# Filtering feature sets.

Candidate feature sets are filtered based on their entropy. Should they demonstrate sufficient entropy, we calculate their mutual information with the 'active' and 'possible' caseness variables, for further review.

The purpose of this script is to calculate the entropy and mutual information.

This Jupyter notebook calls a different filter notebook for each source of feature sets:
- database feature sets
- clinician feature sets
- literature feature sets
- PPI feature sets
- interview feature sets

### Imports

In [79]:
import itertools
import sklearn.metrics

### Prerequisites

#### Load the caseness variables

In [6]:
%%capture
%run ./"UNSEEN create caseness variables.ipynb"

#### Define a function to calculate the multinomial representation of a feature set.

In [45]:
# Define function that will calculate the multinomial
# representation of a feature set.
#
# The function takes an n-by-m array of n patients and m features
# and produces an n-by-1 array indicating the multinomial category
# to which each patient record belongs.
def mutlinomRepresentation(var_vals):
    # Check that the variables have two or fewer values and
    # only progress if True.
    for i_col in range(var_vals.shape[1]-1):
        unique_feature_vals = var_vals.iloc[:, i_col].drop_duplicates()
        if (len(unique_feature_vals) > 2):
            print("\n** Error: At least one of the",
                  "component features has more than",
                  "two values so the multinomial",
                  "representation will not be computed.**\n")
            print(i_col, "th variable:", var_vals.columns.values[i_col])
            unique_feature_vals
            next_iter = True
            return 0, next_iter

    # Get all combinations of values of the component features
    # and define feature set values for each multinomial combination.
    feature_combins = var_vals.drop_duplicates()
    feature_combins =\
        pandas.DataFrame(data = feature_combins, columns = var_vals.columns)\
        .reset_index()\
        .drop(['index'], axis = 1)
    feature_combins['multinom_vals'] = feature_combins.index
    
    
    # Define a vector indicating the feature set value.
    myMerge =\
        pandas.merge(
            var_vals,
            feature_combins,
            how = 'left',
            on = list(var_vals.columns.values)
    )
    
    # Extract multinomial representation as output variable.
    featureSet = myMerge['multinom_vals']
    next_iter = False
    return featureSet, next_iter

#### Define a function to calculate the mutual information between feature sets and the caseness variable.

In [37]:
# Define function that will calculate two-way mutual
# information for the features of order m.
def featuresetmi(featureSet_array,
                 casenessVector,
                 m = None,
                 savelocation = None,
                 representation = None):
    # ## Assess argument validty.
    
    # Check order of feature set. If not provided,
    # default to m = 1.    
    if m == None:
        order_int = 1
        order_label = "Individuals"
        print("\nNo value for m provided." +
              "\n...Default value of m = 1 will be used.")
    elif m == 1:
        order_int = m
        order_label = "Individuals"
    elif m == 2:
        order_int = m
        order_label = "Pairs"
    elif m == 3:
        order_int = m
        order_label = "Triplets"
    else:
        print("\n** Error: Integer value between 1",
              "and 3 not supplied for m.**\n")
        return
            
    # Check and set save location.
    if savelocation == None:
        savelocation = \
           ("Mutual information saves/"+\
            order_label)
        print("\nNo save location provided." +
              "\n...Defaulting to ~/" + savelocation)    
    
    # ## Check encoding. If not provided, 
    # ## default to OR encoding.
    if representation == None:
        representation_label = "ALLrepresentation"
        print("\nNo representation provided." +
              "\n...Defaulting to '" + representation_label + "' representation.")
    elif representation == "all":
        representation_label = "ALLrepresentation"
    elif representation == "multi":
        representation_label = "MULTIrepresentation"
    else:
        print("\n** Error: Representation value from ",
              "{'or', 'and', 'multi'} not provided.**\n")
        return
    
    print("\n\n\n****************************************")  
    print("Calculating mutual information values...")
    # Define the m-way tuples of features sets as a numpy array. We will loop
    # through the rows of this array to create the feature sets.
    combins = \
        numpy.asarray(
            list(
                itertools.combinations(
                    featureSet_array.columns[featureSet_array.columns != 'person_id'],
                    order_int)
                )
            )
    
    # Ensure feature-set and casenesss values are matched for person_id.
    full_array = featureSet_array.merge(casenessVector, on = 'person_id')
    
    # Instantiate specific storage for mutual information.
    featureSet_MI = \
        pandas.DataFrame(columns = ['Feature set', 'Mutual information'])
    
    # Instantiate batch number.
    batch = 0
    
    # Instantiate tally of feature sets that are dropped due to low entropy.
    drop_tally = 0
    
    # Define entropy of the particular caseness variable.
    entropy_caseness = \
        scipy.stats.entropy(casenessVector.iloc[:,-1].value_counts(),
                            base = math.e)
    
    # ## loop through the feature sets.
    for i_fs in range(len(combins)):
                
        # Define a vector indicating the feature set value.
        var_vals = full_array[combins[i_fs]]
        if representation_label == "ALLrepresentation":
            fs_val = var_vals.all(True)
        elif representation_label == "MULTIrepresentation":
            fs_val, next_iter = mutlinomRepresentation(var_vals)
            if next_iter:
                continue
        
        
        # Calculate the mutual information between the feature set and
        # caseness variable.
        f_MI = sklearn.metrics.mutual_info_score(fs_val, full_array.iloc[:,-1])
        
        if f_MI < entropy_caseness:
            drop_tally += 1
            continue
        else:
            # Name the feature set.
            # ...
            # Store the name and mutual information value.
            featureSet_MI.loc[len(featureSet_MI)] = name_var, f_MI

        if len(featureSet_MI) > 9:
                # Increment batch.
                batch += 1

                # Make an interim save of results.
                featureSet_MI.to_csv(savelocation +
                                  order_label + "_" +
                                  representation_label + "_" +
                                  "_batch" + \
                                  str(batch) + \
                                  ".csv", index = False)
                # Instantiate new storage.
                featureSet_MI = \
                    pandas.DataFrame(columns = ['Feature set', 'Mutual information'])


    # Increment counter.
    batch += 1

    # Final save.
    if len(featureSet_MI) != 0:
        featureSet_MI.to_csv(savelocation +
                          order_label + "_" +
                          representation_label + "_" +
                          "_batch" + \
                          str(batch) + \
                          ".csv", index = False)

    # Feedback messages.
    print("...\n")
    print(str(batch), "batch(es) of feature sets processed.")
    print(str(drop_tally), "/",
          str(len(combins)),
          "feature sets dropped due to low entropy.")
    print("****************************************")  

## Calculate the entropy and two-way mutual information of the feature sets and the caseness.

Our focus is mutual information but I check the entropy of features so that I don't calculate the mutual information for any features whose entropy is less than that of the caseness variable. Justification for this action is based on the fact that the mutual information between the caseness variable and any feature will be less than or equal to the lesser entropy of the caseness or feature, i.e. $I(X_{i};CMHD) ≤ min\{H(X_{i}), H(CMHD)\}$. We don’t want any feature set that is worse than no feature set (i.e. having only the caseness prevalence to predict a random outcome value) so we don’t bother with any feature set that will lower the possible mutual information. Two-way mutual information* will not be calculated for those features whose entropy is less than the caseness variable's entropy. The dropped variables are indicated in the `f_to_drop` pandas.DataFrame.

First, two-way mutual information will be calculated between the caseness variable and individual feature sets. Secondly, two-way mutual information will be calculated between the caseness variable and pair-composites of feature sets. These pair composites are individual feature sets that amalgamate two feature sets into a new binary definition, where values are `0` if both component feature sets are zero and `1` otherwise**. More-complicated encoding is possible, e.g. a different level for every combination of values from each component feature set. Further code extends these feature-set compositions up to quintuplet composites (i.e. amalgamating five feature sets into a single binary variable).

<br/>
<br/>

__\*__ _Initially, the plan was to use $k$-way mutual information for $k>2$ but the meaning of these mutual information values is controversial at best. I side with [Krippendorf's assessment](https://sci-hub.wf/10.1080/03081070902993160), which renders 3-way mutual information interpretable but not any higher-order mutual information statistics (yet?). I decided to stick with two-way mutual information using composite feature sets so that I am comparing the same statistic across individual and composite feature sets._

__\**__ _Other encodings will be trialled at a later date. If a feature set contains more than one feature, then it will be represented in two ways: all-present and multinomial. The all-present (alternative called the all-present representation) representation is a binary variable with a value of `1` if all component features are one, and `0` otherwise. The multinomial representation is a multinomial variable with values for each of the possible combinations of component features’ values. For example, given a feature set of two features $A=\{0,1\}$ and $B=\{0,1\}$, their multinomial feature-set representation would be $C=\{0,1,2,3\}$, where $C=0=(A=0\   AND\   B=0)$, $C=1=(A=0\   AND\   B=1)$, $C=2=(A=1\  AND\  B=0)$, and $C=3=(A=1\   AND\   B=1)$. The multinomial representation will only be applied to binary variables._

## Filter database feature sets

In [2]:
%%capture
%run ./"UNSEEN_filter_database_feature_sets.ipynb"

## Filter clinician feature sets

In [None]:
%%capture
%run ./"UNSEEN_filter_clinician_feature_sets.ipynb"

## Filter literature feature sets

In [80]:
%%capture
%run ./"UNSEEN_filter_literature_feature_sets.ipynb"

## Filter PPI feature sets

In [2]:
%%capture
%run ./"UNSEEN_filter_PPI_feature_sets.ipynb"

## Filter interview feature sets

In [2]:
%%capture
%run ./"UNSEEN_filter_interview_feature_sets.ipynb"