# Filtering feature sets.

Candidate feature sets are filtered based on their entropy. Should they demonstrate sufficient entropy, we calculate their mutual information with the 'active' and 'possible' caseness variables, for further review.

The purpose of this script is to calculate the entropy and mutual information.

This Jupyter notebook is intended to be called in other workbooks

runs all notebooks that create the required"UNSEEN database feature set array.IPYNB" on which it depends, and "UNSEEN create caseness array.IPYNB" on which the feature-set Jupyter notebook depends.

### Imports

In [1]:
import itertools

## Prerequisites

In [2]:
%%capture
%run ./"UNSEEN create database feature sets.ipynb"

In [2]:
%%capture
%run ./"UNSEEN create clinician feature sets.ipynb"

In [3]:
%%capture
%run ./"UNSEEN create caseness variables.ipynb"

Caseness variable entropy =  0.395 nats
Caseness variable scaled entropy =  57.0 %
Hit rate (all) = 13.5 %
Hit rate (none) = 86.5 %
Odds (No CMHD : CMHD) =  6.41 times less likely to have CMHD than to have it.


## Prerequisites

## Calculate the entropy and two-way mutual information of the feature sets and the caseness.

Our focus is mutual information but I check the entropy of features so that I don't calculate the mutual information for any features whose entropy is less than that of the caseness variable. Justification for this action is based on the fact that the mutual information between the caseness variable and any feature will be less than or equal to the lesser entropy of the caseness or feature, i.e. $I(X_{i};CMHD) ≤ min\{H(X_{i}), H(CMHD)\}$. We don’t want any feature set that is worse than no feature set (i.e. having only the caseness prevalence to predict a random outcome value) so we don’t bother with any feature set that will lower the possible mutual information. Two-way mutual information* will not be calculated for those features whose entropy is less than the caseness variable's entropy. The dropped variables are indicated in the `f_to_drop` pandas.DataFrame.

First, two-way mutual information will be calculated between the caseness variable and individual feature sets. Secondly, two-way mutual information will be calculated between the caseness variable and pair-composites of feature sets. These pair composites are individual feature sets that amalgamate two feature sets into a new binary definition, where values are `0` if both component feature sets are zero and `1` otherwise**. More-complicated encoding is possible, e.g. a different level for every combination of values from each component feature set. Further code extends these feature-set compositions up to quintuplet composites (i.e. amalgamating five feature sets into a single binary variable).

<br/>
<br/>

__\*__ _Initially, the plan was to use $k$-way mutual information for $k>2$ but the meaning of these mutual information values is controversial at best. I side with [Krippendorf's assessment](https://sci-hub.wf/10.1080/03081070902993160), which renders 3-way mutual information interpretable but not any higher-order mutual information statistics (yet?). I decided to stick with two-way mutual information using composite feature sets so that I am comparing the same statistic across individual and composite feature sets._

__\**__ _Other encodings will be trialled at a later date. If a feature set contains more than one feature, then it will be represented in three ways: OR, AND, and multinomial. The OR representation (alternative called the at-least-one representation) is a binary variable with a value of `0` if the component features are all zero, and `1` otherwise. The AND (alternative called the all-present representation) representation is a binary variable with a value of `1` if all component features are one, and `0` otherwise. The multinomial representation is a multinomial variable with values for each of the possible combinations of component features’ values. For example, given a feature set of two features $A=\{0,1\}$ and $B=\{0,1\}$, their multinomial feature-set representation would be $C=\{0,1,2,3\}$, where $C=0=(A=0\   AND\   B=0)$, $C=1=(A=0\   AND\   B=1)$, $C=2=(A=1\  AND\  B=0)$, and $C=3=(A=1\   AND\   B=1)$. The multinomial representation will only be applied to binary variables._

### Define a function to calculate the multinomial representation of a feature set.

In [4]:
# Define function that will calculate the multinomial
# representation of a feature set.
#
# The function takes an n-by-m array of n patients and m features
# and produces an n-by-1 array indicating the multinomial category
# to which each patient record belongs.
def mutlinomRepresentation(featureSet):
    # Check that the variables have two or fewer values and
    # only progress if True.
    for i_col in range(var_vals.shape[1]-1):
        unique_feature_vals = var_vals.iloc[:, i_col].drop_duplicates()
        if (len(unique_feature_vals) > 2):
            print("\n** Error: At least one of the",
                  "component features has more than",
                  "two values so the multinomial",
                  "representation will not be computed.**\n")
            print(i_col, "th variable:", var_vals.columns.values[i_col])
            unique_feature_vals
            next_iter = True
            return 0, next_iter

    # Get all combinations of values of the component features
    # and define feature set values for each multinomial combination.
    feature_combins = var_vals.drop_duplicates()
    feature_combins =\
        pandas.DataFrame(data = feature_combins, columns = var_vals.columns)\
        .reset_index()\
        .drop(['index'], axis = 1)
    feature_combins['multinom_vals'] = feature_combins.index
    
    
    # Define a vector indicating the feature set value.
    myMerge =\
        pandas.merge(
            var_vals,
            feature_combins,
            how = 'left',
            on = list(var_vals.columns.values)
    )
    
    # Extract multinomial representation as output variable.
    featureSet = myMerge['multinom_vals']
    next_iter = False
    return featureSet, next_iter

### Define a function to calculate the mutual information between feature sets and the caseness variable.

In [5]:
# Define function that will calculate two-way mutual
# information for the features of order m.
def featuresetmi(featureArray,
                 m = None,
                 savelocation = None,
                 representation = None):
    # ## Assess argument validty.
    
    # Check order of feature set. If not provided,
    # default to m = 1.    
    if m == None:
        order_int = 1
        order_label = "Individuals"
        print("\nNo value for m provided." +
              "\n...Default value of m = 1 will be used.")
    elif m == 1:
        order_int = m
        order_label = "Individuals"
    elif m == 2:
        order_int = m
        order_label = "Pairs"
    elif m == 3:
        order_int = m
        order_label = "Triplets"
    else:
        print("\n** Error: Integer value between 1",
              "3 not supplied for m.**\n")
        return
            
    # Check and set save location.
    if savelocation == None:
        savelocation = \
           ("Mutual information saves/"+\
            order_label)
        print("\nNo save location provided." +
              "\n...Defaulting to ~/" + savelocation)    
    
    # ## Check encoding. If not provided, 
    # ## default to OR encoding.
    if representation == None:
        representation_label = "ORrepresentation"
        print("\nNo representation provided." +
              "\n...Defaulting to '" + representation_label + "' representation.")
    elif representation == "or":
        representation_label = "ORrepresentation"
    elif representation == "and":
        representation_label = "ANDrepresentation"
    elif representation == "multi":
        representation_label = "MULTIrepresentation"
    else:
        print("\n** Error: Representation value from ",
              "{'or', 'and', 'multi'} not provided.**\n")
        return

    
    
    print("\n\n\n****************************************")  
    print("Calculating mutual information values...")
    # Define the m-way tuples of features sets as a numpy array. We will loop
    # through the rows of this array to create the feature sets.
    combins = \
        numpy.asarray(
            list(
                itertools.combinations(
                    featureSet_array.columns[featureSet_array.columns != 'person_id'],
                    order_int)
                )
            )
    # Instantiate specific storage for mutual information.
    featureSet_MI = \
        pandas.DataFrame(columns = ['Feature set', 'Mutual information'])
    # Instantiate batch number.
    batch = 0
    # Instantiate tally of feature sets that are dropped due to low entropy.
    drop_tally = 0
    
    # ## loop through the feature sets.
    for i_fs in range(len(combins)):
                
        # Define a vector indicating the feature set value.
        var_vals = featureSet_array[combins[i_fs]]
        if representation_label == "ORrepresentation":
            fs_val = var_vals.any(True)
        elif representation_label == "ANDrepresentation":
            fs_val = var_vals.all(True)
        elif representation_label == "MULTIrepresentation":
            fs_val, next_iter = mutlinomRepresentation(var_vals)
            if next_iter:
                continue
        
        
        # Calculate the mutual information for the feature set.
        f_MI = sklearn.metrics.mutual_info_score(fs_val, caseness_array['CMHD'])

        if f_MI < entropy_caseness:
            drop_tally += 1
            continue
        else:
            # Name the feature set.
            # ...
            # Store the name and mutual information value.
            featureSet_MI.loc[len(featureSet_MI)] = name_var, f_MI

        if len(featureSet_MI) > 9:
                # Increment batch.
                batch += 1

                # Make an interim save of results.
                featureSet_MI.to_csv(savelocation +
                                  order_label + "_" +
                                  representation_label + "_" +
                                  "_batch" + \
                                  str(batch) + \
                                  ".csv", index = False)
                # Instantiate new storage.
                featureSet_MI = \
                    pandas.DataFrame(columns = ['Feature set', 'Mutual information'])


    # Increment counter.
    batch += 1

    # Final save.
    if len(featureSet_MI) != 0:
        featureSet_MI.to_csv(savelocation +
                          order_label + "_" +
                          representation_label + "_" +
                          "_batch" + \
                          str(batch) + \
                          ".csv", index = False)

    # Feedback messages.
    print("...\n")
    print(str(batch), "batch(es) of feature sets processed.")
    print(str(drop_tally), "/",
          str(len(combins)),
          "feature sets dropped due to low entropy.")
    print("****************************************")  

### Mutual information of individual feature sets and the caseness variable.

##### OR representation.

In [211]:
featuresetmi(featureArray = featureSet_array,
            m = 1,
            representation = "or")


No save location provided.
...Defaulting to ~/Mutual information saves/Individuals



****************************************
Calculating mutual information values...
...

1 batch(es) of feature sets processed.
216 / 216 feature sets dropped due to low entropy.
****************************************


##### AND representation.

In [212]:
featuresetmi(featureArray = featureSet_array,
            m = 1,
            representation = "and")


No save location provided.
...Defaulting to ~/Mutual information saves/Individuals



****************************************
Calculating mutual information values...
...

1 batch(es) of feature sets processed.
216 / 216 feature sets dropped due to low entropy.
****************************************


##### MULTI representation.

In [278]:
featuresetmi(featureArray = featureSet_array,
            m = 1,
            representation = "multi")


No save location provided.
...Defaulting to ~/Mutual information saves/Individuals



****************************************
Calculating mutual information values...
...

1 batch(es) of feature sets processed.
216 / 216 feature sets dropped due to low entropy.
****************************************


### Mutual information of pair-composite feature sets and the caseness variable.

##### OR representation.

In [279]:
featuresetmi(featureArray = featureSet_array,
            m = 2,
            representation = "or")


No save location provided.
...Defaulting to ~/Mutual information saves/Pairs



****************************************
Calculating mutual information values...
...

1 batch(es) of feature sets processed.
23220 / 23220 feature sets dropped due to low entropy.
****************************************


##### AND representation.

In [7]:
featuresetmi(featureArray = featureSet_array,
            m = 2,
            representation = "and")


No save location provided.
...Defaulting to ~/Mutual information saves/Pairs



****************************************
Calculating mutual information values...
...

1 batch(es) of feature sets processed.
23220 / 23220 feature sets dropped due to low entropy.
****************************************


##### MULTI representation.

In [8]:
featuresetmi(featureArray = featureSet_array,
            m = 2,
            representation = "multi")


No save location provided.
...Defaulting to ~/Mutual information saves/Pairs



****************************************
Calculating mutual information values...


NameError: name 'var_vals' is not defined

### Mutual information of triplet-composite feature sets and the caseness variable.

##### OR representation.

In [None]:
featuresetmi(featureArray = featureSet_array,
            m = 3,
            representation = "or")


No save location provided.
...Defaulting to ~/Mutual information saves/Triplets



****************************************
Calculating mutual information values...


##### AND representation.

In [None]:
featuresetmi(featureArray = featureSet_array,
            m = 3,
            representation = "and")

##### MULTI representation.

In [None]:
featuresetmi(featureArray = featureSet_array,
            m = 3,
            representation = "multi")


No save location provided.
...Defaulting to ~/Mutual information saves/Triplets



****************************************
Calculating mutual information values...


# Initial results

__\*__ _Note: these initial results were calculated using a previous version of the script where the mutual information of all feature sets was saved. This approach was dropped in favour of only saving mutual information values for feature sets whose mutual information with the caseness variable is greater than the entropy of the caseness variable._
<br/><br/><br/>

All individual feature sets score very low for two-way mutual information: all less than 0.05. A $I_{2}=0.05$ represents 7.9% of the theoretical maximal situatons where the feature set either _exactly is_ the caseness variable or _is exactly not_ the variable. The top five individal feature sets (which have $I_{2}≥0.033$) are defined as having at least one recording of the following SNOMED CT codes in their primary-care electronic health records:

| SNOMED code | Feature set | Topic | Mutual Information | Scaled mutual information | Odds ratio | P(CMHD given X=1) | P(CMHD given X=0) |
| ----------- | ----------- | ----- | ------------------ | ------------------------- | ---------- | ----------------- | ----------------- |
| 314530002 | Medication review done | Medication | ~0.055 | 7.9% | 8.5 | 4.1 | 26.8 |
| 182888003 | Medication requested  | Medication | ~0.035 | 5.0% | 4.9 | 7.9 | 29.5 |
| 1018251000000107 | Serum alanine aminotransferase level (observable entity) | Liver test | ~0.033 | 4.7% | 5.5 | 4.7 | 21.6 |
| 1000621000000104 | Serum alkaline phosphatase level | Liver test | ~0.033 | 4.7% | 5.5 | 4.7 | 21.5 |
| 1022791000000101 | TSH (thyroid stimulating hormone) level | Endocrine | ~0.033 | 4.7% | 5.0 | 5.4 | 22.5 |

The paradoxes of commonly-reported classification statistics are clearly shown. A medication _review_ has the largest odds ratio but the probability of having a record of the caseness variable given that medication was _requested_ is higher. One might propose that a record of a medication request is a better indicator than a record of a medication review, if they prefer the probability statistic over the odds-ratio statistic. But when we look at the probability of the caseness variable given that there is _no_ record of medication being requested, we see that this is also the largest of the top five feature sets! The odds ratio tries to balance these ambiguous probability statistics but it is, therefore, harder to interpret. Note, the odds ratio for a record of a medication _review_ scores better than for a record of medication _request_ because the distinction between the ambiguous probabilities is greater (multiplicatively).

The scaled mutual information is simply a percentage measure of how much the caseness variable is described by the feature set (in terms of information). Unlike the odds ratio, it will give the same value whether the odds are multiplicatively greater or less than equal - e.g. $I_{2}$ will be the same for $OR = 4.0$ and $OR = 0.25$  - so it only measures magnitude of association.