# Feature Entropy and Mutual information

The purpose of this script is to calculate the entropy of feature sets, and the mutual information between the caseness variable (i.e. record of a SNOMED CT code for conditions indicative of complex mental health difficulties) and feature sets whose entropy is at least as great as the caseness variable's.

This Jupyter notebook runs "UNSEEN create feature set array.IPYNB" on which it depends, and "UNSEEN create caseness array.IPYNB" on which the feature-set Jupyter notebook depends.

### Imports

In [21]:
# No Python libraries need importing.

## Dependencies

Run "UNSEEN create feature set array.IPYNB", which runs "UNSEEN create caseness array.ipynb".

In [1]:
%run ./"UNSEEN create feature set array.ipynb"

In [2]:
%run ./"UNSEEN create caseness array.ipynb"

Caseness variable entropy =  0.4 nats
Caseness variable scaled entropy =  57.7 %
Hit rate (all) = 13.7 %
Hit rate (none) = 86.3 %
Odds (No CMHD : CMHD) =  6.3 times less likely to have CMHD than to have it.


## Prerequisites

In [3]:
# Instantiate template for storing entropy.
feature_entropy_TEMPLATE =\
    pandas.DataFrame(columns = ['Feature set', 'Entropy'])

# Instantiate template for storing mutual information.
feature_mutual_information_TEMPLATE =\
    pandas.DataFrame(columns = ['Feature set', 'Mutual information'])

# Instantiate storage for features that are dropped due to low entropy.
f_to_drop_TEMPLATE =\
    pandas.DataFrame(columns = ['Dropped feature set'])

## Calculate the entropy and two-way mutual information of the feature sets and the caseness.

Our focus is mutual information but I calculate the entropy of features so that I don't calculate the mutual information for any features whose entropy is less than that of the caseness variable. Justification for this action is based on the fact that the mutual information between the caseness variable and any feature will be less than or equal to the lesser entropy of the caseness or feature, i.e. $I(X_{i};CMHD) ≤ min\{H(X_{i}), H(CMHD)\}$. We don’t want any feature set that is worse than no feature set (i.e. having only the caseness prevalence to predict a random outcome value) so we don’t bother with any feature set that will lower the possible mutual information. Two-way mutual information* will not be calculated for those features whose entropy is less than the caseness variable's entropy. The dropped variables are indicated in the `f_to_drop` pandas.DataFrame.

First, two-way mutual information will be calculated between the caseness variable and individual feature sets. Secondly, two-way mutual information will be calculated between the caseness variable and pair-composites of feature sets. These pair composites are individual feature sets that amalgamate two feature sets into a new binary definition, where values are `0` if both component feature sets are zero and `1` otherwise**. More-complicated encoding is possible, e.g. a different level for every combination of values from each component feature set. Further code extends these feature-set compositions up to quintuplet composites (i.e. amalgamating five feature sets into a single binary variable).

<br/>
<br/>

__\*__ _Initially, the plan was to use $k$-way mutual information for $k>2$ but the meaning of these mutual information values is controversial at best. I side with [Krippendorf's assessment](https://sci-hub.wf/10.1080/03081070902993160) , which renders 3-way mutual information interpretable but not any higher-order mutual information statistics (yet?). I decided to stick with two-way mutual information using composite feature sets so that I am comparing the same statistic across individual and composite feature sets._

__\**__ _Other encodings will be trialled at a later date. If a feature set contains more than one feature, then it will be represented in three ways: OR, AND, and multinomial. The OR representation (alternative called the at-least-one representation) is a binary variable with a value of `0` if the component features are all zero, and `1` otherwise. The AND (alternative called the all-present representation) representation is a binary variable with a value of `1` if all component features are one, and `0` otherwise. The multinomial representation is a multinomial variable with values for each of the possible combinations of component features’ values. For example, given a feature set of two features $A=\{0,1\}$ and $B=\{0,1\}$, their multinomial feature-set representation would be $C=\{0,1,2,3\}$, where $C=0=(A=0\   AND\   B=0)$, $C=1=(A=0\   AND\   B=1)$, $C=2=(A=1\  AND\  B=0)$, and $C=3=(A=1\   AND\   B=1)$._

### Entropy and two-way mutual information of individual feature sets and the caseness variable.

********************
Storing all batches of calculations is taking too much storage space. I was doing this initially so that I could review the output of each stage of the workflow. It looks like the storage constraint means I will have to include the entire workflow at once and only produce the final output.
********************

In [5]:
# Instantiate specific storage for mutual information.
feature_MI_individual_ORrep = feature_mutual_information_TEMPLATE
# Instantiate batch number.
batch = 0
# Instantiate tally of feature sets that are dropped due to low entropy.
drop_tally = 0

# Calculate entropy and mutual information, and store the values.
for i_featureSet in featureSet_array.columns[featureSet_array.columns != 'person_id']:
    if len(feature_MI_individual_ORrep) > 999:
        # Increment batch.
        batch += 1
            
        # Make an interim save of results.
        feature_MI_individual_ORrep.to_csv("Mutual information saves/"+\
                                           "Individuals/"+\
                                           "UNSEEN feature set_MutInfo_individual_ORrep_batch"+\
                                           str(batch)+\
                                           ".csv", index = False)
        # Instantiate new storage.
        feature_MI_individual_ORrep = feature_mutual_information_TEMPLATE
    else:
        name_var = i_featureSet
        
        # Update the feature set id table.
        # ## This will only be done for individual feature sets because 
        # ## higher-rder feature sets can be algorithmically define, which
        # ## is more efficient for storage.
        featureSet_ID_table.loc[len(featureSet_ID_table),
                                ['Feature set ID', 'Feature Set 1',]] = [name_var]
        
        # Define the feature set's values.
        # ## In this case, the individual feature set is defined as 0 when the feature is 0,
        # ## and 1 otherwise. This is the "OR" or "at least one" encoding.
        binary_var = featureSet_array[i_featureSet] == 0
        # Calculate the mutual information for the individual feature set.
        f_MI = sklearn.metrics.mutual_info_score(binary_var, caseness_array['CMHD'])
        # Only store the mutual information if it is greater than or equal
        # to the entropy of the outcome.
        if f_MI < entropy_caseness:
            drop_tally += 1
            continue
        else:
            # Store.
            feature_MI_individual_ORrep.loc[len(feature_MI_individual_ORrep)] = \
                name_var, f_MI

# Increment counter.
batch += 1

# Final save.
if len(feature_MI_individual_ORrep) != 0:
    feature_MI_individual_ORrep.to_csv("Mutual information saves/"+\
                                       "Individuals/"+\
                                       "UNSEEN feature set_MutInfo_individual_ORrep_batch"+\
                                       str(batch)+\
                                       ".csv",
                                       index = False)

# Feedback messages.
print(str(batch), "batch(es) of feature sets processed.")
print(str(drop_tally), "/",
      str(len(featureSet_array.columns)-1),
      "feature sets dropped due to low entropy.")

1 batch(es) of feature sets processed.
216 / 216 feature sets dropped due to low entropy.


#### Initial results

__\*__ _Note: these initial results were calculated using a previous version of the script where the mutual information of all feature sets was saved. This approach was dropped in favour of only saving mutual information values for feature sets whose mutual information with the caseness variable is greater than the entropy of the caseness variable._
<br/><br/><br/>

All individual feature sets score very low for two-way mutual information: all less than 0.05. A $I_{2}=0.05$ represents 7.9% of the theoretical maximal situatons where the feature set either _exactly is_ the caseness variable or _is exactly not_ the variable. The top five individal feature sets (which have $I_{2}≥0.033$) are defined as having at least one recording of the following SNOMED CT codes in their primary-care electronic health records:

| SNOMED code | Feature set | Topic | Mutual Information | Scaled mutual information | Odds ratio | P(CMHD given X=1) | P(CMHD given X=0) |
| ----------- | ----------- | ----- | ------------------ | ------------------------- | ---------- | ----------------- | ----------------- |
| 314530002 | Medication review done | Medication | ~0.055 | 7.9% | 8.5 | 4.1 | 26.8 |
| 182888003 | Medication requested  | Medication | ~0.035 | 5.0% | 4.9 | 7.9 | 29.5 |
| 1018251000000107 | Serum alanine aminotransferase level (observable entity) | Liver test | ~0.033 | 4.7% | 5.5 | 4.7 | 21.6 |
| 1000621000000104 | Serum alkaline phosphatase level | Liver test | ~0.033 | 4.7% | 5.5 | 4.7 | 21.5 |
| 1022791000000101 | TSH (thyroid stimulating hormone) level | Endocrine | ~0.033 | 4.7% | 5.0 | 5.4 | 22.5 |

The paradoxes of commonly-reported classification statistics are clearly shown. A medication _review_ has the largest odds ratio but the probability of having a record of the caseness variable given that medication was _requested_ is higher. One might propose that a record of a medication request is a better indicator than a record of a medication review, if they prefer the probability statistic over the odds-ratio statistic. But when we look at the probability of the caseness variable given that there is _no_ record of medication being requested, we see that this is also the largest of the top five feature sets! The odds ratio tries to balance these ambiguous probability statistics but it is, therefore, harder to interpret. Note, the odds ratio for a record of a medication _review_ scores better than for a record of medication _request_ because the distinction between the ambiguous probabilities is greater (multiplicatively).

The scaled mutual information is simply a percentage measure of how much the caseness variable is described by the feature set (in terms of information). Unlike the odds ratio, it will give the same value whether the odds are multiplicatively greater or less than equal - e.g. $I_{2}$ will be the same for $OR = 4.0$ and $OR = 0.25$  - so it only measures magnitude of association.

### Entropy and two-way mutual information of pair-composite feature sets and the caseness variable.
The nested FOR LOOPs below also update the Feature Set ID table with the new features.


*********************
Why are MI values < 0.4 being saved?
*********************

In [35]:
# Instantiate specific storage for mutual information.
feature_MI_pair_ORrep = feature_mutual_information_TEMPLATE
# Instantiate batch number.
batch = 0
# Instantiate tally of feature sets that are dropped due to low entropy.
drop_tally = 0

# Calculate entropy and mutual information, and store the values.
for i_featureSet in featureSet_array.columns[featureSet_array.columns != 'person_id']:
    for j_featureSet in featureSet_array.columns[featureSet_array.columns != 'person_id']:
        if len(feature_MI_pair_ORrep) > 10:
            break
        if len(feature_MI_pair_ORrep) > 999:
            # Increment batch.
            batch += 1
                
            # Make an interim save of results.
            feature_MI_pair_ORrep.to_csv("Mutual information saves/Pairs/"+\
                  "UNSEEN feature set_MutInfo_pair_ORrep_batch"+\
                                               str(batch)+\
                                               ".csv", index = False)
            # Instantiate new storage.
            feature_MI_pair_ORrep = feature_mutual_information_TEMPLATE
        else:
            # Skip the iteration if the same feature set is selected twice.
            if i_featureSet == j_featureSet:
                continue
            else:
                # Create the feature ID for the pair-composite feature set.
                name_var = "-".join([i_featureSet, j_featureSet])

#                # Update the feature set id table.
#                featureSet_ID_table.loc[len(featureSet_ID_table),
#                                        ['Feature set ID', 'Feature Set 1', 'Feature Set 2']] = \
#                    [name_var, i_featureSet, j_featureSet]

                # Define the pair-composite feature set values.
                # ## In this case, the pair-composite feature set is defined as 0 when both feature
                # ## sets are 0, and 1 otherwise. This is the "OR" or "at least one" encoding.
                binary_var = \
                    pandas.DataFrame(data = {
                                              'i_binary_var' : featureSet_array[i_featureSet] == 0,
                                              'j_binary_var' : featureSet_array[j_featureSet] == 0
                                             }
                                    ).all(True)

                # Calculate the mutual information for the pair-composite feature set.
                f_MI = sklearn.metrics.mutual_info_score(binary_var, caseness_array['CMHD'])
                print("f_MI =",f_MI)
                if f_MI < entropy_caseness:
                    drop_tally += 1
                    continue
                else:
                    # Calculate the mutual information for the pair-composite feature set.
                    feature_MI_pair_ORrep.loc[len(feature_MI_pair_ORrep)] = \
                        name_var, f_MI

# Increment counter.
batch += 1

# Final save.
if len(feature_MI_pair_ORrep) != 0:
    feature_MI_pair_ORrep.to_csv("Mutual information saves/Pairs/"+\
                      "UNSEEN feature set_MutInfo_pair_ORrep_batch"+\
                                                   str(batch)+\
                                                   ".csv",
                                       index = False)

# Feedback messages.
print(str(batch), "batch(es) of feature sets processed.")
count_of_features = len(featureSet_array.columns)-1
count_of_pairs = (pow(count_of_features,2)-count_of_features) / 2
print(str(drop_tally), "/", str(count_of_pairs),
      "feature sets dropped due to low entropy.")

1 batch(es) of feature sets processed.
0 / 23220.0 feature sets dropped due to low entropy.


In [34]:
print(entropy_caseness)
feature_MI_pair_ORrep

0.4


Unnamed: 0,Feature set,Mutual information
0,_1000621000000104-_1000641000000106,0.032399
1,_1000621000000104-_1000651000000109,0.032518
2,_1000621000000104-_1000661000000107,0.032527
3,_1000621000000104-_1000671000000100,0.032443
4,_1000621000000104-_1000681000000103,0.032519
...,...,...
995,_1000671000000100-_309646008,0.019237
996,_1000671000000100-_313204009,0.021621
997,_1000671000000100-_313334002,0.028597
998,_1000671000000100-_314138001,0.021124


### Entropy and two-way mutual information of triplet-composite feature sets and the caseness variable.
The composite feature sets will each be calculated separately to avoid having all the computation in one call, which risks losing everything if it crashes and places heavy demand on RAM.
The code below is an obvious extension of the nested FOR LOOPs used to calculate the two-way mutual information of pair-composite feature sets.

Note: We still only calculate the mutual information for those feature sets whose entropy at least as great as the outcome variable's.

In [32]:
for i_featureSet in f_to_calc:
    for j_featureSet in f_to_calc:
        # Skip the iteration if the same feature set is selected twice.
        if i_featureSet == j_featureSet:
            continue
            
        for k_featureSet in f_to_calc:
            # Skip the iteration if the same feature set is selected twice.
            if len(set([k_featureSet]) & set([i_featureSet, j_featureSet])) > 0:
                continue

            # Create the feature ID for the pair-composite feature set.
            name_var = "-".join([i_featureSet, j_featureSet, k_featureSet])

            # Update the feature set id table.
            featureSet_ID_table.loc[len(featureSet_ID_table),
                                    ['Feature set ID', 'Feature Set 1', 'Feature Set 2',
                                     'Feature Set 3']] = \
                [name_var, i_featureSet, j_featureSet, k_featureSet]

            # Define the pair-composite feature set values.
            # ## In this case, the pair-composite feature set is defined as 0 when both feature
            # ## sets are 0, and 1 otherwise.
            binary_var = \
                pandas.DataFrame(data = {
                                          'i_binary_var' : featureSet_array[i_featureSet] == 0,
                                          'j_binary_var' : featureSet_array[j_featureSet] == 0,
                                          'k_binary_var' : featureSet_array[k_featureSet] == 0
                                         }
                                ).all(True)

            # Calculate the entropy for the pair-composite feature set.
            f_ent = scipy.stats.entropy(binary_var.value_counts(), base = math.e)
            if f_ent < entropy_caseness:
                continue
            else:
                feature_entropy.loc[len(feature_entropy)] = name_var, f_ent

            # Calculate the mutual information for the pair-composite feature set.
            feature_mutual_information.loc[len(feature_mutual_information)] = \
                name_var, sklearn.metrics.mutual_info_score(binary_var, caseness_array['CMHD'])

feature_entropy.to_csv("UNSEEN feature set_entropy_plus triplets.csv", index = False)
feature_mutual_information.to_csv("UNSEEN feature set_MutInfo_plus triplets.csv", index = False)

NameError: name 'f_to_calc' is not defined

### Entropy and two-way mutual information of quadruplet-composite feature sets and the caseness variable.

In [None]:
for i_featureSet in f_to_calc:
    for j_featureSet in f_to_calc:
        # Skip the iteration if the same feature set is selected twice.
        if i_featureSet == j_featureSet:
            continue
            
        for k_featureSet in f_to_calc:
            # Skip the iteration if the same feature set is selected twice.
            if len(set([k_featureSet]) & set([i_featureSet, j_featureSet])) > 0:
                continue
                
            for l_featureSet in f_to_calc:
                # Skip the iteration if the same feature set is selected twice.
                if len(set([l_featureSet]) & set([i_featureSet, j_featureSet, k_featureSet])) > 0:
                    continue

                # Create the feature ID for the pair-composite feature set.
                name_var = "-".join([i_featureSet, j_featureSet, k_featureSet, l_featureSet])

                # Update the feature set id table.
                featureSet_ID_table.loc[len(featureSet_ID_table),
                                        ['Feature set ID', 'Feature Set 1', 'Feature Set 2',
                                         'Feature Set 3',  'Feature Set 4']] = \
                    [name_var, i_featureSet, j_featureSet, k_featureSet, l_featureSet]

                # Define the pair-composite feature set values.
                # ## In this case, the pair-composite feature set is defined as 0 when both feature
                # ## sets are 0, and 1 otherwise.
                binary_var = \
                    pandas.DataFrame(data = {
                                              'i_binary_var' : featureSet_array[i_featureSet] == 0,
                                              'j_binary_var' : featureSet_array[j_featureSet] == 0,
                                              'k_binary_var' : featureSet_array[k_featureSet] == 0,
                                              'l_binary_var' : featureSet_array[l_featureSet] == 0
                                             }
                                    ).all(True)

                # Calculate the entropy for the pair-composite feature set.
                f_ent = scipy.stats.entropy(binary_var.value_counts(), base = math.e)
                if f_ent < entropy_caseness:
                    continue
                else:
                    feature_entropy.loc[len(feature_entropy)] = name_var, f_ent

                # Calculate the mutual information for the pair-composite feature set.
                feature_mutual_information.loc[len(feature_mutual_information)] = \
                    name_var, sklearn.metrics.mutual_info_score(binary_var, caseness_array['CMHD'])

feature_entropy.to_csv("UNSEEN feature set_entropy_plus quadruplets.csv", index = False)
feature_mutual_information.to_csv("UNSEEN feature set_MutInfo_plus quadruplets.csv", index = False)

### Entropy and two-way mutual information of quintuplet-composite feature sets and the caseness variable.

In [None]:
for i_featureSet in f_to_calc:
    for j_featureSet in f_to_calc:
        # Skip the iteration if the same feature set is selected twice.
        if i_featureSet == j_featureSet:
            continue
            
        for k_featureSet in f_to_calc:
            # Skip the iteration if the same feature set is selected twice.
            if len(set([k_featureSet]) & set([i_featureSet, j_featureSet])) > 0:
                continue
                
            for l_featureSet in f_to_calc:
                # Skip the iteration if the same feature set is selected twice.
                if len(set([l_featureSet]) & set([i_featureSet, j_featureSet, k_featureSet])) > 0:
                    continue
                
                for m_featureSet in f_to_calc:
                    # Skip the iteration if the same feature set is selected twice.
                    if len(set([m_featureSet]) & set([i_featureSet, j_featureSet, k_featureSet, l_featureSet])) > 0:
                        continue

                    # Create the feature ID for the pair-composite feature set.
                    name_var = "-".join([i_featureSet, j_featureSet, k_featureSet, l_featureSet, m_featureSet])

                    # Update the feature set id table.
                    # ## Note: 
                    featureSet_ID_table.loc[len(featureSet_ID_table),
                                            ['Feature set ID', 'Feature Set 1', 'Feature Set 2',
                                             'Feature Set 3',  'Feature Set 4', 'Feature Set 5']] = \
                        [name_var, i_featureSet, j_featureSet, k_featureSet, l_featureSet, m_featureSet]

                    # Define the pair-composite feature set values.
                    # ## In this case, the pair-composite feature set is defined as 0 when both feature
                    # ## sets are 0, and 1 otherwise.
                    binary_var = \
                        pandas.DataFrame(data = {
                                                  'i_binary_var' : featureSet_array[i_featureSet] == 0,
                                                  'j_binary_var' : featureSet_array[j_featureSet] == 0,
                                                  'k_binary_var' : featureSet_array[k_featureSet] == 0,
                                                  'l_binary_var' : featureSet_array[l_featureSet] == 0,
                                                  'm_binary_var' : featureSet_array[m_featureSet] == 0
                                                 }
                                        ).all(True)

                    # Calculate the entropy for the pair-composite feature set.
                    f_ent = scipy.stats.entropy(binary_var.value_counts(), base = math.e)
                    if f_ent < entropy_caseness:
                        continue
                    else:
                        feature_entropy.loc[len(feature_entropy)] = name_var, f_ent

                    # Calculate the mutual information for the pair-composite feature set.
                    feature_mutual_information.loc[len(feature_mutual_information)] = \
                        name_var, sklearn.metrics.mutual_info_score(binary_var, caseness_array['CMHD'])

feature_entropy.to_csv("UNSEEN feature set_entropy_plus quintuplets.csv", index = False)
feature_mutual_information.to_csv("UNSEEN feature set_MutInfo_plus quintuplets.csv", index = False)

In [None]:
feature_entropy['Entropy'].plot.hist(bins = 30)

In [None]:
feature_mutual_information['Mutual information'].plot.hist(bins = 30)