# Filter database feature sets.

Candidate feature sets inspired by a database review are filtered based on their entropy. Further details are in this notebook's parent notebook "UNSEEN filter feature sets.ipynb".

## Imports

In [1]:
import itertools
import sklearn.metrics

## Load database feature-set array

Here, we run the notebook that creates the database feature-set array. We will then save the feature-set array as "my_featureSet_array", so that the remaining syntax in this notebook is common for all feature-set sources.


It is assumed that the caseness variables have already been created in the parent notebook.

In [2]:
%%capture
%run ./"UNSEEN create database feature sets.ipynb"
my_featureSet_array = fs_database

## Filter feature sets.

### 1. Mutual information of individual feature sets and the caseness variables.

In [3]:
# Set the order of the composite: 1 = individual, 2 = pair, 3 = triplet.
m = 1

#### 1.1. Multinomial caseness

##### 1.1.1. ALL representation.

In [4]:
featuresetmi(featureSet_array = my_featureSet_array,
             casenessVector = caseness_array[['person_id','CMHD']],
             m = m,
             representation = "all")

NameError: name 'featuresetmi' is not defined

##### 1.1.2. MULTI representation.

In [None]:
featuresetmi(featureSet_array = my_featureSet_array,
             casenessVector = caseness_array[['person_id','CMHD']],
             m = m,
             representation = "multi")

#### 1.2. Definitive caseness

##### 1.2.1. ALL representation.

In [None]:
featuresetmi(featureSet_array = my_featureSet_array,
             casenessVector = caseness_array[['person_id','CMHD_dx_and_rx']],
             m = m,
             representation = "all")

##### 1.2.2. MULTI representation.

In [None]:
featuresetmi(featureSet_array = my_featureSet_array,
             casenessVector = caseness_array[['person_id','CMHD_dx_and_rx']],
             m = m,
             representation = "multi")

#### 1.3. Possible caseness

##### 1.3.1. ALL representation.

In [None]:
featuresetmi(featureSet_array = my_featureSet_array,
             casenessVector = caseness_array[['person_id','CMHD_rx_not_dx']],
             m = m,
             representation = "all")

##### 1.3.2. MULTI representation.

In [None]:
featuresetmi(featureSet_array = my_featureSet_array,
             casenessVector = caseness_array[['person_id','CMHD_rx_not_dx']],
             m = m,
             representation = "multi")

#### 1.4. No caseness (i.e. control group)

##### 1.4.1. ALL representation.

In [None]:
featuresetmi(featureSet_array = my_featureSet_array,
             casenessVector = caseness_array[['person_id','CMHD_control']],
             m = m,
             representation = "all")

##### 1.4.2. MULTI representation.

In [None]:
featuresetmi(featureSet_array = my_featureSet_array,
             casenessVector = caseness_array[['person_id','CMHD_control']],
             m = m,
             representation = "multi")

### 2. Mutual information of pair-composite feature sets and the caseness variables.

In [None]:
# Set the order of the composite: 1 = individual, 2 = pair, 3 = triplet.
m = 2

#### 1.1. Multinomial caseness

##### 1.1.1. ALL representation.

In [None]:
featuresetmi(featureSet_array = my_featureSet_array,
             casenessVector = caseness_array[['person_id','CMHD']],
             m = m,
             representation = "all")

##### 1.1.2. MULTI representation.

In [None]:
featuresetmi(featureSet_array = my_featureSet_array,
             casenessVector = caseness_array[['person_id','CMHD']],
             m = m,
             representation = "multi")

#### 1.2. Definitive caseness

##### 1.2.1. ALL representation.

In [None]:
featuresetmi(featureSet_array = my_featureSet_array,
             casenessVector = caseness_array[['person_id','CMHD_dx_and_rx']],
             m = m,
             representation = "all")

##### 1.2.2. MULTI representation.

In [None]:
featuresetmi(featureSet_array = my_featureSet_array,
             casenessVector = caseness_array[['person_id','CMHD_dx_and_rx']],
             m = m,
             representation = "multi")

#### 1.3. Possible caseness

##### 1.3.1. ALL representation.

In [None]:
featuresetmi(featureSet_array = my_featureSet_array,
             casenessVector = caseness_array[['person_id','CMHD_rx_not_dx']],
             m = m,
             representation = "all")

##### 1.3.2. MULTI representation.

In [None]:
featuresetmi(featureSet_array = my_featureSet_array,
             casenessVector = caseness_array[['person_id','CMHD_rx_not_dx']],
             m = m,
             representation = "multi")

#### 1.4. No caseness (i.e. control group)

##### 1.4.1. ALL representation.

In [None]:
featuresetmi(featureSet_array = my_featureSet_array,
             casenessVector = caseness_array[['person_id','CMHD_control']],
             m = m,
             representation = "all")

##### 1.4.2. MULTI representation.

In [None]:
featuresetmi(featureSet_array = my_featureSet_array,
             casenessVector = caseness_array[['person_id','CMHD_control']],
             m = m,
             representation = "multi")

### 3. Mutual information of triplet-composite feature sets and the caseness variables.

In [None]:
# Set the order of the composite: 1 = individual, 2 = pair, 3 = triplet.
m = 3

#### 1.1. Multinomial caseness

##### 1.1.1. ALL representation.

In [None]:
featuresetmi(featureSet_array = my_featureSet_array,
             casenessVector = caseness_array[['person_id','CMHD']],
             m = m,
             representation = "all")

##### 1.1.2. MULTI representation.

In [None]:
featuresetmi(featureSet_array = my_featureSet_array,
             casenessVector = caseness_array[['person_id','CMHD']],
             m = m,
             representation = "multi")

#### 1.2. Definitive caseness

##### 1.2.1. ALL representation.

In [None]:
featuresetmi(featureSet_array = my_featureSet_array,
             casenessVector = caseness_array[['person_id','CMHD_dx_and_rx']],
             m = m,
             representation = "all")

##### 1.2.2. MULTI representation.

In [None]:
featuresetmi(featureSet_array = my_featureSet_array,
             casenessVector = caseness_array[['person_id','CMHD_dx_and_rx']],
             m = m,
             representation = "multi")

#### 1.3. Possible caseness

##### 1.3.1. ALL representation.

In [None]:
featuresetmi(featureSet_array = my_featureSet_array,
             casenessVector = caseness_array[['person_id','CMHD_rx_not_dx']],
             m = m,
             representation = "all")

##### 1.3.2. MULTI representation.

In [None]:
featuresetmi(featureSet_array = my_featureSet_array,
             casenessVector = caseness_array[['person_id','CMHD_rx_not_dx']],
             m = m,
             representation = "multi")

#### 1.4. No caseness (i.e. control group)

##### 1.4.1. ALL representation.

In [None]:
featuresetmi(featureSet_array = my_featureSet_array,
             casenessVector = caseness_array[['person_id','CMHD_control']],
             m = m,
             representation = "all")

##### 1.4.2. MULTI representation.

In [None]:
featuresetmi(featureSet_array = my_featureSet_array,
             casenessVector = caseness_array[['person_id','CMHD_control']],
             m = m,
             representation = "multi")

# Initial results (\*needs editing because caseness variable has changed\*)

__\*__ _Note: these initial results were calculated using a previous version of the script where the mutual information of all feature sets was saved. This approach was dropped in favour of only saving mutual information values for feature sets whose mutual information with the caseness variable is greater than the entropy of the caseness variable._
<br/><br/><br/>

All individual feature sets score very low for two-way mutual information: all less than 0.05. A $I_{2}=0.05$ represents 7.9% of the theoretical maximal situatons where the feature set either _is exactly_ the caseness variable or _is exactly not_ the variable. The top five individal feature sets (which only have $I_{2}≥0.033$) are defined as having at least one recording of the following SNOMED CT codes in their primary-care electronic health records:

| SNOMED code | Feature set | Topic | Mutual Information | Scaled mutual information | Odds ratio | P(CMHD given X=1) | P(CMHD given X=0) |
| ----------- | ----------- | ----- | ------------------ | ------------------------- | ---------- | ----------------- | ----------------- |
| 314530002 | Medication review done | Medication | ~0.055 | 7.9% | 8.5 | 4.1 | 26.8 |
| 182888003 | Medication requested  | Medication | ~0.035 | 5.0% | 4.9 | 7.9 | 29.5 |
| 1018251000000107 | Serum alanine aminotransferase level (observable entity) | Liver test | ~0.033 | 4.7% | 5.5 | 4.7 | 21.6 |
| 1000621000000104 | Serum alkaline phosphatase level | Liver test | ~0.033 | 4.7% | 5.5 | 4.7 | 21.5 |
| 1022791000000101 | TSH (thyroid stimulating hormone) level | Endocrine | ~0.033 | 4.7% | 5.0 | 5.4 | 22.5 |

The paradoxes of commonly-reported classification statistics are clearly shown. A medication _review_ has the largest odds ratio but the probability of having a record of the caseness variable given that medication was _requested_ is higher. One might propose that a record of a medication request is a better indicator than a record of a medication review, if they prefer the probability statistic over the odds-ratio statistic. But when we look at the probability of the caseness variable given that there is _no_ record of medication being requested, we see that this is also the largest of the top five feature sets! The odds ratio tries to balance these ambiguous probability statistics but it is, therefore, harder to interpret. Note, the odds ratio for a record of a medication _review_ scores better than for a record of medication _request_ because the distinction between the ambiguous probabilities is greater (multiplicatively).

The scaled mutual information is simply a percentage measure of how much the caseness variable is described by the feature set (in terms of information). Unlike the odds ratio, it will give the same value whether the odds are multiplicatively greater or less than equal - e.g. $I_{2}$ will be the same for $OR = 4.0$ and $OR = 0.25$  - so it only measures magnitude of association.