# Evalaute feature sets.

The purpose of this notebook is to calculate the evaluation statistics for each feature set.

#### Load required data.

In [57]:
%run 'UNSEEN_helper_functions.ipynb'
%store -r

## Discard non-informative feature sets.

Non-informative feature sets are those feature sets that only have one value for all patients. We expect most feature sets to be non-informative because of the breath and specificity of our feature-set definitions.

In [58]:
discarded_because_only_one_value = list( set(discarded_because_only_one_value) )
# Remove from feature_set_array all feature sets with only one value.
def valueCheck(df):
    a = len(df.value_counts())
    if a < 2:
        discarded_because_only_one_value.append(df.name)
    return a > 1
feature_set_array = feature_set_array.loc[:,feature_set_array.apply(valueCheck).to_numpy()]

print(f'{len(discarded_because_only_one_value)} feature sets were discarded because they only presented one value for all patients.')
print('These discards can be viewed in the `discarded_because_only_one_value` variable.')
discarded_because_only_one_value = set(discarded_because_only_one_value)
%store discarded_because_only_one_value
with open('discarded_because_only_one_value.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(discarded_because_only_one_value)

137016 feature sets were discarded because they only presented one value for all patients.
These discards can be viewed in the `discarded_because_only_one_value` variable.
Stored 'discarded_because_only_one_value' (set)


## Calculate evaluation statistics for all feature sets.

Note, the evaluation statistics for the family-combination feature sets was calculated when they were being created, so they are appended to the evalautions of the feature sets in `feature_set_array` that have more than one value.

In [59]:
# Evaluate the  feature sets in `feature_set_array` that have more than one value.
ls_output = []
for i_featureSet in tqdm.notebook.tqdm_notebook(feature_set_array.columns[1:]):
    try:
        ls_output.append(
            evaloutputs(feature_set_array[i_featureSet],
                        caseness_array.caseness_1isYes.astype(int))
        )
    except:
        print(i_featureSet)

# Append the evaluation statistics for the family-combination feature sets was calculated when they were being created
ls_output.extend(ls_output_family_combins)

# Store and save.
%store ls_output
with open('ls_output.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(ls_output)

  0%|          | 0/186 [00:00<?, ?it/s]

Stored 'ls_output' (list)


## Display evaluation statistics for all feature sets.

The displayed pandas.DataFrame is ordered by descending scaled mutual information. Only the top 30 are shown.

The keys for family-combination feature sets are:
| Letter | Family group               | Description |
| ------ | -------------------------- | ----------- |
| A      | Antecedent                 |  feature sets (generally) preceding the emergence of complex mental health difficulties in adults, and might reflect risk factors or early mechanisms, e.g. adverse childhood experiences, mental health difficulties in childhood or adolescence, behaviour problems in childhood and adolescence, adult trauma (including intimate partner violence), and neurodiversity. |
| C      | Concurrent                 |  feature sets present in adult life (from emerging adult onward) that might indicate the emergence of behaviours indicative of complex mental health difficulties, e.g. self-harm, risk, substance misuse or dependency. |
| S      | Service Use                |  feature sets representing patterns of recent use of healthcare services that indicate both intensity and variance of use, e.g. number of mental health-related SNOMED-CT codes in the patient’s record, use of specialist services, burstiness of attendance. |
| T      | Treatment                  |  feature sets indicating – or patterns of – therapy and prescriptions, e.g. repeated referrals to IAPT. |
| K      | Inconsistency              |  feature sets attempting to represent unstable or atypical recorded activity , e.g. median count of appointments not attended, or sample entropy of appointments. |
| P      | Patterns of Prescription   |  feature sets describing patterns in the prescriptions of particular medications, e.g. the count of aborted antidepressant-medication regimes. |
| R      | Relevant Presciptions      |  feature sets indicating the presence or absence of prescriptions for selected medications. |
| Y      | Antipyschotic prescription |  a single feature set indicating the presence or absence of a prescription for antipsychotic medications. |

and

| Number | Level or Extent | Description |
| ------ | --------------- | ----------- |
| 0      | None            | Indicating patients who have no record of this family-group's feature sets. |
| 1      | Not none        | Indicating patients who have at least one of this family-group's feature sets in their records. |
| 2      | Few             | Indicating patients who have at most the family-specific lower quantile of this family-group's feature sets in their records. |
| 3      | Some            | Indicating patients who have between the family-specific lower and upper quantiles of this family-group's feature sets in their records. |
| 4      | Many            | Indicating patients who have at least the family-specific upper quantile of this family-group's feature sets in their records. |
| x      | Not considered  | The level of the family group is not considered in the definition of this particular family-combination feature set. |


The results table is shown below.

In [60]:
eval_output = \
    pandas.DataFrame(ls_output,
                     columns = ['Feature_set', 'Data_type', 'Scaled_mutual_information',
                                'Prevalence_per_thousand', 'Mean', 'Mode', 'Class_balanced_accuracy',
                                'Odds_ratio', 'ppv', 'npv', 'tn', 'fn', 'fp', 'tp'])
eval_output.sort_values(by=['Scaled_mutual_information'], ascending = False, inplace = True)
eval_output.reset_index(drop=True, inplace = True)
eval_output.insert(1, "Feature_set_short", [re.sub(r'_$','', re.sub(r'[ACSTKPRY](x_|x)','', i_fs) ) for i_fs in eval_output.Feature_set])
pandas.set_option('display.max_rows', 30)
display(eval_output[0:30])

# Store and save output table.
%store eval_output
eval_output.to_csv('eval_output.csv', index=False)

Unnamed: 0,Feature_set,Feature_set_short,Data_type,Scaled_mutual_information,Prevalence_per_thousand,Mean,Mode,Class_balanced_accuracy,Odds_ratio,ppv,npv,tn,fn,fp,tp
0,countPsychologicalDisorders,countPsychologicalDisorders,Int64,0.220647,,,1.0,,2.2,,,,,,
1,Ax_C2_S1_Tx_Kx_P2_Rx_Y0,C2_S1_P2_Y0,bool,0.084633,836.32,,,0.08,0.15,0.01,0.93,23368.0,1893.0,127544.0,1530.0
2,Ax_C2_Sx_Tx_Kx_P2_Rx_Y0,C2_P2_Y0,bool,0.084633,836.32,,,0.08,0.15,0.01,0.93,23368.0,1893.0,127544.0,1530.0
3,Ax_C2_Sx_Tx_Kx_Px_Rx_Y0,C2_Y0,bool,0.083747,837.7,,,0.08,0.15,0.01,0.93,23170.0,1878.0,127742.0,1545.0
4,Ax_C2_S1_Tx_Kx_Px_Rx_Y0,C2_S1_Y0,bool,0.083747,837.7,,,0.08,0.15,0.01,0.93,23170.0,1878.0,127742.0,1545.0
5,Ax_C2_Sx_Tx_Kx_P2_R2_Y0,C2_P2_R2_Y0,bool,0.083322,832.68,,,0.09,0.15,0.01,0.93,23923.0,1901.0,126989.0,1522.0
6,Ax_C2_S1_Tx_Kx_P2_R2_Y0,C2_S1_P2_R2_Y0,bool,0.083322,832.68,,,0.09,0.15,0.01,0.93,23923.0,1901.0,126989.0,1522.0
7,Ax_C2_S1_Tx_Kx_Px_R2_Y0,C2_S1_R2_Y0,bool,0.082411,834.02,,,0.08,0.15,0.01,0.93,23731.0,1886.0,127181.0,1537.0
8,Ax_C2_Sx_Tx_Kx_Px_R2_Y0,C2_R2_Y0,bool,0.082411,834.02,,,0.08,0.15,0.01,0.93,23731.0,1886.0,127181.0,1537.0
9,A2_C2_S1_Tx_Kx_P2_Rx_Y0,A2_C2_S1_P2_Y0,bool,0.081798,696.16,,,0.15,0.15,< 0.01,0.95,44397.0,2496.0,106515.0,927.0


Stored 'eval_output' (DataFrame)
