# Evalaute feature sets.

The purpose of this notebook is to calculate the evaluation statistics for each feature set.

#### Load required data.

In [1]:
%run 'UNSEEN_helper_functions.ipynb'
%store -r

## Discard non-informative feature sets.

Non-informative feature sets are those feature sets that only have one value for all patients. We expect most feature sets to be non-informative because of the breath and specificity of our feature-set definitions.

In [2]:
discarded_because_only_one_value = list( set(discarded_because_only_one_value) )
# Remove from feature_set_array all feature sets with only one value.
def valueCheck(df):
    a = len(df.value_counts())
    if a < 2:
        discarded_because_only_one_value.append(df.name)
    return a > 1
feature_set_array = feature_set_array.loc[:,feature_set_array.apply(valueCheck).to_numpy()]

print(f'{len(discarded_because_only_one_value)} feature sets were discarded because they only presented one value for all patients.')
print('These discards can be viewed in the `discarded_because_only_one_value` variable.')
discarded_because_only_one_value = set(discarded_because_only_one_value)
%store discarded_because_only_one_value
with open('discarded_because_only_one_value.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(discarded_because_only_one_value)

161386 feature sets were discarded because they only presented one value for all patients.
These discards can be viewed in the `discarded_because_only_one_value` variable.
Stored 'discarded_because_only_one_value' (set)


## Calculate evaluation statistics for all feature sets.

Note, the evaluation statistics for the family-combination feature sets was calculated when they were being created, so they are appended to the evalautions of the feature sets in `feature_set_array` that have more than one value.

In [31]:
# Evaluate the  feature sets in `feature_set_array` that have more than one value.
ls_output = []
for i_featureSet in tqdm.notebook.tqdm_notebook(feature_set_array.columns[1:]):
    try:
        ls_output.append(
            evaloutputs(feature_set_array[i_featureSet],
                        caseness_array.caseness_1isYes.astype(int))
        )
    except:
        print(i_featureSet)

# Append the evaluation statistics for the family-combination feature sets was calculated when they were being created
ls_output.extend(ls_output_family_combins)

# Store and save.
%store ls_output
with open('ls_output.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(ls_output)

  0%|          | 0/186 [00:00<?, ?it/s]

Stored 'ls_output' (list)


## Display evaluation statistics for all feature sets.

The displayed pandas.DataFrame is ordered by descending scaled mutual information. Only the top 30 are shown.

The keys for family-combination feature sets are:
| Letter | Family group               | Description |
| ------ | -------------------------- | ----------- |
| A      | Antecedent                 |  a feature set representing features (generally) preceding the emergence of complex mental health difficulties in adults (e.g. adverse childhood experiences), and administrative or clinical events recorded before the age of 30. |
| C      | Concurrent                 |  a feature set representing concerning behaviours after 30 years of age, e.g. self-harm, risk, substance misuse or dependency. |
| S      | Service Use                |  a feature set representing patterns of recent use of healthcare services that indicate both intensity and variance of use, e.g. number of mental health-related SNOMED-CT codes in the patient’s record, use of specialist services, burstiness of attendance. |
| T      | Treatment                  |  a feature set representing use and patterns of therapy and prescriptions, e.g. repeated referrals to Improving Access to Psychological Therapy (IAPT). |
| K      | Inconsistency              |  a feature set representing unstable or atypical attendance activity, e.g. median count of appointments not attended, or sample entropy of appointments. |
| P      | Patterns of Prescription   |  a feature set representing patterns in the prescriptions for medications of interest, e.g. the count of aborted antidepressant-medication regimes. |
| R      | Relevant Presciptions      |  a feature set indicating the presence or absence of prescriptions for medications of interest. |
| Y      | Antipyschotic prescription |  a feature set containing only one feature that indicates the presence or absence of a prescription for antipsychotic medications. |

and

| Number | Level or Extent | Description |
| ------ | --------------- | ----------- |
| 0      | None            | Indicating patients who have no record of this family-group's feature sets. |
| 1      | Not none        | Indicating patients who have at least one of this family-group's feature sets in their records. |
| 2      | Few             | Indicating patients who have at most the family-specific lower quantile of this family-group's feature sets in their records. |
| 3      | Some            | Indicating patients who have between the family-specific lower and upper quantiles of this family-group's feature sets in their records. |
| 4      | Many            | Indicating patients who have at least the family-specific upper quantile of this family-group's feature sets in their records. |
| x      | Not considered  | The level of the family group is not considered in the definition of this particular family-combination feature set. |


The results table is shown below.

In [2]:
eval_output = \
    pandas.DataFrame(ls_output,
                     columns = ['Feature_set', 'Data_type', 'Scaled_mutual_information',
                                'Prevalence_per_thousand', 'Mean', 'Mode', 'Class_balanced_accuracy',
                                'Odds_ratio', 'ppv', 'npv', 'tn', 'fn', 'fp', 'tp'])

eval_output.insert(1, "Feature_set_short", [re.sub(r'_$','', re.sub(r'[ACSTKPRY](x_|x)','', i_fs) ) for i_fs in eval_output.Feature_set])
eval_output.sort_values(by=['Scaled_mutual_information', 'Feature_set_short'], ascending = [False, True], inplace = True)
eval_output.reset_index(drop=True, inplace = True)
# Store and save output table.
%store eval_output
eval_output.to_csv('eval_output.csv', index=False)


eval_output.insert(4, "pct_Scaled_mutual_information", round( eval_output.Scaled_mutual_information * 100, 1))
pandas.set_option('display.max_rows', 30)
display(eval_output.iloc[0:30, ~eval_output.columns.isin(['Feature_set', 'Data_type', 'Scaled_mutual_information']) ] )


Stored 'eval_output' (DataFrame)


Unnamed: 0,Feature_set_short,pct_Scaled_mutual_information,Prevalence_per_thousand,Mean,Mode,Class_balanced_accuracy,Odds_ratio,ppv,npv,tn,fn,fp,tp
0,countPsychologicalDisorders,20.7,,,1.0,,2.2,,,,,,
1,C2_S1_T0_P0_Y0,6.3,601.22,,,0.2,0.19,< 0.01,0.96,58893.0,2653.0,92019.0,770.0
2,C2_S2_T0_P0_Y0,6.3,601.22,,,0.2,0.19,< 0.01,0.96,58893.0,2653.0,92019.0,770.0
3,C2_T0_P0_Y0,6.3,601.22,,,0.2,0.19,< 0.01,0.96,58893.0,2653.0,92019.0,770.0
4,S2_T0_P0_Y0,6.3,601.22,,,0.2,0.19,< 0.01,0.96,58893.0,2653.0,92019.0,770.0
5,C2_S1_T0_Y0,6.2,602.52,,,0.2,0.19,< 0.01,0.96,58700.0,2645.0,92212.0,778.0
6,C2_S2_T0_Y0,6.2,602.52,,,0.2,0.19,< 0.01,0.96,58700.0,2645.0,92212.0,778.0
7,C2_T0_Y0,6.2,602.52,,,0.2,0.19,< 0.01,0.96,58700.0,2645.0,92212.0,778.0
8,S2_T0_Y0,6.2,602.52,,,0.2,0.19,< 0.01,0.96,58700.0,2645.0,92212.0,778.0
9,C2_S1_T0_P0,6.2,602.35,,,0.2,0.19,< 0.01,0.96,58732.0,2639.0,92180.0,784.0


In [34]:
pandas.set_option('display.max_rows', 100)
a=pandas.concat([
    eval_output[eval_output.Feature_set_short == 'A0']
    ,eval_output[eval_output.Feature_set_short == 'C0']
    ,eval_output[eval_output.Feature_set_short == 'S0']
    ,eval_output[eval_output.Feature_set_short == 'T0']
    ,eval_output[eval_output.Feature_set_short == 'K0']
    ,eval_output[eval_output.Feature_set_short == 'P0']
    ,eval_output[eval_output.Feature_set_short == 'R0']
    ,eval_output[eval_output.Feature_set_short == 'Y0']
    ,eval_output[eval_output.Feature_set_short == 'A1']
    ,eval_output[eval_output.Feature_set_short == 'C1']
    ,eval_output[eval_output.Feature_set_short == 'S1']
    ,eval_output[eval_output.Feature_set_short == 'T1']
    ,eval_output[eval_output.Feature_set_short == 'K1']
    ,eval_output[eval_output.Feature_set_short == 'P1']
    ,eval_output[eval_output.Feature_set_short == 'R1']
    ,eval_output[eval_output.Feature_set_short == 'Y1']
    ,eval_output[eval_output.Feature_set_short == 'A2']
    ,eval_output[eval_output.Feature_set_short == 'C2']
    ,eval_output[eval_output.Feature_set_short == 'S2']
    ,eval_output[eval_output.Feature_set_short == 'T2']
    ,eval_output[eval_output.Feature_set_short == 'K2']
    ,eval_output[eval_output.Feature_set_short == 'P2']
    ,eval_output[eval_output.Feature_set_short == 'R2']
    ,eval_output[eval_output.Feature_set_short == 'Y2']
    ,eval_output[eval_output.Feature_set_short == 'A3']
    ,eval_output[eval_output.Feature_set_short == 'C3']
    ,eval_output[eval_output.Feature_set_short == 'S3']
    ,eval_output[eval_output.Feature_set_short == 'T3']
    ,eval_output[eval_output.Feature_set_short == 'K3']
    ,eval_output[eval_output.Feature_set_short == 'P3']
    ,eval_output[eval_output.Feature_set_short == 'R3']
    ,eval_output[eval_output.Feature_set_short == 'Y3']
    ,eval_output[eval_output.Feature_set_short == 'A4']
    ,eval_output[eval_output.Feature_set_short == 'C4']
    ,eval_output[eval_output.Feature_set_short == 'S4']
    ,eval_output[eval_output.Feature_set_short == 'T4']
    ,eval_output[eval_output.Feature_set_short == 'K4']
    ,eval_output[eval_output.Feature_set_short == 'P4']
    ,eval_output[eval_output.Feature_set_short == 'R4']
    ,eval_output[eval_output.Feature_set_short == 'Y4']
])
pandas.DataFrame([a.index+1, a.Feature_set_short]).transpose().to_csv('family_ranks.csv')