# Evaluating feature sets

This notebook contains the subprotocol for evaluating feature sets that have already been filtered in previous stages of the overall protocol. The following are the expected outputs for each feature set:

1. A prevalence value per 1,000.
2. A [class balance accuracy value](http://search.proquest.com/docview/1500559170?accountid=37552).
3. An odds ratio value.
4. A positive predictive value.
5. A negative predictive value.

We also provide the counts that make up the contingency table so that readers can calculate their own evaluation statistics.


## Imports and helper functions

In [61]:
import os
os.chdir('/home/jupyter/UNSEEN/c-mcinerney-workspace')
%run 'UNSEEN_helper_functions.ipynb'
%store -r

### Retrieve the retriebable parts of the output table from existing files.

In [62]:
order_options = ['Individuals','Pairs', 'Triplets']
data = []
for i_order in order_options:
    # Set the feature-set order of interest in this iteration.
    fs_order = i_order
    # Set the file directory path to the feature-set order of interest in this iteration.
    data_dir = pathlib.Path(os.getcwd() + '/Mutual information saves/' + fs_order)
    # Create a list of all relevant files.
    globbed_files = data_dir.glob("*.parquet") 
    for parquet_file in globbed_files:
        #print(f'For {parquet_file}, do stuff')
        # Read parquet file to list.
        read_list = pandas.read_parquet(parquet_file).values.tolist() 
        # Transpose to make list for feature set and list for mutual information.
        read_list = numpy.array(list((map(list, zip(*read_list)))))
        # Count how many items are in `read_list`.
        len_read_list = numpy.shape(read_list)[1]
        # Set the name of the file to be handled next.
        file_source = os.path.basename(parquet_file)
        # Extract the elements of the file name, which define the contents.
        fs_source, fs_casenessType, fs_representation, dump = file_source.split("_")
        # For the file we are currently handling, make a multi-colummn list of the information of interest.
        temp_list = \
            list(
                zip(numpy.repeat(file_source, len_read_list),
                    numpy.repeat(fs_source, len_read_list),
                    numpy.repeat(fs_order, len_read_list),
                    numpy.repeat(fs_casenessType, len_read_list),
                    numpy.repeat(fs_representation, len_read_list),
                    numpy.array(read_list[:][0]),
                    numpy.array(read_list[:][1])
                   )
        )
        # Join the processed contents of this file to the contents from the other files in the folder.
        data.append(temp_list)      
        
flat_metadata = \
    pandas.DataFrame([item for sublist in data for item in sublist],
                     columns = ['file_source',
                                'Source',
                                'Order',
                                'Caseness_type',
                                'Representation',
                                'Feature_set',
                                'Normalised_mutual_information']
                    )
flat_metadata['Normalised_mutual_information'] = pandas.to_numeric(flat_metadata['Normalised_mutual_information'])
flat_metadata.sort_values('Normalised_mutual_information', ascending = False, inplace = True)

OSError: Corrupt snappy compressed data.

In [68]:
len(flat_metadata)

520

### Check that the required fs_* dataframes exist.

In [65]:
dfs = [
    fs_clinician
    ,fs_literature
]
df_fs = functools.reduce(lambda left, right: pandas.merge(left, right, on = 'person_id'), dfs)

### For each feature set in 'flat_metadata', extract and append evaluation statistics.

In [66]:
ls_output = []
for i_fs in tqdm(range(len(flat_metadata)), unit = 'feature set'):
    # Choose the caseness variable of interest.
    caseness_type = flat_metadata['Caseness_type'][i_fs]
    if caseness_type == 'multinomial':
        vec_caseness = caseness_array['CMHD']
    elif caseness_type == 'definite':
        vec_caseness = caseness_array['CMHD_dx_and_rx']
    elif caseness_type == 'possible':
        vec_caseness = caseness_array['CMHD_rx_not_dx']
    elif caseness_type == 'control':
        vec_caseness = caseness_array['CMHD_control']

    
    # Choose the feature-set components for the feature set of interest.
    fs_components_names = flat_metadata['Feature_set'][i_fs].split("-")
    #print(fs_components_names)
    fs_components = df_fs[fs_components_names]
    
    # Choose the representation for the feature set of interest.
    representation = flat_metadata['Representation'][i_fs]
    if representation == 'ALL':
        vec_featureSet = fs_components.all(True)
    elif representation == "MULTI":
        vec_featureSet, dump = mutlinomRepresentation(fs_components)
    
    # Check if feature set is a constant.
    if len(vec_featureSet.unique()) == 1:
        continue
    
    # Pass the caseness and the feature-set vectors to evaloutputs().
    ls_output.append(evaloutputs(vec_featureSet,
                                 vec_caseness)
                )
# Flatten the appended list of evaluation statistics.
flat_output = \
    pandas.DataFrame(ls_output,
                     columns = ['prevalence per thousand',
                                'cba',
                                'odds ratio',
                                'ppv',
                                'npv',
                                'tn',
                                'fn',
                                'fp',
                                'tp']
                    )

# Append the evaluation statistics to the metadata about the feature set.
evaluation_dataframe = pandas.concat([flat_metadata, flat_output], axis=1, join='inner')
#display(evaluation_dataframe)

# Rename column names for saving.
evaluation_dataframe.rename = \
                    ['Order',
                     'Caseness_type',
                     'Representation',
                     'Feature_set',
                     'Mutual_information',
                     'FeatureSet_prevalence_per_thousand',
                     'Class_balance_acccuracy',
                     'Odds_ratio',
                     'Positive_predictive_value',
                     'Negative_predictive_value',
                     'True_positive_count',
                     'True_negativecount',
                     'False_positive_count',
                     'False_negative_count']

# Save evaluation outputs.
savelocation = "Evaluation/"
evaluation_dataframe.to_csv(savelocation + datetime.datetime.strftime(datetime.datetime.now(), '%Y_%m_%d_%H:%M:%S') + "_Evaluation statistics.csv", index = False)
evaluation_dataframe.astype(str).to_parquet(savelocation + datetime.datetime.strftime(datetime.datetime.now(), '%Y_%m_%d_%H:%M:%S') + "_Evaluation statistics.parquet", index = False)
print("\nEvaluation statistics saved.")

  0%|          | 0/520 [00:00<?, ?feature set/s]


KeyError: "None of [Index(['ALL OF   familialSubstanceMisuse   AND   incarcerationImprisonment'], dtype='object')] are in the [columns]"

In [79]:
print('Evaluation statistics for ''Possible caseness''')
evaluation_dataframe.iloc[0:20,4:]

Evaluation statistics for Possible caseness


Unnamed: 0,Representation,Feature_set,Normalised_mutual_information,prevalence per thousand,cba,odds ratio,ppv,npv,tn,fn,fp,tp
7389,MULTI,depressionNotDysthymiaOrChronic-anxietyOrPanic-relevantPrescriptions,0.007805,< 0.01,0.46,13.57,0.53,0.92,679508,57225,7,8
13143,MULTI,depressionNotDysthymiaOrChronic-relevantPrescriptions-newAntidepressentThreeMonths,0.007799,1.91,0.57,13.3,0.29,0.97,578421,17467,95842,38485
13132,MULTI,depressionNotDysthymiaOrChronic-AccessToHealthcare-relevantPrescriptions,0.007747,1.89,0.57,13.25,0.29,0.97,579817,17551,96879,38854
13142,MULTI,depressionNotDysthymiaOrChronic-relevantPrescriptions-categoryAnnualCountUniqueAntidepressants,0.007737,2.01,0.57,12.92,0.27,0.97,572307,16295,86767,31921
13122,MULTI,depressionNotDysthymiaOrChronic-recurrentEDattednances-relevantPrescriptions,0.007735,1.95,0.57,13.53,0.29,0.97,575754,17034,95987,38411
13144,MULTI,depressionNotDysthymiaOrChronic-relevantPrescriptions-MentalHealthTreatments,0.007732,1.9,0.57,13.21,0.29,0.97,579277,17537,96864,38740
13145,MULTI,depressionNotDysthymiaOrChronic-relevantPrescriptions-RecurringMentalSymptoms,0.007716,2.07,0.57,13.26,0.27,0.97,568337,16158,86226,32499
7065,MULTI,depressionNotDysthymiaOrChronic-paranoia-relevantPrescriptions,0.007714,0.51,0.54,4.73,0.26,0.93,651185,47628,22921,7930
7008,MULTI,depressionNotDysthymiaOrChronic-manyDNA-relevantPrescriptions,0.007713,< 0.01,0.46,7.92,0.4,0.92,679512,57231,3,2
13148,MULTI,depressionNotDysthymiaOrChronic-relevantPrescriptions-sleepDisturbance,0.007711,1.91,0.57,13.37,0.29,0.97,578843,17394,97146,39039
