# Add feature sets that are combinations of the families of feature sets.

The purpose of this notebook is to append feature sets that are defined by combinations of feature-set families.

This notebook is expected to be called by its parent `UNSEEN_create_feature_sets.ipynb`. It will not run without the requisite loaded during the parent notebook.

All possble combintions of feature-set families are considered. The feature set families are:
- 'antecedent'                   : (coded as 'A')
- 'concurrent'                   : (coded as 'C')
- 'serviceUse'                   : (coded as 'S')
- 'treatment'                    : (coded as 'T')
- 'inconsistency'                : (coded as 'K')
- 'patternsOfPrescription'       : (coded as 'P')
- 'relevantPrescriptions'        : (coded as 'R')
- 'antipsychoticsPrescription'   : (coded as 'Y')

Each feature-set family has multiple versions indicating the level of presence:
- 'none'     : (coded as '0') none of the family's component feature sets are present in the patient's record.
- 'notNone'  : (coded as '1') none of the family's component feature sets are present in the patient's record.
- 'few'      : (coded as '2') the lowest level of presence of the family's component feature sets are in the patient's record.
- 'some'     : (coded as '3') the modest level of presence of the family's component feature sets are in the patient's record.
- 'many'     : (coded as '4') the highest level of presence of the family's component feature sets are in the patient's record.
- 'wildcard' : (coded as 'x') the level of presence is ignored for this family, in the particular feature-set combination.

As an example, a feature set entitled `A0_C1_Sx_Tx_Kx_Px_Rx_Yx` represents the patients that have no antecedent feature sets (A0), notNone concurrent feature sets (C1), and any presence - zero of otherwise - of the other families' feature sets.

## Refresh store.

In [1]:
# Get helper functions.
%run 'UNSEEN_helper_functions.ipynb'
# Refresh stored variables, if they are present.
%store -r

## Create all combinations of family flavours.

### FOR loop implementation.

#### Definte the list of combinations.

In [76]:
family_combins = []
A_cols = list(nafsm_family_membership.filter(regex = 'antecedent').columns) + ['wildcard']
C_cols = list(nafsm_family_membership.filter(regex = 'concurrent').columns) + ['wildcard']
S_cols = list(nafsm_family_membership.filter(regex = 'serviceUse').columns) + ['wildcard']
T_cols = list(nafsm_family_membership.filter(regex = 'treatment').columns) + ['wildcard']
K_cols = list(nafsm_family_membership.filter(regex = 'inconcsistency').columns) + ['wildcard']
P_cols = list(nafsm_family_membership.filter(regex = 'patternsOfPrescription').columns) + ['wildcard']
R_cols = list(nafsm_family_membership.filter(regex = 'relevantPrescriptions').columns) + ['wildcard']
Y_cols = list(nafsm_family_membership.filter(regex = 'antipsychotics').columns) + ['wildcard']
for i_antecedent in A_cols:
    for i_concurrent in C_cols:
        for i_serviceUse in S_cols:
            for i_treatment in T_cols:
                for i_inconcsistency in K_cols:
                    for i_patternsOfPrescription in P_cols:
                        for i_relevantPrescriptions in R_cols:
                            for i_antipsychotics in Y_cols:
                                family_combins.append([i_antecedent, i_concurrent, i_serviceUse, i_treatment, i_inconcsistency,
                                                       i_patternsOfPrescription, i_relevantPrescriptions, i_antipsychotics])
%store family_combins

Stored 'family_combins' (list)


#### Define some look up dictionaries.

In [5]:
# Instantiate a feature-set storage dictionary.
new_fs_dict = {'person_id' : nafsm_family_membership.person_id}

# Set lookup of family names.
family_lookup_list = ['antecedent', 'concurrent', 'serviceUse', 'treatment', 'inconcsistency', 'patternsOfPrescription', 'relevantPrescriptions', 'antipsychoticsPrescription']

# Set mapping dictionary for count level to count-level code.
level_options = \
        {'none' : '0',
         'notNone'  : '1',
         'few'  : '2',
         'some' : '3',
         'many' : '4',
         'wildcard' : 'x'
        }

# Set mapping dictionary for family name to family code.
family_options = \
            {'antecedent' : 'A',
             'concurrent'  : 'C',
             'serviceUse'  : 'S',
             'treatment' : 'T',
             'inconcsistency' : 'K',
             'patternsOfPrescription' : 'P',
             'relevantPrescriptions' : 'R',
             'antipsychoticsPrescription' : 'Y'
            }

#### Define some storage.

In [3]:
# Storage of the evaluation of the feature sets.
# ## Unlike all other feature sets, these combinations will be evaluated in the same loop that they are created. I 
# ## made this decision because the GoogleCloudPlatform python kernel kept dying trying to save so many feature sets
# ## that turned out to be non-informative anyway.
ls_output = []

# Storage for list of names of non-informative feture sets.
discarded_because_only_one_value = []

#### Do the work.

In [5]:
# Do work.
t1 = time.time()
# Create the names and values for the feature sets.
for i_family_combins in tqdm.notebook.tqdm_notebook(family_combins, unit = " feature sets"):
    selection = i_family_combins.copy()

    # Get name of new feature set.
    wildcard_idx = 0
    new_fs_name_list = []
    for i_family in i_family_combins:
        # Extract the level and family.
        try:
            level, family = i_family.split('_')
            wildcard_idx += 1 # This increment says 'There is no wildcard at this index location.'
        except:
            level = 'wildcard'
            family = family_lookup_list[wildcard_idx]
            wildcard_idx += 1

        # Select `level_code` that will be used to represent the variable name.
        level_code = level_options.get(level, 'The provided value for \'level\' is not valid.')

        # Select family_code that will be used to represent the variable name.
        family_code = family_options.get(family, 'The provided value for \'family\' is not valid.')

        # Build variable name string.
        new_fs_name_list.append(family_code + level_code)

    new_fs_name = ('_'.join(new_fs_name_list))

    # Remove wildcard family, if present.
    try:
        selection = [i for i in selection if i != 'wildcard']
    except ValueError:
        None
    
    # Make new feature set.
    new_fs_value = nafsm_family_membership[selection].all(True)

    # Skip if non-informative.  
    a = len(new_fs_value.value_counts())
    if a < 2:
        # Keep a list of the non-informative feature sets.
        discarded_because_only_one_value.append(new_fs_name)
    else:
    # Else, evaluate the feature set.
        fs_eval = evaloutputs(new_fs_value, caseness_array.caseness_1isYes.astype(int))
        ls_output.append([new_fs_name] + list(fs_eval)[1:])

print(f'It took {time.time() - t1} to process.')

# Store the list of output evaluations because I need to append to this in the evaluation of other feature sets.
%store ls_output
print(f'{len(ls_output)} feature sets contained sufficient information to be evaluated.')
print('These feature sets\' evaluation statistics can be viewed in the `ls_output` variable.\n')

# Store the discard list because I need to append to this in the evaluation of other feature sets.
%store discarded_because_only_one_value
print(f'{len(set(discarded_because_only_one_value) ) } feature sets were discarded because they only presented one value for all patients.')
print('These discards can be viewed in the `discarded_because_only_one_value` variable.')


# Join to `feature_set_array`.
# ## No join is necessary because the fetaure sets have been processed and evaluated here, rather than saved for later evaluation.

  0%|          | 0/50000 [00:00<?, ? feature sets/s]

It took 11546.293073654175 to process.
Stored 'ls_output7' (list)
37242 feature sets contained sufficient information to be evaluated.
These feature sets' evaluation statistics can be viewed in the `ls_output` variable.

Stored 'discarded_because_only_one_value7' (list)
12758 feature sets were discarded because they only presented one value for all patients.
These discards can be viewed in the `discarded_because_only_one_value` variable.


#### Write lists to file, just in case there are more problems with the store.

In [None]:
# Extend the lists to be saved.
# ## I had some service issuse that meant I had to chunk my processing. If you have to do the same, the script below combines your chunks into a single variable
# ## that will then be saved.
#[ls_output.extend(l) for l in (ls_output2, ls_output3, ls_output4, ls_output5, ls_output6, ls_output7, ls_output8, ls_output9, ls_output10, ls_output11)]
#[discarded_because_only_one_value.extend(l) for l in (discarded_because_only_one_value2, discarded_because_only_one_value3, discarded_because_only_one_value4,
#                                                      discarded_because_only_one_value5, discarded_because_only_one_value6, discarded_because_only_one_value7,
#                                                      discarded_because_only_one_value8, discarded_because_only_one_value9, discarded_because_only_one_value10, 
#                                                     discarded_because_only_one_value11)]

# Write lists to file.
with open('ls_output_family_combins.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(ls_output)
%store ls_output_family_combins
    
with open('discarded_because_only_one_value.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(discarded_because_only_one_value)
%store discarded_because_only_one_value

### Multiprocessing implementation.

I had some issues with the multiprocessing implementation and I also found that it didn't speed things up very much, for my particular problem. I leave my scripts below for your interest.

#### Define functions.

#### Do work.