# Add feature sets that are combinations of the families of feature sets.

The purpose of this notebook is to append feature sets that are defined by combinations of feature-set families.

This notebook is expected to be called by its parent `UNSEEN_create_feature_sets.ipynb`. It will not run without the requisite loaded during the parent notebook.

All possble combintions of feature-set families are considered. The feature set families are:
- 'antecedent'                   : (coded as 'A')
- 'concurrent'                   : (coded as 'C')
- 'serviceUse'                   : (coded as 'S')
- 'treatment'                    : (coded as 'T')
- 'inconsistency'                : (coded as 'K')
- 'patternsOfPrescription'       : (coded as 'P')
- 'relevantPrescriptions'        : (coded as 'R')
- 'antipsychoticsPrescription'   : (coded as 'Y')

Each feature-set family has multiple versions indicating the level of presence:
- 'none'     : (coded as '0') none of the family's component feature sets are present in the patient's record.
- 'notNone'  : (coded as '1') none of the family's component feature sets are present in the patient's record.
- 'few'      : (coded as '2') the lowest level of presence of the family's component feature sets are in the patient's record.
- 'some'     : (coded as '3') the modest level of presence of the family's component feature sets are in the patient's record.
- 'many'     : (coded as '4') the highest level of presence of the family's component feature sets are in the patient's record.
- 'wildcard' : (coded as 'x') the level of presence is ignored for this family, in the particular feature-set combination.

As an example, a feature set entitled `A0_C1_Sx_Tx_Kx_Px_Rx_Yx` represents the patients that have no antecedent feature sets (A0), notNone concurrent feature sets (C1), and any presence/absence - zero of otherwise - of the other families' feature sets.

## Refresh store.

In [None]:
# Get helper functions.
%run 'UNSEEN_helper_functions.ipynb'
# Refresh stored variables, if they are present.
%store -r

## Create all combinations of family flavours.

### FOR loop implementation.

#### Definte the list of combinations.

In [2]:
family_combins = []
A_cols = list(nafsm_family_membership.filter(regex = 'antecedent').columns) + ['wildcard']
C_cols = list(nafsm_family_membership.filter(regex = 'concurrent').columns) + ['wildcard']
S_cols = list(nafsm_family_membership.filter(regex = 'serviceUse').columns) + ['wildcard']
T_cols = list(nafsm_family_membership.filter(regex = 'treatment').columns) + ['wildcard']
K_cols = list(nafsm_family_membership.filter(regex = 'inconsistency').columns) + ['wildcard']
P_cols = list(nafsm_family_membership.filter(regex = 'patternsOfPrescription').columns) + ['wildcard']
R_cols = list(nafsm_family_membership.filter(regex = 'relevantPrescriptions').columns) + ['wildcard']
Y_cols = list(nafsm_family_membership.filter(regex = 'antipsychotics').columns) + ['wildcard']
for i_antecedent in A_cols:
    for i_concurrent in C_cols:
        for i_serviceUse in S_cols:
            for i_treatment in T_cols:
                for i_inconsistency in K_cols:
                    for i_patternsOfPrescription in P_cols:
                        for i_relevantPrescriptions in R_cols:
                            for i_antipsychotics in Y_cols:
                                family_combins.append([i_antecedent, i_concurrent, i_serviceUse, i_treatment, i_inconsistency,
                                                       i_patternsOfPrescription, i_relevantPrescriptions, i_antipsychotics])
%store family_combins

Stored 'family_combins' (list)


#### Define some look up dictionaries.

In [2]:
# Instantiate a feature-set storage dictionary.
new_fs_dict = {'person_id' : nafsm_family_membership.person_id}

# Set lookup of family names.
family_lookup_list = ['antecedent', 'concurrent', 'serviceUse', 'treatment', 'inconsistency', 'patternsOfPrescription', 'relevantPrescriptions', 'antipsychoticsPrescription']

# Set mapping dictionary for count level to count-level code.
level_options = \
        {'none' : '0',
         'notNone'  : '1',
         'few'  : '2',
         'some' : '3',
         'many' : '4',
         'wildcard' : 'x'
        }

# Set mapping dictionary for family name to family code.
family_options = \
            {'antecedent' : 'A',
             'concurrent'  : 'C',
             'serviceUse'  : 'S',
             'treatment' : 'T',
             'inconsistency' : 'K',
             'patternsOfPrescription' : 'P',
             'relevantPrescriptions' : 'R',
             'antipsychoticsPrescription' : 'Y'
            }

#### Define some storage.

In [4]:
# Storage of the evaluation of the feature sets.
# ## Unlike all other feature sets, these combinations will be evaluated in the same loop that they are created. I 
# ## made this decision because the GoogleCloudPlatform python kernel kept dying trying to save so many feature sets
# ## that turned out to be non-informative anyway.
ls_output1 = []

# Storage for list of names of non-informative feture sets.
discarded_because_only_one_value1 = []

#### Do the work.

In [5]:
# Do work.
t1 = time.time()
# Create the names and values for the feature sets.
for i_family_combins in tqdm.notebook.tqdm_notebook(family_combins[:50000], unit = " feature sets"):
    selection = i_family_combins.copy()

    # Get name of new feature set.
    wildcard_idx = 0
    new_fs_name_list = []
    for i_family in i_family_combins:
        # Extract the level and family.
        try:
            level, family = i_family.split('_')
            wildcard_idx += 1 # This increment says 'There is no wildcard at this index location.'
        except:
            level = 'wildcard'
            family = family_lookup_list[wildcard_idx]
            wildcard_idx += 1

        # Select `level_code` that will be used to represent the variable name.
        level_code = level_options.get(level, 'The provided value for \'level\' is not valid.')

        # Select family_code that will be used to represent the variable name.
        family_code = family_options.get(family, 'The provided value for \'family\' is not valid.')

        # Build variable name string.
        new_fs_name_list.append(family_code + level_code)

    new_fs_name = ('_'.join(new_fs_name_list))

    # Remove wildcard family, if present.
    try:
        selection = [i for i in selection if i != 'wildcard']
    except ValueError:
        None
    
    # Make new feature set.
    new_fs_value = nafsm_family_membership[selection].all(True)

    # Skip if non-informative.  
    a = len(new_fs_value.value_counts())
    if a < 2:
        # Keep a list of the non-informative feature sets.
        discarded_because_only_one_value1.append(new_fs_name)
    else:
    # Else, evaluate the feature set.
        fs_eval = evaloutputs(new_fs_value, caseness_array.caseness_1isYes.astype(int))
        ls_output1.append([new_fs_name] + list(fs_eval)[1:])

print(f'It took {time.time() - t1} to process.')

# Store the list of output evaluations because I need to append to this in the evaluation of other feature sets.
#%store ls_output
print(f'{len(ls_output1)} feature sets contained sufficient information to be evaluated.')
print('These feature sets\' evaluation statistics can be viewed in the `ls_output` variable.\n')

# Store the discard list because I need to append to this in the evaluation of other feature sets.
#%store discarded_because_only_one_value
print(f'{len(set(discarded_because_only_one_value1) ) } feature sets were discarded because they only presented one value for all patients.')
print('These discards can be viewed in the `discarded_because_only_one_value` variable.')

%store ls_output1 discarded_because_only_one_value1
# Join to `feature_set_array`.
# ## No join is necessary because the fetaure sets have been processed and evaluated here, rather than saved for later evaluation.

  0%|          | 0/50000 [00:00<?, ? feature sets/s]

It took 8724.343080282211 to process.
37826 feature sets contained sufficient information to be evaluated.
These feature sets' evaluation statistics can be viewed in the `ls_output` variable.

12174 feature sets were discarded because they only presented one value for all patients.
These discards can be viewed in the `discarded_because_only_one_value` variable.
Stored 'ls_output1' (list)
Stored 'discarded_because_only_one_value1' (list)


In [6]:
ls_output2 = []
discarded_because_only_one_value2 = []
# Do work.
t1 = time.time()
# Create the names and values for the feature sets.
for i_family_combins in tqdm.notebook.tqdm_notebook(family_combins[50000:100000], unit = " feature sets"):
    selection = i_family_combins.copy()

    # Get name of new feature set.
    wildcard_idx = 0
    new_fs_name_list = []
    for i_family in i_family_combins:
        # Extract the level and family.
        try:
            level, family = i_family.split('_')
            wildcard_idx += 1 # This increment says 'There is no wildcard at this index location.'
        except:
            level = 'wildcard'
            family = family_lookup_list[wildcard_idx]
            wildcard_idx += 1

        # Select `level_code` that will be used to represent the variable name.
        level_code = level_options.get(level, 'The provided value for \'level\' is not valid.')

        # Select family_code that will be used to represent the variable name.
        family_code = family_options.get(family, 'The provided value for \'family\' is not valid.')

        # Build variable name string.
        new_fs_name_list.append(family_code + level_code)

    new_fs_name = ('_'.join(new_fs_name_list))

    # Remove wildcard family, if present.
    try:
        selection = [i for i in selection if i != 'wildcard']
    except ValueError:
        None
    
    # Make new feature set.
    new_fs_value = nafsm_family_membership[selection].all(True)

    # Skip if non-informative.  
    a = len(new_fs_value.value_counts())
    if a < 2:
        # Keep a list of the non-informative feature sets.
        discarded_because_only_one_value2.append(new_fs_name)
    else:
    # Else, evaluate the feature set.
        fs_eval = evaloutputs(new_fs_value, caseness_array.caseness_1isYes.astype(int))
        ls_output2.append([new_fs_name] + list(fs_eval)[1:])

print(f'It took {time.time() - t1} to process.')

# Store the list of output evaluations because I need to append to this in the evaluation of other feature sets.
#%store ls_output
print(f'{len(ls_output2)} feature sets contained sufficient information to be evaluated.')
print('These feature sets\' evaluation statistics can be viewed in the `ls_output` variable.\n')

# Store the discard list because I need to append to this in the evaluation of other feature sets.
#%store discarded_because_only_one_value
print(f'{len(set(discarded_because_only_one_value2) ) } feature sets were discarded because they only presented one value for all patients.')
print('These discards can be viewed in the `discarded_because_only_one_value` variable.')

%store ls_output2 discarded_because_only_one_value2
# Join to `feature_set_array`.
# ## No join is necessary because the fetaure sets have been processed and evaluated here, rather than saved for later evaluation.

  0%|          | 0/50000 [00:00<?, ? feature sets/s]

It took 8487.588908433914 to process.
38259 feature sets contained sufficient information to be evaluated.
These feature sets' evaluation statistics can be viewed in the `ls_output` variable.

11741 feature sets were discarded because they only presented one value for all patients.
These discards can be viewed in the `discarded_because_only_one_value` variable.
Stored 'ls_output2' (list)
Stored 'discarded_because_only_one_value2' (list)


In [7]:
ls_output3 = []
discarded_because_only_one_value3 = []
# Do work.
t1 = time.time()
# Create the names and values for the feature sets.
for i_family_combins in tqdm.notebook.tqdm_notebook(family_combins[100000:150000], unit = " feature sets"):
    selection = i_family_combins.copy()

    # Get name of new feature set.
    wildcard_idx = 0
    new_fs_name_list = []
    for i_family in i_family_combins:
        # Extract the level and family.
        try:
            level, family = i_family.split('_')
            wildcard_idx += 1 # This increment says 'There is no wildcard at this index location.'
        except:
            level = 'wildcard'
            family = family_lookup_list[wildcard_idx]
            wildcard_idx += 1

        # Select `level_code` that will be used to represent the variable name.
        level_code = level_options.get(level, 'The provided value for \'level\' is not valid.')

        # Select family_code that will be used to represent the variable name.
        family_code = family_options.get(family, 'The provided value for \'family\' is not valid.')

        # Build variable name string.
        new_fs_name_list.append(family_code + level_code)

    new_fs_name = ('_'.join(new_fs_name_list))

    # Remove wildcard family, if present.
    try:
        selection = [i for i in selection if i != 'wildcard']
    except ValueError:
        None
    
    # Make new feature set.
    new_fs_value = nafsm_family_membership[selection].all(True)

    # Skip if non-informative.  
    a = len(new_fs_value.value_counts())
    if a < 2:
        # Keep a list of the non-informative feature sets.
        discarded_because_only_one_value3.append(new_fs_name)
    else:
    # Else, evaluate the feature set.
        fs_eval = evaloutputs(new_fs_value, caseness_array.caseness_1isYes.astype(int))
        ls_output3.append([new_fs_name] + list(fs_eval)[1:])

print(f'It took {time.time() - t1} to process.')

# Store the list of output evaluations because I need to append to this in the evaluation of other feature sets.
#%store ls_output
print(f'{len(ls_output3)} feature sets contained sufficient information to be evaluated.')
print('These feature sets\' evaluation statistics can be viewed in the `ls_output` variable.\n')

# Store the discard list because I need to append to this in the evaluation of other feature sets.
#%store discarded_because_only_one_value
print(f'{len(set(discarded_because_only_one_value3) ) } feature sets were discarded because they only presented one value for all patients.')
print('These discards can be viewed in the `discarded_because_only_one_value` variable.')

%store ls_output3 discarded_because_only_one_value3
# Join to `feature_set_array`.
# ## No join is necessary because the fetaure sets have been processed and evaluated here, rather than saved for later evaluation.

  0%|          | 0/50000 [00:00<?, ? feature sets/s]

It took 6806.741814851761 to process.
9945 feature sets contained sufficient information to be evaluated.
These feature sets' evaluation statistics can be viewed in the `ls_output` variable.

40055 feature sets were discarded because they only presented one value for all patients.
These discards can be viewed in the `discarded_because_only_one_value` variable.
Stored 'ls_output3' (list)
Stored 'discarded_because_only_one_value3' (list)


In [8]:
ls_output4 = []
discarded_because_only_one_value4 = []
# Do work.
t1 = time.time()
# Create the names and values for the feature sets.
for i_family_combins in tqdm.notebook.tqdm_notebook(family_combins[150000:200000], unit = " feature sets"):
    selection = i_family_combins.copy()

    # Get name of new feature set.
    wildcard_idx = 0
    new_fs_name_list = []
    for i_family in i_family_combins:
        # Extract the level and family.
        try:
            level, family = i_family.split('_')
            wildcard_idx += 1 # This increment says 'There is no wildcard at this index location.'
        except:
            level = 'wildcard'
            family = family_lookup_list[wildcard_idx]
            wildcard_idx += 1

        # Select `level_code` that will be used to represent the variable name.
        level_code = level_options.get(level, 'The provided value for \'level\' is not valid.')

        # Select family_code that will be used to represent the variable name.
        family_code = family_options.get(family, 'The provided value for \'family\' is not valid.')

        # Build variable name string.
        new_fs_name_list.append(family_code + level_code)

    new_fs_name = ('_'.join(new_fs_name_list))

    # Remove wildcard family, if present.
    try:
        selection = [i for i in selection if i != 'wildcard']
    except ValueError:
        None
    
    # Make new feature set.
    new_fs_value = nafsm_family_membership[selection].all(True)

    # Skip if non-informative.  
    a = len(new_fs_value.value_counts())
    if a < 2:
        # Keep a list of the non-informative feature sets.
        discarded_because_only_one_value4.append(new_fs_name)
    else:
    # Else, evaluate the feature set.
        fs_eval = evaloutputs(new_fs_value, caseness_array.caseness_1isYes.astype(int))
        ls_output4.append([new_fs_name] + list(fs_eval)[1:])

print(f'It took {time.time() - t1} to process.')

# Store the list of output evaluations because I need to append to this in the evaluation of other feature sets.
#%store ls_output
print(f'{len(ls_output4)} feature sets contained sufficient information to be evaluated.')
print('These feature sets\' evaluation statistics can be viewed in the `ls_output` variable.\n')

# Store the discard list because I need to append to this in the evaluation of other feature sets.
#%store discarded_because_only_one_value
print(f'{len(set(discarded_because_only_one_value4) ) } feature sets were discarded because they only presented one value for all patients.')
print('These discards can be viewed in the `discarded_because_only_one_value` variable.')

%store ls_output4 discarded_because_only_one_value4
# Join to `feature_set_array`.
# ## No join is necessary because the fetaure sets have been processed and evaluated here, rather than saved for later evaluation.

  0%|          | 0/50000 [00:00<?, ? feature sets/s]

It took 6922.701315879822 to process.
15505 feature sets contained sufficient information to be evaluated.
These feature sets' evaluation statistics can be viewed in the `ls_output` variable.

34495 feature sets were discarded because they only presented one value for all patients.
These discards can be viewed in the `discarded_because_only_one_value` variable.
Stored 'ls_output4' (list)
Stored 'discarded_because_only_one_value4' (list)


In [9]:
ls_output5 = []
discarded_because_only_one_value5 = []
# Do work.
t1 = time.time()
# Create the names and values for the feature sets.
for i_family_combins in tqdm.notebook.tqdm_notebook(family_combins[200000:250000], unit = " feature sets"):
    selection = i_family_combins.copy()

    # Get name of new feature set.
    wildcard_idx = 0
    new_fs_name_list = []
    for i_family in i_family_combins:
        # Extract the level and family.
        try:
            level, family = i_family.split('_')
            wildcard_idx += 1 # This increment says 'There is no wildcard at this index location.'
        except:
            level = 'wildcard'
            family = family_lookup_list[wildcard_idx]
            wildcard_idx += 1

        # Select `level_code` that will be used to represent the variable name.
        level_code = level_options.get(level, 'The provided value for \'level\' is not valid.')

        # Select family_code that will be used to represent the variable name.
        family_code = family_options.get(family, 'The provided value for \'family\' is not valid.')

        # Build variable name string.
        new_fs_name_list.append(family_code + level_code)

    new_fs_name = ('_'.join(new_fs_name_list))

    # Remove wildcard family, if present.
    try:
        selection = [i for i in selection if i != 'wildcard']
    except ValueError:
        None
    
    # Make new feature set.
    new_fs_value = nafsm_family_membership[selection].all(True)

    # Skip if non-informative.  
    a = len(new_fs_value.value_counts())
    if a < 2:
        # Keep a list of the non-informative feature sets.
        discarded_because_only_one_value5.append(new_fs_name)
    else:
    # Else, evaluate the feature set.
        fs_eval = evaloutputs(new_fs_value, caseness_array.caseness_1isYes.astype(int))
        ls_output5.append([new_fs_name] + list(fs_eval)[1:])

print(f'It took {time.time() - t1} to process.')

# Store the list of output evaluations because I need to append to this in the evaluation of other feature sets.
#%store ls_output
print(f'{len(ls_output5)} feature sets contained sufficient information to be evaluated.')
print('These feature sets\' evaluation statistics can be viewed in the `ls_output` variable.\n')

# Store the discard list because I need to append to this in the evaluation of other feature sets.
#%store discarded_because_only_one_value
print(f'{len(set(discarded_because_only_one_value5) ) } feature sets were discarded because they only presented one value for all patients.')
print('These discards can be viewed in the `discarded_because_only_one_value` variable.')

%store ls_output5 discarded_because_only_one_value5
# Join to `feature_set_array`.
# ## No join is necessary because the fetaure sets have been processed and evaluated here, rather than saved for later evaluation.

  0%|          | 0/50000 [00:00<?, ? feature sets/s]

It took 8858.332808971405 to process.
39388 feature sets contained sufficient information to be evaluated.
These feature sets' evaluation statistics can be viewed in the `ls_output` variable.

10612 feature sets were discarded because they only presented one value for all patients.
These discards can be viewed in the `discarded_because_only_one_value` variable.
Stored 'ls_output5' (list)
Stored 'discarded_because_only_one_value5' (list)


In [10]:
ls_output6 = []
discarded_because_only_one_value6 = []
# Do work.
t1 = time.time()
# Create the names and values for the feature sets.
for i_family_combins in tqdm.notebook.tqdm_notebook(family_combins[250000:300000], unit = " feature sets"):
    selection = i_family_combins.copy()

    # Get name of new feature set.
    wildcard_idx = 0
    new_fs_name_list = []
    for i_family in i_family_combins:
        # Extract the level and family.
        try:
            level, family = i_family.split('_')
            wildcard_idx += 1 # This increment says 'There is no wildcard at this index location.'
        except:
            level = 'wildcard'
            family = family_lookup_list[wildcard_idx]
            wildcard_idx += 1

        # Select `level_code` that will be used to represent the variable name.
        level_code = level_options.get(level, 'The provided value for \'level\' is not valid.')

        # Select family_code that will be used to represent the variable name.
        family_code = family_options.get(family, 'The provided value for \'family\' is not valid.')

        # Build variable name string.
        new_fs_name_list.append(family_code + level_code)

    new_fs_name = ('_'.join(new_fs_name_list))

    # Remove wildcard family, if present.
    try:
        selection = [i for i in selection if i != 'wildcard']
    except ValueError:
        None
    
    # Make new feature set.
    new_fs_value = nafsm_family_membership[selection].all(True)

    # Skip if non-informative.  
    a = len(new_fs_value.value_counts())
    if a < 2:
        # Keep a list of the non-informative feature sets.
        discarded_because_only_one_value6.append(new_fs_name)
    else:
    # Else, evaluate the feature set.
        fs_eval = evaloutputs(new_fs_value, caseness_array.caseness_1isYes.astype(int))
        ls_output6.append([new_fs_name] + list(fs_eval)[1:])

print(f'It took {time.time() - t1} to process.')

# Store the list of output evaluations because I need to append to this in the evaluation of other feature sets.
#%store ls_output
print(f'{len(ls_output6)} feature sets contained sufficient information to be evaluated.')
print('These feature sets\' evaluation statistics can be viewed in the `ls_output` variable.\n')

# Store the discard list because I need to append to this in the evaluation of other feature sets.
#%store discarded_because_only_one_value
print(f'{len(set(discarded_because_only_one_value6) ) } feature sets were discarded because they only presented one value for all patients.')
print('These discards can be viewed in the `discarded_because_only_one_value` variable.')

%store ls_output6 discarded_because_only_one_value6
# Join to `feature_set_array`.
# ## No join is necessary because the fetaure sets have been processed and evaluated here, rather than saved for later evaluation.

  0%|          | 0/50000 [00:00<?, ? feature sets/s]

It took 8618.000146865845 to process.
38994 feature sets contained sufficient information to be evaluated.
These feature sets' evaluation statistics can be viewed in the `ls_output` variable.

11006 feature sets were discarded because they only presented one value for all patients.
These discards can be viewed in the `discarded_because_only_one_value` variable.
Stored 'ls_output6' (list)
Stored 'discarded_because_only_one_value6' (list)


In [3]:
ls_output7 = []
discarded_because_only_one_value7 = []
# Do work.
t1 = time.time()
# Create the names and values for the feature sets.
for i_family_combins in tqdm.notebook.tqdm_notebook(family_combins[300000:350000], unit = " feature sets"):
    selection = i_family_combins.copy()

    # Get name of new feature set.
    wildcard_idx = 0
    new_fs_name_list = []
    for i_family in i_family_combins:
        # Extract the level and family.
        try:
            level, family = i_family.split('_')
            wildcard_idx += 1 # This increment says 'There is no wildcard at this index location.'
        except:
            level = 'wildcard'
            family = family_lookup_list[wildcard_idx]
            wildcard_idx += 1

        # Select `level_code` that will be used to represent the variable name.
        level_code = level_options.get(level, 'The provided value for \'level\' is not valid.')

        # Select family_code that will be used to represent the variable name.
        family_code = family_options.get(family, 'The provided value for \'family\' is not valid.')

        # Build variable name string.
        new_fs_name_list.append(family_code + level_code)

    new_fs_name = ('_'.join(new_fs_name_list))

    # Remove wildcard family, if present.
    try:
        selection = [i for i in selection if i != 'wildcard']
    except ValueError:
        None
    
    # Make new feature set.
    new_fs_value = nafsm_family_membership[selection].all(True)

    # Skip if non-informative.  
    a = len(new_fs_value.value_counts())
    if a < 2:
        # Keep a list of the non-informative feature sets.
        discarded_because_only_one_value7.append(new_fs_name)
    else:
    # Else, evaluate the feature set.
        fs_eval = evaloutputs(new_fs_value, caseness_array.caseness_1isYes.astype(int))
        ls_output7.append([new_fs_name] + list(fs_eval)[1:])

print(f'It took {time.time() - t1} to process.')

# Store the list of output evaluations because I need to append to this in the evaluation of other feature sets.
#%store ls_output
print(f'{len(ls_output7)} feature sets contained sufficient information to be evaluated.')
print('These feature sets\' evaluation statistics can be viewed in the `ls_output` variable.\n')

# Store the discard list because I need to append to this in the evaluation of other feature sets.
#%store discarded_because_only_one_value
print(f'{len(set(discarded_because_only_one_value7) ) } feature sets were discarded because they only presented one value for all patients.')
print('These discards can be viewed in the `discarded_because_only_one_value` variable.')

%store ls_output7 discarded_because_only_one_value7
# Join to `feature_set_array`.
# ## No join is necessary because the fetaure sets have been processed and evaluated here, rather than saved for later evaluation.

  0%|          | 0/50000 [00:00<?, ? feature sets/s]

It took 9072.722830533981 to process.
40424 feature sets contained sufficient information to be evaluated.
These feature sets' evaluation statistics can be viewed in the `ls_output` variable.

9576 feature sets were discarded because they only presented one value for all patients.
These discards can be viewed in the `discarded_because_only_one_value` variable.
Stored 'ls_output7' (list)
Stored 'discarded_because_only_one_value7' (list)


In [4]:
ls_output8 = []
discarded_because_only_one_value8 = []
# Do work.
t1 = time.time()
# Create the names and values for the feature sets.
for i_family_combins in tqdm.notebook.tqdm_notebook(family_combins[350000:400000], unit = " feature sets"):
    selection = i_family_combins.copy()

    # Get name of new feature set.
    wildcard_idx = 0
    new_fs_name_list = []
    for i_family in i_family_combins:
        # Extract the level and family.
        try:
            level, family = i_family.split('_')
            wildcard_idx += 1 # This increment says 'There is no wildcard at this index location.'
        except:
            level = 'wildcard'
            family = family_lookup_list[wildcard_idx]
            wildcard_idx += 1

        # Select `level_code` that will be used to represent the variable name.
        level_code = level_options.get(level, 'The provided value for \'level\' is not valid.')

        # Select family_code that will be used to represent the variable name.
        family_code = family_options.get(family, 'The provided value for \'family\' is not valid.')

        # Build variable name string.
        new_fs_name_list.append(family_code + level_code)

    new_fs_name = ('_'.join(new_fs_name_list))

    # Remove wildcard family, if present.
    try:
        selection = [i for i in selection if i != 'wildcard']
    except ValueError:
        None
    
    # Make new feature set.
    new_fs_value = nafsm_family_membership[selection].all(True)

    # Skip if non-informative.  
    a = len(new_fs_value.value_counts())
    if a < 2:
        # Keep a list of the non-informative feature sets.
        discarded_because_only_one_value8.append(new_fs_name)
    else:
    # Else, evaluate the feature set.
        fs_eval = evaloutputs(new_fs_value, caseness_array.caseness_1isYes.astype(int))
        ls_output8.append([new_fs_name] + list(fs_eval)[1:])

print(f'It took {time.time() - t1} to process.')

# Store the list of output evaluations because I need to append to this in the evaluation of other feature sets.
#%store ls_output
print(f'{len(ls_output8)} feature sets contained sufficient information to be evaluated.')
print('These feature sets\' evaluation statistics can be viewed in the `ls_output` variable.\n')

# Store the discard list because I need to append to this in the evaluation of other feature sets.
#%store discarded_because_only_one_value
print(f'{len(set(discarded_because_only_one_value8) ) } feature sets were discarded because they only presented one value for all patients.')
print('These discards can be viewed in the `discarded_because_only_one_value` variable.')

%store ls_output8 discarded_because_only_one_value8
# Join to `feature_set_array`.
# ## No join is necessary because the fetaure sets have been processed and evaluated here, rather than saved for later evaluation.

  0%|          | 0/50000 [00:00<?, ? feature sets/s]

It took 8987.33467054367 to process.
40331 feature sets contained sufficient information to be evaluated.
These feature sets' evaluation statistics can be viewed in the `ls_output` variable.

9669 feature sets were discarded because they only presented one value for all patients.
These discards can be viewed in the `discarded_because_only_one_value` variable.
Stored 'ls_output8' (list)
Stored 'discarded_because_only_one_value8' (list)


In [5]:
ls_output9 = []
discarded_because_only_one_value9 = []
# Do work.
t1 = time.time()
# Create the names and values for the feature sets.
for i_family_combins in tqdm.notebook.tqdm_notebook(family_combins[450000:500000], unit = " feature sets"):
    selection = i_family_combins.copy()

    # Get name of new feature set.
    wildcard_idx = 0
    new_fs_name_list = []
    for i_family in i_family_combins:
        # Extract the level and family.
        try:
            level, family = i_family.split('_')
            wildcard_idx += 1 # This increment says 'There is no wildcard at this index location.'
        except:
            level = 'wildcard'
            family = family_lookup_list[wildcard_idx]
            wildcard_idx += 1

        # Select `level_code` that will be used to represent the variable name.
        level_code = level_options.get(level, 'The provided value for \'level\' is not valid.')

        # Select family_code that will be used to represent the variable name.
        family_code = family_options.get(family, 'The provided value for \'family\' is not valid.')

        # Build variable name string.
        new_fs_name_list.append(family_code + level_code)

    new_fs_name = ('_'.join(new_fs_name_list))

    # Remove wildcard family, if present.
    try:
        selection = [i for i in selection if i != 'wildcard']
    except ValueError:
        None
    
    # Make new feature set.
    new_fs_value = nafsm_family_membership[selection].all(True)

    # Skip if non-informative.  
    a = len(new_fs_value.value_counts())
    if a < 2:
        # Keep a list of the non-informative feature sets.
        discarded_because_only_one_value9.append(new_fs_name)
    else:
    # Else, evaluate the feature set.
        fs_eval = evaloutputs(new_fs_value, caseness_array.caseness_1isYes.astype(int))
        ls_output9.append([new_fs_name] + list(fs_eval)[1:])

print(f'It took {time.time() - t1} to process.')

# Store the list of output evaluations because I need to append to this in the evaluation of other feature sets.
#%store ls_output
print(f'{len(ls_output9)} feature sets contained sufficient information to be evaluated.')
print('These feature sets\' evaluation statistics can be viewed in the `ls_output` variable.\n')

# Store the discard list because I need to append to this in the evaluation of other feature sets.
#%store discarded_because_only_one_value
print(f'{len(set(discarded_because_only_one_value9) ) } feature sets were discarded because they only presented one value for all patients.')
print('These discards can be viewed in the `discarded_because_only_one_value` variable.')

%store ls_output9 discarded_because_only_one_value9
# Join to `feature_set_array`.
# ## No join is necessary because the fetaure sets have been processed and evaluated here, rather than saved for later evaluation.

  0%|          | 0/50000 [00:00<?, ? feature sets/s]

It took 7889.966617822647 to process.
38211 feature sets contained sufficient information to be evaluated.
These feature sets' evaluation statistics can be viewed in the `ls_output` variable.

11789 feature sets were discarded because they only presented one value for all patients.
These discards can be viewed in the `discarded_because_only_one_value` variable.
Stored 'ls_output9' (list)
Stored 'discarded_because_only_one_value9' (list)


In [6]:
ls_output10 = []
discarded_because_only_one_value10 = []
# Do work.
t1 = time.time()
# Create the names and values for the feature sets.
for i_family_combins in tqdm.notebook.tqdm_notebook(family_combins[500000:], unit = " feature sets"):
    selection = i_family_combins.copy()

    # Get name of new feature set.
    wildcard_idx = 0
    new_fs_name_list = []
    for i_family in i_family_combins:
        # Extract the level and family.
        try:
            level, family = i_family.split('_')
            wildcard_idx += 1 # This increment says 'There is no wildcard at this index location.'
        except:
            level = 'wildcard'
            family = family_lookup_list[wildcard_idx]
            wildcard_idx += 1

        # Select `level_code` that will be used to represent the variable name.
        level_code = level_options.get(level, 'The provided value for \'level\' is not valid.')

        # Select family_code that will be used to represent the variable name.
        family_code = family_options.get(family, 'The provided value for \'family\' is not valid.')

        # Build variable name string.
        new_fs_name_list.append(family_code + level_code)

    new_fs_name = ('_'.join(new_fs_name_list))

    # Remove wildcard family, if present.
    try:
        selection = [i for i in selection if i != 'wildcard']
    except ValueError:
        None
    
    # Make new feature set.
    new_fs_value = nafsm_family_membership[selection].all(True)

    # Skip if non-informative.  
    a = len(new_fs_value.value_counts())
    if a < 2:
        # Keep a list of the non-informative feature sets.
        discarded_because_only_one_value10.append(new_fs_name)
    else:
    # Else, evaluate the feature set.
        fs_eval = evaloutputs(new_fs_value, caseness_array.caseness_1isYes.astype(int))
        ls_output10.append([new_fs_name] + list(fs_eval)[1:])

print(f'It took {time.time() - t1} to process.')

# Store the list of output evaluations because I need to append to this in the evaluation of other feature sets.
#%store ls_output
print(f'{len(ls_output10)} feature sets contained sufficient information to be evaluated.')
print('These feature sets\' evaluation statistics can be viewed in the `ls_output` variable.\n')

# Store the discard list because I need to append to this in the evaluation of other feature sets.
#%store discarded_because_only_one_value
print(f'{len(set(discarded_because_only_one_value10) ) } feature sets were discarded because they only presented one value for all patients.')
print('These discards can be viewed in the `discarded_because_only_one_value` variable.')

%store ls_output10 discarded_because_only_one_value10
# Join to `feature_set_array`.
# ## No join is necessary because the fetaure sets have been processed and evaluated here, rather than saved for later evaluation.

  0%|          | 0/59872 [00:00<?, ? feature sets/s]

It took 9414.497598409653 to process.
49618 feature sets contained sufficient information to be evaluated.
These feature sets' evaluation statistics can be viewed in the `ls_output` variable.

10254 feature sets were discarded because they only presented one value for all patients.
These discards can be viewed in the `discarded_because_only_one_value` variable.
Stored 'ls_output10' (list)
Stored 'discarded_because_only_one_value10' (list)


#### Write lists to file, just in case there are more problems with the store.

In [7]:
# Extend the lists to be saved.
# ## I had some service issuse that meant I had to chunk my processing. If you have to do the same, the script below combines your chunks into a single variable
# ## that will then be saved.
ls_output = []
discarded_because_only_one_value = []
[ls_output.extend(l) for l in (ls_output1, ls_output2, ls_output3, ls_output4, ls_output5, ls_output6, ls_output7, ls_output8, ls_output9, ls_output10)]
[discarded_because_only_one_value.extend(l) for l in (discarded_because_only_one_value1, discarded_because_only_one_value2, discarded_because_only_one_value3, discarded_because_only_one_value4,
                                                      discarded_because_only_one_value5, discarded_because_only_one_value6, discarded_because_only_one_value7,
                                                      discarded_because_only_one_value8, discarded_because_only_one_value9, discarded_because_only_one_value10)]

ls_output_family_combins = ls_output.copy()
discarded_because_only_one_value = set(discarded_because_only_one_value).copy()


# Write lists to file.
with open('ls_output_family_combins.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(ls_output_family_combins)

%store ls_output_family_combins
    
with open('discarded_because_only_one_value.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(discarded_because_only_one_value)
%store discarded_because_only_one_value

Stored 'ls_output_family_combins' (list)
Stored 'discarded_because_only_one_value' (set)


### Multiprocessing implementation.

I had some issues with the multiprocessing implementation and I also found that it didn't speed things up very much, for my particular problem. I leave my scripts below for your interest.

#### Define functions.

#### Do work.