# Functions documentation

### *sorter (files)

* Convert all values in 'prod_ai' column to string values, and then separate non-nan values for class mapping.

* The 'prod_ai' (product active ingredient) column is used for this and downstream functions because unlike the brand name, an active ingredient/generic name may have a shared suffix with other medications, which makes the mapping functions computationally efficient.

* Append each sorted dataframe as a list to allow mapping function iterations to run separately and maintain data integrity.

In [None]:
class_dfs = [ ]
missing_dfs = [ ]
new_files = [ ]
positives = [ ]
inds = [ ]

def sorter(files):
        
    for drug_file in files:

        drug_file.prod_ai = drug_file.prod_ai.astype(str)
        drug_file.prod_ai = drug_file.prod_ai.map(lambda x: x.replace('.', ''))

        indices = drug_file[drug_file.prod_ai != 'nan'].index
        nan_indices = drug_file[drug_file.prod_ai == 'nan'].index

        present = drug_file.prod_ai.loc[indices]
        absent = drug_file.prod_ai.loc[nan_indices]
        
        class_df.drugname = present

        missing_df.drugname = drug_file.drugname.loc[nan_indices]
        missing_df.generic = absent


        class_dfs.append([class_df])
        missing_dfs.append([missing_df])
        new_files.append([drug_file])
        positives.append([present])
        inds.append([indices])

### *map_1(class_df,p,i)

* First round of mapping logic.

* At completion of iteration, separate mapped drug names and indices from drug names and indices where no class was mapped, then send the unmapped entries into the next mapping function.
     * Instead of sending each original dataframe through the full mapping logic, which is extremely computationally expensive, only send the original through a small portion of the logic and separate the entries that returned nan. That smaller dataframe is then sent through the next mapping function, which has the same .loc separater steps, and then send an even smaller dataframe through the third round of logic.
     * This cascade-style mapping proves to be very efficient, especially when handling 1.5+ million observations per dataframe.

* Create local variable for mapped entries, and send that to next function to merge with the next round of mapped entries.

In [None]:
def map_1(class_df,p,i):

    for x,y in zip(p,i):
    
        (mapping logic)...
        
    class_df.class_id = class_df.class_id.astype(str)
    lead_df = class_df[class_df.class_id != 'nan']
    df_2 = class_df[class_df.class_id == 'nan']
    
    idx = df_2.index
    drugs = df_2.drugname
    
    return map_2(df_2,drugs,idx,lead_df)

### *map_2(class_df,drugs,idx,lead_df)

* Second round of mapping logic.
* See map_1 for explanation...
* Create local variable of concatenated dataframes, and send that to next function to merge with the next round of mapped entries.

In [None]:
def map_2(class_df,drugs,idx,lead_df):

    for x,y in zip(drugs,idx):
    
        (mapping logic)...
        
    class_df.class_id = class_df.class_id.astype(str)
            
    df_2 = class_df[class_df.class_id != 'nan']
    df_3 = class_df[class_df.class_id == 'nan']
    final_df = pd.concat([lead_df, df_2])
    
    idx = df_3.index
    drugs = df_3.drugname
    
    return map_3(df_3,drugs,idx, final_df)

### *map_3(class_df,drugs,idx, final_df)

* Third round of mapping logic.
* see map_1 for explanation...
* Create local variable of concatenated dataframes and a dataframe of all entries that did not meet any of the mapping logic, then append each into their respective global list to examine once functions are completed.

In [None]:
final_dfs = []
miss_dfs = []

def map_3(class_df,drugs,idx, final_df):
    
    for x,y in zip(drugs,idx):
        
        (mapping logic)...
        
    class_df.class_id = class_df.class_id.astype(str)
    miss_df = class_df[class_df.class_id == 'nan']
    class_df = class_df[class_df.class_id != 'nan']
    final_df = pd.concat([final_df, class_df])
    
    final_dfs.append(final_df)
    miss_dfs.append(miss_df)

### *file_merge(saved_dfs, files)

* Read in the mapped and saved dataframes as well as the original files, select which columns from the original files you'd wish to merge with the mapped dataframes, and then contatenate them creating a custom table to analyze.
* Append new dataframes to global list variable.

In [None]:
additions = []
custom_dfs = []

def file_merge(saved_dfs, files):
    
    for f,z in zip(saved_dfs, files):
        indices = f.index
        kept = z[['primaryid', 'caseid', 'dose_form', 'dechal', 'rechal']].loc[indices]
        additions.append(kept)
        
    for f,a in zip(saved_dfs, additions):
        new = pd.concat([f,a], axis=1)
        custom_dfs.append(new)

### *reacs_map(reacs)
* Identify unique primaryids from reaction file(s) to create new DataFrame. then append all corresponding 'Preferred Term' reaction codes to a list, finally iterate through that list to join codes together as a single row in new DataFrame.

In [None]:
pt_list = []

def reacs_map(reacs):
    start_time = time.time()
    
    ids = reacs.primaryid.unique()
    reacs_df = pd.DataFrame(ids, columns=(['primaryid']))    
    reacs_df['pt'] = 'nan'
        
    for x in ids:
        df = reacs[reacs.primaryid==x]
        pt_list.append(df.pt.values)
    for i,a in enumerate(pt_list):
        reacs_df.pt.loc[i] = ' , '.join(a)
        
    end_time = time.time()
    print((end_time - start_time) / 60 / 60)
    print('Completed. Check your defined variable for output')
    return reacs_df

### *outs_map(outs) 
* Identify unique primaryids from outcome file(s) to create new DataFrame, then append all corresponding outcome codes to a list. Finally, iterate through the list and join all codes in a single row in new DataFrame. 

In [None]:
outs_code_list = []

def outs_map(outs):
    start_time = time.time()
    
    ids = outs.primaryid.unique()
    outs_df = pd.DataFrame(ids, columns=(['primaryid']))    
    outs_df['pt'] = 'nan'
    
    
    for x in ids:
        df = outs[outs.primaryid==x]
        outs_code_list.append(df.outc_cod.values)
    for i,a in enumerate(outs_code_list):
        outs_df.pt.loc[i] = ' , '.join(a)
        
    end_time = time.time()
    
    print((end_time - start_time) / 60 / 60)
    print('Completed. Check your defined variable for output')
    return outs_df

### *file_merge(saved_dfs, drug_files, df1, df2)
* First create a variable representing the indicies from the previously mapped DataFrame and use this variable to locate the primaryid and caseid of each mapped drug from within the original drug file. Append this DataFrame copy to a list. 
* Next, join the saved_df with the newly appended DataFrame copy and create two new rows for the 'Preferred Term' reactions and 'outcome codes', which will be filled with the values obtained from 'reacs_map' and 'outs_map'.
* Finally, enumerate the 'reacs' and 'outs' DataFrames and merge their values with the 'saved_df' by matching on primaryid. Append this final DataFrame to a list for further manipulation.

In [None]:
additions = []
custom_dfs = []

def file_merge(saved_dfs, drug_files, df1, df2):
    for sf,df in zip(saved_dfs, drug_files):
        indices = f.orig_idx
        kept = df[['primaryid','caseid']].loc[indices]
        additions.append(kept)
        
    for f,a,rf,of in zip(saved_dfs, additions, df1, df2):
        primaries = kept.primaryid
        new = f.join([a.set_index(f.index)])
        new['pt'] = 'nan'
        new['outc_cod'] = 'nan'
        
        i = 0
        
        for i,x in enumerate(rf.primaryid):
            for j,y in enumerate(new.primaryid):
                if x == y:
                    new.pt.loc[j] = rf.pt.loc[i]
        for i,x in enumerate(of.primaryid):
            for j,y in enumerate(new.primaryid):
                if x == y:
                    new.outc_cod.loc[j] = of.outc_cod.loc[i]

    custom_dfs.append(new)