## Applying the Model

### 3 New Chemistries

Now that we have a random forest model in place, the next course of action is to measure the effectiveness of the model with 3 additional terminal group systems that were not included in any training or testing of the various models.

These systems were then ran under similar conditions as the previous monolayer systems, and the coefficient of friction and  adhesion force were measured in the same manner.

### New Terminal Groups

1. Toluene
1. Phenol
1. Difluoromethyl

#### SMILES
| Terminal Group | SMILES (H-terminated) | SMILES (CH3-terminated) |
|------|------|------|
| Toluene | Cc1ccccc1 | Cc1ccc(C)cc1 |
| Phenol | c1ccc(cc1)O | Cc1ccc(cc1)O |
| Difluoromethyl | FCF | FC(F)C |

### Installing the proper packages

In this notebook, edits were made to the random forest model to allow it to be importable.

To access that version, checkout the `accuracy_test` branch of the [random forest terminal group project](https://github.com/PTC-CMC/random_forest_tg.git).

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib
from matplotlib import pyplot as plt
import seaborn as sns
import signac
import atools_ml
import atools
import copy
import rf
from pprint import pprint as pprint

pd.set_option('display.float_format', '{:1.4g}'.format)

In [None]:
print("numpy version:\t\t{}".format(np.version.full_version))
print("pandas version:\t\t{}".format(pd.__version__))
print("matplotlib version:\t{}".format(matplotlib.__version__))
print("seaborn version:\t{}".format(sns.__version__))
print("signac version:\t\t{}".format(signac.__version__))

In [None]:
# the data repositories should be located one level above this Jupyter notebook

first_screening_proj = signac.get_project(root="../terminal_group_screening/")
first_mixed_screening_proj = signac.get_project(root="../terminal_groups_mixed/")
uniq_smiles = set()

# grab all the smiles strings and filter out the repeats
for job in old_screening_proj:
    uniq_smiles.add((job.sp['terminal_group'], job.doc['h-smiles'], job.doc['ch3-smiles']))

In [None]:
pprint(uniq_smiles)

In [None]:
# list the smiles strings
new_smiles_dict = dict()
first_smiles_dict = dict()

for group in uniq_smiles:
    first_smiles_dict[group[0]] = [group[1], group[2]]

new_smiles_dict['toluene'] =   ["Cc1ccccc1", "Cc1ccc(C)cc1"]

new_smiles_dict['phenol'] = ["c1ccc(cc1)O", "Cc1ccc(cc1)O"]

new_smiles_dict['difluoromethyl'] = ["FCF", "FC(F)C"]

pprint(first_smiles_dict)
pprint(new_smiles_dict)
print()
# merge the two dicts
new_smiles_dict.update(first_smiles_dict)
pprint(new_smiles_dict)

In [None]:
# invert the relational dictionary so smiles strings can be linked to their string name

from collections import defaultdict
my_inverted_dict = defaultdict(list)
{my_inverted_dict[v[0]].append(k) for k, v in new_smiles_dict.items()}
{my_inverted_dict[v[1]].append(k) for k, v in new_smiles_dict.items()}
pprint(my_inverted_dict)

# test that this prints out 'acetyl'
print(my_inverted_dict['C(=O)C'][0])

In [None]:
# load in the signac project for the 3 new chemistries with identical top and bottom monolayers
new_terminal_project = signac.get_project(root="../terminal_group_screening_accuracy_test")

# load in signac project for the 3 new chemistries with non-identical top and bottom monolayers
test_new_old_project = signac.get_project(root="../terminal_group_mixed_original_16_new_3/")

pprint(first_smiles_dict)

In [None]:
# dataframe of the 3 new chemistries with identical monolayers

df_new = pd.DataFrame()
data_new = []
for job in new_terminal_project:
    if "COF" in job.document and "intercept" in job.document:
        data_new.append(
                        {'Terminal Group': job.sp['terminal_groups'][0],
                         'COF': job.doc['COF'],
                         'intercept': job.doc['intercept'],
                         'SMILES-H': new_smiles_dict[job.sp['terminal_groups'][0]][0],}
                        )

df_new = pd.DataFrame(data_new)
df_new['COF'].describe()

In [None]:
# dataframe of the 3 new chemistries with non-identical top and bottom monolayers

df_new_old = pd.DataFrame()
data_new_old = []
for job in test_new_old_project:
    if "COF" in job.document and "intercept" in job.document:
        data_new_old.append({
                             'Terminal Group 1': job.sp['terminal_groups'][0],
                             'Terminal Group 2': job.sp['terminal_groups'][1],
                             'COF': job.doc['COF'],
                             'intercept': job.doc['intercept'],
                             'SMILES-1-H': new_smiles_dict[job.sp['terminal_groups'][0]][0],
                             'SMILES-2-H': new_smiles_dict[job.sp['terminal_groups'][1]][0],
                             }
                             )        
df_new_old = pd.DataFrame(data_new_old)
df_new_old

In [None]:
# return the tribological data from the original studies (16 termini)

data_old_mixed= []
for job in old_mixed_screening_proj:
    if "COF" in job.document and "intercept" in job.document and 'S2_bottom' in job.document and 'S2_top' in job.document:
        data_old_mixed.append({'Terminal Group 1': job.sp['terminal_groups'][0],
                         'Terminal Group 2': job.sp['terminal_groups'][1],
                         'COF': job.doc['COF'],
                         'intercept': job.doc['intercept'],
                         'S2_bottom_5nN': job.doc['S2_bottom']['5nN'],
                         'S2_bottom_15nN': job.doc['S2_bottom']['15nN'],
                         'S2_bottom_25nN': job.doc['S2_bottom']['25nN'],
                         'S2_top_5nN': job.doc['S2_top']['5nN'],
                         'S2_top_15nN': job.doc['S2_top']['15nN'],
                         'S2_top_25nN': job.doc['S2_top']['25nN'],
                         'SMILES-1-H': new_smiles_dict[job.sp['terminal_groups'][0]][0],
                         'SMILES-2-H': new_smiles_dict[job.sp['terminal_groups'][1]][0]})
df_old_mixed = pd.DataFrame(data_old_mixed)
df_old_mixed

In [None]:
# run this to generate a csv version of this dataframe

# df_new.to_csv("./new_3_self.csv")
# df_new.sort_values(by="Terminal Group")

In [None]:
# run this to generate a csv version of this dataframe

# df_new_old.to_csv("./mixed_original_new3.csv")
# df_new_old.sort_values(by="Terminal Group 1")

In [None]:
df_old_mixed = pd.DataFrame(data_new_old, columns=["Terminal Group 1", "Terminal Group 2", "COF", "F0", "SMILES_H_1", "SMILES_H_2"])

In [None]:
## run this to generate a summary of the prediction of the 5 models for toluene

# #             SMILES2='C[C]=O',

# summary_stat_cof = []
# summary_stat_intercept = []
# for seed in [43, 0, 1, 2, 3]:
    
#     a = rf.predict(SMILES1=new_smiles_dict['toluene'][0],
#                SMILES2=new_smiles_dict['toluene'][0],
#                path_to_data=".",
#                random_seed=seed,
#                vary_descriptors=False,
#                vary_significant=False,
#                barcode_seed=123,
#                feature_cluster_json_location="../random_forest_tg/")
#     summary_stat_cof.append(a['COF'])
#     summary_stat_intercept.append(a['intercept'])
# print(summary_stat_cof, summary_stat_intercept)

In [None]:
# print summary information from the previous cell

# summary_data = []
# # print('COF')
# # print(np.mean(summary_stat_cof, dtype=np.float64))
# # print(np.std(summary_stat_cof, dtype=np.float64))
# # print(np.mean(summary_stat_cof), " +- ",np.std(summary_stat_cof))


# # print("Intercept")
# # print(np.mean(summary_stat_intercept, dtype=np.float64))
# # print(np.std(summary_stat_intercept, dtype=np.float64))
# # print(np.mean(summary_stat_intercept), " +- ",np.std(summary_stat_intercept))

# cof_pred_mean = np.mean(summary_stat_cof, dtype=np.float64)
# cof_pred_std = np.std(summary_stat_cof, dtype=np.float64)

# f_o_pred_mean = np.mean(summary_stat_intercept, dtype=np.float64)
# f_o_pred_std = np.std(summary_stat_intercept, dtype=np.float64)

# calc_mean = df_new[(df_new['Terminal Group'] == "toluene") & (df_new['Terminal Group'] == "toluene") ].describe().loc[['mean']]
# calc_std = df_new[(df_new['Terminal Group'] == "toluene") & (df_new['Terminal Group'] == "toluene") ].describe().loc[['std']]


# #calc_mean = df_new_old[(df_new_old['Terminal Group 1'] == "pyrrole") & (df_new_old['Terminal Group 2'] == "toluene") ].describe().loc[['mean']]
# #calc_std = df_new_old[(df_new_old['Terminal Group 1'] == "pyrrole") & (df_new_old['Terminal Group 2'] == "toluene") ].describe().loc[['std']]

# summary_data = np.asarray([[cof_pred_mean, calc_mean['COF'], f_o_pred_mean, calc_mean['intercept']],[cof_pred_std, calc_std.COF, f_o_pred_std, calc_std.intercept]], dtype=np.float64)
# print(summary_data)

Summary tables for tribilogical properties based on chain length

In [None]:
# this data is now in csv files for easier consumption

# import numpy as np
# def gen_summary_tribo(project):
#     df_index = pd.DataFrame(project.index())
#     df_index = df_index.set_index(['_id'])
#     statepoints = {doc['_id']: doc['statepoint'] for doc in project.index()}
#     df = pd.DataFrame(statepoints).T.join(df_index)
#     chainlengths = df.chainlength.unique()
#     chainlengths.sort()
#     terminal_groups = df.terminal_group.unique()
#     terminal_groups.sort()
#     n_groups = len(terminal_groups)
    
#     # COF
#     summary_cof_list = []
#     for i, terminal_group in enumerate(terminal_groups.tolist()):
#         cof = [df[(df.chainlength==chainlength) &
#                   (df.terminal_group==terminal_group)].COF.mean()
#                for chainlength in chainlengths]
#         cof_err = [df[(df.chainlength==chainlength) &
#                       (df.terminal_group==terminal_group)].COF.std()
#                    for chainlength in chainlengths]

#         dict_1 = {
#             'System': str(terminal_group + ' - ' + terminal_group),
#             'COF: {} carbons'.format(chainlengths[0]): cof[0],
#             'COF: {} carbons'.format(chainlengths[1]): cof[1],
#             'COF: {} carbons'.format(chainlengths[2]): cof[2],
#             'COF: {} carbons'.format(chainlengths[3]): cof[3],
#             'COF: {} carbons'.format(chainlengths[4]): cof[4],
#             'COF Err: {} carbons'.format(chainlengths[0]): cof_err[0],
#             'COF Err: {} carbons'.format(chainlengths[1]): cof_err[1],
#             'COF Err: {} carbons'.format(chainlengths[2]): cof_err[2],
#             'COF Err: {} carbons'.format(chainlengths[3]): cof_err[3],
#             'COF Err: {} carbons'.format(chainlengths[4]): cof_err[4],
            
#         }
#         summary_cof_list.append(dict_1)
#     summary_cof_df = pd.DataFrame(summary_cof_list)
#     cols = list(summary_cof_df.columns)
#     cols = ['System', 'COF: 5 carbons', 'COF: 8 carbons',
#             'COF: 11 carbons', 'COF: 14 carbons', 'COF: 17 carbons',
#             'COF Err: 5 carbons', 'COF Err: 8 carbons',
#             'COF Err: 11 carbons', 'COF Err: 14 carbons', 'COF Err: 17 carbons',
#            ]
#     summary_cof_df = summary_cof_df[cols]
    
#     #print(summary_cof_df)
#     summary_cof_df.to_csv("same_terminal_groups_cof_summary.csv")
#     summary_cof_df.to_html("same_terminal_groups_cof_summary.html")


#     # adhesion
#     summary_adhesion_list = []
#     for i, terminal_group in enumerate(terminal_groups):
#         intercept = [df[(df.chainlength==chainlength) &
#                         (df.terminal_group==terminal_group)].intercept.mean()
#                      for chainlength in chainlengths]
#         intercept_err = [df[(df.chainlength==chainlength) &
#                             (df.terminal_group==terminal_group)].intercept.std()
#                          for chainlength in chainlengths]
#         dict_1 = {
#             'System': str(terminal_group + ' - ' + terminal_group),
#             'F0: {} carbons'.format(chainlengths[0]): intercept[0],
#             'F0: {} carbons'.format(chainlengths[1]): intercept[1],
#             'F0: {} carbons'.format(chainlengths[2]): intercept[2],
#             'F0: {} carbons'.format(chainlengths[3]): intercept[3],
#             'F0: {} carbons'.format(chainlengths[4]): intercept[4],
#             'F0 Err: {} carbons'.format(chainlengths[0]): intercept_err[0],
#             'F0 Err: {} carbons'.format(chainlengths[1]): intercept_err[1],
#             'F0 Err: {} carbons'.format(chainlengths[2]): intercept_err[2],
#             'F0 Err: {} carbons'.format(chainlengths[3]): intercept_err[3],
#             'F0 Err: {} carbons'.format(chainlengths[4]): intercept_err[4],
#         }
#         summary_adhesion_list.append(dict_1)
#     summary_adhesion_df = pd.DataFrame(summary_adhesion_list)
#     cols = list(summary_adhesion_df.columns)
#     cols = ['System', 'F0: 5 carbons', 'F0: 8 carbons',
#             'F0: 11 carbons', 'F0: 14 carbons', 'F0: 17 carbons',
#             'F0 Err: 5 carbons', 'F0 Err: 8 carbons',
#             'F0 Err: 11 carbons', 'F0 Err: 14 carbons', 'F0 Err: 17 carbons',]
#     summary_adhesion_df = summary_adhesion_df[cols]
    
#     #print(summary_adhesion_df)
#     summary_adhesion_df.to_csv("same_terminal_groups_adhesion_summary.csv")
#     summary_adhesion_df.to_html("same_terminal_groups_adhesion_summary.html")
    

#     # nematic Order (S2) 5nN
#     summary_nematic_5nN_list = []
#     for i, terminal_group in enumerate(terminal_groups):
#         nematic = []
#         nematic_err = []

#         nematic_list = [df[(df.chainlength==chainlength) &
#                         (df.terminal_group==terminal_group)].S2
#                      for chainlength in chainlengths]
#         for chain_length_grouping in nematic_list:
#             nematic.append(np.mean([i['5nN'] for i in chain_length_grouping]))
#             nematic_err.append(np.std([i['5nN'] for i in chain_length_grouping]))
#         dict_1 = {
#             'System': str(terminal_group + ' - ' + terminal_group),
#             'S2: {} carbons'.format(chainlengths[0]): nematic[0],
#             'S2: {} carbons'.format(chainlengths[1]): nematic[1],
#             'S2: {} carbons'.format(chainlengths[2]): nematic[2],
#             'S2: {} carbons'.format(chainlengths[3]): nematic[3],
#             'S2: {} carbons'.format(chainlengths[4]): nematic[4],
#             'S2 Err: {} carbons'.format(chainlengths[0]): nematic_err[0],
#             'S2 Err: {} carbons'.format(chainlengths[1]): nematic_err[1],
#             'S2 Err: {} carbons'.format(chainlengths[2]): nematic_err[2],
#             'S2 Err: {} carbons'.format(chainlengths[3]): nematic_err[3],
#             'S2 Err: {} carbons'.format(chainlengths[4]): nematic_err[4],
#         }
#         summary_nematic_5nN_list.append(dict_1)
#     summary_nematic_5nN_df = pd.DataFrame(summary_nematic_5nN_list)
#     cols = list(summary_nematic_5nN_df.columns)
#     cols = ['System', 'S2: 5 carbons', 'S2: 8 carbons',
#             'S2: 11 carbons', 'S2: 14 carbons', 'S2: 17 carbons',
#             'S2 Err: 5 carbons', 'S2 Err: 8 carbons',
#             'S2 Err: 11 carbons', 'S2 Err: 14 carbons', 'S2 Err: 17 carbons']
#     summary_nematic_5nN_df = summary_nematic_5nN_df[cols]
    
#     #print(summary_adhesion_df)
#     summary_nematic_5nN_df.to_csv("same_terminal_groups_5nN_nematic_summary.csv")
#     summary_nematic_5nN_df.to_html("same_terminal_groups_5nN_nematic_summary.html")
#     print(summary_nematic_5nN_df)

#     # nematic Order (S2) 15nN
#     summary_nematic_15nN_list = []
#     for i, terminal_group in enumerate(terminal_groups):
#         nematic = []
#         nematic_err = []

#         nematic_list = [df[(df.chainlength==chainlength) &
#                         (df.terminal_group==terminal_group)].S2
#                      for chainlength in chainlengths]
#         for chain_length_grouping in nematic_list:
#             nematic.append(np.mean([i['15nN'] for i in chain_length_grouping]))
#             nematic_err.append(np.std([i['15nN'] for i in chain_length_grouping]))
#         dict_1 = {
#             'System': str(terminal_group + ' - ' + terminal_group),
#             'S2: {} carbons'.format(chainlengths[0]): nematic[0],
#             'S2: {} carbons'.format(chainlengths[1]): nematic[1],
#             'S2: {} carbons'.format(chainlengths[2]): nematic[2],
#             'S2: {} carbons'.format(chainlengths[3]): nematic[3],
#             'S2: {} carbons'.format(chainlengths[4]): nematic[4],
#             'S2 Err: {} carbons'.format(chainlengths[0]): nematic_err[0],
#             'S2 Err: {} carbons'.format(chainlengths[1]): nematic_err[1],
#             'S2 Err: {} carbons'.format(chainlengths[2]): nematic_err[2],
#             'S2 Err: {} carbons'.format(chainlengths[3]): nematic_err[3],
#             'S2 Err: {} carbons'.format(chainlengths[4]): nematic_err[4],
#         }
#         summary_nematic_15nN_list.append(dict_1)
#     summary_nematic_15nN_df = pd.DataFrame(summary_nematic_15nN_list)
#     cols = list(summary_nematic_15nN_df.columns)
#     cols = ['System', 'S2: 5 carbons', 'S2: 8 carbons',
#             'S2: 11 carbons', 'S2: 14 carbons', 'S2: 17 carbons',
#             'S2 Err: 5 carbons', 'S2 Err: 8 carbons',
#             'S2 Err: 11 carbons', 'S2 Err: 14 carbons', 'S2 Err: 17 carbons']
#     summary_nematic_15nN_df = summary_nematic_15nN_df[cols]
    
#     #print(summary_adhesion_df)
#     summary_nematic_15nN_df.to_csv("same_terminal_groups_15nN_nematic_summary.csv")
#     summary_nematic_15nN_df.to_html("same_terminal_groups_15nN_nematic_summary.html")
#     print(summary_nematic_15nN_df)

#     # nematic Order (S2) 25nN
#     summary_nematic_25nN_list = []
#     for i, terminal_group in enumerate(terminal_groups):
#         nematic = []
#         nematic_err = []

#         nematic_list = [df[(df.chainlength==chainlength) &
#                         (df.terminal_group==terminal_group)].S2
#                      for chainlength in chainlengths]
#         for chain_length_grouping in nematic_list:
#             nematic.append(np.mean([i['25nN'] for i in chain_length_grouping]))
#             nematic_err.append(np.std([i['25nN'] for i in chain_length_grouping]))
#         dict_1 = {
#             'System': str(terminal_group + ' - ' + terminal_group),
#             'S2: {} carbons'.format(chainlengths[0]): nematic[0],
#             'S2: {} carbons'.format(chainlengths[1]): nematic[1],
#             'S2: {} carbons'.format(chainlengths[2]): nematic[2],
#             'S2: {} carbons'.format(chainlengths[3]): nematic[3],
#             'S2: {} carbons'.format(chainlengths[4]): nematic[4],
#             'S2 Err: {} carbons'.format(chainlengths[0]): nematic_err[0],
#             'S2 Err: {} carbons'.format(chainlengths[1]): nematic_err[1],
#             'S2 Err: {} carbons'.format(chainlengths[2]): nematic_err[2],
#             'S2 Err: {} carbons'.format(chainlengths[3]): nematic_err[3],
#             'S2 Err: {} carbons'.format(chainlengths[4]): nematic_err[4],
#         }
#         summary_nematic_25nN_list.append(dict_1)
#     summary_nematic_25nN_df = pd.DataFrame(summary_nematic_25nN_list)
#     cols = list(summary_nematic_25nN_df.columns)
#     cols = ['System', 'S2: 5 carbons', 'S2: 8 carbons',
#             'S2: 11 carbons', 'S2: 14 carbons', 'S2: 17 carbons',
#             'S2 Err: 5 carbons', 'S2 Err: 8 carbons',
#             'S2 Err: 11 carbons', 'S2 Err: 14 carbons', 'S2 Err: 17 carbons']
#     summary_nematic_25nN_df = summary_nematic_25nN_df[cols]
    
#     #print(summary_adhesion_df)
#     summary_nematic_25nN_df.to_csv("same_terminal_groups_25nN_nematic_summary.csv")
#     summary_nematic_25nN_df.to_html("same_terminal_groups_25nN_nematic_summary.html")
#     print(summary_nematic_25nN_df)

In [None]:
#gen_summary_tribo(old_screening_proj)

### Generate complete prediction of phenol and toluene systems vs the original 16 chemsitries

In [None]:
df_new_old['Terminal Group 1'].unique()

In [None]:
df_new_old['Terminal Group 2'].unique()

In [None]:
filtered_new_old_df = df_new_old[
    (df_new_old['Terminal Group 2'] == 'toluene') | 
    (df_new_old['Terminal Group 2'] == 'phenol') & 
    (df_new_old['Terminal Group 1'] != 'phenol') & 
    (df_new_old['Terminal Group 1'] != 'difluoromethyl') &
    (df_new_old['Terminal Group 2'] != 'difluoromethyl')]

In [None]:
filtered_new_old_df

In [None]:
# generate a complete summary of the prediction of the various monolayers
# note, this data also exists in a csv file
# this takes some time to run

# summary_stat = []
# for seed in [43, 0, 1, 2, 3]:
#     print(seed)
#     for smi1 in list(filtered_new_old_df['SMILES-1-H'].unique()):
#         for smi2 in list(filtered_new_old_df['SMILES-2-H'].unique()):
#             a = rf.predict(SMILES1=smi1,
#                        SMILES2=smi2,
#                        path_to_data=".",
#                        random_seed=seed,
#                        vary_descriptors=False,
#                        vary_significant=False,
#                        barcode_seed=123,
#                        feature_cluster_json_location="./random_forest_tg/")
#             a['SMI-H-TOP'] = smi1
#             a['SMI-H-BOTTOM'] = smi2
#             a['Terminal Group 1'] = my_inverted_dict[smi1][0]
#             a['Terminal Group 2'] = my_inverted_dict[smi2][0]
#             summary_stat.append(a)
#     #summary_stat_cof.append(a['COF'])
#     #summary_stat_intercept.append(a['intercept'])
# #print(summary_stat_cof, summary_stat_intercept)
# pd.DataFrame(summary_stat)
#
# summary_new_with_old_df = pd.DataFrame(summary_stat)

In [None]:
# summary_total_list = []
# for smi1 in list(summary_new_with_old_df['SMI-H-TOP'].unique()):
#     for smi2 in list(summary_new_with_old_df['SMI-H-BOTTOM'].unique()):
#         cof_mean = summary_new_with_old_df[(summary_new_with_old_df['SMI-H-TOP'] == smi1) & (summary_new_with_old_df['SMI-H-BOTTOM'] == smi2)]['COF'].mean()
#         cof_std = summary_new_with_old_df[(summary_new_with_old_df['SMI-H-TOP'] == smi1) & (summary_new_with_old_df['SMI-H-BOTTOM'] == smi2)]['COF'].std()
#         intercept_mean = summary_new_with_old_df[(summary_new_with_old_df['SMI-H-TOP'] == smi1) & (summary_new_with_old_df['SMI-H-BOTTOM'] == smi2)]['intercept'].mean()
#         intercept_std = summary_new_with_old_df[(summary_new_with_old_df['SMI-H-TOP'] == smi1) & (summary_new_with_old_df['SMI-H-BOTTOM'] == smi2)]['intercept'].std()
        
#         calc_cof_mean = filtered_new_old_df[(filtered_new_old_df['SMILES-1-H'] == smi1) & (filtered_new_old_df['SMILES-2-H'] == smi2)]['COF'].mean()
#         calc_cof_std = filtered_new_old_df[(filtered_new_old_df['SMILES-1-H'] == smi1) & (filtered_new_old_df['SMILES-2-H'] == smi2)]['COF'].std()
#         calc_intercept_mean = filtered_new_old_df[(filtered_new_old_df['SMILES-1-H'] == smi1) & (filtered_new_old_df['SMILES-2-H'] == smi2)]['intercept'].mean()
#         calc_intercept_std = filtered_new_old_df[(filtered_new_old_df['SMILES-1-H'] == smi1) & (filtered_new_old_df['SMILES-2-H'] == smi2)]['intercept'].std()
#         print(np.any(np.isnan([calc_cof_mean, calc_cof_std, calc_intercept_mean, calc_intercept_std])))
#         if np.any(np.isnan([calc_cof_mean, calc_cof_std, calc_intercept_mean, calc_intercept_std])):
#             calc_cof_mean = filtered_new_old_df[(filtered_new_old_df['SMILES-1-H'] == smi2) & (filtered_new_old_df['SMILES-2-H'] == smi1)]['COF'].mean()
#             calc_cof_std = filtered_new_old_df[(filtered_new_old_df['SMILES-1-H'] == smi2) & (filtered_new_old_df['SMILES-2-H'] == smi1)]['COF'].std()
#             calc_intercept_mean = filtered_new_old_df[(filtered_new_old_df['SMILES-1-H'] == smi2) & (filtered_new_old_df['SMILES-2-H'] == smi1)]['intercept'].mean()
#             calc_intercept_std = filtered_new_old_df[(filtered_new_old_df['SMILES-1-H'] == smi2) & (filtered_new_old_df['SMILES-2-H'] == smi1)]['intercept'].std()
#             summary_total_list.append({
#                 'SMI-H-1': smi2,
#                 'SMI-H-2': smi1,
#                 'Terminal Group 1': summary_new_with_old_df[(summary_new_with_old_df['SMI-H-TOP'] == smi1) & (summary_new_with_old_df['SMI-H-BOTTOM'] == smi2)]['Terminal Group 1'].unique()[0],
#                 'Terminal Group 2': summary_new_with_old_df[(summary_new_with_old_df['SMI-H-TOP'] == smi1) & (summary_new_with_old_df['SMI-H-BOTTOM'] == smi2)]['Terminal Group 2'].unique()[0],
#                 'COF Prediction (mean)': cof_mean,
#                 'COF Prediction (std)': cof_std,
#                 'F0 Prediction (mean)': intercept_mean,
#                 'F0 Prediction (std)': intercept_std,
#                 'COF Calculated (mean)': calc_cof_mean,
#                 'COF Calculated (std)': calc_cof_std,
#                 'F0 Calculated (mean)': calc_intercept_mean,
#                 'F0 Calculated (std)': calc_intercept_std,
#             })
#         else:
#             summary_total_list.append({
#                 'SMI-H-1': smi1,
#                 'SMI-H-2': smi2,
#                 'Terminal Group 1': summary_new_with_old_df[(summary_new_with_old_df['SMI-H-TOP'] == smi1) & (summary_new_with_old_df['SMI-H-BOTTOM'] == smi2)]['Terminal Group 1'].unique()[0],
#                 'Terminal Group 2': summary_new_with_old_df[(summary_new_with_old_df['SMI-H-TOP'] == smi1) & (summary_new_with_old_df['SMI-H-BOTTOM'] == smi2)]['Terminal Group 2'].unique()[0],
#                 'COF Prediction (mean)': cof_mean,
#                 'COF Prediction (std)': cof_std,
#                 'F0 Prediction (mean)': intercept_mean,
#                 'F0 Prediction (std)': intercept_std,
#                 'COF Calculated (mean)': calc_cof_mean,
#                 'COF Calculated (std)': calc_cof_std,
#                 'F0 Calculated (mean)': calc_intercept_mean,
#                 'F0 Calculated (std)': calc_intercept_std,
#             })

In [None]:
# pd.DataFrame(summary_total_list)

In [None]:
# filtered_new_old_df[(filtered_new_old_df['Terminal Group 2'] == 'phenol') | (filtered_new_old_df['Terminal Group 1']=='pyrrole')]

In [None]:
# final_df = pd.DataFrame(summary_total_list)

In [None]:
# col_list = list(final_df.columns)
# print(col_list)
# col_list = ['Terminal Group 1',
#             'Terminal Group 2',
#             'COF Calculated (mean)',
#             'COF Calculated (std)',
#             'COF Prediction (mean)',
#             'COF Prediction (std)',
#             'F0 Calculated (mean)',
#             'F0 Calculated (std)',
#             'F0 Prediction (mean)',
#             'F0 Prediction (std)',
#             'SMI-H-1',
#             'SMI-H-2',]
# final_df = final_df[col_list]

In [None]:
# 
#
# final_df.sort_values(by='Terminal Group 1').dropna(how='any')

### Two rows had to be dropped, missing data on:

1. pyrrole-phenol
2. phenyl-phenol

In [None]:
# final_df.to_csv('two_new_groups_predicted_vs_calculated.csv',)
# final_df.to_html('two_new_groups_predicted_vs_calculated.html')

In [None]:
f_df = pd.read_csv('./csv_files/two_new_groups_predicted_vs_calculated.csv')
f_df['COF Calculated (mean)']

In [None]:
f_df

In [None]:
f_df = f_df.dropna(how='any')

In [None]:
f_df.sort_values(by='Terminal Group 1')

In [None]:
ax = f_df.plot(kind='scatter',
          x='COF Prediction (mean)',
          y='COF Calculated (mean)',
          xticks=[0.10, 0.14, 0.18],
          yticks=[0.10, 0.14, 0.18]
          )
X_plot = np.linspace(0.08,0.19,100)
ax.set_xlabel('COF (Predicted)')
ax.set_ylabel('COF (actual)')
ax.plot(X_plot, X_plot,'k',zorder=-1)
ax.axis(xmin=0.1, xmax=0.19, ymin=0.1, ymax=0.19)
ax.grid()
plt.tight_layout()
#plt.savefig('./cof_pred_calc.pdf', format='pdf')


In [None]:
ax = f_df.plot(kind='scatter',
          y='F0 Prediction (mean)',
          x='F0 Calculated (mean)',
          xticks=[0, 3, 6],
          yticks=[0, 3, 6]
          )
X_plot = np.linspace(0,7,100)
ax.set_xlabel('$F_0$ (Predicted)')
ax.set_ylabel('$F_0$ (actual)')
ax.plot(X_plot, X_plot,'k',zorder=-1)
ax.axis(xmin=0, xmax=7, ymin=0, ymax=7)
ax.grid()
plt.tight_layout()
#plt.savefig('./intercept_pred_calc.pdf', format='pdf')

In [None]:
f_df[(f_df['Terminal Group 2'] == 'toluene')]['COF Prediction (mean)']

In [None]:
f_df[(f_df['Terminal Group 2'] == 'toluene')]['COF Calculated (mean)']

In [None]:
plt.rcParams['axes.axisbelow'] = True
hfont = {'fontname':'Helvetica'}
X_plot = np.linspace(0,.2,100)
plt.errorbar(x=f_df[(f_df['Terminal Group 2'] == 'toluene')]['COF Prediction (mean)'],
            y=f_df[(f_df['Terminal Group 2'] == 'toluene')]['COF Calculated (mean)'],
            xerr=f_df[(f_df['Terminal Group 2'] == 'toluene')]['COF Prediction (std)'],
            yerr=f_df[(f_df['Terminal Group 2'] == 'toluene')]['COF Calculated (std)'],
            mec='k',
            ls='none',
            elinewidth=0.75,
            marker='o',
            fillstyle='full',
            c='k',
            label='toluene',
            alpha=1,
            zorder=2)
plt.errorbar(x=f_df[(f_df['Terminal Group 2'] == 'phenol')]['COF Prediction (mean)'],
            y=f_df[(f_df['Terminal Group 2'] == 'phenol')]['COF Calculated (mean)'],
            xerr=f_df[(f_df['Terminal Group 2'] == 'phenol')]['COF Prediction (std)'],
            yerr=f_df[(f_df['Terminal Group 2'] == 'phenol')]['COF Calculated (std)'],
            mec='r',
            ls='none',
            marker='s',
            fillstyle='none',
            c='r',
            elinewidth=0.75,
            label='phenol',
            alpha=1,
            zorder=2)
plt.xticks([0.10, 0.14, 0.18])
plt.yticks([0.10, 0.14, 0.18])
plt.axis(xmin=0.1, xmax=0.19, ymin=0.1, ymax=0.19)
plt.plot(X_plot, X_plot,'k',zorder=1)
plt.grid(which='all')
plt.xlabel('COF (predicted)', **hfont)
plt.ylabel('COF (actual)', **hfont)
#plt.legend(bbox_to_anchor=(1.04, 1))
plt.tight_layout()
plt.axes().set_aspect('equal',)
#plt.savefig('./cof_pred_calc.pdf', format='pdf')

In [None]:
plt.rcParams['axes.axisbelow'] = True
hfont = {'fontname':'Helvetica'}
X_plot = np.linspace(0,7,100)
plt.errorbar(x=f_df[(f_df['Terminal Group 2'] == 'toluene')]['F0 Prediction (mean)'],
            y=f_df[(f_df['Terminal Group 2'] == 'toluene')]['F0 Calculated (mean)'],
            xerr=f_df[(f_df['Terminal Group 2'] == 'toluene')]['F0 Prediction (std)'],
            yerr=f_df[(f_df['Terminal Group 2'] == 'toluene')]['F0 Calculated (std)'],
            ls='none',
            elinewidth=0.75,
            mec='k',
            c='k',
            marker='o',
            fillstyle='full',
            label='toluene',
            alpha=1,
            zorder=2)
plt.errorbar(x=f_df[(f_df['Terminal Group 2'] == 'phenol')]['F0 Prediction (mean)'],
            y=f_df[(f_df['Terminal Group 2'] == 'phenol')]['F0 Calculated (mean)'],
            xerr=f_df[(f_df['Terminal Group 2'] == 'phenol')]['F0 Prediction (std)'],
            yerr=f_df[(f_df['Terminal Group 2'] == 'phenol')]['F0 Calculated (std)'],
            ls='none',
            elinewidth=0.75,
            mec='r',
            marker='s',
            fillstyle='none',
            c='r',
            alpha=1,
            label='phenol',
            zorder=2)
plt.xticks([0, 3, 6])
plt.yticks([0, 3, 6])
plt.axis(xmin=0, xmax=7, ymin=0, ymax=7)
plt.plot(X_plot, X_plot,'k',zorder=1)
plt.grid(which='all')
plt.xlabel('$F_0$ (predicted), nN')
plt.ylabel('$F_0$ (actual), nN')
plt.axes().set_aspect(aspect='equal')
plt.tight_layout()
#plt.savefig('./intercept_pred_calc.pdf', format='pdf')

In [None]:
plt.rcParams['axes.axisbelow'] = True
hfont = {'fontname':'Helvetica'}
#X_plot = np.linspace(0,7,100)
plt.errorbar(x=f_df[(f_df['Terminal Group 2'] == 'toluene')]['F0 Calculated (mean)'],
            y=f_df[(f_df['Terminal Group 2'] == 'toluene')]['COF Calculated (mean)'],
            marker='o',
            ls='none',
            c='#1f77b4',
            alpha=0.9,
            elinewidth=0.5,
            mec='k',
            label='calculated',
            zorder=2)
plt.errorbar(x=f_df[(f_df['Terminal Group 2'] == 'toluene')]['F0 Prediction (mean)'],
            y=f_df[(f_df['Terminal Group 2'] == 'toluene')]['COF Prediction (mean)'],
            mec='k',
            ls='none',
            alpha=0.9,
            elinewidth=0.5,
            marker='o',
            c='#d62728',
            label='predicted',
            zorder=2)
plt.yticks([0.10, 0.14, 0.18])
plt.xticks([0, 2, 4])
plt.axis(xmin=0, xmax=7, ymin=0.10, ymax=0.19)
#plt.plot(X_plot, X_plot,'k',zorder=1)
#plt.legend(loc='best')
plt.ylabel('COF')
plt.xlabel('$F_0$, nN',fontdict={'fontname': 'Helvetica'})
plt.grid(which='all')
plt.tight_layout()
#plt.savefig('./cof_intercept_toluene.pdf', format='pdf')

In [None]:
plt.rcParams['axes.axisbelow'] = True
hfont = {'fontname':'Helvetica'}
#X_plot = np.linspace(0,7,100)
plt.scatter(x=f_df[(f_df['Terminal Group 2'] == 'phenol')]['F0 Calculated (mean)'],
            y=f_df[(f_df['Terminal Group 2'] == 'phenol')]['COF Calculated (mean)'],
            edgecolors='k',
            marker='o',
            c='#1f77b4',
            label='calculated',
            zorder=2)
plt.scatter(x=f_df[(f_df['Terminal Group 2'] == 'phenol')]['F0 Prediction (mean)'],
            y=f_df[(f_df['Terminal Group 2'] == 'phenol')]['COF Prediction (mean)'],
            edgecolors='k',
            marker='o',
            c='#d62728',
            label='predicted',
            zorder=2)
plt.yticks([0.10, 0.14, 0.18])
plt.xticks([0, 3, 6])
plt.axis(xmin=0, xmax=7, ymin=0.10, ymax=0.19)
#plt.plot(X_plot, X_plot,'k',zorder=1)
plt.legend(loc='best')
plt.ylabel('COF')
plt.xlabel('$F_0$, nN')
plt.grid(which='all')
plt.tight_layout()
#plt.savefig('./cof_intercept_phenol.pdf', format='pdf')

In [None]:
data_old_mixed= []
for job in old_mixed_screening_proj:
    if "COF" in job.document and "intercept" in job.document:
        data_old_mixed.append({'Terminal Group 1': job.sp['terminal_groups'][0],
                         'Terminal Group 2': job.sp['terminal_groups'][1],
                         'COF': job.doc['COF'],
                         'intercept': job.doc['intercept'],
                         'SMILES-1-H': new_smiles_dict[job.sp['terminal_groups'][0]][0],
                         'SMILES-2-H': new_smiles_dict[job.sp['terminal_groups'][1]][0]})

for job in old_screening_proj:
    if "COF" in job.document and "intercept" in job.document:
        data_old_mixed.append({'Terminal Group 1': job.sp['terminal_group'],
                         'Terminal Group 2': job.sp['terminal_group'],
                         'COF': job.doc['COF'],
                         'intercept': job.doc['intercept'],
                         'SMILES-1-H': new_smiles_dict[job.sp['terminal_group']][0],
                         'SMILES-2-H': new_smiles_dict[job.sp['terminal_group']][0]})
        
df_previous_study = pd.DataFrame(data_old_mixed)

In [None]:
# summary_stat = []
# for seed in [43, 0, 1, 2, 3]:
#     print(seed)
#     for key, smi1 in old_smiles_dict.items(): # list(filtered_new_old_df['SMILES-1-H'].unique()):
#         print(smi1)
#         for k2, smi2 in old_smiles_dict.items():  #list(filtered_new_old_df['SMILES-2-H'].unique()):
#             a = rf.predict(SMILES1=smi1[0],
#                        SMILES2=smi2[0],
#                        path_to_data=".",
#                        random_seed=seed,
#                        vary_descriptors=False,
#                        vary_significant=False,
#                        barcode_seed=123,
#                        feature_cluster_json_location="./random_forest_tg/")
#             a['SMI-H-TOP'] = smi1[0]
#             a['SMI-H-BOTTOM'] = smi2[0]
#             a['Terminal Group 1'] = my_inverted_dict[smi1[0]][0]
#             a['Terminal Group 2'] = my_inverted_dict[smi2[0]][0]
#             summary_stat.append(a)
# pd.DataFrame(summary_stat)

In [None]:
# df_test_train = pd.DataFrame(summary_stat)
df_test_train = pd.read_csv('./csv_files/andrew_df_test_train.csv')

In [None]:
df_test_train[(df_test_train['Terminal Group 1'] == 'perfluoromethyl') 
                     & (df_test_train['Terminal Group 2'] == 'perfluoromethyl')&
                    (df_test_train['Model Number'] == 43)]
list(df_test_train['Terminal Group 1'].unique())


In [None]:
import glob, os
summary_project_df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "./csv_files/model_*_summary*_*_andrew*.csv"))),axis=1,).drop_duplicates().reset_index(drop=True)

summary_project_df = summary_project_df.drop(columns=['Unnamed: 0'], axis=1)

five_models = dict()
for model in [43, 0, 1, 2, 3]:
    five_models['model{}'.format(model)] = pd.read_csv('./csv_files/model_{}_summary_test_train_andrew.csv'.format(model))

In [None]:
five_models['model0']

In [None]:
df_test_train[ 
              (df_test_train['Terminal Group 1']=='acetyl') &
              (df_test_train['Terminal Group 2'] == 'acetyl')]['COF'].describe()

print(sorted(list(df_old_andrew['Terminal Group 1'].unique())))
print(sorted(list(df_old_andrew['Terminal Group 2'].unique())))

In [None]:
summary_andrew = []
for tg1 in list(df_old_andrew['Terminal Group 1'].unique()):
    for tg2 in list(df_old_andrew['Terminal Group 2'].unique()):
        cof_mean = df_test_train[
            (df_test_train['Terminal Group 1'] == tg1) &
            (df_test_train['Terminal Group 2'] == tg2)]['COF'].mean()
        cof_std = df_test_train[
            (df_test_train['Terminal Group 1'] == tg1) &
            (df_test_train['Terminal Group 2'] == tg2)]['COF'].std()
        intercept_mean = df_test_train[
            (df_test_train['Terminal Group 1'] == tg1) &
            (df_test_train['Terminal Group 2'] == tg2)]['intercept'].mean()
        intercept_std = df_test_train[
            (df_test_train['Terminal Group 1'] == tg1) &
            (df_test_train['Terminal Group 2'] == tg2)]['intercept'].std()

        cof_calc_mean = df_old_andrew[
            (df_old_andrew['Terminal Group 1'] == tg1) &
            (df_old_andrew['Terminal Group 2'] == tg2)]['COF'].mean()
        cof_calc_std = df_old_andrew[
            (df_old_andrew['Terminal Group 1'] == tg1) &
            (df_old_andrew['Terminal Group 2'] == tg2)]['COF'].std()
        intercept_calc_mean = df_old_andrew[
            (df_old_andrew['Terminal Group 1'] == tg1) &
            (df_old_andrew['Terminal Group 2'] == tg2)]['intercept'].mean()
        intercept_calc_std = df_old_andrew[
            (df_old_andrew['Terminal Group 1'] == tg1) &
            (df_old_andrew['Terminal Group 2'] == tg2)]['intercept'].std()

        print(np.any(np.isnan([cof_calc_mean, cof_calc_std, intercept_calc_mean, intercept_calc_std])))
        if np.any(np.isnan([cof_calc_mean, cof_calc_std, intercept_calc_mean, intercept_calc_std])):
            cof_calc_mean = df_old_andrew[
                (df_old_andrew['Terminal Group 1'] == tg2) &
                (df_old_andrew['Terminal Group 2'] == tg1)]['COF'].mean()
            cof_calc_std = df_old_andrew[
                (df_old_andrew['Terminal Group 1'] == tg2) &
                (df_old_andrew['Terminal Group 2'] == tg1)]['COF'].std()
            intercept_calc_mean = df_old_andrew[
                (df_old_andrew['Terminal Group 1'] == tg2) &
                (df_old_andrew['Terminal Group 2'] == tg1)]['intercept'].mean()
            intercept_calc_std = df_old_andrew[
                (df_old_andrew['Terminal Group 1'] == tg2) &
                (df_old_andrew['Terminal Group 2'] == tg1)]['intercept'].std()
            summary_andrew.append({
                'Terminal Group 1': tg1,
                'Terminal Group 2': tg2,
                'COF Prediction (mean)': cof_mean,
                'COF Prediction (std)': cof_std,
                'F0 Prediction (mean)': intercept_mean,
                'F0 Prediction (std)': intercept_std,
                'COF Calculated (mean)': cof_calc_mean,
                'COF Calculated (std)': cof_calc_std,
                'F0 Calculated (mean)': intercept_calc_mean,
                'F0 Calculated (std)': intercept_calc_std,
            })
        else:
            summary_andrew.append({
                'Terminal Group 1': tg1,
                'Terminal Group 2': tg2,
                'COF Prediction (mean)': cof_mean,
                'COF Prediction (std)': cof_std,
                'F0 Prediction (mean)': intercept_mean,
                'F0 Prediction (std)': intercept_std,
                'COF Calculated (mean)': cof_calc_mean,
                'COF Calculated (std)': cof_calc_std,
                'F0 Calculated (mean)': intercept_calc_mean,
                'F0 Calculated (std)': intercept_calc_std,
            })
            

In [None]:
summary_df = pd.DataFrame(summary_andrew)

In [None]:
summary_df = summary_df.dropna(how='any')
summary_df

In [None]:
# df_test_train[(df_test_train['Terminal Group 1'] == tg1) & 
#                      (df_test_train['Terminal Group 2'] == tg2)&
#                      (df_test_train['Model Number'] == 43)]['COF']

In [None]:

model_list_group1 = list(andrew_models['model0']['Terminal Group 1'])
model_list_group2 = list(andrew_models['model0']['Terminal Group 2'])
for model in [43, 0, 1, 2, 3]:
    plt.clf()
    plt.rcParams['axes.axisbelow'] = True
    hfont = {'fontname':'Helvetica'}
    X_plot = np.linspace(0,.3,100)
    counter = 0
    for tg1, tg2 in zip(model_list_group1, model_list_group2):
        model_df = andrew_models['model{}'.format(model)]
        counter = counter + 1
        if (np.all(model_df[(model_df['Terminal Group 1'] == tg1) & 
                            (model_df['Terminal Group 2'] == tg2)]['Model {}'.format(model)] == 'Train')):
            plt.errorbar(x=df_test_train[(df_test_train['Terminal Group 1'] == tg1) & 
                             (df_test_train['Terminal Group 2'] == tg2)&
                             (df_test_train['Model Number'] == 43)]['COF'],
                        y=summary_df[(summary_df['Terminal Group 1'] == tg1) &
                                        (summary_df['Terminal Group 2'] == tg2)]['COF Calculated (mean)'],
                        xerr=None,
                        yerr=summary_df[(summary_df['Terminal Group 1'] == tg1) &
                                        (summary_df['Terminal Group 2'] == tg2)]['COF Calculated (std)'],
                        mec='k',
                        mew=0.5,
                        ecolor='k',
                        ls='none',
                        elinewidth=0.5,
                        marker='.',
                        ms=7,
                        fillstyle='full',
                        c='w',
                        label='train',
                        alpha=1,
                        zorder=2)
        else:
            plt.errorbar(x=df_test_train[(df_test_train['Terminal Group 1'] == tg1) & 
                             (df_test_train['Terminal Group 2'] == tg2)&
                             (df_test_train['Model Number'] == 43)]['COF'],
                        y=summary_df[(summary_df['Terminal Group 1'] == tg1) &
                                        (summary_df['Terminal Group 2'] == tg2)]['COF Calculated (mean)'],
                        xerr=None,
                        yerr=summary_df[(summary_df['Terminal Group 1'] == tg1) &
                                        (summary_df['Terminal Group 2'] == tg2)]['COF Calculated (std)'],
                        mec='k',
                        mew=0.75,
                        ecolor='k',
                        ls='none',
                        marker='X',
                        fillstyle='full',
                        c='r',
                        ms=4,
                        elinewidth=0.75,
                        label='test',
                        alpha=1,
                        zorder=3)
    plt.xticks([0.08, 0.14, 0.18])
    plt.yticks([0.08, 0.14, 0.18])
    plt.axis(xmin=0.08, xmax=0.21, ymin=0.08, ymax=0.21)
    plt.plot(X_plot, X_plot,'k',zorder=1)
    plt.grid(which='all')
    plt.xlabel('COF (predicted)', **hfont)
    plt.ylabel('COF (actual)', **hfont)
    plt.tight_layout()
    plt.axes().set_aspect('equal',)
    plt.savefig('./cof_test_train_andrew_model_{}.pdf'.format(model), format='pdf')
    print(counter)

In [None]:

model_list_group1 = list(andrew_models['model0']['Terminal Group 1'])
model_list_group2 = list(andrew_models['model0']['Terminal Group 2'])
for model in [43, 0, 1, 2, 3]:
    plt.clf()
    plt.rcParams['axes.axisbelow'] = True
    hfont = {'fontname':'Helvetica'}
    X_plot = np.linspace(0,7,100)
    counter = 0
    for tg1, tg2 in zip(model_list_group1, model_list_group2):
        model_df = andrew_models['model{}'.format(model)]
        if (np.all(model_df[(model_df['Terminal Group 1'] == tg1) & 
                            (model_df['Terminal Group 2'] == tg2)]['Model {}'.format(model)] == 'Train')):
            plt.errorbar(x=andrew_df_test_train[(andrew_df_test_train['Terminal Group 1'] == tg1) & 
                             (andrew_df_test_train['Terminal Group 2'] == tg2)&
                             (andrew_df_test_train['Model Number'] == model)]['intercept'],
                        y=and_summar_df[(and_summar_df['Terminal Group 1'] == tg1) &
                                        (and_summar_df['Terminal Group 2'] == tg2)]['F0 Calculated (mean)'],
                        xerr=None,
                        yerr=and_summar_df[(and_summar_df['Terminal Group 1'] == tg1) &
                                        (and_summar_df['Terminal Group 2'] == tg2)]['F0 Calculated (std)'],
                        mec='k',
                        mew=0.5,
                        ecolor='k',
                        ls='none',
                        elinewidth=0.5,
                        marker='.',
                        ms=7,
                        fillstyle='full',
                        c='w',
                        label='train',
                        alpha=1,
                        zorder=2)
        else:
            counter = counter + 1
            plt.errorbar(x=andrew_df_test_train[(andrew_df_test_train['Terminal Group 1'] == tg1) & 
                             (andrew_df_test_train['Terminal Group 2'] == tg2)&
                             (andrew_df_test_train['Model Number'] == model)]['intercept'],
                        y=and_summar_df[(and_summar_df['Terminal Group 1'] == tg1) &
                                        (and_summar_df['Terminal Group 2'] == tg2)]['F0 Calculated (mean)'],
                        xerr=None,
                        yerr=and_summar_df[(and_summar_df['Terminal Group 1'] == tg1) &
                                        (and_summar_df['Terminal Group 2'] == tg2)]['F0 Calculated (std)'],
                        mec='k',
                        mew=0.75,
                        ecolor='k',
                        ls='none',
                        marker='X',
                        fillstyle='full',
                        c='r',
                        ms=4,
                        elinewidth=0.75,
                        label='test',
                        alpha=1,
                        zorder=3)
    plt.xticks([0, 3, 6])
    plt.yticks([0, 3, 6])
    plt.axis(xmin=0, xmax=7, ymin=0, ymax=7)
    plt.plot(X_plot, X_plot,'k',zorder=1)
    plt.grid(which='all')
    plt.xlabel('$F_0$ (predicted), nN')
    plt.ylabel('$F_0$ (actual), nN')
    plt.tight_layout()
    plt.axes().set_aspect('equal',)
    #plt.savefig('./intercept_test_train_andrew_model_{}.pdf'.format(model), format='pdf')
    #print(counter)

In [None]:
%%writefile generate_plots.py
import matplotlib
import pandas as pd
import numpy as np
import glob
import os
import pprint

from matplotlib import pyplot as plt

# Load in the relevant dataframes
screening_original_df = pd.read_csv('./csv_files/monolayer_screening_original_data.csv')
test_train_df = pd.read_csv('./csv_files/andrew_df_test_train.csv')

summary_model_df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "./csv_files/model_*_summary*_*_andrew*.csv"))),axis=1,).drop_duplicates().reset_index(drop=True)
summary_model_df = summary_model_df.drop(columns=['Unnamed: 0'], axis=1)

orig_models = dict()
for model in [43, 0, 1, 2, 3]:
    orig_models['model{}'.format(model)] = pd.read_csv('./csv_files/model_{}_summary_test_train_andrew.csv'.format(model))

# generate summary of simulation data from original projects
summary_project = []
for tg1 in list(screening_original_df['Terminal Group 1'].unique()):
    for tg2 in list(screening_original_df['Terminal Group 2'].unique()):
        cof_mean = test_train_df[
            (test_train_df['Terminal Group 1'] == tg1) &
            (test_train_df['Terminal Group 2'] == tg2)]['COF'].mean()
        cof_std = test_train_df[
            (test_train_df['Terminal Group 1'] == tg1) &
            (test_train_df['Terminal Group 2'] == tg2)]['COF'].std()
        intercept_mean = test_train_df[
            (test_train_df['Terminal Group 1'] == tg1) &
            (test_train_df['Terminal Group 2'] == tg2)]['intercept'].mean()
        intercept_std = test_train_df[
            (test_train_df['Terminal Group 1'] == tg1) &
            (test_train_df['Terminal Group 2'] == tg2)]['intercept'].std()

        cof_calc_mean = screening_original_df[
            (screening_original_df['Terminal Group 1'] == tg1) &
            (screening_original_df['Terminal Group 2'] == tg2)]['COF'].mean()
        cof_calc_std = screening_original_df[
            (screening_original_df['Terminal Group 1'] == tg1) &
            (screening_original_df['Terminal Group 2'] == tg2)]['COF'].std()
        intercept_calc_mean = screening_original_df[
            (screening_original_df['Terminal Group 1'] == tg1) &
            (screening_original_df['Terminal Group 2'] == tg2)]['intercept'].mean()
        intercept_calc_std = screening_original_df[
            (screening_original_df['Terminal Group 1'] == tg1) &
            (screening_original_df['Terminal Group 2'] == tg2)]['intercept'].std()

        print(np.any(np.isnan([cof_calc_mean, cof_calc_std, intercept_calc_mean, intercept_calc_std])))
        if np.any(np.isnan([cof_calc_mean, cof_calc_std, intercept_calc_mean, intercept_calc_std])):
            cof_calc_mean = screening_original_df[
                (screening_original_df['Terminal Group 1'] == tg2) &
                (screening_original_df['Terminal Group 2'] == tg1)]['COF'].mean()
            cof_calc_std = screening_original_df[
                (screening_original_df['Terminal Group 1'] == tg2) &
                (screening_original_df['Terminal Group 2'] == tg1)]['COF'].std()
            intercept_calc_mean = screening_original_df[
                (screening_original_df['Terminal Group 1'] == tg2) &
                (screening_original_df['Terminal Group 2'] == tg1)]['intercept'].mean()
            intercept_calc_std = screening_original_df[
                (screening_original_df['Terminal Group 1'] == tg2) &
                (screening_original_df['Terminal Group 2'] == tg1)]['intercept'].std()
            summary_project.append({
                'Terminal Group 1': tg1,
                'Terminal Group 2': tg2,
                'COF Prediction (mean)': cof_mean,
                'COF Prediction (std)': cof_std,
                'F0 Prediction (mean)': intercept_mean,
                'F0 Prediction (std)': intercept_std,
                'COF Calculated (mean)': cof_calc_mean,
                'COF Calculated (std)': cof_calc_std,
                'F0 Calculated (mean)': intercept_calc_mean,
                'F0 Calculated (std)': intercept_calc_std,
            })
        else:
            summary_project.append({
                'Terminal Group 1': tg1,
                'Terminal Group 2': tg2,
                'COF Prediction (mean)': cof_mean,
                'COF Prediction (std)': cof_std,
                'F0 Prediction (mean)': intercept_mean,
                'F0 Prediction (std)': intercept_std,
                'COF Calculated (mean)': cof_calc_mean,
                'COF Calculated (std)': cof_calc_std,
                'F0 Calculated (mean)': intercept_calc_mean,
                'F0 Calculated (std)': intercept_calc_std,
            })
# convert list of dictionaries into dataframe, drop any NaN values
project_pred_calc_df = pd.DataFrame(summary_project).dropna(how='any')


# now generate COF Test/Train Plots

model_list_group1 = list(orig_models['model0']['Terminal Group 1'])
model_list_group2 = list(orig_models['model0']['Terminal Group 2'])
for model in [43, 0, 1, 2, 3]:
    plt.clf()
    plt.rcParams['axes.axisbelow'] = True
    hfont = {'fontname':'Helvetica'}
    X_plot = np.linspace(0,.3,100)
    for tg1, tg2 in zip(model_list_group1, model_list_group2):
        model_df = orig_models['model{}'.format(model)]
        # check if this datum was for testing or training
        if (np.all(model_df[(model_df['Terminal Group 1'] == tg1) & 
                            (model_df['Terminal Group 2'] == tg2)]['Model {}'.format(model)] == 'Train')):
            plt.errorbar(x=test_train_df[(test_train_df['Terminal Group 1'] == tg1) & 
                             (test_train_df['Terminal Group 2'] == tg2) &
                             (test_train_df['Model Number'] == model)]['COF'],
                        y=project_pred_calc_df[(project_pred_calc_df['Terminal Group 1'] == tg1) &
                                        (project_pred_calc_df['Terminal Group 2'] == tg2)]['COF Calculated (mean)'],
                        xerr=None,
                        yerr=project_pred_calc_df[(project_pred_calc_df['Terminal Group 1'] == tg1) &
                                        (project_pred_calc_df['Terminal Group 2'] == tg2)]['COF Calculated (std)'],
                        mec='k',
                        mew=0.5,
                        ecolor='k',
                        ls='none',
                        elinewidth=0.5,
                        marker='.',
                        ms=7,
                        fillstyle='full',
                        c='w',
                        label='train',
                        alpha=1,
                        zorder=2)
        else:
            plt.errorbar(x=test_train_df[(test_train_df['Terminal Group 1'] == tg1) & 
                             (test_train_df['Terminal Group 2'] == tg2)&
                             (test_train_df['Model Number'] == model)]['COF'],
                        y=project_pred_calc_df[(project_pred_calc_df['Terminal Group 1'] == tg1) &
                                        (project_pred_calc_df['Terminal Group 2'] == tg2)]['COF Calculated (mean)'],
                        xerr=None,
                        yerr=project_pred_calc_df[(project_pred_calc_df['Terminal Group 1'] == tg1) &
                                        (project_pred_calc_df['Terminal Group 2'] == tg2)]['COF Calculated (std)'],
                        mec='k',
                        mew=0.75,
                        ecolor='k',
                        ls='none',
                        marker='X',
                        fillstyle='full',
                        c='r',
                        ms=4,
                        elinewidth=0.75,
                        label='test',
                        alpha=1,
                        zorder=3)
    plt.xticks([0.08, 0.14, 0.18])
    plt.yticks([0.08, 0.14, 0.18])
    plt.axis(xmin=0.08, xmax=0.21, ymin=0.08, ymax=0.21)
    plt.plot(X_plot, X_plot,'k',zorder=1)
    plt.grid(which='all')
    plt.xlabel('COF (predicted)', **hfont)
    plt.ylabel('COF (actual)', **hfont)
    plt.tight_layout()
    plt.axes().set_aspect('equal',)
    plt.savefig('./cof_test_train_andrew_model_{}.pdf'.format(model), format='pdf')
    
    
# now generate F0 test/train plots
for model in [43, 0, 1, 2, 3]:
    plt.clf()
    plt.rcParams['axes.axisbelow'] = True
    hfont = {'fontname':'Helvetica'}
    X_plot = np.linspace(0,7,100)
    for tg1, tg2 in zip(model_list_group1, model_list_group2):
        model_df = orig_models['model{}'.format(model)]
        # check if this datum was for testing or training
        if (np.all(model_df[(model_df['Terminal Group 1'] == tg1) & 
                            (model_df['Terminal Group 2'] == tg2)]['Model {}'.format(model)] == 'Train')):
            plt.errorbar(x=test_train_df[(test_train_df['Terminal Group 1'] == tg1) & 
                             (test_train_df['Terminal Group 2'] == tg2) &
                             (test_train_df['Model Number'] == model)]['intercept'],
                        y=project_pred_calc_df[(project_pred_calc_df['Terminal Group 1'] == tg1) &
                                        (project_pred_calc_df['Terminal Group 2'] == tg2)]['F0 Calculated (mean)'],
                        xerr=None,
                        yerr=project_pred_calc_df[(project_pred_calc_df['Terminal Group 1'] == tg1) &
                                        (project_pred_calc_df['Terminal Group 2'] == tg2)]['F0 Calculated (std)'],
                        mec='k',
                        mew=0.5,
                        ecolor='k',
                        ls='none',
                        elinewidth=0.5,
                        marker='.',
                        ms=7,
                        fillstyle='full',
                        c='w',
                        label='train',
                        alpha=1,
                        zorder=2)
        else:
            plt.errorbar(x=test_train_df[(test_train_df['Terminal Group 1'] == tg1) & 
                             (test_train_df['Terminal Group 2'] == tg2)&
                             (test_train_df['Model Number'] == model)]['intercept'],
                        y=project_pred_calc_df[(project_pred_calc_df['Terminal Group 1'] == tg1) &
                                        (project_pred_calc_df['Terminal Group 2'] == tg2)]['F0 Calculated (mean)'],
                        xerr=None,
                        yerr=project_pred_calc_df[(project_pred_calc_df['Terminal Group 1'] == tg1) &
                                        (project_pred_calc_df['Terminal Group 2'] == tg2)]['F0 Calculated (std)'],
                        mec='k',
                        mew=0.75,
                        ecolor='k',
                        ls='none',
                        marker='X',
                        fillstyle='full',
                        c='r',
                        ms=4,
                        elinewidth=0.75,
                        label='test',
                        alpha=1,
                        zorder=3)
    plt.xticks([0, 3, 6])
    plt.yticks([0, 3, 6])
    plt.axis(xmin=0, xmax=7, ymin=0, ymax=7)
    plt.plot(X_plot, X_plot,'k',zorder=1)
    plt.grid(which='all')
    plt.xlabel('$F_0$ (predicted), nN')
    plt.ylabel('$F_0$ (actual), nN')
    plt.tight_layout()
    plt.axes().set_aspect('equal',)
    plt.savefig('./intercept_test_train_andrew_model_{}.pdf'.format(model), format='pdf')

In [None]:
screening_original_df = pd.read_csv('./csv_files/monolayer_screening_original_data.csv')
test_train_df = pd.read_csv('./csv_files/andrew_df_test_train.csv')
orig_models = dict()
for model in [43, 0, 1, 2, 3]:
    orig_models['model{}'.format(model)] = pd.read_csv('./csv_files/model_{}_summary_test_train_andrew.csv'.format(model))

In [None]:
# generate summary of simulation data from original projects
summary_project = []
for tg1 in list(screening_original_df['Terminal Group 1'].unique()):
    for tg2 in list(screening_original_df['Terminal Group 2'].unique()):
        cof_mean = test_train_df[
            (test_train_df['Terminal Group 1'] == tg1) &
            (test_train_df['Terminal Group 2'] == tg2)]['COF'].mean()
        cof_std = test_train_df[
            (test_train_df['Terminal Group 1'] == tg1) &
            (test_train_df['Terminal Group 2'] == tg2)]['COF'].std()
        intercept_mean = test_train_df[
            (test_train_df['Terminal Group 1'] == tg1) &
            (test_train_df['Terminal Group 2'] == tg2)]['intercept'].mean()
        intercept_std = test_train_df[
            (test_train_df['Terminal Group 1'] == tg1) &
            (test_train_df['Terminal Group 2'] == tg2)]['intercept'].std()

        cof_calc_mean = screening_original_df[
            (screening_original_df['Terminal Group 1'] == tg1) &
            (screening_original_df['Terminal Group 2'] == tg2)]['COF'].mean()
        cof_calc_std = screening_original_df[
            (screening_original_df['Terminal Group 1'] == tg1) &
            (screening_original_df['Terminal Group 2'] == tg2)]['COF'].std()
        intercept_calc_mean = screening_original_df[
            (screening_original_df['Terminal Group 1'] == tg1) &
            (screening_original_df['Terminal Group 2'] == tg2)]['intercept'].mean()
        intercept_calc_std = screening_original_df[
            (screening_original_df['Terminal Group 1'] == tg1) &
            (screening_original_df['Terminal Group 2'] == tg2)]['intercept'].std()

        print(np.any(np.isnan([cof_calc_mean, cof_calc_std, intercept_calc_mean, intercept_calc_std])))
        if np.any(np.isnan([cof_calc_mean, cof_calc_std, intercept_calc_mean, intercept_calc_std])):
            cof_calc_mean = screening_original_df[
                (screening_original_df['Terminal Group 1'] == tg2) &
                (screening_original_df['Terminal Group 2'] == tg1)]['COF'].mean()
            cof_calc_std = screening_original_df[
                (screening_original_df['Terminal Group 1'] == tg2) &
                (screening_original_df['Terminal Group 2'] == tg1)]['COF'].std()
            intercept_calc_mean = screening_original_df[
                (screening_original_df['Terminal Group 1'] == tg2) &
                (screening_original_df['Terminal Group 2'] == tg1)]['intercept'].mean()
            intercept_calc_std = screening_original_df[
                (screening_original_df['Terminal Group 1'] == tg2) &
                (screening_original_df['Terminal Group 2'] == tg1)]['intercept'].std()
            summary_project.append({
                'Terminal Group 1': tg1,
                'Terminal Group 2': tg2,
                'COF Prediction (mean)': cof_mean,
                'COF Prediction (std)': cof_std,
                'F0 Prediction (mean)': intercept_mean,
                'F0 Prediction (std)': intercept_std,
                'COF Calculated (mean)': cof_calc_mean,
                'COF Calculated (std)': cof_calc_std,
                'F0 Calculated (mean)': intercept_calc_mean,
                'F0 Calculated (std)': intercept_calc_std,
            })
        else:
            summary_project.append({
                'Terminal Group 1': tg1,
                'Terminal Group 2': tg2,
                'COF Prediction (mean)': cof_mean,
                'COF Prediction (std)': cof_std,
                'F0 Prediction (mean)': intercept_mean,
                'F0 Prediction (std)': intercept_std,
                'COF Calculated (mean)': cof_calc_mean,
                'COF Calculated (std)': cof_calc_std,
                'F0 Calculated (mean)': intercept_calc_mean,
                'F0 Calculated (std)': intercept_calc_std,
            })
# convert list of dictionaries into dataframe, drop any NaN values
project_pred_calc_df = pd.DataFrame(summary_project).dropna(how='any')

In [None]:
project_pred_calc_df

In [None]:
my_new_list = list()
model_list_group1 = list(orig_models['model0']['Terminal Group 1'])
model_list_group2 = list(orig_models['model0']['Terminal Group 2'])
for tg1, tg2 in zip(model_list_group1, model_list_group2):
    print(project_pred_calc_df[(project_pred_calc_df['Terminal Group 1'] == tg1) & (project_pred_calc_df['Terminal Group 2'] == tg2)])
    my_new_list.append(project_pred_calc_df[(project_pred_calc_df['Terminal Group 1'] == tg1) & (project_pred_calc_df['Terminal Group 2'] == tg2)])
    
my_new_df = pd.concat(my_new_list)

In [None]:
my_new_df