## Investigation about duplicate SMILES Codes in the 'ML-ready' DFs

### Problem Statement
In the notebook 'pubchem-sparse-matrix' on the Jovian Server it became evident that the Metadata_broad_sample identifier of the cp_annotations.csv file is not suited as a unique  identifier. The DF resulting from overlapping on the Metadata_broad_sample hat duplicate inchikeys. The notebook 'Troubleshooting' in this directory illuminated that the Identifier columns CPD_SMILES, inchikey, contained duplicate values.

### Aim of this Notebook
In this notebook the cause of duplicate SMILES is to be investigated. For this purpose the columns 'CPD_SMILES' and 'Metadata_broad_sample' are taken from the file 'cp_annotations.csv' and the non-duplicate SMILES are being dropped. The DF is the sorted by SMILES to reveal the competing Metadata_broad_sample-Identifier. The resulting DF is then merged with the file 'raw_image_data.txt' on the column 'Metadata_broad_sample'. From the resulting dataframe all CellProfiler Columns and invariant Metadata columns are being dropped. The DF is then inspected visually to find patterns relating SMILES, Metadata_broad_sample to other information available in the Dataframe.

### Summary of the Steps Undertaken
1. Get Relevant Identifiers from Annotations File
2. Get Relevant Rows in sorted order
3. Merge with Raw Data
4. Inspect Metadata Columns to find Patterns

### Conclusion
It seems Compounds with the same SMILES but different Metadata_broad_sample are spread onto different plates as can be guessed from the Metadata_columns 'Metadata_Plate' and 'Metadata_Plate_Map_Name'. Since the Number of different Metadata_broad_sample is 30616 and this is also number of distinct compounds, stated by the authors, they must be mistaken. The true number is lower since there are duplicate SMILES associated with different Metadata_broad_sample.



In [10]:
# Prerequisite

import pandas as pd
df = pd.read_csv("../../input/cp_annotations.csv")

In [11]:
df.CPD_SMILES.dropna().unique().shape, df.Metadata_broad_sample.dropna().unique().shape

((30408,), (30616,))

In [12]:
# 1. Get Relevant Identifiers from Annotations File
df = df.loc[:,['CPD_SMILES','Metadata_broad_sample']]
df['duplicate'] = df['CPD_SMILES'].duplicated(keep=False)

In [13]:
# 2. Get Relevant Rows in sorted order
df = df.query('duplicate == True')
df = df.dropna()
df = df.sort_values(by=['CPD_SMILES'])

In [16]:
# non essential control step
ratio = (df.CPD_SMILES.value_counts() != 2).to_numpy().sum() == 0 
uniqueSMILES = df.CPD_SMILES.value_counts().shape[0]

print('#MBS/#SMILES=2? {}!\t#uniqueSMILES: {}'.format(ratio, uniqueSMILES))

#MBS/#SMILES=2? True!	#uniqueSMILES: 204


In [15]:
df.CPD_SMILES.value_counts()

COCC(=O)O[C@]1(CCN(C)CCCc2nc3ccccc3[nH]2)CCc2cc(F)ccc2[C@@H]1C(C)C             2
CCC(CO)NC(=O)C1CN(C)C2Cc3c[nH]c4cccc(C2=C1)c34                                 2
C[C@H](CO)N1C[C@@H](C)[C@@H](CN(C)C(=O)Nc2ccnc(c2)N2CCOCC2)OCc2cnnn2CCCC1=O    2
C[C@@H](CO)N1C[C@@H](C)[C@H](CN(C)C(=O)Oc2ccc(C)cc2)OCc2cnnn2CCCC1=O           2
C[C@H](CO)N1C[C@@H](C)[C@H](CN(C)C(=O)Nc2ccccc2)OCc2cn(CCCC1=O)nn2             2
                                                                              ..
C[C@@H](CO)N1C[C@H](C)[C@@H](CN(C)C(=O)Nc2ccccc2C(F)(F)F)OCc2cn(CCCC1=O)nn2    2
C[C@@H](CO)N1C[C@H](C)[C@@H](CN(C)C(=O)Nc2ccc(cc2)N2CCCCC2)OCc2cnnn2CCCC1=O    2
[O-][N+](=O)c1ccc2nc(ccc2c1)N1CCNCC1                                           2
CC(NCCC(c1ccccc1)c1ccccc1)c1ccccc1                                             2
C[C@H](CO)N1C[C@@H](C)[C@H](CN(C)C(=O)c2ccncc2)OCc2cnnn2CCCC1=O                2
Name: CPD_SMILES, Length: 204, dtype: int64

In [6]:
df.Metadata_broad_sample.unique().shape

(408,)

In [15]:
df

Unnamed: 0,CPD_SMILES,Metadata_broad_sample,duplicate
22686,CC#C[C@]1(O)CC[C@H]2[...,BRD-K37270826-001-15-1,True
1620,CC#C[C@]1(O)CC[C@H]2[...,BRD-K37270826-001-03-7,True
20859,CC(C)(C)c1ccc(cc1)C(O...,BRD-A06352418-001-17-6,True
906,CC(C)(C)c1ccc(cc1)C(O...,BRD-A06352418-001-03-6,True
10116,CC(C)COC(=O)N(C)C[C@@...,BRD-K83190598-001-02-8,True
...,...,...,...
1734,OC(=O)c1ccc2nc(-c3ccc...,BRD-K32584078-001-02-3,True
27156,OC[C@H]1N[C@H](C#N)[C...,BRD-K39376673-001-02-6,True
27468,OC[C@H]1N[C@H](C#N)[C...,BRD-K39376673-001-04-2,True
18407,[O-][N+](=O)c1ccc2nc(...,BRD-K95821857-050-03-5,True


In [5]:
# 3. Merge with Raw Data
meta_data_cols = ['Metadata_broad_sample', 'Metadata_Plate','Metadata_Plate_Map_Name',
                  'Metadata_mmoles_per_liter', 'Metadata_pert_id','Metadata_pert_mfc_id', 'Metadata_pert_well']

df = pd.merge(left=df, right=pd.read_csv("../../input/Raw_Image_data.txt",usecols=meta_data_cols, sep='\t'), how='outer', on='Metadata_broad_sample')
df = df.dropna(subset=['duplicate'])

In [6]:
# 4. Inspect Metadata Columns to find Patterns

pd.set_option('max_colwidth', 25)

with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(df[['CPD_SMILES','Metadata_broad_sample', 'Metadata_Plate_Map_Name']])

                    CPD_SMILES   Metadata_broad_sample Metadata_Plate_Map_Name
0     CC#C[C@]1(O)CC[C@H]2[...  BRD-K37270826-001-15-1       C-2113-01-D39-030
1     CC#C[C@]1(O)CC[C@H]2[...  BRD-K37270826-001-15-1       C-2113-01-D39-030
2     CC#C[C@]1(O)CC[C@H]2[...  BRD-K37270826-001-15-1       C-2113-01-D39-030
3     CC#C[C@]1(O)CC[C@H]2[...  BRD-K37270826-001-03-7            H-BIOA-004-3
4     CC#C[C@]1(O)CC[C@H]2[...  BRD-K37270826-001-03-7            H-BIOA-004-3
5     CC#C[C@]1(O)CC[C@H]2[...  BRD-K37270826-001-03-7            H-BIOA-004-3
6     CC#C[C@]1(O)CC[C@H]2[...  BRD-K37270826-001-03-7            H-BIOA-004-3
7     CC#C[C@]1(O)CC[C@H]2[...  BRD-K37270826-001-03-7            H-BIOA-004-3
8     CC#C[C@]1(O)CC[C@H]2[...  BRD-K37270826-001-03-7            H-BIOA-004-3
9     CC#C[C@]1(O)CC[C@H]2[...  BRD-K37270826-001-03-7            H-BIOA-004-3
10    CC#C[C@]1(O)CC[C@H]2[...  BRD-K37270826-001-03-7            H-BIOA-004-3
11    CC(C)(C)c1ccc(cc1)C(O...  BRD-A06352418-001-17