In [15]:
import pandas as pd
import matplotlib.pyplot as plt
import ipywidgets as widgets
from IPython.display import display
from rdkit.Chem.Draw import IPythonConsole #Needed to show molecules
from rdkit import Chem
from rdkit.Chem import PandasTools
from IPython.display import HTML

# Fragment Error Analysis 1 (Lipophilicity-ID)
Date: January 21, 2021
## Objectives
Determine the sensitivity of this fragment analysis to the pecentile threshold that separates "hard" and "easy" molecules.  

### Approach
1. Split molecules into "easy to predict" and "hard to predict"
    1. Percentile threshold
    2. This might need to be **dataset specific**.  Molecules or fragments that are difficult to predict for one
      property may not be difficult for the next.  These effects will offset in an average error.


2. Compare and contrast fragments from these groups.
    1. Are the most common (by number of appearances) the same?

3. Remove highly conserved fragments.  Fragments that are present in both in easy and hard to predict molecule sets
 are removed.
    1. This might remove all fragments?
    2. Maybe remove the top `n` most frequent or the top `X%` most frequent

4. Identify which fragments are most popular based on relationship counts and relationship error weights.


5.  Analyze results.

#  Fragment Analysis
Ideally, we would be able to calculate molecule `difficulty` on the fly when running the analysis.   A user may want to know what fragments are difficult for a particular chemical property, such as logP.  In this scenario, the `difficulty` property should only consider logP errors.  But then we have a user-query specific property persisting in the mother graph, which is undesired.  My less than elegant solution is as follows:
1. Remove all `difficulty` weights
2. Make new `difficulty` weights for the chemical property of interest
3. Run Fragment Analysis
4. Remove the `difficulty` weights

## Streamlined Cypher Procedure
Cypher commands can be run in Batch using `;` to separate the commands, but the outputs will be suppressed.  So the command that returns your results should be run by itself.  These first 3 commands can be run together, however. They remove old weights, set the dataset of interest, and create new weights for use with the analysis.  

### Prepare Graph
```cypher
// Delete old weights
MATCH (M:Molecule)-[f:HAS_FRAGMENT]->(F:Fragment)
REMOVE M.difficulty, f.difficulty                      
RETURN M, F, f;

// Set the Dataset you are interested in
:param data => "Lipophilicity-ID.csv"; // must be in separate command from MATCH

// Make new weights for Dataset
MATCH (D:DataSet{data: $data})-[:SPLITS_INTO_TEST]->(T:TestSet)-[p:CONTAINS_PREDICTED_MOLECULE]->(M:Molecule)-[f:HAS_FRAGMENT]->(F:Fragment)
WITH avg(p.average_error) as difficulty, f, M, F
SET M.difficulty = difficulty                      
SET f.difficulty = difficulty
RETURN M, F, f;
```
### Run Fragment Analysis
The command below produces the fragment analysis and returns the number of relationships, the sum of their errors and the average error.  This command is where the percentile threshold is determined.  Vary it here. 

```cypher
// Remove Common Fragments
MATCH (D:DataSet{data: $data})-[c:CONTAINS_MOLECULE]->(M:Molecule)
WITH  percentileCont(M.difficulty, 0.90) as cutoff

MATCH (D:DataSet{data: $data})-[c:CONTAINS_MOLECULE]->(eM:Molecule)-[ef:HAS_FRAGMENT]->(eF:Fragment)
WHERE eM.difficulty < cutoff // easy molecules
WITH eF, count(ef) as efreq, cutoff // gath frags and frequency
ORDER BY efreq DESC LIMIT 1000  //  limit to top n
WITH  collect(eF) as easyFrags, cutoff

MATCH (D:DataSet{data: $data})-[c:CONTAINS_MOLECULE]->(hM:Molecule)-[hf:HAS_FRAGMENT]->(hF:Fragment)
WHERE hM.difficulty > cutoff // hard molecules
WITH hF, count(hf) as hfreq, easyFrags
ORDER BY hfreq DESC LIMIT 1000
WITH collect(hF) as hardFrags, easyFrags

// use APOC to do list intersect & subtraction
WITH apoc.coll.intersection(easyFrags, hardFrags) as overlap, apoc.coll.subtract(hardFrags, easyFrags) as remain 

// Find Molecule-Fragment pairs that have the remaining fragments and are in the dataset
UNWIND remain as rFrags
MATCH (D:DataSet{data: $data})-[c:CONTAINS_MOLECULE]->(M:Molecule)-[f:HAS_FRAGMENT]->(rFrags)
WITH M, rFrags
MATCH (M)-[f:HAS_FRAGMENT]->(rFrags)

// Get Difficulty Stats for Remaining Fragments
WITH rFrags.name as fragment, count(f) as number_of_rel, sum(f.difficulty) as sum_difficulty,sum(f.difficulty)/count(f) as avg_difficulty// , M, rFrags,f 
RETURN fragment, number_of_rel, sum_difficulty, avg_difficulty
ORDER BY number_of_rel DESC, avg_difficulty DESC                                                       
```
### Clean up
***RUN THIS AT THE END to clean up after yourself!***

```cypher
// Delete weights again
MATCH (D:DataSet{data: $data})-[:SPLITS_INTO_TEST]->(T:TestSet)-[p:CONTAINS_PREDICTED_MOLECULE]->(M:Molecule)-[f:HAS_FRAGMENT]->(F:Fragment)
WITH avg(p.average_error) as difficulty, f, M, F
REMOVE M.difficulty = difficulty                      
REMOVE f.difficulty = difficulty
RETURN M, F, f LIMIT 20;
```

## The Fragment Analysis Results (with errant models removed)
Upon running the fragment analysis, I saved the Cypher results as a CSV. The Cypher query sorts the fragments first by the number of incoming relationships, i.e how many molecules have that fragment and then by the average prediction difficulty.  The results are below.

### 90 Percentile Threshold

In [16]:
pd.options.display.float_format = '{:,.3f}'.format
path = 'LipoID/'
frags90 = pd.read_csv(path + 'Frag_Analysis_lipoID2.csv')
frags90 = frags90.rename(columns={'number_of_rel': 'rels'})
frags90 = frags90.drop(columns=['sum_difficulty'])
frags90.head(15)
# print(frags.head(10).to_latex(index=False))

Unnamed: 0,fragment,rels,avg_difficulty
0,cc<-X>cNC,68,0.769
1,ccnc[nH],67,0.77
2,c<=O>nCc,67,0.737
3,CCCC<-O>,65,0.76
4,c<-X>cC<=O>NC,65,0.748
5,ccCnc<=O>,65,0.73
6,ccC<=O>C,65,0.713
7,cc-c(c)s,65,0.711
8,c<-O>c,64,1.033
9,ncc[nH],64,0.79


**Sort by average difficulty**

In [17]:
frags90.sort_values(by="avg_difficulty", ascending=False).head(15)

Unnamed: 0,fragment,rels,avg_difficulty
199,ccc(c<-OMe>)S(<=O>)<=O>,15,1.275
186,ccN(C)S(<=O>)<=O>,21,1.23
196,CCN(c)S(<=O>)<=O>,16,1.225
176,CC<=O>OC,27,1.212
165,CC<=O>O,30,1.205
187,c<-O>cc(c)c,21,1.205
197,cNC<=O>Nc,16,1.159
189,c<-OMe>cS(<=O>)<=O>,20,1.131
190,cc<-OMe>cS(<=O>)<=O>,20,1.131
191,c<-OMe>cS(<=O>)<=O>N,20,1.131


### 75 Percentile Threshold

In [18]:
frags75 = pd.read_csv(path + 'Frag_Analysis_lipoID2_75.csv')
frags75 = frags75.rename(columns={'number_of_rel': 'rels'})
frags75 = frags75.drop(columns=['sum_difficulty'])

frags75.head(15)

Unnamed: 0,fragment,rels,avg_difficulty
0,CCOC<=O>,87,0.998
1,CC(C)(C)C,78,0.883
2,cnnc-c,76,0.967
3,cC<-O>CN,75,0.845
4,ccC<-O>CN,75,0.845
5,CCCC<=O>,74,0.872
6,cOCCN,73,0.813
7,c-c[nH],73,0.742
8,cC<-O>CNC,72,0.845
9,c<-X>cNC,72,0.784


**Sort by average difficulty**

In [19]:
frags75.sort_values(by="avg_difficulty", ascending=False).head(15)

Unnamed: 0,fragment,rels,avg_difficulty
166,CC<=O>OC,27,1.212
165,CC<=O>O,30,1.205
150,c<-O>cC,41,1.073
158,[nH]ccn,38,1.067
147,cc(c)NS(<=O>)<=O>,43,1.054
30,c<-O>c,64,1.033
162,cccc<-NO2>,36,1.018
127,cc([nH])cn,48,1.015
139,COC<=O>,46,1.01
156,CCCOC<=O>,39,1.002


### 50 Percentile Threshold

In [20]:
frags50 = pd.read_csv(path + 'Frag_Analysis_lipoID2_50.csv')
frags50 = frags50.rename(columns={'number_of_rel': 'rels'})
frags50 = frags50.drop(columns=['sum_difficulty'])
frags50.head(15)

Unnamed: 0,fragment,rels,avg_difficulty
0,c<-O>cc,101,0.942
1,c-cnn,95,0.942
2,cc-cnn,90,0.966
3,CCOC<=O>,87,0.998
4,cc(c)S,80,0.912
5,ccc(c)S,80,0.912
6,CCCcn,80,0.859
7,cc(n)NC<=O>,80,0.823
8,CC(C)(C)C,78,0.883
9,c<-X>c,78,0.811


**Sort by average difficulty**

In [21]:
frags50.sort_values(by="avg_difficulty", ascending=False).head(15)

Unnamed: 0,fragment,rels,avg_difficulty
137,c<-O>cC,41,1.073
62,c<-O>c,64,1.033
134,COC<=O>,46,1.01
3,CCOC<=O>,87,0.998
109,cc(C)c-c,57,0.984
11,cnnc-c,76,0.967
2,cc-cnn,90,0.966
130,Ccc(C)n,49,0.958
135,c<-C(=O)O>cc,46,0.942
0,c<-O>cc,101,0.942


In [22]:
from IPython.display import display_html
def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)

In [23]:
def df_cleanup(df):
    df = df.rename(columns={'number_of_rel': 'rels'})
    df = df.drop(columns=['sum_difficulty'])
    return df


In [24]:
frags90h = df_cleanup(pd.read_csv(path + 'Frag_Analysis_lipoID2_nohlim.csv'))
frags75h = df_cleanup(pd.read_csv(path + 'Frag_Analysis_lipoID2_nohlim_75.csv'))
frags50h = df_cleanup(pd.read_csv(path + 'Frag_Analysis_lipoID2_nohlim_50.csv'))

### Compare side by side
Sorted by the total number of relationships to the fragment.

In [25]:
display_side_by_side(frags90.loc[:, frags90.columns != 'sum_difficulty'].head(25),frags75.loc[:, frags75.columns != 'sum_difficulty'].head(25), frags50.loc[:, frags50.columns != 'sum_difficulty'].head(25))

Unnamed: 0,fragment,rels,avg_difficulty
0,cc<-X>cNC,68,0.769
1,ccnc[nH],67,0.77
2,c<=O>nCc,67,0.737
3,CCCC<-O>,65,0.76
4,c<-X>cC<=O>NC,65,0.748
5,ccCnc<=O>,65,0.73
6,ccC<=O>C,65,0.713
7,cc-c(c)s,65,0.711
8,c<-O>c,64,1.033
9,ncc[nH],64,0.79

Unnamed: 0,fragment,rels,avg_difficulty
0,CCOC<=O>,87,0.998
1,CC(C)(C)C,78,0.883
2,cnnc-c,76,0.967
3,cC<-O>CN,75,0.845
4,ccC<-O>CN,75,0.845
5,CCCC<=O>,74,0.872
6,cOCCN,73,0.813
7,c-c[nH],73,0.742
8,cC<-O>CNC,72,0.845
9,c<-X>cNC,72,0.784

Unnamed: 0,fragment,rels,avg_difficulty
0,c<-O>cc,101,0.942
1,c-cnn,95,0.942
2,cc-cnn,90,0.966
3,CCOC<=O>,87,0.998
4,cc(c)S,80,0.912
5,ccc(c)S,80,0.912
6,CCCcn,80,0.859
7,cc(n)NC<=O>,80,0.823
8,CC(C)(C)C,78,0.883
9,c<-X>c,78,0.811


**Sorted by Average Difficulty**

In [29]:
display_side_by_side(frags90.loc[:, frags90.columns != 'sum_difficulty'].sort_values(by="avg_difficulty",ignore_index=True, ascending=False).head(35),frags75.loc[:, frags75.columns != 'sum_difficulty'].sort_values(by="avg_difficulty",ignore_index=True, ascending=False).head(25), frags50.loc[:, frags50.columns != 'sum_difficulty'].sort_values(by="avg_difficulty",ignore_index=True, ascending=False).head(25))

Unnamed: 0,fragment,rels,avg_difficulty
0,ccc(c<-OMe>)S(<=O>)<=O>,15,1.275
1,ccN(C)S(<=O>)<=O>,21,1.23
2,CCN(c)S(<=O>)<=O>,16,1.225
3,CC<=O>OC,27,1.212
4,CC<=O>O,30,1.205
5,c<-O>cc(c)c,21,1.205
6,cNC<=O>Nc,16,1.159
7,c<-OMe>cS(<=O>)<=O>,20,1.131
8,cc<-OMe>cS(<=O>)<=O>,20,1.131
9,c<-OMe>cS(<=O>)<=O>N,20,1.131

Unnamed: 0,fragment,rels,avg_difficulty
0,CC<=O>OC,27,1.212
1,CC<=O>O,30,1.205
2,c<-O>cC,41,1.073
3,[nH]ccn,38,1.067
4,cc(c)NS(<=O>)<=O>,43,1.054
5,c<-O>c,64,1.033
6,cccc<-NO2>,36,1.018
7,cc([nH])cn,48,1.015
8,COC<=O>,46,1.01
9,CCCOC<=O>,39,1.002

Unnamed: 0,fragment,rels,avg_difficulty
0,c<-O>cC,41,1.073
1,c<-O>c,64,1.033
2,COC<=O>,46,1.01
3,CCOC<=O>,87,0.998
4,cc(C)c-c,57,0.984
5,cnnc-c,76,0.967
6,cc-cnn,90,0.966
7,Ccc(C)n,49,0.958
8,c<-C(=O)O>cc,46,0.942
9,c<-O>cc,101,0.942


### Analysis with no limit on hard frags
**Sorted by Relationship Count**

In [27]:
display_side_by_side(frags90h.loc[:, frags90h.columns != 'sum_difficulty'].head(25),frags75h.loc[:, frags75h.columns != 'sum_difficulty'].head(25), frags50h.loc[:, frags50h.columns != 'sum_difficulty'].head(25))

Unnamed: 0,fragment,rels,avg_difficulty
0,cc<-X>cNC,68,0.769
1,ccnc[nH],67,0.77
2,c<=O>nCc,67,0.737
3,CCCC<-O>,65,0.76
4,c<-X>cC<=O>NC,65,0.748
5,ccCnc<=O>,65,0.73
6,ccC<=O>C,65,0.713
7,cc-c(c)s,65,0.711
8,c<-O>c,64,1.033
9,ncc[nH],64,0.79

Unnamed: 0,fragment,rels,avg_difficulty
0,CCOC<=O>,87,0.998
1,CC(C)(C)C,78,0.883
2,cnnc-c,76,0.967
3,cC<-O>CN,75,0.845
4,ccC<-O>CN,75,0.845
5,CCCC<=O>,74,0.872
6,cOCCN,73,0.813
7,c-c[nH],73,0.742
8,cC<-O>CNC,72,0.845
9,c<-X>cNC,72,0.784

Unnamed: 0,fragment,rels,avg_difficulty
0,c<-O>cc,101,0.942
1,c-cnn,95,0.942
2,cc-cnn,90,0.966
3,CCOC<=O>,87,0.998
4,cc(c)S,80,0.912
5,ccc(c)S,80,0.912
6,CCCcn,80,0.859
7,cc(n)NC<=O>,80,0.823
8,CC(C)(C)C,78,0.883
9,c<-X>c,78,0.811


**Sorted by Average Difficulty**

In [28]:
display_side_by_side(frags90h.loc[:, frags90h.columns != 'sum_difficulty'].sort_values(by="avg_difficulty",ignore_index=True, ascending=False).head(25),frags75h.loc[:, frags75h.columns != 'sum_difficulty'].sort_values(by="avg_difficulty",ignore_index=True, ascending=False).head(25), frags50h.loc[:, frags50h.columns != 'sum_difficulty'].sort_values(by="avg_difficulty",ignore_index=True, ascending=False).head(25))

Unnamed: 0,fragment,rels,avg_difficulty
0,CC(<-C(=O)O>)<-N>,1,3.334
1,CC(<-C(=O)O>)<-N>COP(<-O>)(<-O>)<=O>,1,3.334
2,C(<-C(=O)O>)<-N>COP(<-O>)(<-O>)<=O>,1,3.334
3,CC(<-C(=O)O>)<-N>CO,1,3.334
4,COP(<-O>)(<-O>)<=O>,1,3.334
5,C(<-C(=O)O>)<-N>CO,1,3.334
6,CC(<-C(=O)O>)<-N>C,1,3.334
7,c(<=NCH3>)<=NCH3>cccn,1,3.177
8,c(<=NCH3>)<=NCH3>cc(-c)s,1,3.177
9,cc(<=NCH3>)<=NCH3>ccs,1,3.177

Unnamed: 0,fragment,rels,avg_difficulty
0,CC(<-C(=O)O>)<-N>,1,3.334
1,CC(<-C(=O)O>)<-N>C,1,3.334
2,C(<-C(=O)O>)<-N>CO,1,3.334
3,COP(<-O>)(<-O>)<=O>,1,3.334
4,CC(<-C(=O)O>)<-N>CO,1,3.334
5,C(<-C(=O)O>)<-N>COP(<-O>)(<-O>)<=O>,1,3.334
6,CC(<-C(=O)O>)<-N>COP(<-O>)(<-O>)<=O>,1,3.334
7,cc(<=NCH3>)<=NCH3>c,1,3.177
8,cccc(<=NCH3>)<=NCH3>c,1,3.177
9,cccc(<=NCH3>)<=NCH3>,1,3.177

Unnamed: 0,fragment,rels,avg_difficulty
0,CC(<-C(=O)O>)<-N>,1,3.334
1,CC(<-C(=O)O>)<-N>C,1,3.334
2,C(<-C(=O)O>)<-N>CO,1,3.334
3,COP(<-O>)(<-O>)<=O>,1,3.334
4,CC(<-C(=O)O>)<-N>CO,1,3.334
5,C(<-C(=O)O>)<-N>COP(<-O>)(<-O>)<=O>,1,3.334
6,CC(<-C(=O)O>)<-N>COP(<-O>)(<-O>)<=O>,1,3.334
7,cc(<=NCH3>)<=NCH3>c,1,3.177
8,cccc(<=NCH3>)<=NCH3>c,1,3.177
9,cccc(<=NCH3>)<=NCH3>,1,3.177
