In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import ipywidgets as widgets
from IPython.display import display

# Page Rank Analysis 2
Date: October 12, 2020
## Objectives
Isolate fragment impact from fragment frequency.  The idea is to minimize the impact of highly frequent fragments
such as `ccc`.

### Approach
1. Split molecules into "easy to predict" and "hard to predict"
    1. Top and bottom quartiles of scaled average error
    2. This might need to be **dataset specific**.  Molecules or fragments that are difficult to predict for one
      property may not be difficult for the next.  These effects will offset in an average error.
      Try logP14k without scaled error in next attempt.

2. Compare and contrast fragments from these groups.
    1. Are the most common (by number of appearances) the same?

3. Remove highly conserved fragments.  Fragments that are present in both in easy and hard to predict molecule sets
 are removed.
    1. This might remove all fragments?
    2. Maybe remove the top `n` most frequent or the top `X%` most frequent

4. Create graph projection with remaining molecules and fragments. Create unweighted and weighted graphs.

5. Run PageRank algorithm on both graph projections.
    1. Return fragments rank and score.  Collect results in CSV.

6.  Analyze results.

### Grouping Molecules by Error Prediction Error
I need to collect statistics on molecules average prediction errors.  For simplicity and minimizing variables,
I am going to just use the `Lipophilicity` dataset.  Currently, there are 11 models that have used this dataset.

**Make a difficulty property based on molecule predictions.**  This will be used to categorize molecules as hard to predict.  In the below query, I do not used the `scaled average error` because I am only looking at a single dataset.  This is not ideal since I am writing directly to the molecule node, which may be a part of more than one dataset.
```cypher
MATCH (D:DataSet{data:'Lipophilicity-ID.csv'})-[]-(T:TestSet)-[p:CONTAINS_PREDICTED_MOLECULE]->(M:Molecule)
WITH avg(p.average_error) as difficulty, M, T, p
SET M.difficulty = difficulty
RETURN M,T, p
```

**Find the difficult to predict molecules.**  This query will find the molecules above the 90th percentile.  In other words, the 10% of molecules with the highest average error.
```cypher
MATCH (D:DataSet{data:'Lipophilicity-ID.csv'})-[c:CONTAINS_MOLECULE]->(M:Molecule)
WITH  percentileCont(M.difficulty, 0.90) as cutoff
MATCH (D:DataSet{data:'Lipophilicity-ID.csv'})-[c:CONTAINS_MOLECULE]->(M:Molecule)
WHERE M.difficulty > cutoff
RETURN  id(M) as NodeID, M.smiles as SMILES , M.difficulty as Difficulty, cutoff ORDER BY
M.difficulty DESC LIMIT 100
```

**Find the most common fragments.** We want to subtract the most common fragments in the bottom 90% from the fragments in the top 10% most difficult molecules.  But first we must identify what fragments are most common.

**Most common fragments in easy molecules**
```cypher
MATCH (D:DataSet{data:'Lipophilicity-ID.csv'})-[c:CONTAINS_MOLECULE]->(M:Molecule)
WITH  percentileCont(M.difficulty, 0.90) as cutoff, count(M) as total
MATCH (D:DataSet{data:'Lipophilicity-ID.csv'})-[c:CONTAINS_MOLECULE]->(M:Molecule)-[f:HAS_FRAGMENT]->(F:Fragment)
WHERE M.difficulty < cutoff
RETURN F.name, count(f), 0.9 * total as Total, toFloat(count(f)) / 0.9 / total * 100 as percent ORDER BY count(f) DESC LIMIT 100
```
**Most common fragments in hard molecules**
```cypher
MATCH (D:DataSet{data:'Lipophilicity-ID.csv'})-[c:CONTAINS_MOLECULE]->(M:Molecule)
WITH  percentileCont(M.difficulty, 0.90) as cutoff, count(M) as total
MATCH (D:DataSet{data:'Lipophilicity-ID.csv'})-[c:CONTAINS_MOLECULE]->(M:Molecule)-[f:HAS_FRAGMENT]->(F:Fragment)
WHERE M.difficulty > cutoff
RETURN F.name, count(f), 0.1 * total as Total, toFloat(count(f)) / 0.1 / total * 100 as percent ORDER BY count(f) DESC LIMIT 100
```

The above Cypher command finds the most frequent fragments in the group based on number of relationships it has to molecules in the dataset.  It calculates the percent of molecules in that group that have that fragment.

### Removing fragments that overlap molecular groups
Next we need to find what fragments are common in both the high error and less error sets.  Then isolate the ones more frequent in the high error group.

I think there are several ways we could go about making rules for which fragments to remove.

1. We could remove the `n` most common fragments in the easy group from the hard group.

2. Remove fragments with a prevelence above a threshold, say 25%.  i.e if a fragment is present in 25% or more of the easy molecules, remove it.

3. We could remove fragments that have the same prevelence (within a threshold, say 2%) in both the hard and easy sets.

4. Remove all fragments present in the easy group from the hard group.  This will remove the most and leave fragments that *only* exist in the hard group.

Let's start with the first approach.

```cypher
MATCH (D:DataSet{data:'Lipophilicity-ID.csv'})-[c:CONTAINS_MOLECULE]->(M:Molecule)
WITH  percentileCont(M.difficulty, 0.90) as cutoff

MATCH (D:DataSet{data:'Lipophilicity-ID.csv'})-[c:CONTAINS_MOLECULE]->(eM:Molecule)-[ef:HAS_FRAGMENT]->(eF:Fragment)
WHERE eM.difficulty < cutoff // easy molecules
WITH eF, count(ef) as efreq, cutoff // gath frags and frequency
ORDER BY efreq DESC LIMIT 1000  //  limit to top n
WITH  collect(eF) as easyFrags, cutoff

MATCH (D:DataSet{data:'Lipophilicity-ID.csv'})-[c:CONTAINS_MOLECULE]->(hM:Molecule)-[hf:HAS_FRAGMENT]->(hF:Fragment)
WHERE hM.difficulty > cutoff // hard molecules
WITH hF, count(hf) as hfreq, easyFrags
ORDER BY hfreq DESC LIMIT 1000
WITH collect(hF) as hardFrags, easyFrags

WITH apoc.coll.intersection(easyFrags, hardFrags) as overlap, apoc.coll.subtract(hardFrags, easyFrags) as remain  // use APOC to do list intersect & subtraction
RETURN size(remain), remain
```

The above query collects the 1000 most frequent fragments in both the easy and difficult groups.  Then it calculates the overlap between the sets and the difference between them. It returns what remains of the hard group once the easy fragments have been removed.

You can `UNWIND` the resulting list to use the nodes in a `MATCH` query.
```cypher
UNWIND remain as rFrags
MATCH (M:Molecule)-[:HAS_FRAGMENT]->(rFrags)
RETURN M, rFrags
```

This will return the molecules that have those "difficult" fragments.

In [None]:


##  Graph Projections
We are going to use a Cypher graph projection. Ideally, we would be able to calculate molecule `difficulty` on the fly when creating the graph projection.  However, the Cypher commands for graph projection must be read-only.  So the `difficulty` property must exsist prior to projection.  The challenge with this is that the graph analysis may be focused at a certain dataset.  A user may want to know what fragments are difficult for a particular chemical property, such as logP.  In this scenario, the `difficulty` property should only consider logP errors.  But then we have a user-query specific property persisting in the mother graph, which is undesired.  My less than elegant solution is as follows:
1. Remove all `difficulty` weights
2. Make new `difficulty` weights for the chemical property of interest
3. Create the Cypher graph projection(s)
4. Remove the `difficulty` weights

### Remove Old Weights
```cypher
MATCH (:Molecule)-[h:HAS_FRAGMENT]->(:Fragment)
REMOVE h.difficulty
RETURN h limit 10
```

### Make New Weights for LogP
```cypher
MATCH (D:DataSet{data:"Lipophilicity-ID.csv"})-[:SPLITS_INTO_TEST]->(T:TestSet)-[p:CONTAINS_PREDICTED_MOLECULE]->(M:Molecule)-[f:HAS_FRAGMENT]->(F:Fragment)
WITH avg(p.average_error) as difficulty, f, M, F
SET f.difficulty = difficulty
RETURN M, F, f LIMIT 20
```

In [None]:
### Create Graph Projection
This query contains the same logic as the above sections: set a difficulty cutoff (90 percentile), remove common fragments, etc.  This just does it all in one, big, ugly command.
```cypher
CALL gds.graph.create.cypher(
    'hard-frags-weight',
    'MATCH (D:DataSet{data:"Lipophilicity-ID.csv"}) MATCH (D)-[:CONTAINS_MOLECULE]->(mol:Molecule)<-[stats:CONTAINS_PREDICTED_MOLECULE]-(:TestSet) WITH avg(stats.scaled_average_error) as aces, mol.smiles as smiles WITH percentileCont(aces, 0.90) as cutoff MATCH (:TestSet)-[stats:CONTAINS_PREDICTED_MOLECULE]->(eM:Molecule) WITH avg(stats.scaled_average_error) as ace, eM.smiles as smiles, cutoff UNWIND [{ace: ace, smiles: smiles}] as row MATCH (:Molecule {smiles: row["smiles"]})-[:HAS_FRAGMENT]->(easy_frags:Fragment) WHERE row["ace"] < cutoff WITH collect(easy_frags) as easy_frags, cutoff MATCH (:TestSet)-[stats:CONTAINS_PREDICTED_MOLECULE]->(hM:Molecule) WITH avg(stats.scaled_average_error) as ace, hM.smiles as smiles, cutoff, easy_frags UNWIND [{ace: ace, smiles: smiles}] as row MATCH (:Molecule {smiles: row["smiles"]})-[:HAS_FRAGMENT]->(hard_frags:Fragment) WHERE row["ace"] > cutoff WITH collect(hard_frags) as hard_frags, easy_frags WITH apoc.coll.subtract(hard_frags, easy_frags) as hard_frags MATCH (mol:Molecule) WITH collect(mol) as molecules, hard_frags UNWIND [hard_frags, molecules] as sublist UNWIND sublist as mol_or_hard_frag RETURN id(mol_or_hard_frag) as id',
    'MATCH (D:DataSet{data:"Lipophilicity-ID.csv"}) MATCH (D)-[:CONTAINS_MOLECULE]->(mol:Molecule)<-[stats:CONTAINS_PREDICTED_MOLECULE]-(:TestSet) WITH avg(stats.scaled_average_error) as aces, mol.smiles as smiles WITH percentileCont(aces, 0.90) as cutoff MATCH (:TestSet)-[stats:CONTAINS_PREDICTED_MOLECULE]->(eM:Molecule) WITH avg(stats.scaled_average_error) as ace, eM.smiles as smiles, cutoff UNWIND [{ace: ace, smiles: smiles}] as row MATCH (:Molecule {smiles: row["smiles"]})-[:HAS_FRAGMENT]->(easy_frags:Fragment) WHERE row["ace"] < cutoff WITH collect(easy_frags) as easy_frags, cutoff MATCH (:TestSet)-[stats:CONTAINS_PREDICTED_MOLECULE]->(hM:Molecule) WITH avg(stats.scaled_average_error) as ace, hM.smiles as smiles, cutoff, easy_frags UNWIND [{ace: ace, smiles: smiles}] as row MATCH (:Molecule {smiles: row["smiles"]})-[:HAS_FRAGMENT]->(hard_frags:Fragment) WHERE row["ace"] > cutoff WITH collect(hard_frags) as hard_frags, easy_frags WITH apoc.coll.subtract(hard_frags, easy_frags) as hard_frags UNWIND hard_frags as hard_frag MATCH (T:TestSet)-[p:CONTAINS_PREDICTED_MOLECULE]->(mol:Molecule)-[f:HAS_FRAGMENT]->(hard_frag) RETURN id(mol) AS source, id(hard_frag) AS target, f.difficulty as weight'
)
YIELD graphName, nodeCount, relationshipCount;
```

### Single Command for Procedure
All of the Cypher steps can be run at once by separating them with `;`

```cypher
// Delete old weights
MATCH (:Molecule)-[h:HAS_FRAGMENT]->(:Fragment)
REMOVE h.difficulty
RETURN h limit 10;

// Make new weights for LogP
MATCH (D:DataSet{data:"Lipophilicity-ID.csv"})-[:SPLITS_INTO_TEST]->(T:TestSet)-[p:CONTAINS_PREDICTED_MOLECULE]->(M:Molecule)-[f:HAS_FRAGMENT]->(F:Fragment)
WITH avg(p.average_error) as difficulty, f, M, F
SET f.difficulty = difficulty
RETURN count(f);

// Create the graph projection
CALL gds.graph.create.cypher(
    'hard-frags-weight',
    'MATCH (D:DataSet{data:"Lipophilicity-ID.csv"}) MATCH (D)-[:CONTAINS_MOLECULE]->(mol:Molecule)<-[stats:CONTAINS_PREDICTED_MOLECULE]-(:TestSet) WITH avg(stats.scaled_average_error) as aces, mol.smiles as smiles WITH percentileCont(aces, 0.90) as cutoff MATCH (:TestSet)-[stats:CONTAINS_PREDICTED_MOLECULE]->(eM:Molecule) WITH avg(stats.scaled_average_error) as ace, eM.smiles as smiles, cutoff UNWIND [{ace: ace, smiles: smiles}] as row MATCH (:Molecule {smiles: row["smiles"]})-[:HAS_FRAGMENT]->(easy_frags:Fragment) WHERE row["ace"] < cutoff WITH collect(easy_frags) as easy_frags, cutoff MATCH (:TestSet)-[stats:CONTAINS_PREDICTED_MOLECULE]->(hM:Molecule) WITH avg(stats.scaled_average_error) as ace, hM.smiles as smiles, cutoff, easy_frags UNWIND [{ace: ace, smiles: smiles}] as row MATCH (:Molecule {smiles: row["smiles"]})-[:HAS_FRAGMENT]->(hard_frags:Fragment) WHERE row["ace"] > cutoff WITH collect(hard_frags) as hard_frags, easy_frags WITH apoc.coll.subtract(hard_frags, easy_frags) as hard_frags MATCH (mol:Molecule) WITH collect(mol) as molecules, hard_frags UNWIND [hard_frags, molecules] as sublist UNWIND sublist as mol_or_hard_frag RETURN id(mol_or_hard_frag) as id',
    'MATCH (D:DataSet{data:"Lipophilicity-ID.csv"}) MATCH (D)-[:CONTAINS_MOLECULE]->(mol:Molecule)<-[stats:CONTAINS_PREDICTED_MOLECULE]-(:TestSet) WITH avg(stats.scaled_average_error) as aces, mol.smiles as smiles WITH percentileCont(aces, 0.90) as cutoff MATCH (:TestSet)-[stats:CONTAINS_PREDICTED_MOLECULE]->(eM:Molecule) WITH avg(stats.scaled_average_error) as ace, eM.smiles as smiles, cutoff UNWIND [{ace: ace, smiles: smiles}] as row MATCH (:Molecule {smiles: row["smiles"]})-[:HAS_FRAGMENT]->(easy_frags:Fragment) WHERE row["ace"] < cutoff WITH collect(easy_frags) as easy_frags, cutoff MATCH (:TestSet)-[stats:CONTAINS_PREDICTED_MOLECULE]->(hM:Molecule) WITH avg(stats.scaled_average_error) as ace, hM.smiles as smiles, cutoff, easy_frags UNWIND [{ace: ace, smiles: smiles}] as row MATCH (:Molecule {smiles: row["smiles"]})-[:HAS_FRAGMENT]->(hard_frags:Fragment) WHERE row["ace"] > cutoff WITH collect(hard_frags) as hard_frags, easy_frags WITH apoc.coll.subtract(hard_frags, easy_frags) as hard_frags UNWIND hard_frags as hard_frag MATCH (T:TestSet)-[p:CONTAINS_PREDICTED_MOLECULE]->(mol:Molecule)-[f:HAS_FRAGMENT]->(hard_frag) RETURN id(mol) AS source, id(hard_frag) AS target, f.difficulty as weight'
)
YIELD graphName, nodeCount, relationshipCount;

// Delete weights again
MATCH (:Molecule)-[h:HAS_FRAGMENT]->(:Fragment)
REMOVE h.difficulty
RETURN h limit 10

```

## Unweighted PageRank Algorithm

```cypher
CALL gds.pageRank.stream('mols_native',{
	maxIterations: 20
    })
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS name, score
ORDER BY score DESC
```

## Weighted PageRank Algorithm
```cypher
CALL gds.pageRank.stream('mols_native',{
	maxIterations: 20,
    relationshipWeightProperty: 'weight'
    })
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS name, score
ORDER BY score DESC
```

In [None]:
path = '/home/adam/research/neo4j/gds_results/pageRank/'
df_w = pd.read_csv(path + 'mol_frags_weight_3.csv')
df_n = pd.read_csv(path + 'mol_frags_noweight_3.csv')

n = 5000  # counter for n highest results

In [None]:
df_n.shape

In [None]:
df_w.shape

In [None]:
def cleanup(df, n = 0):
    """
    Quick function to process incoming dataframes.
    Accepts a data frame and a integer for how many entries to keep.
    """
    df = df.dropna()
    df = df.reset_index(drop=True)
    df["rank"] = df.index + 1
    if n == 0:
        n = df.shape[0]
    df = df[:n]

    return df


In [None]:
# df_w["rank"] = df_w.index + 1
# df_w = df_w[:n]
# df_w.head(10)
df_w = cleanup(df_w)
df_w.describe()

In [None]:
df_n = cleanup(df_n)
df_n.describe()

In [None]:
df_n.head(15)

### Clean up Results
We need to merge the two results into one dataframe.
One column with the fragment, one with unweighted scores,
one with the weighted scores.

Once that is done, we can see how we can manipulate them to get answers.

In [None]:
df = pd.merge(df_n, df_w, on="name", how='outer', suffixes=("_no_weight", "_weight"))
df.dropna()
df["score_diff"] = df["score_weight"] - df["score_no_weight"]

# Negative rank diff means fragment became more important with weight!
df["rank_diff"] = df["rank_weight"] - df["rank_no_weight"]

# Positive frac means fragment became more important with weight
df["frac"] = (df["score_weight"] - df["score_no_weight"])/df["score_no_weight"]*100
# df = df[np.abs(df.score_diff) > 0.01]
df.sort_values(by="score_diff", ascending=True).head(25)
# df.head(25)

In [None]:
df.describe()

In [None]:
%matplotlib notebook
plt.style.use('bmh')
dff=df[:1000]
fig, ax = plt.subplots()
# ax = plt.subplot(111)
ax.bar(dff.rank_no_weight, dff.rank_diff, color=(dff['rank_diff'] > 0).map({True: 'b', False: 'r'}))
plt.xlabel("Unweighted Fragment Rank")
plt.ylabel("Rank Change After Weighting")
# plt.ylim(-225, 225)
plt.xlim(0,)
# plt.show()
# plt.close()


Change the number of points the graph looks at.  Interestingly, you can see that as `n` increases, the `rank_diff` tends to also increase.  This is largely because very little score separates entries at high ranks (low score).  So a small delta in score can cause a large jump in rank

In [None]:
dff=df[:2500]  # adjust what data to look at here
plt.style.use('bmh')
fig, ax = plt.subplots()
ax.scatter(dff.score_no_weight, dff.score_weight)
plt.xlabel("Unweighted Fragment Score")
plt.ylabel("Weighted Fragment Score")

## Whole Graph Page Rank
I figured it would be interesting to run the PageRank algorithm on the full graph, unweighted.

### Creating the graph projection
```cypher
CALL gds.graph.create('whole-graph', '*', '*')
```

### Run PageRank Algorithm
```cypher
CALL gds.pageRank.stream('whole-graph',{
    maxIterations: 20
    })
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS name, score,  labels(gds.util.asNode(nodeId)) as NodeType, nodeId
ORDER BY NodeType, score DESC
```