In [91]:
import pandas as pd
import numpy as np

##  Graph Projections
I used the native projections for creating the sub-graph used in this analysis.

```
CALL gds.graph.create('mols_native',  // graph name
['Molecule', 'Fragment'], // Node Labels
'HAS_FRAGMENT',  // Relationship Labels
{relationshipProperties:{weight:{property: 'difficulty', defaultValue: 0}}} // Relationship properties
)
Yield graphName, nodeCount, relationshipCount;
```
This graph projection contains just molecules and molecular fragments and the relationship between them.
The Neo4j property `difficulty` is mapped to the projection under the name `weight`.

## Unweighted PageRank Algorithm

```
CALL gds.pageRank.stream('mols_native',{
	maxIterations: 20
    })
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS name, score
ORDER BY score DESC
```

## Weighted PageRank Algorithm
```cypher
CALL gds.pageRank.stream('mols_native',{
	maxIterations: 20,
    relationshipWeightProperty: 'weight'
    })
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS name, score
ORDER BY score DESC
```

In [92]:
path = '/home/adam/research/neo4j/gds_results/pageRank/'
df_w = pd.read_csv(path + 'mol_frags_weight.csv')
df_n = pd.read_csv(path + 'mol_frags_noweight.csv')

In [93]:
df_w["rank"] = df_w.index + 1
df_w = df_w[:1000]
df_w.head(10)


Unnamed: 0,name,score,rank
0,cc,39.579261,1
1,ccc,34.946431,2
2,cccc,31.597883,3
3,CC,29.49366,4
4,ccccc,28.826794,5
5,CCC,20.091932,6
6,CN,15.39781,7
7,ccC,14.732795,8
8,cccC,13.315302,9
9,CCCC,12.873738,10


In [94]:
df_n["rank"] = df_n.index + 1
df_n = df_n[:1000]
df_n.head(10)

Unnamed: 0,name,score,rank
0,cc,46.882612,1
1,ccc,41.416363,2
2,cccc,37.454273,3
3,ccccc,34.226061,4
4,CC,33.965738,5
5,CCC,22.980744,6
6,CN,18.242777,7
7,ccC,17.466311,8
8,cccC,15.778175,9
9,cn,14.910307,10


### Clean up Results
We need to merge the two results into one dataframe.
One column with the fragment, one with unweighted scores,
one with the weighted scores.

Once that is done, we can see how we can manipulate them to get answers.

In [101]:
df = pd.merge(df_n, df_w, on="name", how='outer', suffixes=("_no_weight", "_weight"))
df.dropna()
df["score_diff"] = df["score_weight"] - df["score_no_weight"]
df["rank_diff"] = df["rank_weight"] - df["rank_no_weight"]
df["frac"] = (df["score_weight"] - df["score_no_weight"])/df["score_no_weight"]*100
df = df[np.abs(df.score_diff) > 0.01]
df.sort_values(by="rank_diff", ascending=True).head(25)
# df.head(15)

Unnamed: 0,name,score_no_weight,rank_no_weight,score_weight,rank_weight,score_diff,rank_diff,frac
974,C=C<-X>,0.525448,975.0,0.512331,858.0,-0.013117,-117.0,-2.4963
954,CC<-tBu>,0.533004,955.0,0.521442,839.0,-0.011562,-116.0,-2.169178
814,CC<-C(=O)OMe>,0.597074,815.0,0.578767,721.0,-0.018307,-94.0,-3.066055
917,CCCO<-C(=O)H>,0.551881,918.0,0.522183,836.0,-0.029699,-82.0,-5.381332
863,CC(<-X>)<-X>,0.574461,864.0,0.546598,784.0,-0.027864,-80.0,-4.850403
934,CCC<-C#N>,0.543199,935.0,0.510112,863.0,-0.033088,-72.0,-6.091274
830,CC=C(C)C,0.589032,831.0,0.556325,761.0,-0.032706,-70.0,-5.552544
990,CC<-C(=O)H>,0.515592,991.0,0.490092,923.0,-0.0255,-68.0,-4.945769
951,cc<-OEt>c,0.534721,952.0,0.502643,887.0,-0.032078,-65.0,-5.999087
766,c<-X>c<-X>c<-X>c<-X>,0.621698,767.0,0.596584,702.0,-0.025114,-65.0,-4.039515


In [96]:
df.describe()

Unnamed: 0,score_no_weight,rank_no_weight,score_weight,rank_weight,score_diff,rank_diff,frac
count,978.0,978.0,978.0,978.0,978.0,978.0,978.0
mean,1.828634,490.530675,1.571115,491.209611,-0.257519,0.678937,-12.91013
std,3.442888,283.852073,2.918664,284.589,0.526322,21.502783,2.441366
min,0.513856,1.0,0.456035,1.0,-7.303351,-117.0,-23.434538
25%,0.648836,245.25,0.568223,245.25,-0.225419,-6.0,-14.573208
50%,0.925279,489.5,0.802157,489.5,-0.11907,0.0,-13.19738
75%,1.607536,735.75,1.396581,736.75,-0.078379,6.0,-11.513205
max,46.882612,997.0,39.579261,1000.0,-0.011562,170.0,-2.169178
