# GI50
This NCI60 data gives an insight to effect of a chemical on cancer.
Cancer is not one disease but a collection of every possible cell in your body growing uncontrolled.
For this reason NCI60 has multiple different cell lines (column CELL_NAME).

I have here the GI50 data of the NCI60 project
GI50: concentration to stop growth with 50%. column "AVERAGE" is the average concentration needed.

In [1]:
import pandas as pd
from os.path import join as path_join
import numpy as np

In [2]:
gi50 = pd.read_csv(path_join("data", "GI50.csv"))
print(gi50.shape)
gi50.head()

(4585048, 14)


Unnamed: 0,RELEASE_DATE,EXPID,PREFIX,NSC,CONCENTRATION_UNIT,LOG_HI_CONCENTRATION,PANEL_NUMBER,CELL_NUMBER,PANEL_NAME,CELL_NAME,PANEL_CODE,COUNT,AVERAGE,STDDEV
0,20210223,0001MD02,S,123127,M,-4.6021,1,1,Non-Small Cell Lung Cancer,NCI-H23,LNS,1,-7.1391,0.0
1,20210223,0001MD02,S,123127,M,-4.6021,10,14,Melanoma,M14,MEL,1,-7.052,0.0
2,20210223,0001MD02,S,123127,M,-4.6021,12,5,CNS Cancer,SNB-75,CNS,1,-7.138,0.0
3,20210223,0001MD02,S,123127,M,-4.6021,4,2,Colon Cancer,HCC-2998,COL,1,-6.9426,0.0
4,20210223,0001MD02,S,123127,M,-4.6021,5,5,Breast Cancer,MDA-MB-231/ATCC,BRE,1,-6.4485,0.0


Creating a pivot table. Where every value is the average GI50 value of the combination cell line (row) and chemical (column)

In [3]:
correlation = pd.pivot_table(gi50, values='AVERAGE', index=['NSC'],
                    columns=['CELL_NAME'], aggfunc=np.mean)
correlation.head()

NSC,1,17,26,89,112,171,185,186,196,197,...,836824,836941,836942,837081,837082,837396,837397,837398,837892,837893
CELL_NAME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
786-0,-4.716867,-5.557233,-5.4265,-4.70005,-6.4911,-4.0,-7.4287,-4.609967,-5.1022,-5.5461,...,-5.4324,-5.3468,-4.0,-4.8286,-5.1042,-6.6969,-5.5047,-6.5669,-4.301,-5.6072
A-172/H.Fine,,,,,,,,-4.2305,,,...,,,,,,,,,,
A-204,,,,,,,,,,,...,,,,,,,,,,
A-C/EBP 3,,,,,,,,,,,...,,,,,,,,,,
A-CREB 1,,,,,,,,,,,...,,,,,,,,,,


The difference in how 2 cell lines react to the same chemical gives us an idea how similar those 2 cell lines are. If we take the average of all difference between all chemicals they have in common we get a difference score. Where 0 means the cells are the same, and infinite means they could not be more different.

In [17]:
np.abs(correlation["HCT-15"] - correlation["HOP-92"]).mean()

Avg diff 0.5365136610125857, num difs 59
CELL_NAME
786-0              0.071407
A498               0.429719
A549/ATCC          1.584981
ACHN               0.460768
BT-549             0.398016
CAKI-1             0.784300
CCRF-CEM           0.176300
COLO 205           0.575962
DU-145             0.410943
EKVX               0.539297
HCC-2998           1.080709
HCT-116            0.805091
HCT-15             1.106202
HL-60(TB)          0.396537
HOP-62             0.417476
HOP-92             0.198874
HS 578T            0.345624
HT29               0.930421
IGROV1             0.850885
K-562              0.288681
KM12               0.353211
LOX IMVI           0.884969
M14                0.006811
MALME-3M           0.309639
MCF7               0.976516
MDA-MB-231/ATCC    1.117000
MDA-MB-435         0.491341
MDA-N              0.492866
MOLT-4             0.432465
NCI-H226           0.634068
NCI-H23            0.884898
NCI-H322M          0.577172
NCI-H460           1.474320
NCI-H522           0.4307

We can do the same in a graph database. However the database is not fully in memory. So first we need to retrieve the data from the corresponding cells

In [5]:
import json

from py2neo import Graph

with open("config.json") as f:
    config = json.load(f)

neo4j_url = config.get("neo4jUrl", "bolt://localhost:7687")
user = config.get("user", "neo4j")
pswd = config.get("pswd", "password")
graph = Graph(neo4j_url, auth=(user, pswd))

At this point in the database only hold GI50 data, but we still write the query like more kinds of data exists. We look for all chemicals that have both a condition with cell A and cell B. All conditions also most have GI50 measurement.
Next up we average the measurements and calculate the distance just like above.
The values do differ a bit from each other because not all chemicals could be loaded into the database.

In [6]:
response = graph.run(
    """
    MATCH (cell1:CellLine {label: $cell_name_1})
    MATCH (gi50:Measurement {name: "GI50"})

    MATCH (chem:Synonym)<-[:USES]-(c1:Condition)-[:USES]->(cell1)
    MATCH (c1)-[m1:MEASURES]->(gi50)
    WITH DISTINCT chem as chem, avg(toFloat(m1.value)) as values1, count(m1.value) as numVal1, gi50

    MATCH (cell2:CellLine {label: $cell_name_2})
    MATCH (chem)<-[:USES]-(c:Condition)-[:USES]->(cell2)
    MATCH (c)-[m:MEASURES]->(gi50)
    WITH DISTINCT chem as chem, avg(toFloat(m.value)) as values2, count(m.value) as numVal2, values1, numVal1

    RETURN avg(abs(values1- values2)) as distance
    """,
    cell_name_1="HCT-15",
    cell_name_2="HOP-92",
).data()
print(response[0]["distance"])

0.31269519410755936


In [7]:
response = graph.run(
    """
    MATCH (cell1:CellLine {label: $cell_name_1})
    MATCH (gi50:Measurement {name: "GI50"})

    MATCH (chem:Synonym)<-[:USES]-(c1:Condition)-[:USES]->(cell1)
    MATCH (c1)-[m1:MEASURES]->(gi50)
    WITH DISTINCT chem as chem, avg(toFloat(m1.value)) as values1, gi50
    // WITH DISTINCT cell1, collect(chem) as col_chem, collect(values1) as col_val1, gi50

    MATCH (cell2:CellLine)
    WHERE cell2.label in ["HOP-92", "HCT-116"]

    MATCH (chem)<-[:USES]-(c:Condition)-[:USES]->(cell2)
    MATCH (c)-[m:MEASURES]->(gi50)
    WITH DISTINCT chem, avg(toFloat(m.value)) as values2, cell2, values1

    RETURN DISTINCT cell2, avg(abs(values1- values2)) as distance
    """,
    cell_name_1="HCT-15",
).data()

for r in response:
    print(r["cell2"]["label"], r["distance"])

HOP-92 0.31269519410755936
HCT-116 0.1841500216693871


In [10]:
response = graph.run(
    """
    MATCH (cell1:CellLine {label: $cell_name_1})
    MATCH (gi50:Measurement {name: "GI50"})

    MATCH (chem:Synonym)<-[:USES]-(c1:Condition)-[:USES]->(cell1)
    MATCH (c1)-[m1:MEASURES]->(gi50)
    WITH DISTINCT chem as chem, avg(toFloat(m1.value)) as values1, gi50

    MATCH (cell2:CellLine)
    WHERE cell2.label in ["HOP-92", "HCT-116"]


    CALL {
        WITH chem, values1, gi50, cell2

        MATCH (chem)<-[:USES]-(c:Condition)-[:USES]->(cell2)
        MATCH (c)-[m:MEASURES]->(gi50)
        WITH DISTINCT chem as chem2, abs(values1-avg(toFloat(m.value))) as values2, cell2 as cell
        RETURN chem2, cell, values2
    }

    RETURN DISTINCT cell.label as cellName, avg(values2) as distance, count(values2) as numChems
     """,
    cell_name_1="HCT-15",
).data()

for r in response:
    print(r["cellName"], r["distance"], r["numChems"])

HOP-92 0.31269519410755936 45469
HCT-116 0.1841500216693871 50975


In [8]:
for cell_line in correlation:   
    all_distance = np.abs(correlation["HCT-15"] - correlation[cell_line]).dropna()
    distance = all_distance.mean()
    print(f"Difference with {cell_line} is {distance:.4f} of {len(all_distance)} samples")

Difference with 786-0 is 0.2012 of 52098 samples
Difference with A-172/H.Fine is 0.5092 of 783 samples
Difference with A-204 is 1.4629 of 16 samples
Difference with A-C/EBP 3 is 0.8016 of 10 samples
Difference with A-CREB 1 is 0.8027 of 10 samples
Difference with A-CREB 2 is 0.6198 of 10 samples
Difference with A-FOS 2 is 0.9863 of 10 samples
Difference with A-FOS 3 is 0.8214 of 10 samples
Difference with A-JUN 1 is 0.4059 of 10 samples
Difference with A-JUN 3 is 0.3378 of 10 samples
Difference with A431 is 0.8911 of 10 samples
Difference with A498 is 0.2961 of 46660 samples
Difference with A549/ATCC is 0.2128 of 53286 samples
Difference with ACHN is 0.1831 of 52442 samples
Difference with BT-549 is 0.2632 of 37355 samples
Difference with CACO-2 is 1.0474 of 17 samples
Difference with CAKI-1 is 0.2103 of 49761 samples
Difference with CALU-1 is 0.5601 of 16 samples
Difference with CCD-19LU is 1.0922 of 3 samples
Difference with CCRF-CEM is 0.2796 of 50167 samples
Difference with CHA-59 

In [9]:
correlation.isna().sum().sum()

5819308

In [10]:
correlation.isna().sum().sum()/correlation.size

0.6456632510125747

In [11]:
correlation.isna().sum().sort_values()

CELL_NAME
A549/ATCC          1394
OVCAR-8            1519
SW-620             1522
U251               1823
SF-295             1954
                  ...  
VDSO/CMV-8        56683
VDSO/E6-18        56683
NYH/ICRF-187-1    56683
CHO               56683
NYH               56684
Length: 159, dtype: int64