# Preparing the Data for Training

### This Jupyter Notebook replicated the preprocessing steps on the original paper for better understanding

The architecture of DeepSignalingFlow forces us to create seperate networks for each available dataset depending on the number of genes and drug combinations. We will thus start by only preprocessing the NCI ALMANAC Dataset.

To prepare we need to download the following Datasets:

NCI ALMANAC [Download](https://wiki.nci.nih.gov/spaces/NCIDTPdata/pages/338237347/NCI-ALMANAC) <br>
KEGG [Download](https://www.genome.jp/kegg/) <br>
DrugBank [Download](https://go.drugbank.com/) <br>
Cell Model Passports RNA-Seq [Download](https://cog.sanger.ac.uk/cmp/download/rnaseq_20191101.zip) <br>
Cell Model Passports CNV [Download](https://cog.sanger.ac.uk/cmp/download/cnv_20191101.zip) <br>

NOTES: 
We do not know which KEGG dataset has been used here.
The NCI ALMANAC dataset is called the NCI60 in the code which is not the same.


#### Reading our dataset
Now we begin by loading our dataset and editing it into a new file.

In the original code this can be found at `parse/init_parse.py`

Let's do some basic tests and follow their process:

In [50]:
import pandas as pd
import os
import numpy as np

In [51]:
dl_input_df = pd.read_csv('../data-nci/raw_data/nci/nci-raw-data.csv')
dl_input_df.head()

  dl_input_df = pd.read_csv('../data-nci/raw_data/nci/nci-raw-data.csv')


Unnamed: 0,COMBODRUGSEQ,SCREENER,STUDY,TESTDATE,PLATE,PANELNBR,CELLNBR,PREFIX1,NSC1,SAMPLE1,...,PERCENTGROWTH,PERCENTGROWTHNOTZ,TESTVALUE,CONTROLVALUE,TZVALUE,EXPECTEDGROWTH,SCORE,VALID,PANEL,CELLNAME
0,260496,FG,PZUT00156_86_01_T72,06/23/2011,786-0_1_T72,9,18,S,752,37,...,85.979,88.159,332168.0,376781.548,58592.854,90.342,4.0,Y,Renal Cancer,786-0
1,260497,FG,PZUT00156_86_01_T72,06/23/2011,786-0_1_T72,9,18,S,752,37,...,100.903,100.763,379656.0,376781.548,58592.854,87.13,-14.0,Y,Renal Cancer,786-0
2,260498,FG,PZUT00156_86_01_T72,06/23/2011,786-0_1_T72,9,18,S,752,37,...,14.147,27.498,103608.0,376781.548,58592.854,12.739,-1.0,Y,Renal Cancer,786-0
3,260499,FG,PZUT00156_86_01_T72,06/23/2011,786-0_1_T72,9,18,S,752,37,...,71.268,75.736,285360.0,376781.548,58592.854,76.397,5.0,Y,Renal Cancer,786-0
4,260500,FG,PZUT00156_86_01_T72,06/23/2011,786-0_1_T72,9,18,S,752,37,...,89.278,90.945,342664.0,376781.548,58592.854,73.681,-16.0,Y,Renal Cancer,786-0


Now we can copy the first operation from the paper, we can see that given no elements called "Drug A" or "Drug B" we cannot perform the first operation on the raw data.

In [52]:
#dl_input_df = dl_input_df.groupby(['Drug A', 'Drug B', 'Cell Line Name']).agg({'Score':'mean'}).reset_index()
#dl_input_df.head()

To perform the same operations we need to rename the field so they adhere to the raw data. The corresponding fields could be: <br>
`NSC1` or <br>
`SAMPLE1` <br>
This should not matter much as we just want to look at the data for now.

In [53]:
dl_input_df = dl_input_df.groupby(['SAMPLE1', 'SAMPLE2', 'CELLNAME']).agg({'SCORE':'mean'}).reset_index()
print(dl_input_df.shape)
dl_input_df.tail()

(62277, 4)


Unnamed: 0,SAMPLE1,SAMPLE2,CELLNAME,SCORE
62272,290,31.0,HCT-15,-2.444444
62273,290,46.0,HCT-15,-2.0
62274,290,52.0,HCT-15,-7.888889
62275,290,91.0,HCT-15,-2.444444
62276,290,203.0,HCT-15,3.222222


In [54]:
dl_input_deletion_list = []
dl_input_df = dl_input_df.fillna('missing')
for row in dl_input_df.itertuples():
    if row[1] == 'missing' or row[2] == 'missing':
        dl_input_deletion_list.append(row[0])
dl_input_df = dl_input_df.drop(dl_input_df.index[dl_input_deletion_list]).reset_index(drop=True)
print(dl_input_df.shape)
dl_input_df.tail()

(62277, 4)


Unnamed: 0,SAMPLE1,SAMPLE2,CELLNAME,SCORE
62272,290,31.0,HCT-15,-2.444444
62273,290,46.0,HCT-15,-2.0
62274,290,52.0,HCT-15,-7.888889
62275,290,91.0,HCT-15,-2.444444
62276,290,203.0,HCT-15,3.222222


Something seems wrong here. Not only are there some preprocessing operations not mentioned but no missing values are getting removed, this may result in the 2 different datatypes of `SAMPLE1` and `SAMPLE2` <br>
So Let's investigate a bit.

In [78]:
dl_input_df = pd.read_csv('../data-nci/raw_data/nci/nci-raw-data.csv')
dl_input_df.head()

  dl_input_df = pd.read_csv('../data-nci/raw_data/nci/nci-raw-data.csv')


Unnamed: 0,COMBODRUGSEQ,SCREENER,STUDY,TESTDATE,PLATE,PANELNBR,CELLNBR,PREFIX1,NSC1,SAMPLE1,...,PERCENTGROWTH,PERCENTGROWTHNOTZ,TESTVALUE,CONTROLVALUE,TZVALUE,EXPECTEDGROWTH,SCORE,VALID,PANEL,CELLNAME
0,260496,FG,PZUT00156_86_01_T72,06/23/2011,786-0_1_T72,9,18,S,752,37,...,85.979,88.159,332168.0,376781.548,58592.854,90.342,4.0,Y,Renal Cancer,786-0
1,260497,FG,PZUT00156_86_01_T72,06/23/2011,786-0_1_T72,9,18,S,752,37,...,100.903,100.763,379656.0,376781.548,58592.854,87.13,-14.0,Y,Renal Cancer,786-0
2,260498,FG,PZUT00156_86_01_T72,06/23/2011,786-0_1_T72,9,18,S,752,37,...,14.147,27.498,103608.0,376781.548,58592.854,12.739,-1.0,Y,Renal Cancer,786-0
3,260499,FG,PZUT00156_86_01_T72,06/23/2011,786-0_1_T72,9,18,S,752,37,...,71.268,75.736,285360.0,376781.548,58592.854,76.397,5.0,Y,Renal Cancer,786-0
4,260500,FG,PZUT00156_86_01_T72,06/23/2011,786-0_1_T72,9,18,S,752,37,...,89.278,90.945,342664.0,376781.548,58592.854,73.681,-16.0,Y,Renal Cancer,786-0


In [79]:
print(dl_input_df['SAMPLE1'].dtype)
print(dl_input_df['SAMPLE2'].dtype)

int64
float64


I suspect the datatypes have an effect on the missing values, how many missing values do we have when looking at `SAMPLE2`?

In [80]:
print(dl_input_df[['SAMPLE1', 'SAMPLE2', 'CELLNAME', 'SCORE']].isna().sum())

SAMPLE1          0
SAMPLE2     812961
CELLNAME         0
SCORE       815031
dtype: int64


In [81]:
dl_input_df = dl_input_df.dropna()
dl_input_df['SAMPLE2'] = dl_input_df['SAMPLE2'].astype('int64')
print("Missing values in SAMPLE2:", dl_input_df['SAMPLE2'].isna().sum())

Missing values in SAMPLE2: 0


In [82]:
dl_input_df = dl_input_df.groupby(['SAMPLE1', 'SAMPLE2', 'CELLNAME']).agg({'SCORE':'mean'}).reset_index()
print(dl_input_df.shape)
dl_input_df.tail()

(59898, 4)


Unnamed: 0,SAMPLE1,SAMPLE2,CELLNAME,SCORE
59893,290,31,HCT-15,-2.444444
59894,290,46,HCT-15,-2.0
59895,290,52,HCT-15,-7.888889
59896,290,91,HCT-15,-2.444444
59897,290,203,HCT-15,3.222222


In [83]:
dl_input_deletion_list = []
dl_input_df = dl_input_df.fillna('missing')
for row in dl_input_df.itertuples():
    if row[1] == 'missing' or row[2] == 'missing':
        dl_input_deletion_list.append(row[0])
dl_input_df = dl_input_df.drop(dl_input_df.index[dl_input_deletion_list]).reset_index(drop=True)
print(dl_input_df.shape)
dl_input_df.tail()

(59898, 4)


Unnamed: 0,SAMPLE1,SAMPLE2,CELLNAME,SCORE
59893,290,31,HCT-15,-2.444444
59894,290,46,HCT-15,-2.0
59895,290,52,HCT-15,-7.888889
59896,290,91,HCT-15,-2.444444
59897,290,203,HCT-15,3.222222


## Let's put a stop here

This seems a bit futile without having input from the authors.

## Let's look at the filtered data instead

To do that we copy out the data folders from the DSF repository into our own. <br>
It can be found under `data-nci` don't forget to modify the `.gitignore` so we don't push around much data. <br>

We can ignore the following folders for now: <br>
```
filtered_data-original
plot
result
```

The readme suggests us running the following function to run the model:
```
python geo_tmain_webgnn.py 
```
this file accepts custom specifications for our GNN <br>
More importantly we are "building" our model and using the data here:
```python
def build_geowebgnn_model(args, device, dataset):
    print('--- BUILDING UP WEBGNN MODEL ... ---')
    # GET PARAMETERS
    # [num_gene, num_drug, (adj)node_num]
    final_annotation_gene_df = pd.read_csv('./' + dataset + '/filtered_data/kegg_gene_annotation.csv')
    gene_name_list = list(final_annotation_gene_df['kegg_gene'])
    num_gene = len(gene_name_list)
    drug_num_dict_df = pd.read_csv('./' + dataset + '/filtered_data/drug_num_dict.csv')
    drug_dict = dict(zip(drug_num_dict_df.Drug, drug_num_dict_df.drug_num))
    num_drug = len(drug_dict)
    node_num = num_gene + num_drug
    # [num_gene_edge, num_drug_edge]
    gene_num_df = pd.read_csv('./' + dataset + '/filtered_data/kegg_gene_num_interaction.csv')
    gene_num_df = gene_num_df.drop_duplicates()
    drugbank_num_df = pd.read_csv('./' + dataset + '/filtered_data/final_drugbank_num_sym.csv')
    num_gene_edge = gene_num_df.shape[0]
    num_drug_edge = drugbank_num_df.shape[0]
    num_edge = num_gene_edge + num_drug_edge
    # import pdb; pdb.set_trace()
    # BUILD UP MODEL
    model = WeBGNNDecoder(input_dim=args.input_dim, hidden_dim=args.hidden_dim, embedding_dim=args.output_dim, 
                decoder_dim=args.decoder_dim, node_num=node_num, num_edge=num_edge, num_gene_edge=num_gene_edge, device=device)
    model = model.to(device)
    return model
```
Here we use: <br>
`/filtered_data/kegg_gene_annotation.csv` which is a `.csv` with only 1 column and *2016* gene names <br>
to get the number of the genes
`/filtered_data/drug_num_dict.csv` which is a `.csv` with two columns,<br> 
one the names of the drugs <br>
and the other their number, going from 2017 (continuing from the genes?) to 2034 <br>
so *18* genes <br>
to get the number of the genes.<br>
Then these numbers are added and provided to the WeBGNNDecoder as `node_num`<br>



`/filtered_data/kegg_gene_num_interaction.csv` which is a `.csv` with two columns,<br> 
one the number of the gene<br>
the number of the gene it connects to <br>
for 2016 genes these are *18512* connections <br>
to get the number of the genes.<br>


In [1]:
import pandas as pd
import numpy as np

In [None]:
gene_num_df = pd.read_csv('../data-nci/filtered_data/kegg_gene_num_interaction.csv')


In [11]:
is_present = (gene_num_df[['src', 'dest']] == 4).any().any()
is_present

True

In [13]:
missing_numbers = [
    num for num in range(1, 2017) 
    if not ((gene_num_df['src'] == num).any() or 
            (gene_num_df['dest'] == num).any())
]
print("Missing numbers:", missing_numbers)

Missing numbers: []


In [14]:
mirrored_df = gene_num_df.rename(columns={'src':'dest', 'dest':'src'})

# Combine original and mirrored DataFrames
combined = pd.concat([gene_num_df, mirrored_df])

# Find duplicates (including mirrored pairs)
duplicates = combined[combined.duplicated(keep=False)]  # keep=False marks all duplicates

# Show duplicates
print("Duplicate connections (including mirrors):")
print(duplicates.sort_values(by=['src', 'dest']))

# Count duplicates
print(f"\nTotal duplicate pairs: {len(duplicates)//2}")

Duplicate connections (including mirrors):
        src  dest
487      82   143
848      82   143
488      82   144
858      82   144
489      82   422
...     ...   ...
18356  2007  2000
18470  2008  2000
18357  2008  2000
18488  2010  2000
18358  2010  2000

[1324 rows x 2 columns]

Total duplicate pairs: 662


In [15]:
gene_num_df.shape

(18512, 2)

In [16]:
gene_num_df_cleaned = gene_num_df.drop_duplicates()
gene_num_df_cleaned.shape

(18512, 2)