## Framework for Supervised Learning on CNV Data and Gene Expressions

We generate a dataframe for one sample, seen below. 

In [42]:
import pandas as pd

df_map = pd.read_csv('https://raw.githubusercontent.com/LaraLim/cnv-supervised-learning/refs/heads/main/sample_mapping.csv')


#first sample 
filename = df_map['Filename'][0]

#get the shape of the dataframe
url = f'https://raw.githubusercontent.com/LaraLim/cnv-supervised-learning/refs/heads/main/gene_level_cnv/{filename}'
df = pd.read_csv(url, sep='\t')


#find how many unique gene_ids are in the dataframe
unique_genes = df['gene_id'].unique()
num_genes = len(unique_genes)
num_rows = df.shape[0]

#print the number of rows and columns for a sample
print(f'{filename} has {df.shape[0]} rows and {df.shape[1]} columns')

print(f'The number of unique gene_ids in the dataframe is {num_genes}')

# validate whether the number of unique gene_ids is equal to the number of rows in the dataframe
if (num_genes == num_rows):
    print("The number of unique gene_ids is equal to the number of rows for a sample")

display(df.head())

ee5e869c-e15f-4899-9e12-377920609b42.wgs.ASCAT.gene_level.copy_number_variation.tsv has 60623 rows and 8 columns
The number of unique gene_ids in the dataframe is 60623
The number of unique gene_ids is equal to the number of rows for a sample


Unnamed: 0,gene_id,gene_name,chromosome,start,end,copy_number,min_copy_number,max_copy_number
0,ENSG00000223972.5,DDX11L1,chr1,11869,14409,4.0,4.0,4.0
1,ENSG00000227232.5,WASH7P,chr1,14404,29570,4.0,4.0,4.0
2,ENSG00000278267.1,MIR6859-1,chr1,17369,17436,4.0,4.0,4.0
3,ENSG00000243485.5,MIR1302-2HG,chr1,29554,31109,4.0,4.0,4.0
4,ENSG00000284332.1,MIR1302-2,chr1,30366,30503,4.0,4.0,4.0


Now, we iterate over all of the samples, creating a dataframe for each and storing it into a dictionary identified by their sample identification.
We want to ensure that each sample contains the same list of unique gene identifier and has the sample number of rows.

In [43]:
#dictionary of dataframes for each sample
dfs = {}

# loop through samples, creating a data frame for each sample
for filename in df_map['Filename']:

    #append new column for the case id
    case_id = df_map[df_map['Filename'] == filename]['CaseID'].values[0]


    url = f'https://raw.githubusercontent.com/LaraLim/cnv-supervised-learning/refs/heads/main/gene_level_cnv/{filename}'
    df = pd.read_csv(url, sep='\t')

    #check if the number of rows is the same
    if df.shape[0] != num_rows: 
        print("The number of rows is not the same between all files")
    
    #check if the number of genes is the same
    if len(df['gene_id'].unique()) != num_genes:
        print("The number of genes is not the same between all files")

    #check whether the unique genes are the same
    if not df['gene_id'].unique().all() == unique_genes.all():
        print("The unique genes are not the same between all files")
    
    df['CaseID'] = case_id
    df['SumGeneExpression'] = 0
    df['MeanGeneExpression'] = 0
    df['VarianceGeneExpression'] = 0

    df['status'] = 'normal'
    df.loc[df['copy_number'] > 2, 'status'] = 'amplified'
    df.loc[df['copy_number'] < 2, 'status'] = 'deleted'

    #move the position of the copy number column to the last column
    copy_number = df.pop('copy_number')
    df['copy_number_target'] = copy_number

    
    #add the dataframe to the dictionary
    dfs[case_id] = df



We now know that the list of unique gene identifiers are consistent throughout the samples. We also know confirmed that the number unique gene identifiers for each sample is the same as the number of rows for each sample. This will allow us to perform transformations over the all the genes within each sample.


### Data Imputations
Now to check for NULL values for copy number variations, we check how many null values there are and if they are for the same genes throughout the samples. 

In [44]:
#Now to check for NULL values for copy number variations, we check how many null values there are and if they are for the same genes throughout the samples. 
null_values = pd.DataFrame()

for case_id, df in dfs.items():
    null_values[case_id] = df.isnull().sum()

display(null_values)


# for each sample in the dictionary, find the gene associated for a copy number that is null and check if they are the same genes for all samples
null_genes = {}
for case_id, df in dfs.items():
    df  = pd.DataFrame()
    df['gene_id'] = dfs[case_id][dfs[case_id]['copy_number_target'].isnull()]['gene_id']
    null_genes[case_id] = df

# for case_id, df in null_genes.items():
#     print(f'Null genes for {case_id}')
#     display(df)

#display the first sample's null genes
display(null_genes['C3L-00359-01'])


Unnamed: 0,C3L-00359-01,C3L-00606-01,C3L-00606-02,C3L-00606-03,C3L-01287-01,C3L-01287-03,C3L-01953-01,C3L-02705-71,C3L-02858-01,C3L-03968-01,...,C3N-01904-02,C3N-02181-02,C3N-02188-03,C3N-02190-01-02,C3N-02769-02,C3N-02783-05,C3N-02784-01-03,C3N-03184-02,C3N-03186-01,C3N-03188-02
gene_id,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
gene_name,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
chromosome,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
start,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
end,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
min_copy_number,2225,2220,2217,2217,1709,1712,1709,1707,1716,2243,...,2220,2222,1709,1706,1714,2221,1721,1721,2226,1724
max_copy_number,2225,2220,2217,2217,1709,1712,1709,1707,1716,2243,...,2220,2222,1709,1706,1714,2221,1721,1721,2226,1724
CaseID,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
SumGeneExpression,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
MeanGeneExpression,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Unnamed: 0,gene_id
119,ENSG00000284740.1
126,ENSG00000286989.1
180,ENSG00000284745.1
213,ENSG00000284668.2
214,ENSG00000284703.1
...,...
60618,ENSG00000124334.17_PAR_Y
60619,ENSG00000270726.6_PAR_Y
60620,ENSG00000185203.12_PAR_Y
60621,ENSG00000182484.15_PAR_Y


We can see that the the copy_number_target is null for a varying number of occurrances. However it remains **mostly** consistent for the selection of genes throughout the samples. Therefore, to handle these missing values while remaining consistent with our datasets, we will impute the copy numbers with a negative one.

In [45]:
#impute the copy number values that are missing with -1 
for case_id, df in dfs.items():
    dfs[case_id]['copy_number_target'] = dfs[case_id]['copy_number_target'].fillna(-1)

display(dfs['C3L-00359-01'].tail())

Unnamed: 0,gene_id,gene_name,chromosome,start,end,min_copy_number,max_copy_number,CaseID,SumGeneExpression,MeanGeneExpression,VarianceGeneExpression,status,copy_number_target
60618,ENSG00000124334.17_PAR_Y,IL9R,chrY,57184101,57197337,,,C3L-00359-01,0,0,0,normal,-1.0
60619,ENSG00000270726.6_PAR_Y,AJ271736.1,chrY,57190738,57208756,,,C3L-00359-01,0,0,0,normal,-1.0
60620,ENSG00000185203.12_PAR_Y,WASIR1,chrY,57201143,57203357,,,C3L-00359-01,0,0,0,normal,-1.0
60621,ENSG00000182484.15_PAR_Y,WASH6P,chrY,57207346,57212230,,,C3L-00359-01,0,0,0,normal,-1.0
60622,ENSG00000227159.8_PAR_Y,DDX11L16,chrY,57212184,57214397,,,C3L-00359-01,0,0,0,normal,-1.0


From processing the statistical data on Gene Expressions, we also know that the dense matrices do not exist on the cluster for the following cases and their samples:

- C3N-01180
- C3N-01334

Therefore, we are removing these samples from our dataset.

We are left with a total number of 36 samples. 


Next, we drop the CaseID, gene_id, gene_name, min_copy_number, max_copy_number, status (for the regression problem). We would keep the status for a classification problem.

In [47]:
# remove the C3N-01180 case ids from the dictionary
dfs.pop('C3N-01180-01')
dfs.pop('C3N-01334-03')

Unnamed: 0,gene_id,gene_name,chromosome,start,end,min_copy_number,max_copy_number,CaseID,SumGeneExpression,MeanGeneExpression,VarianceGeneExpression,status,copy_number_target
0,ENSG00000223972.5,DDX11L1,chr1,11869,14409,4.0,4.0,C3N-01334-03,0,0,0,amplified,4.0
1,ENSG00000227232.5,WASH7P,chr1,14404,29570,4.0,4.0,C3N-01334-03,0,0,0,amplified,4.0
2,ENSG00000278267.1,MIR6859-1,chr1,17369,17436,4.0,4.0,C3N-01334-03,0,0,0,amplified,4.0
3,ENSG00000243485.5,MIR1302-2HG,chr1,29554,31109,4.0,4.0,C3N-01334-03,0,0,0,amplified,4.0
4,ENSG00000284332.1,MIR1302-2,chr1,30366,30503,4.0,4.0,C3N-01334-03,0,0,0,amplified,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
60618,ENSG00000124334.17_PAR_Y,IL9R,chrY,57184101,57197337,,,C3N-01334-03,0,0,0,normal,-1.0
60619,ENSG00000270726.6_PAR_Y,AJ271736.1,chrY,57190738,57208756,,,C3N-01334-03,0,0,0,normal,-1.0
60620,ENSG00000185203.12_PAR_Y,WASIR1,chrY,57201143,57203357,,,C3N-01334-03,0,0,0,normal,-1.0
60621,ENSG00000182484.15_PAR_Y,WASH6P,chrY,57207346,57212230,,,C3N-01334-03,0,0,0,normal,-1.0


### Transform Numerical Features

For each sample, we transform the numerical features: 
- start position (normalize)
- end position (normalize)
- Sum Gene Expression (lognormal)
- Mean Gene Expression (lognormal)
- Variance Gene Expression (lognormal)

Example of normalizing the gene expressions for a few samples: https://github.com/LaraLim/cnv-supervised-learning/blob/main/visualize_genes_per_sample.ipynb


**TODO:** ask Ted before we log normalize the all of the genes expressions per sample, whether we should ***also*** perform any transformations of the gene accross samples
https://github.com/LaraLim/cnv-supervised-learning/blob/main/visualize_genes_across_samples.ipynb


### Combining Samples For ML
After we transform the numerical features for each sample, we combine the samples for ML. 

In [48]:
combined_df = pd.concat(dfs.values(), ignore_index=True)
display(combined_df.head())

Unnamed: 0,gene_id,gene_name,chromosome,start,end,min_copy_number,max_copy_number,CaseID,SumGeneExpression,MeanGeneExpression,VarianceGeneExpression,status,copy_number_target
0,ENSG00000223972.5,DDX11L1,chr1,11869,14409,4.0,4.0,C3L-00359-01,0,0,0,amplified,4.0
1,ENSG00000227232.5,WASH7P,chr1,14404,29570,4.0,4.0,C3L-00359-01,0,0,0,amplified,4.0
2,ENSG00000278267.1,MIR6859-1,chr1,17369,17436,4.0,4.0,C3L-00359-01,0,0,0,amplified,4.0
3,ENSG00000243485.5,MIR1302-2HG,chr1,29554,31109,4.0,4.0,C3L-00359-01,0,0,0,amplified,4.0
4,ENSG00000284332.1,MIR1302-2,chr1,30366,30503,4.0,4.0,C3L-00359-01,0,0,0,amplified,4.0


In [39]:
#display dimensions of the combined dataframe
print(f'The combined dataframe has {combined_df.shape[0]} rows and {combined_df.shape[1]} columns')

The combined dataframe has 2303674 rows and 7 columns


### Drop the Columns for Gene Identifiers & CaseID

We don't want the ML model to correlate cnv to a specific identifer, such as
- gene_id
- gene_name
- CaseID


For a regression problem we drop the status. 
**TODO** ask Ted if we should also consider comparing to a classification problem

In [49]:
# drop cols from the combined dataframe
combined_df.drop(['gene_id', 'gene_name', 'CaseID' 'status'], axis=1, inplace=True)


### Split the data for machine learning

In [51]:
#split for training and testing
from sklearn.model_selection import train_test_split

X = combined_df.drop('copy_number_target', axis=1)
y = combined_df['copy_number_target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#train SVM, Random Forest, etc, comparing performance




***TODO***
Ask Ted if we will need to perform any further transformations on the numerical data training and test data after the split given that we already have normalized values for each sample. 