## Mapping Gene Expression Data to CNV Table

### Overview
This notebook maps gene expression data to the corresponding gene-level copy number variation table through the gene name. This is done to add gene expression to the ground truth table as a feature. Gene expression data is taken from the statistics files, which are generated through preprocessing dense matrices.

### Workflow:

1. Gene-level CNV data is retrieved from its original file.
2. Gene names in the CNV table are matched (joined) to those in the corresponding stats file.
3. The expression data for genes is identified in the stats file (Sum column) and added to a new column in the CNV table.

### Test variant:
One stats file mapped to one gene-level CNV file

**Stats file:** `gene_statistics.csv`, original from commit 410eb18

**GNV file:** original name `cfe2f44c-ab6a-407a-ae93-5203cd3eb6fe.wgs.ASCAT.gene_level.copy_number_variation.tsv` (case ID C3L-00606, sample ID C3L-00606-02)

Import libraries

In [9]:
import pandas as pd
from google.colab import files

Get CNV and gene expression data

In [10]:
uploaded = files.upload()
df_cnv = pd.read_csv("gene_level.copy_number_variation.tsv", sep='\t') # CNV
df_ge = pd.read_csv("gene_statistics.csv", delimiter=',') # gene expression

print("\ngene-level copy number variation:")
print(df_cnv.head())

print("\ngene expression statistics:")
print(df_ge.head())

Saving gene_level.copy_number_variation.tsv to gene_level.copy_number_variation (2).tsv
Saving gene_statistics.csv to gene_statistics (2).csv

gene-level copy number variation:
             gene_id    gene_name chromosome  start    end  copy_number  \
0  ENSG00000223972.5      DDX11L1       chr1  11869  14409          5.0   
1  ENSG00000227232.5       WASH7P       chr1  14404  29570          5.0   
2  ENSG00000278267.1    MIR6859-1       chr1  17369  17436          5.0   
3  ENSG00000243485.5  MIR1302-2HG       chr1  29554  31109          5.0   
4  ENSG00000284332.1    MIR1302-2       chr1  30366  30503          5.0   

   min_copy_number  max_copy_number  
0              5.0              5.0  
1              5.0              5.0  
2              5.0              5.0  
3              5.0              5.0  
4              5.0              5.0  

gene expression statistics:
         Gene       Sum      Mean      Variance
0      WASH7P  0.032887  0.000003  5.050720e-10
1  AL627309.6  0.01

Or retrieve combined CNV data locally

In [None]:
file_path = "D:/GDC-data/ground_truth_combined.csv"
df = pd.read_csv(file_path, delimiter=',')
print(df.head())

Merge the dataframes on `gene_name` from `df_cnv` and `Gene` from `df_ge`. Use left join to preserve all rows from `df_cnv`

In [13]:
df_cnv_ge = pd.merge(df_cnv, df_ge[['Gene', 'Sum']], left_on='gene_name', right_on='Gene', how='left')

df_cnv_ge = df_cnv_ge.rename(columns={'Sum': 'gene_expr'}) # rename Sum col to gene_expr in the merged df
df_cnv_ge = df_cnv_ge.drop(columns='Gene') # drop unnecessary Gene col from former stats csv

print(df_cnv_ge.head(30))

               gene_id    gene_name chromosome   start     end  copy_number  \
0    ENSG00000223972.5      DDX11L1       chr1   11869   14409          5.0   
1    ENSG00000227232.5       WASH7P       chr1   14404   29570          5.0   
2    ENSG00000278267.1    MIR6859-1       chr1   17369   17436          5.0   
3    ENSG00000243485.5  MIR1302-2HG       chr1   29554   31109          5.0   
4    ENSG00000284332.1    MIR1302-2       chr1   30366   30503          5.0   
5    ENSG00000237613.2      FAM138A       chr1   34554   36081          5.0   
6    ENSG00000268020.3       OR4G4P       chr1   52473   53312          5.0   
7    ENSG00000240361.2      OR4G11P       chr1   57598   64116          5.0   
8    ENSG00000186092.6        OR4F5       chr1   65419   71585          5.0   
9    ENSG00000238009.6   AL627309.1       chr1   89295  133723          5.0   
10   ENSG00000239945.1   AL627309.3       chr1   89551   91105          5.0   
11   ENSG00000233750.3       CICP27       chr1  1310

Note that there is missing gene expression data (NaN values) in this test due to two reasons:
- The CNV data includes both tumor and normal examples. These will not map one-to-one with the stats file, which comes from the dense matrix of only tumor examples.
- The stats file used is not the corresponding one to this specific CNV data, but a basic comparison with enough common genes to test the mapping.