### Mapping Statistical Data of Gene Expressions

On the cluster, the gene expression dense matrices exist within directories labelled numerically by order of samples as indicated in https://github.com/LaraLim/cnv-supervised-learning/blob/main/mapping/sample_mapping.csv 

The statistical calculations on the gene expressions have been uploaded to this repo, identified with their sub directory on the cluster. The following code will assist in mapping the statistical files to the sample_ids

In [24]:
#Imports 
import requests
import pandas as pd

In [None]:
# url containing the gene statistics files
api_url = 'https://api.github.com/repos/LaraLim/cnv-supervised-learning/contents/gene_statistics'
response = requests.get(api_url)
files = response.json()

# obtain the csv files in the gene_statistics directory
desired_files = [file['name'] for file in files if file['name'].endswith('.csv')]

print("The csv files in the gene_statistics directory are:")
print(desired_files)

#get the sample mapping file to get the associated case IDs and the numerical value of sub directories
df_map = pd.read_csv('https://raw.githubusercontent.com/LaraLim/cnv-supervised-learning/refs/heads/main/mapping/sample_mapping.csv')


#create a new column with the github dirs which takes everthing before the last _ in the ClassID column and concateneates it with the file name it with _ and number in the subdir column
df_map['Stats_file'] = df_map['CaseID'].str.rsplit('-', n=1).str[0] + '_' + df_map['Subdir'].astype(str)

#display the mapped stats files to the sample case IDs
print("Mapped stats files to the sample case IDs:")
display(df_map)



The csv files in the gene_statistics directory are:
['C3L-00359_1.csv', 'C3L-00606_1.csv', 'C3L-00606_2.csv', 'C3L-00606_3.csv', 'C3L-01287_1.csv', 'C3L-01287_2.csv', 'C3L-01953_1.csv', 'C3L-02705_1.csv', 'C3L-02858_1.csv', 'C3L-03405_1.csv', 'C3L-03968_1.csv', 'C3N-00148_1.csv', 'C3N-00148_2.csv', 'C3N-00148_3.csv', 'C3N-00148_4.csv', 'C3N-00149_1.csv', 'C3N-00149_2.csv', 'C3N-00149_3.csv', 'C3N-00439_1.csv', 'C3N-00662_1.csv', 'C3N-01175_1.csv', 'C3N-01270_1.csv', 'C3N-01798_1.csv', 'C3N-01814_1.csv', 'C3N-01815_1.csv', 'C3N-01816_1.csv', 'C3N-01904_1.csv', 'C3N-02181_1.csv', 'C3N-02188_1.csv', 'C3N-02190_1.csv', 'C3N-02769_1.csv', 'C3N-02783_1.csv', 'C3N-02784_1.csv', 'C3N-03184_1.csv', 'C3N-03186_1.csv', 'C3N-03188_1.csv']
Mapped stats files to the sample case IDs:


Unnamed: 0,CaseID,Filename,Subdir,Stats_file
0,C3L-00359-01,ee5e869c-e15f-4899-9e12-377920609b42.wgs.ASCAT...,1,C3L-00359_1
1,C3L-00606-01,f4f49853-5dbc-4b00-8ed0-3dffec3423cd.wgs.ASCAT...,1,C3L-00606_1
2,C3L-00606-02,cfe2f44c-ab6a-407a-ae93-5203cd3eb6fe.wgs.ASCAT...,2,C3L-00606_2
3,C3L-00606-03,77009b5c-68a2-4763-91b9-c9ca64694b55.wgs.ASCAT...,3,C3L-00606_3
4,C3L-01287-01,d36eea62-f7d1-48f4-9ca3-3b4b11a71e57.wgs.ASCAT...,1,C3L-01287_1
5,C3L-01287-03,6e0fe3f9-f17f-44cd-8773-184eebec5321.wgs.ASCAT...,2,C3L-01287_2
6,C3L-01953-01,8abdeb70-a909-49c1-867d-2916f840af54.wgs.ASCAT...,1,C3L-01953_1
7,C3L-02705-71,c4ec3b0e-0cef-4937-bbb7-be06b69bf54c.wgs.ASCAT...,1,C3L-02705_1
8,C3L-02858-01,0bc70e09-1157-4af2-a045-1ac10f34c997.wgs.ASCAT...,1,C3L-02858_1
9,C3L-03405-01,0bc70e09-1157-4af2-a045-1ac10f34c997.wgs.ASCAT...,1,C3L-03405_1


In [26]:
# change the value of multi_samples to the appropriate subdirectory on repo
# for C3N-02190-01_1, C3N-02784-01_1 and C3N-02190-01_1

#replace in github_dirs column
df_map['Stats_file'] = df_map['Stats_file'].replace({'C3N-02190-01_1': 'C3N-02190_1',
                                                       'C3N-02784-01_1': 'C3N-02784_1',
                                                       'C3N-02190-01_1': 'C3N-02190_1'})

#at the end of of github_dirs, append .csv
df_map['Stats_file'] = df_map['Stats_file'] + '.csv'

#rename the Filename column to CNV_file 
df_map = df_map.rename(columns={'Filename': 'CNV_file', 'Subdir': 'Cluster_subdir'})


display(df_map)

#generate the df_map file
df_map.to_csv('stats_mapping.csv', index=False)

Unnamed: 0,CaseID,CNV_file,Cluster_subdir,Stats_file
0,C3L-00359-01,ee5e869c-e15f-4899-9e12-377920609b42.wgs.ASCAT...,1,C3L-00359_1.csv
1,C3L-00606-01,f4f49853-5dbc-4b00-8ed0-3dffec3423cd.wgs.ASCAT...,1,C3L-00606_1.csv
2,C3L-00606-02,cfe2f44c-ab6a-407a-ae93-5203cd3eb6fe.wgs.ASCAT...,2,C3L-00606_2.csv
3,C3L-00606-03,77009b5c-68a2-4763-91b9-c9ca64694b55.wgs.ASCAT...,3,C3L-00606_3.csv
4,C3L-01287-01,d36eea62-f7d1-48f4-9ca3-3b4b11a71e57.wgs.ASCAT...,1,C3L-01287_1.csv
5,C3L-01287-03,6e0fe3f9-f17f-44cd-8773-184eebec5321.wgs.ASCAT...,2,C3L-01287_2.csv
6,C3L-01953-01,8abdeb70-a909-49c1-867d-2916f840af54.wgs.ASCAT...,1,C3L-01953_1.csv
7,C3L-02705-71,c4ec3b0e-0cef-4937-bbb7-be06b69bf54c.wgs.ASCAT...,1,C3L-02705_1.csv
8,C3L-02858-01,0bc70e09-1157-4af2-a045-1ac10f34c997.wgs.ASCAT...,1,C3L-02858_1.csv
9,C3L-03405-01,0bc70e09-1157-4af2-a045-1ac10f34c997.wgs.ASCAT...,1,C3L-03405_1.csv
