#### Contents:
* [Data description](#1)
    * [Genetic alterations in breast cancer](#1.a)
    * [Clinical, genomic and epigenomic data of breast cancer patients](#1.b)
* [Data processing](#2)

In [1]:
import pandas as pd

## Data description <a class="anchor" id="1"></a>

### Genetic alterations in breast cancer <a class="anchor" id="1.a"></a>

The **gene_list** dataframe stores a series of genes involved inthe development of breast cancer disease. For each variant, the gene involved and the 

In [None]:
# https://cancer.sanger.ac.uk/cosmic
# https://panelapp.genomicsengland.co.uk/

In [54]:
gl1 = pd.read_csv('data/final_PanelApp_CGC_genes.txt', header=None)[0].tolist()
len(gl1)

877

In [55]:
gl1

['A1CF',
 'ABI1',
 'ABL1',
 'ABL2',
 'ACD',
 'ACKR3',
 'ACSL3',
 'ACSL6',
 'ACTRT1',
 'ACVR1',
 'ACVR1B',
 'ACVR2A',
 'ADM',
 'AFDN',
 'AFF1',
 'AFF3',
 'AFF4',
 'AIP',
 'AKAP9',
 'AKT1',
 'AKT2',
 'AKT3',
 'ALDH2',
 'ALK',
 'AMER1',
 'ANK1',
 'ANKRD26',
 'APC',
 'APOB',
 'APOBEC3B',
 'AR',
 'ARAF',
 'ARHGAP26',
 'ARHGAP35',
 'ARHGAP5',
 'ARHGEF10',
 'ARHGEF10L',
 'ARHGEF12',
 'ARID1A',
 'ARID1B',
 'ARID2',
 'ARNT',
 'ASPSCR1',
 'ASXL1',
 'ASXL2',
 'ATF1',
 'ATIC',
 'ATM',
 'ATP1A1',
 'ATP2B3',
 'ATR',
 'ATRX',
 'AXIN1',
 'AXIN2',
 'B2M',
 'BAP1',
 'BARD1',
 'BAX',
 'BAZ1A',
 'BCL10',
 'BCL11A',
 'BCL11B',
 'BCL2',
 'BCL2L12',
 'BCL3',
 'BCL6',
 'BCL7A',
 'BCL9',
 'BCL9L',
 'BCLAF1',
 'BCOR',
 'BCORL1',
 'BCR',
 'BIRC3',
 'BIRC6',
 'BLM',
 'BMP5',
 'BMPR1A',
 'BRAF',
 'BRCA1',
 'BRCA2',
 'BRD3',
 'BRD4',
 'BRIP1',
 'BTG1',
 'BTG2',
 'BTK',
 'BUB1B',
 'C15orf65',
 'CACNA1D',
 'CALR',
 'CAMTA1',
 'CANT1',
 'CARD11',
 'CARS',
 'CASP3',
 'CASP8',
 'CASP9',
 'CASR',
 'CBFA2T3',
 'CBFB',
 'C

In [25]:
gene_list2 = pd.read_csv('data/CBCG.csv')
gl2=gene_list2.Gene.str.rsplit('\xa0').str.get(0).unique().tolist()

In [27]:
len(gl2)

623

In [28]:
gl2

['ABCB1',
 'ABCG2',
 'ABHD8',
 'ABR',
 'ABRAXAS1',
 'AC008132.1',
 'AC058822.1',
 'ACE',
 'ACOXL-AS1',
 'ACOXL',
 'ACTA2',
 'ACYP2',
 'ADAM29',
 'ADARB2',
 'ADCY3',
 'ADCY9',
 'ADSS1',
 'AFDN',
 'AGTR1',
 'AHR',
 'AHRR',
 'AIRE',
 'AKAP6',
 'AKAP9',
 'AKR1A1',
 'AKT1',
 'ALDH2',
 'ALG9',
 'ALOX12',
 'ALOX12-AS1',
 'AMFR',
 'ANGPT2',
 'ANKLE1',
 'ANTXR1',
 'AP4B1-AS1',
 'APC',
 'APEX1',
 'APOBEC3A',
 'APOBEC3A_B',
 'AQP4-AS1',
 'ARHGAP26',
 'ARHGEF5',
 'ARL11',
 'ARLNC1',
 'ASPH',
 'ASPRV1',
 'ASTN2',
 'ASTN2-AS1',
 'ATAD5',
 'ATE1',
 'ATG10',
 'ATM',
 'ATP12A',
 'ATR',
 'ATRIP',
 'ATXN1',
 'ATXN7',
 'AURKA',
 'AXIN2',
 'BABAM1',
 'BAIAP2L1',
 'BARD1',
 'BCKDHB',
 'BCL2L11',
 'BCL2L15',
 'BET1',
 'BIVM-ERCC5',
 'BLM',
 'BRCA1',
 'BRCA2',
 'BRD7',
 'BRIP1',
 'BTN2A1',
 'C11orf65',
 'C2CD6',
 'CABLES1',
 'CASC15',
 'CASC16',
 'CASC21',
 'CASC8',
 'CASP10',
 'CASP8',
 'CBFA2T3',
 'CCDC152',
 'CCDC170',
 'CCDC33',
 'CCDC88C',
 'CCDC91',
 'CCND1',
 'CCNE1',
 'CDC27',
 'CDCA7L',
 'CDH1',
 'CD

In [29]:
### geni in comune fra geni in CBCG per breast cancer e geni contenuti in file server 
len(list(set(gene_list) & set(gl2)))

126

### Clinical, genomic and epigenomic data of breast cancer patients <a class="anchor" id="1.b"></a>

### Clinical data

In [30]:
clin_patient = pd.read_csv('data/data_clinical_patient.txt', sep='\t', comment='#')
clin_sample = pd.read_csv('data/data_clinical_sample.txt', sep='\t', comment='#')

In [8]:
clin_patient.columns

Index(['PATIENT_ID', 'LYMPH_NODES_EXAMINED_POSITIVE', 'NPI', 'CELLULARITY',
       'CHEMOTHERAPY', 'COHORT', 'ER_IHC', 'HER2_SNP6', 'HORMONE_THERAPY',
       'INFERRED_MENOPAUSAL_STATE', 'SEX', 'INTCLUST', 'AGE_AT_DIAGNOSIS',
       'OS_MONTHS', 'OS_STATUS', 'CLAUDIN_SUBTYPE', 'THREEGENE',
       'VITAL_STATUS', 'LATERALITY', 'RADIO_THERAPY', 'HISTOLOGICAL_SUBTYPE',
       'BREAST_SURGERY', 'RFS_STATUS', 'RFS_MONTHS'],
      dtype='object')

In [9]:
clin_sample.columns

Index(['PATIENT_ID', 'SAMPLE_ID', 'CANCER_TYPE', 'CANCER_TYPE_DETAILED',
       'ER_STATUS', 'HER2_STATUS', 'GRADE', 'ONCOTREE_CODE', 'PR_STATUS',
       'SAMPLE_TYPE', 'TUMOR_SIZE', 'TUMOR_STAGE', 'TMB_NONSYNONYMOUS'],
      dtype='object')

### Genomic data

In [31]:
mut = pd.read_csv('data/data_mutations.txt', sep='\t', comment='#')
cna = pd.read_csv('data/data_cna.txt', sep='\t')
rna = pd.read_csv('data/data_mrna_illumina_microarray_zscores_ref_diploid_samples.txt', sep='\t', comment='#')
meth = pd.read_csv('data/data_methylation_promoters_rrbs.txt', sep='\t')

The **mut** dataframe stores 17272 small variants, of length ranging from one nucleotide to few tens of nucleotides. Each variant is associated to a patient and is a variation of a DNA fraction, when compared with the reference genome. 

In [58]:
mut.columns

Index(['Hugo_Symbol', 'Entrez_Gene_Id', 'Center', 'NCBI_Build', 'Chromosome',
       'Start_Position', 'End_Position', 'Strand', 'Consequence',
       'Variant_Classification', 'Variant_Type', 'Reference_Allele',
       'Tumor_Seq_Allele1', 'Tumor_Seq_Allele2', 'dbSNP_RS',
       'dbSNP_Val_Status', 'Tumor_Sample_Barcode',
       'Matched_Norm_Sample_Barcode', 'Match_Norm_Seq_Allele1',
       'Match_Norm_Seq_Allele2', 'Tumor_Validation_Allele1',
       'Tumor_Validation_Allele2', 'Match_Norm_Validation_Allele1',
       'Match_Norm_Validation_Allele2', 'Verification_Status',
       'Validation_Status', 'Mutation_Status', 'Sequencing_Phase',
       'Sequence_Source', 'Validation_Method', 'Score', 'BAM_File',
       'Sequencer', 't_ref_count', 't_alt_count', 'n_ref_count', 'n_alt_count',
       'HGVSc', 'HGVSp', 'HGVSp_Short', 'Transcript_ID', 'RefSeq',
       'Protein_position', 'Codons', 'Hotspot'],
      dtype='object')

The most relevant informations are stored in the following columns:  
**Hugo_Symbol**: identification symbol of the gene in which the variant occurs  
**Chromosome**  
**Start_Position** and **End_Position**: exact position of the variant in the DNA strand  
**Consequence**, **Variant_Classification** and **Variant_Type**  
**Reference_Allele**: nucleotide found in the reference genome at the same variant location  
**Tumor_Seq_Allele1** and **Tumor_Seq_Allele2**  
**Tumor_Sample_Barcode**: sample id (referred to the patient)

In [28]:
len(mut.Hugo_Symbol.unique())

173

In [30]:
len(mut.Tumor_Sample_Barcode.unique())

2369

In [31]:
mut.Variant_Classification.unique()

array(['Frame_Shift_Ins', 'In_Frame_Ins', 'Missense_Mutation', 'Silent',
       'Frame_Shift_Del', 'Nonsense_Mutation', 'Splice_Region',
       'Splice_Site', 'In_Frame_Del', "5'UTR", 'Nonstop_Mutation',
       'Intron', 'Translation_Start_Site'], dtype=object)

In [34]:
cols = ['Hugo_Symbol','Chromosome','Start_Position','End_Position','Consequence','Variant_Classification','Variant_Type','Reference_Allele','Tumor_Seq_Allele1','Tumor_Seq_Allele2','Tumor_Sample_Barcode']
mut.query('Variant_Classification=="5\'UTR"')[cols]

Unnamed: 0,Hugo_Symbol,Chromosome,Start_Position,End_Position,Consequence,Variant_Classification,Variant_Type,Reference_Allele,Tumor_Seq_Allele1,Tumor_Seq_Allele2,Tumor_Sample_Barcode
309,PRR16,5,119800268,119800307,5_prime_UTR_variant,5'UTR,DEL,GATCAAGATCATCGTGGAGGATTTGGAATTAGTCCTGGGC,GATCAAGATCATCGTGGAGGATTTGGAATTAGTCCTGGGC,-,MTS-T0335
2127,PRR16,5,119800263,119800263,5_prime_UTR_variant,5'UTR,SNP,G,G,T,MB-0404
12037,PRR16,5,119800333,119800333,5_prime_UTR_variant,5'UTR,SNP,T,T,G,MB-5152
16940,PRR16,5,119800287,119800287,5_prime_UTR_variant,5'UTR,SNP,G,G,T,MTS-T1292


In [74]:
mut[['Hugo_Symbol','Chromosome','Start_Position','End_Position','Consequence','Variant_Classification','Variant_Type','Reference_Allele','Tumor_Seq_Allele1','Tumor_Seq_Allele2','Tumor_Sample_Barcode']]

Unnamed: 0,Hugo_Symbol,Chromosome,Start_Position,End_Position,Consequence,Variant_Classification,Variant_Type,Reference_Allele,Tumor_Seq_Allele1,Tumor_Seq_Allele2,Tumor_Sample_Barcode
0,TP53,17,7579344,7579345,frameshift_variant,Frame_Shift_Ins,INS,-,-,G,MTS-T0058
1,TP53,17,7579346,7579347,protein_altering_variant,In_Frame_Ins,INS,-,-,CAG,MTS-T0058
2,MLLT4,6,168299111,168299111,missense_variant,Missense_Mutation,SNP,G,G,T,MTS-T0058
3,NF2,22,29999995,29999995,missense_variant,Missense_Mutation,SNP,G,G,T,MTS-T0058
4,SF3B1,2,198288682,198288682,synonymous_variant,Silent,SNP,A,A,T,MTS-T0059
...,...,...,...,...,...,...,...,...,...,...,...
17267,PIK3CA,3,178917478,178917478,"missense_variant,splice_region_variant",Missense_Mutation,SNP,G,G,A,MB-0906
17268,PIK3CA,3,178952085,178952085,missense_variant,Missense_Mutation,SNP,A,A,G,MB-0906
17269,TP53,17,7579355,7579355,missense_variant,Missense_Mutation,SNP,A,A,G,MB-0906
17270,JAK1,1,65311232,65311233,frameshift_variant,Frame_Shift_Ins,INS,-,-,T,MB-0906


The **cna** dataframe stores, for each patient, copy number alterations (replications or deletions of large DNA fractions) referred to 22544 genes. Rows represent genes and columns represent patients.   
Values:  
-2 = homozygous deletion  
-1 = hemizygous deletion  
0 = neutral / no change  
1 = gain  
2 = high level amplification.  

In [29]:
len(cna.Hugo_Symbol.unique())

22542

In [25]:
cna

Unnamed: 0,Hugo_Symbol,Entrez_Gene_Id,MB-0000,MB-0039,MB-0045,MB-0046,MB-0048,MB-0050,MB-0053,MB-0062,...,MB-5467,MB-5546,MB-5585,MB-5625,MB-5648,MB-6020,MB-6213,MB-6230,MB-7148,MB-7188
0,A1BG,1.0,0,0,-1,0,0,0,0,-1,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,A1BG-AS1,503538.0,0,0,-1,0,0,0,0,-1,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,A1CF,29974.0,0,0,0,0,1,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0
3,A2M,2.0,0,0,-1,-1,0,0,0,2,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,A2M-AS1,144571.0,0,0,-1,-1,0,0,0,2,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22539,ZYG11A,,0,0,1,0,0,0,1,1,...,1.0,0.0,-1.0,0.0,-1.0,0.0,0.0,-1.0,0.0,0.0
22540,ZYG11B,79699.0,0,0,-1,0,0,0,1,1,...,1.0,0.0,-1.0,0.0,-1.0,0.0,0.0,-1.0,0.0,0.0
22541,ZYX,7791.0,0,-1,0,0,1,0,0,0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,-1.0
22542,ZZEF1,23140.0,0,0,-2,-1,-1,-1,0,-1,...,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.0


The **rna** dataframe stores, for every patients (columns), the mRNA expression values of 20603 genes (rows).

In [38]:
len(rna.Hugo_Symbol.unique())

20387

In [13]:
rna

Unnamed: 0,Hugo_Symbol,Entrez_Gene_Id,MB-0362,MB-0346,MB-0386,MB-0574,MB-0185,MB-0503,MB-0641,MB-0201,...,MB-6192,MB-4820,MB-5527,MB-5167,MB-5465,MB-5453,MB-5471,MB-5127,MB-4313,MB-4823
0,RERE,473,-0.7139,1.2266,-0.0053,-0.4399,-0.5958,0.4729,0.4974,-1.1900,...,-0.4596,1.8975,1.1120,1.1942,-1.7974,1.1339,0.0259,-0.3529,-1.2327,1.7217
1,RNF165,494470,-0.4606,0.3564,-0.6800,-1.0563,-0.0377,-0.6829,-0.2854,-0.4336,...,-1.0927,0.9103,-0.0023,-0.2898,3.5763,1.3429,0.5726,0.1731,0.5482,1.2239
2,PHF7,51533,-0.3325,-1.0617,0.2587,-0.2982,-1.2422,0.0558,-0.5011,-0.6418,...,-0.0725,0.7219,0.1402,0.8718,-0.9275,-0.0587,0.5240,-0.0311,4.4925,-0.2173
3,CIDEA,1149,-0.0129,-1.0394,3.2991,-0.2632,-1.0949,1.2628,2.0796,-0.8310,...,0.0679,-0.7126,-0.1523,-0.7593,-0.7141,-0.4324,-0.0336,-0.4003,2.4698,-0.7268
4,TENT2,167153,-0.7853,0.0337,-0.6649,2.1640,-0.2031,1.0304,0.6046,-1.7557,...,0.6400,-0.1102,1.2719,0.8178,-1.0301,0.6082,0.5608,2.4222,-3.2853,0.4181
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20598,VPS72,6944,-0.2908,0.3443,0.4818,0.2503,-0.1057,-0.1657,-0.4730,1.4719,...,-0.9195,-1.4857,-1.4543,0.3791,-0.3989,-1.5529,-0.6349,-0.8160,-1.0902,-0.2811
20599,CSMD3,114788,-0.5286,-0.4379,6.9258,1.0466,-0.1060,0.3284,0.0993,-0.1987,...,-0.3776,-0.6366,-0.0607,-0.0475,0.2231,0.0706,0.1188,-0.3231,-0.1251,-0.4265
20600,CC2D1A,54862,0.0068,-0.7520,0.0519,0.2502,-0.3376,-0.4705,-0.6036,-1.1946,...,-0.5877,-1.1169,-0.5420,0.2947,-0.2800,2.5337,-0.8272,-0.1200,4.2708,-1.0090
20601,IGSF9,57549,0.4053,1.2968,0.7962,-0.1634,-0.2418,-0.2545,-0.9814,1.9240,...,-0.6217,-1.5481,-1.2088,0.4594,0.3821,0.3254,0.8187,-0.5648,0.5931,0.9043


The **meth** dataframe stores, for every patients (columns), the methylation values of 13188 genes (rows). SPIEGARE MEGLIO

In [14]:
meth

Unnamed: 0,Hugo_Symbol,MB-0006,MB-0028,MB-0035,MB-0046,MB-0050,MB-0053,MB-0054,MB-0062,MB-0064,...,MB-7279,MB-7281,MB-7283,MB-7285,MB-7288,MB-7289,MB-7291,MB-7292,MB-7293,MB-7296
0,A2M,0.045031,0.066532,0.015487,0.102439,0.001905,0.039106,0.080311,0.007585,0.003140,...,0.000000,0.108374,0.001916,0.004673,0.004049,0.000000,0.177112,0.015464,0.090909,0.1
1,A4GALT,0.017582,0.038095,0.033333,0.000000,0.009804,0.236311,0.023460,0.001227,0.004673,...,0.000000,0.026393,0.202151,0.061224,0.000000,0.062745,0.039773,0.007752,0.011628,0.0
2,AAAS,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.003817,0.000000,...,0.000000,0.000000,0.013043,0.000000,0.000000,0.011811,0.003401,0.000000,0.014151,0.0
3,AACS,0.000000,0.019802,0.000000,0.000000,0.000000,0.000000,0.032967,0.000000,0.012195,...,0.000000,0.000000,0.026316,,0.000000,0.000000,0.007194,0.000000,0.000000,0.0
4,AADACL2,0.870370,0.958763,0.987395,0.965517,0.969231,0.763359,0.979310,0.958904,0.845588,...,,0.922414,0.962264,0.935484,0.812500,0.939163,0.921348,,0.937500,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13183,ZWINT,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.011494,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.016000,0.000000,0.000000,0.0
13184,ZXDC,0.000000,0.000000,0.000000,0.002331,0.000000,0.000000,0.006061,0.000000,0.008403,...,0.000000,0.000000,0.012500,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
13185,ZYG11B,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.012658,0.000000,0.000000,...,,0.000000,0.038961,0.014085,0.000000,0.011173,0.000000,,0.000000,0.0
13186,ZYX,0.013445,0.008097,0.000000,0.000000,0.000000,0.003155,0.007194,0.005703,0.012384,...,0.015873,0.000000,0.021028,0.026846,0.001546,0.005533,0.008869,0.013699,0.000000,0.0


In [None]:
CELLULARITY
tumor content

In [None]:
useless_cols = []
for i in mut.columns:
    col = mut[i].unique()
    if len(col) == 1:
        useless_cols.append(i)

new_cols = []
for i in mut.columns:
    if i not in useless_cols:
        new_cols.append(i)


In [None]:
######## incongruenze

In [13]:
#lista geni ottenuta da CBCG = gl1
len(gl1)

623

In [14]:
#lista geni da altro articolo = gl2
#https://www.frontiersin.org/articles/10.3389/fgene.2021.596794/full
len(gl2)

193

In [15]:
#geni in comune fra i due elenchi
len(list(set(gl1) & set(gl2)))

64

In [16]:
#lista geni da file METABRIC = g_META
#https://www.kaggle.com/datasets/raghadalharbi/breast-cancer-gene-expression-profiles-metabric/data?select=METABRIC_RNA_Mutation.csv

file = pd.read_csv('METABRIC_RNA_Mutation.csv')#, sep='\t')
genes = file[file.columns[-173:]].columns
# delet '_mut'
g_META = genes.str.split('_').str.get(0)
g_META = g_META.str.upper()
len(g_META)

  file = pd.read_csv('METABRIC_RNA_Mutation.csv')#, sep='\t')


173

In [37]:
file[file.columns[200:]]

Unnamed: 0,gsk3b,hif1a,hla-g,hras,igf1,igf1r,inha,inhba,inhbc,itgav,...,mtap_mut,ppp2cb_mut,smarcd1_mut,nras_mut,ndfip1_mut,hras_mut,prps2_mut,smarcb1_mut,stmn2_mut,siah1_mut
0,-0.7982,0.4191,0.1048,-0.9923,0.9947,-0.4445,0.0883,1.1190,-0.4613,-0.4337,...,0,0,0,0,0,0,0,0,0,0
1,-0.0094,-0.8038,-0.9355,0.9679,0.8191,0.0638,-0.2797,-0.6347,-0.2217,-0.7036,...,0,0,0,0,0,0,0,0,0,0
2,-1.7768,3.5336,1.9768,-0.6058,1.5676,-0.6426,-0.4539,-0.3512,-1.4372,2.2121,...,0,0,0,0,0,0,0,0,0,0
3,-1.6895,3.3403,2.0772,0.9831,2.8491,-0.7844,1.9837,0.6968,0.1198,1.3652,...,0,0,0,0,0,0,0,0,0,0
4,-0.3001,1.3211,0.0246,0.2881,-0.2109,-0.5785,-0.1702,-0.5076,0.3660,-1.2198,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1899,-1.0073,0.4651,0.0992,0.0434,1.5313,-0.1809,-0.3722,-0.8222,-0.2382,0.6536,...,0,0,0,0,0,0,0,0,0,0
1900,0.5816,1.4471,-0.0569,0.7485,0.0285,0.8643,-0.3629,1.2752,0.9347,0.2462,...,0,0,0,0,0,0,0,0,0,0
1901,0.0491,2.8388,0.8123,-0.0733,-0.1027,-0.6666,-0.4058,-0.9113,0.0418,-0.3592,...,0,0,0,0,0,0,0,0,0,0
1902,-0.1292,0.3175,0.6938,1.2855,-0.0013,0.3560,0.0146,-1.1223,1.2106,-0.2377,...,0,0,0,0,0,0,0,0,0,0


In [17]:
#lista rna da file METABRIC = g_META

rna_loc = file[file.columns[31:-173]].columns
r_META = rna_loc.str.upper()
len(r_META)

489

In [24]:
#geni in comune file MUT e g_META
g_mut = mut.Hugo_Symbol.unique()
len(g_mut)

173

In [25]:
len(list(set(g_mut) & set(g_META)))

169

In [22]:
for i in g_mut:
    if i not in g_META:
        print(i)

MLLT4
GPR124
MLL2
LARGE


In [23]:
for i in g_META:
    if i not in g_mut:
        print(i)

KMT2D
AFDN
ADGRA2
LARGE1


In [137]:
list(set(mut.Hugo_Symbol))

173