#### Contents:
1. [Clinical and genomic data of breast cancer patients](#1)<br>
       1.1 [Clinical data](#1.1)<br>
       1.2 [Genomic, gene expression and epigenomic data](#1.2)<br>
2. [Genetic alterations in breast cancer](#2)<br>

This notebook is divided in two sections. The first section shows the raw data that will be processed and analysed in data_processing.ipynb and analysis.ipynb respectively. The second section selects a list of breast cancer related genes from two datasets downloaded from public databases. 

In [1]:
import pandas as pd
import numpy as np

## 1. Clinical and genomic data of breast cancer patients <a class="anchor" id="1"></a>

The dataset consists of six files and is downloaded from https://www.cbioportal.org/study/summary?id=brca_metabric; it stores clinical and genomic informations of 2509 patients affected of breast cancer.

In the following each file is shown, after being converted to a pandas dataframe.

Files have not been saved in the github repository due to their size.

### 1.1 Clinical data <a class="anchor" id="1.1"></a>

Clinical data are stored in data_clinical_patient.txt and data_clinical_sample.txt. In the following some of the main features (that will be used for the analysis) are described.

In [2]:
clin_patient = pd.read_csv('data/data_clinical_patient.txt', sep='\t', comment='#')
clin_sample = pd.read_csv('data/data_clinical_sample.txt', sep='\t', comment='#')

In [3]:
print(clin_patient.columns)
clin_patient

Index(['PATIENT_ID', 'LYMPH_NODES_EXAMINED_POSITIVE', 'NPI', 'CELLULARITY',
       'CHEMOTHERAPY', 'COHORT', 'ER_IHC', 'HER2_SNP6', 'HORMONE_THERAPY',
       'INFERRED_MENOPAUSAL_STATE', 'SEX', 'INTCLUST', 'AGE_AT_DIAGNOSIS',
       'OS_MONTHS', 'OS_STATUS', 'CLAUDIN_SUBTYPE', 'THREEGENE',
       'VITAL_STATUS', 'LATERALITY', 'RADIO_THERAPY', 'HISTOLOGICAL_SUBTYPE',
       'BREAST_SURGERY', 'RFS_STATUS', 'RFS_MONTHS'],
      dtype='object')


Unnamed: 0,PATIENT_ID,LYMPH_NODES_EXAMINED_POSITIVE,NPI,CELLULARITY,CHEMOTHERAPY,COHORT,ER_IHC,HER2_SNP6,HORMONE_THERAPY,INFERRED_MENOPAUSAL_STATE,...,OS_STATUS,CLAUDIN_SUBTYPE,THREEGENE,VITAL_STATUS,LATERALITY,RADIO_THERAPY,HISTOLOGICAL_SUBTYPE,BREAST_SURGERY,RFS_STATUS,RFS_MONTHS
0,MB-0000,10.0,6.044,,NO,1.0,Positve,NEUTRAL,YES,Post,...,0:LIVING,claudin-low,ER-/HER2-,Living,Right,YES,Ductal/NST,MASTECTOMY,0:Not Recurred,138.65
1,MB-0002,0.0,4.020,High,NO,1.0,Positve,NEUTRAL,YES,Pre,...,0:LIVING,LumA,ER+/HER2- High Prolif,Living,Right,YES,Ductal/NST,BREAST CONSERVING,0:Not Recurred,83.52
2,MB-0005,1.0,4.030,High,YES,1.0,Positve,NEUTRAL,YES,Pre,...,1:DECEASED,LumB,,Died of Disease,Right,NO,Ductal/NST,MASTECTOMY,1:Recurred,151.28
3,MB-0006,3.0,4.050,Moderate,YES,1.0,Positve,NEUTRAL,YES,Pre,...,0:LIVING,LumB,,Living,Right,YES,Mixed,MASTECTOMY,0:Not Recurred,162.76
4,MB-0008,8.0,6.080,High,YES,1.0,Positve,NEUTRAL,YES,Post,...,1:DECEASED,LumB,ER+/HER2- High Prolif,Died of Disease,Right,YES,Mixed,MASTECTOMY,1:Recurred,18.55
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2504,MTS-T2428,0.0,2.540,,,1.0,Positve,,,,...,,,,,,,,,1:Recurred,4.93
2505,MTS-T2429,0.0,4.560,,,1.0,Positve,,,,...,,,,,,,,,1:Recurred,16.18
2506,MTS-T2430,0.0,,,,,,,,,...,,,,,,,,,,
2507,MTS-T2431,0.0,,,,,,,,,...,,,,,,,,,,


In [4]:
print(clin_sample.columns)
clin_sample

Index(['PATIENT_ID', 'SAMPLE_ID', 'CANCER_TYPE', 'CANCER_TYPE_DETAILED',
       'ER_STATUS', 'HER2_STATUS', 'GRADE', 'ONCOTREE_CODE', 'PR_STATUS',
       'SAMPLE_TYPE', 'TUMOR_SIZE', 'TUMOR_STAGE', 'TMB_NONSYNONYMOUS'],
      dtype='object')


Unnamed: 0,PATIENT_ID,SAMPLE_ID,CANCER_TYPE,CANCER_TYPE_DETAILED,ER_STATUS,HER2_STATUS,GRADE,ONCOTREE_CODE,PR_STATUS,SAMPLE_TYPE,TUMOR_SIZE,TUMOR_STAGE,TMB_NONSYNONYMOUS
0,MB-0000,MB-0000,Breast Cancer,Breast Invasive Ductal Carcinoma,Positive,Negative,3.0,IDC,Negative,Primary,22.0,2.0,0.000000
1,MB-0002,MB-0002,Breast Cancer,Breast Invasive Ductal Carcinoma,Positive,Negative,3.0,IDC,Positive,Primary,10.0,1.0,2.615035
2,MB-0005,MB-0005,Breast Cancer,Breast Invasive Ductal Carcinoma,Positive,Negative,2.0,IDC,Positive,Primary,15.0,2.0,2.615035
3,MB-0006,MB-0006,Breast Cancer,Breast Mixed Ductal and Lobular Carcinoma,Positive,Negative,2.0,MDLC,Positive,Primary,25.0,2.0,1.307518
4,MB-0008,MB-0008,Breast Cancer,Breast Mixed Ductal and Lobular Carcinoma,Positive,Negative,3.0,MDLC,Positive,Primary,40.0,2.0,2.615035
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2504,MTS-T2428,MTS-T2428,Breast Cancer,Invasive Breast Carcinoma,Positive,,1.0,BRCA,,Primary,27.0,1.0,2.615035
2505,MTS-T2429,MTS-T2429,Breast Cancer,Invasive Breast Carcinoma,Positive,,2.0,BRCA,,Primary,28.0,2.0,5.230071
2506,MTS-T2430,MTS-T2430,Breast Cancer,Invasive Breast Carcinoma,,,,BRCA,,Primary,,0.0,7.845106
2507,MTS-T2431,MTS-T2431,Breast Cancer,Invasive Breast Carcinoma,,,,BRCA,,Primary,,0.0,9.152624


**PATIENT_ID** and **SAMPLE_ID**: id code for each patient/sample (in this case the two columns have same IDs)

**TMB_NONSYNONYMOUS**: total number of somatic/acquired mutations per coding area of a tumor genome (Mut/Mb)

**COHORT**: group of subjects who share a defining characteristic (It takes a value from 1 to 5)

**GRADE**: determined by pathology by looking the nature of the cells, do they look aggressive or not (It takes a value from 1 to 3)

**ER_STATUS**: cancer cells are positive or negative for estrogen receptors

**ER_IHC**: to assess if estrogen receptors are expressed on cancer cells by using immune-histochemistry (a dye used in pathology that targets specific antigen, if it is there, it will give a color, it is not there, the tissue on the slide will be colored) (positive/negative)

**HER2_STATUS**: whether the cancer is positive or negative for HER2

**HER2_SNP6**: to assess if the cancer positive for HER2 or not by using advance molecular techniques (Type of next generation sequencing)

**PR_STATUS**: cancer cells are positive or negative for progesterone receptors

**INTCLUST**: molecular subtype of the cancer based on some gene expression (It takes a value from '4ER+', '3', '9', '7', '4ER-', '5', '8', '10', '1', '2', '6')

**CLAUDIN_SUBTYPE**: tumor profiling test that helps show whether some estrogen receptor-positive (ER-positive), HER2-negative breast cancers are likely to metastasize (when breast cancer spreads to other organs). The claudin-low breast cancer subtype is defined by gene expression characteristics, most prominently: Low expression of cell–cell adhesion genes, high expression of epithelial–mesenchymal transition (EMT) genes, and stem cell-like/less differentiated gene expression patterns

**THREEGENE**: three Gene classifier subtype It takes a value from 'ER-/HER2-', 'ER+/HER2- High Prolif', nan, 'ER+/HER2- Low Prolif','HER2+'

**HISTOLOGICAL_SUBTYPE**: type of the cancer based on microscopic examination of the cancer tissue (It takes a value of 'Ductal/NST', 'Mixed', 'Lobular', 'Tubular/ cribriform', 'Mucinous', 'Medullary', 'Other', 'Metaplastic' )

### 1.2 Genomic, gene expression and epigenomic data  <a class="anchor" id="1.2"></a>

In [5]:
mut = pd.read_csv('data/data_mutations.txt', sep='\t', comment='#')
cna = pd.read_csv('data/data_cna.txt', sep='\t')
rna = pd.read_csv('data/data_mrna_illumina_microarray_zscores_ref_diploid_samples.txt', sep='\t', comment='#')
meth = pd.read_csv('data/data_methylation_promoters_rrbs.txt', sep='\t')

#### Small mutations

The **mut** dataframe stores 17272 small variants of length ranging from one nucleotide to few tens of nucleotides. Each variant is associated to a patient and is a variation of a DNA fraction, when compared with the reference genome. 

The most relevant informations are stored in the following columns: 

**Hugo_Symbol**: identification symbol of the gene in which the variant occurs 

**Chromosome**, **Start_Position**, **End_Position** and **Strand**: exact position of the variant in the DNA molecule  

**Consequence**: effect of the variant presence  

**Variant_Classification** and **Variant_Type**: type of mutation (insertion, deletion, single nucleotide variant...)  

**Reference_Allele**: nucleotide(s) found in the reference genome at the same variant location  

**Tumor_Seq_Allele1** and **Tumor_Seq_Allele2**: mutation found in each allele (if present in both)  

**Tumor_Sample_Barcode**: sample id (referred to the patient)

In [6]:
print(mut.columns)
mut

Index(['Hugo_Symbol', 'Entrez_Gene_Id', 'Center', 'NCBI_Build', 'Chromosome',
       'Start_Position', 'End_Position', 'Strand', 'Consequence',
       'Variant_Classification', 'Variant_Type', 'Reference_Allele',
       'Tumor_Seq_Allele1', 'Tumor_Seq_Allele2', 'dbSNP_RS',
       'dbSNP_Val_Status', 'Tumor_Sample_Barcode',
       'Matched_Norm_Sample_Barcode', 'Match_Norm_Seq_Allele1',
       'Match_Norm_Seq_Allele2', 'Tumor_Validation_Allele1',
       'Tumor_Validation_Allele2', 'Match_Norm_Validation_Allele1',
       'Match_Norm_Validation_Allele2', 'Verification_Status',
       'Validation_Status', 'Mutation_Status', 'Sequencing_Phase',
       'Sequence_Source', 'Validation_Method', 'Score', 'BAM_File',
       'Sequencer', 't_ref_count', 't_alt_count', 'n_ref_count', 'n_alt_count',
       'HGVSc', 'HGVSp', 'HGVSp_Short', 'Transcript_ID', 'RefSeq',
       'Protein_position', 'Codons', 'Hotspot'],
      dtype='object')


Unnamed: 0,Hugo_Symbol,Entrez_Gene_Id,Center,NCBI_Build,Chromosome,Start_Position,End_Position,Strand,Consequence,Variant_Classification,...,n_ref_count,n_alt_count,HGVSc,HGVSp,HGVSp_Short,Transcript_ID,RefSeq,Protein_position,Codons,Hotspot
0,TP53,,METABRIC,GRCh37,17,7579344,7579345,+,frameshift_variant,Frame_Shift_Ins,...,,,ENST00000269305.4:c.343dup,p.His115ProfsTer34,p.H115Pfs*34,ENST00000269305,NM_001126112.2,114.0,-/C,0
1,TP53,,METABRIC,GRCh37,17,7579346,7579347,+,protein_altering_variant,In_Frame_Ins,...,,,ENST00000269305.4:c.340_341insCTG,p.Leu114delinsSerVal,p.L114delinsSV,ENST00000269305,NM_001126112.2,114.0,ttg/tCTGtg,0
2,MLLT4,,METABRIC,GRCh37,6,168299111,168299111,+,missense_variant,Missense_Mutation,...,,,ENST00000392108.3:c.1544G>T,p.Gly515Val,p.G515V,ENST00000392108,NM_001040000.2,515.0,gGa/gTa,0
3,NF2,,METABRIC,GRCh37,22,29999995,29999995,+,missense_variant,Missense_Mutation,...,,,ENST00000338641.4:c.8G>T,p.Gly3Val,p.G3V,ENST00000338641,NM_000268.3,3.0,gGg/gTg,0
4,SF3B1,,METABRIC,GRCh37,2,198288682,198288682,+,synonymous_variant,Silent,...,,,ENST00000335508.6:c.45T>A,p.Ile15=,p.I15=,ENST00000335508,NM_012433.2,15.0,atT/atA,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17267,PIK3CA,,METABRIC,GRCh37,3,178917478,178917478,+,"missense_variant,splice_region_variant",Missense_Mutation,...,,,ENST00000263967.3:c.353G>A,p.Gly118Asp,p.G118D,ENST00000263967,NM_006218.2,118.0,gGt/gAt,0
17268,PIK3CA,,METABRIC,GRCh37,3,178952085,178952085,+,missense_variant,Missense_Mutation,...,,,ENST00000263967.3:c.3140A>G,p.His1047Arg,p.H1047R,ENST00000263967,NM_006218.2,1047.0,cAt/cGt,0
17269,TP53,,METABRIC,GRCh37,17,7579355,7579355,+,missense_variant,Missense_Mutation,...,,,ENST00000269305.4:c.332T>C,p.Leu111Pro,p.L111P,ENST00000269305,NM_001126112.2,111.0,cTg/cCg,0
17270,JAK1,,METABRIC,GRCh37,1,65311232,65311233,+,frameshift_variant,Frame_Shift_Ins,...,,,ENST00000342505.4:c.2079dup,p.Val694SerfsTer15,p.V694Sfs*15,ENST00000342505,NM_002227.2,693.0,aaa/aaAa,0


In [7]:
print('Total number of genes with small mutations: ',len(mut.Hugo_Symbol.unique()))
print('Number of patients with small mutations: ',len(mut.Tumor_Sample_Barcode.unique()))

Total number of genes with small mutations:  173
Number of patients with small mutations:  2369


#### Copy number variations

The **cna** dataframe stores, for each patient, copy number alterations (replications or deletions of large DNA fractions) referred to 22544 genes. Rows represent genes and columns represent patients.   
Values:  
-2 = homozygous deletion  
-1 = hemizygous deletion  
0 = neutral / no change  
1 = gain  
2 = high level amplification.  

In [8]:
print('Total number of genes with copy number variations: ',len(cna.Hugo_Symbol.unique()))
print('Number of patients with copy number variations: ',len(cna.columns)-2)

Total number of genes with copy number variations:  22542
Number of patients with copy number variations:  2173


In [9]:
cna

Unnamed: 0,Hugo_Symbol,Entrez_Gene_Id,MB-0000,MB-0039,MB-0045,MB-0046,MB-0048,MB-0050,MB-0053,MB-0062,...,MB-5467,MB-5546,MB-5585,MB-5625,MB-5648,MB-6020,MB-6213,MB-6230,MB-7148,MB-7188
0,A1BG,1.0,0,0,-1,0,0,0,0,-1,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,A1BG-AS1,503538.0,0,0,-1,0,0,0,0,-1,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,A1CF,29974.0,0,0,0,0,1,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0
3,A2M,2.0,0,0,-1,-1,0,0,0,2,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,A2M-AS1,144571.0,0,0,-1,-1,0,0,0,2,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22539,ZYG11A,,0,0,1,0,0,0,1,1,...,1.0,0.0,-1.0,0.0,-1.0,0.0,0.0,-1.0,0.0,0.0
22540,ZYG11B,79699.0,0,0,-1,0,0,0,1,1,...,1.0,0.0,-1.0,0.0,-1.0,0.0,0.0,-1.0,0.0,0.0
22541,ZYX,7791.0,0,-1,0,0,1,0,0,0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,-1.0
22542,ZZEF1,23140.0,0,0,-2,-1,-1,-1,0,-1,...,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.0


#### Rna expression

The **rna** dataframe stores, for every patients (columns), the mRNA expression values of 20603 genes (rows).

In [10]:
print('Total number of genes: ',len(rna.Hugo_Symbol.unique()))
print('Number of patients: ',len(rna.columns)-2)

Total number of genes:  20387
Number of patients:  1980


In [11]:
rna

Unnamed: 0,Hugo_Symbol,Entrez_Gene_Id,MB-0362,MB-0346,MB-0386,MB-0574,MB-0185,MB-0503,MB-0641,MB-0201,...,MB-6192,MB-4820,MB-5527,MB-5167,MB-5465,MB-5453,MB-5471,MB-5127,MB-4313,MB-4823
0,RERE,473,-0.7139,1.2266,-0.0053,-0.4399,-0.5958,0.4729,0.4974,-1.1900,...,-0.4596,1.8975,1.1120,1.1942,-1.7974,1.1339,0.0259,-0.3529,-1.2327,1.7217
1,RNF165,494470,-0.4606,0.3564,-0.6800,-1.0563,-0.0377,-0.6829,-0.2854,-0.4336,...,-1.0927,0.9103,-0.0023,-0.2898,3.5763,1.3429,0.5726,0.1731,0.5482,1.2239
2,PHF7,51533,-0.3325,-1.0617,0.2587,-0.2982,-1.2422,0.0558,-0.5011,-0.6418,...,-0.0725,0.7219,0.1402,0.8718,-0.9275,-0.0587,0.5240,-0.0311,4.4925,-0.2173
3,CIDEA,1149,-0.0129,-1.0394,3.2991,-0.2632,-1.0949,1.2628,2.0796,-0.8310,...,0.0679,-0.7126,-0.1523,-0.7593,-0.7141,-0.4324,-0.0336,-0.4003,2.4698,-0.7268
4,TENT2,167153,-0.7853,0.0337,-0.6649,2.1640,-0.2031,1.0304,0.6046,-1.7557,...,0.6400,-0.1102,1.2719,0.8178,-1.0301,0.6082,0.5608,2.4222,-3.2853,0.4181
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20598,VPS72,6944,-0.2908,0.3443,0.4818,0.2503,-0.1057,-0.1657,-0.4730,1.4719,...,-0.9195,-1.4857,-1.4543,0.3791,-0.3989,-1.5529,-0.6349,-0.8160,-1.0902,-0.2811
20599,CSMD3,114788,-0.5286,-0.4379,6.9258,1.0466,-0.1060,0.3284,0.0993,-0.1987,...,-0.3776,-0.6366,-0.0607,-0.0475,0.2231,0.0706,0.1188,-0.3231,-0.1251,-0.4265
20600,CC2D1A,54862,0.0068,-0.7520,0.0519,0.2502,-0.3376,-0.4705,-0.6036,-1.1946,...,-0.5877,-1.1169,-0.5420,0.2947,-0.2800,2.5337,-0.8272,-0.1200,4.2708,-1.0090
20601,IGSF9,57549,0.4053,1.2968,0.7962,-0.1634,-0.2418,-0.2545,-0.9814,1.9240,...,-0.6217,-1.5481,-1.2088,0.4594,0.3821,0.3254,0.8187,-0.5648,0.5931,0.9043


In [12]:
list_rna_duplicates = []
for i in rna.Hugo_Symbol.unique():
    c = rna[rna['Hugo_Symbol']==i].Hugo_Symbol
    if len(c) != 1:
        list_rna_duplicates.append(i)
len(list_rna_duplicates)

197

#### Methylation

The **meth** dataframe stores, for every patients (columns), the methylation values of 13188 genes (rows). SPIEGARE MEGLIO

In [13]:
meth

Unnamed: 0,Hugo_Symbol,MB-0006,MB-0028,MB-0035,MB-0046,MB-0050,MB-0053,MB-0054,MB-0062,MB-0064,...,MB-7279,MB-7281,MB-7283,MB-7285,MB-7288,MB-7289,MB-7291,MB-7292,MB-7293,MB-7296
0,A2M,0.045031,0.066532,0.015487,0.102439,0.001905,0.039106,0.080311,0.007585,0.003140,...,0.000000,0.108374,0.001916,0.004673,0.004049,0.000000,0.177112,0.015464,0.090909,0.1
1,A4GALT,0.017582,0.038095,0.033333,0.000000,0.009804,0.236311,0.023460,0.001227,0.004673,...,0.000000,0.026393,0.202151,0.061224,0.000000,0.062745,0.039773,0.007752,0.011628,0.0
2,AAAS,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.003817,0.000000,...,0.000000,0.000000,0.013043,0.000000,0.000000,0.011811,0.003401,0.000000,0.014151,0.0
3,AACS,0.000000,0.019802,0.000000,0.000000,0.000000,0.000000,0.032967,0.000000,0.012195,...,0.000000,0.000000,0.026316,,0.000000,0.000000,0.007194,0.000000,0.000000,0.0
4,AADACL2,0.870370,0.958763,0.987395,0.965517,0.969231,0.763359,0.979310,0.958904,0.845588,...,,0.922414,0.962264,0.935484,0.812500,0.939163,0.921348,,0.937500,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13183,ZWINT,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.011494,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.016000,0.000000,0.000000,0.0
13184,ZXDC,0.000000,0.000000,0.000000,0.002331,0.000000,0.000000,0.006061,0.000000,0.008403,...,0.000000,0.000000,0.012500,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
13185,ZYG11B,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.012658,0.000000,0.000000,...,,0.000000,0.038961,0.014085,0.000000,0.011173,0.000000,,0.000000,0.0
13186,ZYX,0.013445,0.008097,0.000000,0.000000,0.000000,0.003155,0.007194,0.005703,0.012384,...,0.015873,0.000000,0.021028,0.026846,0.001546,0.005533,0.008869,0.013699,0.000000,0.0


In [14]:
print('Total number of genes: ',len(meth.Hugo_Symbol.unique()))
print('Number of patients: ',len(meth.columns)-1)

Total number of genes:  13184
Number of patients:  1418


There are some duplicate row in **cna**, **rna** and **meth** dataframe. In this case "duplicate" means more than one row for the same gene, since it may be possible that the values corresponding to each patient are not the same.
This may be due to an error or to the presence of more than one sample for the same patient. 
The verification of this aspect is postponed, since may be not necessary after the filtering of the whole datasets according to the breast cancer gene list.

## 2. Genetic alterations in breast cancer <a class="anchor" id="2"></a>

In the following two lists of breast cancer involved genes are separately extracted from two files; then the two lists are merged.

The files are downloaded from two different databases: Census_allSun_Jan_14_17_21_06_2024.csv from https://cancer.sanger.ac.uk/census and Familial_breast_cancer.tsv from https://panelapp.genomicsengland.co.uk/panels/158/.

In [15]:
gene1 = pd.read_csv('data/gene_list/Census_allSun_Jan_14_17_21_06_2024.csv')
gene2 = pd.read_csv('data/gene_list/Familial_breast_cancer.tsv', sep='\t')

In [16]:
gene1

Unnamed: 0,Gene Symbol,Name,Entrez GeneId,Genome Location,Tier,Hallmark,Chr Band,Somatic,Germline,Tumour Types(Somatic),Tumour Types(Germline),Cancer Syndrome,Tissue Type,Molecular Genetics,Role in Cancer,Mutation Types,Translocation Partner,Other Germline Mut,Other Syndrome,Synonyms
0,A1CF,APOBEC1 complementation factor,29974.0,10:50799421-50885675,2,,10q11.23,yes,,melanoma,,,E,,oncogene,Mis,,,,"ACF,ACF64,ACF65,APOBEC1CF,ASP,CCDS73133.1,ENSG..."
1,ABI1,abl-interactor 1,10006.0,10:26746593-26860935,1,Yes,10p12.1,yes,,AML,,,L,Dom,"TSG, fusion",T,KMT2A,,,"ABI-1,CCDS7150.1,E3B1,ENSG00000136754.17,NM_00..."
2,ABL1,v-abl Abelson murine leukemia viral oncogene h...,25.0,9:130713946-130885683,1,Yes,9q34.12,yes,,"CML, ALL, T-ALL",,,L,Dom,"oncogene, fusion","T, Mis","BCR, ETV6, NUP214",,,"ABL,CCDS35165.1,ENSG00000097007.17,JTK7,NM_007..."
3,ABL2,"c-abl oncogene 2, non-receptor tyrosine kinase",27.0,1:179099327-179229601,1,,1q25.2,yes,,AML,,,L,Dom,"oncogene, fusion",T,ETV6,,,"ABLL,ARG,CCDS30947.1,ENSG00000143322.19,NM_007..."
4,ACKR3,atypical chemokine receptor 3,57007.0,2:236569641-236582358,1,Yes,2q37.3,yes,,lipoma,,,M,Dom,"oncogene, fusion",T,HMGA2,,,"CCDS2516.1,CMKOR1,CXCR7,ENSG00000144476.5,GPR1..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
738,ZNF429,zinc finger protein 429,353088.0,19:21505564-21538078,2,,19p12,yes,,GBM,,,O,,,Mis,,,,"CCDS42537.1,ENSG00000197013.9,NM_001001415.2,N..."
739,ZNF479,zinc finger protein 479,90827.0,7:57119614-57139864,2,,7p11.2,yes,,"lung cancer, bladder carcinoma, prostate carci...",,,E,,,Mis,,,,"CCDS43590.1,ENSG00000185177.12,KR19,NM_033273...."
740,ZNF521,zinc finger protein 521,25925.0,18:25061926-25352152,1,,18q11.2,yes,,ALL,,,L,Dom,"oncogene, fusion",T,PAX5,,,"CCDS32806.1,EHZF,ENSG00000198795.10,Evi3,NM_01..."
741,ZNRF3,zinc and ring finger 3,84133.0,22:28883592-29057487,2,,22q12.1,yes,,"colorectal cancer, adrenocortical carcinoma, g...",,,E,,TSG,"N, F, Mis",,,,"BK747E2.3,CCDS56225.1,ENSG00000183579.15,FLJ22..."


**gene1** contains informations related to different types of cancer disease. In the following genes that are related to breast cancer are selected.

The information of the cancer typologies is stored in the 'Tumour Types(Somatic)' column, while the ID name of each gene is stored in the 'Gene Symbol' columns.

In [17]:
print('Total number of genes: ',len(gene1['Gene Symbol'].unique()))

# drop rows with missing labels
gene1_red = gene1[['Gene Symbol','Tumour Types(Somatic)']].dropna().copy()
gene1_red.index = np.arange(len(gene1_red))
#print('Number of labeled genes: ',len(gene1_red))

# select rows with 'brest' in 'Tumour Types(Somatic)'
breast_index = []
for i in range(len(gene1_red)):
    label = gene1_red['Tumour Types(Somatic)'].iloc[i]
    if ('breast' in label):
        breast_index = breast_index + [gene1_red.iloc[i].name]
       
gene1_red = gene1_red.iloc[breast_index]
print('Number of labeled breast cancer related genes: ',len(gene1_red))

gene_list1 = gene1_red['Gene Symbol'].tolist()

Total number of genes:  743
Number of labeled breast cancer related genes:  42


In [18]:
gene2

Unnamed: 0,Entity Name,Entity type,Gene Symbol,Sources(; separated),Level4,Level3,Level2,Model_Of_Inheritance,Phenotypes,Omim,...,Position GRCh38 Start,Position GRCh38 End,STR Repeated Sequence,STR Normal Repeats,STR Pathogenic Repeats,Region Haploinsufficiency Score,Region Triplosensitivity Score,Region Required Overlap Percentage,Region Variant Type,Region Verbose Name
0,ATM,gene,ATM,Emory Genetics Laboratory;Expert Review Green;...,Familial breast cancer,Breast and endocrine,Tumour syndromes,"MONOALLELIC, autosomal or pseudoautosomal, NOT...","{Breast cancer, susceptibility to}, OMIM:114480",,...,,,,,,,,,,
1,BRCA1,gene,BRCA1,Eligibility statement prior genetic testing;Em...,Familial breast cancer,Breast and endocrine,Tumour syndromes,"MONOALLELIC, autosomal or pseudoautosomal, NOT...","{Breast-ovarian cancer, familial, 1}, OMIM:604370",,...,,,,,,,,,,
2,BRCA2,gene,BRCA2,Eligibility statement prior genetic testing;Em...,Familial breast cancer,Breast and endocrine,Tumour syndromes,"MONOALLELIC, autosomal or pseudoautosomal, NOT...","{Breast-ovarian cancer, familial, 2}, OMIM:612...",,...,,,,,,,,,,
3,CHEK2,gene,CHEK2,Emory Genetics Laboratory;Expert Review Green;...,Familial breast cancer,Breast and endocrine,Tumour syndromes,"MONOALLELIC, autosomal or pseudoautosomal, NOT...","Li-Fraumeni syndrome, 609265; Osteosarcoma, so...",,...,,,,,,,,,,
4,PALB2,gene,PALB2,Emory Genetics Laboratory;Expert Review Green;...,Familial breast cancer,Breast and endocrine,Tumour syndromes,"MONOALLELIC, autosomal or pseudoautosomal, NOT...","Fanconi anemia, complementation group N, 61083...",,...,,,,,,,,,,
5,PTEN,gene,PTEN,Emory Genetics Laboratory;Expert Review Green,Familial breast cancer,Breast and endocrine,Tumour syndromes,"MONOALLELIC, autosomal or pseudoautosomal, NOT...",High Risk Breast Cancer; Breast and Ovarian Ca...,,...,,,,,,,,,,
6,RAD51C,gene,RAD51C,Emory Genetics Laboratory;Expert Review Green;...,Familial breast cancer,Breast and endocrine,Tumour syndromes,"MONOALLELIC, autosomal or pseudoautosomal, NOT...","Fanconi anemia, complementation group O, 61339...",,...,,,,,,,,,,
7,RAD51D,gene,RAD51D,Emory Genetics Laboratory;Expert Review Green;...,Familial breast cancer,Breast and endocrine,Tumour syndromes,"MONOALLELIC, autosomal or pseudoautosomal, NOT...","{Breast-ovarian cancer, familial, susceptibili...",,...,,,,,,,,,,
8,STK11,gene,STK11,Emory Genetics Laboratory;Expert Review Green,Familial breast cancer,Breast and endocrine,Tumour syndromes,"MONOALLELIC, autosomal or pseudoautosomal, NOT...",High Risk Breast Cancer; Breast and Ovarian Ca...,,...,,,,,,,,,,
9,TP53,gene,TP53,Emory Genetics Laboratory;Expert list;Expert R...,Familial breast cancer,Breast and endocrine,Tumour syndromes,"MONOALLELIC, autosomal or pseudoautosomal, NOT...","Colorectal cancer, 114500; Li-Fraumeni syndrom...",,...,,,,,,,,,,


In [19]:
gene_list2 = gene2['Gene Symbol'].tolist()
print('Number of breast cancer related genes: ',len(gene_list2))

Number of breast cancer related genes:  27


Merge the lists extracted from the two datasets.

In [20]:
gene_list = gene_list1 + gene_list2
gene_list = list(set(gene_list))
print('Total number of breast cancer related genes: ',len(gene_list))
gene_list

Total number of breast cancer related genes:  63


['TP53',
 'MSH2',
 'MAP3K1',
 'ZMYM3',
 'NOTCH1',
 'IKZF3',
 'RB1',
 'PTEN',
 'XRCC2',
 'NTRK3',
 'MAP3K13',
 'AR',
 'PALB2',
 'RAD51C',
 'CASP8',
 'ESR1',
 'MUTYH',
 'RAD51D',
 'TBX3',
 'FOXA1',
 'ETV6',
 'SALL4',
 'BRCA1',
 'CHEK2',
 'STK11',
 'CDH1',
 'FBLN2',
 'NCOR1',
 'ATRIP',
 'ERBB2',
 'CCND1',
 'EPCAM',
 'GOLPH3',
 'KEAP1',
 'EP300',
 'NBN',
 'HGF',
 'ATM',
 'PPM1D',
 'BRCA2',
 'FADD',
 'MAP2K4',
 'RAD50',
 'PMS2',
 'PBRM1',
 'MLH1',
 'RAD54L',
 'RRAS2',
 'GATA3',
 'SMARCD1',
 'PIK3CA',
 'ARID1A',
 'AKT1',
 'BRIP1',
 'BARD1',
 'FLNA',
 'ASPM',
 'CDKN1B',
 'MSH6',
 'BAP1',
 'ARID1B',
 'CTCF',
 'IRS4']

Save the complete list in  a .txt file.

In [21]:
with open(r'data/gene_list.txt', 'w') as file:
    for item in gene_list:
        file.write(item+'\n')