# Data from Human Protein Atlas
LINK: https://www.proteinatlas.org/about/download

**RNA HPA cell line cancer gene data**
* Ensembl gene identifier ("Gene")
* analyzed sample ("Cancer")
* transcripts per million ("TPM")
* protein-coding transcripts per million ("pTPM")
* normalized expression ("nTPM")

Contains per gene and cancer type the TPM values.
TPM is a value that represents the activity of a gene in a sample.

**Output file format**
* Gene
* Gene name
* TPM

## Load data

In [1]:
import pandas as pd

In [2]:
file = "../import_data/HPA/rna_celline_cancer.tsv"
df_hpa = pd.read_csv(file, delimiter="\t")

df_hpa

Unnamed: 0,Gene,Gene name,Cancer,TPM,pTPM,nTPM
0,ENSG00000000003,TSPAN6,Adrenocortical cancer,17.5,21.7,20.4
1,ENSG00000000003,TSPAN6,Bile duct cancer,20.9,26.5,34.4
2,ENSG00000000003,TSPAN6,Bladder cancer,27.8,35.6,36.8
3,ENSG00000000003,TSPAN6,Bone cancer,15.9,20.6,17.6
4,ENSG00000000003,TSPAN6,Brain cancer,15.0,19.1,17.1
...,...,...,...,...,...,...
604855,ENSG00000291317,TMEM276,skin cancer,11.9,15.0,14.8
604856,ENSG00000291317,TMEM276,testis cancer,4.7,5.8,4.7
604857,ENSG00000291317,TMEM276,thyroid cancer,15.9,20.2,18.3
604858,ENSG00000291317,TMEM276,Uncategorized,12.3,15.2,14.4


## Filter for lung cancer

In [3]:
df_hpa['Cancer'].unique()

array(['Adrenocortical cancer', 'Bile duct cancer', 'Bladder cancer',
       'Bone cancer', 'Brain cancer', 'breast cancer', 'cervical cancer',
       'colorectal cancer', 'Esophageal cancer', 'Gallbladder cancer',
       'Gastric cancer', 'head and neck cancer', 'Kidney cancer',
       'Leukemia', 'liver cancer', 'lung cancer', 'lymphoma', 'Myeloma',
       'Neuroblastoma', 'Non-cancerous', 'ovarian cancer',
       'pancreatic cancer', 'prostate cancer', 'Rhabdoid', 'Sarcoma',
       'skin cancer', 'testis cancer', 'thyroid cancer', 'Uncategorized',
       'Uterine cancer'], dtype=object)

In [4]:
print("Number of unique genes is", len(df_hpa['Gene'].unique()))

Number of unique genes is 20162


In [5]:
df_lung = df_hpa[df_hpa['Cancer'] == 'lung cancer']
df_lung = df_lung.drop(columns=['pTPM', 'nTPM', 'Cancer'])
df_lung = df_lung.rename(columns={'Gene': 'id', 'Gene name':'name', 'TPM': 'tpm'})

df_lung

Unnamed: 0,id,name,tpm
15,ENSG00000000003,TSPAN6,17.5
45,ENSG00000000005,TNMD,0.0
75,ENSG00000000419,DPM1,77.3
105,ENSG00000000457,SCYL3,5.4
135,ENSG00000000460,C1orf112,14.4
...,...,...,...
604725,ENSG00000291313,ENSG00000291313,7.9
604755,ENSG00000291314,ENSG00000291314,0.2
604785,ENSG00000291315,ENSG00000291315,0.0
604815,ENSG00000291316,ENSG00000291316,8.9


## Save the data to a csv file

In [6]:
df_lung.to_csv("../processed_data/HPA_lung_cancer.csv", index=False)

print(f'There are {df_lung.shape[0]} rows/genes in the saved dataset.')