# Data from Cell Modell Passport
Link: https://cellmodelpassports.sanger.ac.uk/downloads

Expression Data → all RNA Seq processed Data

All tissues are cancer tissues. We need to filter for lung cancer.

**Output file format:**
* id
* name
* tpm

In [1]:
import pandas as pd


, a comprehensive resource for studying cancer-related gene expression.

, a comprehensive resource for studying cancer-related gene expression.

, a comprehensive resource for studying cancer-related gene expression.
## Model infos for cancer type
loaded from Model Annotation → list of all annotated models

Needed for filtering `all data` for lung cancer

In [2]:
file = "../import_data/CMP/model_list_20240110.csv"

# load data
df_model_info = pd.read_csv(file, delimiter=',', usecols=['model_id', 'tissue', 'cancer_type', 'tissue_status', 'cancer_type_detail'])


# filter for lung cancer
lung_cancer = ['Small Cell Lung Carcinoma', 'Non-Small Cell Lung Carcinoma', 'Squamous Cell Lung Carcinoma']
df_model_info_lung = df_model_info.where(df_model_info["cancer_type"].isin(lung_cancer)).dropna()

df_model_info_lung.head(5)

Unnamed: 0,model_id,tissue,cancer_type,tissue_status,cancer_type_detail
13,SIDM01387,Lung,Small Cell Lung Carcinoma,Metastasis,Small Cell Lung Carcinoma
18,SIDM00990,Lung,Non-Small Cell Lung Carcinoma,Metastasis,Non-Small Cell Lung Carcinoma
20,SIDM00019,Lung,Non-Small Cell Lung Carcinoma,Metastasis,Papillary Lung Adenocarcinoma
28,SIDM00046,Lung,Non-Small Cell Lung Carcinoma,Unknown,Lung Adenocarcinoma
30,SIDM01069,Lung,Non-Small Cell Lung Carcinoma,Tumour,Lung Adenocarcinoma


In [3]:
# List of model_ids with lung cancer
model_ids_lung = df_model_info_lung["model_id"].to_list()

## Expression data

cancerous RNA Seq Data from from different cell models. 

**columns:**
- id
- gene_symbol
- dataset_id
- data_source: Organisation or Project that provided the data
- model_id: gezüchtete Zelle / Zelllinie (ist die "Probe")
- tpm

In [4]:
# read in the data
file = "../import_data/CMP/rnaseq_all_data_20220624.csv"
df_cmp_all = pd.read_csv(file, delimiter=",", usecols=["id", "gene_symbol", "dataset_id", "data_source", "model_id", "tpm"])

print("There are {} rows in the import_data.".format(df_cmp_all.shape[0]))

df_cmp_all.head()

There are 53348469 rows in the import_data.


Unnamed: 0,dataset_id,id,model_id,tpm,data_source,gene_symbol
0,22,133594790,SIDM01313,14.41,Sanger,CASP10
1,22,133630300,SIDM01313,0.64,Sanger,NBPF10
2,22,133630301,SIDM01313,0.39,Sanger,RPL17P51
3,22,133630302,SIDM01313,0.0,Sanger,PPATP2
4,22,133630303,SIDM01313,3.32,Sanger,MMP28


### Clean Dataframe

In [5]:
df_cmp_all.drop(columns=["dataset_id", "data_source", "id"], inplace=True)
df_cmp_all.rename(columns={"gene_symbol": "name"}, inplace=True)

In [6]:
# filter rows with lung cancer model ids
df_cmp_all = df_cmp_all.where(df_cmp_all["model_id"].isin(model_ids_lung)).dropna()

print("There are {} rows with lung cancer data.".format(df_cmp_all.shape[0]))

There are 7564389 rows with lung cancer data.


### Analyze Dataset

In [9]:
# check for missing values
missing_values = df_cmp_all.isnull().sum()

# TPM Ranges
min_tpm = df_cmp_all["tpm"].min()
max_tpm = df_cmp_all["tpm"].max()

# genes
n_genes = df_cmp_all["name"].nunique()

# tissues
n_tissues = df_cmp_all["model_id"].nunique()

print(f"Missing values:\n"
      f"{missing_values}\n")

print(f"Min TPM: {min_tpm}")
print(f"Max TPM: {max_tpm}\n")

print(f"Number of genes: {n_genes}")
print(f"Number of tissues: {n_tissues}")

Missing values:
model_id    0
tpm         0
name        0
dtype: int64

Min TPM: 0.0
Max TPM: 132676.0

Number of genes: 37262
Number of tissues: 203



We then merged this file with our CMP data on the gene names to retrieve the ENS IDs for each gene.

After merging the data, we found that 3,760 genes had no ENS ID associated with them.
Since these genes were likely duplicates or did not exist in the Ensemble file, we removed them from our dataset to ensure consistency and accuracy of our analysis.
### Group Data to mean values

In [7]:
df_cmp_all.drop(columns=["model_id"], inplace=True)
df_cmp_group = df_cmp_all.groupby(["name"]).mean().reset_index()

print("There are {} rows in the grouped dataset.".format(df_cmp_group.shape[0]))
df_cmp_group

There are 37262 rows in the grouped dataset.


Unnamed: 0,name,tpm
0,A1BG,0.827192
1,A1BG-AS1,4.676305
2,A1CF,1.355369
3,A2M,1.669212
4,A2M-AS1,1.033596
...,...,...
37257,ZYG11B,21.124039
37258,ZYX,99.636946
37259,ZYXP1,0.000000
37260,ZZEF1,17.697980


## Ensemble Dataset
Downloaded via Biomart



In [19]:
df_ensembl = pd.read_csv("../import_data/ENSEMBLE/ensemble_gene_id.txt", delimiter="\t")
df_ensembl.rename(columns={"gene_symbol": "name"}, inplace=True)

# drop rows without gene_symbol
df_ensembl.drop(df_ensembl[df_ensembl["name"].isnull()].index, inplace=True)

df_ensembl

Unnamed: 0,Gene_stable_ID,name
0,ENSG00000210049,MT-TF
1,ENSG00000211459,MT-RNR1
2,ENSG00000210077,MT-TV
3,ENSG00000210082,MT-RNR2
4,ENSG00000209082,MT-TL1


In [22]:
duplicate_names = df_ensembl["name"].duplicated(keep=False).sum()
rows = df_ensembl.shape[0]
print(f'{duplicate_names} from {rows} do not have a unique gene names in ENS Dataset')

10605 from 48311 do not have a unique gene names in ENS Dataset


PROBLEM: There are gene names that are not unique.

→ If the names are not unique, we cannot merge the data on the gene names with our dataset.

In [10]:
# delete all rows with not unique gene names
df_ensembl_unique = df_ensembl.drop_duplicates(subset=["name"], keep=False)

df_ensembl_unique

Unnamed: 0,Gene_stable_ID,name
0,ENSG00000210049,MT-TF
1,ENSG00000211459,MT-RNR1
2,ENSG00000210077,MT-TV
3,ENSG00000210082,MT-RNR2
4,ENSG00000209082,MT-TL1
...,...,...
70606,ENSG00000232679,LINC01705
70607,ENSG00000200033,RNU6-403P
70608,ENSG00000228437,LINC02474
70609,ENSG00000229463,LYST-AS1


## Merge Data

In [24]:
df_cmp_ens = pd.merge(df_cmp_group, df_ensembl_unique, on="name", how="left")
df_cmp_ens

Unnamed: 0,name,tpm,Gene_stable_ID
0,A1BG,0.827192,ENSG00000121410
1,A1BG-AS1,4.676305,ENSG00000268895
2,A1CF,1.355369,ENSG00000148584
3,A2M,1.669212,ENSG00000175899
4,A2M-AS1,1.033596,ENSG00000245105
...,...,...,...
37257,ZYG11B,21.124039,ENSG00000162378
37258,ZYX,99.636946,
37259,ZYXP1,0.000000,ENSG00000274572
37260,ZZEF1,17.697980,ENSG00000074755


In [25]:
# check Data with missing ENS
missing_ens = df_cmp_ens[df_cmp_ens["Gene_stable_ID"].isnull()]

print(len(missing_ens), "/",len(df_cmp_ens),  "still have no ENS ID")
missing_ens

3760 / 37262 still have no ENS ID


Unnamed: 0,name,tpm,Gene_stable_ID
8,A2MP1,0.035025,
12,AAAS,46.539015,
14,AACSP1,2.496207,
16,AADACL2,0.035813,
17,AADACL2-AS1,0.550640,
...,...,...,...
37197,ZP2,0.042167,
37207,ZRANB2-AS2,0.264926,
37218,ZSCAN2,6.817094,
37238,ZSWIM4,11.950049,


TODO: Which of these 3760 genes do not have a ENS ID because of duplcate names and which did not occur from the beginning?

→ These genes with no ENS ID could get an ID by looking for ENS name aliases. But this makes tbe duplicate of names even worse.

Solution: Since these are only a few gene, we will drop these rows.

In [26]:
# show rows with duplicate names in df_cmp_ens
df_cmp_ens[df_cmp_ens["name"].duplicated(keep=False)]

Unnamed: 0,name,tpm,Gene_stable_ID


### Clean up

In [27]:
df_cmp_ens.dropna(subset=["Gene_stable_ID"], inplace=True)
df_cmp_ens.rename(columns={"Gene_stable_ID": "id"}, inplace=True)
df_cmp_ens

Unnamed: 0,name,tpm,id
0,A1BG,0.827192,ENSG00000121410
1,A1BG-AS1,4.676305,ENSG00000268895
2,A1CF,1.355369,ENSG00000148584
3,A2M,1.669212,ENSG00000175899
4,A2M-AS1,1.033596,ENSG00000245105
...,...,...,...
37256,ZYG11AP1,0.000887,ENSG00000232242
37257,ZYG11B,21.124039,ENSG00000162378
37259,ZYXP1,0.000000,ENSG00000274572
37260,ZZEF1,17.697980,ENSG00000074755


### Save Data

In [28]:
df_cmp_ens.to_csv("../processed_data/CMP_cancer_mean.csv", index=False)
print(f'There are {df_cmp_ens.shape[0]} rows/genes in the saved dataset.')

There are 33502 rows/genes in the saved dataset.
