<a href="https://colab.research.google.com/github/JinLeeGG/Survival-Prediction-Model-for-AML-using-Gene-Expression-Data-from-TCGA/blob/main/Data_Analysis_Process.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Analysis Process**
- This is data analysis process before investigating the relationship between Early Growth Response1 (EGR1) gene expression leveles and survival outcomes in Acute Myeloid Leukemia (AML) patient using data from Cancer Genome Atlas Research Network

# **Dataset**
- Source: [Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia](https://gdc.cancer.gov/about-data/publications/laml_2012)
- Sample Size: 200 AML patients
- Data Types:
  - Patient Clinical Data
  - RNAseq GAF 2.0 read count



In [1]:
# import libraries
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# **Clinical Dataset**

In [2]:
# Reading clinical dataset
Clinical_df = pd.read_csv('/content/drive/MyDrive/Acute Myeloid Leukemia (TCGA, PanCancer Atlas)/datasets/clinical_patient_laml.tsv', sep='\t', index_col=0)
Clinical_df.head()

Unnamed: 0_level_0,acute_myeloid_leukemia_calgb_cytogenetics_risk_category,age_at_initial_pathologic_diagnosis,atra_exposure,bcr_patient_uuid,cumulative_agent_total_dose,cytogenetic_abnormality,cytogenetic_abnormality_other,cytogenetic_analysis_performed_ind,date_of_form_completion,date_of_initial_pathologic_diagnosis,...,prior_diagnosis,prior_hematologic_disorder_diagnosis_indicator,race,steroid_therapy_administered,tissue_source_site,total_dose_units,tumor_tissue_site,vital_status,FISH_test_component,FISH_test_component_percentage_value
bcr_patient_barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-AB-2802,Intermediate/Normal,50,NO,b93cb62a-a7dc-406d-8482-6b51a92ea3c3,0,Normal,[Not Available],YES,2010-12-14,2001-00-00,...,NO,NO,WHITE,[Not Available],AB,[Not Available],BONE MARROW,DECEASED,,
TCGA-AB-2803,Favorable,61,NO,fb4c9803-3690-4f6a-9402-72a4f36d64d1,0,Normal,t(15;17),YES,2010-12-14,2001-00-00,...,NO,NO,WHITE,[Not Available],AB,[Not Available],BONE MARROW,DECEASED,PML-RAR,95.0
TCGA-AB-2804,Intermediate/Normal,30,YES,2fcda6a9-813b-41b2-aae4-ca42c9986287,0,Normal,[Not Available],YES,2010-12-14,2001-00-00,...,NO,NO,WHITE,[Not Available],AB,[Not Available],BONE MARROW,LIVING,PML-RAR,9.0
TCGA-AB-2805,Intermediate/Normal,77,NO,ada38f3e-8020-4394-9e7c-50d06dd04769,0,Normal,No,YES,2010-12-14,2002-00-00,...,YES,NO,WHITE,[Not Available],AB,[Not Available],BONE MARROW,DECEASED,,
TCGA-AB-2806,Favorable,46,NO,e78ff499-037b-450a-ac04-6fb3a9e124a4,4000,t (8;21),[Not Available],YES,2010-12-14,2002-00-00,...,NO,NO,WHITE,[Not Available],AB,mg,BONE MARROW,DECEASED,,


# **Informations that we need from clinical dataset:**
  1. bcr_patient_barcode (contains patient's unique id)
  2. vital_status (contains patient's status - DECEASED or LIVING)
  3. days_to_death (contains patient's days to death after dignosis - Deceased patient)
  4. days_to_last_followup (contain's patient's last days to follow up - Alive patients)

## Below shows data inspection from clinical dataset


In [3]:
# Inspecting clinical dataset
Clinical_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 200 entries, TCGA-AB-2802 to TCGA-AB-3012
Data columns (total 77 columns):
 #   Column                                                                   Non-Null Count  Dtype  
---  ------                                                                   --------------  -----  
 0   acute_myeloid_leukemia_calgb_cytogenetics_risk_category                  200 non-null    object 
 1   age_at_initial_pathologic_diagnosis                                      200 non-null    int64  
 2   atra_exposure                                                            200 non-null    object 
 3   bcr_patient_uuid                                                         200 non-null    object 
 4   cumulative_agent_total_dose                                              200 non-null    object 
 5   cytogenetic_abnormality                                                  200 non-null    object 
 6   cytogenetic_abnormality_other                              

In [4]:
# index contains patient's unique id
Clinical_df.index

Index(['TCGA-AB-2802', 'TCGA-AB-2803', 'TCGA-AB-2804', 'TCGA-AB-2805',
       'TCGA-AB-2806', 'TCGA-AB-2807', 'TCGA-AB-2808', 'TCGA-AB-2809',
       'TCGA-AB-2810', 'TCGA-AB-2811',
       ...
       'TCGA-AB-3000', 'TCGA-AB-3001', 'TCGA-AB-3002', 'TCGA-AB-3005',
       'TCGA-AB-3006', 'TCGA-AB-3007', 'TCGA-AB-3008', 'TCGA-AB-3009',
       'TCGA-AB-3011', 'TCGA-AB-3012'],
      dtype='object', name='bcr_patient_barcode', length=200)

In [5]:
# days to death (patient has not survived)
Clinical_df['days_to_death']

Unnamed: 0_level_0,days_to_death
bcr_patient_barcode,Unnamed: 1_level_1
TCGA-AB-2802,365
TCGA-AB-2803,792
TCGA-AB-2804,[Not Applicable]
TCGA-AB-2805,576
TCGA-AB-2806,944
...,...
TCGA-AB-3007,[Not Applicable]
TCGA-AB-3008,822
TCGA-AB-3009,576
TCGA-AB-3011,[Not Applicable]


In [6]:
# days to last followup (patient is survived)
Clinical_df['days_to_last_followup']

Unnamed: 0_level_0,days_to_last_followup
bcr_patient_barcode,Unnamed: 1_level_1
TCGA-AB-2802,[Not Available]
TCGA-AB-2803,[Not Available]
TCGA-AB-2804,2556
TCGA-AB-2805,[Not Available]
TCGA-AB-2806,[Not Available]
...,...
TCGA-AB-3007,1581
TCGA-AB-3008,[Not Available]
TCGA-AB-3009,[Not Available]
TCGA-AB-3011,1885


In [7]:
# Patient's current status
Clinical_df['vital_status']

Unnamed: 0_level_0,vital_status
bcr_patient_barcode,Unnamed: 1_level_1
TCGA-AB-2802,DECEASED
TCGA-AB-2803,DECEASED
TCGA-AB-2804,LIVING
TCGA-AB-2805,DECEASED
TCGA-AB-2806,DECEASED
...,...
TCGA-AB-3007,LIVING
TCGA-AB-3008,DECEASED
TCGA-AB-3009,DECEASED
TCGA-AB-3011,LIVING


In [8]:
# days to last known alive only contains not available
Clinical_df['days_to_last_known_alive'].describe()

Unnamed: 0,days_to_last_known_alive
count,200
unique,1
top,[Not Available]
freq,200


## **Creating a new metadata "Observation period"**
  - In order to analysis patient's survival rates, we need to know patient's total survival dates.
  - From vital_status, we know that whether patient is Living or Deceased.
      - If patient is deceased, we take days from "days_to_death"
      - If patient is living, we take days from "days_to_last_known_alive"
      - Combine these datas and create a new metadata called "Observation period"

In [9]:
# Convert 'days_to_death' into numeric type (non-numeric string turns NaN)
Clinical_df['days_to_death'] = pd.to_numeric(Clinical_df['days_to_death'], errors='coerce')

# Convert 'days_to_last_followup' into numeric type (non-numeric string turns NaN)
Clinical_df['days_to_last_followup'] = pd.to_numeric(Clinical_df['days_to_last_followup'], errors='coerce')

# Creating new metatdata 'Observation Period'
Clinical_df['Observation Period'] = np.where(
    Clinical_df['vital_status'] == 'DECEASED', # If status is DECEASED
    Clinical_df['days_to_death'], # If True
    Clinical_df['days_to_last_followup'] # if False
)

# Checking Observation Period
Clinical_df['Observation Period'].head(20)

Unnamed: 0_level_0,Observation Period
bcr_patient_barcode,Unnamed: 1_level_1
TCGA-AB-2802,365.0
TCGA-AB-2803,792.0
TCGA-AB-2804,2556.0
TCGA-AB-2805,576.0
TCGA-AB-2806,944.0
TCGA-AB-2807,180.0
TCGA-AB-2808,2861.0
TCGA-AB-2809,62.0
TCGA-AB-2810,31.0
TCGA-AB-2811,243.0


In [10]:
Clinical_df['Observation Period'].describe()

Unnamed: 0,Observation Period
count,186.0
mean,560.61828
std,589.761637
min,0.0
25%,152.25
50%,365.0
75%,753.5
max,2861.0


In [23]:
# How many patiant does not have observation period?
nan_count = Clinical_df['Observation Period'].isna().sum()
print(nan_count)

14


## **Creating a new metadata 'Status'**
  - Currently 'vital_status' is a object type (LIVING or DECEASED)
  - We need to convert this into numeric (1 if the patient is dead and 0 if alive).




In [11]:
Clinical_df['Status'] = np.where(
    Clinical_df['vital_status'] == 'DECEASED',
    1, # if true
    0 # if false
)

# Checking
Clinical_df['Status'].head(20)

Unnamed: 0_level_0,Status
bcr_patient_barcode,Unnamed: 1_level_1
TCGA-AB-2802,1
TCGA-AB-2803,1
TCGA-AB-2804,0
TCGA-AB-2805,1
TCGA-AB-2806,1
TCGA-AB-2807,1
TCGA-AB-2808,0
TCGA-AB-2809,1
TCGA-AB-2810,1
TCGA-AB-2811,1


In [12]:
Clinical_df['Status'].describe()

Unnamed: 0,Status
count,200.0
mean,0.665
std,0.473175
min,0.0
25%,0.0
50%,1.0
75%,1.0
max,1.0


In [13]:
# Check for how many 0s(Alive) and 1s(Dead)
counts = Clinical_df['Status'].value_counts()
counts

Unnamed: 0_level_0,count
Status,Unnamed: 1_level_1
1,133
0,67


In [14]:
# Check for final Clinical Dataframe
Clinical_df

Unnamed: 0_level_0,acute_myeloid_leukemia_calgb_cytogenetics_risk_category,age_at_initial_pathologic_diagnosis,atra_exposure,bcr_patient_uuid,cumulative_agent_total_dose,cytogenetic_abnormality,cytogenetic_abnormality_other,cytogenetic_analysis_performed_ind,date_of_form_completion,date_of_initial_pathologic_diagnosis,...,race,steroid_therapy_administered,tissue_source_site,total_dose_units,tumor_tissue_site,vital_status,FISH_test_component,FISH_test_component_percentage_value,Observation Period,Status
bcr_patient_barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-AB-2802,Intermediate/Normal,50,NO,b93cb62a-a7dc-406d-8482-6b51a92ea3c3,0,Normal,[Not Available],YES,2010-12-14,2001-00-00,...,WHITE,[Not Available],AB,[Not Available],BONE MARROW,DECEASED,,,365.0,1
TCGA-AB-2803,Favorable,61,NO,fb4c9803-3690-4f6a-9402-72a4f36d64d1,0,Normal,t(15;17),YES,2010-12-14,2001-00-00,...,WHITE,[Not Available],AB,[Not Available],BONE MARROW,DECEASED,PML-RAR,95,792.0,1
TCGA-AB-2804,Intermediate/Normal,30,YES,2fcda6a9-813b-41b2-aae4-ca42c9986287,0,Normal,[Not Available],YES,2010-12-14,2001-00-00,...,WHITE,[Not Available],AB,[Not Available],BONE MARROW,LIVING,PML-RAR,9,2556.0,0
TCGA-AB-2805,Intermediate/Normal,77,NO,ada38f3e-8020-4394-9e7c-50d06dd04769,0,Normal,No,YES,2010-12-14,2002-00-00,...,WHITE,[Not Available],AB,[Not Available],BONE MARROW,DECEASED,,,576.0,1
TCGA-AB-2806,Favorable,46,NO,e78ff499-037b-450a-ac04-6fb3a9e124a4,4000,t (8;21),[Not Available],YES,2010-12-14,2002-00-00,...,WHITE,[Not Available],AB,mg,BONE MARROW,DECEASED,,,944.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
TCGA-AB-3007,Favorable,35,NO,d3988699-70d6-43e1-b84b-9e38b4d2d2b1,[Not Available],Normal|del (7q) / 7q-|Trisomy 8|t (15;17),NO,YES,2010-12-14,2005-00-00,...,WHITE,[Not Available],AB,[Not Available],BONE MARROW,LIVING,PML-RAR|CBF-Beta|-7 or del(7q),15|0|96,1581.0,0
TCGA-AB-3008,Intermediate/Normal,22,NO,898a092e-89fe-4010-afea-14c605f99481,8000,Normal,[Not Available],YES,2010-12-14,2005-00-00,...,WHITE,[Not Available],AB,mg,BONE MARROW,DECEASED,PML-RAR|MLL|CBF-Beta,3|0|1,822.0,1
TCGA-AB-3009,Intermediate/Normal,23,NO,c9b92f8f-4599-47d1-9d12-31e42166a091,[Not Available],Normal,[Not Available],YES,2010-12-14,2005-00-00,...,[Not Available],[Not Available],AB,[Not Available],BLOOD,DECEASED,,,576.0,1
TCGA-AB-3011,Intermediate/Normal,21,NO,e08f84fe-0013-4734-9966-cd734e6fedc5,[Not Available],Normal,[Not Available],YES,2010-12-14,2005-00-00,...,WHITE,[Not Available],AB,[Not Available],BONE MARROW,LIVING,PML-RAR|MLL|TEL-AML 1,0|1|1,1885.0,0


# **RNA Expression Dataset**

In [15]:
# Reading RNA Expression Dataset
file_path = '/content/drive/MyDrive/Acute Myeloid Leukemia (TCGA, PanCancer Atlas)/datasets/laml.rnaseq.179_v1.0_gaf2.0_read_count_matrix.txt.tcgaID.txt'
gene_df = pd.read_csv(file_path, sep='\t', index_col=0) # set GeneID as index

gene_df

Unnamed: 0_level_0,TCGA-AB-2803,TCGA-AB-2807,TCGA-AB-2963,TCGA-AB-2826,TCGA-AB-2867,TCGA-AB-2818,TCGA-AB-2808,TCGA-AB-2853,TCGA-AB-2854,TCGA-AB-2822,...,TCGA-AB-2949,TCGA-AB-2981,TCGA-AB-2999,TCGA-AB-2896,TCGA-AB-2952,TCGA-AB-2920,TCGA-AB-2841,TCGA-AB-2811,TCGA-AB-2979,TCGA-AB-2977
GeneID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
?|100132510_calculated,14.00,4.00,31.82,32.00,5.00,0.00,0.00,0.00,30.42,4.00,...,19.54,56.00,22.00,17.00,3.00,0.00,15.00,15.00,0.90,36.00
?|100134860_calculated,339.42,382.66,198.44,113.44,218.28,177.32,233.36,268.30,178.88,543.92,...,684.04,238.28,540.18,837.32,560.84,663.38,783.04,1056.46,853.70,591.32
?|10357_calculated,5.00,2.00,8.98,26.12,23.66,5.00,9.20,14.00,3.98,6.28,...,1.00,4.74,7.90,5.80,5.36,0.00,6.00,2.00,2.50,20.46
?|10431_calculated,1638.14,788.50,2078.18,1253.20,633.58,1065.06,1241.54,1658.58,1464.00,674.76,...,975.66,1087.08,2183.28,1331.26,1168.14,1488.88,1343.04,4892.32,1009.80,2349.76
?|114130_calculated,23.72,980.76,60.28,35.76,203.26,172.26,123.72,48.52,28.86,158.34,...,197.40,105.52,119.48,85.06,143.40,339.22,149.18,212.96,156.58,54.70
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZYX|7791_calculated,9613.40,7733.44,17099.60,21996.50,20587.60,22100.10,10197.40,13655.20,14063.70,21799.20,...,15616.70,43075.40,13611.50,19541.50,42372.30,34502.20,64440.40,165176.00,56408.90,47559.30
ZZEF1|23140_calculated,5332.46,9364.42,9210.82,6321.28,8703.40,9577.96,8040.82,8076.90,9366.36,9578.34,...,10947.50,15371.80,8626.32,8168.92,11208.60,7845.28,9837.74,9715.26,13772.90,5357.90
ZZZ3|26009_calculated,2452.22,2986.18,4788.52,2025.34,3722.32,2884.50,3697.18,5077.84,2072.30,4125.34,...,4373.78,2374.28,2821.74,3226.08,3098.56,2486.90,1913.44,1419.48,2620.20,1045.34
psiTPTE22|387590_calculated,33.00,51.88,7.00,140.18,583.76,15.00,47.70,5.04,33.96,30.72,...,31.84,10.98,67.00,26.50,187.78,36.94,833.10,25.54,74.98,2.00


# I**nspecting RNA expression dataframe**
 - Contains 20422 genes information (rows)
    - Format:
      - {gene_name}|{Id}_calculated
      - ? : unknown id
 - Contains 179 patient cases (columns)
    - Format:
      - TCGA-AB-2803 (Patient's unique id)

In [16]:
# Inspecting gene dataframe
#   - Contains 20422 genes information (rows)
#   - Contains 179 patient cases (columns)
gene_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 20442 entries, ?|100132510_calculated to tAKR|389932_calculated
Columns: 179 entries, TCGA-AB-2803 to TCGA-AB-2977
dtypes: float64(179)
memory usage: 28.1+ MB


In [17]:
# Inspecting number of unknown Gene ID
Unknown_gene_count = gene_df.index.str.startswith('?').sum()
print(f'Number of unknown gene Id: {Unknown_gene_count}')

Number of unknown gene Id: 123


In [18]:
# Removing unknown gene ID from dataframe

gene_df = gene_df[~gene_df.index.str.startswith('?')]
gene_df.index

Index(['A1BG-AS|503538_calculated', 'A1BG|1_calculated',
       'A1CF|29974_calculated', 'A2LD1|87769_calculated',
       'A2ML1|144568_calculated', 'A2M|2_calculated',
       'A4GALT|53947_calculated', 'A4GNT|51146_calculated',
       'AAA1|404744_calculated', 'AAAS|8086_calculated',
       ...
       'ZWINT|11130_calculated', 'ZXDA|7789_calculated',
       'ZXDB|158586_calculated', 'ZXDC|79364_calculated',
       'ZYG11B|79699_calculated', 'ZYX|7791_calculated',
       'ZZEF1|23140_calculated', 'ZZZ3|26009_calculated',
       'psiTPTE22|387590_calculated', 'tAKR|389932_calculated'],
      dtype='object', name='GeneID', length=20319)

In [19]:
# 1. Filter for rows containing the '|' character
clean_gene_df = gene_df[gene_df.index.str.contains('\|', na=False)].copy()

# Removing "_calculated" from GeneID
clean_gene_df.index = clean_gene_df.index.str.replace('_calculated', '')

# Checking
clean_gene_df

Unnamed: 0_level_0,TCGA-AB-2803,TCGA-AB-2807,TCGA-AB-2963,TCGA-AB-2826,TCGA-AB-2867,TCGA-AB-2818,TCGA-AB-2808,TCGA-AB-2853,TCGA-AB-2854,TCGA-AB-2822,...,TCGA-AB-2949,TCGA-AB-2981,TCGA-AB-2999,TCGA-AB-2896,TCGA-AB-2952,TCGA-AB-2920,TCGA-AB-2841,TCGA-AB-2811,TCGA-AB-2979,TCGA-AB-2977
GeneID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A1BG-AS|503538,792.14,1095.44,425.60,363.22,505.22,772.74,570.74,460.62,775.30,1148.84,...,901.26,670.12,1330.06,506.64,1015.04,1213.56,1274.18,436.34,1417.64,963.22
A1BG|1,1139.18,1121.68,322.96,274.50,379.46,599.24,531.26,465.74,723.88,1252.48,...,801.48,412.22,2261.20,512.84,786.04,1124.96,1311.62,806.68,1446.26,1226.78
A1CF|29974,0.00,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
A2LD1|87769,194.50,111.06,225.60,92.92,143.86,278.46,123.08,274.40,195.24,142.30,...,207.56,269.26,173.72,219.28,171.38,242.20,184.98,242.34,107.64,250.72
A2ML1|144568,24.36,11.08,58.84,12.06,17.30,22.10,21.64,34.30,30.32,40.22,...,33.12,59.74,25.68,60.42,55.42,89.10,22.82,5.10,24.22,11.14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZYX|7791,9613.40,7733.44,17099.60,21996.50,20587.60,22100.10,10197.40,13655.20,14063.70,21799.20,...,15616.70,43075.40,13611.50,19541.50,42372.30,34502.20,64440.40,165176.00,56408.90,47559.30
ZZEF1|23140,5332.46,9364.42,9210.82,6321.28,8703.40,9577.96,8040.82,8076.90,9366.36,9578.34,...,10947.50,15371.80,8626.32,8168.92,11208.60,7845.28,9837.74,9715.26,13772.90,5357.90
ZZZ3|26009,2452.22,2986.18,4788.52,2025.34,3722.32,2884.50,3697.18,5077.84,2072.30,4125.34,...,4373.78,2374.28,2821.74,3226.08,3098.56,2486.90,1913.44,1419.48,2620.20,1045.34
psiTPTE22|387590,33.00,51.88,7.00,140.18,583.76,15.00,47.70,5.04,33.96,30.72,...,31.84,10.98,67.00,26.50,187.78,36.94,833.10,25.54,74.98,2.00


# Saving processed data

In [20]:
os.makedirs('processed_datas', exist_ok=True)

path1 = os.path.join('processed_datas', 'cleaned_clinical_df.csv')
path2 = os.path.join('processed_datas', 'cleaned_gene_df.csv')

Clinical_df.to_csv(path1, index=False)
gene_df.to_csv(path2, index=False)

print(f"DataFrames successfully saved to processed_datas")

DataFrames successfully saved to processed_datas
