# Data Preprocessing:

In this step the entire clinical dataset containing all information regarding TCGA-LUAD patients is narrowed down. The idea is to create 2 groups of patients based on their smoking status. Group 1 consists of 10 randomly selected patients (Male or Female) for which years_smoked data is available. 

The same steps are followed for the second group. The second group consists of 10 randomly selected patients (Male or Female) for whom smoking exposure is not available, and thus are considered different from smoking group/group 1.

After identifying the patients of interest the submitter_id of the patients is then used to obtain those patients' DNA Methylation data from GDC.


In [0]:
import pandas as pd #Necessary package for importing the datasets.

In [0]:
clinicaldata = pd.read_csv("TCGA-LUAD_Clinical.csv") #Entire clinical dataset 
clinicaldata.head() #Checking the dataset looks right.

Unnamed: 0,submitter_id,year_of_diagnosis,classification_of_tumor,last_known_disease_status,updated_datetime,primary_diagnosis,tumor_stage,age_at_diagnosis,morphology,days_to_last_known_disease_status,...,treatments_radiation_treatment_id,treatments_radiation_therapeutic_agents,treatments_radiation_regimen_or_line_of_therapy,treatments_radiation_treatment_intent_type,treatments_radiation_treatment_anatomic_site,treatments_radiation_treatment_outcome,treatments_radiation_days_to_treatment_end,treatments_radiation_treatment_or_therapy,bcr_patient_barcode,disease
0,TCGA-05-4244,2009.0,not reported,not reported,2019-08-08T17:07:40.762038-05:00,"Adenocarcinoma, NOS",stage iv,25752.0,8140/3,,...,834430fa-3aa8-570c-bac0-0c09865fbb2b,,,,,,,not reported,TCGA-05-4244,LUAD
1,TCGA-05-4245,2009.0,not reported,not reported,2019-08-08T17:07:40.762038-05:00,"Adenocarcinoma, NOS",stage iiia,29647.0,8140/3,,...,85506557-5c39-5d73-ba25-228aee1aea2a,,,,,,,no,TCGA-05-4245,LUAD
2,TCGA-05-4249,2007.0,not reported,not reported,2019-08-08T17:07:40.762038-05:00,"Adenocarcinoma, NOS",stage ib,24532.0,8140/3,,...,7e0ec2e5-724b-5661-9b44-92b73528531a,,,,,,,no,TCGA-05-4249,LUAD
3,TCGA-05-4250,2007.0,not reported,not reported,2019-08-08T17:07:40.762038-05:00,"Adenocarcinoma, NOS",stage iiia,29068.0,8140/3,,...,cf2c4caa-1bf5-5a80-9c8c-ba63bca72cda,,,,,,,not reported,TCGA-05-4250,LUAD
4,TCGA-05-4382,2009.0,not reported,not reported,2019-08-08T17:07:40.762038-05:00,Adenocarcinoma with mixed subtypes,stage ib,24868.0,8255/3,,...,4f8e2c3e-77c1-5313-967b-0eac6e3fafac,,,,,,,yes,TCGA-05-4382,LUAD


In [0]:
clinicaldata.shape #Making sure its complete.

(522, 74)

In [0]:
for col in clinicaldata.columns: #The names of all columns in clinical dataset.
    print(col)


submitter_id
year_of_diagnosis
classification_of_tumor
last_known_disease_status
updated_datetime
primary_diagnosis
tumor_stage
age_at_diagnosis
morphology
days_to_last_known_disease_status
created_datetime
prior_treatment
ajcc_pathologic_n
ajcc_pathologic_m
state
days_to_last_follow_up
days_to_recurrence
diagnosis_id
tumor_grade
icd_10_code
tissue_or_organ_of_origin
progression_or_recurrence
prior_malignancy
ajcc_staging_system_edition
ajcc_pathologic_stage
synchronous_malignancy
site_of_resection_or_biopsy
ajcc_pathologic_t
days_to_diagnosis
cigarettes_per_day
weight
alcohol_intensity
bmi
years_smoked
alcohol_history
exposure_id
height
pack_years_smoked
gender
year_of_birth
demographic_id
race
age_at_index
vital_status
ethnicity
year_of_death
days_to_birth
days_to_death
treatments_pharmaceutical_days_to_treatment_start
treatments_pharmaceutical_treatment_effect
treatments_pharmaceutical_initial_disease_status
treatments_pharmaceutical_treatment_type
treatments_pharmaceutical_treatmen

There are a lot of columns here that we wont be using right now. For simplicity purposes, lets reduce the columns down to the ones we need right now. 

In [0]:
clindata_copy = clinicaldata[['submitter_id', 'cigarettes_per_day', 'alcohol_intensity', 
                              'years_smoked','alcohol_history', 'pack_years_smoked', 
                              'gender', 'race', 'vital_status', 'ethnicity', 
                              'bcr_patient_barcode','disease']].copy()
#Leaving in some columns to get a good overview of the data.

In [0]:
clindata_copy.shape

(522, 13)

The dataset is now 13 columns instead of 74. Much easier to understand and process.

# Group 1: Smokers

Below the data is being sorted by most years smoked to lowest years smoked. This is the dataset from which group 1 patients are randomly selected.

In [0]:
smokersdata = clindata_copy.nlargest(200, 'years_smoked') 
#Saving the data that contains all 200 samples.
#The dataset below ranges from 64 years smoked to 2 years smoked.

In [0]:
smokersdata.head()

Unnamed: 0,submitter_id,cigarettes_per_day,alcohol_intensity,years_smoked,alcohol_history,pack_years_smoked,gender,race,vital_status,ethnicity,vital_status.1,bcr_patient_barcode,disease
77,TCGA-44-6777,3.506849,,64.0,Not Reported,64.0,female,white,Dead,not reported,Dead,TCGA-44-6777,LUAD
410,TCGA-91-6829,5.178082,,63.0,Not Reported,94.5,male,white,Dead,not hispanic or latino,Dead,TCGA-91-6829,LUAD
222,TCGA-55-8205,1.643836,,61.0,Not Reported,30.0,female,white,Alive,not hispanic or latino,Alive,TCGA-55-8205,LUAD
97,TCGA-44-A4SS,4.931507,,60.0,Not Reported,90.0,male,white,Alive,not hispanic or latino,Alive,TCGA-44-A4SS,LUAD
96,TCGA-44-A47G,1.534247,,56.0,Not Reported,28.0,female,white,Alive,not hispanic or latino,Alive,TCGA-44-A47G,LUAD


In [0]:
Group1 = smokersdata.sample(n = 10) #Randomly selecting the 10 patients from group 1.

In [0]:
Group1

Unnamed: 0,submitter_id,cigarettes_per_day,alcohol_intensity,years_smoked,alcohol_history,pack_years_smoked,gender,race,vital_status,ethnicity,vital_status.1,bcr_patient_barcode,disease
504,TCGA-MP-A4TE,2.191781,,40.0,Not Reported,40.0,male,white,Dead,not hispanic or latino,Dead,TCGA-MP-A4TE,LUAD
79,TCGA-44-6779,0.821918,,30.0,Not Reported,15.0,female,white,Dead,not reported,Dead,TCGA-44-6779,LUAD
428,TCGA-93-A4JN,2.191781,,20.0,Not Reported,40.0,male,white,Alive,not hispanic or latino,Alive,TCGA-93-A4JN,LUAD
433,TCGA-95-7043,2.191781,,39.0,Not Reported,40.0,female,white,Dead,not hispanic or latino,Dead,TCGA-95-7043,LUAD
251,TCGA-55-A494,0.383562,,13.0,Not Reported,7.0,female,white,Alive,not hispanic or latino,Alive,TCGA-55-A494,LUAD
463,TCGA-97-A4M1,0.164384,,3.0,Not Reported,3.0,female,white,Alive,not hispanic or latino,Alive,TCGA-97-A4M1,LUAD
426,TCGA-93-7348,2.30137,,42.0,Not Reported,42.0,female,white,Alive,not hispanic or latino,Alive,TCGA-93-7348,LUAD
471,TCGA-99-8028,1.643836,,30.0,Not Reported,30.0,female,black or african american,Alive,not hispanic or latino,Alive,TCGA-99-8028,LUAD
362,TCGA-78-7539,0.120548,,28.0,Not Reported,2.2,female,white,Alive,not reported,Alive,TCGA-78-7539,LUAD
505,TCGA-MP-A4TF,2.191781,,40.0,Not Reported,40.0,female,white,Dead,not hispanic or latino,Dead,TCGA-MP-A4TF,LUAD


Due to the nature of sample, I must save group 1 data to a csv. Otherwise, when line 9 is run Group 1 will become a completely new, random, set of 10 patients.

In [0]:
Group1.to_csv('group1.csv') #Saving Group 1 data into a csv.

Now group 1 is a file containing all the Ids of 10 smoker patients who have lung cancer. It is a mix of male and female patients. The patient's submitter_id will be used to obtain their DNA Methylation file.

# Group 2: Non-Smokers

In order to appropriately select 10 random patients for this group, for each of the 10 patients, all exposure data must be zero. The data currently contains values called NaN. These values must be replaced with 0. After the data is clean we select the 10 random patients. 

In [0]:
clindata_copy = clindata_copy.fillna(0) #Replaces all NaNs with 0.

In [0]:
clindata_copy.describe() #Take a look at the data. There are no NaNs visible.

Unnamed: 0,cigarettes_per_day,alcohol_intensity,years_smoked,pack_years_smoked
count,522.0,522.0,522.0,522.0
mean,1.560463,0.0,11.756705,28.478448
std,1.625265,0.0,17.15796,29.661089
min,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0
50%,1.342466,0.0,0.0,24.5
75%,2.465753,0.0,25.0,45.0
max,8.438356,0.0,64.0,154.0


In [0]:
clindata_copy.isna().sum() #This counts the # of NaNs in each column. It verifies there are no NaN in the dataset.

submitter_id           0
cigarettes_per_day     0
alcohol_intensity      0
years_smoked           0
alcohol_history        0
pack_years_smoked      0
gender                 0
race                   0
vital_status           0
ethnicity              0
vital_status           0
bcr_patient_barcode    0
disease                0
dtype: int64

Now that the dataset is clean we select all patients for which cigarettes_per_day, years_smoked and pack_years_smoked is equal to zero.

In [0]:
nonsmokers = clindata_copy.loc[(clindata_copy['cigarettes_per_day'] == 0) & (clindata_copy['years_smoked'] == 0) & (clindata_copy['pack_years_smoked'] == 0)]
#This selects all data for which the exposure data is equal to zero.

In [0]:
nonsmokers.describe() #We verify that for those columns the max values are 0.

Unnamed: 0,cigarettes_per_day,alcohol_intensity,years_smoked,pack_years_smoked
count,154.0,154.0,154.0,154.0
mean,0.0,0.0,0.0,0.0
std,0.0,0.0,0.0,0.0
min,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0
max,0.0,0.0,0.0,0.0


In [0]:
Group2 = nonsmokers.sample(n = 10) #Randomly selecting the 10 patients from group 2.

In [0]:
Group2 #Seeing what the data looks like.

Unnamed: 0,submitter_id,cigarettes_per_day,alcohol_intensity,years_smoked,alcohol_history,pack_years_smoked,gender,race,vital_status,ethnicity,vital_status.1,bcr_patient_barcode,disease
232,TCGA-55-8508,0.0,0.0,0.0,Not Reported,0.0,female,black or african american,Alive,not hispanic or latino,Alive,TCGA-55-8508,LUAD
54,TCGA-44-2661,0.0,0.0,0.0,Not Reported,0.0,female,white,Alive,not hispanic or latino,Alive,TCGA-44-2661,LUAD
320,TCGA-73-7499,0.0,0.0,0.0,Not Reported,0.0,female,white,Dead,not hispanic or latino,Dead,TCGA-73-7499,LUAD
401,TCGA-86-8672,0.0,0.0,0.0,Not Reported,0.0,male,white,Dead,not hispanic or latino,Dead,TCGA-86-8672,LUAD
137,TCGA-50-5930,0.0,0.0,0.0,Not Reported,0.0,male,white,Dead,not hispanic or latino,Dead,TCGA-50-5930,LUAD
202,TCGA-55-7816,0.0,0.0,0.0,Not Reported,0.0,female,white,Dead,not hispanic or latino,Dead,TCGA-55-7816,LUAD
241,TCGA-55-8619,0.0,0.0,0.0,Not Reported,0.0,female,white,Alive,not hispanic or latino,Alive,TCGA-55-8619,LUAD
417,TCGA-91-6848,0.0,0.0,0.0,Not Reported,0.0,male,white,Alive,not hispanic or latino,Alive,TCGA-91-6848,LUAD
336,TCGA-75-7030,0.0,0.0,0.0,Not Reported,0.0,male,not reported,Alive,not reported,Alive,TCGA-75-7030,LUAD
293,TCGA-69-7760,0.0,0.0,0.0,Not Reported,0.0,male,white,Alive,not hispanic or latino,Alive,TCGA-69-7760,LUAD


In [0]:
Group2.to_csv('group2.csv') #Saving Group 2 as csv.

For organizational purposes this notebook ends here. Notebook 2 consists of analyzing DNA Methylation data of the patients we selected here.