### Imports

In [44]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as stats
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from library1.sb_utils import save_file
from sklearn.model_selection import train_test_split

### Load in Data

In [3]:
bc_data_eda = pd.read_csv('/Users/sharanaravindh/Desktop/springboard/Github repository/Capstone-Project-1-Breast-Cancer-Prognosis/Data/breast_cancer_EDA.csv')

In [4]:
bc_data_eda.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 321 entries, 0 to 320
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Age                   321 non-null    int64  
 1   Gender                321 non-null    object 
 2   Protein1              321 non-null    float64
 3   Protein2              321 non-null    float64
 4   Protein3              321 non-null    float64
 5   Protein4              321 non-null    float64
 6   Tumour_Stage          321 non-null    int64  
 7   Histology             321 non-null    object 
 8   ER status             321 non-null    object 
 9   PR status             321 non-null    object 
 10  HER2 status           321 non-null    object 
 11  Surgery_type          321 non-null    object 
 12  Date_of_Surgery       321 non-null    object 
 13  Date_of_Last_Visit    321 non-null    object 
 14  Patient_Status        321 non-null    object 
 15  Survival_Time_Months  3

bc_data_eda.head()

### Scale Standardization

My goal is to scale the numerical features in the dataset, since I plan on performing operations such as k-nearest neightbord, k-means clustering, principal component analysis on aspects of the dataset. Standard Scaler is the best fit for the needs of this project as it proves useful when conducting principal component analysis, k-means clustering, and aid in logistic regression models. With regards to logistic regression, I will be performing the process on both scaled and non-scaled data to avoid instances of overfitting. 

In [29]:
col_to_scale = ['Age','Protein1','Protein2','Protein3','Protein4','Survival_Time_Months']

bc_numerical = bc_data_eda[col_to_scale]

scaler = StandardScaler()

scaler.fit(bc_numerical)

bc_numerical = scaler.transform(bc_numerical)

bc_numerical = pd.DataFrame(bc_data_scaled, columns=col_to_scale)


In [31]:
bc_numerical.head()

Unnamed: 0,Age,Protein1,Protein2,Protein3,Protein4,Survival_Time_Months
0,-1.307174,1.79365,1.316245,0.172469,-0.091956,-1.002878
1,-0.37765,0.048512,0.468651,-0.68945,-0.830435,-0.440454
2,0.319494,-0.909706,0.891237,-0.471689,0.003221,0.578939
3,1.4814,-1.556695,-0.90852,-0.472012,0.198508,0.66522
4,-1.307174,0.462757,0.874831,-0.767704,-0.64201,-1.239351


In [34]:
bc_data_scaled = pd.concat([bc_data_eda.drop(col_to_scale, axis=1), bc_numerical], axis=1)

In [35]:
bc_data_scaled.head()

Unnamed: 0,Gender,Tumour_Stage,Histology,ER status,PR status,HER2 status,Surgery_type,Date_of_Surgery,Date_of_Last_Visit,Patient_Status,Tumor_advanced,Age,Protein1,Protein2,Protein3,Protein4,Survival_Time_Months
0,FEMALE,2,Infiltrating Ductal Carcinoma,Positive,Positive,Negative,Other,2018-05-20,2018-08-26,Alive,1,-1.307174,1.79365,1.316245,0.172469,-0.091956,-1.002878
1,FEMALE,2,Infiltrating Ductal Carcinoma,Positive,Positive,Negative,Other,2018-04-26,2019-01-25,Dead,1,-0.37765,0.048512,0.468651,-0.68945,-0.830435,-0.440454
2,FEMALE,2,Infiltrating Ductal Carcinoma,Positive,Positive,Negative,Lumpectomy,2018-08-24,2020-04-08,Alive,1,0.319494,-0.909706,0.891237,-0.471689,0.003221,0.578939
3,FEMALE,1,Infiltrating Ductal Carcinoma,Positive,Positive,Negative,Other,2018-11-16,2020-07-28,Alive,0,1.4814,-1.556695,-0.90852,-0.472012,0.198508,0.66522
4,FEMALE,2,Infiltrating Ductal Carcinoma,Positive,Positive,Positive,Lumpectomy,2018-12-12,2019-01-05,Alive,1,-1.307174,0.462757,0.874831,-0.767704,-0.64201,-1.239351


Since I want to utilize both scaled and unscaled data in my modeling steps, I'm keeping the scaled data separate from that of the unscaled data. In the encoding stage I will make two different dataframes: one containing scaled data with encoded features, and another with unscaled data with encoded features. 

### Encoding 

One of the models implemented will be a log regression model to look for correlations between protein levels and the cancer stage. Since log regression requires binary variables, stage 1 tumor will be encoded as 0 with an indication of beginning cancer, while stages 2 and 3 will be encoded as 1 with an indication of being advanced tumors. 

In [36]:
bc_data_scaled['Tumor_advanced'] = ((bc_data_scaled['Tumour_Stage'] == 2) | (bc_data_scaled['Tumour_Stage'] == 3)).astype(int)

In [37]:
bc_data_scaled_encode = pd.get_dummies(bc_data_scaled)

In [38]:
bc_data_scaled_encode.head()

Unnamed: 0,Tumour_Stage,Tumor_advanced,Age,Protein1,Protein2,Protein3,Protein4,Survival_Time_Months,Gender_FEMALE,Gender_MALE,...,Date_of_Last_Visit_2021-03-24,Date_of_Last_Visit_2021-05-04,Date_of_Last_Visit_2021-11-15,Date_of_Last_Visit_2022-03-14,Date_of_Last_Visit_2022-03-15,Date_of_Last_Visit_2022-05-19,Date_of_Last_Visit_2022-06-26,Date_of_Last_Visit_2022-11-04,Patient_Status_Alive,Patient_Status_Dead
0,2,1,-1.307174,1.79365,1.316245,0.172469,-0.091956,-1.002878,1,0,...,0,0,0,0,0,0,0,0,1,0
1,2,1,-0.37765,0.048512,0.468651,-0.68945,-0.830435,-0.440454,1,0,...,0,0,0,0,0,0,0,0,0,1
2,2,1,0.319494,-0.909706,0.891237,-0.471689,0.003221,0.578939,1,0,...,0,0,0,0,0,0,0,0,1,0
3,1,0,1.4814,-1.556695,-0.90852,-0.472012,0.198508,0.66522,1,0,...,0,0,0,0,0,0,0,0,1,0
4,2,1,-1.307174,0.462757,0.874831,-0.767704,-0.64201,-1.239351,1,0,...,0,0,0,0,0,0,0,0,1,0


In [41]:
filter = 'Date_of'
columns_to_keep = [col for col in bc_data_encode.columns if not col.startswith(filter)]
bc_data_scaled_encode = bc_data_scaled_encode[columns_to_keep]

Removing columns with redundant information or with features that are binary (ex. patient status). 

In [43]:
bc_data_scaled_encode.drop(['ER status_Positive','PR status_Positive','HER2 status_Negative','Patient_Status_Dead','Gender_MALE',], axis = 1)

Unnamed: 0,Age,Protein1,Protein2,Protein3,Protein4,Tumour_Stage,Survival_Time_Months,Tumor_advanced,Gender_FEMALE,Histology_Infiltrating Ductal Carcinoma,Histology_Infiltrating Lobular Carcinoma,Histology_Mucinous Carcinoma,HER2 status_Positive,Surgery_type_Lumpectomy,Surgery_type_Modified Radical Mastectomy,Surgery_type_Other,Surgery_type_Simple Mastectomy,Patient_Status_Alive
0,-1.307174,1.793650,1.316245,0.172469,-0.091956,2,-1.002878,1,1,1,0,0,0,0,0,1,0,1
1,-0.377650,0.048512,0.468651,-0.689450,-0.830435,2,-0.440454,1,1,1,0,0,0,0,0,1,0,0
2,0.319494,-0.909706,0.891237,-0.471689,0.003221,2,0.578939,1,1,1,0,0,0,1,0,0,0,1
3,1.481400,-1.556695,-0.908520,-0.472012,0.198508,1,0.665220,0,1,1,0,0,0,0,0,1,0,1
4,-1.307174,0.462757,0.874831,-0.767704,-0.642010,2,-1.239351,1,1,1,0,0,1,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
316,0.009652,0.093576,0.491002,0.201050,0.436843,2,0.080427,1,1,1,0,0,1,1,0,0,0,1
317,-1.384634,0.231937,-1.563540,0.963521,-0.857063,1,-0.446845,0,1,1,0,0,1,0,1,0,0,1
318,-0.377650,1.429548,0.757458,-0.408084,1.366080,2,-0.680123,1,1,1,0,0,0,0,0,0,1,0
319,1.171558,1.830199,0.519960,-0.465522,-0.187657,2,-1.287285,1,1,0,1,0,0,1,0,0,0,1


### Training & Testing Splits

The main goal of this project is to determine the prognosis (survival time) based on the levels of proteins, and or other factors that are provided within the dataframe. The first train test split will be built around this question with X being all other features, and y being the survival time. 

In [45]:
X1 = bc_data_scaled_encode.drop('Survival_Time_Months', axis=1)  
y1 = bc_data_scaled_encode['Survival_Time_Months']
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.2, random_state=42)

The second train test split will be for the logistic regression model with the X being the protein levels, and y being the tumor_advanced variable

In [47]:
X2 = bc_data_scaled_encode[['Protein1','Protein2','Protein3','Protein4']]  
y2 = bc_data_scaled_encode['Tumor_advanced']
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.2, random_state=42)

### Saving Data

In [48]:
datapath = '/Users/sharanaravindh/Desktop/springboard/Github repository/Capstone-Project-1-Breast-Cancer-Prognosis/Data'
save_file(bc_data_scaled_encode,'breast_cancer_preprocessed.csv',datapath)

Writing file.  "/Users/sharanaravindh/Desktop/springboard/Github repository/Capstone-Project-1-Breast-Cancer-Prognosis/Data/breast_cancer_preprocessed.csv"
