# Thoracic Surgery Patient Survival

The problem to solve is whether there is a way to determine post-operative life expectancy of lung cancer patients from patient attributes in the data set.

It would be easier for doctors and patients to decide whether to move on with surgery if a pattern could be identified with the characteristics and if the patients do not survive the one-year mark. Both partners may decide whether to proceed with surgery or seek alternate treatments or palliative care if doctors believe it will merely lower the patient's quality of life and carry a known high risk of dying within a year.


## 1 Setup
Import modules


In [1]:
# import numpy and pandas libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.impute import SimpleImputer

## 2 Set Random Seed

It's *very* important that you set this! In this course we will use the random seed value of 1.

In [2]:
# set random seed to ensure that results are repeatable
np.random.seed(1)

## 3 Load data

In [3]:
# Uncomment the following snippet of code to debug problems with finding the .csv file path
# This snippet of code will exit the program and print the current working directory.
#import os
#print(os.getcwd())

In [4]:
thoracic_surgery=pd.read_csv("thoracicsurgery.csv") # let's use the surgery data for all the models
thoracic_surgery.head(5) 

Unnamed: 0,id,DGN,FVC,Volume_Ex,Performance,Pain_BS,Haemoptysis_BS,Dyspnoea_BS,Cough_BS,Weakness_BS,Tumour_Size,Diabetes_T2,HA_6,PAD,Smoking,Asthma,AGE,Risk1Yr
0,1,DGN2,2.88,2.16,PRZ1,F,F,F,T,T,OC14,F,F,F,T,F,60,F
1,2,DGN3,3.4,1.88,PRZ0,F,F,F,F,F,OC12,F,F,F,T,F,51,F
2,3,DGN3,2.76,2.08,PRZ1,F,F,F,T,F,OC11,F,F,F,T,F,59,F
3,4,DGN3,3.68,3.04,PRZ0,F,F,F,F,F,OC11,F,F,F,F,F,54,F
4,5,DGN3,2.44,0.96,PRZ2,F,T,F,T,T,OC11,F,F,F,T,F,73,T


In [5]:
thoracic_surgery.describe()

Unnamed: 0,id,FVC,Volume_Ex,AGE
count,470.0,470.0,470.0,470.0
mean,235.5,3.281638,4.568702,62.534043
std,135.821574,0.871395,11.767857,8.706902
min,1.0,1.44,0.96,21.0
25%,118.25,2.6,1.96,57.0
50%,235.5,3.16,2.4,62.0
75%,352.75,3.8075,3.08,69.0
max,470.0,6.3,86.3,87.0


In [6]:
thoracic_surgery.columns

Index(['id', 'DGN', 'FVC', 'Volume_Ex', 'Performance', 'Pain_BS',
       'Haemoptysis_BS', 'Dyspnoea_BS', 'Cough_BS', 'Weakness_BS',
       'Tumour_Size', 'Diabetes_T2', 'HA_6', 'PAD', 'Smoking', 'Asthma', 'AGE',
       'Risk1Yr'],
      dtype='object')

In [7]:
thoracic_surgery.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 470 entries, 0 to 469
Data columns (total 18 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              470 non-null    int64  
 1   DGN             470 non-null    object 
 2   FVC             470 non-null    float64
 3   Volume_Ex       470 non-null    float64
 4   Performance     470 non-null    object 
 5   Pain_BS         470 non-null    object 
 6   Haemoptysis_BS  470 non-null    object 
 7   Dyspnoea_BS     470 non-null    object 
 8   Cough_BS        470 non-null    object 
 9   Weakness_BS     470 non-null    object 
 10  Tumour_Size     470 non-null    object 
 11  Diabetes_T2     470 non-null    object 
 12  HA_6            470 non-null    object 
 13  PAD             470 non-null    object 
 14  Smoking         470 non-null    object 
 15  Asthma          470 non-null    object 
 16  AGE             470 non-null    int64  
 17  Risk1Yr         470 non-null    obj

In [8]:
# check for missing values
thoracic_surgery.isnull().sum()

id                0
DGN               0
FVC               0
Volume_Ex         0
Performance       0
Pain_BS           0
Haemoptysis_BS    0
Dyspnoea_BS       0
Cough_BS          0
Weakness_BS       0
Tumour_Size       0
Diabetes_T2       0
HA_6              0
PAD               0
Smoking           0
Asthma            0
AGE               0
Risk1Yr           0
dtype: int64

In [9]:
# Performing Label encoding for converting object type variables integer variables to calculate metrics

In [10]:
labelencoder = LabelEncoder() 
for col in ['DGN','Performance','Pain_BS','Haemoptysis_BS','Dyspnoea_BS','Cough_BS','Weakness_BS','Tumour_Size','Diabetes_T2','HA_6','PAD','Smoking','Asthma','Risk1Yr']:
    thoracic_surgery[col]=labelencoder.fit_transform(thoracic_surgery[col])

In [11]:
thoracic_surgery.dtypes

id                  int64
DGN                 int32
FVC               float64
Volume_Ex         float64
Performance         int32
Pain_BS             int32
Haemoptysis_BS      int32
Dyspnoea_BS         int32
Cough_BS            int32
Weakness_BS         int32
Tumour_Size         int32
Diabetes_T2         int32
HA_6                int32
PAD                 int32
Smoking             int32
Asthma              int32
AGE                 int64
Risk1Yr             int32
dtype: object

In [13]:
thoracic_surgery.head(5)

Unnamed: 0,id,DGN,FVC,Volume_Ex,Performance,Pain_BS,Haemoptysis_BS,Dyspnoea_BS,Cough_BS,Weakness_BS,Tumour_Size,Diabetes_T2,HA_6,PAD,Smoking,Asthma,AGE,Risk1Yr
0,1,1,2.88,2.16,1,0,0,0,1,1,3,0,0,0,1,0,60,0
1,2,2,3.4,1.88,0,0,0,0,0,0,1,0,0,0,1,0,51,0
2,3,2,2.76,2.08,1,0,0,0,1,0,0,0,0,0,1,0,59,0
3,4,2,3.68,3.04,0,0,0,0,0,0,0,0,0,0,0,0,54,0
4,5,2,2.44,0.96,2,0,1,0,1,1,0,0,0,0,1,0,73,1


## 4 Split the data into training and testing 

In [14]:
# split the data into validation and training set
train_df, test_df = train_test_split(thoracic_surgery, test_size=0.3)

# to reduce repetition in later code, creating variables to represent the columns
# that are predictors and target
target = 'Risk1Yr'
predictors = list(thoracic_surgery.columns)
predictors.remove(target)

In [15]:
# create a standard scaler and fit it to the training set of predictors
scaler = preprocessing.StandardScaler()
cols_to_stdize = predictors                
               
# Transform the predictors of training and validation sets
train_df[cols_to_stdize] = scaler.fit_transform(train_df[cols_to_stdize]) # train_predictors is not a numpy array

test_df[cols_to_stdize] = scaler.transform(test_df[cols_to_stdize]) # validation_target is now a series object

## 6 Save the data

In [16]:
train_X = train_df[predictors]
train_y = train_df[target] # train_target is now a series objecttrain_df.to_csv
test_X = test_df[predictors]
test_y = test_df[target] # validation_target is now a series object

train_df.to_csv('thoracic_train_df_risk.csv', index=False)
train_X.to_csv('thoracic_train_X_risk.csv', index=False)
train_y.to_csv('thoracic_train_y_risk.csv', index=False)
test_df.to_csv('thoracic_test_df_risk.csv', index=False)
test_X.to_csv('thoracic_test_X_risk.csv', index=False)
test_y.to_csv('thoracic_test_y_risk.csv', index=False)