<h3> Predicting whether a patient will survive the first year after lung cancer surgery - using machine learning. </h3>
<br>

  <ul>
    <li>Imports</li>
    <li>Load the dataset and Exploration</li>
    <li>Prepering data before modeling</li>
    <li>Modeling</li>
    <li>SVM</li>
    <li>Random Forest Classifer</li>
    <li>Cross Validation</li>
    <li>Summary</li>
  </ul>
<br><br>

<h3>1. Problem Definition</h3>
<br>

  <p>In this project, the problem that we will be investigating is binary classification.
  We will use numbers of different features (information) about patients to predict whether they will survive the first year after surgery.</p>
  <br>
  <blockquote>The data is dedicated to classification problem related to the post-operative life expectancy in the lung cancer patients after thoracic surgery in which there are two classes class 1 - the death of patients within one year after surgery and class 2 – the patients who survive.</blockquote>

<br><br>

<h3>2. Data</h3>
<br>

  <p>The original data came from the <a href="https://www.kaggle.com/sid321axn/thoraric-surgery">kaggle</a>.
  The database cantains 18 atributes (features), but here 16 atributes will be use. I will also add one feature myself.</p>
  <br>

<h5>Data Dictionary</h5>
<br>

  <p>The following are the features we'll use to predict our target variable (1 year survival period).</p>
  <br>

  <ol>
    <li>ID</li><br>
    <li>DGN: Diagnosis - specific combination of ICD-10 codes for primary and secondary as well multiple tumours if any (DGN3,DGN2,DGN4,DGN6,DGN5,DGN8,DGN1)</li><br>
    <li>PRE4: Forced vital capacity - FVC (numeric)</li><br>
    <li>PRE5: Volume that has been exhaled at the end of the first second of forced expiration - FEV1 (numeric)</li><br>
    <li>PRE6: Performance status - Zubrod scale (PRZ2,PRZ1,PRZ0)</li><br>
    <li>PRE7: Pain before surgery (T,F)</li><br>
    <li>PRE8: Haemoptysis before surgery (T,F)</li><br>
    <li>PRE9: Dyspnoea before surgery (T,F)</li><br>
    <li>PRE10: Cough before surgery (T,F)</li><br>
    <li>PRE11: Weakness before surgery (T,F)</li><br>
    <li>PRE14: T in clinical TNM - size of the original tumour, from OC11 (smallest) to OC14 (largest) (OC11,OC14,OC12,OC13)</li><br>
    <li>PRE17: Type 2 DM - diabetes mellitus (T,F)</li><br>
    <li>PRE19: MI up to 6 months (T,F)</li><br>
    <li>PRE25: PAD - peripheral arterial diseases (T,F)</li><br>
    <li>PRE30: Smoking (T,F)</li><br>
    <li>PRE32: Asthma (T,F)</li><br>
    <li>AGE: Age at surgery (numeric)</li><br>
    <li>Risk1Y: 1 year survival period - (T)rue value if died (T,F)</li><br><br>
    <li>RATIO = PRE5/PRE4</li><br>
  </ol>

<h5>Imports</h5>

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import GridSearchCV # for tuning a models
from sklearn.model_selection import train_test_split #Split arrays or matrices into random train and test subsets
from sklearn.model_selection import cross_val_score #Evaluate a score by cross-validation

from sklearn.metrics import confusion_matrix # Compute confusion matrix to evaluate the accuracy of a classification
from sklearn.metrics import plot_confusion_matrix # Plot Confusion Matrix
from sklearn.metrics import classification_report # Build a text report showing the main classification metrics

# Models
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

from imblearn.over_sampling import SMOTE # provides a set of method to perform over-sampling

# We want our plots to appear in the notebook
%matplotlib inline

<h5>Load and Exploration Data</h5>

In [None]:
df = pd.read_csv('../input/thoraric-surgery/ThoraricSurgery.csv')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
# Checking if we have 0 value in float dtypes
df[(df['PRE4'] == 0) | (df['PRE5'] == 0)]

<p>The most of the features are non numeric. They are T or F, which is true or false. We will change all of it to 1 for T, and 0 for F.</p>
<p>Apllies to: PRE7, PRE8, PRE9, PRE10, PRE11, PRE17, PRE19, PRE25, PRE30, PRE32, Risk1Y</p>

In [None]:
# Data is small so I do copy.
# df2 will be my new data on which I will make changes
df2 = df.copy()

In [None]:
# I change all 'T' to 1 and 'F' to 0 in df2 using the lambda function
df2[['PRE7', 'PRE8', 'PRE9', 'PRE10', 'PRE11', 'PRE17', 'PRE19', 'PRE25', 
     'PRE30', 'PRE32', 'Risk1Yr']] = df2[['PRE7', 'PRE8', 'PRE9', 'PRE10', 'PRE11', 
                                          'PRE17', 'PRE19', 'PRE25', 'PRE30', 'PRE32', 
                                          'Risk1Yr']].apply(lambda x: np.where(x == 'T', 1, 0)) 

In [None]:
# Checking
df2[['PRE7', 'PRE8', 'PRE9', 'PRE10', 'PRE11', 'PRE17', 'PRE19', 
     'PRE25', 'PRE30', 'PRE32', 'Risk1Yr']].head()

In [None]:
df2.head()

In [None]:
# this plot shows how many patients survived 1 year and how many did not after surgery
sns.set_style(style="whitegrid")
fig, ax = plt.subplots(figsize=(16,10))


sns.countplot(x="Risk1Yr", data=df2, hue = 'Risk1Yr', palette = 'rocket_r')

ax.set_title('Bar chart of Risk')
legend_labels, _= ax.get_legend_handles_labels()
ax.legend(legend_labels, ['Survive 1st year', 'Not survive 1st year'], bbox_to_anchor=(1,1))

<p>The data size is small and the data is not well balanced.</p><br>
<p>Anyway, we will check what can be deduced from them.</p>

<p>DGN: Diagnosis - specific combination of ICD-10 codes for primary and secondary as well multiple tumours if any (DGN3,DGN2,DGN4,DGN6,DGN5,DGN8,DGN1)</p><br>

<p>We check which DGN code has the greatest impact on the patients</p> 

In [None]:
df2['DGN'].value_counts()

In [None]:
# this plot shows count of a given DGN in all patients
sns.set_style(style="whitegrid")
fig, ax = plt.subplots(figsize=(16,10))
ax.set_title("Type of DGN in all patients")

sns.set(font_scale=2)
sns.countplot(x='DGN', data=df2)

<p>DGN3 is the most common code that was present in all patients.</p>

In [None]:
# this plot shows count of a given DGN in patients who didnt survive 1st year
sns.set_style(style="whitegrid")
fig, ax = plt.subplots(figsize=(16,10))
sns.set(font_scale=2)
ax.set_title("Type of DGN in patients who didn't survive first year after surgery")
sns.countplot(x='DGN', data= df2[df2['Risk1Yr'] == 1])

<p>DGN3 was the most common code that was present in all patients.</p>

<p>The trend is similar in patients who did not survive 1 year after surgery.</p>

<p>Ok, as you know smoking is bad for your health. Let's see if it had an effect on patients.</p>

In [None]:
# This shows how many patients survived 1 year being a smoker (0,1)
# This shows how many patients survived 1 year without being a smoker (0,0)
# This shows how many patients did not survive 1 year being a smoker (1,1)
# This shows how many patients did not survive 1 year without being a smoker (1,0)

fig, ax = plt.subplots(figsize=(15,7))

# group patients by 'Risk1Yr', 
# taking into account the number of patients who was smokers (1) and was not smokers (0)
df2.groupby('Risk1Yr')['PRE30'].value_counts().plot(ax=ax, kind='bar', 
                                                    title = 'Bar chart of Risk by Smoking', colormap = 'coolwarm')
ax.set(xlabel = "(Risk1Yr, Smoker)")

<p>Focusing on patients who did not survive 1 year after surgery, you can see that 90% of them were smokers.</p>

<p>Now let's look at the FVC (PRE4) parameter.</p><br>
<p>Forced vital capacity is the total amount of air that can be exhaled following a deep inhalation in an FVC test. Thenormal FVC range for an adult is between 3liters and 5liters.</p>

In [None]:
plt.figure(figsize=(10,6))
plt.hist(x=df2['PRE4'], bins = 10)

plt.xlabel('Exhaled Total Amount Of Air')
plt.ylabel('Number of patients')
plt.title('Histogram of Forced Vital Capacity')

plt.show()

In [None]:
# mean FVC
df2['PRE4'].mean()

In [None]:
# Checking the age of the youngest patient
# He was adult
df2['AGE'].min()

In [None]:
# mean AGE
df2['AGE'].mean()

<p>The highest value of the FVC parameter oscillates between 2.5 - 4.

The patients mean FVC is 3.2, the lower end of the optimal range.</p>

<p>Next step</p><br>
<p>Checking FVC vs FEV1</p><br>

<p>The FEV1/FVC ratio is a measurement of the amount of air you can forcefully exhale from your lungs.</p>
<p>Decreased FVC With Proportional FEV1/FVC Ratio. <br> If your FVC is decreased but the ratio of FEV1/FVC is normal, this indicates a restrictive pattern. A normal ratio is 70% to 80% in adults, and 85% in children.</p>

<p><a href="https://www.verywellhealth.com/fev1fvc-ratio-of-fev1-to-fvc-spirometry-914783">SOURCE</a></p>

In [None]:
# Checking jointplot FVC vs FEV1 in all patietns
sns.set_style(style="whitegrid")

g = sns.jointplot(x='PRE4',y='PRE5',data=df2, kind='scatter', height = 7)
g.set_axis_labels('FVC', 'FEV1', fontsize=22)
g.fig.suptitle("Plot of FVC vs FEV1")
g.fig.tight_layout()

In [None]:
# Checking jointplot FVC vs FEV1 by Risk1Yr
sns.set_style(style="whitegrid")

g = sns.FacetGrid(col='Risk1Yr',data=df2, height = 5)
g.map(plt.scatter, 'PRE4', 'PRE5')
g.set_axis_labels('FVC', 'FEV1')
g.fig.suptitle("Plot of FVC vs FEV1", fontsize=22)

In [None]:
# I add a new feature called 'RATIO' to the data frame. RATIO = FEV1 / FVC
df2['RATIO'] = df2['PRE5'] / df2['PRE4']

<p>In my opinion, the above charts showed that we have some anomalies in the data. (specifically 15 indexes)

Citing that data <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5849379/">source</a>. Ratio should not be greater than ~ 120%, and value of FEV1 (PRE5) should be ~ between 0.6 - 4.7 [ml]</p>

In [None]:
# looking for that 15 indexes
df2[df2['RATIO'] > 1.3]

In [None]:
# List of indexes I want to remove
my_l = list(df2[df2['RATIO'] > 1.3].index.values)

In [None]:
len(list(df2[df2['RATIO'] > 1.3].index.values))

In [None]:
len(df2)

In [None]:
df2 = df2.drop(my_l, axis=0)
len(df2)

In [None]:
# Checking after drop
df2[df2['RATIO'] > 1.3]

<p>Now I retry the chart again</p>

In [None]:
# Checking jointplot FVC vs FEV1 by Risk1Yr
# after dropped 15 indexes
sns.set_style(style="whitegrid")

g = sns.FacetGrid(col='Risk1Yr',data=df2, height = 5)
g.map(plt.scatter, 'PRE4', 'PRE5')
g.set_axis_labels('FVC', 'FEV1')
g.fig.suptitle("Plot of FVC vs FEV1", fontsize=22)

<p>Ok now we can check mean ratio.</p>
<p>If your FVC is decreased but the ratio of FEV1/FVC is normal, this indicates a restrictive pattern.</p>
<p>A normal ratio is 70% to 80% in adults, and 85% in children.</p>
<p> <a target="_blank" href="https://www.verywellhealth.com/fev1fvc-ratio-of-fev1-to-fvc-spirometry-914783">More HERE</a></p>

In [None]:
# FEV1 / FVC by Risk1Yr
plt.figure(figsize=(10,6))
sns.set_style("whitegrid")
sns.boxplot(x= 'Risk1Yr', y='RATIO',data=df2,palette='rainbow').set_title("FEV1/FVC Ratio by Risk1Yr")

plt.yticks(np.arange(0.4, 1.3, 0.1))

<p>Patients who did not survive 1 year after surgery had a lower mean ratio than those who survived, but they were still within the normal 70% - 80%.</p>

In [None]:
# FEV1 / FVC by Asthma
plt.figure(figsize=(10,6))
sns.set_style("whitegrid")
sns.boxplot(x= 'PRE32', y='RATIO',data=df2,palette='rainbow').set_title("FEV1/FVC Ratio by Asthma")

plt.xlabel('Asthma')
plt.yticks(np.arange(0.4, 1.3, 0.1))

In [None]:
df2['PRE32'].value_counts()

<p>Asthma patients have FEV1 / FVC Ratio below 70%, but there were only two of them</p>

<p>Next step</p>
<p>PRE14: T in clinical TNM - size of the original tumour, from OC11 (smallest) to OC14 (largest) (OC11,OC14,OC12,OC13)</p>

In [None]:
# this plot shows count of a given OC in all patients
sns.set_style(style="whitegrid")
fig, ax = plt.subplots(figsize=(16,10))
ax.set_title("Type of OC in all patients")

sns.set(font_scale=2)
sns.countplot(x= 'PRE14', data = df2)
ax.set_xlabel('OC type')

In [None]:
# this plot shows count of a given OC in patients who didnt survive 1st year
sns.set_style(style="whitegrid")
fig, ax = plt.subplots(figsize=(16,10))
sns.set(font_scale=2)
ax.set_title("Type of OC in patients who didn't survive first year after surgery")
sns.countplot(x= 'PRE14',  data =df2[df2['Risk1Yr'] == 1])
ax.set_xlabel('OC type')

<p>The data show that the majority of patients had a tumor classified as OC12 and OC11.</p>
<p>The trend is similar in patients who did not survive 1 year after surgery.</p>

<h5>Data preparation before modeling.</h5>

In [None]:
#### DGN, PRE6 and PRE14 are object dtypes, so I'll use get_dummies - one hot encoding
df2.info()

In [None]:
# first I remove ID column
df2 = df2.drop('id', axis = 1)

In [None]:
df2.head()

In [None]:
# using get_dummies
# new data frame: df3
df3 = pd.get_dummies(df2, drop_first = False, columns = ['DGN', 'PRE6', 'PRE14'])

In [None]:
# no objcets now
df3.info()

In [None]:
# Find the correlation between our independent variables
corr_matrix = df3.corr()
corr_matrix

In [None]:
# Let's make it look a little prettier
corr_matrix = df3.corr()
plt.figure(figsize=(25, 20))
sns.heatmap(corr_matrix, annot=False, linewidths=0.5, cmap="YlGn")

<h5>MODELING</h5>

In [None]:
df3.head()

In [None]:
X = df3.drop("Risk1Yr", axis=1)
y = df3['Risk1Yr']

In [None]:
# dataset is small and unbalanced: that's not good
y.value_counts()

In [None]:
# using SMOTE to take care of data
# MORE: https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html
smote = SMOTE(sampling_strategy='minority')
X_sm, y_sm = smote.fit_sample(X, y)

In [None]:
# now the dataset is still small, but more balanced
y_sm.value_counts()

In [None]:
# making a dictionary in which we include three models with some parameters pre-set.
model_params = {
    'random_forest' :{
        'model' : RandomForestClassifier(),
        'params' : {
            'n_estimators' : [1, 5, 10]
        }
    },
    
    'svm' : {
        'model' : SVC(gamma = 'auto'),
        'params' : {
            'C' : [0.1, 1, 10, 100],
            'kernel' : ["rbf", "linear"]
        }
    },
    
    'logistics_regression' : {
        'model' : LogisticRegression(solver = 'liblinear', multi_class = 'auto'),
        'params' : {
            'C' : [0.1, 1, 10, 100]
        }
    }
}

In [None]:
# implement GridSearchCV for three models using a loop and a previously created dictionary
# in the created variable scores, we save best_score and best_params for each model
scores = []

for model_name, mp in model_params.items():
    clf = GridSearchCV(mp['model'], mp['params'], cv=5)
    clf.fit(X_sm, y_sm)
    scores.append({
        'model' : model_name,
        'best_score' : clf.best_score_,
        'best_params' : clf.best_params_
    })

In [None]:
# making data frame
sc = pd.DataFrame(scores, columns=['model', 'best_score', 'best_params'])
sc

<p>Further, I focused on two models - RFC and SVM</p>

In [None]:
# split the data
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm, test_size=0.33, random_state = 42)

<h4>SVM Model</h4>

In [None]:
# implementing SVC model with best_params
clf_svm = SVC(C = 100, kernel = 'rbf')
clf_svm.fit(X_train, y_train)

In [None]:
# confusion matrix SVC
plot_confusion_matrix(clf_svm,
                     X_test,
                     y_test,
                     values_format = 'd',
                     display_labels=['Risk1Yr:0', 'Risk1Yr:1'])

In [None]:
# Now predict values for the testing data
predictions_svm = clf_svm.predict(X_test)

In [None]:
# Create a classification report for the model SVC
print(classification_report(y_test,predictions_svm))

<h4>Random Forest Classifier Model</h4>

In [None]:
# implementing RFC model with best_params
clf_rfc = RandomForestClassifier(n_estimators =  10)
clf_rfc.fit(X_train, y_train)

In [None]:
# confusion matrix RFC
plot_confusion_matrix(clf_rfc,
                     X_test,
                     y_test,
                     values_format = 'd',
                     display_labels=['Risk1Yr:0', 'Risk1Yr:1'])

In [None]:
# Now predict values for the testing data
predictions_rfc = clf_rfc.predict(X_test)

In [None]:
# Create a classification report for the RFC model 
print(classification_report(y_test,predictions_rfc))

<h4>Cross Validation SVM<h4>

In [None]:
# Cross-validated accuracy score
cv_acc_svm = cross_val_score(clf_svm,
                         X_sm,
                         y_sm,
                         cv=5, # 5-fold cross-validation
                         scoring="accuracy") # accuracy as scoring
cv_acc_svm

In [None]:
# accuracy mean
cv_acc_svm = np.mean(cv_acc_svm)
cv_acc_svm

In [None]:
# Cross-validated precision score

cv_precision_svm = np.mean(cross_val_score(clf_svm,
                                       X_sm,
                                       y_sm,
                                       cv=5, # 5-fold cross-validation
                                       scoring="precision")) # precision as scoring
# precision mean
cv_precision_svm

In [None]:
# Cross-validated recall score

cv_recall_svm = np.mean(cross_val_score(clf_svm,
                                    X_sm,
                                    y_sm,
                                    cv=5, # 5-fold cross-validation
                                    scoring="recall")) # recall as scoring
# recall mean
cv_recall_svm

In [None]:
# Cross-validated F1 score
cv_f1_svm = np.mean(cross_val_score(clf_svm,
                                X_sm,
                                y_sm,
                                cv=5, # 5-fold cross-validation
                                scoring="f1")) # f1 as scoring
# f1 score mean
cv_f1_svm

In [None]:
# Visualizing cross-validated metrics SVM Model
cv_metrics_svm = pd.DataFrame({"Accuracy": cv_acc_svm,
                            "Precision": cv_precision_svm,
                            "Recall": cv_recall_svm,
                            "F1": cv_f1_svm},
                          index=[0])
cv_metrics_svm.T.plot.bar(title="Cross-Validated Metrics SVM Model", legend=False);

<h4>Cross Validation RFC<h4>

In [None]:
# Cross-validated accuracy score
cv_acc_rfc = np.mean(cross_val_score(clf_rfc,
                         X_sm,
                         y_sm,
                         cv=5, # 5-fold cross-validation
                         scoring="accuracy")) # accuracy as scoring
# acc mean
cv_acc_rfc

In [None]:
# Cross-validated precision score

cv_precision_rfc = np.mean(cross_val_score(clf_rfc,
                                       X_sm,
                                       y_sm,
                                       cv=5, # 5-fold cross-validation
                                       scoring="precision")) # precision as scoring
# precision mean
cv_precision_rfc

In [None]:
# Cross-validated recall score

cv_recall_rfc = np.mean(cross_val_score(clf_rfc,
                                    X_sm,
                                    y_sm,
                                    cv=5, # 5-fold cross-validation
                                    scoring="recall")) # recall as scoring
# recall mean
cv_recall_rfc

In [None]:
# Cross-validated F1 score
cv_f1_rfc = np.mean(cross_val_score(clf_rfc,
                                X_sm,
                                y_sm,
                                cv=5, # 5-fold cross-validation
                                scoring="f1")) # f1 as scoring
# f1 score mean
cv_f1_rfc

In [None]:
# Visualizing cross-validated metrics RFC Model
cv_metrics_rfc = pd.DataFrame({"Accuracy": cv_acc_rfc,
                            "Precision": cv_precision_rfc,
                            "Recall": cv_recall_rfc,
                            "F1": cv_f1_rfc},
                          index=[0])
cv_metrics_rfc.T.plot.bar(title="Cross-Validated Metrics RFC Model", legend=False);

<h5>Summary</h5>

In [None]:
# concat two dataframes cv_metrics_rfc and cv_metrics_svm
cv_metrics_rfc['Key'] = 'RFC'
cv_metrics_svm['Key'] = 'SVM'
df_summary = pd.concat([cv_metrics_rfc,cv_metrics_svm],keys=['RFC','SVM'])

In [None]:
df_summary.plot.bar(x='Key', title='Random Forest Classifier vs Support Vector Machine', 
                    figsize=(15,10), fontsize=12).set_xlabel('Model Type')