# Predicting Thyroid Cancer Recurrence - Model Building

### Purpose of this notebook
To explore the models that were used for this study, optimise feature selection and find the most performant models.

## 1. Imports and Loading data

In [33]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# SciKit-learn imports
from sklearn.model_selection import train_test_split

# imports for encoding
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer

# imports for metrics
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc 

#imports for models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

In [34]:
file_path = "../data/raw/Thyroid_Diff.csv"
df = pd.read_csv(file_path)

## 2. Data Transformation

What transformations are required?
- Update Hx Radiotherapy spelling error
- Use the pipeline established in the EDA to transform categorical daata into numerical form
- Establish if transformation of the numerical field is necessary based on the types of models that will be used.
- transformation of the class variable

In [35]:
df.head()

Unnamed: 0,Age,Gender,Smoking,Hx Smoking,Hx Radiothreapy,Thyroid Function,Physical Examination,Adenopathy,Pathology,Focality,Risk,T,N,M,Stage,Response,Recurred
0,27,F,No,No,No,Euthyroid,Single nodular goiter-left,No,Micropapillary,Uni-Focal,Low,T1a,N0,M0,I,Indeterminate,No
1,34,F,No,Yes,No,Euthyroid,Multinodular goiter,No,Micropapillary,Uni-Focal,Low,T1a,N0,M0,I,Excellent,No
2,30,F,No,No,No,Euthyroid,Single nodular goiter-right,No,Micropapillary,Uni-Focal,Low,T1a,N0,M0,I,Excellent,No
3,62,F,No,No,No,Euthyroid,Single nodular goiter-right,No,Micropapillary,Uni-Focal,Low,T1a,N0,M0,I,Excellent,No
4,62,F,No,No,No,Euthyroid,Multinodular goiter,No,Micropapillary,Multi-Focal,Low,T1a,N0,M0,I,Excellent,No


In [36]:
df.rename(columns={"Hx Radiothreapy":"Hx Radiotherapy"})

Unnamed: 0,Age,Gender,Smoking,Hx Smoking,Hx Radiotherapy,Thyroid Function,Physical Examination,Adenopathy,Pathology,Focality,Risk,T,N,M,Stage,Response,Recurred
0,27,F,No,No,No,Euthyroid,Single nodular goiter-left,No,Micropapillary,Uni-Focal,Low,T1a,N0,M0,I,Indeterminate,No
1,34,F,No,Yes,No,Euthyroid,Multinodular goiter,No,Micropapillary,Uni-Focal,Low,T1a,N0,M0,I,Excellent,No
2,30,F,No,No,No,Euthyroid,Single nodular goiter-right,No,Micropapillary,Uni-Focal,Low,T1a,N0,M0,I,Excellent,No
3,62,F,No,No,No,Euthyroid,Single nodular goiter-right,No,Micropapillary,Uni-Focal,Low,T1a,N0,M0,I,Excellent,No
4,62,F,No,No,No,Euthyroid,Multinodular goiter,No,Micropapillary,Multi-Focal,Low,T1a,N0,M0,I,Excellent,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
378,72,M,Yes,Yes,Yes,Euthyroid,Single nodular goiter-right,Right,Papillary,Uni-Focal,High,T4b,N1b,M1,IVB,Biochemical Incomplete,Yes
379,81,M,Yes,No,Yes,Euthyroid,Multinodular goiter,Extensive,Papillary,Multi-Focal,High,T4b,N1b,M1,IVB,Structural Incomplete,Yes
380,72,M,Yes,Yes,No,Euthyroid,Multinodular goiter,Bilateral,Papillary,Multi-Focal,High,T4b,N1b,M1,IVB,Structural Incomplete,Yes
381,61,M,Yes,Yes,Yes,Clinical Hyperthyroidism,Multinodular goiter,Extensive,Hurthel cell,Multi-Focal,High,T4b,N1b,M0,IVA,Structural Incomplete,Yes


### Split features and target variables

In [37]:
X = df.drop(columns=['Recurred'],axis=1)

In [38]:
X.head()

Unnamed: 0,Age,Gender,Smoking,Hx Smoking,Hx Radiothreapy,Thyroid Function,Physical Examination,Adenopathy,Pathology,Focality,Risk,T,N,M,Stage,Response
0,27,F,No,No,No,Euthyroid,Single nodular goiter-left,No,Micropapillary,Uni-Focal,Low,T1a,N0,M0,I,Indeterminate
1,34,F,No,Yes,No,Euthyroid,Multinodular goiter,No,Micropapillary,Uni-Focal,Low,T1a,N0,M0,I,Excellent
2,30,F,No,No,No,Euthyroid,Single nodular goiter-right,No,Micropapillary,Uni-Focal,Low,T1a,N0,M0,I,Excellent
3,62,F,No,No,No,Euthyroid,Single nodular goiter-right,No,Micropapillary,Uni-Focal,Low,T1a,N0,M0,I,Excellent
4,62,F,No,No,No,Euthyroid,Multinodular goiter,No,Micropapillary,Multi-Focal,Low,T1a,N0,M0,I,Excellent


In [39]:
y = df['Recurred']

In [40]:
y.head()

0    No
1    No
2    No
3    No
4    No
Name: Recurred, dtype: object

### Feature Column transformations

In [41]:
ohe_columns = ['Gender', 'Smoking', 'Hx Smoking', 'Hx Radiothreapy', 'Thyroid Function', 'Physical Examination', 'Adenopathy', 'Pathology', 'Focality', 'Response']
oe_columns = ['Risk', 'T', 'N', 'M', 'Stage']
numeric_features = ['Age']


In [42]:
# Encoder objects
ohe = OneHotEncoder()
oe = OrdinalEncoder()
ss = StandardScaler()


In [43]:
# Fit transform those columns
preprocessor = ColumnTransformer(
    [
        ('OneHotEncoder', ohe, ohe_columns),
        ('OrdinalEncoder', oe, oe_columns),
        ('StandardScaler', ss, numeric_features)
    ]
)

In [44]:
X = preprocessor.fit_transform(X)

In [45]:
X.shape

(383, 40)

### Converting categorical Target variable to numerical

In [46]:
# Create LabelEncoder Instance
le = LabelEncoder()

In [47]:
y = le.fit_transform(df['Recurred'])

In [48]:
y.shape

(383,)

### Train test split

In [49]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)
X_train.shape, X_test.shape

((306, 40), (77, 40))

## 3. Initial Model Training and Evaluation



In [50]:
def evaluate_model(true, predicted):
    """Function to measure model performance"""
    cm = confusion_matrix(true, predicted)
    accuracy = accuracy_score(true, predicted)
    precision = precision_score(true, predicted)
    recall = recall_score(true, predicted)
    f1 = f1_score(true, predicted)
    fpr, tpr, thresholds = roc_curve(true, predicted) 
    roc_auc = auc(fpr, tpr) 
    return cm, accuracy, precision, recall, f1, roc_auc
    

In [51]:
# Models to be evaluated
models = {
    "Logistic Regression": LogisticRegression(),
    "K-Neighbors Classifier": KNeighborsClassifier(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest Classifier": RandomForestClassifier(),
    "SVM Classifier": SVC(),
}


In [52]:
model_list = []
metrics_list = []

In [55]:
for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train, y_train) # Train model

    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # Evaluate Train and Test dataset
    model_train_cm , model_train_accuracy, model_train_precision, model_train_recall, model_train_f1, model_train_roc_auc = evaluate_model(y_train, y_train_pred)

    model_test_cm , model_test_accuracy, model_test_precision, model_test_recall, model_test_f1, model_test_roc_auc = evaluate_model(y_test, y_test_pred)

    
    print(list(models.keys())[i])
    model_list.append(list(models.keys())[i])
    # TODO: Fix issue causing model names to not appear correctly in final results table.
    print('Model performance for Training set')
    print(f"- Confusion Matrix: {model_train_cm}")
    print("- Accuracy: {:.4f}".format(model_train_accuracy))
    print("- Precision: {:.4f}".format(model_train_precision))
    print("- Recall: {:.4f}".format(model_train_recall))
    print("- F1 Score: {:.4f}".format(model_train_f1))
    print("- ROC_AUC: {:.4f}".format(model_train_roc_auc))

    print('----------------------------------')
    
    print('Model performance for Test set')
    print(f"- Confusion Matrix: {model_test_cm}")
    print("- Accuracy: {:.4f}".format(model_test_accuracy))
    print("- Precision: {:.4f}".format(model_test_precision))
    print("- Recall: {:.4f}".format(model_test_recall))
    print("- F1 Score: {:.4f}".format(model_test_f1))
    print("- ROC_AUC: {:.4f}".format(model_test_roc_auc))
    metrics_list.append([model_test_precision, model_test_recall, model_test_roc_auc])
    
    print('='*35)
    print('\n')

Logistic Regression
Model performance for Training set
- Confusion Matrix: [[215   2]
 [  8  81]]
- Accuracy: 0.9673
- Precision: 0.9759
- Recall: 0.9101
- F1 Score: 0.9419
- ROC_AUC: 0.9504
----------------------------------
Model performance for Test set
- Confusion Matrix: [[58  0]
 [ 2 17]]
- Accuracy: 0.9740
- Precision: 1.0000
- Recall: 0.8947
- F1 Score: 0.9444
- ROC_AUC: 0.9474


K-Neighbors Classifier
Model performance for Training set
- Confusion Matrix: [[213   4]
 [ 17  72]]
- Accuracy: 0.9314
- Precision: 0.9474
- Recall: 0.8090
- F1 Score: 0.8727
- ROC_AUC: 0.8953
----------------------------------
Model performance for Test set
- Confusion Matrix: [[57  1]
 [ 4 15]]
- Accuracy: 0.9351
- Precision: 0.9375
- Recall: 0.7895
- F1 Score: 0.8571
- ROC_AUC: 0.8861


Decision Tree
Model performance for Training set
- Confusion Matrix: [[217   0]
 [  0  89]]
- Accuracy: 1.0000
- Precision: 1.0000
- Recall: 1.0000
- F1 Score: 1.0000
- ROC_AUC: 1.0000
------------------------------

## 4. Initial Model Evaluation

In [59]:
pd.DataFrame(list(zip(model_list, metrics_list)), columns=['Model Name', 'mtp, mtr, mt_roc_auc']).sort_values(by=["mtp, mtr, mt_roc_auc"],ascending=False)

Unnamed: 0,Model Name,"mtp, mtr, mt_roc_auc"
3,K-Neighbors Classifier,"[1.0, 0.9473684210526315, 0.9736842105263157]"
0,Logistic Regression,"[1.0, 0.8947368421052632, 0.9473684210526316]"
4,Decision Tree,"[1.0, 0.8947368421052632, 0.9473684210526316]"
1,Logistic Regression,"[0.9375, 0.7894736842105263, 0.8861161524500907]"
2,Logistic Regression,"[0.9, 0.9473684210526315, 0.9564428312159708]"


## Feature Selection

## 5. Hyperparameter Tuning

## 6. Impact of Over or Under Sampling


### Observations