## Predicting Thyroid Using Machine Learning

We are going to take the following approach
* Problem Definition
* Data
* Evaluation
* Features
* Modelling 
* Experimentation


# 1. Problem Definition

Given clinical parameters about a patient can we predict whether or not a patient has thyroid.


# 2. Data

The datasets featured below were created by reconciling thyroid disease datasets provided by the UCI Machine Learning Repository.

# 3. Evaluation
If we can reach 95% accuracy of predicting whether or not a patient has thyroid, then we will pursue the project

# 4. Features
The size for the file featured within this Kaggle dataset is shown below — along with a list of attributes, and their description summaries:

thyroidDF.csv - 9172 observations x 31 attributes
* age - age of the patient (int)
* sex - sex patient identifies (str)
* on_thyroxine - whether patient is on thyroxine (bool)
* query on thyroxine - *whether patient is on thyroxine (bool)
* on antithyroid meds - whether patient is on antithyroid meds (bool)
* sick - whether patient is sick (bool)
* pregnant - whether patient is pregnant (bool)
* thyroid_surgery - whether patient has undergone thyroid surgery (bool)
* I131_treatment - whether patient is undergoing I131 treatment (bool)
* query_hypothyroid - whether patient believes they have hypothyroid (bool)
* query_hyperthyroid - whether patient believes they have hyperthyroid (bool)
* lithium - whether patient * lithium (bool)
* goitre - whether patient has goitre (bool)
* tumor - whether patient has tumor (bool)
* hypopituitary - whether patient * hyperpituitary gland (float)
* psych - whether patient * psych (bool)
* TSH_measured - whether TSH was measured in the blood (bool)
* TSH - TSH level in blood from lab work (float)
* T3_measured - whether T3 was measured in the blood (bool)
* T3 - T3 level in blood from lab work (float)
* TT4_measured - whether TT4 was measured in the blood (bool)
* TT4 - TT4 level in blood from lab work (float)
* T4U_measured - whether T4U was measured in the blood (bool)
* T4U - T4U level in blood from lab work (float)
* FTI_measured - whether FTI was measured in the blood (bool)
* FTI - FTI level in blood from lab work (float)
* TBG_measured - whether TBG was measured in the blood (bool)
* TBG - TBG level in blood from lab work (float)
* referral_source - (str)
* target - hyperthyroidism medical diagnosis (str)
* patient_id - unique id of the patient (str)

# Preparing the tools

We are going to use pandas, matplotlib, numpy for analysing and manipulating the data.

In [66]:
# For Data Analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
%matplotlib inline

# Model from sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier


# Model evaluation
from sklearn.model_selection import train_test_split,cross_val_score,RepeatedStratifiedKFold
from sklearn.metrics import accuracy_score,precision_score,recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report,plot_roc_curve
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV,cross_val_score

# Data

In [2]:
df = pd.read_csv("thyroidDF.csv")
df

Unnamed: 0,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_meds,sick,pregnant,thyroid_surgery,I131_treatment,query_hypothyroid,...,TT4,T4U_measured,T4U,FTI_measured,FTI,TBG_measured,TBG,referral_source,target,patient_id
0,29,F,f,f,f,f,f,f,f,t,...,,f,,f,,f,,other,-,840801013
1,29,F,f,f,f,f,f,f,f,f,...,128.0,f,,f,,f,,other,-,840801014
2,41,F,f,f,f,f,f,f,f,f,...,,f,,f,,t,11.0,other,-,840801042
3,36,F,f,f,f,f,f,f,f,f,...,,f,,f,,t,26.0,other,-,840803046
4,32,F,f,f,f,f,f,f,f,f,...,,f,,f,,t,36.0,other,S,840803047
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9167,56,M,f,f,f,f,f,f,f,f,...,64.0,t,0.83,t,77.0,f,,SVI,-,870119022
9168,22,M,f,f,f,f,f,f,f,f,...,91.0,t,0.92,t,99.0,f,,SVI,-,870119023
9169,69,M,f,f,f,f,f,f,f,f,...,113.0,t,1.27,t,89.0,f,,SVI,I,870119025
9170,47,F,f,f,f,f,f,f,f,f,...,75.0,t,0.85,t,88.0,f,,other,-,870119027


In [3]:
df.shape

(9172, 31)

In [4]:
df.columns

Index(['age', 'sex', 'on_thyroxine', 'query_on_thyroxine',
       'on_antithyroid_meds', 'sick', 'pregnant', 'thyroid_surgery',
       'I131_treatment', 'query_hypothyroid', 'query_hyperthyroid', 'lithium',
       'goitre', 'tumor', 'hypopituitary', 'psych', 'TSH_measured', 'TSH',
       'T3_measured', 'T3', 'TT4_measured', 'TT4', 'T4U_measured', 'T4U',
       'FTI_measured', 'FTI', 'TBG_measured', 'TBG', 'referral_source',
       'target', 'patient_id'],
      dtype='object')

In [5]:
df.dtypes

age                      int64
sex                     object
on_thyroxine            object
query_on_thyroxine      object
on_antithyroid_meds     object
sick                    object
pregnant                object
thyroid_surgery         object
I131_treatment          object
query_hypothyroid       object
query_hyperthyroid      object
lithium                 object
goitre                  object
tumor                   object
hypopituitary           object
psych                   object
TSH_measured            object
TSH                    float64
T3_measured             object
T3                     float64
TT4_measured            object
TT4                    float64
T4U_measured            object
T4U                    float64
FTI_measured            object
FTI                    float64
TBG_measured            object
TBG                    float64
referral_source         object
target                  object
patient_id               int64
dtype: object

In [6]:
df.isna().sum()

age                       0
sex                     307
on_thyroxine              0
query_on_thyroxine        0
on_antithyroid_meds       0
sick                      0
pregnant                  0
thyroid_surgery           0
I131_treatment            0
query_hypothyroid         0
query_hyperthyroid        0
lithium                   0
goitre                    0
tumor                     0
hypopituitary             0
psych                     0
TSH_measured              0
TSH                     842
T3_measured               0
T3                     2604
TT4_measured              0
TT4                     442
T4U_measured              0
T4U                     809
FTI_measured              0
FTI                     802
TBG_measured              0
TBG                    8823
referral_source           0
target                    0
patient_id                0
dtype: int64

In [7]:
def preprocessing(data):
    """
    This function will preprocess the data and returns transformed df
    """
    for label,content in data.items():
        if pd.api.types.is_string_dtype(content):
            data[label] = content.astype('category').cat.as_ordered()
            
    for label,content in data.items():
        if pd.api.types.is_numeric_dtype(content):
            if content.isna().values.any():
                data[label+"_is_missing"] = pd.isnull(content)
                data[label] = content.fillna(content.median())        
    for label,content in data.items():
        if not pd.api.types.is_numeric_dtype(content):
            data[label+"_is_missing"] = pd.isnull(content)
            data[label] = pd.Categorical(content).codes+1
            
    return data        

In [11]:
transformed_df = preprocessing(df)
transformed_df

Unnamed: 0,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_meds,sick,pregnant,thyroid_surgery,I131_treatment,query_hypothyroid,...,hypopituitary_is_missing,psych_is_missing,TSH_measured_is_missing,T3_measured_is_missing,TT4_measured_is_missing,T4U_measured_is_missing,FTI_measured_is_missing,TBG_measured_is_missing,referral_source_is_missing,target_is_missing
0,29,1,1,1,1,1,1,1,1,2,...,False,False,False,False,False,False,False,False,False,False
1,29,1,1,1,1,1,1,1,1,1,...,False,False,False,False,False,False,False,False,False,False
2,41,1,1,1,1,1,1,1,1,1,...,False,False,False,False,False,False,False,False,False,False
3,36,1,1,1,1,1,1,1,1,1,...,False,False,False,False,False,False,False,False,False,False
4,32,1,1,1,1,1,1,1,1,1,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9167,56,2,1,1,1,1,1,1,1,1,...,False,False,False,False,False,False,False,False,False,False
9168,22,2,1,1,1,1,1,1,1,1,...,False,False,False,False,False,False,False,False,False,False
9169,69,2,1,1,1,1,1,1,1,1,...,False,False,False,False,False,False,False,False,False,False
9170,47,1,1,1,1,1,1,1,1,1,...,False,False,False,False,False,False,False,False,False,False


In [17]:
def model_selection(data):
    np.random.seed = 42
    X = data.drop("target",axis=1)
    y = data.target
    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
    lgr_model = LogisticRegression(n_jobs=-1)
    lgr_model.fit(X_train,y_train)
    lgr_score = lgr_model.score(X_test,y_test)
    kn_model = KNeighborsClassifier(n_neighbors=10)
    kn_model.fit(X_train,y_train)
    kn_score = kn_model.score(X_test,y_test)
    rf_model = RandomForestClassifier(n_estimators=100)
    rf_model.fit(X_train,y_train)
    rf_score = rf_model.score(X_test,y_test)
    
    scores = {"Logistic Regression":lgr_score,"KNeighbors Classifier":kn_score,"Random Forest Classifier":rf_score}
    return scores

In [19]:
scores = model_selection(transformed_df)
scores

{'Logistic Regression': 0.7346049046321526,
 'KNeighbors Classifier': 0.7340599455040872,
 'Random Forest Classifier': 0.9520435967302452}

In [27]:
def evaluation(model):
    y_preds = model.predict(X_test)
    accuracy,precision,recall,f1 = accuracy_score(y_test,y_preds),precision_score(y_test,y_preds,average='micro'),recall_score(y_test,y_preds,average='micro'),f1_score(y_test,y_preds,average='micro')
    evaluation_scores = {"Accuracy Score":accuracy,"Precision Score":precision,"Recall Score":recall,"F1 Score":f1}
    return evaluation_scores

In [32]:
np.random.seed= 42
X = transformed_df.drop("target",axis=1)
y = transformed_df.target
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train,y_train)
print(model.score(X_test,y_test))
evaluation_scores = evaluation(model)

0.9422343324250682


In [34]:
evaluation_scores

{'Accuracy Score': 0.9422343324250682,
 'Precision Score': 0.9422343324250682,
 'Recall Score': 0.9422343324250682,
 'F1 Score': 0.9422343324250682}

## Tuning Hyperparameters

Tuning hyperparamteres using randomized search cv

In [44]:
grid = {"n_estimators":np.arange(10,100,10),"max_features":[0.5,"auto","sqrt"],"min_samples_split":np.arange(2,10,2),"min_samples_leaf":np.arange(2,10,2)}

In [45]:
model = RandomForestClassifier(n_jobs=-1)
rs_clf = RandomizedSearchCV(model,grid,n_iter=10,cv=5,verbose=True)
rs_clf.fit(X_train,y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits




RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(n_jobs=-1),
                   param_distributions={'max_features': [0.5, 'auto', 'sqrt'],
                                        'min_samples_leaf': array([2, 4, 6, 8]),
                                        'min_samples_split': array([2, 4, 6, 8]),
                                        'n_estimators': array([10, 20, 30, 40, 50, 60, 70, 80, 90])},
                   verbose=True)

In [46]:
rs_clf.best_score_

0.9482088229885826

In [47]:
evaluation(rs_clf)

{'Accuracy Score': 0.9520435967302452,
 'Precision Score': 0.9520435967302452,
 'Recall Score': 0.9520435967302452,
 'F1 Score': 0.9520435967302452}

In [49]:
gs_clf = GridSearchCV(model,grid,cv=5,verbose=True)
gs_clf.fit(X_train,y_train)

Fitting 5 folds for each of 432 candidates, totalling 2160 fits




GridSearchCV(cv=5, estimator=RandomForestClassifier(n_jobs=-1),
             param_grid={'max_features': [0.5, 'auto', 'sqrt'],
                         'min_samples_leaf': array([2, 4, 6, 8]),
                         'min_samples_split': array([2, 4, 6, 8]),
                         'n_estimators': array([10, 20, 30, 40, 50, 60, 70, 80, 90])},
             verbose=True)

In [50]:
gs_clf.best_score_

0.9548873583969953

In [51]:
evaluation(gs_clf)

{'Accuracy Score': 0.9564032697547684,
 'Precision Score': 0.9564032697547684,
 'Recall Score': 0.9564032697547684,
 'F1 Score': 0.9564032697547684}

In [68]:
gs_clf.best_params_

{'max_features': 0.5,
 'min_samples_leaf': 2,
 'min_samples_split': 4,
 'n_estimators': 50}

In [69]:
ideal_model = RandomForestClassifier(max_features = 0.5,
 min_samples_leaf = 2,
 min_samples_split = 4,
 n_estimators = 50)

In [70]:
import joblib
joblib.dump(ideal_model,"thyroid_model.sav")


['thyroid_model.sav']

In [74]:
loaded_model = joblib.load("thyroid_model.sav")
loaded_model.fit(X_train,y_train)
loaded_model.score(X_test,y_test)

0.9547683923705722

In [75]:
data = pd.read_csv("thyroidDF.csv")
data

Unnamed: 0,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_meds,sick,pregnant,thyroid_surgery,I131_treatment,query_hypothyroid,...,TT4,T4U_measured,T4U,FTI_measured,FTI,TBG_measured,TBG,referral_source,target,patient_id
0,29,F,f,f,f,f,f,f,f,t,...,,f,,f,,f,,other,-,840801013
1,29,F,f,f,f,f,f,f,f,f,...,128.0,f,,f,,f,,other,-,840801014
2,41,F,f,f,f,f,f,f,f,f,...,,f,,f,,t,11.0,other,-,840801042
3,36,F,f,f,f,f,f,f,f,f,...,,f,,f,,t,26.0,other,-,840803046
4,32,F,f,f,f,f,f,f,f,f,...,,f,,f,,t,36.0,other,S,840803047
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9167,56,M,f,f,f,f,f,f,f,f,...,64.0,t,0.83,t,77.0,f,,SVI,-,870119022
9168,22,M,f,f,f,f,f,f,f,f,...,91.0,t,0.92,t,99.0,f,,SVI,-,870119023
9169,69,M,f,f,f,f,f,f,f,f,...,113.0,t,1.27,t,89.0,f,,SVI,I,870119025
9170,47,F,f,f,f,f,f,f,f,f,...,75.0,t,0.85,t,88.0,f,,other,-,870119027


In [80]:
data.target.value_counts()

-      6771
K       436
G       359
I       346
F       233
R       196
A       147
L       115
M       111
N       110
S        85
GK       49
AK       46
J        30
B        21
MK       16
Q        14
O        14
C|I      12
KJ       11
GI       10
H|K       8
D         8
FK        6
C         6
P         5
MI        2
LJ        1
GKJ       1
OI        1
D|R       1
E         1
Name: target, dtype: int64