# <font color = 'maroon'><u><center>Thyroid Detection Analysis</center></u></font>

### Data Info

<p>Thyroid is a gland in our body responsible for producing thyroid hormone, which is
essential for regulating breathing, body weight, heart rate, and muscle strength. Any
irregularity in the production of this hormone can be fatal. There are four disorders
associated with the thyroid gland:</p>

- Hyperthyroid
- Hypothyroid
- Euthyroid-sick
- Euthyroid <br>

<p>However, the human body reacts differently to the above irregularities resulting in
diversified symptoms, and the disease goes undiagnosed in many cases. The 
challenge here is to train a machine learning model to predict whether a patient has a
thyroid-related disorder or not.</p>


- The database has 3772 rows and 19 columns/features with 2829 training (data) instances and 943 test instances on splitting the data

### Feature Analysis

- **age** - age of the patient (int)
- **sex** - sex patient identifies (str)
- **on_thyroxine** - whether patient is on thyroxine (bool)
- **on antithyroid meds** - whether patient is on antithyroid meds (bool)
- **sick** - whether patient is sick (bool)
- **pregnant** - whether patient is pregnant (bool)
- **thyroid_surgery** - whether patient has undergone thyroid surgery (bool)
- **I131_treatment** - whether patient is undergoing I131 treatment (bool)
- **goitre** - whether patient has goitre (bool)
- **tumor** - whether patient has tumor (bool)
- **hypopituitary** - whether patient * hyperpituitary gland (float)
- **psych** - whether patient * psych (bool)
- **TSH** - TSH level in blood from lab work (float)
- **T3** - T3 level in blood from lab work (float)
- **TT4** - TT4 level in blood from lab work (float)
- **T4U** - T4U level in blood from lab work (float)
- **FTI** - FTI level in blood from lab work (float)
- **Thyroid** - Whether patient has thyroid or not (bool)

### Importing the Libraries

In [22]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV 
from sklearn.metrics import confusion_matrix
from sklearn.metrics import mean_squared_error

### Getting the dataset

In [14]:
dataset = pd.read_csv('thyroid.csv') 
dataset.head()

Unnamed: 0,age,sex,on thyroxine,on antithyroid medication,sick,pregnant,thyroid surgery,I131 treatment,lithium,goitre,tumor,hypopituitary,psych,TSH,T3,TT4,T4U,FTI,Thyroid
0,41,F,f,f,f,f,f,f,f,f,f,f,f,1.3,2.5,125,1.14,109,P
1,23,F,f,f,f,f,f,f,f,f,f,f,f,4.1,2,102,?,?,P
2,46,M,f,f,f,f,f,f,f,f,f,f,f,0.98,?,109,0.91,120,P
3,70,F,t,f,f,f,f,f,f,f,f,f,f,0.16,1.9,175,?,?,P
4,70,F,f,f,f,f,f,f,f,f,f,f,f,0.72,1.2,61,0.87,70,P


### Data Preproccessing

In [15]:
dataset = dataset.fillna(0) #replacing the na values with 0

In [16]:
objList = dataset.select_dtypes(include = "object").columns # Taking all the column names
print (objList)

Index(['age', 'sex', 'on thyroxine', 'on antithyroid medication', 'sick',
       'pregnant', 'thyroid surgery', 'I131 treatment', 'lithium', 'goitre',
       'tumor', 'hypopituitary', 'psych', 'TSH', 'T3', 'TT4', 'T4U', 'FTI',
       'Thyroid'],
      dtype='object')


In [17]:
le = LabelEncoder() #Initializing the Label Encoder

In [18]:
for feat in objList:  # Applying the label encoder onto our columns which we defined earlier
    dataset[feat] = le.fit_transform(dataset[feat].astype(str))
print (dataset.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3772 entries, 0 to 3771
Data columns (total 19 columns):
 #   Column                     Non-Null Count  Dtype
---  ------                     --------------  -----
 0   age                        3772 non-null   int32
 1   sex                        3772 non-null   int32
 2   on thyroxine               3772 non-null   int32
 3   on antithyroid medication  3772 non-null   int32
 4   sick                       3772 non-null   int32
 5   pregnant                   3772 non-null   int32
 6   thyroid surgery            3772 non-null   int32
 7   I131 treatment             3772 non-null   int32
 8   lithium                    3772 non-null   int32
 9   goitre                     3772 non-null   int32
 10  tumor                      3772 non-null   int32
 11  hypopituitary              3772 non-null   int32
 12  psych                      3772 non-null   int32
 13  TSH                        3772 non-null   int32
 14  T3                      

In [19]:
x = dataset.iloc[:,:-1].values # Input data ie, independent variable
y = dataset.iloc[:,-1].values  # Output data ie, dependent variable

### Feature Scaling 

In [20]:
sc_x = StandardScaler() # Initializing and applying standard scalar to the input data ie, x
x = sc_x.fit_transform(x)

### Train-Test Split

In [23]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state=0) # Splitting the dataset into training and test

### Model Initialization

- Random Forest Classifier is used because firstly it is a classification problem and among all models random forest has given    the best performance with high accuracy and low error rate

In [24]:
rfc = RandomForestClassifier(n_estimators=1000) #Initializing the random forest classifier as it is a classification problem and also taking no. of trees as 1000
rfc.fit(x_train,y_train) #fitting the model into the training sets

RandomForestClassifier(n_estimators=1000)

### Parameter Tunning using Grid Search

- Grid search is done to get the optimal best paramters and accuracy score

In [25]:
parameter = [{'n_estimators': [500, 1000, 2000, 3000]}] 
grid_search = GridSearchCV(estimator=rfc, 
                           param_grid = parameter,
                           scoring='accuracy',# calculation model based on accuracy
                           cv = 10,
                           n_jobs=-1) # n_jobs can be set for large datasets,cv=10 becoz it will calculate the best out of 10 folds
grid_search = grid_search.fit(x_train,y_train)
best_accuracy = grid_search.best_score_ #best accuracy
best_parameters = grid_search.best_params_ # This will give the best parameter ie, optimal number of trees

In [27]:
print(best_accuracy)

0.9589993233591458


In [26]:
print(best_parameters)

{'n_estimators': 1000}


### Predicting the results

In [28]:
y_pred = rfc.predict(x_test) #Predicting the results with the help of x test

### Getting the Accuracy of the model

In [29]:
cm = confusion_matrix(y_test, y_pred) #Confusion matrix gives the number of correct predictions as well as incorrect predictions
                                      # Accuracy will be Total number of correct predictions/Total Predictions

In [30]:
print(cm)

[[ 50  34]
 [  3 856]]


### Report

In [None]:
report=classification_report(y_test.reshape(-1,),y_pred) # Printing the report which tells the precision,recall,f1-score,support
print(report)

### Exporting the Model

In [None]:
import pickle            # This is to export the model for future prediction on new data
with open('model.pickle','wb') as f:
    pickle.dump(rfc,f)