# Thyroid Disease Detection

##### Life cycle of Thyroid disease Detector
- Understanding the problem statement
- Data Collection
- Data Checks to perform
- Exploratory Data Analysis
- Data preprocessing
- Model Training
- Choose best model

## 1) Problem Statement

- Thyroid disease is a common cause of medical diagnosis and prediction, with an onset that is difficult to forecast in medical research. The thyroid gland is one of our body's most vital organs. Thyroid hormone releases are responsible for metabolic regulation. Hyperthyroidism and hypothyroidism are one of the two common diseases of the thyroid that releases thyroid hormones in regulating the rate of body's metabolism.The main goal is to predict the estimated risk on a patient's chance of obtaining thyroid disease or not.


## 2) Data Collection
- Dataset Source - https://archive.ics.uci.edu/dataset/102/thyroid+disease
- From Garavan Institute
- Documentation: as given by Ross Quinlan
- 6 databases from the Garavan Institute in Sydney, Australia
- Approximately the following for each database:

    ** 2800 training (data) instances and 972 test instances
    ** Plenty of missing data
    ** 29 or so attributes, either Boolean or continuously-valued 

- 2 additional databases, also from Ross Quinlan, are also here

    ** Hypothyroid.data and sick-euthyroid.data
    ** Quinlan believes that these databases have been corrupted
    ** Their format is highly similar to the other databases 

- A Thyroid database suited for training ANNs

    ** 3 classes
    ** 3772 training instances, 3428 testing instances
    ** Includes cost data (donated by Peter Turney) 

### 2.1 Import Required Packages
#### Importing Pandas, Numpy, Matplotlib, Seaborn, regular expression and Os Libraries

In [43]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
import os
import re

### Setting DataFrame max Rows and Columns view
- Maximum columns: 5000
- Maximum rows: All

In [44]:
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',10000)

## 3) Combining Multiple datasets
- Datasets are divided into 2 formats:
    - Format-1: contains 30 attributes
    - Format-2: contains 26 attributes

### 3.1 Defining paths for Format-1 and Format-2 file folders

In [45]:
file_path_1='raw_thyroid_dataset/format_1_data'

In [46]:
file_path_2='raw_thyroid_dataset/format_2_data'

- Below function will be used for fetching the training file names

In [47]:
def fetch_file_names(dir_path):
    files=os.listdir(dir_path)
    data_file_names=[file for file in files if file.endswith('.data')]
    return data_file_names

- Below function will be used for combining all the individual datasets

In [48]:
def merge_dataset(dir_path, data_file_names_list):
    df=pd.read_csv(f"{dir_path}/{data_file_names_list[0]}",header=None)
    for file in data_file_names_list[1:]:
        temp_df=pd.read_csv(os.path.join(dir_path,file),header=None,)
        df=pd.concat([df,temp_df],axis=0, ignore_index=True)
    return df

### 3.2 Combining the all Format-1 datasets in a single file

In [49]:
file_names=fetch_file_names(file_path_1)
df=merge_dataset(file_path_1,file_names)

In [50]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29
0,41,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,1.3,t,2.5,t,125,t,1.14,t,109,f,?,SVHC,negative.|3733
1,23,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,4.1,t,2,t,102,f,?,f,?,f,?,other,negative.|1442
2,46,M,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.98,f,?,t,109,t,0.91,t,120,f,?,other,negative.|2965
3,70,F,t,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.16,t,1.9,t,175,f,?,f,?,f,?,other,negative.|806
4,70,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.72,t,1.2,t,61,t,0.87,t,70,f,?,SVI,negative.|2807


- Format-1 data has 25972 instances and 30 attributes

#### 3.2.1 Fetching Feature names of dataset

- Below function will fetch the features names from the given source path

In [51]:
def fetch_feature_names(src_path):
    with open(src_path) as file:
        itr_line=file.readlines()
        features_begining_point=itr_line.index('age:\t\t\t\tcontinuous.\n')
        itr_line=itr_line[features_begining_point:len(itr_line)]

        features_list=[]
        for features in itr_line:
            sep_point=features.find(":")
            features_list.append(features[:sep_point])
    return features_list

In [52]:
feature_list=fetch_feature_names(f"{file_path_1}/allhyper.names")

#### df dataset has 30 columns where we have only 29 feature names in features name list
- Appending the disease feature to list for 30th column

In [53]:
feature_list.append('disease')

- Assigning features names to the header of dataset

In [54]:
df.columns=feature_list

In [55]:
df.head(200)

Unnamed: 0,age,sex,on thyroxine,query on thyroxine,on antithyroid medication,sick,pregnant,thyroid surgery,I131 treatment,query hypothyroid,query hyperthyroid,lithium,goitre,tumor,hypopituitary,psych,TSH measured,TSH,T3 measured,T3,TT4 measured,TT4,T4U measured,T4U,FTI measured,FTI,TBG measured,TBG,referral source,disease
0,41,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,1.3,t,2.5,t,125,t,1.14,t,109,f,?,SVHC,negative.|3733
1,23,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,4.1,t,2,t,102,f,?,f,?,f,?,other,negative.|1442
2,46,M,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.98,f,?,t,109,t,0.91,t,120,f,?,other,negative.|2965
3,70,F,t,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.16,t,1.9,t,175,f,?,f,?,f,?,other,negative.|806
4,70,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.72,t,1.2,t,61,t,0.87,t,70,f,?,SVI,negative.|2807
5,18,F,t,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.03,f,?,t,183,t,1.3,t,141,f,?,other,negative.|3434
6,59,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,?,f,?,t,72,t,0.92,t,78,f,?,other,negative.|1595
7,80,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,2.2,t,0.6,t,80,t,0.7,t,115,f,?,SVI,negative.|1367
8,66,F,f,f,f,f,f,f,f,f,f,f,f,t,f,f,t,0.6,t,2.2,t,123,t,0.93,t,132,f,?,SVI,negative.|1787
9,68,M,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,2.4,t,1.6,t,83,t,0.89,t,93,f,?,SVI,negative.|2534


### 3.3 Combining all Format-2 datasets in a single file

In [56]:
file_names=fetch_file_names(file_path_2)
# df2=merge_dataset(file_path_2,file_names)

In [57]:
# print(df2.columns)
# df2.shape

- Format-2 data has 6326 instances and 26 attributes

#### 3.3.1 Fetching feature names of dataset

In [58]:
# Defining function to fetch features from format-2 dataset folder

def fetch_feature_names2(src_path):
    with open(src_path) as file:
        itr_line=file.readlines()
        features_begining_point=[idx for idx,val in enumerate(itr_line) if re.search(r'age*',val)]
        itr_line=itr_line[features_begining_point[0]:len(itr_line)]

        features_list=[]
        for features in itr_line:
            sep_point=features.find(":")
            features_list.append(features[:sep_point])
    return features_list

In [59]:
# feature_list2=fetch_feature_names2(f"{file_path_2}/hypothyroid.names")

In [60]:
# feature_list2

### 3.4 Format-2 data must be converted to the format of Format-1 data for merging initiation

In [61]:
# adjusting disease columns position from begining to the end

# outputs=df2.pop(0)
# df2.columns=feature_list2
# df2['disease']=outputs

In [62]:
# df2.head()

### 3.5 Merging Format-1 and Format-2 data

#### 3.5.1 Checking columns of Format-1 and Format-2 dataset

In [63]:
print(f"Format-1 Data columns:\n{df.columns}")
# print(f"Format-2 Data columns:\n{df2.columns}")

Format-1 Data columns:
Index(['age', 'sex', 'on thyroxine', 'query on thyroxine',
       'on antithyroid medication', 'sick', 'pregnant', 'thyroid surgery',
       'I131 treatment', 'query hypothyroid', 'query hyperthyroid', 'lithium',
       'goitre', 'tumor', 'hypopituitary', 'psych', 'TSH measured', 'TSH',
       'T3 measured', 'T3', 'TT4 measured', 'TT4', 'T4U measured', 'T4U',
       'FTI measured', 'FTI', 'TBG measured', 'TBG', 'referral source',
       'disease'],
      dtype='object')


- Features of Format-1 dataset is using whitespaces to separate words
- Features of Format-2 dataset is using underscore to separate words

    - All features must be converted to the same word separation format

#### 3.5.1 Replacing whitespace with underscore in feature names of Format-1 dataset

In [64]:
df.columns=[features.replace(" ","_") for features in df.columns]
df.columns

Index(['age', 'sex', 'on_thyroxine', 'query_on_thyroxine',
       'on_antithyroid_medication', 'sick', 'pregnant', 'thyroid_surgery',
       'I131_treatment', 'query_hypothyroid', 'query_hyperthyroid', 'lithium',
       'goitre', 'tumor', 'hypopituitary', 'psych', 'TSH_measured', 'TSH',
       'T3_measured', 'T3', 'TT4_measured', 'TT4', 'T4U_measured', 'T4U',
       'FTI_measured', 'FTI', 'TBG_measured', 'TBG', 'referral_source',
       'disease'],
      dtype='object')

#### 3.5.2 missing features in Format-2 data

- I131 treatment
- hypopituitary
- psych
- referral source

In [65]:
# print(f"Missing Features in Format-2 Data:")
# for features in df.columns:
#     if features not in df2.columns:
#         print("* ",end="")
#         print(features)

#### 3.5.3 Creating a final dataset

In [66]:
# final_dataset=pd.concat([df,df2],axis=0)
final_dataset=df

In [67]:
final_dataset.head(200)

Unnamed: 0,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,sick,pregnant,thyroid_surgery,I131_treatment,query_hypothyroid,query_hyperthyroid,lithium,goitre,tumor,hypopituitary,psych,TSH_measured,TSH,T3_measured,T3,TT4_measured,TT4,T4U_measured,T4U,FTI_measured,FTI,TBG_measured,TBG,referral_source,disease
0,41,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,1.3,t,2.5,t,125,t,1.14,t,109,f,?,SVHC,negative.|3733
1,23,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,4.1,t,2,t,102,f,?,f,?,f,?,other,negative.|1442
2,46,M,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.98,f,?,t,109,t,0.91,t,120,f,?,other,negative.|2965
3,70,F,t,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.16,t,1.9,t,175,f,?,f,?,f,?,other,negative.|806
4,70,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.72,t,1.2,t,61,t,0.87,t,70,f,?,SVI,negative.|2807
5,18,F,t,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.03,f,?,t,183,t,1.3,t,141,f,?,other,negative.|3434
6,59,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,?,f,?,t,72,t,0.92,t,78,f,?,other,negative.|1595
7,80,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,2.2,t,0.6,t,80,t,0.7,t,115,f,?,SVI,negative.|1367
8,66,F,f,f,f,f,f,f,f,f,f,f,f,t,f,f,t,0.6,t,2.2,t,123,t,0.93,t,132,f,?,SVI,negative.|1787
9,68,M,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,2.4,t,1.6,t,83,t,0.89,t,93,f,?,SVI,negative.|2534


In [68]:
final_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7544 entries, 0 to 7543
Data columns (total 30 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   age                        7544 non-null   object
 1   sex                        7544 non-null   object
 2   on_thyroxine               7544 non-null   object
 3   query_on_thyroxine         7544 non-null   object
 4   on_antithyroid_medication  7544 non-null   object
 5   sick                       7544 non-null   object
 6   pregnant                   7544 non-null   object
 7   thyroid_surgery            7544 non-null   object
 8   I131_treatment             7544 non-null   object
 9   query_hypothyroid          7544 non-null   object
 10  query_hyperthyroid         7544 non-null   object
 11  lithium                    7544 non-null   object
 12  goitre                     7544 non-null   object
 13  tumor                      7544 non-null   object
 14  hypopitu

In [69]:
final_dataset.to_csv("raw_data.csv")

### 3.6 Shape of dataset

In [70]:
final_dataset.shape

(7544, 30)

### 4.1 Replacing ? with Nan in features of dataset

In [71]:
for features in final_dataset.columns:
    final_dataset[features]=np.where(final_dataset[features]=="?",np.nan,final_dataset[features])


#### Checking first 5 values of dataset

In [72]:
final_dataset.head()

Unnamed: 0,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,sick,pregnant,thyroid_surgery,I131_treatment,query_hypothyroid,query_hyperthyroid,lithium,goitre,tumor,hypopituitary,psych,TSH_measured,TSH,T3_measured,T3,TT4_measured,TT4,T4U_measured,T4U,FTI_measured,FTI,TBG_measured,TBG,referral_source,disease
0,41,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,1.3,t,2.5,t,125,t,1.14,t,109.0,f,,SVHC,negative.|3733
1,23,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,4.1,t,2.0,t,102,f,,f,,f,,other,negative.|1442
2,46,M,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.98,f,,t,109,t,0.91,t,120.0,f,,other,negative.|2965
3,70,F,t,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.16,t,1.9,t,175,f,,f,,f,,other,negative.|806
4,70,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.72,t,1.2,t,61,t,0.87,t,70.0,f,,SVI,negative.|2807


### 4.2 Fixing Numerical features

- Extracting numerical features manually

In [73]:
num_features=['age', 'TSH', 'T3', 'TT4', 'T4U', 'FTI', 'TBG']

#### 4.2.1 Typecasting object features to float datatype

In [74]:
for features in num_features:
    final_dataset[features]=final_dataset[features].astype(float)

### 4.3 Fixing Categorical features

In [75]:
cat_features=['sex', 'referral_source','disease']

#### 4.3.1 Extracting the disease categories by splitting the character through "|" mark

In [76]:
final_dataset['disease']=final_dataset['disease'].str.split('.')
final_dataset['disease']=[val[0].replace(".","") for val in final_dataset['disease']]

#### Extracting the disease categories by splitting the character through "[" mark

In [77]:
final_dataset['disease']=final_dataset['disease'].str.split('[')
final_dataset['disease']=[val[0] for val in final_dataset['disease']]

#### Checking the frequency of each class in disease feature

In [78]:
final_dataset['disease'].value_counts()

disease
negative                   7151
compensated hypothyroid     194
primary hypothyroid          95
hyperthyroid                 79
goitre                       12
T3 toxic                     10
secondary hypothyroid         2
secondary toxic               1
Name: count, dtype: int64

In [79]:
final_dataset.head()

Unnamed: 0,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,sick,pregnant,thyroid_surgery,I131_treatment,query_hypothyroid,query_hyperthyroid,lithium,goitre,tumor,hypopituitary,psych,TSH_measured,TSH,T3_measured,T3,TT4_measured,TT4,T4U_measured,T4U,FTI_measured,FTI,TBG_measured,TBG,referral_source,disease
0,41.0,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,1.3,t,2.5,t,125.0,t,1.14,t,109.0,f,,SVHC,negative
1,23.0,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,4.1,t,2.0,t,102.0,f,,f,,f,,other,negative
2,46.0,M,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.98,f,,t,109.0,t,0.91,t,120.0,f,,other,negative
3,70.0,F,t,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.16,t,1.9,t,175.0,f,,f,,f,,other,negative
4,70.0,F,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,0.72,t,1.2,t,61.0,t,0.87,t,70.0,f,,SVI,negative


### 4.4 Fixing the boolean features

In [80]:
bool_features=[features for features in final_dataset.columns if features not in num_features+cat_features]

#### Showing all boolean features

In [81]:
bool_features

['on_thyroxine',
 'query_on_thyroxine',
 'on_antithyroid_medication',
 'sick',
 'pregnant',
 'thyroid_surgery',
 'I131_treatment',
 'query_hypothyroid',
 'query_hyperthyroid',
 'lithium',
 'goitre',
 'tumor',
 'hypopituitary',
 'psych',
 'TSH_measured',
 'T3_measured',
 'TT4_measured',
 'T4U_measured',
 'FTI_measured',
 'TBG_measured']

In [82]:
bool_dict={'y':1, 'n':0, 'f':0, 't':1}

for features in bool_features:
    final_dataset[features]=final_dataset[features].map(bool_dict)

In [83]:
final_dataset.to_csv("blended_data.csv")