# Nairobi Hospital Hypothyroidism Test

## 1. Defining the Question

### a) Specifying the Question

To build a model that determines whether or not the patient's symptoms indicate that the patient has hypothyroid

### b) Defining the Metric for Success


### c) Understanding the context

Hypothyroidism (underactive thyroid) is a condition in which your thyroid gland doesn't produce enough of certain crucial hormones.

Hypothyroidism may not cause noticeable symptoms in the early stages. Over time, untreated hypothyroidism can cause a number of health problems, such as obesity, joint pain, infertility and heart disease.

Accurate thyroid function tests are available to diagnose hypothyroidism. Treatment with synthetic thyroid hormone is usually simple, safe and effective once you and your doctor find the right dose for you.

Hypothyroidism signs and symptoms may include:

* Fatigue
* Increased sensitivity to cold
* Constipation
* Dry skin
* Weight gain
* Puffy face
* Hoarseness
* Muscle weakness
* Elevated blood cholesterol level
* Muscle aches, tenderness and stiffness
* Pain, stiffness or swelling in your joints
* Heavier than normal or irregular menstrual periods
* Thinning hair
* Slowed heart rate
* Depression
* Impaired memory
* Enlarged thyroid gland (goiter)

### d). Recording the Experimental Design

We will use exploratory data analysis, such as Univariate, Bivariate, in this study to determine the relationships and differences between different variables. We shall also use Decision Trees and Support Vector Machine to make predictions.

### e) Data Relevance

The dataset to use for this project can be found by following this link: https://bit.ly/hypothyroid_data

Below are the features of the dataset

* Age
* Sex
* on_thyroxine
* query_on_thyroxine
* on_antithyroid_medicationthyroid_surgery
* query_hypothyroid
* query_hyperthyroid
* pregnant
* sick
* tumor
* lithium
* goitre
* TSH_measured
* TSH
* T3_measured
* T3
* TT4_measured
* TT4

## 2. Reading the Data

In [20]:
# Importing the Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
pd.set_option("display.max_columns", None)



In [21]:
# Let's read the dataset

url = 'https://bit.ly/hypothyroid_data'

hypothyroid_data = pd.read_csv(url)

## 3. Checking the Data

In [22]:
# Checking the top data

hypothyroid_data.head()

Unnamed: 0,status,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,thyroid_surgery,query_hypothyroid,query_hyperthyroid,pregnant,sick,tumor,lithium,goitre,TSH_measured,TSH,T3_measured,T3,TT4_measured,TT4,T4U_measured,T4U,FTI_measured,FTI,TBG_measured,TBG
0,hypothyroid,72,M,f,f,f,f,f,f,f,f,f,f,f,y,30.0,y,0.6,y,15,y,1.48,y,10,n,?
1,hypothyroid,15,F,t,f,f,f,f,f,f,f,f,f,f,y,145.0,y,1.7,y,19,y,1.13,y,17,n,?
2,hypothyroid,24,M,f,f,f,f,f,f,f,f,f,f,f,y,0.0,y,0.2,y,4,y,1.0,y,0,n,?
3,hypothyroid,24,F,f,f,f,f,f,f,f,f,f,f,f,y,430.0,y,0.4,y,6,y,1.04,y,6,n,?
4,hypothyroid,77,M,f,f,f,f,f,f,f,f,f,f,f,y,7.3,y,1.2,y,57,y,1.28,y,44,n,?


In [23]:
# Checking the columns

hypothyroid_data.columns

Index(['status', 'age', 'sex', 'on_thyroxine', 'query_on_thyroxine',
       'on_antithyroid_medication', 'thyroid_surgery', 'query_hypothyroid',
       'query_hyperthyroid', 'pregnant', 'sick', 'tumor', 'lithium', 'goitre',
       'TSH_measured', 'TSH', 'T3_measured', 'T3', 'TT4_measured', 'TT4',
       'T4U_measured', 'T4U', 'FTI_measured', 'FTI', 'TBG_measured', 'TBG'],
      dtype='object')

In [24]:
# Checking the shape

hypothyroid_data.shape

(3163, 26)

The dataset contains 26 columns and 3163 rows.

In [25]:
# Describing the data

hypothyroid_data.describe()

Unnamed: 0,status,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,thyroid_surgery,query_hypothyroid,query_hyperthyroid,pregnant,sick,tumor,lithium,goitre,TSH_measured,TSH,T3_measured,T3,TT4_measured,TT4,T4U_measured,T4U,FTI_measured,FTI,TBG_measured,TBG
count,3163,3163,3163,3163,3163,3163,3163,3163,3163,3163,3163,3163,3163,3163,3163,3163,3163,3163,3163,3163,3163,3163,3163,3163,3163,3163
unique,2,93,3,2,2,2,2,2,2,2,2,2,2,2,2,240,2,70,2,269,2,159,2,281,2,53
top,negative,?,F,f,f,f,f,f,f,f,f,f,f,f,y,0,y,?,y,?,y,?,y,?,n,?
freq,3012,446,2182,2702,3108,3121,3059,2922,2920,3100,3064,3123,3161,3064,2695,894,2468,695,2914,249,2915,248,2916,247,2903,2903


## 4. Tidying the Dataset

### a). Checking for Null Values

In [26]:
# Replacing the '?' character in the dataframe with NAN

hypothyroid_data.replace('?', np.nan, inplace=True)

In [27]:
total = hypothyroid_data.isnull().sum().sort_values(ascending=False)
percentage = (hypothyroid_data.isnull().sum()/hypothyroid_data.isnull().count()*100).sort_values(ascending=False)
missing_value = pd.concat([total,percentage],axis=1,keys=['Total','Percentage'])
missing_value.head(10)

Unnamed: 0,Total,Percentage
TBG,2903,91.779956
T3,695,21.972811
TSH,468,14.79608
age,446,14.100537
TT4,249,7.872273
T4U,248,7.840658
FTI,247,7.809042
sex,73,2.307936
TSH_measured,0,0.0
TBG_measured,0,0.0


Since tbg column has 91% missing we shall drop it

In [28]:
hypothyroid_data.drop('TBG', axis=1, inplace=True)

We shall also drop the other missing values since the data is medical data, it must only contain actual information

In [29]:
hypothyroid_data.dropna(inplace=True)

In [30]:
# Confirming there is no missing data

hypothyroid_data.isnull().sum()

status                       0
age                          0
sex                          0
on_thyroxine                 0
query_on_thyroxine           0
on_antithyroid_medication    0
thyroid_surgery              0
query_hypothyroid            0
query_hyperthyroid           0
pregnant                     0
sick                         0
tumor                        0
lithium                      0
goitre                       0
TSH_measured                 0
TSH                          0
T3_measured                  0
T3                           0
TT4_measured                 0
TT4                          0
T4U_measured                 0
T4U                          0
FTI_measured                 0
FTI                          0
TBG_measured                 0
dtype: int64

### b). Checking for Duplicates

In [12]:
hypothyroid_data.duplicated().sum()

77

There are 77 duplicated values and we shall proceed to remove them

In [13]:
hypothyroid_data.drop_duplicates(keep='first',inplace=True)

In [14]:
# Confirming they have been removed

hypothyroid_data.duplicated().sum()

0

### c). Checking the Datatypes

In [15]:
hypothyroid_data.dtypes

status                       object
age                          object
sex                          object
on_thyroxine                 object
query_on_thyroxine           object
on_antithyroid_medication    object
thyroid_surgery              object
query_hypothyroid            object
query_hyperthyroid           object
pregnant                     object
sick                         object
tumor                        object
lithium                      object
goitre                       object
TSH_measured                 object
TSH                          object
T3_measured                  object
T3                           object
TT4_measured                 object
TT4                          object
T4U_measured                 object
T4U                          object
FTI_measured                 object
FTI                          object
TBG_measured                 object
TBG                          object
dtype: object

In [38]:
# Changing column datatypes to their appriopriate datatypes
# Lists of numerical, categorical and bool columns have been created for efficiency
# Numerical columns list
#
numeric_cols = ['age', 'tsh', 't3', 'tt4', 't4u', 'fti']

# Categorical columns list
categorical_cols = ['status', 'sex','tsh_measured', 't3_measured', 'tt4_measured',
            't4u_measured', 'fti_measured', 'tbg_measured']

# Boolean columns list
booleen_cols = ['on_thyroxine', 'query_on_thyroxine','on_antithyroid_medication', 'thyroid_surgery', 'query_hypothyroid',
            'query_hyperthyroid', 'pregnant', 'sick', 'tumor', 'lithium', 'goitre']

# Replacing bool columns with True or False value
#
for column in booleen_cols:
  hypothyroid_data[column] = hypothyroid_data[column].replace('f', False)
  hypothyroid_data[column] = hypothyroid_data[column].replace('t', True)
  
# Using a for loop to change columns to their appriopriate datatypes
#
for column in hypothyroid_data.columns:
  if column in numeric_cols:
    hypothyroid_data[column] = hypothyroid_data[column].astype('float64')
  elif column in categorical_cols:
    hypothyroid_data[column] = hypothyroid_data[column].astype('category')
  elif column in booleen_cols:
    hypothyroid_data[column] = hypothyroid_data[column].astype('bool')
  
# Previewing the column datatypes to check whether the changes have been effected
#
hypothyroid_data.dtypes

status                       category
age                           float64
sex                          category
on_thyroxine                     bool
query_on_thyroxine               bool
on_antithyroid_medication        bool
thyroid_surgery                  bool
query_hypothyroid                bool
query_hyperthyroid               bool
pregnant                         bool
sick                             bool
tumor                            bool
lithium                          bool
goitre                           bool
TSH_measured                   object
TSH                            object
T3_measured                    object
T3                             object
TT4_measured                   object
TT4                            object
T4U_measured                   object
T4U                            object
FTI_measured                   object
FTI                            object
TBG_measured                   object
dtype: object

### d). Checking for Outliers

### e). Columns Formatting 

In [17]:
# For consistency  the columns should be uniform
# We shall be changing all column names to lower case
#
hypothyroid_data.columns = hypothyroid_data.columns.str.lower()

# Previewing the columns 

hypothyroid_data.columns

Index(['status', 'age', 'sex', 'on_thyroxine', 'query_on_thyroxine',
       'on_antithyroid_medication', 'thyroid_surgery', 'query_hypothyroid',
       'query_hyperthyroid', 'pregnant', 'sick', 'tumor', 'lithium', 'goitre',
       'tsh_measured', 'tsh', 't3_measured', 't3', 'tt4_measured', 'tt4',
       't4u_measured', 't4u', 'fti_measured', 'fti', 'tbg_measured', 'tbg'],
      dtype='object')