# Hyperthyroid Disease Classification
### Group 17: Matthew Gillies, Ryan Lee, Eric Liu, Arman Moztarzadeh

## Summary
In this report, we trained a classifier to predict the presence of hyperthyroidism using varying attributes such as age, sex, prior treatment for thyroid disease, and amount of thyroid hormones in the body. Our classifier has a 98% accuracy rate on the test set.

## Introduction
The **Thyroid Disease dataset** obtained from the **UCI Machine Learning Repository** will be used to predict the presence of hyperthyroidism. Hyperthyroidism is an issue that occurs when the thyroid gland produces an excess amount of thyroid hormones (De Leo et al., 2016). As a result, the body's metabolism greatly "speeds up", resulting in weight loss, rapid heartbeat, fatigue, shaky hands, sweating, and more (U.S. Department of Health and Human Services, n.d.). 
<br>
Studies have shown that within the population, certain groups are more predisposed to getting hyperthyroidism. Hyperthyroidism is more common in women, women who were recently pregnant, those with type 1 or type 2 diabetes, alongside other factors (Allahabadia, 2000). In order to help predict whether someone has hyperthyroidism, we are using 19 attributes provided from the Thyroid Disease dataset. The attributes we are using are age, sex, if they take medication for thyroid disease (thyroxine or antithyroid medication), pregnant status, prior treatment for thyroid disease (thyroid surgery or Ii131 radiotherapy for hyperthyroidism), and amount of different hormones in the body (TSH, TT4, T4U, and FTI). **With these factors, we are hoping to make an accurate classifier in to predict whether or not someone has hyperthyroidism.**

## Methods and Results

### Data-Cleaning
To begin this analysis we read in the data from the original source, merging **solely** for pre-processing. The `binaryClass` feature is manipulated so that all values are either positive or negative, removing all other diagnoses. Next, we replace all `?` values with `NaN` values, and removed all columns that were irrelevant to our classification model or had an extremely large amount of `NaN` values. We then remove all of the remaining rows with NA values from the data set. Since all of the columns are of datatype `Object`, they are converted to their respective data-types (either numeric or categorical). Following this we converted the `binaryClass` column from character values ("P" and "N") to the reverse in boolean integer values ("0" and "1" respectively) to match the classification model. In this data set positive labels actually represented a negative diagnosis.
### Exploratory Data Analysis
After data-cleaning we preformed exploratory data analysis (EDA) through summary statistics, correlations of numeric features, and value counts for the entire data set. 
### Model Training
We split the data into training and testing set with a 70/30 split, while also separating the `binaryClass` feature (**target**) from the rest of the data set. Following this a `ColumnTransformer` was created to scale all numeric variables and one-hot-encode all categorical variables to ensure they are in a state that the model can process. The `ColumnTransformer` was then fitted and transformed on the training set, along with transforming the test set. We then performed cross-validation with a LogisticRegression model on the training data, which returned an average validation score of 98%. Applying the model again to the full training set, we again produced a 98% accuracy score. Furthermote, we visualized the training predictions with `TSH` and `TT4` concentrations on the x and y axes respectively. Finally, **The model was applied to the test set, which produced a 99% accuracy**. Another plot visualizing the class predictions were created along with a confusion matrix to realize the impact of the predictions. From the confusion matrix we can see that most of our incorrect predictions are false negatives, which are not preferable as an incorrect positive disease diagnose would have less consequential impact than an incorrectly predicting some as disease free.

## Imports

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.pipeline import make_pipeline


# Function Imports (4 Functions)
import group17pkg as grp

import warnings
warnings.filterwarnings('ignore')

## Data-Cleaning

In [None]:
if not os.path.exists("data"):
    os.makedirs("data")
path1 = 'http://archive.ics.uci.edu/ml/machine-learning-databases/thyroid-disease/allhyper.data'
path2 = 'http://archive.ics.uci.edu/ml/machine-learning-databases/thyroid-disease/allhyper.test'
columnNames = ["age", "sex", "on thyroxine", "query on thyroxine",
               "on antithyroid medication", "sick", "pregnant",
               "thyroid surgery", "I131 treatment", "query hypothyroid",
               "query hyperthyroid", "lithium", "goitre", "tumor",
               "hypopituitary", "psych", "TSH measured", "TSH", "T3 measured",
               "T3", "TT4 measured", "TT4", "T4U measured", "T4U",
               "FTI measured", "FTI", "TBG measured", "TBG", "referral source",
               "binaryClass"]
dfData = pd.read_csv(path1, names=columnNames)
dfTest = pd.read_csv(path2, names=columnNames)
hyperthyroid_df = pd.concat([dfData, dfTest])

In [None]:
hyperthyroid_df.head()

In [None]:
hyperthyroid_df = grp.relabel_bclass(hyperthyroid_df)

In [None]:
hyperthyroid_df.binaryClass.unique().tolist()

In [None]:
# Reading in data
# hyperthyroid_df = pd.read_csv("data/hyperthyroid.csv")
hyperthyroid_df.head()

In [None]:
# Replacing ? values with NA
hyperthyroid = hyperthyroid_df.replace("?", np.nan)

In [None]:
hyperthyroid.isna().sum()

In [None]:
# Due to large amount of NA values in TBG and T3 these features will be removed from the dataset. 
# All "measured" features will also be removed as once NA's are removed they will all be "t"
# We will also removed the referral source column as it is quite irrelevant
# All other NA rows will simply be dropped
hyper = hyperthyroid.drop(columns=["TBG", "TBG measured", "T3", "T3 measured", "TSH measured",
                                   "TT4 measured", "FTI measured", "T4U measured", "referral source"])

In [None]:
hyper_clean = hyper.dropna()

In [None]:
hyper_clean.info()

In [None]:
# Changing Dtype of the columns to numeric/categorical
num_cols = ['age', 'TSH', 'TT4', 'T4U', 'FTI']
cat_cols = ['sex', 'on thyroxine', 'query on thyroxine', 'on antithyroid medication',
            'sick', 'pregnant', 'thyroid surgery', 'I131 treatment', 'query hypothyroid',
            'query hyperthyroid', 'lithium', 'goitre', 'tumor', 'psych', 'binaryClass', 'hypopituitary']

hyper_clean = grp.col_dtype_reformat(num_cols, cat_cols, hyper_clean)

In [None]:
hyper_clean.info()

In [None]:
# Changing binaryClass so 0 represents negative and 1 represents positive
hyper_clean['binaryClass'] = hyper_clean['binaryClass'].replace(["N", "P"], [1, 0])

## Exploratory Data Analysis (EDA)

In [None]:
hyper_clean.info()

In [None]:
hyper_clean.describe()

In [None]:
for c in hyper_clean.columns:
    print("---- %s ---" % c)
    print(hyper_clean[c].value_counts())

In [None]:
grp.plot_correlations(hyper_clean).show()

In [None]:
print("Figure 2: Realization of class counts")
pd.DataFrame(hyper_clean['binaryClass'].value_counts())

## Model Training & Analysis

In [None]:
# Splitting data
X = hyper_clean.drop(columns="binaryClass")
y = hyper_clean['binaryClass']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
X_train.head()

In [None]:
# Preprocessing data
onehot = ['sex', 'on thyroxine', 'query on thyroxine', 'on antithyroid medication',
          'sick', 'pregnant', 'thyroid surgery', 'I131 treatment', 'query hypothyroid',
          'query hyperthyroid', 'lithium', 'goitre', 'tumor', 'psych', 'hypopituitary']
numeric = ['age', 'TSH', 'TT4', 'T4U', 'FTI']

In [None]:
ct = make_column_transformer(
    (StandardScaler(), numeric),
    (OneHotEncoder(handle_unknown='ignore'), onehot)
)

In [None]:
transformed_X_train = ct.fit_transform(X_train)
transformed_X_test = ct.transform(X_test)
transformed_X_test

In [None]:
# Creating LogisticRegression Classifier
pipe_log = make_pipeline(ct, LogisticRegression(max_iter=1000, C=1))
cv = cross_validate(pipe_log, X_train, y_train, error_score='raise', return_train_score=True)
print("Figure 3: Cross-Validation Scores for the Classifier")
pd.DataFrame(cv)

Cross validation performs well so this model will be applied to the dataset.

In [None]:
lr = LogisticRegression(max_iter=1000, C=1)
X_train_trans = ct.fit_transform(X_train)
X_test_trans = ct.transform(X_test)
lr.fit(X_train_trans, y_train)
train_preds = lr.predict(X_train_trans)

In [None]:
print("Figure 4: Visualization of Classification with TSH and TT4 concentration on the axes for training set")
grp.visualize_classification(X_train, train_preds).show()

In [None]:
accuracy_score(train_preds, y_train)

The model produces a very strong 98% accuracy rate on the training data. 

In [None]:
# Test set
test_preds = lr.predict(X_test_trans)

In [None]:
print("Figure 5: Visualization of Clusters with TSH and TT4 concentration on the axes for test set")
grp.visualize_classification(X_test, test_preds).show()

In [None]:
accuracy_score(test_preds, y_test)

The model also performs quite well on the test set with a 98% accuracy.

In [None]:
# Confusion Matrix
print("Figure 6: Confusion matrix for test predictions")
cm = confusion_matrix(y_test, test_preds)
disp = ConfusionMatrixDisplay(cm, display_labels=[True, False])
disp = disp.plot()
plt.grid(False)
plt.show()

## Discussion
The model we trained has a 98% accuracy when tested on a test set. As we were provided by a plethora of measurements specific to thyroid hormones tied to hyperthyroidism, we expected to be able to train a classifier with a high level of accuracy. We believe that this model could act as a further test to backup a doctor's medical opinion for hyperthyroidism. Additionally, due to the lower analysis cost of prediction, this could potentially be an easy resource for people to self-test for the disease. Currently, there are at-home thyroid tests you can administer to measure thyroid hormone levels. By lowering the barrier of entry, more potential hyperthyroidism patients can detect their diseases early on instead of waiting for a medical appointment. A future project could be to model hyperthyroidism disease progression over time. Thyroid hormone levels exist on a spectrum, and simply saying someone has hyperthyroidism or hypothyroidism is severely oversimplifying the disease. Additionally, other projects could look if the dataset holds out in a modern setting. The dataset was published in 1987 and it has been studied that hyperthyroidism increased between 1987 and 1995 due to an increase in salt in food (Mostbeck et al., 1998).


## References

Allahabadia, A., Daykin, J., Holder, R. L., Sheppard, M. C., Gough, S. C., &amp; Franklyn, J. A. (2000). Age and Gender Predict the Outcome of Treatment for Graves’ Hyperthyroidism. The Journal of Clinical Endocrinology &amp; Metabolism, 85(3), 1038–1042. https://doi.org/10.1210/jcem.85.3.6430 

De Leo, S., Lee, S. Y., & Braverman, L. E. (2016). Hyperthyroidism. The Lancet, 388(10047), 906–918. https://doi.org/10.1016/s0140-6736(16)00278-6

Mostbeck, A., Galvan, G., Bauer, P., Eber, O., Atefie, K., Dam, K., Feichtinger, H., Fritzsche, H., Haydl, H., Köhn, H., König, B., Koriska, K., Kroiss, A., Lind, P., Markt, B., Maschek, W., Pesl, H., Ramschak-Schwarzer, S., Riccabona, G., … Zechmann, W. (1998). The incidence of hyperthyroidism in Austria from 1987 to 1995 before and after an increase in salt iodization in 1990. European Journal of Nuclear Medicine and Molecular Imaging, 25(4), 367–374. https://doi.org/10.1007/s002590050234

U.S. Department of Health and Human Services. (2021, August). Hyperthyroidism (overactive thyroid). National Institute of Diabetes and Digestive and Kidney Diseases. Retrieved February 17, 2023, from https://www.niddk.nih.gov/health-information/endocrine-diseases/hyperthyroidism
