<h1>Thyroid Tumor Outlier Detection</h1>

The goal of this project is to examine thyroid tumor data and try to predict and find outliers using supervised, and unsupervised methods. The data was pulled from UC Irvine Machine Learning Repository. Results discussed after each section. 

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('annthyroid_unsupervised_anomaly_detection.csv', delimiter=';')
print(df.shape)
df.head()

(6916, 24)


Unnamed: 0,Age,Sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,sick,pregnant,thyroid_surgery,I131_treatment,query_hypothyroid,...,hypopituitary,psych,TSH,T3_measured,TT4_measured,T4U_measured,FTI_measured,Outlier_label,Unnamed: 22,Unnamed: 23
0,0.45,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,61.0,6.0,23.0,87.0,26.0,o,,
1,0.61,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,29.0,15.0,61.0,96.0,64.0,o,,
2,0.16,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,29.0,19.0,58.0,103.0,56.0,o,,
3,0.85,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,114.0,3.0,24.0,61.0,39.0,o,,
4,0.75,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,49.0,3.0,5.0,116.0,4.0,o,,


In [3]:
df = df.drop(['Unnamed: 22','Unnamed: 23'], axis=1)

In [4]:
df['Outlier_label '] = df['Outlier_label '].map({'o':1,'n':0})
df.head()

Unnamed: 0,Age,Sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,sick,pregnant,thyroid_surgery,I131_treatment,query_hypothyroid,...,goitre,tumor,hypopituitary,psych,TSH,T3_measured,TT4_measured,T4U_measured,FTI_measured,Outlier_label
0,0.45,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,61.0,6.0,23.0,87.0,26.0,1
1,0.61,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,29.0,15.0,61.0,96.0,64.0,1
2,0.16,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,29.0,19.0,58.0,103.0,56.0,1
3,0.85,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,114.0,3.0,24.0,61.0,39.0,1
4,0.75,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,49.0,3.0,5.0,116.0,4.0,1


In [5]:
target_df = df.pop('Outlier_label ')

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_df = scaler.fit_transform(df)
X_train, X_test, y_train, y_test = train_test_split(scaled_df, target_df, stratify=target_df)

In [7]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

rand_forest = RandomForestClassifier()
rand_forest.fit(X_train, y_train)
y_pred = rand_forest.predict(X_test)
y_pred = pd.DataFrame(data=y_pred)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      1.00      1.00      1667
           1       0.96      0.81      0.88        62

    accuracy                           0.99      1729
   macro avg       0.98      0.90      0.94      1729
weighted avg       0.99      0.99      0.99      1729



As we can see above, an out-of-the-box Random Forest Classifier performs very well. An alarming metric here is the recall on predicting outliers. Here, 0 is a normal data point, and 1 is an outlier. Recall is of the form:

$$
\frac{True Positives}{True Positives + False Negatives}
$$

A recall of .71 means that there are a number of false negative predictions. Since we are predict anomalous thyroid tumor data, false negatives on anomalous data points is very undesirable.

This could further be improved on by hyperparameter tuning. Particularily, a GridSearchCV could be employed to find the best hyperparameters for the model. This is beyond the scope of this project, unfortunatly, due to the number of features and runtime on my local machine.

Below, we will examine unsupervised methods to determine outliers:

In [8]:
importance_df = pd.DataFrame(data=rand_forest.feature_importances_, index=df.columns, columns=['importance'])
importance_df.query('importance > .01')

Unnamed: 0,importance
Age,0.057386
on_thyroxine,0.046454
TSH,0.490968
T3_measured,0.136596
TT4_measured,0.07883
T4U_measured,0.053976
FTI_measured,0.087123


Above we have measure the importance of each feature when predicting if a data point is anomalous or not. The most significant feature in this model is TSH, followed by T3.

In [9]:
target_df = target_df.map({1:-1,0:1})
target_df.value_counts()

 1    6666
-1     250
Name: Outlier_label , dtype: int64

In [10]:
from sklearn.ensemble import IsolationForest

iso_forest = IsolationForest()
iso_forest.fit(scaled_df)
y_pred_un = iso_forest.predict(scaled_df)
print(classification_report(target_df, y_pred_un))

              precision    recall  f1-score   support

          -1       0.05      0.06      0.06       250
           1       0.96      0.96      0.96      6666

    accuracy                           0.93      6916
   macro avg       0.51      0.51      0.51      6916
weighted avg       0.93      0.93      0.93      6916



We see above that the out-of-the-box Isolation Forest algorithm performs slightly worse than the Random Forest classifier above, if only measuring by accuracy. Here, -1 is anomalous, and 1 is normal. Below is the equation for Precision:

$$
\frac{True Positives}{True Positives + True Negatives}
$$

This being so low, coupled with a low Recall, means that we have alarmingly low True Positive cases for anomalous data points. These numbers mean that an Isolation Forest is not a useful algorithm for this problem. 