# Practice assignment: Handling imbalanced data

This assignment is graded by your `submission.json`.

The cell below creates a valid `submission.json` file, fill your answers in there. 

You can press "Submit Assignment" at any time to submit partial progress.

In [50]:
%%file submission.json
{
    "q1": 0.06124,
    "q2": 7,
    "q3": 2,
    "q4": 20,
    "q5": 1,
    "q6": -202.06375,
    "q7": 0.00813,
    "q8": 0.00035,
    "q9": 0.86899,
    "q10": 0.87931,
    "q11": 0.95549,
    "q12": 0.83683,
    "q13": 0.89486,
    "q14": 0.89429,
    "q15": 0.92072,
    "q16": 0.92015,
    "q17": 0.92877,
    "q18": 0.91153
}

Overwriting submission.json


In this programming assignment, you are going to work with a dataset based on the following data:

https://archive.ics.uci.edu/ml/datasets/thyroid+disease

_Citation:_

* _(Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.)_

The dataset contains various attributes of patients. Some of them have a thyroid disease (`'Class' = 1`), some of them don't have it (`'Class' = 0`).

The data is imbalanced. In this assignment, you are going to preprocess the data and apply various techniques for the imbalanced classification.

In [2]:
import numpy as np
import pandas as pd
from imblearn.combine import SMOTETomek
from imblearn.over_sampling import ADASYN, BorderlineSMOTE, RandomOverSampler, SMOTE
from imblearn.under_sampling import NearMiss, RandomUnderSampler, TomekLinks
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, balanced_accuracy_score, confusion_matrix, matthews_corrcoef, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [3]:
df = pd.read_csv('data.csv')

In [4]:
df['Class']

0       0
1       0
2       0
3       0
4       0
       ..
3767    0
3768    0
3769    0
3770    0
3771    0
Name: Class, Length: 3772, dtype: int64

## 1

**q1:** What proportion of patients in this data has a thyroid disease? Provide the answer (a number from 0 to 1), rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [5]:
# your code here
df['Class'].value_counts().iloc[1]/(df.shape[0])

0.0612407211028632

## 2

**q2:** How many columns contain missing values (NaN)?

In [6]:
# your code here
df_null = df.isnull().sum()
df_null[df_null>0].shape

(7,)

In [7]:
df.isnull().sum()

age                             1
sex                             0
on_thyroxine                    0
query_on_thyroxine              0
on_antithyroid_medication       0
sick                            0
pregnant                        0
thyroid_surgery                 0
I131_treatment                  0
query_hypothyroid               0
query_hyperthyroid              0
lithium                         0
goitre                          0
tumor                           0
hypopituitary                   0
psych                           0
TSH_measured                    0
TSH                           369
T3_measured                     0
T3                            769
TT4_measured                    0
TT4                           231
T4U_measured                    0
T4U                           387
FTI_measured                    0
FTI                           385
TBG_measured                    0
TBG                          3772
referral_source                 0
Class         

## 3

**q3:** How many columns contain only one unique value (count NaNs too)? If the number is bigger than 0, drop these columns.

In [8]:
# your code here
df_unique = df.describe(include = 'object').iloc[1]
df_unique[df_unique==1]
df.drop(columns = ['TBG_measured', 'TBG'], inplace = True)
df_unique.keys().tolist()

['sex',
 'on_thyroxine',
 'query_on_thyroxine',
 'on_antithyroid_medication',
 'sick',
 'pregnant',
 'thyroid_surgery',
 'I131_treatment',
 'query_hypothyroid',
 'query_hyperthyroid',
 'lithium',
 'goitre',
 'tumor',
 'hypopituitary',
 'psych',
 'TSH_measured',
 'T3_measured',
 'TT4_measured',
 'T4U_measured',
 'FTI_measured',
 'TBG_measured',
 'referral_source']

In [9]:
df[df.columns[~df.columns.isin(df_unique.keys())]]

Unnamed: 0,age,TSH,T3,TT4,T4U,FTI,Class
0,41.0,1.30,2.5,125.0,1.14,109.0,0
1,23.0,4.10,2.0,102.0,,,0
2,46.0,0.98,,109.0,0.91,120.0,0
3,70.0,0.16,1.9,175.0,,,0
4,70.0,0.72,1.2,61.0,0.87,70.0,0
...,...,...,...,...,...,...,...
3767,30.0,,,,,,0
3768,68.0,1.00,2.1,124.0,1.08,114.0,0
3769,74.0,5.10,1.8,112.0,1.07,105.0,0
3770,72.0,0.70,2.0,82.0,0.94,87.0,0


## 4

**q4:** Calculate the number of binary columns (only two unique values) with `'object'` data types. Transform them with `LabelEncoder` so that their values become numbers.

In [10]:
# your code here
binary_columns = df_unique[df_unique==2].keys().tolist()
len(binary_columns)

20

In [11]:
binary_columns

['sex',
 'on_thyroxine',
 'query_on_thyroxine',
 'on_antithyroid_medication',
 'sick',
 'pregnant',
 'thyroid_surgery',
 'I131_treatment',
 'query_hypothyroid',
 'query_hyperthyroid',
 'lithium',
 'goitre',
 'tumor',
 'hypopituitary',
 'psych',
 'TSH_measured',
 'T3_measured',
 'TT4_measured',
 'T4U_measured',
 'FTI_measured']

In [12]:
label_encoder = LabelEncoder()
df[binary_columns] = df[binary_columns].apply(lambda x: label_encoder.fit_transform(x))
df

Unnamed: 0,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,sick,pregnant,thyroid_surgery,I131_treatment,query_hypothyroid,...,T3_measured,T3,TT4_measured,TT4,T4U_measured,T4U,FTI_measured,FTI,referral_source,Class
0,41.0,0,0,0,0,0,0,0,0,0,...,1,2.5,1,125.0,1,1.14,1,109.0,SVHC,0
1,23.0,0,0,0,0,0,0,0,0,0,...,1,2.0,1,102.0,0,,0,,other,0
2,46.0,1,0,0,0,0,0,0,0,0,...,0,,1,109.0,1,0.91,1,120.0,other,0
3,70.0,0,1,0,0,0,0,0,0,0,...,1,1.9,1,175.0,0,,0,,other,0
4,70.0,0,0,0,0,0,0,0,0,0,...,1,1.2,1,61.0,1,0.87,1,70.0,SVI,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3767,30.0,0,0,0,0,0,0,0,0,0,...,0,,0,,0,,0,,other,0
3768,68.0,0,0,0,0,0,0,0,0,0,...,1,2.1,1,124.0,1,1.08,1,114.0,SVI,0
3769,74.0,0,0,0,0,0,0,0,0,0,...,1,1.8,1,112.0,1,1.07,1,105.0,other,0
3770,72.0,1,0,0,0,0,0,0,0,0,...,1,2.0,1,82.0,1,0.94,1,87.0,SVI,0


## 5

**q5:** How many categorical columns with `'object'` data types are remaining in the data? Encode them with One-Hot encoding (with `pandas.get_dummies()`) the same way as in the programming assignment in week 1.

In [13]:
df_unique

sex                          2
on_thyroxine                 2
query_on_thyroxine           2
on_antithyroid_medication    2
sick                         2
pregnant                     2
thyroid_surgery              2
I131_treatment               2
query_hypothyroid            2
query_hyperthyroid           2
lithium                      2
goitre                       2
tumor                        2
hypopituitary                2
psych                        2
TSH_measured                 2
T3_measured                  2
TT4_measured                 2
T4U_measured                 2
FTI_measured                 2
TBG_measured                 1
referral_source              5
Name: unique, dtype: object

In [14]:
one_hot_encoder = OneHotEncoder(handle_unknown = 'ignore')
# df['referral_source'] = df['referral_source'].apply(lambda x: one_hot_encoder.fit_transform(x))
referral_source_category = pd.get_dummies(df['referral_source'], dummy_na=False)

In [15]:
df = pd.concat([df, referral_source_category], axis = 1)

In [16]:
df.drop(columns = ['referral_source'], inplace = True)

## 6

We have encoded categorical features, but we still have missing values. Fill them with a number -999. 

**q6:** What is a mean value of `'T3'` column now? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Don't be afraid if you see that the mean changed significantly after filling missing values. We just introduced a special category, and it won't influence tree-based models.

In [17]:
# your code here
df.fillna(-999, inplace = True)
df.isnull().sum()
df['T3'].mean()

-202.06374867444347

## 7

Finally, we have preprocessed the data. Next, we separate the target from the dataframe with features (`df` -> `X`, `y`).

Split the data (`X` and `y`) into train and test sets using `train_test_split` from `sklearn`. Test size should be 0.25 of the whole data. Use `random_state=13`, so that your results are reproducible and similar to the original ones.

**q7:** Measure the proportion of patients in train set who has a thyroid disease (as in the task 1). Measure the proportion of patients in test set who has a thyroid disease. As the answer, provide the absolute value of the difference between these proportions (to compare positive class proportions in train and test), rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [18]:
X = df.drop('Class', axis=1)
y = df['Class']

In [19]:
# your code here
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size = 0.25, random_state = 13)

In [20]:
y_train[y_train == 1].shape[0]/y_train.shape[0] - y_test[y_test==1].shape[0]/y_test.shape[0]

0.008130081300813004

## 8

Now split the data (`X` and `y`) into train and test sets using `train_test_split` from `sklearn` with the same parameters, as in task 7, but also add `stratify=y` parameter for the stratification. This may help to make positive class proportions in train and test more similar.

**q8:** Measure the proportion of patients in train set who has a thyroid disease (as in the task 1). Measure the proportion of patients in test set who has a thyroid disease. As the answer, provide the absolute value of the difference between these proportions (to compare positive class proportions in train and test), rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Is it bigger or smaller than the similar number in the previous task?

In [21]:
# your code here
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 13, stratify = y)
y_train[y_train == 1].shape[0]/y_train.shape[0] - y_test[y_test == 1].shape[0]/y_test.shape[0]

-0.0003534817956875255

## 9

Let's move to modeling. First, we write two functions to estimate a quality of machine learning model predictions on test set via different metrics.

In this and all the following tasks, use the same train and test sets which you obtained in the task 8 (with the stratification).

Train a Random Forest classifier from `sklearn` with 50 estimators and `random_state=13`, and let all other parameters have the default values. Fit it on the train set and obtain predictions for the test set. Run the function which computes scores. 

**q9:** What balanced accuracy value do you obtain? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [22]:
def compute_metrics(y_test, y_pred):
    print('Accuracy: {:.5f}'.format(accuracy_score(y_test, y_pred)))
    print('F-score: {:.5f}'.format(f1_score(y_test, y_pred)))
    print('Precision: {:.5f}'.format(precision_score(y_test, y_pred)))
    print('Recall: {:.5f}'.format(recall_score(y_test, y_pred)))
    print('Accuracy (balanced): {:.5f}'.format(balanced_accuracy_score(y_test, y_pred)))
    print('MCC: {:.5f}'.format(matthews_corrcoef(y_test, y_pred)))

def compute_confusion_matrix(y_test, y_pred):
    compute_metrics(y_test, y_pred)
    return pd.DataFrame(
        confusion_matrix(y_test, y_pred, labels=[1, 0]),
        columns=['a(x) = 1', 'a(x) = 0'],
        index=['y = 1', 'y = 0'],
    ).T

In [23]:
# your code here
clf = RandomForestClassifier(n_estimators=50, random_state=13)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
compute_confusion_matrix(y_test, y_pred)

Accuracy: 0.98091
F-score: 0.82692
Precision: 0.93478
Recall: 0.74138
Accuracy (balanced): 0.86899
MCC: 0.82312


Unnamed: 0,y = 1,y = 0
a(x) = 1,43,3
a(x) = 0,15,882


## 10

In this task, perform the same procedure as in task 9, but with the parameter `class_weight='balanced'` in the Random Forest classifier. 

**q10:** What balanced accuracy value do you obtain? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Look at the scores overall. What do you think - did setting class weights improve the quality of the model?

In [25]:
# your code here
clf = RandomForestClassifier(n_estimators = 50, class_weight = 'balanced', random_state = 13)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
compute_confusion_matrix(y_test, y_pred)

Accuracy: 0.98515
F-score: 0.86275
Precision: 1.00000
Recall: 0.75862
Accuracy (balanced): 0.87931
MCC: 0.86418


Unnamed: 0,y = 1,y = 0
a(x) = 1,44,0
a(x) = 0,14,885


## 11

Let's try to balance train set with different approaches. We will use a special library `imbalanced-learn` (documentation: https://imbalanced-learn.org/stable/).

In this and all the following tasks, use the same Random Forest classifier setting as in the task 10 (with `class_weight='balanced'`).

Let's start with a random understampling (`RandomUnderSampler`). Run it with the default parameter values and `random_state=13` on the initial train data (from the task 8) and modify it. Train a Random Forest classifier on the modified data and obtain the predictions on the test data.

**q11:** What balanced accuracy value do you obtain? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Look at the scores overall. What do you think - how did random undersampling method perform?

In [29]:
# your code here
rus = RandomUnderSampler(random_state = 13)
X_res, y_res = rus.fit_sample(X_train, y_train)
clf = RandomForestClassifier(n_estimators = 50, class_weight = 'balanced', random_state = 13)
clf.fit(X_res, y_res)
y_pred = clf.predict(X_test)
compute_confusion_matrix(y_test, y_pred)

Accuracy: 0.96182
F-score: 0.75342
Precision: 0.62500
Recall: 0.94828
Accuracy (balanced): 0.95549
MCC: 0.75244


Unnamed: 0,y = 1,y = 0
a(x) = 1,55,33
a(x) = 0,3,852


## 12

Take the second version of `NearMiss` (`version=2`). Run it with `sampling_strategy=0.2`, `n_neighbors=3` and other default parameter values on the initial train data and modify it. Train a Random Forest classifier on the modified data and obtain the predictions on the test data.

**q12:** What balanced accuracy value do you obtain? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Look at the scores overall. What do you think - how did NearMiss-2 method perform?

In [31]:
# your code here
undersample = NearMiss(version = 2, sampling_strategy=0.2, n_neighbors=3)
X_res, y_res = undersample.fit_sample(X_train, y_train)
clf = RandomForestClassifier(n_estimators = 50, class_weight = 'balanced', random_state = 13)
clf.fit(X_res, y_res)
y_pred = clf.predict(X_test)
compute_confusion_matrix(y_test, y_pred)

Accuracy: 0.79958
F-score: 0.35052
Precision: 0.21888
Recall: 0.87931
Accuracy (balanced): 0.83683
MCC: 0.37525


Unnamed: 0,y = 1,y = 0
a(x) = 1,51,182
a(x) = 0,7,703


## 13

Take the Tomek's links method (`TomekLinks`) with the default parameter values, run it on the initial train data and modify it. Train a Random Forest classifier on the modified data and obtain the predictions on the test data.

**q13:** What balanced accuracy value do you obtain? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Look at the scores overall. What do you think - how did Tomek's links method perform? What was the best undersampling approach?

In [35]:
# your code here
tomelinks = TomekLinks()
X_res, y_res = tomelinks.fit_sample(X_train, y_train)
clf = RandomForestClassifier(n_estimators = 50, class_weight = 'balanced', random_state = 13)
clf.fit(X_res, y_res)
y_pred = clf.predict(X_test)
compute_confusion_matrix(y_test, y_pred)

Accuracy: 0.98409
F-score: 0.85981
Precision: 0.93878
Recall: 0.79310
Accuracy (balanced): 0.89486
MCC: 0.85485


Unnamed: 0,y = 1,y = 0
a(x) = 1,46,3
a(x) = 0,12,882


## 14

Now let's move to the oversampling. Take a random oversampling approach (`RandomOverSampler`) with `sampling_strategy=0.8`, `random_state=13` and other default parameter values. Run it on the initial train data and modify it. Train a Random Forest classifier on the modified data and obtain the predictions on the test data.

**q14:** What balanced accuracy value do you obtain? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Look at the scores overall. What do you think - how did random oversampling method perform?

In [39]:
# your code here
ros = RandomOverSampler(sampling_strategy = 0.8)
X_res, y_res = ros.fit_sample(X_train, y_train)
clf = RandomForestClassifier(n_estimators = 50, class_weight = 'balanced', random_state = 13)
clf.fit(X_res, y_res)
y_pred = clf.predict(X_test)
compute_confusion_matrix(y_test, y_pred)

Accuracy: 0.98303
F-score: 0.85185
Precision: 0.92000
Recall: 0.79310
Accuracy (balanced): 0.89429
MCC: 0.84552


Unnamed: 0,y = 1,y = 0
a(x) = 1,46,4
a(x) = 0,12,881


## 15

Take SMOTE (`SMOTE`) with `sampling_strategy=0.8`, `k_neighbors=5`, `random_state=13` and other default parameter values. Run it on the initial train data and modify it. Train a Random Forest classifier on the modified data and obtain the predictions on the test data.

**q15:** What balanced accuracy value do you obtain? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Look at the scores overall. What do you think - how did SMOTE method perform? Was it better than random oversampling?

In [41]:
# your code here
smote = SMOTE(sampling_strategy = 0.8, k_neighbors = 5, random_state = 13)
X_res, y_res = smote.fit_sample(X_train, y_train)
clf = RandomForestClassifier(n_estimators = 50, class_weight = 'balanced', random_state = 13)
clf.fit(X_res, y_res)
y_pred = clf.predict(X_test)
compute_confusion_matrix(y_test, y_pred)

Accuracy: 0.98727
F-score: 0.89091
Precision: 0.94231
Recall: 0.84483
Accuracy (balanced): 0.92072
MCC: 0.88566


Unnamed: 0,y = 1,y = 0
a(x) = 1,49,3
a(x) = 0,9,882


## 16

Take ADASYN (`ADASYN`) with `sampling_strategy=0.8`, `n_neighbors=5`, `random_state=13` and other default parameter values. Run it on the initial train data and modify it. Train a Random Forest classifier on the modified data and obtain the predictions on the test data.

**q16:** What balanced accuracy value do you obtain? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Look at the scores overall. What do you think - how did ADASYN method perform? Was it better than SMOTE?

In [43]:
# your code here
adasyn = ADASYN(sampling_strategy = 0.8, n_neighbors=5, random_state = 13)
X_res, y_res = adasyn.fit_sample(X_train, y_train)
clf = RandomForestClassifier(n_estimators = 50, class_weight = 'balanced', random_state = 13)
clf.fit(X_res, y_res)
y_pred = clf.predict(X_test)
compute_confusion_matrix(y_test, y_pred)

Accuracy: 0.98621
F-score: 0.88288
Precision: 0.92453
Recall: 0.84483
Accuracy (balanced): 0.92015
MCC: 0.87658


Unnamed: 0,y = 1,y = 0
a(x) = 1,49,4
a(x) = 0,9,881


## 17

Take the first version of borderline SMOTE (`BorderlineSMOTE`, `kind='borderline-1'`) with `sampling_strategy=0.8`, `random_state=13` and other default parameter values. Run it on the initial train data and modify it. Train a Random Forest classifier on the modified data and obtain the predictions on the test data.

**q17:** What balanced accuracy value do you obtain? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

Look at the scores overall. What do you think - how did BorderlineSMOTE-1 method perform? Was it better than SMOTE and ADASYN? What was the best oversampling approach?

In [47]:
# your code here
smote_1 = BorderlineSMOTE(kind='borderline-1', sampling_strategy = 0.8, random_state = 13)
X_res, y_res = smote_1.fit_sample(X_train, y_train)
clf = RandomForestClassifier(n_estimators = 50, class_weight = 'balanced', random_state = 13)
clf.fit(X_res, y_res)
y_pred = clf.predict(X_test)
compute_confusion_matrix(y_test, y_pred)

Accuracy: 0.98727
F-score: 0.89286
Precision: 0.92593
Recall: 0.86207
Accuracy (balanced): 0.92877
MCC: 0.88674


Unnamed: 0,y = 1,y = 0
a(x) = 1,50,4
a(x) = 0,8,881


## 18

Finally, check the performance of the combination of oversampling and undersampling. Take SMOTE + Tomek's links (`SMOTETomek`) with `sampling_strategy=0.8`, `random_state=13` and other default parameter values. Run it on the initial train data and modify it. Train a Random Forest classifier on the modified data and obtain the predictions on the test data.

**q18:** What balanced accuracy value do you obtain? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

What do you think, which approach was the best to deal with our data?

In [49]:
# your code here
smote_tomek = SMOTETomek(sampling_strategy = 0.8, random_state = 13)
X_res, y_res = smote_tomek.fit_sample(X_train, y_train)
clf = RandomForestClassifier(n_estimators = 50, class_weight = 'balanced', random_state = 13)
clf.fit(X_res, y_res)
y_pred = clf.predict(X_test)
compute_confusion_matrix(y_test, y_pred)

Accuracy: 0.98515
F-score: 0.87273
Precision: 0.92308
Recall: 0.82759
Accuracy (balanced): 0.91153
MCC: 0.86632


Unnamed: 0,y = 1,y = 0
a(x) = 1,48,4
a(x) = 0,10,881
