## Dealing with Imbalanced Dataset

**We are going to study about imbalanced data by introducing the concept Upsampling and Downsampling.**
**We will look into K-Nearest Neighbour, using a hyper parameterusing GridSearch CV and identify the best model.**

### Accessing the dataset

In [1]:
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

In [3]:
bank = pd.read_csv('bank.csv')
bank

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,technician,married,tertiary,no,825,no,no,cellular,17,nov,977,3,-1,0,unknown,yes
45207,71,retired,divorced,primary,no,1729,no,no,cellular,17,nov,456,2,-1,0,unknown,yes
45208,72,retired,married,secondary,no,5715,no,no,cellular,17,nov,1127,5,184,3,success,yes
45209,57,blue-collar,married,secondary,no,668,no,no,telephone,17,nov,508,4,-1,0,unknown,no


In [4]:
bank['y'].value_counts()

no     39922
yes     5289
Name: y, dtype: int64

In [5]:
5289/(5289+39922)

0.11698480458295547

**Yes class has around 11.69% and No class has 88.31% of values**

**Imbalance Dataset is the one where the spread of the dataset is skewed to a large extent.**

### Resolving the Imbalance

In [7]:
# Splitting the data into 2 according to the class

bank_yes = bank[bank['y'] == 'yes']
bank_yes.shape

(5289, 17)

In [8]:
bank_no = bank[bank['y'] == 'no']
bank_no.shape

(39922, 17)

### Upsampling

**Upsampling is applied on the minority class inorder to increase the datapoints (manipulation). In this case we do it for the class "Yes".**

In [11]:
from sklearn.utils import resample

bank_yes_up = resample(bank_yes, replace = True, random_state = 100, n_samples = 15000)
bank_yes_up.shape

(15000, 17)

In [12]:
bank_yes_up

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
42571,38,admin.,married,secondary,no,11303,no,no,cellular,28,dec,473,2,216,2,failure,yes
3190,41,blue-collar,married,secondary,no,1384,yes,no,unknown,15,may,1162,4,-1,0,unknown,yes
10049,42,entrepreneur,married,tertiary,no,5345,no,no,unknown,11,jun,878,3,-1,0,unknown,yes
31962,31,blue-collar,married,secondary,no,1406,yes,yes,cellular,13,apr,1091,2,-1,0,unknown,yes
43024,51,management,married,tertiary,no,346,no,no,cellular,12,feb,122,1,92,5,success,yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38934,24,services,single,secondary,no,414,yes,no,cellular,18,may,493,1,370,1,other,yes
42798,62,entrepreneur,married,secondary,no,3904,no,no,telephone,29,jan,403,2,-1,0,unknown,yes
31333,58,self-employed,married,tertiary,no,5810,no,no,cellular,12,mar,139,1,-1,0,unknown,yes
42557,33,management,married,tertiary,no,1808,no,no,cellular,28,dec,250,3,-1,0,unknown,yes


### Downsampling

**Downsampling is applied on the majority class inorder to decrease the datapoints (manipulation). In this case we do it for the class "No".**

In [13]:
bank_no_down = resample (bank_no, replace = False, random_state = 100, n_samples = 25000)
bank_no_down.shape

(25000, 17)

In [14]:
bank_no_down

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
9789,39,blue-collar,married,tertiary,no,10483,no,no,unknown,9,jun,218,1,-1,0,unknown,no
44899,77,retired,married,tertiary,no,0,no,no,cellular,27,sep,990,4,-1,0,unknown,no
2331,28,blue-collar,married,secondary,no,-95,yes,no,unknown,13,may,200,1,-1,0,unknown,no
15271,41,management,single,tertiary,no,0,yes,no,cellular,17,jul,509,7,-1,0,unknown,no
10642,38,services,divorced,secondary,no,439,no,no,unknown,16,jun,50,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27927,43,unknown,single,unknown,no,181,no,no,telephone,28,jan,41,1,-1,0,unknown,no
18014,40,management,divorced,tertiary,no,69,yes,no,cellular,30,jul,149,2,-1,0,unknown,no
35076,40,management,married,tertiary,no,429,yes,no,cellular,6,may,222,2,363,4,failure,no
11238,44,blue-collar,single,unknown,no,4330,no,no,unknown,18,jun,16,9,-1,0,unknown,no


### Creating a datset by combining

In [15]:
bank_new = pd.concat([bank_yes_up, bank_no_down])
bank_new.shape

(40000, 17)

In [17]:
bank_new.head(25)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
42571,38,admin.,married,secondary,no,11303,no,no,cellular,28,dec,473,2,216,2,failure,yes
3190,41,blue-collar,married,secondary,no,1384,yes,no,unknown,15,may,1162,4,-1,0,unknown,yes
10049,42,entrepreneur,married,tertiary,no,5345,no,no,unknown,11,jun,878,3,-1,0,unknown,yes
31962,31,blue-collar,married,secondary,no,1406,yes,yes,cellular,13,apr,1091,2,-1,0,unknown,yes
43024,51,management,married,tertiary,no,346,no,no,cellular,12,feb,122,1,92,5,success,yes
18015,55,management,married,tertiary,no,-375,no,no,cellular,30,jul,814,2,-1,0,unknown,yes
44802,31,management,single,tertiary,no,3340,no,no,cellular,15,sep,213,2,469,3,success,yes
39724,43,admin.,married,secondary,no,132,no,no,cellular,27,may,187,2,71,1,success,yes
43431,78,retired,divorced,primary,no,1389,no,no,cellular,8,apr,335,1,-1,0,unknown,yes
31180,25,technician,single,secondary,no,1231,yes,no,cellular,27,feb,412,5,-1,0,unknown,yes


In [18]:
bank_new.tail(25)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
8935,39,blue-collar,married,primary,no,0,yes,no,unknown,4,jun,112,4,-1,0,unknown,no
36257,40,services,married,secondary,no,51,yes,no,cellular,11,may,50,1,-1,0,unknown,no
16162,32,management,married,tertiary,no,663,no,yes,cellular,22,jul,62,3,-1,0,unknown,no
37035,36,technician,married,tertiary,no,421,yes,no,cellular,13,may,793,5,-1,0,unknown,no
13823,33,technician,married,tertiary,no,1746,yes,no,cellular,10,jul,184,1,-1,0,unknown,no
8260,54,self-employed,married,primary,no,277,yes,no,unknown,2,jun,360,3,-1,0,unknown,no
28590,28,technician,married,tertiary,no,203,no,yes,cellular,29,jan,188,3,-1,0,unknown,no
12183,40,blue-collar,married,secondary,no,95,no,yes,unknown,20,jun,61,4,-1,0,unknown,no
17209,37,management,single,unknown,no,242,no,yes,cellular,28,jul,124,3,-1,0,unknown,no
39681,25,management,single,tertiary,no,430,no,yes,cellular,27,may,145,2,-1,0,unknown,no


### Shuffling the dataset

In [19]:
from sklearn.utils import shuffle

bank_new = shuffle(bank_new)
bank_new.shape

(40000, 17)

In [20]:
bank_new.head(25)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
36373,33,services,single,secondary,no,-530,yes,yes,telephone,11,may,168,5,-1,0,unknown,no
16791,25,entrepreneur,married,tertiary,no,30,yes,no,cellular,24,jul,254,2,-1,0,unknown,no
8084,35,services,married,secondary,no,152,yes,no,unknown,2,jun,563,1,-1,0,unknown,yes
44095,71,retired,divorced,unknown,no,392,no,no,telephone,7,jul,276,2,-1,0,unknown,yes
44556,35,management,married,tertiary,no,2717,no,no,cellular,13,aug,394,6,-1,0,unknown,yes
12320,27,blue-collar,married,tertiary,no,335,yes,no,unknown,26,jun,519,3,-1,0,unknown,yes
36526,41,blue-collar,married,secondary,no,849,yes,yes,cellular,12,may,100,4,-1,0,unknown,no
37887,37,services,married,secondary,no,6089,yes,no,cellular,14,may,616,2,-1,0,unknown,yes
13400,36,blue-collar,married,secondary,no,56,yes,no,cellular,9,jul,183,1,-1,0,unknown,no
787,46,management,married,tertiary,no,0,no,no,unknown,7,may,70,2,-1,0,unknown,no


### Splitting into target and features

In [21]:
y_tar = bank_new['y']
y_tar.shape

(40000,)

In [22]:
X = bank_new.drop(['y'], axis = 1)
X.shape

(40000, 16)

In [23]:
X

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
36373,33,services,single,secondary,no,-530,yes,yes,telephone,11,may,168,5,-1,0,unknown
16791,25,entrepreneur,married,tertiary,no,30,yes,no,cellular,24,jul,254,2,-1,0,unknown
8084,35,services,married,secondary,no,152,yes,no,unknown,2,jun,563,1,-1,0,unknown
44095,71,retired,divorced,unknown,no,392,no,no,telephone,7,jul,276,2,-1,0,unknown
44556,35,management,married,tertiary,no,2717,no,no,cellular,13,aug,394,6,-1,0,unknown
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26656,52,housemaid,married,secondary,no,3923,yes,no,cellular,20,nov,20,8,190,3,success
44162,31,technician,married,tertiary,no,2166,no,no,cellular,13,jul,577,6,182,2,success
37472,42,management,married,tertiary,no,1162,yes,no,cellular,13,may,406,5,364,1,other
18509,54,admin.,married,secondary,no,812,yes,yes,cellular,31,jul,83,6,-1,0,unknown


In [24]:
# Converting the categorical features to numeric

X_new = pd.get_dummies(X)
X_new.shape

(40000, 51)

In [25]:
X_new

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,job_admin.,job_blue-collar,job_entrepreneur,...,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown
36373,33,-530,11,168,5,-1,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
16791,25,30,24,254,2,-1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
8084,35,152,2,563,1,-1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
44095,71,392,7,276,2,-1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
44556,35,2717,13,394,6,-1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26656,52,3923,20,20,8,190,3,0,0,0,...,0,0,0,1,0,0,0,0,1,0
44162,31,2166,13,577,6,182,2,0,0,0,...,0,0,0,0,0,0,0,0,1,0
37472,42,1162,13,406,5,364,1,0,0,0,...,0,0,1,0,0,0,0,1,0,0
18509,54,812,31,83,6,-1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1


### Standardization of features

In [26]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_sc = sc.fit_transform(X_new)
X_sc

array([[-0.70650057, -0.60798423, -0.55417447, ..., -0.21921049,
        -0.28888227,  0.55166443],
       [-1.40079886, -0.4412374 ,  0.99410664, ..., -0.21921049,
        -0.28888227,  0.55166443],
       [-0.532926  , -0.4049104 , -1.6260614 , ..., -0.21921049,
        -0.28888227,  0.55166443],
       ...,
       [ 0.07458499, -0.10417057, -0.31597738, ...,  4.56182538,
        -0.28888227, -1.81269616],
       [ 1.11603242, -0.20838734,  1.82779647, ..., -0.21921049,
        -0.28888227,  0.55166443],
       [ 3.98001283,  0.20192899, -0.43507593, ..., -0.21921049,
        -0.28888227,  0.55166443]])

### Splitting the data into train and test

In [28]:
from sklearn.model_selection import train_test_split

X_sc_train, X_sc_test, y_tar_train, y_tar_test = train_test_split(X_sc, y_tar, test_size = 0.2, random_state = 100)

X_sc_train.shape, X_sc_test.shape, y_tar_train.shape, y_tar_test.shape

((32000, 51), (8000, 51), (32000,), (8000,))

### Model Building - K Nearest Neighbor Classifier

In [29]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(X_sc_train, y_tar_train)

KNeighborsClassifier()

### Model performance

In [31]:
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score

cm = confusion_matrix(y_tar_test, knn.predict(X_sc_test))
report = classification_report(y_tar_test, knn.predict(X_sc_test))

print ("The Confusion matrix is: \n", cm)
print ("\n")
print ("The report is:\n", report)

The Confusion matrix is: 
 [[4285  698]
 [ 527 2490]]


The report is:
               precision    recall  f1-score   support

          no       0.89      0.86      0.87      4983
         yes       0.78      0.83      0.80      3017

    accuracy                           0.85      8000
   macro avg       0.84      0.84      0.84      8000
weighted avg       0.85      0.85      0.85      8000



### Hyper-paramter turing using GridSearchCV

In [34]:
from sklearn.model_selection import GridSearchCV

knn_gs = GridSearchCV(knn, {'n_neighbors': range(3, 8)}) # The parameters are estimator and param_grid. Estimator is the model.
knn_gs.fit(X_sc_train, y_tar_train)

GridSearchCV(estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': range(3, 8)})

In [35]:
knn_gs.best_params_

{'n_neighbors': 3}

### Creating the best KNN Model

In [36]:
knn_best = KNeighborsClassifier(n_neighbors = 3)
knn_best.fit(X_sc_train, y_tar_train)

KNeighborsClassifier(n_neighbors=3)

In [37]:
report = classification_report(y_tar_test, knn_best.predict(X_sc_test))
cm = confusion_matrix (y_tar_test, knn_best.predict(X_sc_test))
print ("The Confusion matrix is: \n", cm)
print ("\n")
print ("The report is:\n", report)

The Confusion matrix is: 
 [[4313  670]
 [ 369 2648]]


The report is:
               precision    recall  f1-score   support

          no       0.92      0.87      0.89      4983
         yes       0.80      0.88      0.84      3017

    accuracy                           0.87      8000
   macro avg       0.86      0.87      0.86      8000
weighted avg       0.87      0.87      0.87      8000

