# Lab | Handling Data Imbalance in Classification Models

### Begin the Modeling here

Look critically at the dtypes of numerical and categorical columns and make changes where appropriate.

Concatenate numerical and categorical back together again for your X dataframe. Designate the TargetB as y.

Split the data into a training set and a test set.

Split further into train_num and train_cat. Also test_num and test_cat.

Scale the features either by using MinMax Scaler or a Standard Scaler. (train_num, test_num)

Encode the categorical features using One-Hot Encoding or Ordinal Encoding. (train_cat, test_cat)

- fit only on train data, transform both train and test

- again re-concatenate train_num and train_cat as X_train as well as test_num and test_cat as X_test

Fit a logistic regression model on the training data.

Check the accuracy on the test data.

#### Note: So far we have not balanced the data.

### Managing imbalance in the dataset

Check for the imbalance.

Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes.

Each time fit the model and see how the accuracy of the model has changed.

### Preparing the data

In [1]:
import pandas as pd
import numpy as np

In [2]:
# files were output from previous lab (lab-revisiting-machine-learning.ipynb)

numerical = pd.read_csv('numerical.csv').drop(['Unnamed: 0'], axis=1)
categorical = pd.read_csv('categorical.csv').drop(['Unnamed: 0'], axis=1)
target = pd.read_csv('target.csv').drop(['Unnamed: 0'], axis=1)

display(numerical.shape)
display(categorical.shape)
target

(95412, 321)

(95412, 10)

Unnamed: 0,TARGET_B,TARGET_D
0,0,0.0
1,0,0.0
2,0,0.0
3,0,0.0
4,0,0.0
...,...,...
95407,0,0.0
95408,0,0.0
95409,0,0.0
95410,1,18.0


In [3]:
display(categorical.dtypes)
categorical.head()

STATE       object
CLUSTER      int64
HOMEOWNR    object
GENDER      object
DATASRCE     int64
RFA_2R      object
RFA_2A      object
GEOCODE2    object
DOMAIN_A    object
DOMAIN_B     int64
dtype: object

Unnamed: 0,STATE,CLUSTER,HOMEOWNR,GENDER,DATASRCE,RFA_2R,RFA_2A,GEOCODE2,DOMAIN_A,DOMAIN_B
0,IL,36,U,F,0,L,E,C,T,2
1,CA,14,H,M,3,L,G,A,S,1
2,NC,43,U,M,3,L,E,C,R,2
3,CA,44,U,F,3,L,E,C,R,2
4,FL,16,H,F,3,L,F,A,S,2


In [4]:
categorical['CLUSTER'] = categorical['CLUSTER'].astype('object')
categorical['DATASRCE'] = categorical['DATASRCE'].astype('object')
# I will ordinally encode DOMAIN_B so it can remain as numeric to be scaled

categorical.dtypes

STATE       object
CLUSTER     object
HOMEOWNR    object
GENDER      object
DATASRCE    object
RFA_2R      object
RFA_2A      object
GEOCODE2    object
DOMAIN_A    object
DOMAIN_B     int64
dtype: object

In [5]:
y = target['TARGET_B']
y.value_counts()

0    90569
1     4843
Name: TARGET_B, dtype: int64

In [6]:
X = pd.concat([numerical, categorical], axis = 1)
X.shape

(95412, 331)

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state = 123)

X_train_num = X_train.select_dtypes(np.number).copy()
X_test_num = X_test.select_dtypes(np.number).copy()
X_train_cat = X_train.select_dtypes(object).copy()
X_test_cat = X_test.select_dtypes(object).copy()

display(X_train_cat.shape) # one column (DOMAIN_B) has been moved over to numeric
X_train_cat.head()

(76329, 9)

Unnamed: 0,STATE,CLUSTER,HOMEOWNR,GENDER,DATASRCE,RFA_2R,RFA_2A,GEOCODE2,DOMAIN_A
26355,other,36,U,F,0,L,G,B,T
3034,GA,40,H,other,3,L,E,A,T
21143,FL,18,U,M,0,L,F,A,S
46939,other,53,H,F,2,L,G,C,R
73809,other,27,U,M,0,L,G,B,C


In [8]:
X_train_num.head()

Unnamed: 0,ODATEDW,TCODE,DOB,AGE,INCOME,WEALTH1,HIT,MALEMILI,MALEVET,VIETVETS,...,LASTGIFT,LASTDATE,FISTDATE,TIMELAG,AVGGIFT,CONTROLN,HPHONE_D,RFA_2F,CLUSTER2,DOMAIN_B
26355,8601,2,1412,83.0,5.0,5.0,0,1,48,38,...,50.0,9511,8702,9.0,33.461538,141909,0,2,20.0,2
3034,8601,0,0,61.611649,6.0,8.0,1,0,20,32,...,12.0,9601,8706,7.0,10.375,27010,0,1,15.0,2
21143,9401,1,0,61.611649,5.0,5.0,0,1,24,38,...,16.0,9511,9401,10.0,12.0,38038,0,1,21.0,2
46939,8601,0,4407,53.0,1.0,5.0,0,0,62,20,...,25.0,9505,8706,10.0,8.428571,137020,0,1,62.0,3
73809,9401,0,4207,55.0,5.0,5.0,0,0,28,18,...,25.0,9512,9310,9.0,23.75,51546,0,1,35.0,2


In [9]:
y_train.head()

26355    0
3034     1
21143    0
46939    0
73809    0
Name: TARGET_B, dtype: int64

In [10]:
from sklearn.preprocessing import MinMaxScaler

transformer = MinMaxScaler().fit(X_train_num) # fit only on the training set, but applied to both train and test (below)

X_train_scaled = pd.DataFrame(transformer.transform(X_train_num), columns=X_train_num.columns)
X_test_scaled = pd.DataFrame(transformer.transform(X_test_num), columns=X_train_num.columns)

X_train_scaled.head()

Unnamed: 0,ODATEDW,TCODE,DOB,AGE,INCOME,WEALTH1,HIT,MALEMILI,MALEVET,VIETVETS,...,LASTGIFT,LASTDATE,FISTDATE,TIMELAG,AVGGIFT,CONTROLN,HPHONE_D,RFA_2F,CLUSTER2,DOMAIN_B
0,0.21147,2.8e-05,0.145417,0.845361,0.666667,0.555556,0.0,0.010101,0.484848,0.383838,...,0.05,0.040201,0.906175,0.008272,0.032217,0.739956,0.0,0.333333,0.311475,0.333333
1,0.21147,0.0,0.0,0.624862,0.833333,0.888889,0.004149,0.0,0.20202,0.323232,...,0.012,0.492462,0.906592,0.006434,0.009101,0.140821,0.0,0.0,0.229508,0.333333
2,0.784946,1.4e-05,0.0,0.624862,0.666667,0.555556,0.0,0.010101,0.242424,0.383838,...,0.016,0.040201,0.978965,0.009191,0.010728,0.198326,0.0,0.0,0.327869,0.333333
3,0.21147,0.0,0.453862,0.536082,0.0,0.555556,0.0,0.0,0.626263,0.20202,...,0.025,0.01005,0.906592,0.009191,0.007152,0.714462,0.0,0.0,1.0,0.666667
4,0.784946,0.0,0.433265,0.556701,0.666667,0.555556,0.0,0.0,0.282828,0.181818,...,0.025,0.045226,0.969489,0.008272,0.022493,0.268763,0.0,0.0,0.557377,0.333333


In [11]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(drop='first').fit(X_train_cat) # again, fit only on the training set, but applied to both
cols = encoder.get_feature_names_out(input_features=X_train_cat.columns)

X_train_encode = pd.DataFrame(encoder.transform(X_train_cat).toarray(),columns=cols)
X_test_encode = pd.DataFrame(encoder.transform(X_test_cat).toarray(),columns=cols)

X_train_encode.head()

Unnamed: 0,STATE_FL,STATE_GA,STATE_IL,STATE_IN,STATE_MI,STATE_MO,STATE_NC,STATE_TX,STATE_WA,STATE_WI,...,RFA_2A_E,RFA_2A_F,RFA_2A_G,GEOCODE2_B,GEOCODE2_C,GEOCODE2_D,DOMAIN_A_R,DOMAIN_A_S,DOMAIN_A_T,DOMAIN_A_U
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [12]:
# now we recombine the scaled/encoded test and train sets
# we also reset the indexes of the y series to realign with the transformed X

X_train_transformed = pd.concat([X_train_scaled,X_train_encode], axis = 1)
X_test_transformed = pd.concat([X_test_scaled,X_test_encode], axis = 1)

y_train = y_train.reset_index(drop=True)          # to realign the X and y datasets for when we over/under-sample later
y_test = y_test.reset_index(drop=True)

### Logistic Regression LR

In [13]:
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression(max_iter = 1000).fit(X_train_transformed, y_train) # use default parameters

score_LR = LR.score(X_test_transformed, y_test)

In [14]:
score_LR    # this is a very high score

0.9493790284546455

### LR - Resampling

In [15]:
# We just over/undersample the train data to improve the model. It is still tested with the test data as before.

In [16]:
y_train.value_counts()

0    72452
1     3877
Name: TARGET_B, dtype: int64

In [17]:
from sklearn.utils import resample

train = pd.concat([X_train_transformed, y_train],axis=1) # we recombine the X and y to maintain relationship within the rows

train_0 = train[train['TARGET_B']==0]
train_1 = train[train['TARGET_B']==1]

display(train_0.shape)
train_1.shape            # the data is imbalanced

(72452, 402)

(3877, 402)

In [18]:
# first we will oversample the train_1 dataset

train_1_oversampled = resample(train_1,
                                    replace=True,
                                    n_samples = len(train_0),
                                    random_state=0)

display(train_1_oversampled.shape)
display(train_0.shape)

(72452, 402)

(72452, 402)

In [19]:
train_oversampled = pd.concat([train_0, train_1_oversampled],axis=0)
y_train_over = train_oversampled['TARGET_B'].copy()
X_train_over = train_oversampled.drop('TARGET_B',axis = 1).copy()

In [20]:
LR_over = LogisticRegression(max_iter = 1500).fit(X_train_over, y_train_over)

score_LR_over = LR_over.score(X_test_transformed, y_test)

In [21]:
score_LR_over

0.6186658282240738

In [22]:
# now let's undersample the train_0 dataset

train_0_undersampled = resample(train_0,
                                    replace=False,
                                    n_samples = len(train_1),
                                    random_state=0)

display(train_0_undersampled.shape)
display(train_1.shape)

(3877, 402)

(3877, 402)

In [23]:
train_undersampled = pd.concat([train_0_undersampled, train_1],axis=0)
y_train_under = train_undersampled['TARGET_B'].copy()
X_train_under = train_undersampled.drop('TARGET_B',axis = 1).copy()

In [24]:
LR_under = LogisticRegression(max_iter = 1000).fit(X_train_under, y_train_under)

score_LR_under = LR_under.score(X_test_transformed, y_test)

In [25]:
score_LR_under

0.590525598700414

In [26]:
print('Original Logistic Regression score:      ', score_LR)
print('Over-sampled Logistic Regression score:  ',score_LR_over)
print('Under-sampled Logistic Regression score: ',score_LR_under)

Original Logistic Regression score:       0.9493790284546455
Over-sampled Logistic Regression score:   0.6186658282240738
Under-sampled Logistic Regression score:  0.590525598700414


### Nearest Neighbour Classifier KNC

In [27]:
# we will use the original (imbalanced) train dataset

In [28]:
from sklearn import neighbors
KNC = neighbors.KNeighborsClassifier()
KNC.fit(X_train_transformed, y_train)

In [29]:
score_KNC = KNC.score(X_test_transformed, y_test)
score_KNC

0.947911753917099

### Decision Tree Classifier DTC

In [30]:
from sklearn.tree import DecisionTreeClassifier

DTC = DecisionTreeClassifier(max_depth=3)          # arbitrary choice of 3 levels
DTC.fit(X_train_num, y_train)                      # we go back to using our pre-scaled, numerical X dataset

In [31]:
score_DTC = DTC.score(X_test_num,y_test)
print("Test data score: ", score_DTC)
print("Train data score: ", DTC.score(X_train_num,y_train))

Test data score:  0.9493790284546455
Train data score:  0.9492067235257897


In [32]:
from sklearn.tree import export_text

tree = export_text(DTC, feature_names=list(X_train_num.columns))
print(tree)

|--- LASTGIFT <= 14.03
|   |--- LASTGIFT <= 8.22
|   |   |--- AVGGIFT <= 8.97
|   |   |   |--- class: 0
|   |   |--- AVGGIFT >  8.97
|   |   |   |--- class: 0
|   |--- LASTGIFT >  8.22
|   |   |--- LASTDATE <= 9701.50
|   |   |   |--- class: 0
|   |   |--- LASTDATE >  9701.50
|   |   |   |--- class: 0
|--- LASTGIFT >  14.03
|   |--- HVP1 <= 10.50
|   |   |--- NGIFTALL <= 4.50
|   |   |   |--- class: 0
|   |   |--- NGIFTALL >  4.50
|   |   |   |--- class: 0
|   |--- HVP1 >  10.50
|   |   |--- EC3 <= 35.50
|   |   |   |--- class: 0
|   |   |--- EC3 >  35.50
|   |   |   |--- class: 0



In [33]:
# interesting that 'LASTGIFT' appears in two different level-nodes

### Summary of scores so far

In [34]:
print('Original Logistic Regression score:      ', round(score_LR,4))
print('Over-sampled Logistic Regression score:  ', round(score_LR_over,4))
print('Under-sampled Logistic Regression score: ', round(score_LR_under,4))
print('Nearest Neighbour score:                 ', round(score_KNC,4))
print('Decision Tree Classifier score:          ', round(score_DTC,4))

Original Logistic Regression score:       0.9494
Over-sampled Logistic Regression score:   0.6187
Under-sampled Logistic Regression score:  0.5905
Nearest Neighbour score:                  0.9479
Decision Tree Classifier score:           0.9494


In [35]:
# The original LR model and the DTC have generated the best (and identical to 10dp !!!) scores.
# This is especially strange as only the numerical data was used in the DTC.
# Over/under sampling did not improve the LR model.
# KNC also producted a good result.

### Hyperparameter optimisation

In [36]:
# Let's try to improve the LR model by amending parameters

In [37]:
# for reference:

from sklearn.model_selection import cross_validate

cross_validate(LR, X_train_transformed, y_train, cv = 5)

{'fit_time': array([13.9911871 , 15.74153972, 14.5428803 , 18.38295341, 16.47683191]),
 'score_time': array([0.02093935, 0.02094293, 0.02393603, 0.02293777, 0.0287354 ]),
 'test_score': array([0.94923359, 0.94923359, 0.94916809, 0.94916809, 0.94923027])}

In [38]:
from sklearn.model_selection import GridSearchCV

# let's pick some parameters at random, we will therefore generate 4 models

model = LogisticRegression()

grid = {'class_weight': [None, 'balanced'],
        'solver': ['lbfgs', 'saga'],
        'max_iter': 1000, # fixed as I got a warning when using the default, 100, earlier
        }

In [39]:
grid_search = GridSearchCV(estimator = model, param_grid = grid, cv = 5)

In [None]:
%%time

grid_search.fit(X_train_transformed, y_train)

grid_search.best_params_

In [None]:
# wouldn't run, timed out so I can't compare the scores

In [None]:
print('Original Logistic Regression score:      ', score_LR)
print('Improved Logistic Regression score:      ', grid_search.best_score_)

In [None]:
# 