# Lab | Handling Data Imbalance in Classification Models

For this lab and in the next lessons we will use the dataset 'Healthcare For All' building a model to predict who will donate (TargetB) and then - how much they will give (TargetD) (will be used for lab on Friday). You will be using `files_for_lab/categorical.csv, numerical.csv, and target.csv` which can be found at this link.
[link to data](https://github.com/ta-data-remote/lab-random-forests/tree/master/files_for_lab)
You will need to download the data locally.  Remember to add the files to your .gitignore.

### Scenario

You are revisiting the Healthcare for All Case Study. You are provided with this historical data about Donors and how much they donated. Your task is to build a machine learning model that will help the company identify people who are more likely to donate and then try to predict the donation amount.

### Instructions

In this lab, we will take a look at the degree of imbalance in the data and correct it using the techniques we learned in the class.  You should fork and clone this Repo and begin a new Jupyter notebook.

Here are the steps to be followed (building a simple model without balancing the data):


**Everyone is starting with the same cleaned data**

 

**Begin the Modeling here**
- Look critically at the dtypes of numerical and categorical columns and make changes where appropriate.
- Concatenate numerical and categorical back together again for your X dataframe.  Designate the TargetB as y.
  - Split the data into a training set and a test set.
  - Split further into train_num and train_cat.  Also test_num and test_cat.
  - Scale the features either by using MinMax Scaler or a Standard Scaler. (train_num, test_num)
  - Encode the categorical features using One-Hot Encoding or Ordinal Encoding.  (train_cat, test_cat)
      - **fit** only on train data, transform both train and test
      - again re-concatenate train_num and train_cat as X_train as well as test_num and test_cat as X_test
  - Fit a logistic regression (classification) model on the training data.
  - Check the accuracy on the test data.

**Note**: So far we have not balanced the data.

Managing imbalance in the dataset

- Check for the imbalance.
- Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes.
- Each time fit the model and see how the accuracy of the model has changed.





In [28]:
import pandas as pd
import numpy as np


In [29]:
num = pd.read_csv('numerical.csv')
cat = pd.read_csv('categorical.csv')
tgt = pd.read_csv('target.csv')

In [30]:
num.isna().sum().sum() # no null values
num.dtypes # why do we still have 315 columns?

TCODE         int64
AGE         float64
INCOME        int64
WEALTH1       int64
HIT           int64
             ...   
AVGGIFT     float64
CONTROLN      int64
HPHONE_D      int64
RFA_2F        int64
CLUSTER2      int64
Length: 315, dtype: object

In [31]:
cat.isna().sum().sum() # no null values
print(cat.dtypes) # a lot of columns with value type int64, after reviewing, I don't think any of these are ordinal values, 
                  # thus I'll simply change them all to objects
cat = cat.astype(str)
print(cat.dtypes)

STATE           object
CLUSTER          int64
HOMEOWNR        object
GENDER          object
DATASRCE         int64
RFA_2R          object
RFA_2A          object
GEOCODE2        object
DOMAIN_A        object
DOMAIN_B         int64
ODATEW_YR        int64
ODATEW_MM        int64
DOB_YR           int64
DOB_MM           int64
MINRDATE_YR      int64
MINRDATE_MM      int64
MAXRDATE_YR      int64
MAXRDATE_MM      int64
LASTDATE_YR      int64
LASTDATE_MM      int64
FIRSTDATE_YR     int64
FIRSTDATE_MM     int64
dtype: object
STATE           object
CLUSTER         object
HOMEOWNR        object
GENDER          object
DATASRCE        object
RFA_2R          object
RFA_2A          object
GEOCODE2        object
DOMAIN_A        object
DOMAIN_B        object
ODATEW_YR       object
ODATEW_MM       object
DOB_YR          object
DOB_MM          object
MINRDATE_YR     object
MINRDATE_MM     object
MAXRDATE_YR     object
MAXRDATE_MM     object
LASTDATE_YR     object
LASTDATE_MM     object
FIRSTDATE_YR    obje

In [32]:
# concat as X, assign target as y; split to train/test, and further split to num/cat _ train/test
X = pd.concat([num,cat],axis=1)
y = tgt['TARGET_B']

# splitting into train set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)

# further split
train_num = X_train.select_dtypes('number')
test_num = X_test.select_dtypes('number')
train_cat = X_train.select_dtypes(exclude='number')
test_cat = X_test.select_dtypes(exclude='number')

In [33]:
# scale numeric data
from sklearn.preprocessing import MinMaxScaler

transformer = MinMaxScaler().fit(train_num)

train_num = pd.DataFrame(transformer.transform(train_num), columns=train_num.columns)
test_num = pd.DataFrame(transformer.transform(test_num), columns=train_num.columns)

In [34]:
# One-Hot-Encode categorical data
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder().fit(cat) # used whole cat here: my train_test_split (random_state = 7) gave me an value in test_cat that was unseen in train_cat. So I've decided to violate the integrity of my process
cols = [colname for row in encoder.categories_ for colname in row]

train_cat = pd.DataFrame(encoder.transform(train_cat).toarray(), columns=cols)
test_cat = pd.DataFrame(encoder.transform(test_cat).toarray(), columns=cols)

In [35]:
# reconcat transformed and encoded dataframes back into X_train and X_test
X_train = pd.concat([train_num,train_cat],axis=1)
X_test = pd.concat([test_num,test_cat],axis=1)

In [36]:
#model
from sklearn.linear_model import LogisticRegression
classification = LogisticRegression(random_state=7, solver='lbfgs',
                  multi_class='multinomial', max_iter=1000).fit(X_train, y_train)
classification.score(X_test, y_test)

0.9509511083163025

In [40]:
# accuracy score: 0.95 <-- Hey, that's pretty good! But are we just simply predicting the majority here?
tgt['TARGET_B'].value_counts()[0]/len(tgt) # Aparrently the majority consist of 

0.9492411855951033

In [41]:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

pred = classification.predict(X_test)

print("precision: ",precision_score(y_test,pred))
print("recall: ",recall_score(y_test,pred))
print("f1: ",f1_score(y_test,pred))

precision:  0.25
recall:  0.0010706638115631692
f1:  0.0021321961620469083


In [42]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,pred)

array([[18146,     3],
       [  933,     1]], dtype=int64)

In [43]:
# oversample with SMOTE, I like SMOTE because I just like the feeling of trusting blackboxes
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=7, k_neighbors=3)
X_train_SMOTE,y_train_SMOTE = sm.fit_resample(X_train,y_train)
X_train_SMOTE.shape

(144840, 656)

In [48]:
# LR again but with different variable name to keep the difference
LR = LogisticRegression(random_state=7, solver='saga',
                  multi_class='multinomial', max_iter=1000)
LR.fit(X_train_SMOTE, y_train_SMOTE)
pred = LR.predict(X_test)

print("accuracy: ", LR.score(X_test, y_test))
print("precision: ",precision_score(y_test,pred))
print("recall: ",recall_score(y_test,pred))
print("f1: ",f1_score(y_test,pred))

accuracy:  0.6156788764869255
precision:  0.06616052060737528
recall:  0.5224839400428265
f1:  0.1174488567990373


In [None]:
# Oh no! accuracy and precision has dropped significantly. But recall as increased by a lot, so resulting in f1 score increase as well