# Lab | Handling Data Imbalance in Classification Models

For this lab and in the next lessons we will use the dataset 'Healthcare For All' building a model to predict who will donate (TargetB) and then - how much they will give (TargetD) (will be used for lab on Friday). You will be using `files_for_lab/categorical.csv, numerical.csv, and target.csv` which can be found at this link.
[link to data](https://github.com/ta-data-remote/lab-random-forests/tree/master/files_for_lab)
You will need to download the data locally.  Remember to add the files to your .gitignore.

### Scenario

You are revisiting the Healthcare for All Case Study. You are provided with this historical data about Donors and how much they donated. Your task is to build a machine learning model that will help the company identify people who are more likely to donate and then try to predict the donation amount.

### Instructions

In this lab, we will take a look at the degree of imbalance in the data and correct it using the techniques we learned in the class.  You should fork and clone this Repo and begin a new Jupyter notebook.

Here are the steps to be followed (building a simple model without balancing the data):


**Everyone is starting with the same cleaned data**

 

**Begin the Modeling here**
- Look critically at the dtypes of numerical and categorical columns and make changes where appropriate.
- Concatenate numerical and categorical back together again for your X dataframe.  Designate the TargetB as y.
  - Split the data into a training set and a test set.
  - Split further into train_num and train_cat.  Also test_num and test_cat.
  - Scale the features either by using MinMax Scaler or a Standard Scaler. (train_num, test_num)
  - Encode the categorical features using One-Hot Encoding or Ordinal Encoding.  (train_cat, test_cat)
      - **fit** only on train data, transform both train and test
      - again re-concatenate train_num and train_cat as X_train as well as test_num and test_cat as X_test
  - Fit a logistic regression (classification) model on the training data.
  - Check the accuracy on the test data.

**Note**: So far we have not balanced the data.

Managing imbalance in the dataset

- Check for the imbalance.
- Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes.
- Each time fit the model and see how the accuracy of the model has changed.





In [1]:
import pandas as pd
import numpy as np


In [21]:
num = pd.read_csv('numerical.csv')
cat = pd.read_csv('categorical.csv')
tgt = pd.read_csv('target.csv')

In [22]:
num.isna().sum().sum() # no null values
num.dtypes # why do we still have 315 columns?

TCODE         int64
AGE         float64
INCOME        int64
WEALTH1       int64
HIT           int64
             ...   
AVGGIFT     float64
CONTROLN      int64
HPHONE_D      int64
RFA_2F        int64
CLUSTER2      int64
Length: 315, dtype: object

In [23]:
# cat.isna().sum().sum() # no null values
# print(cat.dtypes) # a lot of columns with value type int64, after reviewing, I don't think any of these are ordinal values, 
#                   # thus I'll simply change them all to objects
# cat = cat.astype(str)
# print(cat.dtypes)

#Note from friday: Nope, don't do this, cause tremendous problems

In [24]:
# concat as X, assign target as y; split to train/test, and further split to num/cat _ train/test
X = pd.concat([num,cat],axis=1)
y = tgt['TARGET_B']

# splitting into train set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)

# further split
train_num = X_train.select_dtypes('number')
test_num = X_test.select_dtypes('number')
train_cat = X_train.select_dtypes(exclude='number')
test_cat = X_test.select_dtypes(exclude='number')

In [25]:
# scale numeric data
from sklearn.preprocessing import MinMaxScaler

transformer = MinMaxScaler().fit(train_num)

train_num = pd.DataFrame(transformer.transform(train_num), columns=train_num.columns)
test_num = pd.DataFrame(transformer.transform(test_num), columns=train_num.columns)

In [26]:
# One-Hot-Encode categorical data
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder().fit(train_cat)

train_cat = pd.DataFrame(encoder.transform(train_cat).toarray(), columns=encoder.get_feature_names_out())
test_cat = pd.DataFrame(encoder.transform(test_cat).toarray(), columns=encoder.get_feature_names_out())

In [27]:
# reconcat transformed and encoded dataframes back into X_train and X_test
X_train = pd.concat([train_num,train_cat],axis=1)
X_test = pd.concat([test_num,test_cat],axis=1)

In [31]:
#model
from sklearn.linear_model import LogisticRegression
classification = LogisticRegression(random_state=7, solver='saga',
                  multi_class='multinomial', max_iter=1000).fit(X_train, y_train)
classification.score(X_test, y_test)

0.951055913640413

In [32]:
# accuracy score: 0.95 <-- Hey, that's pretty good! But are we just simply predicting the majority here?
tgt['TARGET_B'].value_counts()[0]/len(tgt) # Aparrently the majority consist of 

0.9492411855951033

In [36]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,pred) # it's all zero!

array([[18149,     0],
       [  934,     0]], dtype=int64)

In [38]:
# oversample with SMOTE, I like SMOTE because I just like the feeling of trusting blackboxes
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=7, k_neighbors=3)
X_train_SMOTE,y_train_SMOTE = sm.fit_resample(X_train,y_train)
X_train_SMOTE.shape

(144840, 361)

In [39]:
# LR again but with different variable name to keep the difference
LR = LogisticRegression(random_state=7, solver='saga',
                  multi_class='multinomial', max_iter=1000)
LR.fit(X_train_SMOTE, y_train_SMOTE)
pred = LR.predict(X_test)

print("accuracy: ", LR.score(X_test, y_test))
print("precision: ",precision_score(y_test,pred))
print("recall: ",recall_score(y_test,pred))
print("f1: ",f1_score(y_test,pred))

accuracy:  0.6028926269454489
precision:  0.06654488517745302
recall:  0.5460385438972163
f1:  0.11863224005582693


In [None]:
# Oh no! accuracy has dropped significantly. But recall as increased by a lot, resulting in f1 score increase as well

# Lab | Random Forests

For this lab, you will be using the .CSV files provided in the `files_for_lab` folder.  These are cleaned versions of the learningSet data from the Case Study 'Healthcare for All'.   
You may continue in the Jupyter Notebook you created yesterday.  There is no need to fork and clone this Repo.

### Instructions

- Apply the Random Forests algorithm AFTER upscaling the data to deal with the imbalance.
- Use Feature Selections that you have learned in class to decide if you want to use all of the features (Variance Threshold, RFE, PCA, etc.)
- Re-run the Random Forest algorithm to determine if the Feature Selection has improved the results.
- Discuss the output and its impact in the business scenario. Is the cost of a false positive equals to the cost of the false negative? How would you change your algorithm or data in order to maximize the return of the business?

In [40]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(max_depth=10, # max number of questions to ask
                             min_samples_split=20, # amount of rows still considered at every question
                             min_samples_leaf =20, # ultimate answer based on at least this many rows
                             max_samples=0.8, # fraction of X-train to use in each tree
                             random_state=7)
clf.fit(X_train_SMOTE, y_train_SMOTE)
print(clf.score(X_train_SMOTE, y_train_SMOTE))
print(clf.score(X_test, y_test))

0.958733775200221
0.9232825027511398


In [41]:
# accuracy is not bad, much better than the Logistic Regression model
pred = clf.predict(X_test)

print("precision: ",precision_score(y_test,pred))
print("recall: ",recall_score(y_test,pred))
print("f1: ",f1_score(y_test,pred))

precision:  0.06557377049180328
recall:  0.042826552462526764
f1:  0.05181347150259067


In [None]:
# None of the above three metrics are better than the Logistic Regression model
# Let's use PCA to build/reduce features
# from sklearn.decomposition import PCA
# pca = PCA(0.80).fit(X_train_SMOTE)
# pca.explained_variance_ratio_ # OK this looks cool, acceptable number of features adding up to explained variance ratio of 0.90
#                               # But even the one with highest evr is not too high (0.09), I'll try anyway

In [None]:
# X_train_SMOTE_pca = pca.transform(X_train_SMOTE)

# clf2 = RandomForestClassifier(max_depth=10, # max number of questions to ask
#                              min_samples_split=20, # amount of rows still considered at every question
#                              min_samples_leaf =20, # ultimate answer based on at least this many rows
#                              random_state=7)
# clf2.fit(X_train_SMOTE_pca, y_train_SMOTE)

# pred2 = clf2.predict(pca.transform(X_test))

# print("accuracy: ",clf2.score(pca.transform(X_test), y_test))
# print("precision: ",precision_score(y_test,pred2))
# print("recall: ",recall_score(y_test,pred2))
# print("f1: ",f1_score(y_test,pred2))

In [42]:
# this forest has way better recall and f1 score than the previous before pca
# from sklearn.metrics import confusion_matrix
# confusion_matrix(y_test,pred2)

NameError: name 'pred2' is not defined

In [None]:
# I would just use the logistic regression model.

# Lab | Final regression model in "Health Care for All" Case

### Instructions

At this point, we have created a model to predict who will make a donation and who won't (Classification Model). But, what about the ammount of money that each person will give?

In this lab, subset those that have made a donation (Target B) and use that subset to create a model to predict how much money will they give (Target D) (Regression Model).

- Only look at people who have donated (Target B = 1)
- Use this new dataframe to create a model to predict how much they will donate (Target D)
- Using the regression model, make predictions on all of the people our classification model predicted will donate.
- See the pdf file for a schema of the process.

Evaluate the result of your model and estimate how much better the result are for the business in comparison with the naive scenario we discuss on Monday. (Just sending donation cards to everyone)

You can see a flowchart for the project here --  [Lucid Flowchart](https://lucid.app/lucidchart/dd701870-3d4e-45c3-b49c-01976181ae06/edit?viewport_loc=-15%2C-25%2C2150%2C1048%2C0_0&invitationId=inv_089ae862-550b-4e82-a606-a8122f39d2f2)

In [54]:
#concat all data to a big dataframe
all_data = pd.concat([num,cat,tgt],axis=1)
all_data.shape

(95412, 339)

In [55]:
#select X and y, and train-test-split for regression model
donations_data = all_data[all_data['TARGET_B']==1]
X = donations_data.drop(columns=['TARGET_B','TARGET_D'])
y = donations_data['TARGET_D']

# splitting into train set and test set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)

# further split
train_num = X_train.select_dtypes('number')
test_num = X_test.select_dtypes('number')
train_cat = X_train.select_dtypes(exclude='number')
test_cat = X_test.select_dtypes(exclude='number')

In [58]:
# scale numeric data
from sklearn.preprocessing import MinMaxScaler

transformer = MinMaxScaler().fit(train_num)

train_num = pd.DataFrame(transformer.transform(train_num), columns=train_num.columns)
test_num = pd.DataFrame(transformer.transform(test_num), columns=train_num.columns)

# One-Hot-Encode categorical data
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder().fit(train_cat)

train_cat = pd.DataFrame(encoder.transform(train_cat).toarray(), columns=encoder.get_feature_names_out())
test_cat = pd.DataFrame(encoder.transform(test_cat).toarray(), columns=encoder.get_feature_names_out())

# reconcat transformed and encoded dataframes back into X_train and X_test
X_train = pd.concat([train_num,train_cat],axis=1)
X_test = pd.concat([test_num,test_cat],axis=1)

In [49]:
# select model
# from sklearn.model_selection import cross_val_score

# from sklearn.tree import DecisionTreeRegressor
# model1 = DecisionTreeRegressor()
# from sklearn.linear_model import LinearRegression
# model2 = LinearRegression()
# from sklearn.neighbors import KNeighborsRegressor
# model3 = KNeighborsRegressor()

# model_pipeline = [model1, model2, model3]
# model_names = ['Decision Tree Regressor', 'Linear Regression', 'KNN']
# scores = {}
# for model, model_name in zip(model_pipeline, model_names):
#     mean_score = np.mean(cross_val_score(model, X_train, y_train, cv=5))
#     scores[model_name] = mean_score
# print(scores)

In [61]:
# we will use linear regression
from sklearn.linear_model import LinearRegression as LinReg

linreg=LinReg()    # model
linreg.fit(X_train, y_train)   # model training
y_pred_linreg=linreg.predict(X_test)   # model prediction

print ('train R2: {} -- test R2: {}'.format(linreg.score(X_train, y_train),
                                            linreg.score(X_test, y_test)))

train R2: 0.591708326367602 -- test R2: 0.43991407526510484


### Now the models are ready, moving onto applying it to the whole dataset

In [72]:
# preprocess the input X
X = all_data.drop(columns=['TARGET_B','TARGET_D'])

numericalX    = X.select_dtypes('number')
categoricalX = X.select_dtypes(exclude='number')


scaled_numerical_X = transformer.transform(numericalX)
scaled_numerical_X = pd.DataFrame(scaled_numerical_X, columns=numericalX.columns)

encoded_categorical_X = encoder.transform(categoricalX).toarray()
encoded_categorical_X =pd.DataFrame(encoded_categorical_X, columns=encoder.get_feature_names_out())


X = pd.concat([scaled_numerical_X, encoded_categorical_X], axis = 1)

In [76]:
print(X.shape,all_data.shape)

(95412, 361) (95412, 340)


In [77]:
# Apply Classification
X['pred_target_b'] = LR.predict(X) # Using the LogisticRegression model trained with SMOTE data, explained choice above.
                                          # Got predictions (we are really over predicting a lot)

In [79]:
# Select data and apply regression model
X = X[X['pred_target_b'] == 1]

#model:
X['pred_amount'] = linreg.predict(X.drop(columns=['pred_target_b']))

In [88]:
print('number of predicted donors:',X['pred_amount'].count(),'; sum of predicted amount:',X['pred_amount'].sum())

number of predicted donors: 64643 ; sum of predicted amount: 1039892.8515625


In [89]:
X['pred_amount'].describe()

count    64643.000000
mean        16.086705
std          8.880228
min        -60.392578
25%         10.379883
50%         15.365234
75%         19.779297
max        322.824219
Name: pred_amount, dtype: float64