# Lab | Handling Data Imbalance in Classification Models

For this lab and in the next lessons we will use the dataset 'Healthcare For All' building a model to predict who will donate (TargetB) and then - how much they will give (TargetD) (will be used for lab on Friday). You will be using `files_for_lab/categorical.csv, numerical.csv, and target.csv` which can be found at this link.
[link to data](https://github.com/ta-data-remote/lab-random-forests/tree/master/files_for_lab)
You will need to download the data locally.  Remember to add the files to your .gitignore.

### Scenario

You are revisiting the Healthcare for All Case Study. You are provided with this historical data about Donors and how much they donated. Your task is to build a machine learning model that will help the company identify people who are more likely to donate and then try to predict the donation amount.

### Instructions

In this lab, we will take a look at the degree of imbalance in the data and correct it using the techniques we learned in the class.  You should fork and clone this Repo and begin a new Jupyter notebook.

Here are the steps to be followed (building a simple model without balancing the data):


**Everyone is starting with the same cleaned data**

 

**Begin the Modeling here**
- Look critically at the dtypes of numerical and categorical columns and make changes where appropriate.
- Concatenate numerical and categorical back together again for your X dataframe.  Designate the TargetB as y.
  - Split the data into a training set and a test set.
  - Split further into train_num and train_cat.  Also test_num and test_cat.
  - Scale the features either by using MinMax Scaler or a Standard Scaler. (train_num, test_num)
  - Encode the categorical features using One-Hot Encoding or Ordinal Encoding.  (train_cat, test_cat)
      - **fit** only on train data, transform both train and test
      - again re-concatenate train_num and train_cat as X_train as well as test_num and test_cat as X_test
  - Fit a logistic regression (classification) model on the training data.
  - Check the accuracy on the test data.

**Note**: So far we have not balanced the data.

Managing imbalance in the dataset

- Check for the imbalance.
- Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes.
- Each time fit the model and see how the accuracy of the model has changed.

![Screenshot 2024-05-16 at 14.21.48.png](attachment:2a30efe9-b4a8-4949-b54c-456576cf6acd.png)

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Load up that sweet data 

cats = pd.read_csv("categorical.csv")
nums = pd.read_csv("numerical.csv")
target = pd.read_csv("target.csv")

print(cats.shape)
print(nums.shape)
print(target.shape)

(95412, 22)
(95412, 315)
(95412, 2)


In [3]:
# I can see that there are no null values in the columns, which is nice, so don't need to change any row values
nulls_percent_df = pd.DataFrame(nums.isna().sum()/len(nums)).reset_index()
nulls_percent_df
nulls_percent_df.columns = ['column_name', 'nulls_percentage']
nulls_percent_df

Unnamed: 0,column_name,nulls_percentage
0,TCODE,0.0
1,AGE,0.0
2,INCOME,0.0
3,WEALTH1,0.0
4,HIT,0.0
...,...,...
310,AVGGIFT,0.0
311,CONTROLN,0.0
312,HPHONE_D,0.0
313,RFA_2F,0.0


In [4]:
# Creating X and y and splitting my test and training data
X = pd.concat([cats, nums], axis=1)
y = target['TARGET_B']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1337)


In [5]:
# Sclaing the nums using min/max scaler
from sklearn.preprocessing import MinMaxScaler
X_train_num = X_train.select_dtypes(include = np.number)
X_test_num = X_test.select_dtypes(include = np.number)

# Fitting and Scaling training data
transformer = MinMaxScaler().fit(X_train_num) # need to keep transformer
X_train_normalized = transformer.transform(X_train_num)
X_train_norm = pd.DataFrame(X_train_normalized, columns=X_train_num.columns)

# Scaling test data 
X_test_normalized = transformer.transform(X_test_num)
X_test_norm = pd.DataFrame(X_test_normalized, columns=X_train_num.columns)

In [6]:
# Encoding the cats - using ordinal encoder because I don't want 300 new columns to add to the 300 exisitng ones 
X_train_cat = X_train.select_dtypes(include = object)
X_test_cat = X_test.select_dtypes(include = object)

from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder()

# Fit the encoder to the data
encoder.fit(X_train_cat)

# Transform the data using the fitted encoder
X_train_cat_encoded = encoder.transform(X_train_cat)
X_train_cat_encoded_df = pd.DataFrame(X_train_cat_encoded, columns=X_train_cat.columns)

In [7]:
# Also going to encode my test data using the same encoder before I forget 
X_test_cat_encoded = encoder.transform(X_test_cat)
X_test_cat_encoded_df = pd.DataFrame(X_test_cat_encoded, columns=X_test_cat.columns)

In [8]:
# X_ train_num and train_cat as X_train as well as test_num and test_cat as X_test

X_train_clean =  pd.concat([X_train_norm, X_train_cat_encoded_df], axis=1)
X_test_clean =  pd.concat([X_test_norm, X_test_cat_encoded_df], axis=1)

X_train_clean

Unnamed: 0,CLUSTER,DATASRCE,DOMAIN_B,ODATEW_YR,ODATEW_MM,DOB_YR,DOB_MM,MINRDATE_YR,MINRDATE_MM,MAXRDATE_YR,...,HPHONE_D,RFA_2F,CLUSTER2,STATE,HOMEOWNR,GENDER,RFA_2R,RFA_2A,GEOCODE2,DOMAIN_A
0,0.769231,0.5,0.666667,0.714286,0.0,0.649485,0.000000,0.789474,0.000000,0.909091,...,0.0,0.333333,0.885246,3.0,0.0,0.0,0.0,2.0,0.0,3.0
1,0.826923,0.0,0.333333,0.857143,0.0,0.536082,0.000000,0.894737,0.545455,0.954545,...,1.0,1.000000,0.262295,11.0,0.0,0.0,0.0,1.0,1.0,1.0
2,0.634615,0.5,0.000000,0.571429,0.0,0.453608,0.000000,0.684211,1.000000,0.909091,...,1.0,0.666667,0.229508,8.0,1.0,0.0,0.0,1.0,0.0,3.0
3,0.961538,1.0,0.666667,0.214286,0.0,0.000000,0.090909,0.473684,0.272727,0.909091,...,0.0,0.666667,0.934426,10.0,0.0,1.0,0.0,1.0,3.0,1.0
4,0.961538,0.5,0.666667,0.785714,0.0,0.000000,0.090909,0.842105,0.090909,0.909091,...,1.0,0.000000,0.967213,3.0,0.0,1.0,0.0,3.0,2.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76324,0.442308,1.0,0.000000,0.642857,0.0,0.402062,0.000000,0.736842,0.818182,0.818182,...,1.0,0.000000,0.180328,11.0,1.0,1.0,0.0,2.0,2.0,0.0
76325,0.865385,0.5,0.333333,0.928571,0.0,0.000000,0.090909,0.894737,0.909091,0.909091,...,1.0,0.000000,0.950820,7.0,1.0,0.0,0.0,2.0,3.0,1.0
76326,0.326923,1.0,0.333333,0.785714,0.0,0.000000,0.090909,0.842105,0.000000,0.909091,...,0.0,0.000000,0.295082,10.0,0.0,0.0,0.0,2.0,1.0,2.0
76327,0.692308,1.0,0.333333,0.214286,0.0,0.319588,0.000000,0.473684,1.000000,0.954545,...,1.0,0.000000,0.836066,11.0,0.0,0.0,0.0,2.0,3.0,3.0


In [9]:
# Time to build my logistic regression model 

from sklearn.linear_model import LogisticRegression
classification = LogisticRegression(random_state=0, solver='lbfgs',
                  multi_class='multinomial').fit(X_train_clean, y_train)

train = classification.predict(X_train_clean)
train_score = classification.score(X_train_clean, y_train)
print("Train Score:",train_score)

# Now we can make predictions on the test set:
test = classification.predict(X_test_clean)
test_score = classification.score(X_test_clean, y_test)
print("Test Score:",test_score)

Train Score: 0.949848681366191
Test Score: 0.9468112980139392


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [10]:
# That seems like an amazing score - almost too good to be true. Perhaps the datas is super imbalanced....
# So it just predicts 0 donations and only got 1015 wrong

from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, test)

array([[18068,     0],
       [ 1015,     0]])

In [11]:
# Now to try downsampling - first I need to temporarily concat X_train and y_train
# need to reset_index on y_train to make sure they line up
trainset = pd.concat([X_train_clean, y_train.reset_index(drop=True)], axis=1)

# quicker way to downsample category 0:
category_0_downsampled = trainset[trainset['TARGET_B']==0].sample(len(trainset[trainset['TARGET_B']==1])) # Could just use the df.sample function here

category_1 = trainset[trainset['TARGET_B']== 1 ]
trainset_new = pd.concat([category_0_downsampled, category_1], axis = 0)
trainset_new = trainset_new.sample(frac=1) #randomize the rows
X_train_treated_downsampled = trainset_new.drop(['TARGET_B'], axis=1)
y_train = trainset_new['TARGET_B']

print(X_train_treated_downsampled.shape) # split between cat 0 and cat 1

(7656, 337)


In [12]:
# Time to test the model again 
classification = LogisticRegression(random_state=0, solver='lbfgs',
                  multi_class='multinomial').fit(X_train_treated_downsampled, y_train)

train = classification.predict(X_train_treated_downsampled)
train_score = classification.score(X_train_treated_downsampled, y_train)
print("Train Score:",train_score)

# Now we can make predictions on the test set:
test2 = classification.predict(X_test_clean)
test_score = classification.score(X_test_clean, y_test)
print("Test Score:",test_score)

Train Score: 0.6140282131661442
Test Score: 0.5916784572656291


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [13]:
# Model is much worse as expected - let's check the confusion matrix again
# This time we have many more false positives, but we've identified 500 donors - yay
confusion_matrix(y_test, test2)

array([[10758,  7310],
       [  482,   533]])

In [14]:
# Now to try upsampling - seems like there's a tool for that 

from sklearn.utils import resample

# Need to reset_index on y_train to make sure they line up
trainset2 = pd.concat([X_train_clean, y_train.reset_index(drop=True)], axis=1)

# Upsample category 1 to match the number of samples in category 0
category_1_upsampled = resample(trainset2[trainset2['TARGET_B'] == 1],
                                replace=True,  # sample with replacement
                                n_samples=len(trainset2[trainset2['TARGET_B'] == 0]),  # match number in category 0
                                random_state=42)  # reproducible results

# Combine the upsampled category 1 with category 0
trainset_new = pd.concat([trainset[trainset['TARGET_B'] == 0], category_1_upsampled])

# Shuffle the rows
trainset_new = trainset_new.sample(frac=1, random_state=42)

# Separate features and target label
X_train_treated_upsampled = trainset_new.drop(['TARGET_B'], axis=1)
y_train_upsampled = trainset_new['TARGET_B']


In [15]:
# Time to run the model again and see what the results are
classification = LogisticRegression(random_state=0, solver='lbfgs',
                  multi_class='multinomial').fit(X_train_treated_upsampled, y_train_upsampled)

train = classification.predict(X_train_treated_upsampled)
train_score = classification.score(X_train_treated_upsampled, y_train_upsampled)
print("Train Score:",train_score)

# Now we can make predictions on the test set:
test3 = classification.predict(X_test_clean)
test_score = classification.score(X_test_clean, y_test)
print("Test Score:",test_score)

Train Score: 0.949848681366191
Test Score: 0.9468112980139392


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [16]:
# Looks very similar, slightly fewer true negatives this time
display(confusion_matrix(y_test, test3))
print("price of false positives:", 8643*.68)
print("revenue from identifying donors:", 550*15)

array([[18068,     0],
       [ 1015,     0]])

price of false positives: 5877.240000000001
revenue from identifying donors: 8250


# Lab | Random Forests
For this lab, you will be using the .CSV files provided in the files_for_lab folder. These are cleaned versions of the learningSet data from the Case Study 'Healthcare for All'.
You may continue in the Jupyter Notebook you created yesterday. There is no need to fork and clone this Repo.

Instructions
- Apply the Random Forests algorithm AFTER upscaling the data to deal with the imbalance.
- Use Feature Selections that you have learned in class to decide if you want to use all of the features (Variance Threshold, RFE, PCA, etc.)
- Re-run the Random Forest algorithm to determine if the Feature Selection has improved the results.
- Discuss the output and its impact in the business scenario. Is the cost of a false positive equals to the cost of the false negative? How would you change your algorithm or data in order to maximize the return of the business?

In [17]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

clf = RandomForestClassifier(max_depth=5, # max number of questions to ask
                             min_samples_split=20, # amount of rows still considered at every question
                             min_samples_leaf =20, # ultimate answer based on at least this many rows
                             max_samples=0.8, # fraction of X-train to use in each tree
                             random_state=42)
clf.fit(X_train_treated_upsampled, y_train_upsampled)
print(clf.score(X_train_treated_upsampled, y_train_upsampled))
print(clf.score(X_test_clean, y_test))

y_pred = clf.predict(X_test_clean)
display(y_test.value_counts())
display(confusion_matrix(y_test, y_pred))

0.949848681366191
0.9468112980139392


TARGET_B
0    18068
1     1015
Name: count, dtype: int64

array([[18068,     0],
       [ 1015,     0]])

In [18]:
# Results seem moderately worse than before, huge numbers of false negatives
print("price of false positives:", 7497*.68)
print("revenue from identifying donors:", 378*15)

price of false positives: 5097.96
revenue from identifying donors: 5670


In [19]:
# I realised the above calculation is actually slightly wrong because you need to include the 637 in the 7497 
# Since you still need to pay for the postage for those who return, so the margin is less good

In [20]:
cats = pd.read_csv("categorical.csv")
nums = pd.read_csv("numerical.csv")
target = pd.read_csv("target.csv")


In [21]:
# Ok, now only 79 columns instead of 350, so let's start again on the df and see if that improves the model

X = pd.concat([cats, nums], axis=1)
y = target['TARGET_B']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1337)

In [22]:
# Getting numericals AGAIN 
X_train_num = X_train.select_dtypes(include = np.number)
X_test_num = X_test.select_dtypes(include = np.number)

transformer = MinMaxScaler().fit(X_train_num) # need to keep transformer

X_train_normalized = transformer.transform(X_train_num)
X_test_normalized = transformer.transform(X_test_num)
X_train_norm = pd.DataFrame(X_train_normalized, columns=X_train_num.columns)
X_test_norm = pd.DataFrame(X_test_normalized, columns=X_test_num.columns)
X_train_norm

Unnamed: 0,CLUSTER,DATASRCE,DOMAIN_B,ODATEW_YR,ODATEW_MM,DOB_YR,DOB_MM,MINRDATE_YR,MINRDATE_MM,MAXRDATE_YR,...,CARDGIFT,MINRAMNT,MAXRAMNT,LASTGIFT,TIMELAG,AVGGIFT,CONTROLN,HPHONE_D,RFA_2F,CLUSTER2
0,0.769231,0.5,0.666667,0.714286,0.0,0.649485,0.000000,0.789474,0.000000,0.909091,...,0.055556,0.00300,0.002002,0.015,0.002757,0.007146,0.509954,0.0,0.333333,0.885246
1,0.826923,0.0,0.333333,0.857143,0.0,0.536082,0.000000,0.894737,0.545455,0.954545,...,0.083333,0.00500,0.001201,0.011,0.006434,0.005978,0.909749,1.0,1.000000,0.262295
2,0.634615,0.5,0.000000,0.571429,0.0,0.453608,0.000000,0.684211,1.000000,0.909091,...,0.138889,0.00300,0.001001,0.010,0.008272,0.004865,0.606703,1.0,0.666667,0.229508
3,0.961538,1.0,0.666667,0.214286,0.0,0.000000,0.090909,0.473684,0.272727,0.909091,...,0.694444,0.00102,0.001201,0.011,0.003676,0.002367,0.416533,0.0,0.666667,0.934426
4,0.961538,0.5,0.666667,0.785714,0.0,0.000000,0.090909,0.842105,0.090909,0.909091,...,0.027778,0.00500,0.005005,0.030,0.011029,0.018662,0.500130,1.0,0.000000,0.967213
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76324,0.442308,1.0,0.000000,0.642857,0.0,0.402062,0.000000,0.736842,0.818182,0.818182,...,0.055556,0.01000,0.002202,0.015,0.012868,0.011652,0.963306,1.0,0.000000,0.180328
76325,0.865385,0.5,0.333333,0.928571,0.0,0.000000,0.090909,0.894737,0.909091,0.909091,...,0.000000,0.02000,0.003003,0.020,0.008272,0.018662,0.114639,1.0,0.000000,0.950820
76326,0.326923,1.0,0.333333,0.785714,0.0,0.000000,0.090909,0.842105,0.000000,0.909091,...,0.027778,0.01500,0.002803,0.019,0.010110,0.015658,0.403147,0.0,0.000000,0.295082
76327,0.692308,1.0,0.333333,0.214286,0.0,0.319588,0.000000,0.473684,1.000000,0.954545,...,0.305556,0.00200,0.002002,0.015,0.005515,0.003374,0.304339,1.0,0.000000,0.836066


In [23]:
# Reduce the train and test data 
from sklearn.feature_selection import VarianceThreshold 

var_threshold = 0.02
sel = VarianceThreshold(threshold=(var_threshold))

# Fit the VarianceThreshold selector
sel.fit(X_train_norm)

# Get the indices of the selected features
selected_indices = sel.get_support(indices=True)

# Get the selected column names
selected_columns = X_train_norm.columns[selected_indices]

# Transform the original DataFrame and apply the same columns to both data frames
trim_X_train = pd.DataFrame(sel.transform(X_train_norm), columns=selected_columns)

trim_X_test = pd.DataFrame(sel.transform(X_test_norm), columns=selected_columns)


In [24]:
# Now to put the data set back together again after encoding 
X_train_cat = X_train.select_dtypes(include = object)
X_test_cat = X_test.select_dtypes(include = object)

encoder = OrdinalEncoder()

# Fit the encoder to the data
encoder.fit(X_train_cat)

# Transform the data using the fitted encoder
X_train_cat_encoded = encoder.transform(X_train_cat)
X_train_cat_encoded_df = pd.DataFrame(X_train_cat_encoded, columns=X_train_cat.columns)

X_test_cat_encoded = encoder.transform(X_test_cat)
X_test_cat_encoded_df = pd.DataFrame(X_test_cat_encoded, columns=X_test_cat.columns)
X_train_clean =  pd.concat([trim_X_train, X_train_cat_encoded_df], axis=1)
X_test_clean =  pd.concat([trim_X_test, X_test_cat_encoded_df], axis=1)


In [25]:
# Now running the model on every row 

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

clf = RandomForestClassifier(max_depth=5, # max number of questions to ask
                             min_samples_split=20, # amount of rows still considered at every question
                             min_samples_leaf =20, # ultimate answer based on at least this many rows
                             max_samples=0.8, # fraction of X-train to use in each tree
                             random_state=42)
clf.fit(X_train_clean, y_train)
print(clf.score(X_train_clean, y_train))
print(clf.score(X_test_clean, y_test))

y_pred = clf.predict(X_test_clean)
display(y_test.value_counts())
display(confusion_matrix(y_test, y_pred))

0.949848681366191
0.9468112980139392


TARGET_B
0    18068
1     1015
Name: count, dtype: int64

array([[18068,     0],
       [ 1015,     0]])

In [26]:
# Super imbalanced so I need to do the upsampling again and see if that helps, then run the model again

trainset4 = pd.concat([X_train_clean, y_train.reset_index(drop=True)], axis=1)
# quicker way to downsample category 0:
category_4_downsampled = trainset4[trainset4['TARGET_B']==0].sample(len(trainset4[trainset4['TARGET_B']==1]))
print(category_4_downsampled.shape)

category_5 = trainset4[trainset4['TARGET_B']== 1 ]
print(category_5.shape)
trainset_new3= pd.concat([category_4_downsampled, category_5], axis = 0)
trainset_new3 = trainset_new3.sample(frac=1) #randomize the rows
X_train_treated_downsampled = trainset_new3.drop(['TARGET_B'], axis=1)
y_train = trainset_new3['TARGET_B']

print(X_train_treated_downsampled.shape)

(3828, 98)
(3828, 98)
(7656, 97)


In [27]:
# Double checking 
print(X_train_treated_downsampled.shape)
print(y_train.shape)
print(X_test_clean.shape)
print(y_test.shape)

(7656, 97)
(7656,)
(19083, 97)
(19083,)


In [28]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

clf = RandomForestClassifier(max_depth=5, # max number of questions to ask
                             min_samples_split=20, # amount of rows still considered at every question
                             min_samples_leaf =20, # ultimate answer based on at least this many rows
                             max_samples=0.8, # fraction of X-train to use in each tree
                             random_state=42)
clf.fit(X_train_treated_downsampled, y_train)
print(clf.score(X_train_treated_downsampled, y_train))
print(clf.score(X_test_clean, y_test))

y_pred = clf.predict(X_test_clean)
display(y_test.value_counts())
display(confusion_matrix(y_test, y_pred))

0.6327063740856844
0.5849185138605041


TARGET_B
0    18068
1     1015
Name: count, dtype: int64

array([[10605,  7463],
       [  458,   557]])

In [29]:
# Previous results were as follows
# array([[10571,  7497],
#        [  637,   378]])

print("price of false positives:", 9304*.68)
print("revenue from identifying donors:", 528*15)

# Not a great margin in the end for all that work, but just about the best result (528 vs 518)

price of false positives: 6326.72
revenue from identifying donors: 7920


In [31]:
# Create a DataFrame with actual and predicted values for later 
results = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})

# Identify the rows where the model correctly predicted a 1
correctly_predicted_1 = results[(results['Actual'] == 1) & (results['Predicted'] == 1)]

# Get the indices of these rows
correctly_predicted_1_indices = correctly_predicted_1.index

# X = pd.concat([X_train_clean, X_test_clean], axis=1) Tried and failed to use the cleaned data

# Display these rows
correctly_predicted_1_rows = X.loc[correctly_predicted_1_indices]
# 
correctly_predicted_1_rows

Unnamed: 0,STATE,CLUSTER,HOMEOWNR,GENDER,DATASRCE,RFA_2R,RFA_2A,GEOCODE2,DOMAIN_A,DOMAIN_B,...,CARDGIFT,MINRAMNT,MAXRAMNT,LASTGIFT,TIMELAG,AVGGIFT,CONTROLN,HPHONE_D,RFA_2F,CLUSTER2
39860,TX,25,H,M,3,L,E,A,C,2,...,9,3.0,15.00,10.0,10,10.060606,5144,0,2,51
48302,CA,5,H,M,3,L,G,A,U,2,...,8,5.0,30.00,30.0,6,15.176471,149433,0,1,9
82759,FL,28,H,M,2,L,G,B,C,2,...,0,8.0,25.00,25.0,13,16.750000,43998,0,2,23
76061,TX,15,U,F,1,L,G,A,S,1,...,2,10.0,26.25,25.0,3,20.638000,121773,1,3,10
74404,other,12,H,M,1,L,F,A,S,1,...,5,3.0,15.00,12.0,8,9.000000,10606,0,3,20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54241,MI,5,U,F,3,L,D,A,U,2,...,11,2.0,7.00,6.0,7,4.176471,65565,1,4,27
22034,other,40,U,M,3,L,E,B,T,2,...,22,5.0,10.00,6.0,6,7.800000,56451,1,3,17
35183,CA,14,H,M,2,L,F,A,S,1,...,1,20.0,20.00,20.0,9,20.000000,150043,0,1,1
76232,MO,27,H,F,3,L,D,C,C,2,...,3,1.0,5.00,5.0,4,2.571429,101390,1,4,35


In [None]:
# X_full_data = pd.concat([X_train_clean, X_test_clean], axis=0)

# y_pred_full = clf.predict(X_full_data)

# # Evaluate the performance on the entire dataset
# full_dataset_accuracy = clf.score(X_full_data, y)
# print("Accuracy on the entire dataset:", full_dataset_accuracy)
# X_full_data

# Lab | Final regression model in "Health Care for All" Case

In this lab, subset those that have made a donation (Target B) and use that subset to create a model to predict how much money will they give (Target D) (Regression Model).

- Only look at people who have donated (Target B = 1)
- Use this new dataframe to create a model to predict how much they will donate (Target D)
- Using the regression model, make predictions on all of the people our classification model predicted will donate.

Evaluate the result of your model and estimate how much better the result are for the business in comparison with the naive scenario we discuss on Monday. (Just sending donation cards to everyone)

![Screenshot 2024-05-16 at 14.21.48.png](attachment:8304ca04-5c28-4998-8843-a38a56d17eeb.png)

In [32]:
# Let's run it backkkkkkk

cats = pd.read_csv("categorical.csv")
nums = pd.read_csv("numerical.csv")
target = pd.read_csv("target.csv")


In [33]:
megaset = pd.concat([cats, nums, target], axis=1)
filtered = megaset[megaset['TARGET_B'] == 1]
filtered = filtered.drop(columns='TARGET_B')
filtered

Unnamed: 0,STATE,CLUSTER,HOMEOWNR,GENDER,DATASRCE,RFA_2R,RFA_2A,GEOCODE2,DOMAIN_A,DOMAIN_B,...,MINRAMNT,MAXRAMNT,LASTGIFT,TIMELAG,AVGGIFT,CONTROLN,HPHONE_D,RFA_2F,CLUSTER2,TARGET_D
20,other,12,H,F,3,L,D,A,S,1,...,2.00,7.0,5.0,12,4.066667,82943,1,3,3,4.0
30,TX,35,H,M,3,L,D,A,T,1,...,2.00,10.0,7.0,9,6.181818,190313,1,3,14,7.0
45,other,24,H,F,3,L,D,C,C,1,...,3.00,6.0,5.0,3,4.857143,76585,1,3,11,5.0
78,CA,13,H,F,2,L,F,A,S,1,...,5.00,17.0,10.0,21,11.000000,156378,0,2,2,13.0
93,GA,18,H,M,3,L,E,A,S,2,...,5.00,12.0,12.0,6,9.400000,25641,1,3,22,10.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95298,CA,36,H,F,3,L,F,A,T,2,...,0.07,17.0,17.0,7,7.935667,154544,0,1,52,20.0
95309,CA,12,H,F,3,L,F,B,S,1,...,5.00,15.0,15.0,4,11.666667,171302,1,1,20,15.0
95398,WI,11,H,F,3,L,G,B,S,1,...,5.00,25.0,20.0,15,14.400000,78831,0,3,3,3.0
95403,other,49,H,F,2,L,F,D,R,2,...,3.00,20.0,20.0,10,11.583333,84678,0,1,56,10.0


In [34]:
X = filtered.drop(columns='TARGET_D')
y = filtered['TARGET_D']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1337)

# Getting numericals AGAIN 
X_train_num = X_train.select_dtypes(include = np.number)
X_test_num = X_test.select_dtypes(include = np.number)

transformer = MinMaxScaler().fit(X_train_num) # need to keep transformer

X_train_normalized = transformer.transform(X_train_num)
X_test_normalized = transformer.transform(X_test_num)
X_train_norm = pd.DataFrame(X_train_normalized, columns=X_train_num.columns)
X_test_norm = pd.DataFrame(X_test_normalized, columns=X_test_num.columns)
X_train_norm

# Reduce the train and test data 
from sklearn.feature_selection import VarianceThreshold 

var_threshold = 0.02
sel = VarianceThreshold(threshold=(var_threshold))

# Fit the VarianceThreshold selector
sel.fit(X_train_norm)

# Get the indices of the selected features
selected_indices = sel.get_support(indices=True)

# Get the selected column names
selected_columns = X_train_norm.columns[selected_indices]

# Transform the original DataFrame and apply the same columns to both data frames
trim_X_train = pd.DataFrame(sel.transform(X_train_norm), columns=selected_columns)

trim_X_test = pd.DataFrame(sel.transform(X_test_norm), columns=selected_columns)

# Now to put the data set back together again after encoding 
X_train_cat = X_train.select_dtypes(include = object)
X_test_cat = X_test.select_dtypes(include = object)

encoder = OrdinalEncoder()

# Fit the encoder to the data
encoder.fit(X_train_cat)

# Transform the data using the fitted encoder
X_train_cat_encoded = encoder.transform(X_train_cat)
X_train_cat_encoded_df = pd.DataFrame(X_train_cat_encoded, columns=X_train_cat.columns)

X_test_cat_encoded = encoder.transform(X_test_cat)
X_test_cat_encoded_df = pd.DataFrame(X_test_cat_encoded, columns=X_test_cat.columns)
X_train_clean =  pd.concat([trim_X_train, X_train_cat_encoded_df], axis=1)
X_test_clean =  pd.concat([trim_X_test, X_test_cat_encoded_df], axis=1)

print(X_train_clean.shape)
print(X_test_clean.shape)
print(y_train.shape)
print(y_test.shape)

(3874, 114)
(969, 114)
(3874,)
(969,)


In [35]:
# Now to run our regression model - we will use linear regression because we are predicting values
# Maybe we could try KNN afterwards - let's see 

from sklearn import linear_model
lm = linear_model.LinearRegression()
lm.fit(X_train_clean,y_train)

from sklearn.metrics import r2_score
predictions = lm.predict(X_train_clean)
print("R2 train score is:",r2_score(y_train, predictions))

predictions_test = lm.predict(X_test_clean)
print("R2 test score is:",r2_score(y_test, predictions_test))

from sklearn.metrics import mean_squared_error
mse=mean_squared_error(y_test,predictions_test)
print("MSE is: ", mse)


rmse = np.sqrt(mean_squared_error(y_test,predictions_test))
print("RMSE is: ",rmse)

print("RMSE/Ytest mean is: ",rmse/y_test.mean())

from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error
print("MAE-score is ", mean_absolute_error(y_test, predictions_test))

R2 train score is: 0.34975452838709986
R2 test score is: 0.3564822614212426
MSE is:  71.22154314490466
RMSE is:  8.439285701106739
RMSE/Ytest mean is:  0.5558248668917627
MAE-score is  5.240085102512447


In [36]:
from sklearn.neighbors import KNeighborsRegressor

knn = KNeighborsRegressor(n_neighbors=20)  # Have worked through them a bit but they're still terrible scores

# Fit the model
knn.fit(X_train_clean, y_train)

# Make predictions
y_pred = knn.predict(X_test_clean)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

Mean Squared Error: 78.0038675105779
R-squared: 0.29520099952516554


In [43]:
# common_columns = correctly_predicted_1_rows.columns.intersection(X_train_clean.columns)

# # Subset X_test_clean to only include common columns
# correctly_predicted_1_rows_clean = correctly_predicted_1_rows[common_columns

common_columns = correctly_predicted_1_rows.columns.intersection(X_train_clean.columns)

# Reorder X_test_clean to match the order of columns in X_train_treated_downsampled
correctly_predicted_1_rows_ordered = X_train_clean[common_columns]

# Ensure the common columns are in the same order as in X_train_treated_downsampled
correctly_predicted_1_rows_clean = correctly_predicted_1_rows_ordered[X_train_clean.columns]

In [39]:
# correctly_predicted_1_rows_clean = pd.DataFrame(sel.transform(correctly_predicted_1_rows), columns=selected_columns)

ValueError: The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- DOMAIN_A
- GENDER
- GEOCODE2
- HOMEOWNR
- RFA_2A
- ...


In [44]:
# Now I will apply my linear regression model to my data set from earlier (part 1)
# correctly_predicted_1_rows
predictions = lm.predict(correctly_predicted_1_rows_clean)
print("R2 train score is:",r2_score(y_train, predictions))

R2 train score is: 0.34975452838709986


In [46]:
# Predicted sum raised would be 60,956 raised (which seems like a lot) from the potential donors
predictions.sum()

60956.03000000003

In [52]:
# Here are the previous results
# Name: count, dtype: int64
# array([[10605,  7463],
#        [  458,   557]])

# This would therefore be much better than if mail was sent to everyone, which comes in at a bigger cost 
95412*0.68

64880.16