In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Below is the framework we will use to analyze the data:
1. Look at the big picture.
2. Get the Data.
3. Discover and Visualize data to gain insights (EDA).
4. Preprocess the data.
5. Select a Model and train/finetune it.
6. Evaluate the model.
7. Predict the test set. 

# 1. Big Picture

Our objective is to **identify which potential donors** the charity should contact in order to **maximize profitability of marketing campaign.**

From the given data, it looks like we will need to do some **supervised classification** to predict which donor will give us profit, meaning which donors have a positive gain when *amount - marketingCost*. Furthermore, we would need to have some interpretability to explain how we came up with that prediction i.e. what predictors we value most, etc, to communicate to our partners why our prediction is correct. 

# 2. Get the Data

First, we load the train, test datasets, and the zipcode dataset.

In [2]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
zipcode = pd.read_csv("zipCodeMarketingCosts.csv")

  interactivity=interactivity, compiler=compiler, result=result)


Now, let's take a peek at all the datasets.

In [3]:
train.head()

Unnamed: 0,date,source,title,state,zip,mailcode,has_chapter,dob,noexch,recinhse,...,amount,hphone_d,rfa_2r,rfa_2f,rfa_2a,mdmaud_r,mdmaud_f,mdmaud_a,cluster2,geocode2
0,9101,IMA,0,KY,40207-,,,6901,0,,...,,0,L,3,D,X,X,X,4.0,B
1,8601,LIS,2,MI,48504,,,4101,0,,...,,0,L,2,F,X,X,X,49.0,B
2,9601,AGS,28,WA,99218,,,0,0,,...,,0,L,3,E,X,X,X,48.0,B
3,9101,NAD,2,NM,88201,B,,5201,0,,...,34.0,0,L,1,F,X,X,X,39.0,C
4,9201,FRC,0,AL,35603,,,4301,0,,...,,0,L,1,G,X,X,X,16.0,C


In [4]:
test.head()

Unnamed: 0,date,source,title,state,zip,mailcode,has_chapter,dob,noexch,recinhse,...,hphone_d,rfa_2r,rfa_2f,rfa_2a,mdmaud_r,mdmaud_f,mdmaud_a,cluster2,geocode2,market
0,9301,TRE,1,FL,34461,,,2001,0,,...,1,L,2,F,X,X,X,52.0,C,
1,9101,PV3,1002,CA,91106,,,0,0,,...,0,L,1,F,X,X,X,24.0,A,
2,8601,MBC,0,MN,56470,,,4305,0,,...,0,L,3,D,X,X,X,59.0,D,
3,8601,BHG,0,IN,47441,,,0,0,,...,0,L,4,D,X,X,X,59.0,D,
4,9501,AIR,0,NC,28906,,,4201,0,,...,0,L,1,F,X,X,X,60.0,D,


In [5]:
zipcode.head()

Unnamed: 0,marketingCost,zip
0,2.53,35236
1,2.63,35541
2,1.73,35542
3,3.92,35235
4,2.32,35232


Lets look at the size of the datasets

In [6]:
print(train.shape)
print(test.shape)
print(zipcode.shape)

(182190, 481)
(9589, 480)
(20265, 2)


Look's like there's a lot of features involved and test doesn't have the response/amount, which makes sense.

# 3. Discover and EDA

# 4. Preprocess

TO DO:
1. Merge train, test with zipcode
2. Create new target labels in train (Profit/Not Profit = 1/0) 
    - Let us define Not Profit if the donor doesn't respond. **
    - Let us assume that each cost in the zipcode is for each individual donor **
    - Drop amount, responded
3. Combine train, test to preprocess
4. Pre-select Features: 
    - Drop redundant features
    - Select most important predictors/hypotheses to start
5. Categorical Features:
    - Decode encoded features
    - One-hot Encode
    - PCA(?) 
6. Numerical Features: 
    - Standardize
    - Do PCA

### Merge with Zipcode

First, we need to clean zip code in train, test, and zipcode.

In [7]:
train.zip = train.zip.str.replace('-','').astype(int)
test.zip = test.zip.str.replace('-','').astype(int)
zipcode.zip = zipcode.zip.astype(int)

In [8]:
#merge with zipcode dataset
train_merged = train.merge(zipcode, how = "left", left_on = "zip", right_on = "zip")
test_merged = test.merge(zipcode, how = "left", left_on = "zip", right_on = "zip")

### Create new target label for Train: profit

In [9]:
train_merged["net"] = train_merged.amount - train_merged.marketingCost
train_merged.net.fillna(0, inplace = True)
train_merged["profit"] = [1 if x > 0 else 0 for x in train_merged.net]
train_merged = train_merged.drop(columns = ["net","amount","responded"], axis = 1)

### Check for class imbalance

In [110]:
(sum(y_train > 0))/float(sum(y_train <= 0))

0.9514938246049817

In [79]:
#class is imbalanced, need to resample
from sklearn.utils import resample

train_majority = train_merged[train_merged.profit == 0]
train_minority = train_merged[train_merged.profit == 1]

train_minority_upsampled = resample(train_minority,
                                   replace = True,
                                   n_samples = sum(y_train > 0) * 20,
                                   random_state = 123) # match majority class

# Combine majority class with upsampled minority class
train_upsampled = pd.concat([train_majority, train_minority_upsampled])

In [80]:
train_upsampled.profit.value_counts()

0    173916
1    165480
Name: profit, dtype: int64

### Create X_train, y_train, y_test

In [81]:
y_train = train_upsampled.profit
X_train = train_upsampled.drop(columns = ["profit"], axis = 1)
X_test = test_merged.drop(columns = ["market"]) #drop market for now

In [82]:
print(y_train.shape)
print(X_train.shape)
print(X_test.shape)

(339396,)
(339396, 480)
(9589, 480)


### Combine Train, Test to preprocess

In [83]:
full_train = pd.concat([X_train, X_test])
print(full_train.shape)
full_train.head()

(348985, 480)


Unnamed: 0,date,source,title,state,zip,mailcode,has_chapter,dob,noexch,recinhse,...,hphone_d,rfa_2r,rfa_2f,rfa_2a,mdmaud_r,mdmaud_f,mdmaud_a,cluster2,geocode2,marketingCost
0,9101,IMA,0,KY,40207,,,6901,0,,...,0,L,3,D,X,X,X,4.0,B,2.11
1,8601,LIS,2,MI,48504,,,4101,0,,...,0,L,2,F,X,X,X,49.0,B,3.51
2,9601,AGS,28,WA,99218,,,0,0,,...,0,L,3,E,X,X,X,48.0,B,1.21
4,9201,FRC,0,AL,35603,,,4301,0,,...,0,L,1,G,X,X,X,16.0,C,1.95
5,9201,HHH,1,ID,83702,,,1703,0,X,...,0,L,2,E,X,X,X,48.0,C,2.93


### Pre-Select Features / Hypothesis

To simplify the model, let us use the columns that we will find pertinent to this problem, then add on additional columns later to see how they can improve prediction. 

The following is our hypothesis for this problem: 

We hypothesize that the the greatest predictors of donations will be: 
1. Social status 
    - Customer Title
    - RFM
    - Wealth Rating
2. Location
    - State
    - Neighborhood (domain, neighborhood)
3. Loyalty to Charity
    - RFM
    - File flags (In House, etc.)
    - Promotion History
4. Time of Year
5. Demographics
    - Family or not 
    - Gender
    - Age
6. Interests 

TO DO:
1. Clean Zip based on mailcode (if bad address, not use zipcode)
2. Explain why some features are dropped
3. Use State, Cluster

In [84]:
#drop features that are already captured in other features
#income range not understood feature, already captured in wealth rating
#wealth1/wealth2? 
feature_to_drop = ["date","source","zip", "mailcode", 
                   "dob", "ageflag", "income_range", 
                   "geocode", "lifesrc","id","cluster2","geocode2","msa","adi","dma"]

full_train_drop = full_train.drop(columns = feature_to_drop, axis = 1)

For starters, let's just use the following features: 
- Demographics (age, Gender, number of children) 
- RFM
- RFA
- Socioeconomic Status (title, wealth2)
- Characteristics of Neighborhood

In [85]:
#get column names of neighborhood characteristics
print("start of characteristics:",full_train_drop.columns.get_loc("pop901"))
print("end of characteristics:",full_train_drop.columns.get_loc("ac2"))
neigh_chars_colnames = list(full_train_drop.columns.values[66:349])

start of characteristics: 66
end of characteristics: 348


In [86]:
#categorical variables 
cat_features = ["title","mdmaud","domain","gender","rfa_2f","rfa_2r","rfa_2a"]

#numerical variables
num_features_small = ["numchld","wealth2","age"] #each wealth rating different in each state: how do we capture that? 
num_features = num_features_small + neigh_chars_colnames

#subset full dataset
all_features = cat_features + num_features
subset = full_train_drop[all_features]
subset.head()

Unnamed: 0,title,mdmaud,domain,gender,rfa_2f,rfa_2r,rfa_2a,numchld,wealth2,age,...,hc16,hc17,hc18,hc19,hc20,hc21,mhuc1,mhuc2,ac1,ac2
0,0,XXXX,C1,F,3,L,D,,9.0,29.0,...,0,99,0,53,99,99,9,2,8,9
1,2,XXXX,C3,F,2,L,F,2.0,,57.0,...,0,84,16,95,99,99,7,2,9,10
2,28,XXXX,C2,F,3,L,E,,,,...,4,99,0,99,99,98,10,4,4,3
4,0,XXXX,T2,F,1,L,G,,8.0,55.0,...,5,98,1,72,99,99,7,2,4,4
5,1,XXXX,C2,M,2,L,E,,4.0,81.0,...,2,99,0,99,99,98,5,2,0,5


In [87]:
subset.describe()

Unnamed: 0,title,rfa_2f,numchld,wealth2,age,pop901,pop902,pop903,pop90c1,pop90c2,...,hc16,hc17,hc18,hc19,hc20,hc21,mhuc1,mhuc2,ac1,ac2
count,348985.0,348985.0,42953.0,194869.0,264877.0,348985.0,348985.0,348985.0,348985.0,348985.0,...,348985.0,348985.0,348985.0,348985.0,348985.0,348985.0,348985.0,348985.0,348985.0,348985.0
mean,60.862424,2.02466,1.501851,5.064649,61.884388,3200.046627,854.412969,1209.588472,58.409066,13.625442,...,5.476161,82.499262,15.307237,71.7408,97.646108,94.657034,8.185994,2.34135,5.882677,6.026637
std,1018.306569,1.111574,0.790114,2.796953,16.267946,5540.794894,1416.979947,2076.948679,47.416041,31.285227,...,10.57553,28.486959,26.750646,35.620863,9.20995,10.305667,3.566352,0.861654,2.850272,3.285127
min,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,1.0,1.0,3.0,49.0,986.0,265.0,378.0,0.0,0.0,...,0.0,79.0,0.0,44.0,99.0,94.0,6.0,2.0,4.0,4.0
50%,1.0,2.0,1.0,5.0,63.0,1570.0,422.0,587.0,99.0,0.0,...,1.0,99.0,1.0,94.0,99.0,98.0,8.0,2.0,6.0,6.0
75%,2.0,3.0,2.0,8.0,75.0,3100.0,846.0,1169.0,99.0,0.0,...,5.0,99.0,18.0,99.0,99.0,99.0,9.0,3.0,7.0,8.0
max,72002.0,4.0,7.0,9.0,98.0,98701.0,23766.0,35403.0,99.0,99.0,...,99.0,99.0,99.0,99.0,99.0,99.0,21.0,5.0,99.0,99.0


We can see above that some features have missing values: age, wealth2, numchld. Let's assume all the neighborhood characteristics have no NaN and that they're reasonable.

### Decode

In [88]:
#decode mdmaud
subset.mdmaud.unique()

array(['XXXX', 'D2CM', 'D1CM', 'C1CM', 'C2LM', 'L1CM', 'I1CM', 'L1LM',
       'C2CM', 'I1LM', 'D5MM', 'C1LM', 'D5TM', 'L2CM', 'C5CM', 'D5CM',
       'I2CM', 'C2MM', 'D2MM', 'C5MM', 'C1MM', 'C5TM', 'I5MM', 'I2MM',
       'I5CM', 'L2TM', 'L1MM', 'L2LM'], dtype=object)

In [89]:
subset["recency"] = [x[0] for x in subset.mdmaud.values]
subset["frequency"] = [x[1] for x in subset.mdmaud.values]
subset["amount"] = [x[2] for x in subset.mdmaud.values]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [90]:
#decode domain
subset["urbanicity"] = [x[0] for x in subset.domain.values]
subset["neighborhood_status"] = [x[1] if x != ' ' else "X" for x in subset.domain.values] #remove one missing value

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [91]:
#add into categorical features
cat_features += ["urbanicity","neighborhood_status","recency","frequency","amount"]

### Impute Missing Values

### Categorical Values

In [92]:
#make gender into three categories
subset.gender = ["Other" if x != "F" and x != "M" else x for x in subset.gender.values]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


#### Numerical Values

We will use sklearn's Imputer for this case and use the median strategy.

TO DO:
- use a better imputer strategy (quick regression)

In [93]:
subset.head()

Unnamed: 0,title,mdmaud,domain,gender,rfa_2f,rfa_2r,rfa_2a,numchld,wealth2,age,...,hc21,mhuc1,mhuc2,ac1,ac2,recency,frequency,amount,urbanicity,neighborhood_status
0,0,XXXX,C1,F,3,L,D,,9.0,29.0,...,99,9,2,8,9,X,X,X,C,1
1,2,XXXX,C3,F,2,L,F,2.0,,57.0,...,99,7,2,9,10,X,X,X,C,3
2,28,XXXX,C2,F,3,L,E,,,,...,98,10,4,4,3,X,X,X,C,2
4,0,XXXX,T2,F,1,L,G,,8.0,55.0,...,99,7,2,4,4,X,X,X,T,2
5,1,XXXX,C2,M,2,L,E,,4.0,81.0,...,98,5,2,0,5,X,X,X,C,2


In [94]:
from sklearn.preprocessing import Imputer

#Impute numerical columns
def impute_num(data, strategy, columns_to_impute):
    
    df = data.copy()
    
    imputer = Imputer(strategy = strategy)
    
    for column in columns_to_impute:
        df[[column]] = imputer.fit_transform(df[[column]])
    
    return df

#for numchild, assume no child
subset.numchld = subset.age.fillna(0)

#for wealth, assume middle class
subset.wealth2 = subset.age.fillna(5)

#impute median for age
subset = impute_num(subset, "median", ["age"])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


### One hot Encode Categorical Values

Before we hot-encode, let's double check the categorical features

In [95]:
subset[cat_features].describe(include = ['O'])

Unnamed: 0,mdmaud,domain,gender,rfa_2r,rfa_2a,urbanicity,neighborhood_status,recency,frequency,amount
count,348985,348985,348985,348985,348985,348985,348985,348985,348985,348985
unique,28,17,3,1,4,6,5,5,4,5
top,XXXX,R2,F,L,F,S,2,X,X,X
freq,347793,48591,186951,348985,164732,82954,168824,347793,347793,347793


In [96]:
subset.gender.unique()

array(['F', 'M', 'Other'], dtype=object)

In [97]:
subset[cat_features].describe()

Unnamed: 0,title,rfa_2f
count,348985.0,348985.0
mean,60.862424,2.02466
std,1018.306569,1.111574
min,0.0,1.0
25%,0.0,1.0
50%,1.0,2.0
75%,2.0,3.0
max,72002.0,4.0


In [98]:
# drop domain code since not needed anymore
subset = subset.drop(columns = "domain", axis = 1)

#drop mdmaud since not needed anymore
subset = subset.drop(columns = "mdmaud", axis = 1)

In [99]:
cat_features = [x for x in cat_features if x not in ('domain', 'mdmaud')]

In [100]:
def encode_dummies(df, columns_to_encode): 
    return pd.get_dummies(df, prefix = columns_to_encode, columns = columns_to_encode)


subset = encode_dummies(subset, cat_features)

### Scale Numerical Values

This will help our dimension reduction.

In [101]:
from sklearn.preprocessing import MinMaxScaler

def scaler(data, columns_to_scale):
    
    df = data.copy()
    scaler = MinMaxScaler()
    
    for column in columns_to_scale:
        df[[column]] = scaler.fit_transform(df[[column]])
        
    return df 

subset = scaler(subset, num_features)

### Reduce Dimensions of Numerical Values

We only do PCA for numerical values since for binary data we would not achieve an accurate representation for the data.

In [102]:
subset.head()

Unnamed: 0,numchld,wealth2,age,pop901,pop902,pop903,pop90c1,pop90c2,pop90c3,pop90c4,...,recency_X,frequency_1,frequency_2,frequency_5,frequency_X,amount_C,amount_L,amount_M,amount_T,amount_X
0,0.295918,0.28866,0.28866,0.018592,0.024615,0.022992,1.0,0.0,0.0,0.464646,...,1,0,0,0,1,0,0,0,0,1
1,0.581633,0.57732,0.57732,0.011023,0.011782,0.014349,1.0,0.0,0.0,0.494949,...,1,0,0,0,1,0,0,0,0,1
2,0.0,0.041237,0.639175,0.078277,0.076748,0.098099,1.0,0.0,0.0,0.464646,...,1,0,0,0,1,0,0,0,0,1
4,0.561224,0.556701,0.556701,0.066261,0.077464,0.069486,0.787879,0.0,0.222222,0.484848,...,1,0,0,0,1,0,0,0,0,1
5,0.826531,0.824742,0.824742,0.007791,0.008163,0.009717,1.0,0.0,0.0,0.474747,...,1,0,0,0,1,0,0,0,0,1


In [103]:
from sklearn.decomposition import PCA
pca = PCA(n_components = 0.90)

subset_num = subset[num_features]
subset_num_reduced = pca.fit_transform(subset_num)
subset_num_reduced

array([[ 0.72365942, -0.1472508 , -0.4659292 , ...,  0.12601657,
        -0.10672132,  0.18731442],
       [-0.53526081,  0.6966524 , -0.02231325, ...,  0.03170666,
         0.4611634 , -0.10498123],
       [ 0.03972832,  0.87269803,  1.25856537, ..., -0.14022146,
         0.09856285, -0.18756773],
       ...,
       [-0.25272633, -1.37638766,  0.38094027, ..., -0.03152259,
        -0.01516832, -0.13293231],
       [ 1.89723255, -0.839615  , -0.68220942, ...,  0.1066986 ,
         0.07683099,  0.00905444],
       [-1.56947797, -0.07552519, -0.4139631 , ..., -0.01896723,
         0.10015638,  0.02556052]])

In [104]:
pca.explained_variance_

array([1.24422328, 0.79536281, 0.49890718, 0.35153678, 0.26959322,
       0.21403161, 0.1796066 , 0.17206147, 0.16470201, 0.11484887,
       0.10580487, 0.095935  , 0.08412829, 0.06884408, 0.06123981,
       0.0560617 , 0.05241656, 0.04607383, 0.0439153 , 0.03874204,
       0.03710803, 0.03331658, 0.02833882, 0.02722285, 0.0249037 ,
       0.02327648, 0.02267779, 0.02034576, 0.01974198, 0.018518  ,
       0.01802544])

In [105]:
#drop previous columns, add in components
subset_final = subset.drop(columns = num_features, axis = 1)
subset_final = subset_final.join(pd.DataFrame(subset_num_reduced).iloc[:,0:11])

### Split back to Train and Test

In [109]:
y_train.shape

(339396,)

In [112]:
X_train = subset_final.iloc[0:339396]
X_test = subset_final.iloc[339396:subset_final.shape[0]]
print(X_train.shape)
print(X_test.shape)

(339396, 104)
(9589, 104)


# 5. Train Model / Tune it

Let us first start with a baseline classification algorithm: a **logistic regression**. NOTE: logistic regression did not converge, which made sense as logistic regression is not meant for high dimension

In [113]:
# from sklearn.linear_model import LogisticRegression
# clf = LogisticRegression(random_state = 0, solver = 'sag', verbose = 1) #sag faster for large datasets
# clf.fit(X_pretrain, y_pretrain)

For a baseline, we will have a model that predicts the variable randomly. (Random Generator from 0 to 1)

In [120]:
import random
y_pred = np.random.randint(2, size=len(y_train))

from sklearn.metrics import recall_score
recall_score(y_train, y_pred, average="macro")

0.5005242447317845

Finally, let's try out a **Random Forest Classifier**.

In [114]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators = 100, max_depth = 10, random_state = 0)
rfc.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=10, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

### Cross-Validate and Tune the Model

TO DO: 
- Tune the dataset

In [100]:
#Grid-Search parameters


# 6. Evaluate the Model

In this evaluation, we

### Evaluate the Pretest 

The metrics we will use are ROC/AUC, Precision, Recall, F1.

In [116]:
from sklearn.model_selection import cross_validate

scoring = {'precision' : 'precision_macro',
           'recall' : 'recall_macro', 
           'accuracy' : 'accuracy',
           'roc_auc': 'roc_auc'}

scores = cross_validate(rfc, X_train, y_train, scoring = scoring, cv = 5, return_train_score = True)

SGD Results **not good**: 
{'fit_time': array([138.87587214, 141.50030422, 136.17024302, 152.12180829,
        136.2348361 ]),
 'score_time': array([0.29960179, 0.22892308, 0.24707389, 0.25374961, 0.24197674]),
 'test_precision': array([0.47722205, 0.47722205, 0.47722205, 0.47723842, 0.47723764]),
 'train_precision': array([0.47723004, 0.47723004, 0.47723004, 0.47722595, 0.47722614]),
 'test_recall': array([0.5, 0.5, 0.5, 0.5, 0.5]),
 'train_recall': array([0.5, 0.5, 0.5, 0.5, 0.5]),
 'test_accuracy': array([0.9544441 , 0.9544441 , 0.9544441 , 0.95447684, 0.95447528]),
 'train_accuracy': array([0.95446008, 0.95446008, 0.95446008, 0.9544519 , 0.95445229]),
 'test_f1': array([0.48834556, 0.48834556, 0.48834556, 0.48835413, 0.48835372]),
 'train_f1': array([0.48834974, 0.48834974, 0.48834974, 0.4883476 , 0.4883477 ]),
 'test_roc_auc': array([0.48871541, 0.51291833, 0.5078014 , 0.51084327, 0.49734952]),
 'train_roc_auc': array([0.50664669, 0.50165642, 0.50319574, 0.49676853, 0.49933318])}

In [117]:
scores

{'fit_time': array([53.32907581, 51.04806089, 52.99109793, 50.8713851 , 51.64807916]),
 'score_time': array([4.10545874, 4.31011844, 4.74526668, 4.5207932 , 4.27778769]),
 'test_precision': array([0.49002941, 0.486505  , 0.48131936, 0.49459425, 0.49154836]),
 'train_precision': array([0.72286259, 0.74196717, 0.72568797, 0.74386021, 0.74092431]),
 'test_recall': array([0.49514208, 0.4945944 , 0.49058444, 0.49748568, 0.49704195]),
 'train_recall': array([0.63981011, 0.64454165, 0.64302278, 0.66305481, 0.63021664]),
 'test_accuracy': array([0.50405127, 0.50422075, 0.49934442, 0.50657788, 0.50706404]),
 'train_accuracy': array([0.6473136 , 0.65234221, 0.65046019, 0.67011274, 0.63856407]),
 'test_roc_auc': array([0.49159006, 0.48485496, 0.49477282, 0.49448732, 0.49788929]),
 'train_roc_auc': array([0.781916  , 0.79547371, 0.79316329, 0.80487031, 0.79184019])}

We'd like a model that go beyond 0.5 in roc/auc, recall and precision, as a classifier with such a metric is simply guessing randomly whether a donor will donate or not.

# 7. Predict the Test Set

In [None]:
best_model = RandomForestClassifier(n_estimators = 100, max_depth = 2, random_state = 0)
best_model.fit(X_train)
results = best_model.predict(X_test)

In [None]:
test = pd.read_csv("test.csv")
test.market = results
test.to_csv("test_predicted.csv")

# 8. Next Steps

The classifier still did not produce good results, as it could not exceed a baseline model (randomly choose between 0 and 1). If there were more time, we would try to use more data and find better models to perform prediction. A few things to try: 
1. Look more into the class imbalance.
2. Look into PCA, locate errors. 
3. Use more data.