### WE3a-DTrees

Universal bank has recently trialed a marketing campaign to sell their new CD account product to existing customers. They contacted 5000 of their non-CD account customers with an offer. The data provided in universal.csv is the result of this market test. 

Use the techniques covered in this class to load and clean the data. Then, identify the best predictive model (using only the models covered thus far). Use RandomSearchCV combined with GridSearchCV to identify the best parameters for each model tested.

Be sure to document your thought process using markdown. Think of this as a report that your manager will read. This assignment requires you to decide how to process the provided data best (i.e., encoding). Be sure to provide your arguments/observations in markdown as you progress through data preparation, fitting, and performance evaluation.

### Importing necessary modules

In [1]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier 

np.random.seed(1)

### Reading and displaying data from the link

In [2]:
univBankdata = pd.read_csv('https://raw.githubusercontent.com/prof-tcsmith/data/master/UniversalBank.csv') 

In [3]:
univBankdata.head(5)

Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,1,25,1,49,91107,4,1.6,1,0,0,1,0,0,0
1,2,45,19,34,90089,3,1.5,1,0,0,1,0,0,0
2,3,39,15,11,94720,1,1.0,1,0,0,0,0,0,0
3,4,35,9,100,94112,1,2.7,2,0,0,0,0,0,0
4,5,35,8,45,91330,4,1.0,2,0,0,0,0,0,1


### Details of the data

Id: Customer ID - Drop

Age: Customers age in completed years  

Experience: Number of years of professional experience  

Income: Annual income of the customer

Family Size: Family size of the customer  

CCAvg: Average spending on credit cards per month (x1000)

Education: Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional  

Mortgage: Value of house mortgage if any (x1000)  

Personal Loan: (1 if customer has personal loand with bank, 0 otherwise)

Securities Account: (1 f customer has securities account with bank, 0 otherwise) 

CD Account: (1 if customer has certificate of deposit (CD) account with bank, 0 otherwise) 

Online Banking: (1 if customer uses Internet banking facilities, 0 otherwise) 

Credit Card: (1 if customer uses credit card issued by Universal Bank, 0 otherwise)  


## Cleaning the data

### Replacing categorical values with binary values.

In [4]:
univBankdata.columns = [s.strip() for s in univBankdata.columns] 
univBankdata.columns

Index(['ID', 'Age', 'Experience', 'Income', 'ZIP Code', 'Family', 'CCAvg',
       'Education', 'Mortgage', 'Personal Loan', 'Securities Account',
       'CD Account', 'Online', 'CreditCard'],
      dtype='object')

In [5]:
univBankdata['Education'] = univBankdata['Education'].replace({1: "Undergrad", 2: 'Graduate', 3: 'Advanced/Professional'})

### Implementing onehotencoding for Education and family

In [6]:
dummies_education = pd.get_dummies(univBankdata['Education'], prefix='Education', drop_first=False)
univBankdata = univBankdata.join(dummies_education)

### Dropping unnecessary columns

In [7]:
univBankdata = univBankdata.drop(['ID', 'ZIP Code', 'Education'], axis=1)

### Properteis and observations of cleaned data

In [8]:
univBankdata.head(3)

Unnamed: 0,Age,Experience,Income,Family,CCAvg,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard,Education_Advanced/Professional,Education_Graduate,Education_Undergrad
0,25,1,49,4,1.6,0,0,1,0,0,0,0,0,1
1,45,19,34,3,1.5,0,0,1,0,0,0,0,0,1
2,39,15,11,1,1.0,0,0,0,0,0,0,0,0,1


In [9]:
univBankdata.shape

(5000, 14)

In [10]:
univBankdata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Age                              5000 non-null   int64  
 1   Experience                       5000 non-null   int64  
 2   Income                           5000 non-null   int64  
 3   Family                           5000 non-null   int64  
 4   CCAvg                            5000 non-null   float64
 5   Mortgage                         5000 non-null   int64  
 6   Personal Loan                    5000 non-null   int64  
 7   Securities Account               5000 non-null   int64  
 8   CD Account                       5000 non-null   int64  
 9   Online                           5000 non-null   int64  
 10  CreditCard                       5000 non-null   int64  
 11  Education_Advanced/Professional  5000 non-null   uint8  
 12  Education_Graduate  

In [11]:
univBankdata.describe()

Unnamed: 0,Age,Experience,Income,Family,CCAvg,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard,Education_Advanced/Professional,Education_Graduate,Education_Undergrad
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,45.3384,20.1046,73.7742,2.3964,1.937938,56.4988,0.096,0.1044,0.0604,0.5968,0.294,0.3002,0.2806,0.4192
std,11.463166,11.467954,46.033729,1.147663,1.747659,101.713802,0.294621,0.305809,0.23825,0.490589,0.455637,0.458391,0.449337,0.493478
min,23.0,-3.0,8.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,35.0,10.0,39.0,1.0,0.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,45.0,20.0,64.0,2.0,1.5,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
75%,55.0,30.0,98.0,3.0,2.5,101.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0
max,67.0,43.0,224.0,4.0,10.0,635.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [12]:
univBankdata.isna().sum()

Age                                0
Experience                         0
Income                             0
Family                             0
CCAvg                              0
Mortgage                           0
Personal Loan                      0
Securities Account                 0
CD Account                         0
Online                             0
CreditCard                         0
Education_Advanced/Professional    0
Education_Graduate                 0
Education_Undergrad                0
dtype: int64

### Spliting the data for training and testing

In [13]:
train_df, test_df = train_test_split(univBankdata, test_size=0.3)

### Seperating the predictors and traget variables

In [14]:
target = 'CD Account'
predictors = list(univBankdata.columns)
predictors.remove(target)

### Looking for null values

In [15]:
numeric_cols_with_nas = list(train_df.isna().sum()[train_df.isna().sum() > 0].index)
numeric_cols_with_nas

[]

### Creating a common scale between the numberic columns by standardizing each numeric column

In [16]:
# create a standard scaler and fit it to the training set of predictors
scaler = preprocessing.StandardScaler()
cols_to_stdize = ['Family', 'Age', 'Experience', 'Income', 'CCAvg', 'Mortgage']                
               
# Transform the predictors of training and validation sets
train_df[cols_to_stdize] = scaler.fit_transform(train_df[cols_to_stdize]) # train_predictors is not a numpy array


test_df[cols_to_stdize] = scaler.transform(test_df[cols_to_stdize]) # validation_target is now a series object


### Saving the datasets for testing and training

In [17]:
X_train = train_df[predictors]
y_train = train_df[target]
X_test = test_df[predictors]
y_test = test_df[target]

X_train.to_csv('univbank-train_X-data.csv', index=False)
y_train.to_csv('univbank-train_y-data.csv', index=False)
X_test.to_csv('univbank-test_X-data.csv', index=False)
y_test.to_csv('univbank-test_y-data.csv', index=False)


In [18]:
X_train

Unnamed: 0,Age,Experience,Income,Family,CCAvg,Mortgage,Personal Loan,Securities Account,Online,CreditCard,Education_Advanced/Professional,Education_Graduate,Education_Undergrad
1334,0.135977,0.156137,-0.837058,-0.344955,-0.365708,-0.557707,0,0,1,0,0,0,1
4768,-0.646212,-0.538578,-0.750406,-1.217944,0.037876,-0.557707,0,0,1,0,0,1,0
65,1.178895,1.285050,1.242590,-1.217944,1.075664,-0.557707,0,0,1,1,0,0,1
177,-1.428400,-1.493812,-0.187168,1.401022,-0.077434,1.869923,0,0,0,0,0,1,0
4489,-0.559302,-0.625418,-1.140340,0.528033,-0.999912,-0.557707,0,0,1,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2895,1.265805,1.371889,-0.750406,1.401022,-0.365708,0.835195,0,0,1,0,0,1,0
2763,0.831255,0.937692,-1.313644,1.401022,-0.711638,-0.557707,0,0,1,0,0,0,1
905,0.049067,0.156137,-0.988699,-1.217944,-0.538673,0.278035,0,0,1,1,0,0,1
3980,0.049067,0.156137,0.332744,1.401022,-0.308053,-0.557707,0,0,1,0,0,1,0


In [19]:
y_train

1334    0
4768    0
65      0
177     0
4489    0
       ..
2895    0
2763    0
905     0
3980    0
235     0
Name: CD Account, Length: 3500, dtype: int64

## Conclusion

In this note book I used the techniques covered in class to load and clean the data and saved the predictors and target variable containing test and train data sets in csv files. I will use these files in WE3a-DTrees-model-fit notebook.