# Assignment WE04-Universal Bank

Universal bank has recently trialed a marketing campaign to sell their new Securities account product to existing customers. They contacted 5000 of their non-Securities account customers with an offer. The data provided in universal.csv is the result of this market test. 

Use the techniques covered in this class to load and clean the data. Then, identify the best predictive model (using only the models covered thus far: Logistic Regression, SVM (with various kernels), and Decision trees). Your target variable is Securities Account. Your scoring measure is precision. Use RandomSearchCV combined with GridSearchCV to identify the best parameters for each model tested.

Be sure to document your thought process using markdown. Think of this as a report that your manager will read. This assignment requires you to decide how to process the provided data best (i.e., encoding). Be sure to provide your arguments/observations in markdown as you progress through data preparation, fitting, and performance evaluation.


    Id: Customer ID
    Age: Customers age in completed years  
    Experience: Number of years of professional experience  
    Income: Annual income of the customer ($000s)  
    Family Size: Family size of the customer  
    CCAvg: Average spending on credit cards per month ($000s)  
    Education: Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional  
    Mortgage: Value of house mortgage if any ($000s)  
    Personal Loan: (1 if customer has personal loand with bank, 0 otherwise)
    Securities Account: (1 if customer has securities account with bank, 0 otherwise)  
    CD Account: (1 if customer has certificate of deposit (CD) account with bank, 0 otherwise)  
    Online Banking: (1 if customer uses Internet banking facilities, 0 otherwise)  
    Credit Card: (1 if customer uses credit card issued by Universal Bank, 0 otherwise)

# 1.0 Import libraries and set random seed

In [15]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

np.random.seed(1)


## 2.0 Data loading

In [16]:
Ubank = pd.read_csv("./data/UniversalBank.csv")

## 3.0 Data cleaning
"Target variable is `Securities Account`"

### 3.1 Exploration of the data

In [17]:
Ubank.head(4)

Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,1,25,1,49,91107,4,1.6,1,0,0,1,0,0,0
1,2,45,19,34,90089,3,1.5,1,0,0,1,0,0,0
2,3,39,15,11,94720,1,1.0,1,0,0,0,0,0,0
3,4,35,9,100,94112,1,2.7,2,0,0,0,0,0,0


Quick info: missing data, how many observations, columns and names:

In [18]:
Ubank.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIP Code            5000 non-null   int64  
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal Loan       5000 non-null   int64  
 10  Securities Account  5000 non-null   int64  
 11  CD Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 547.0 KB


Generate a statistical summary of the numeric value in the data:

In [19]:
Ubank.describe()

Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,2500.5,45.3384,20.1046,73.7742,93152.503,2.3964,1.937938,1.881,56.4988,0.096,0.1044,0.0604,0.5968,0.294
std,1443.520003,11.463166,11.467954,46.033729,2121.852197,1.147663,1.747659,0.839869,101.713802,0.294621,0.305809,0.23825,0.490589,0.455637
min,1.0,23.0,-3.0,8.0,9307.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1250.75,35.0,10.0,39.0,91911.0,1.0,0.7,1.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2500.5,45.0,20.0,64.0,93437.0,2.0,1.5,2.0,0.0,0.0,0.0,0.0,1.0,0.0
75%,3750.25,55.0,30.0,98.0,94608.0,3.0,2.5,3.0,101.0,0.0,0.0,0.0,1.0,1.0
max,5000.0,67.0,43.0,224.0,96651.0,4.0,10.0,3.0,635.0,1.0,1.0,1.0,1.0,1.0


**Observations:**
* There are no missing data.
* I agree with the encoding of the categorical variables. 
* Drop ID feature since is not relevant for the prediction.
* There is at least one ZIP code wrong since it has 4 digits. Therefore they can be dropped or imputed (after the data splitting). It depends on how many there are.
* The Experience feature must not have negative values. Therefore those values can be droppend or imputed using the mean of the variable after the data spltting.

In [20]:
#drop ID feature
Ubank.drop('ID', axis=1, inplace = True)

In [21]:
#check how many ZIP codes are wrong
zipdigits = np.int_(np.log10(np.array(Ubank['ZIP Code'])) + 1)
zip_index = list(np.where(zipdigits != 5)[0])
print(len(zip_index))
print(Ubank.loc[zip_index,['ZIP Code']])

1
     ZIP Code
384      9307


In [22]:
#check how many Experience observations are negative
experience_index = list(np.where(np.array(Ubank['Experience']) < 0)[0])
len(experience_index)

52

In [23]:
# Drop wrong ZIP codes and Experience values
Ubank.drop(zip_index, axis=0, inplace = True)
Ubank.drop(experience_index, axis=0, inplace = True)
Ubank.describe()

Unnamed: 0,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
count,4947.0,4947.0,4947.0,4947.0,4947.0,4947.0,4947.0,4947.0,4947.0,4947.0,4947.0,4947.0,4947.0
mean,45.556095,20.330099,73.825147,93168.521932,2.391146,1.936196,1.878714,56.645846,0.097029,0.104306,0.061047,0.596927,0.293916
std,11.321615,11.312922,46.111141,1761.253907,1.148333,1.747768,0.839679,101.835994,0.296026,0.305688,0.239441,0.490565,0.4556
min,24.0,0.0,8.0,90005.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,36.0,10.5,39.0,91911.0,1.0,0.7,1.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,46.0,20.0,64.0,93437.0,2.0,1.5,2.0,0.0,0.0,0.0,0.0,1.0,0.0
75%,55.0,30.0,98.0,94608.0,3.0,2.6,3.0,101.0,0.0,0.0,0.0,1.0,1.0
max,67.0,43.0,224.0,96651.0,4.0,10.0,3.0,635.0,1.0,1.0,1.0,1.0,1.0


### 3.2 Split data (train/test)

In [24]:
# split the data into validation and training set
train_df, test_df = train_test_split(Ubank, test_size=0.3)

# create variables to represent the columns
# that are our predictors and target
target = 'Securities Account'
predictors = list(Ubank.columns)
predictors.remove(target)

### 3.3  Conduct any data prepartion that should be done *AFTER* the data split


Remove differences of scale by **standardizing** the numerical variables.

In [25]:
# create a standard scaler and fit it to the training set of predictors
scaler = preprocessing.StandardScaler()
cols_to_stdize = ['Age', 'Experience', 'Income', 
                   'ZIP Code', 'Family', 'CCAvg', 'Mortgage']  
               
               
# Transform the predictors of training and validation sets
train_df[cols_to_stdize] = scaler.fit_transform(train_df[cols_to_stdize]) # train_predictors is not a numpy array

test_df[cols_to_stdize] = scaler.transform(test_df[cols_to_stdize]) # validation_target is now a series object


In [26]:
train_df.describe()

Unnamed: 0,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
count,3462.0,3462.0,3462.0,3462.0,3462.0,3462.0,3462.0,3462.0,3462.0,3462.0,3462.0,3462.0,3462.0
mean,1.571373e-17,-7.946657e-17,-1.478534e-16,1.266366e-15,-2.5437e-16,1.390184e-17,1.879549,3.897005e-16,0.102542,0.107163,0.058637,0.599942,0.297805
std,1.000144,1.000144,1.000144,1.000144,1.000144,1.000144,0.838053,1.000144,0.303403,0.309366,0.234977,0.489981,0.457359
min,-1.885039,-1.779143,-1.423802,-1.77034,-1.218872,-1.108785,1.0,-0.553752,0.0,0.0,0.0,0.0,0.0
25%,-0.9173502,-0.898374,-0.7568999,-0.7718598,-1.218872,-0.7113473,1.0,-0.553752,0.0,0.0,0.0,0.0,0.0
50%,-0.03763319,-0.0176052,-0.2190756,0.1487727,-0.3454017,-0.2571325,2.0,-0.553752,0.0,0.0,0.0,1.0,0.0
75%,0.8420838,0.8631636,0.5553914,0.8262724,0.5280685,0.3674128,3.0,0.4259143,0.0,0.0,0.0,1.0,1.0
max,1.897744,1.920086,3.093922,1.978755,1.401539,4.5689,3.0,5.66713,1.0,1.0,1.0,1.0,1.0


## 4.0 Save the data

In [27]:
train_X = train_df[predictors]
train_y = train_df[target] 
test_X = test_df[predictors]
test_y = test_df[target] 

train_df.to_csv('./data/ubank_train_df.csv', index=False)
train_X.to_csv('./data/ubank_train_X.csv', index=False)
train_y.to_csv('./data/ubank_train_y.csv', index=False)

test_df.to_csv('./data/ubank_test_df.csv', index=False)
test_X.to_csv('./data/ubank_test_X.csv', index=False)
test_y.to_csv('./data/ubank_test_y.csv', index=False)