# Capstone Two - Pre-processing and Training Data Development.ipynb

Now that we have cleaned the data to a point that we can work with it, let's take some time to explore the data. To remind us of our goal, we will be determining if there are any correlations between specific characteristics of our bank's cold call marketing campaigns and our prospects saying "yes" to a term deposit with the bank.

We have nearly 20 data points to explore, so let's get to it!

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import io
%matplotlib inline

In [3]:
#Download the csv file and import into dataframe.
url = 'https://raw.githubusercontent.com/GabeGibitz/springboard/master/Capstone/data/bank_clean.csv'
s = requests.get(url).content
df = pd.read_csv(io.StringIO(s.decode('utf-8')))
df.drop('Unnamed: 0', inplace=True, axis=1)
df.head()

Unnamed: 0,age,job,marital,education,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56.0,housemaid,married,4.0,0.0,0.0,telephone,5.0,1.0,261.0,1.0,,0.0,,1.1,93.994,-36.4,4.857,5191.0,0.0
1,57.0,services,married,12.0,0.0,0.0,telephone,5.0,1.0,149.0,1.0,,0.0,,1.1,93.994,-36.4,4.857,5191.0,0.0
2,37.0,services,married,12.0,1.0,0.0,telephone,5.0,1.0,226.0,1.0,,0.0,,1.1,93.994,-36.4,4.857,5191.0,0.0
3,40.0,admin.,married,6.0,0.0,0.0,telephone,5.0,1.0,151.0,1.0,,0.0,,1.1,93.994,-36.4,4.857,5191.0,0.0
4,56.0,services,married,12.0,0.0,1.0,telephone,5.0,1.0,307.0,1.0,,0.0,,1.1,93.994,-36.4,4.857,5191.0,0.0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38241 entries, 0 to 38240
Data columns (total 20 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             38241 non-null  float64
 1   job             38241 non-null  object 
 2   marital         38241 non-null  object 
 3   education       38241 non-null  float64
 4   housing         38241 non-null  float64
 5   loan            38241 non-null  float64
 6   contact         38241 non-null  object 
 7   month           38241 non-null  float64
 8   day_of_week     38241 non-null  float64
 9   duration        38239 non-null  float64
 10  campaign        38241 non-null  float64
 11  pdays           1366 non-null   float64
 12  previous        38241 non-null  float64
 13  poutcome        5179 non-null   object 
 14  emp.var.rate    38241 non-null  float64
 15  cons.price.idx  38241 non-null  float64
 16  cons.conf.idx   38241 non-null  float64
 17  euribor3m       38241 non-null 

In [6]:
# Drop null values
df.drop(['pdays', 'poutcome'], axis=1, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38241 entries, 0 to 38240
Data columns (total 18 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             38241 non-null  float64
 1   job             38241 non-null  object 
 2   marital         38241 non-null  object 
 3   education       38241 non-null  float64
 4   housing         38241 non-null  float64
 5   loan            38241 non-null  float64
 6   contact         38241 non-null  object 
 7   month           38241 non-null  float64
 8   day_of_week     38241 non-null  float64
 9   duration        38239 non-null  float64
 10  campaign        38241 non-null  float64
 11  previous        38241 non-null  float64
 12  emp.var.rate    38241 non-null  float64
 13  cons.price.idx  38241 non-null  float64
 14  cons.conf.idx   38241 non-null  float64
 15  euribor3m       38241 non-null  float64
 16  nr.employed     38241 non-null  float64
 17  y               38241 non-null 

In [8]:
df.dropna(how='any', inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38239 entries, 0 to 38240
Data columns (total 18 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             38239 non-null  float64
 1   job             38239 non-null  object 
 2   marital         38239 non-null  object 
 3   education       38239 non-null  float64
 4   housing         38239 non-null  float64
 5   loan            38239 non-null  float64
 6   contact         38239 non-null  object 
 7   month           38239 non-null  float64
 8   day_of_week     38239 non-null  float64
 9   duration        38239 non-null  float64
 10  campaign        38239 non-null  float64
 11  previous        38239 non-null  float64
 12  emp.var.rate    38239 non-null  float64
 13  cons.price.idx  38239 non-null  float64
 14  cons.conf.idx   38239 non-null  float64
 15  euribor3m       38239 non-null  float64
 16  nr.employed     38239 non-null  float64
 17  y               38239 non-null 

Now, we need to decide if we want to keep the categorical columns. Let's create dummy columns for marital and contact and drop the job category to keep the columns to a minimum.

In [11]:
# Drop category column.

df.drop('job', axis=1, inplace=True)
df.head()

Unnamed: 0,age,marital,education,housing,loan,contact,month,day_of_week,duration,campaign,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56.0,married,4.0,0.0,0.0,telephone,5.0,1.0,261.0,1.0,0.0,1.1,93.994,-36.4,4.857,5191.0,0.0
1,57.0,married,12.0,0.0,0.0,telephone,5.0,1.0,149.0,1.0,0.0,1.1,93.994,-36.4,4.857,5191.0,0.0
2,37.0,married,12.0,1.0,0.0,telephone,5.0,1.0,226.0,1.0,0.0,1.1,93.994,-36.4,4.857,5191.0,0.0
3,40.0,married,6.0,0.0,0.0,telephone,5.0,1.0,151.0,1.0,0.0,1.1,93.994,-36.4,4.857,5191.0,0.0
4,56.0,married,12.0,0.0,1.0,telephone,5.0,1.0,307.0,1.0,0.0,1.1,93.994,-36.4,4.857,5191.0,0.0


In [17]:
# Dummies for two categorical columns.
df = pd.get_dummies(
    df,
    prefix='marital',
    prefix_sep='_',
    dummy_na=False,
    columns=['marital'],
    sparse=False,
    drop_first=False,
    dtype=None,
)

In [18]:
df.head()

Unnamed: 0,age,education,housing,loan,contact,month,day_of_week,duration,campaign,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y,marital_divorced,marital_married,marital_single
0,56.0,4.0,0.0,0.0,telephone,5.0,1.0,261.0,1.0,0.0,1.1,93.994,-36.4,4.857,5191.0,0.0,0,1,0
1,57.0,12.0,0.0,0.0,telephone,5.0,1.0,149.0,1.0,0.0,1.1,93.994,-36.4,4.857,5191.0,0.0,0,1,0
2,37.0,12.0,1.0,0.0,telephone,5.0,1.0,226.0,1.0,0.0,1.1,93.994,-36.4,4.857,5191.0,0.0,0,1,0
3,40.0,6.0,0.0,0.0,telephone,5.0,1.0,151.0,1.0,0.0,1.1,93.994,-36.4,4.857,5191.0,0.0,0,1,0
4,56.0,12.0,0.0,1.0,telephone,5.0,1.0,307.0,1.0,0.0,1.1,93.994,-36.4,4.857,5191.0,0.0,0,1,0


In [19]:
# Dummies for two categorical columns.
df = pd.get_dummies(
    df,
    prefix='contact',
    prefix_sep='_',
    dummy_na=False,
    columns=['contact'],
    sparse=False,
    drop_first=False,
    dtype=None,
)
df.head()

Unnamed: 0,age,education,housing,loan,month,day_of_week,duration,campaign,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y,marital_divorced,marital_married,marital_single,contact_cellular,contact_telephone
0,56.0,4.0,0.0,0.0,5.0,1.0,261.0,1.0,0.0,1.1,93.994,-36.4,4.857,5191.0,0.0,0,1,0,0,1
1,57.0,12.0,0.0,0.0,5.0,1.0,149.0,1.0,0.0,1.1,93.994,-36.4,4.857,5191.0,0.0,0,1,0,0,1
2,37.0,12.0,1.0,0.0,5.0,1.0,226.0,1.0,0.0,1.1,93.994,-36.4,4.857,5191.0,0.0,0,1,0,0,1
3,40.0,6.0,0.0,0.0,5.0,1.0,151.0,1.0,0.0,1.1,93.994,-36.4,4.857,5191.0,0.0,0,1,0,0,1
4,56.0,12.0,0.0,1.0,5.0,1.0,307.0,1.0,0.0,1.1,93.994,-36.4,4.857,5191.0,0.0,0,1,0,0,1


## Preprocessing and Scaling Data

In [28]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)
df_scaled = pd.DataFrame(df_scaled, columns=list(df.columns))
df_scaled.head()

Unnamed: 0,age,education,housing,loan,month,day_of_week,duration,campaign,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y,marital_divorced,marital_married,marital_single,contact_cellular,contact_telephone
0,1.568583,-1.983929,-1.079202,-0.429655,-0.79081,-1.401933,0.010796,-0.566037,-0.349009,0.649554,0.735081,0.895741,0.713045,0.328424,-0.353938,-0.355993,0.8061,-0.625757,-1.330617,1.330617
1,1.665771,0.036578,-1.079202,-0.429655,-0.79081,-1.401933,-0.420403,-0.566037,-0.349009,0.649554,0.735081,0.895741,0.713045,0.328424,-0.353938,-0.355993,0.8061,-0.625757,-1.330617,1.330617
2,-0.277998,0.036578,0.926611,-0.429655,-0.79081,-1.401933,-0.123954,-0.566037,-0.349009,0.649554,0.735081,0.895741,0.713045,0.328424,-0.353938,-0.355993,0.8061,-0.625757,-1.330617,1.330617
3,0.013567,-1.478802,-1.079202,-0.429655,-0.79081,-1.401933,-0.412703,-0.566037,-0.349009,0.649554,0.735081,0.895741,0.713045,0.328424,-0.353938,-0.355993,0.8061,-0.625757,-1.330617,1.330617
4,1.568583,0.036578,-1.079202,2.327448,-0.79081,-1.401933,0.187895,-0.566037,-0.349009,0.649554,0.735081,0.895741,0.713045,0.328424,-0.353938,-0.355993,0.8061,-0.625757,-1.330617,1.330617


## Split the Data Into Training and Testing (while keeping our final test data separate):

Now that we have all dummy variables, we will split our data up 75/25. We will train on the 75% of the cleaned data and test on the remaining 25%.

Here we go!

In [20]:
from sklearn.model_selection import train_test_split

In [30]:
# Dropping predition variable off the dataframe for X and preserving only the predition variable in y.
# Printing the shape to make sure we lost the column.

print(df_scaled.shape)
X = df_scaled.drop('y', axis=1)
print(X.shape)

y = df_scaled.y
print(y.shape)

(38239, 20)
(38239, 19)
(38239,)


In [31]:
# Now, we split.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 37)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(28679, 19)
(9560, 19)
(28679,)
(9560,)


## Export Data to CSV Files

In [34]:
pwd

'/Users/gagibitz/Documents/springboard/springboard-git/springboard/Capstone/notebooks'

In [35]:
cd ..

/Users/gagibitz/Documents/springboard/springboard-git/springboard/Capstone


In [36]:
X_train.to_csv('data/X_train.csv')

In [37]:
X_test.to_csv('data/X_test.csv')
y_train.to_csv('data/y_train.csv')
y_test.to_csv('data/y_test.csv')

In [38]:
df_scaled.to_csv('data/bank_scaled.csv')