##  Machine Learning 
### Exploratory Analysis and Pre-processing
### Solution to Proposed Exercises

Build a training dataset using the original bank-marketing CSV:
1. Discarding the 'month' feature because we believe it is not relevant to us
2. Converting all binary nominal features to [0 1]
3. Converting nominal features with more than two values to dummy variables
    

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#### DataSet Bank-Marketing

In [2]:
bank_marketing = pd.read_csv('../../data/bank.csv', sep=';')

In [3]:
from sklearn import preprocessing

In [4]:
bank_marketing.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no


In [5]:
bank_marketing.dtypes

age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
y            object
dtype: object

Primero sacamos las caracteristicas no numéricas 

In [6]:
data_types = bank_marketing.dtypes
nominal_features = data_types[data_types == 'object'].index
nominal_features

Index(['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact',
       'month', 'poutcome', 'y'],
      dtype='object')

Let's look at the values of the non-numeric ones

In [7]:
for f in nominal_features:
    print(f, bank_marketing[f].unique())

job ['unemployed' 'services' 'management' 'blue-collar' 'self-employed'
 'technician' 'entrepreneur' 'admin.' 'student' 'housemaid' 'retired'
 'unknown']
marital ['married' 'single' 'divorced']
education ['primary' 'secondary' 'tertiary' 'unknown']
default ['no' 'yes']
housing ['no' 'yes']
loan ['no' 'yes']
contact ['cellular' 'unknown' 'telephone']
month ['oct' 'may' 'apr' 'jun' 'feb' 'aug' 'jan' 'jul' 'nov' 'sep' 'mar' 'dec']
poutcome ['unknown' 'failure' 'other' 'success']
y ['no' 'yes']


1. discard the 'month' feature because we think it is not relevant

In [8]:
bank_1 = bank_marketing.drop('month', axis=1)
bank_1.columns

Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
       'loan', 'contact', 'day', 'duration', 'campaign', 'pdays', 'previous',
       'poutcome', 'y'],
      dtype='object')

2. Convert all nominal features with two values to [0 1].

Since the replacement is the same, we do it at the dataframe level.

In [9]:
mapping = {'no': 0, 'yes': 1}
bank_2 = bank_1.replace(mapping)

bank_2[['default','housing','y']].head()

Unnamed: 0,default,housing,y
0,0,0,0
1,0,1,0
2,0,1,0
3,0,1,0
4,0,1,0


3. Here we can directly use the get_dummies function on the rest of the nominal features.

Here we can directly use the get_dummies function on the remaining features.


In [10]:
bank_3 = pd.get_dummies(bank_2)
bank_3.columns

Index(['age', 'default', 'balance', 'housing', 'loan', 'day', 'duration',
       'campaign', 'pdays', 'previous', 'y', 'job_admin.', 'job_blue-collar',
       'job_entrepreneur', 'job_housemaid', 'job_management', 'job_retired',
       'job_self-employed', 'job_services', 'job_student', 'job_technician',
       'job_unemployed', 'job_unknown', 'marital_divorced', 'marital_married',
       'marital_single', 'education_primary', 'education_secondary',
       'education_tertiary', 'education_unknown', 'contact_cellular',
       'contact_telephone', 'contact_unknown', 'poutcome_failure',
       'poutcome_other', 'poutcome_success', 'poutcome_unknown'],
      dtype='object')

This would be our dataset with all numerical features.

In [11]:
bank_3.dtypes

age                    int64
default                int64
balance                int64
housing                int64
loan                   int64
day                    int64
duration               int64
campaign               int64
pdays                  int64
previous               int64
y                      int64
job_admin.             uint8
job_blue-collar        uint8
job_entrepreneur       uint8
job_housemaid          uint8
job_management         uint8
job_retired            uint8
job_self-employed      uint8
job_services           uint8
job_student            uint8
job_technician         uint8
job_unemployed         uint8
job_unknown            uint8
marital_divorced       uint8
marital_married        uint8
marital_single         uint8
education_primary      uint8
education_secondary    uint8
education_tertiary     uint8
education_unknown      uint8
contact_cellular       uint8
contact_telephone      uint8
contact_unknown        uint8
poutcome_failure       uint8
poutcome_other

In [12]:
bank_3.head()

Unnamed: 0,age,default,balance,housing,loan,day,duration,campaign,pdays,previous,...,education_secondary,education_tertiary,education_unknown,contact_cellular,contact_telephone,contact_unknown,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown
0,30,0,1787,0,0,19,79,1,-1,0,...,0,0,0,1,0,0,0,0,0,1
1,33,0,4789,1,1,11,220,1,339,4,...,1,0,0,1,0,0,1,0,0,0
2,35,0,1350,1,0,16,185,1,330,1,...,0,1,0,1,0,0,1,0,0,0
3,30,0,1476,1,1,3,199,4,-1,0,...,0,1,0,0,0,1,0,0,0,1
4,59,0,0,1,0,5,226,1,-1,0,...,1,0,0,0,0,1,0,0,0,1
