## Homework 4 Classification I
### Natalia Hoyos Velasquez
- 20805370

The Homework aims to use the Bank Marketing dataset to perform k-Nearest Neighbors (k-NN) and Support Vector Machine (SVM) classification algorithms. Python 3.7.1 was used for this purpose.

The file bank-additional.zip taken from [1] contains two .csv file (bank-additional.csv and bank-additional-full.csv) and the data set documentation (bank-additional-names.txt). The file bank-additional-full.csv was used as the data source and the documentation was used to understand the dataset's contents and to identify the column labels.

The pandas [2] and scikit learn [3] documentation was used as a guide to solve the problems presented in Homework 3. Previously courses taken at DataCamp [4] were also helpful in solving the homework.


## Libraries used
- Pandas was imported to handle the dataframes.

- Scikit Learn: This is an open-source machine learning library for Python and contains a helpful preprocessing module that can help us do the normalization and standardization

- Seaborn was used to build the plot.

In [1]:
##--Importing necessary libraries
import pandas as pd 
import numpy as np
from sklearn import preprocessing
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
import matplotlib.pyplot as plt

##--Disabling the warnings for better notebook visibility
import warnings
warnings.simplefilter('ignore') 

##--Setting SNS palette and parameters
sns.set(style='whitegrid', palette='Set1')
SMALL_SIZE = 18
MEDIUM_SIZE = 20
BIGGER_SIZE = 25

plt.rc('font', size=SMALL_SIZE)          # controls default text sizes
plt.rc('axes', titlesize=BIGGER_SIZE)     # fontsize of the axes title
plt.rc('axes', labelsize=BIGGER_SIZE)    # fontsize of the x and y labels
plt.rc('xtick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('ytick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('legend', fontsize=MEDIUM_SIZE)    # legend fontsize
plt.rc('figure', titlesize=BIGGER_SIZE)  # fontsize of the figure title

## Loading the File
The dataset was inspected and it was noticed that it is stored as a comma separated values (csv) file and each value is separated by a semi-colon (;). The file contains a header with the column names. 

The file was taken directly from the dataset url and loaded as a pandas dataframe with parameters for the header and the delimiter were specified to 0 and ";" respectively. The first 5 rows can be seen in the table below.

In [2]:
##--Reading the file

#Read the file
df = pd.read_csv('bank-additional/bank-additional-full.csv', header = 0, delimiter = ";")
# df = pd.read_csv('bank/bank-full.csv', header = 0, delimiter = ";")

print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
age               41188 non-null int64
job               41188 non-null object
marital           41188 non-null object
education         41188 non-null object
default           41188 non-null object
housing           41188 non-null object
loan              41188 non-null object
contact           41188 non-null object
month             41188 non-null object
day_of_week       41188 non-null object
duration          41188 non-null int64
campaign          41188 non-null int64
pdays             41188 non-null int64
previous          41188 non-null int64
poutcome          41188 non-null object
emp.var.rate      41188 non-null float64
cons.price.idx    41188 non-null float64
cons.conf.idx     41188 non-null float64
euribor3m         41188 non-null float64
nr.employed       41188 non-null float64
y                 41188 non-null object
dtypes: float64(5), int64(5), object(11)
memory usa

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


The dataset is related to the direct marketing campaigns based on phone calls of a banking institution located in Portugal. The dataset contains 45211 instances and 16 client and contact attributes. The goal attribute (variable y) describes if the client will subscribe a term deposit.

## Data Cleaning and Preprocessing
The dataframe has missing values denoted as “unknown” that have to be changed to Not a number (NaN) for an easier analysis. 

In [3]:
#Replacing the 'unknown' entries with NaN values so they can be counted
df = df.replace(to_replace = 'unknown', value = np.nan)
print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
age               41188 non-null int64
job               40858 non-null object
marital           41108 non-null object
education         39457 non-null object
default           32591 non-null object
housing           40198 non-null object
loan              40198 non-null object
contact           41188 non-null object
month             41188 non-null object
day_of_week       41188 non-null object
duration          41188 non-null int64
campaign          41188 non-null int64
pdays             41188 non-null int64
previous          41188 non-null int64
poutcome          41188 non-null object
emp.var.rate      41188 non-null float64
cons.price.idx    41188 non-null float64
cons.conf.idx     41188 non-null float64
euribor3m         41188 non-null float64
nr.employed       41188 non-null float64
y                 41188 non-null object
dtypes: float64(5), int64(5), object(11)
memory usa

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


It can be seen in the information above that there is not a variable with a significant number of missing values. 

The documentation found in [1] states that the duration column is not known before a call is performed then it should be used for benchmark purposes only and it is advised to discard this column.

The other features that contain missing values are “job”, “education” and “contact”. Replacing the missing values of these attributes with the mean would not make sense since these are unordered categorical values, and therefore they are replaced with the mode.

In [4]:
# Dropping "poutcome" and "duration"
df.drop(columns = ['duration'],  inplace=True)

df = df.replace(to_replace = np.nan, value = df.mode())

df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,no,no,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


### Categorical Data Encoding
The dataset contains categorical features but the kNN and SVM classification models don’t handle categorical features. To solve this, some kind of encoding to turn them into numerical attributes needs to be used. 

In this case, the variables 'job', 'marital', 'default', 'housing', 'loan', 'contact','poutcome' were encoded giving a unique number to each of its possible values without a logical order other than ordering alphabetically. ‘education’ was encoded following the complexity of education beginning with illiterate and finishing with university degree. The ‘month’ and ‘day_of_week’ were encoded in chronological order.

In [5]:
# #Encoding Marital, Job with one hot encoder
# df_encode = df.copy()

# onehotencoder2 = preprocessing.OneHotEncoder()
# enc = job = onehotencoder2.fit_transform(df_encode[['job', 'marital']]).toarray()

# col_job = np.unique(df_encode['job'])
# col_marital = np.unique(df_encode['marital'])

# # job, marital, contact, day, month, poutcome

In [6]:
#Encoding job, martial, default, housing, loan, contact with labelencoder
df_encode = df.copy()
labelencoder_df = preprocessing.LabelEncoder()
feas = ('job', 'marital', 'default', 'housing', 'loan', 'contact','poutcome')
for item in feas:
    df_encode[item]=labelencoder_df.fit_transform(df_encode[item])

In [7]:
#Encoding Education
education = {"illiterate":0, "basic.4y":1, "basic.6y":2, "basic.9y":3, 
             "high.school":4,"professional.course":5, "university.degree":6}
df_encode['education'] = df_encode['education'].map(education)

In [8]:
#Encoding Month in the order of the months
ord_map = {'jan': 1, 'feb': 2, 'mar': 3, 'apr': 4, 'may': 5, 'jun': 6,
          'jul': 7, 'aug': 8, 'sep': 9, 'oct': 10, 'nov':11, 'dec':12}

df_encode['month'] = df_encode['month'].map(ord_map)

ord_map = {'mon': 1, 'tue': 2, 'wed': 3, 'thu': 4, 'fri': 5, 'sat': 6,
          'sun': 7}

df_encode['day_of_week'] = df_encode['day_of_week'].map(ord_map)

df_encode[df['y']=='no'].count() #36548 samples with 'y'='no'
df_encode[df['y']=='yes'].count() #4640 samples with 'y'='yes'
df_encode.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,3,1,1,0,0,0,1,5,1,1,999,0,1,1.1,93.994,-36.4,4.857,5191.0,no
1,57,7,1,4,0,0,0,1,5,1,1,999,0,1,1.1,93.994,-36.4,4.857,5191.0,no
2,37,7,1,4,0,1,0,1,5,1,1,999,0,1,1.1,93.994,-36.4,4.857,5191.0,no
3,40,0,1,2,0,0,0,1,5,1,1,999,0,1,1.1,93.994,-36.4,4.857,5191.0,no
4,56,7,1,4,0,0,1,1,5,1,1,999,0,1,1.1,93.994,-36.4,4.857,5191.0,no


## Splitting Data
Choosing the splitting data ratio is a difficult task. A training set that is too small would underfit the model due to the lack of information and a small testing set would not be robust enough to test every possible scenario. For big datasets, 80/20 is quite a commonly occurring ratio and is referred to as the Pareto principle.

In [9]:
X = df_encode.iloc[:, :-1].values
y = df_encode.iloc[:, -1].values

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.2, 
                                                    stratify = y, 
                                                    random_state = 2)

# Feature Scaling
sc = StandardScaler()
cols = df_encode.columns
cols = cols[:-1] 
X_train = pd.DataFrame(sc.fit_transform(X_train), columns=cols)
X_test = pd.DataFrame(sc.transform(X_test), columns=cols)

df_z=pd.concat((pd.DataFrame(X_test), pd.DataFrame(y_test, columns=['y'])), axis=1)

## Classification models

In the present homework two machine learning techniques were used to create a classification model: k-Nearest Neighbor (kNN) and Support Vector Machine (SVM). 

It is worth to be mentioned that in binary classification problems, the accuracy which is a frequently used measurement of performance can be misleading if one class is much more common than another. In this exercise the classes are unbalanced having more than 36,500 samples (88.8%) with a ‘no’ class and only ‘4640’ samples (12%) in the class ‘yes’.

Besides accuracy, other measures that can be analyzed are:
-	Precision: Explains how many of those predicted positive are actual positive values. 
-	Recall: Explains how many of the actual positive values were predicted as positive.
-	F1 score: is a function of precision and recall and tries to balance these two [koo ping].


### kNN Classifier

In [10]:
classifier = KNeighborsClassifier(n_neighbors = 5)

classifier.fit(X_train, y_train)

y_pred_knn = classifier.predict(X_test)

df_knn=pd.concat((pd.DataFrame(X_test), pd.DataFrame(y_pred_knn, columns=['y'])), axis=1)

err = (y_test != y_pred_knn).sum()/len(y_test)
acc = 1-err
print("The accuracy of the kNN model is " + str(round(acc,2)) + 
      " and the error is " + str(round(err,2)) + "\n" + "  'no'  'yes'")

#Evaluating
print(metrics.confusion_matrix(y_test, y_pred_knn))
print(metrics.classification_report(y_test, y_pred_knn))
# , target_names=set(y)

The accuracy of the kNN model is 0.89 and the error is 0.11
  'no'  'yes'
[[7086  224]
 [ 669  259]]
              precision    recall  f1-score   support

          no       0.91      0.97      0.94      7310
         yes       0.54      0.28      0.37       928

   micro avg       0.89      0.89      0.89      8238
   macro avg       0.72      0.62      0.65      8238
weighted avg       0.87      0.89      0.88      8238

