# Lab 4: SVM + Neural Networks #


In [1]:
import pandas as pd
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.feature_extraction import DictVectorizer

from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, ParameterGrid

import numpy as np

import warnings
warnings.filterwarnings("ignore")

In [2]:
df_train = pd.read_csv('./lab_4_training.csv')
df_test = pd.read_csv('./lab_4_test.csv')
df_train.head()

Unnamed: 0.1,Unnamed: 0,gender,age,year,eyecolor,height,miles,brothers,sisters,computertime,exercise,exercisehours,musiccds,playgames,watchtv
0,1303,male,20,second,green,73.0,210.0,0,1,10.0,Yes,5.0,50.0,1.0,15.0
1,36,male,20,third,other,71.0,90.0,1,0,15.0,Yes,4.0,10.0,0.0,1.0
2,489,male,22,fourth,hazel,75.0,200.0,0,1,1.0,Yes,2.0,150.0,1.0,10.0
3,1415,male,19,second,brown,72.0,35.0,2,2,20.0,Yes,5.0,100.0,0.0,7.0
4,616,male,22,fourth,hazel,71.0,15.0,2,1,10.0,Yes,7.0,10.0,0.0,5.0


***
### Question 1###
Calculate a baseline accuracy measure using the majority class.

** Question 1.a**  
Find the majority class in the training set. If you always predicted this class in the training set, what would your accuracy be?

In [3]:
df_train.gender.describe()

count       1590
unique         2
top       female
freq         855
Name: gender, dtype: object

In [4]:
train_size = df_train.gender.count()
train_female = df_train.where(df_train.gender=='female').gender.count()
train_male = df_train.where(df_train.gender=='male').gender.count()
train_pred = np.maximum(train_female, train_male)/train_size
train_acc = train_pred
train_acc

0.5377358490566038

**Question 1.b**   
If you always predicted this same class (majority from the training set) in the test set, what would your accuracy be?

In [5]:
df_test.gender.describe()

count        398
unique         2
top       female
freq         208
Name: gender, dtype: object

In [6]:
test_size = df_test.gender.count()
test_female = df_test.where(df_test.gender=='female').gender.count()
test_male = df_test.where(df_test.gender=='male').gender.count()
test_freq = test_female/test_size
test_acc = np.minimum(test_freq, train_pred)
test_acc
#measured value – accepted value) ÷ accepted value

0.5226130653266332

***
### Question 2 ###
Get started with Neural Networks.

   
Choose a NN implementation and specify which you choose. Be sure the implementation allows you to modify the number of hidden layers and hidden nodes per layer.  

NOTE: When possible, specify the logsig (sigmoid/logistc) function as the transfer function for the output node and use Levenberg-Marquardt backpropagation (lbfgs). It is possible to specify logsig or logistic in Sklearn MLPclassifier (Neural net).  

**Question 2.a**   
Train a neural network with a single 10 node hidden layer. Only use the Height feature of the dataset to predict the Gender. You will have to change Gender to a 0 and 1 class. After training, use your trained model to predict the class using the height feature from the training set. What was the accuracy of this prediction?

In [7]:
import keras as K
from keras.models import Sequential
from keras.layers import Dense

y_train = df_train['gender'].apply(lambda x: 0 if x == 'male' else 1).values.reshape(-1,1)
x_train = df_train['height'].values.reshape(-1,1)

model = Sequential()
model.add(Dense(units=10, activation='sigmoid', input_dim=1))
model.add(Dense(units=1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])
model.fit(x_train, y_train)

Using TensorFlow backend.


Epoch 1/1


<keras.callbacks.History at 0x1167dc160>

**Question 2.b**  
Take the trained model from question 2.b and use it to predict the test set. This can be accomplished by taking the trained model and giving it the Height feature values from the test set. What is the accuracy of this model on the test set?

In [8]:
y_test = df_test['gender'].apply(lambda x: 0 if x == 'male' else 1).values.reshape(-1,1)
x_test = df_test['height'].values.reshape(-1,1)

model.evaluate(x_test, y_test, batch_size=256)



[0.7148509082482688, 0.4773869355719293]

**Question 2.c**   
Neural Networks tend to prefer smaller, normalized feature values. Try taking the log of the height feature in both training and testing sets or use a Standard Scalar operation in SKlearn to centre and normalize the data between 0-1 for continuous values. Repeat question 2.c with the log version and the normalized and centered version of this feature

In [9]:
x_train = df_train['height'].apply('log').values.reshape(-1,1)
x_test = df_test['height'].apply('log').values.reshape(-1,1)

model.fit(x_train, y_train, epochs=2, batch_size=256)
model.evaluate(x_test, y_test, batch_size=256)


Epoch 1/2
Epoch 2/2


[0.7170084592085987, 0.4773869355719293]

In [10]:
scaler = StandardScaler()
x_train = scaler.fit_transform(df_train['height'].values.reshape(-1,1))

x_test = scaler.fit_transform(df_test['height'].values.reshape(-1,1))

model.fit(x_train, y_train, epochs=2, batch_size=256)
model.evaluate(x_test, y_test, batch_size=25)

Epoch 1/2
Epoch 2/2


[0.7378595676553908, 0.47738693287624184]

***
### Question 3 ###
Get started with Support Vector Machines.

  
Chose a SVM implementation and specify which you choose. Be sure the implementation allows you to choose between linear and RBF kernels.

**Question 3.a**   
Use the same dataset from 2.a using the linear kernel to find training set prediction accuracy.

In [11]:
from sklearn import svm

y_train = df_train['gender'].apply(lambda x: 0 if x == 'male' else 1).values.reshape(-1,1)
x_train = df_train['height'].values.reshape(-1,1)
y_test = df_test['gender'].apply(lambda x: 0 if x == 'male' else 1).values.reshape(-1,1)
x_test = df_test['height'].values.reshape(-1,1)

clf = svm.SVC(kernel='linear')
clf.fit(x_train, y_train)

clf.score(x_train, y_train)

0.8465408805031447

**Question 3.b**   
Use the same dataset from 2.a using the linear kernel to find test set prediction accuracy

In [12]:
clf.score(x_test, y_test)

0.8542713567839196

**Question 3.c**   
Use the same dataset from 2.a using the RBF kernel  to find training set prediction accuracy

In [13]:
clf = svm.SVC(kernel='rbf')
clf.fit(x_train, y_train)

clf.score(x_train, y_train)

0.8465408805031447

**Question 3.d**   
Use the same dataset from 2.a using the RBF kernel  to find test set prediction accuracy

In [14]:
clf.score(x_test, y_test)

0.8542713567839196

**Question 3.e**   
Use the same dataset from 2.c (log) using the RBF to find test set prediction accuracy

In [15]:
clf = svm.SVC(kernel='rbf')
x_train = df_train['height'].apply('log').values.reshape(-1,1)
x_test = df_test['height'].apply('log').values.reshape(-1,1)
clf.fit(x_train, y_train)

clf.score(x_test, y_test)

0.8542713567839196

**Question 3.f**   
Z-score is a normalization technique. It is the value of a feature minus the average value for that feature in the training set, divided by the standard deviation of that feature in the training set. Repeat question 3.e using Z-score and note if there is any difference in accuracy and comment on why there is a change or no change in accuracy

In [16]:
x_train = df_train['height'].apply(lambda x: (x - df_train['height'].mean())/df_train['height'].std()).values.reshape(-1,1)
x_test = df_test['height'].apply(lambda x: (x - df_train['height'].mean())/df_train['height'].std()).values.reshape(-1,1)
clf.fit(x_train, y_train)

clf.score(x_test, y_test)

0.8542713567839196

***

### Question 4 ###
The rest of features in this dataset barring a few are categorical. Neither ML method accepts categorical features, so transform year, eyecolor, exercise into a set of binary features, one feature per unique original feature value, and mark the binary feature as ‘1’ if the feature value matches the original value and ‘0’ otherwise. Using only these binary variable transformed features, train and predict the class of the test set.

In [23]:
df_train4 = df_train[['year', 'eyecolor', 'exercise']]
df_test4 = df_test[['year', 'eyecolor', 'exercise']]

X_train = df_train4.to_dict('records')
y_train = df_train['gender']

X_test = df_test4.to_dict('records')
y_test = df_test['gender']

vec = DictVectorizer()
X_train = vec.fit_transform(X_train).toarray()
X_test = vec.fit_transform(X_test).toarray()

le = LabelEncoder()
le.fit(["male", "female"])
y_train = le.transform(y_train).reshape(-1,1)
y_test = le.transform(y_test).reshape(-1,1)

**Question 4.a**    
What was your accuracy using Neural Network with a single 10 node hidden layer? During training, use a maximum number of iterations of 50. (Expected training time: ~15 mins)

In [24]:
model = Sequential()
model.add(Dense(units=10, activation='sigmoid', input_dim=13))
model.add(Dense(units=1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])
model.fit(X_train, y_train, epochs=50)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x1171effd0>

In [25]:
model.evaluate(X_test, y_test)



[0.6892389697046136, 0.5226130659256748]

**Question 4.b**    
What was your accuracy using a SVM with RBF kernel?

In [26]:
clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.585427135678392

***
### Question 5###
Using a NN, does height + eye color predict the test set class better by:

**Question 5.a**  
Keeping the original feature values?

In [27]:
X_train = df_train[['height','eyecolor']]
X_train = X_train.to_dict('records')

X_test = df_test[['height','eyecolor']]
X_test = X_test.to_dict('records')

vec = DictVectorizer()
X_train = vec.fit_transform(X_train).toarray()
X_test = vec.fit_transform(X_test).toarray()

model = Sequential()
model.add(Dense(units=10, activation='sigmoid', input_dim=6))
model.add(Dense(units=1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])

model.fit(X_train, y_train, epochs=10)
model.evaluate(X_test, y_test)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


[0.6893466974622641, 0.5226130659256748]

**Question 5.b**  
Taking the log of the original values?

In [28]:
df_train_log = df_train
df_test_log = df_test
df_train_log['height'] = df_train['height'].apply('log')
df_test_log['height'] = df_test['height'].apply('log')

X_train = df_train_log[['height','eyecolor']]
X_train = X_train.to_dict('records')

X_test = df_test_log[['height','eyecolor']]
X_test = X_test.to_dict('records')

vec = DictVectorizer()
X_train = vec.fit_transform(X_train).toarray()
X_test = vec.fit_transform(X_test).toarray()

model = Sequential()
model.add(Dense(units=10, activation='sigmoid', input_dim=6))
model.add(Dense(units=1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])
model.fit(X_train, y_train, epochs=10)
model.evaluate(X_test, y_test)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


[0.6952551546408303, 0.5226130659256748]

**Question 5.c**  
Taking the Z-score of the original values?

In [29]:
df_train_Z = df_train
df_test_Z = df_test
df_train_Z['height'] = df_train_Z['height'].apply(lambda x: (x - df_train['height'].mean())/df_train['height'].std())
df_test_Z['height'] = df_test_Z['height'].apply(lambda x: (x - df_train['height'].mean())/df_train['height'].std())

df_train_Z = df_train_Z[['height','eyecolor']]
X_train = df_train_Z[['height','eyecolor']]
X_train = X_train.to_dict('records')

df_test_Z = df_test_Z[['height','eyecolor']]
X_test = df_test_Z[['height','eyecolor']]
X_test = X_test.to_dict('records')

vec = DictVectorizer()
X_train = vec.fit_transform(X_train).toarray()
X_test = vec.fit_transform(X_test).toarray()

model = Sequential()
model.add(Dense(units=10, activation='sigmoid', input_dim=6))
model.add(Dense(units=1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])
model.fit(X_train, y_train, epochs=10)
model.evaluate(X_test, y_test)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


[0.9106535570106314, 0.47738693512264807]

***
### Question 6 ###
Repeat question 5 for exercise hours + eye color

In [36]:
X_train = df_train[['exercisehours','eyecolor']]
X_train = X_train.to_dict('records')

X_test = df_test[['exercisehours','eyecolor']]
X_test = X_test.to_dict('records')

vec = DictVectorizer()
X_train = vec.fit_transform(X_train).toarray()
X_test = vec.fit_transform(X_test).toarray()

model = Sequential()
model.add(Dense(units=10, activation='sigmoid', input_dim=6))
model.add(Dense(units=1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])
model.fit(X_train, y_train, epochs=10)
model.evaluate(X_test, y_test)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


[nan, 0.0]

In [31]:
df_train_log = df_train
df_test_log = df_test
df_train_log['exercisehours'] = df_train['exercisehours'].apply('log')
df_test_log['exercisehours'] = df_test['exercisehours'].apply('log')

X_train = df_train_log[['exercisehours','eyecolor']]
X_train = X_train.to_dict('records')

X_test = df_test_log[['exercisehours','eyecolor']]
X_test = X_test.to_dict('records')

vec = DictVectorizer()
X_train = vec.fit_transform(X_train).toarray()
X_test = vec.fit_transform(X_test).toarray()

model = Sequential()
model.add(Dense(units=10, activation='sigmoid', input_dim=6))
model.add(Dense(units=1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])
model.fit(X_train, y_train, epochs=10)
model.evaluate(X_test, y_test)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


[nan, 0.0]

In [32]:
df_train_Z = df_train
df_test_Z = df_test
df_train_Z['exercisehours'] = df_train_Z['exercisehours'].apply(lambda x: (x - df_train['height'].mean())/df_train['height'].std())
df_test_Z['exercisehours'] = df_test_Z['exercisehours'].apply(lambda x: (x - df_train['height'].mean())/df_train['height'].std())

X_train = df_train_Z[['exercisehours','eyecolor']]
X_train = X_train.to_dict('records')

X_test = df_test_Z[['exercisehours','eyecolor']]
X_test = X_test.to_dict('records')

vec = DictVectorizer()
X_train = vec.fit_transform(X_train).toarray()
X_test = vec.fit_transform(X_test).toarray()

model = Sequential()
model.add(Dense(units=10, activation='sigmoid', input_dim=6))
model.add(Dense(units=1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])
model.fit(X_train, y_train, epochs=10)
model.evaluate(X_test, y_test)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


[nan, 0.0]

***
### Question 7###
Combine the features from question 4, 5, and exercise hours from question 6 (using the best normalization feature set form questions 5 and 6)

**Question 7.a**  
What was the NN accuracy on the test set using the single 10 node hidden layer?

In [47]:
X_train = df_train[['eyecolor','year', 'exercise', 'height']]
X_train['height'] = X_train['height'].apply(lambda x: (x - df_train['height'].mean())/df_train['height'].std())
X_train = X_train.to_dict('records')

#'exercisehours',

X_test = df_test[['eyecolor','year', 'exercise', 'height']]
X_test['height'] = X_test['height'].apply(lambda x: (x - df_train['height'].mean())/df_train['height'].std())
X_test = X_test.to_dict('records')

vec = DictVectorizer()
X_train = vec.fit_transform(X_train).toarray()
X_test = vec.fit_transform(X_test).toarray()

model = Sequential()
model.add(Dense(units=10, activation='sigmoid', input_dim=14))
model.add(Dense(units=1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])
model.fit(X_train, y_train, epochs=50)
model.evaluate(X_test, y_test)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


[1.218741680509481, 0.47738693512264807]

**Question 7.b**  
What was the SVM accuracy on the test set the RBF kernel?

In [48]:
clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.47738693467336685

***
### Question 8- Bonus###
Can you improve your test set prediction accuracy by 5% or more?  

See how close to that milestone of improvement you can get by modifying the tuning parameters of either Neural Networks (the number of hidden layers, number of hidden nodes in each layer, the learning rate aka mu) or with SVM (choosing kernel, C, and gamma). A great guide to tuning parameters is explained in this guide: http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf. 

While the guide is specific to SVM and in particular the C and gamma parameters of the RBF kernel, the method applies to generally to any ML technique with tuning parameters.

Please also write a paragraph in a markdown cell below with an explanation of your approach and evaluation metrics.
