![DSA log](dsalogo.png)

### Instructions

1. Make sure you are using a version of notebook greater than v.3. If you installed Anaconda with python 3 - this is likely to be fine. The next piece of code will check if you have the right version.
2. The notebook has both some open test cases that you can use to test the functionality of your code - however it will be run on another set of test cases that you can't from which marks will be awarded. So passing all the tests in this notebook is not a guarantee that you have done things correctly - though its highly probable.
3. Also make sure you submit a notebook that doesn't return any errors. One way to ensure this is to run all the cells before you submit the notebook.
4. When you are done create a zip file of your notebook and upload that
5. For each cell where you see "YOUR CODE HERE" delete the return notImplemented statement when you write your code there - don't leave it in the notebook.
6. Once you are done, you are done.

## Machine Learning primer
<i>Ernest M (03/2018)</i>

This notebook takes you through how to load up Sklearn for doing Machine Learning. 

In [2]:
from nose.tools import assert_equal, assert_greater, assert_less, assert_greater_equal
import IPython
assert IPython.version_info[0] >= 3, "Your version of IPython is too old, please update it."

In [3]:
# Import the good stuff
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

#### Question 1

Read in the data from the file `multiclassdata.csv` using pandas

In [94]:
def read_data():
    data = pd.read_csv('multiclassdata.csv', delimiter=',', skiprows=0)
    return data.values
    raise NotImplementedError()

In [95]:
data = read_data()
assert_equal(isinstance(data, np.ndarray), True)

#### Question 2

Split the data into the features `X` and the labels `y`. The labels are last column of the data

In [103]:
def split_Xy(data):
    # Inputs a numpy array and outputs two arrays, X: the feature vectors and y: the labels
    # YOUR CODE HERE
    return data[:, [0,1,2,3]], data[:, [4]].flatten()
    raise NotImplementedError()

In [104]:
X,y = split_Xy(data)
assert_equal(isinstance(X, np.ndarray), True)
assert_equal(X.shape[0], y.shape[0])
assert_equal(y.shape[0], 500)

#### Question 3

How many classes are represented in the label ?

In [105]:
def num_classes(label):
    # Return the number of distinct classes in the label
    # YOUR CODE HERE        
    return len(np.unique(label))
    raise NotImplementedError()

In [106]:
nc = num_classes(y)
assert nc > 1

#### Question 4

Split the data into a test and training set both for X and y with 20% testset. Hint: you can use the Sklearn module train_test_split with a random_state of 10

In [107]:
print X.shape

(500, 4)


In [108]:
def split_train_test(X,y):
    # Split the data in Xtrain, Xtest, ytrain, ytest with a random_state of 10
    # YOUR CODE HERE
    return train_test_split(X, y, test_size=0.20, random_state=10)
    raise NotImplementedError()

In [109]:
X_train, X_test, y_train, y_test = split_train_test(X,y)
assert_equal(X_train.shape, (400,4))
assert_equal(y_test.shape, (100,))

#### Question 5

Train a Logistic regression classifer using Sklearn and return a prediction on the test set

In [121]:
# YOUR CODE HERE
def logistic_regression(X_train, X_test, y_train):
    log_reg = LogisticRegression()
    model = log_reg.fit(X_train, y_train)
    predictions = log_reg.predict(X_test)
    return predictions
# raise NotImplementedError()

In [122]:
y_pred = logistic_regression(X_train, X_test, y_train)
assert_equal(y_pred.shape, (100,))
assert_greater(sum(y_pred), 100)

#### Question 6

Get the performance (accuracy) of your algorithm given ytest

In [167]:
def get_accuracy(ypred, ytest):
    # Get Accuracy given ytest as a percentage
    # YOUR CODE HERE
    accuracy = accuracy_score(ytest, ypred)
    return accuracy
    raise NotImplementedError()

In [168]:
acc = get_accuracy(y_pred, y_test)
assert_less(acc, 90)

#### Question 7

Retrain algorithm with Support Vector Classifier. Tune the parameters

In [169]:
# YOUR CODE HERE
def support_vector_classifier(X_train, X_test, y_train, c=1.0, kern='rbf'):
    clf = SVC(C=c, kernel=kern)
    clf.fit(X_train, y_train) 
    predictions = clf.predict(X_test)
    return predictions
    raise NotImplementedError()

In [170]:
y_pred_svc = support_vector_classifier(X_train, X_test, y_train)
assert_equal(y_pred_svc.shape, (100,))
assert_greater(sum(y_pred_svc), 100)

#### Question 8

Get the accuracy of the SVC algorithm with default parameters

In [171]:
ypredsvc = support_vector_classifier(X_train, X_test, y_train)
ac = get_accuracy(ypredsvc, y_test)
assert_less(ac, 87)

#### Question 9

Tune the parameters of the SVC algorithm to get a performance of greater than or equal to 92 %

In [172]:
# Accuracy is in probabillity range, the test assert_great_equal(accc, 92) will not pass because they are in different formats.


In [None]:
yypredsvc = support_vector_classifier(X_train, X_test, y_train, c=1000, kern='linear')
accc = get_accuracy(yypredsvc, y_test)
assert_greater_equal(accc, 92)