# Python as a Calculator

Blank notebook to be used for class exercises.

## Exercise 1

Write code to load the data in the "iris.csv". The first 4 columns are the features. The last column is the the class. Don't forget to convert the dataset into a numpy array.

After the dataset is loaded, create train and test partitions using the following scikit-learn method:

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

File path: ../data/datasets/iris/iris.csv

In [1]:
!head ../data/datasets/iris/iris.csv

5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa


In [7]:
import csv
import numpy as np
from sklearn.model_selection import train_test_split 

X = []
y = []
with open('../data/datasets/iris/iris.csv') as in_file:
    iCSV = csv.reader(in_file,delimiter=',')
    for row in iCSV:
        X.append([float(x) for x in row[:-1]])
        y.append(row[-1])
X = np.array(X)
y = np.array(y)

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

## Exercise 2

Using the iris data you loaded in Exercise 4, train an SVM on the train split and evaulate using accuracy on the test split. Fiddle with the parameters of the SVM to see how it effects the performance.

Next, try using a different classifier, a random forest, and see how it compares to the SVM
    
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html}

In [10]:
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score

clf = SVC()
clf.fit(X_train,y_train)
y_pred_svc = clf.predict(X_test)

rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

print("SVC F1: {} RF F1: {}".format(f1_score(y_test, y_pred_svc,average='macro'), f1_score(y_test, y_pred_rf,average='macro')))

SVC F1: 0.9649122807017544 RF F1: 0.9649122807017544


## Exercise 3

Use the iris dataset then create a 2-way split (train/validation), compare all combinations loop over the SVC kernel parameters "rbf" and "linear", and C parameters 0.001, 0.01, 0.1, 1., and 10. Print the training and validation scores for every pair of parameters. How do they compare?

Hint: You need to nest two for loops. You can use the train/test splits from Exercise 1

In [15]:

for C in [0.001, 0.01, 0.1, 1., 10.]:
    for kernel in ['rbf','linear']:
        clf = SVC(C=C, kernel=kernel)
        clf.fit(X_train,y_train)
        y_pred = clf.predict(X_test)
        print("C: {} Kernel: {} F1: {:0.4f}".format(C, kernel, f1_score(y_test, y_pred,average='macro')))

C: 0.001 Kernel: rbf F1: 0.1538
C: 0.001 Kernel: linear F1: 0.5476
C: 0.01 Kernel: rbf F1: 0.1538
C: 0.01 Kernel: linear F1: 1.0000
C: 0.1 Kernel: rbf F1: 1.0000
C: 0.1 Kernel: linear F1: 0.9649
C: 1.0 Kernel: rbf F1: 0.9649
C: 1.0 Kernel: linear F1: 0.9649
C: 10.0 Kernel: rbf F1: 0.9649
C: 10.0 Kernel: linear F1: 0.9649


  'precision', 'predicted', average, warn_for)


## Exercise 4

Use the iris dataset to create a 2-way split, but optimize the SVC parameters using GridSearchCV (also try a RandomForest model), then report the final f1 score on the test, train, and validation datasets. How close are the validation and test scores? How does the training score compare to the test and validation scores?

In [20]:
from sklearn.model_selection import GridSearchCV

params = {"C":[0.0001, 0.001, 0.01, 0.1, 1., 10.], "kernel":["rbf","linear"]}

svc = SVC()
clf = GridSearchCV(svc, params, cv=5)
clf.fit(X_train, y_train)

train_preds_svc = clf.predict(X_train)
train_f1_svc = f1_score(y_train, train_preds_svc, average='macro')
test_preds_svc = clf.predict(X_test)
test_f1_svc = f1_score(y_test, test_preds_svc, average='macro')
dev_f1_svc = clf.best_score_

print("SVC - Train F1: {} Dev F1: {} Test F1: {}".format(train_f1_svc, dev_f1_svc, test_f1_svc))

SVC - Train F1: 0.9917645264602714 Dev F1: 0.9916666666666667 Test F1: 0.9649122807017544
