# Python as a Calculator

Blank notebook to be used for class exercises.

## Exercise 1

Write code to load the data in the "iris.csv" into numpy arrays.

The frst 4 columns are the features/attributes. The last column is the
class. Simply load the class as a list of strings. Don't forget to convert the
dataset into a numpy array. You can use either DictVectorizer or the CVS
method on the previous slide to load the features.

In [50]:
with open('../data/datasets/iris/iris.csv') as in_file:
    count = 0
    for row in in_file:
        print(row.strip())
        count += 1
        if count == 10:
            break

5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa


In [51]:
# Method 1
import numpy as np

X = []
y = []
with open('../data/datasets/iris/iris.csv') as in_file:
    count = 0
    for row in in_file:
        data = row.strip().split(',')
        feats = [float(x) for x in data[:-1]]
        X.append(feats)
        y.append(data[-1])
X = np.array(X)
y = np.array(y)

In [52]:
# Method 2
from sklearn.feature_extraction import DictVectorizer

X_dicts = []
y = []

with open('../data/datasets/iris/iris.csv') as in_file:
    count = 0
    for row in in_file:
        data = row.strip().split(',')
        item = {}
        item['feat 1'] = float(data[0])
        item['feat 2'] = float(data[1])
        item['feat 3'] = float(data[2])
        item['feat 4'] = float(data[3])
        X_dicts.append(item)
        y.append(data[-1])

vec = DictVectorizer(sparse=False)

X = vec.fit_transform(X_dicts)
y = np.array(y)

## Exercise 2

Using the iris data you loaded in Exercise 1, do the following:

- Use train_test_split() to split the iris dataset. (use 0.2 for the
test size)
- Train an SVM on the train split and evaluate using accuracy on the
test split.
- Fiddle with the parameters of the SVM to see how it effects the
performance.
- Calculate the accuracy on the train split. Is there a difference between the train/test accuracies?

Next, try using a different classifier, a random forest, and see how it
compares to the SVM

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Note that this is a toy dataset, so all scores will be high.

In [53]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [54]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

clf = SVC(kernel='linear', C=0.0001)

clf.fit(X_train, y_train)

y_test_preds = clf.predict(X_test)

print("Test Score: {:.4f}".format(accuracy_score(y_test, y_test_preds)))
print("Train Score: {:.4f}".format(accuracy_score(y_train, clf.predict(X_train))))

Test Score: 0.3000
Train Score: 0.3417


## Exercise 3

Using the train/test iris dataset split from exercise 2. Compare all combinations of parameters by looping over the SVC kernel parameters "rbf" and "linear",
and C parameters 0.001, 0.01, 0.1, 1., and 10. Print the training and
validation scores for every pair of parameters.


Hint: You need to nest two for loops. You can use the train/test splits from Exercise 2

In [55]:
for k in ['rbf', 'linear']:
    for c in [0.001, 0.01, 0.1, 1., 10.]:
        clf = SVC(kernel=k, C=c)
        clf.fit(X_train, y_train)
        y_test_preds = clf.predict(X_test)
        y_train_preds = clf.predict(X_train)
        print("Test Score: {:.4f} Train Score: {:.4f}".format(
            accuracy_score(y_test, y_test_preds), accuracy_score(y_train, y_train_preds)))


Test Score: 0.3000 Train Score: 0.3417
Test Score: 0.3000 Train Score: 0.3417
Test Score: 1.0000 Train Score: 0.9417
Test Score: 1.0000 Train Score: 0.9917
Test Score: 1.0000 Train Score: 0.9833
Test Score: 0.3000 Train Score: 0.3417
Test Score: 0.9667 Train Score: 0.9250
Test Score: 1.0000 Train Score: 0.9667
Test Score: 1.0000 Train Score: 0.9750
Test Score: 0.9667 Train Score: 0.9667


## Exercise 4

Use the iris dataset to create a 3-way split (train/val/test), optimize the SVC parameters using the validation split, then report the final f1 score on the test, train, and validation datasets.

You will need to use the train\_test\_split() method on the train dataset from Exercise 2. You can use a 10% test size.

How close are the validation and test scores? How does the training score compare to the test and validation scores?

In [56]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

X_train2, X_val, y_train2, y_val = train_test_split(
    X_train, y_train, test_size=0.1, random_state=42)

for c in [0.1, 1.]:
    for k in ['rbf', 'linear']:
        clf = SVC(kernel = k, C=c)
        clf.fit(X_train2, y_train2)
        val_preds = clf.predict(X_val)
        acc = accuracy_score(y_val, val_preds)
        print("C: {} kern: {} acc: {}".format(c, k, acc))
fclf = SVC(kernel='rbf', C=0.1)
fclf.fit(X_train2, y_train2)
test_preds = fclf.predict(X_test)

test_acc = accuracy_score(y_test, test_preds)
print(test_acc)

params = {"C": [0.1, 1.], 'kernel': ['rbf', 'linear']}

gclf = GridSearchCV(SVC(), params, cv=3)

gclf.fit(X_train, y_train)

new_preds = gclf.predict(X_test)

test_acc = accuracy_score(y_test, new_preds)
print(test_acc)

C: 0.1 kern: rbf acc: 0.9166666666666666
C: 0.1 kern: linear acc: 0.9166666666666666
C: 1.0 kern: rbf acc: 0.8333333333333334
C: 1.0 kern: linear acc: 0.8333333333333334
1.0
1.0


# NOTES for F1 Score
from sklearn.metrics import f1_score

MICRO

TP_i = true positives

FP_i = false positives

FN_i = false negatives

TP = sum of all TP_i
FP = sum of all FP_i
FN = sum of all FN_i

precision = TP / (TP + FP)

recall = TP /(TP + FN)

F1 = (2 * precision * recall) / (precision + recall)



MACRO F1

TP_i = true positives

FP_i = false positives

FN_i = false negatives



precision_i = TP_i / (TP_i + FP_i)
recall = TP_i /(TP_i + FN_i)

F1_i = (2 * precision_i * recall_i) / (precision_i + recall_i)

$\frac{1}{C}\sum_{i=1}^{i=C} F_i$

## Group Project

For this exercise, you will code your own k nearest neighbor method. For this assume "k" is equal to 1. You can use euclidean distance to find similar examples.

Euclidean distance is defined as

$ EDist = \sqrt{(x_0 - v_0)^2 + (x_1 - v_1)^2 + \dots + (x_{D-1} - v_{D-1})^2} $

where D is the dimension size (number of elements) for the vectors.

**Hint:** The **easiest** way to complete this exercise is with for loops. The **fastest** way to complete this exercise is to complete this exericse is to use numpy cleverness. Both approaches are acceptable.

In [59]:
def edist(x1, x2):
    return np.linalg.norm(x1-x2)


def kNN(query, X, y):
    '''
        Complete this function to return the class of the closest
        example (row) in X to the vector "query" based on euclidean
        distance.
        :param vector query: A numpy vector (Will be one row from the test set from Exercise 2)
        :param matrix X: A numpy matrix (will be training dataset from Exercise 2)
        :return: Return the class (element of y) corresponding the the closes item in X to query.
    '''
    min_dist = None
    pred = None
    for xi,yi in zip(X,y):
        if min_dist is None:
            min_dist = edist(query, xi)
            pred = yi
        else:
            dist = edist(query, xi)
            if dist < min_dist:
                min_dist = dist
                pred = yi
    return pred

In [60]:
preds = []
for test_x1 in X_test:
    new_pred = kNN(test_x1, X_train, y_train)
    preds.append(new_pred)
print(accuracy_score(y_test, preds))

1.0


## Exercise 5

The tab (\t) separated file "sentiment-twitter-data.tsv" contains tweets annotated for sentiment. Load the data then do the following:

- split the dataset into a train/test split.
- create a bag of words feature representation for the tweets using the CountVectorizer
- Use grid-search (CV) on the train split to find the best C parameters for a LinearSVC classifier. Only test 2 C values to reduce overhead (0.1 and 1.). Also, use a 2-fold CV, i.e., cv=2.
- report (print) the accuracy of the final classifier on the test data and train data
- How many features were created with the bag of words representation?

file path: ../data/sentiment-twitter-data.tsv

In [2]:
# This is a tab seperated file, so with csv reader use delimiter="\t"
with open('../data/sentiment-twitter-data.tsv') as in_file:
    count = 0
    for row in in_file:
        print(row.strip())
        count += 1
        if count == 10:
            break

264183816548130816	15140428	positive	Gas by my house hit $3.39!!!! I'm going to Chapel Hill on Sat. :)
264249301910310912	18516728	negative	Iranian general says Israel's Iron Dome can't deal with their missiles (keep talking like that and we may end up finding out)
264105751826538497	147088367	positive	with J Davlar 11th. Main rivals are team Poland. Hopefully we an make it a successful end to a tough week of training tomorrow.
264094586689953794	332474633	negative	Talking about ACT's &amp;&amp; SAT's, deciding where I want to go to college, applying to colleges and everything about college stresses me out.
254941790757601280	557103111	negative	They may have a SuperBowl in Dallas, but Dallas ain't winning a SuperBowl. Not with that quarterback and owner. @S4NYC @RasmussenPoll
264169034155696130	382403760	neutral	Im bringing the monster load of candy tomorrow, I just hope it doesn't get all squiched
263192091700654080	344222239	objective-OR-neutral	Apple software, retail chiefs out in o

In [40]:
import csv

import numpy as np
np.random.seed(42)
import random
random.seed(42)

X_txt = []
y = []
with open('../data/sentiment-twitter-data.tsv') as in_file:
    iCSV = csv.reader(in_file, delimiter='\t')
    for row in iCSV:
        X_txt.append(row[3])
        y.append(row[2])
        
X_train_txt, X_test_txt, y_train, y_test = train_test_split(X_txt, y,
                                                            test_size=0.2,
                                                            random_state=42)

from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()

vec.fit(X_train_txt)

X_train = vec.transform(X_train_txt)
X_test = vec.transform(X_test_txt)

#X_tr

print(X_train.shape)

from sklearn.svm import LinearSVC

params = {'C': [0.1, 1.]}

clf = GridSearchCV(LinearSVC(random_state=42), params, cv=10, scoring='f1_micro')

clf.fit(X_train, y_train)


(6401, 18118)


GridSearchCV(cv=10, error_score='raise',
       estimator=LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=42, tol=0.0001,
     verbose=0),
       fit_params=None, iid=True, n_jobs=1, param_grid={'C': [0.1, 1.0]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='f1_micro', verbose=0)

In [41]:
clf.best_score_

0.452585533510389

In [42]:
preds = clf.predict(X_test)

print(accuracy_score(y_test, preds))

0.46595877576514677
