# Homework 4: SVM


This assignment is due on Moodle by **11:59pm on Friday November 2**. 
Your solutions to theoretical questions should be done in Markdown/MathJax directly below the associated question.
Your solutions to computational questions should include any specified Python code and results 
as well as written commentary on your conclusions.
Remember that you are encouraged to discuss the problems with your instructors and classmates, 
but **you must write all code and solutions on your own**. For a refresher on the course **Collaboration Policy** click [here](https://github.com/BoulderDS/CSCI-4622-Machine-Learning-18fa/blob/master/info/syllabus.md#collaboration-policy).

**NOTES**: 

- Do **NOT** load or use any Python packages that are not available in Anaconda 3.6. 
- Some problems with code may be autograded.  If we provide a function API **do not** change it.  If we do not provide a function API then you're free to structure your code however you like. 
- Submit only this Jupyter notebook to Moodle.  Do not compress it using tar, rar, zip, etc. 


Name: 

In [None]:
import math
import pickle
import gzip
import numpy as np
import pandas
import matplotlib.pylab as plt
%matplotlib inline

In this homework you'll explore the primal and dual representations of support vector machines,
as well as explore the performance of various kernels while classifying sentiments.

[50 Points] Problem 1 - Basic concepts of SVM
---

**Part A**: 
* What are the main differences between the primal and the dual representations?
* For $\xi$, $C$, $\alpha$, and $\beta$, what is their role and if there is a special value what is the value and what does it mean?

YOUR ANSWER HERE

**PART B**: 

 * Given a weight vector, implement the *find_support* function that returns the indices of the support vectors.
 * Given a weight vector, implement the *find_slack* function that returns the indices of the vectors with nonzero slack.
 * Given the alpha dual vector, implement the *weight_vector* function that returns the corresponding weight vector.

In [None]:
import numpy as np

kINSP = np.array([(1, 8, +1),
               (7, 2, -1),
               (6, -1, -1),
               (-5, 0, +1),
               (-5, 1, -1),
               (-5, 2, +1),
               (6, 3, +1),
               (6, 1, -1),
               (5, 2, -1)])

kSEP = np.array([(-2, 2, +1),    # 0 - A
              (0, 4, +1),     # 1 - B
              (2, 1, +1),     # 2 - C
              (-2, -3, -1),   # 3 - D
              (0, -1, -1),    # 4 - E
              (2, -3, -1),    # 5 - F
              ])


def weight_vector(x, y, alpha):
    """
    Given a vector of alphas, compute the primal weight vector w.
    The vector w should be returned as an Numpy array.
    """

    w = np.zeros(len(x[0]))
    # YOUR CODE HERE
    raise NotImplementedError()
    return w



def find_support(x, y, w, b, tolerance=0.001):
    """
    Given a set of training examples and primal weights, return the indices
    of all of the support vectors as a set.
    """

    support = set()
    # YOUR CODE HERE
    raise NotImplementedError()
    return support



def find_slack(x, y, w, b):
    """
    Given a set of training examples and primal weights, return the indices
    of all examples with nonzero slack as a set.
    """

    slack = set()
    # YOUR CODE HERE
    raise NotImplementedError()
    return slack

In [None]:
%run -i tests/tests.py

**PART C**

The goal of this problem is to correctly classify test data points, given a training data set.
For this problem, assume that we are training an SVM with a quadratic kernel– that is, our kernel function is a polynomial kernel of degree 2. You are given the data set presented in Figure 1. The slack penalty C will determine the location of the decision boundary.

Justify the following questions in a sentence or via drawing decision boundary.
![training_data](./data/data.png)

* Where would the decision boundary be for very large values of C ?
* Where you would expect the decision boundary to be if  C = 0 ?
* Which of the two cases above would you expect to generalize better on test data? Why?

YOUR ANSWER HERE

[50 points] Problem 2 -- SVM with Sklearn
---

In this problem, you are going to get familiar with important practical functions in scikit-learn such as pipeline, grid search, and cross validation. You will experiment with these using support vector machines.

Note that grid search can take some time on your laptop, so make sure that your code is correct with a small subset of the training data and search a reasonable number of options.

* Use the Sklearn implementation of support vector machines to train a classifier to distinguish Positive and negative sentiments
* Experiment with linear, polynomial, and RBF kernels. In each case, perform a GridSearch to help determine optimal hyperparameters for the given model (e.g. C for linear kernel, C and p for polynomial kernel, and C and  for RBF). Comment on the experiments you ran and optimal hyperparameters you found.
Hint: http://scikit-learn.org/stable/modules/grid_search.html
* Comment on classification performance for each model for optimal parameters by testing on a hold-out set.

Following is a dataset containing reviews and sentiments associated with it.

Create a SVM Classifier to predict positive or negative sentiments

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold
reviews  = pd.read_csv('./data/reviews.csv')
train, test = train_test_split(reviews, test_size=0.2, random_state=4622)
X_train = train['reviews'].values
X_test = test['reviews'].values
y_train = train['sentiment']
y_test = test['sentiment']

In [None]:
len(X_train),sum(y_train),len(X_test),sum(y_test)

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, f1_score
from sklearn.metrics import confusion_matrix, roc_auc_score, recall_score, precision_score

**PART A**

Use CountVectorizer to vectorize reviews as dictionary of term frequencies.
Define the crossvalidation split using StratifiedKFold.

In [None]:
def tokenize(text): 
    tknzr = TweetTokenizer()
    return tknzr.tokenize(text)

en_stopwords = set(stopwords.words("english")) 

# CREATE CountVectorizer using sklearn.feature_extraction.text.CountVectorizer
# Hint: use the above tokenize function
# Hint: play with different parameters, in particular, min_df can help with generalizability
# YOUR CODE HERE
raise NotImplementedError()

# split dataset using StratifiedKFold into 5 splits using sklearn.model_selection.StratifiedKFold.
# YOUR CODE HERE
raise NotImplementedError()

**PART B**
* Create pipeline with Count Vectorizer and SVM Classifier
* Define grid search parameters
* Create GridSearchCV object with pipeline created and fit the data.
* Compute accuracy on best estimator from GridSearchCV

In [None]:
# DEFINE GRID SEARCH PARAMETERS kernel, C, degree, gamma respectively
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
np.random.seed(1234)
# Define pipeline using make_pipeline with vectorizer and SVM Classifier
# YOUR CODE HERE
raise NotImplementedError()

# Create GridSearchCV with pipeline and grid search parameters, scoring as accuracy.
# for example grid_svm = GridSearchCV(pipeline, param_grid, cv, scoring="accuracy")

# YOUR CODE HERE
raise NotImplementedError()

# For debugging purposes, it makes sense to use a smaller set of training set to speed up the grid search progress
grid_svm.fit(X_train, y_train)

In [None]:
print("best params:")
print(grid_svm.best_params_)

print("best cv score:")
print(grid_svm.best_score_)

In [None]:
def report_results(model, X, y):
    pred = model.predict(X)        
    acc = accuracy_score(y, pred)
    f1 = f1_score(y, pred)
    prec = precision_score(y, pred)
    rec = recall_score(y, pred)
    result = {'f1': f1, 'acc': acc, 'precision': prec, 'recall': rec}
    return result

In [None]:
report_results(grid_svm.best_estimator_, X_test, y_test)

**PART C**

Explain the overall procedure and report the final result that you obtain including which kernel and hyperparameter was chosen.

YOUR ANSWER HERE