# Tutorial 4 - Fundamental ML Algorithms Part II

In this tutorial, we'll explore the **support vector machine**, and ensemble methods such as **random forests** and **gradient boosted trees**. To do so, we'll be working with the Yelp dataset, consisting of variable length text reviews, and corresponding star ratings from 1 to 5. The problem at hand is to perform basic sentiment analysis (phrased as a classification task),  by predicting what star rating a new Yelp review would generate.

### 1. Preprocessing / Preparing the Dataset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re

from scipy import sparse
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV, PredefinedSplit
from sklearn.preprocessing import normalize
from sklearn.metrics import accuracy_score, f1_score
from collections import Counter

In [None]:
# Specify paths to read and write data from
dataPath = './dataset/'

# Basic regex feature to help preprocess text
regex = re.compile('[^\w\s]')

FEATURES = 10000 # we will be inspecting 10 000 most common words in the training set
data_types = ['train', 'valid', 'test']

Define some helper methods to help us perform preprocessing tasks on the Yelp dataset. Recall that machine learning models can only directly interpret numbers, so we must encode the reviews and ratings. We attempt the following strategy.

In [None]:
''' Preprocessing function: lowercase, remove punctuation. Returns list of lists (of clean lines), ratings''' 
def preprocess(filePath):
    with open(filePath, 'r', encoding="utf-8") as f:
        lines = f.readlines()
        
        reviews, ratings = [], []
        for l in lines:
            splitted = l.split('\t')
            ratings.append(int(splitted[1].strip()))
            reviews.append(regex.sub('', splitted[0].strip()).lower())

    return reviews, ratings

In [None]:
# We have 3 different files for train, validation and test sets, respectively
# Inspect .txt files
# Preprocess these .txt files and separate into features, labels for supervised classification problem

### FILL IN ###
### FILL IN ###
### FILL IN ###

In [None]:
# Inspect sample of the training data review

### FILL IN ###

In [None]:
# Inspect corresponding rating, do this for a few samples

### FILL IN ###

In [None]:
# Convert to dictionaries: yelp_text, yelp_ratings

yelp_text = {'train': X_train, 'valid': X_val, 'test': X_test}
### FILL IN ###

In [None]:
''' Function returns takes in list of lists (of lines), returns list of num most freq. words '''
def top_n_words(linesOfLines, num):
    count = Counter([word for line in linesOfLines for word in line.split()]).most_common(num)
    top_features = [word[0] for word in count]

    return top_features, count

In [None]:
# Return top 10 000 features for dataset: yelp_vocab, yelp_count

### FILL IN ###

In [None]:
# Inspect some of the most common words

### FILL IN ###

In [None]:
''' Function that converts preprocessed text to binary, frequency bag-of-words representations with corresponding ratings '''
def convert_bow(text, ratings):
    binary = {}
    freq = {}

    ### FILL IN ### Code for frequency vectorizer
    ### FILL IN ### Code for binary vectorizer
    
    # data_types referenced globally, bad practice
    for dtype in data_types:
        # stores v_bin as sparse matrix, but v_freq as normal numpy array - note computational times are heavier for one
        v_freq = np.array(normalize(vectorizer.fit_transform(text[dtype]).todense()))
        v_bin = sparse.csr_matrix(np.array(vectorizer_bin.fit_transform(text[dtype]).todense()))
        # Appends transformed bag-of-words representation, and corresponding ratings
        
        freq[dtype] = [v_freq, ratings[dtype]]
        ### FILL IN ### do the same for binary bag-of-words

    # return bin, freq
    return binary, freq

In [None]:
# Convert text to binary and frequency bag-of-words representation

### FILL IN ###

In [None]:
# Inspect values of binary bag-of-words representation
# Inspect sparse matrix shapes

### FILL IN ###

In [None]:
# Inspect corresponding ratings

### FILL IN ###

In [None]:
# Inspect frequency bag-of-words sample values

### FILL IN ###

In [None]:
# Compare with original text from same sample

### FILL IN ###

Now that we have our bag-of-words-representations, we attempt to classify the data using algorithms seen today, e.g. SVMs, Random Forests, and Gradient-Boosted Trees.

#### Define function to train and evaluate classifiers.

In [None]:
''' Function  to train, evaluate classifier and returns best parameters, accuracies on different sets '''
def train_clf(dataset, clf, params):
    
    # Hyperparameter tuning
    if params != None:
        clf = tune_hyper_params(clf, dataset, params)
        # Concatenate training, validation sets - use validation set to tune hyperparameters
        X_train_val = sparse.vstack([dataset['train'][0], dataset['valid'][0]])
        y_train_val = np.concatenate((dataset['train'][1], dataset['valid'][1]))

        clf.fit(X_train_val, y_train_val)
        
    # If no hyperparameter tuning, fit on training data
    else:
        clf.fit(dataset['train'][0], dataset['train'][1])

    acc_train = accuracy_score(dataset['train'][1], clf.predict(dataset['train'][0]))
    acc_val = accuracy_score(dataset['valid'][1], clf.predict(dataset['valid'][0]))
    acc_test = accuracy_score(dataset['test'][1], clf.predict(dataset['test'][0]))

    acc = {'Train Accuracy': acc_train, 'Validation Accuracy': acc_val, 'Test Accuracy': acc_test}
    
    best_param = None if params == None else clf.best_params_

    return acc, best_param

Now, we can evaluate different classifiers on both the binary and frequency bag-of-words representations of the Yelp dataset.

### 2. Support Vector Machines

In [None]:
# Linear SVM
params = [{'max_iter': [100 * i for i in range(10)]}]

In [None]:
# Test on binary bag-of-words representation
acc_SVM, best_params = ### FILL IN ###
print('Linear SVM')
print(acc_SVM)
print('Best params - {}'.format(best_params))

In [None]:
# Test on freq bag-of-words representation
acc_SVM, best_params = ### FILL IN ###
print('Linear SVM')
print(acc_SVM)
print('Best params - {}'.format(best_params))

### 3. Ensemble Methods

#### 3.1 Random Forests

In [None]:
# Test on binary bag-of-words representation
acc_rand_forest, best_params = ### FILL IN ###
print('Random Forest')
print(acc_rand_forest)
print('Best params - {}'.format(best_params))

In [None]:
acc_rand_forest, best_params = ### FILL IN ###
print('Random Forest')
print(acc_rand_forest)
print('Best params - {}'.format(best_params))

#### 3.2 Gradient Boosted Trees

In [None]:
acc_grad_boosted, best_params = ### FILL IN ###
print('Gradient Boosted Trees')
print(acc_grad_boosted)
print('Best params - {}'.format(best_params))

In [None]:
acc_grad_boosted, best_params = ### FILL IN ###
print('Gradient Boosted Trees')
print(acc_grad_boosted)
print('Best params - {}'.format(best_params))

### BONUS - Hyperparameter Tuning

The accuracy we obtained is not so great. What could we have done differently? First of all, we didn't get to tune hyperparameters(except for briefly on the SVM), so here's the method used for hyperparameter tuning on the validation set! Explore the sklearn documentation and try to beat the accuracy scores currently reached. Anything else? Inspect the most common words, what are they? Are they all useful?

In [None]:
# Function that returns best hyper-parameters for given classifier, tunes parameters on validation set
def tune_hyper_params(classifier, dataset, parameters):
    ps = PredefinedSplit(test_fold=[-1 for i in range(dataset['train'][0].shape[0])] + [0 for i in range(dataset['valid'][0].shape[0])])
    classifier = GridSearchCV(classifier, parameters, cv=ps, refit=True)

    return classifier