# Tutorial 4 - Fundamental ML Algorithms Part II

In this tutorial, we'll explore the **support vector machine**, and ensemble methods such as **random forests** and **gradient boosted trees**. To do so, we'll be working with the Yelp dataset, consisting of variable length text reviews, and corresponding star ratings from 1 to 5. The problem at hand is to perform basic sentiment analysis (phrased as a classification task),  by predicting what star rating a new Yelp review would generate.

### 1. Preprocessing / Preparing the Dataset

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re

from scipy import sparse
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV, PredefinedSplit
from sklearn.preprocessing import normalize
from sklearn.metrics import accuracy_score, f1_score
from collections import Counter

In [2]:
# Specify paths to read and write data from
dataPath = './dataset/'

# Basic regex feature to help preprocess text
regex = re.compile('[^\w\s]')

FEATURES = 10000 # we will be inspecting 10 000 most common words in the training set
data_types = ['train', 'valid', 'test']

Define some helper methods to help us perform preprocessing tasks on the Yelp dataset. Recall that machine learning models can only directly interpret numbers, so we must encode the reviews and ratings. We attempt the following strategy.

In [3]:
''' Preprocessing function: lowercase, remove punctuation. Returns list of lists (of clean lines), ratings''' 
def preprocess(filePath):
    with open(filePath, 'r', encoding="utf-8") as f:
        lines = f.readlines()
        
        reviews, ratings = [], []
        for l in lines:
            splitted = l.split('\t')
            ratings.append(int(splitted[1].strip()))
            reviews.append(regex.sub('', splitted[0].strip()).lower())

    return reviews, ratings

In [4]:
# We have 3 different files for train, validation and test sets, respectively
# Inspect .txt files
# Preprocess these .txt files and separate into features, labels for supervised classification problem
X_train, y_train = preprocess(dataPath + 'yelp-train.txt')
X_val, y_val = preprocess(dataPath + 'yelp-valid.txt')
X_test, y_test = preprocess(dataPath + 'yelp-test.txt')

In [5]:
# Inspect sample of the training data review
X_train[1]

'best nights to go to postinos are mondays and tuesdays  they offer 20 deals where you get 4 slices of bruschetta out of the 12 that they offer and one whole bottle of the house wine  each bruschetta slice is probably the size of maybe your hand with your fingers outstretched  if youre a petite girl     they then cut each slice into 4s  perfect for sharingwe went on a monday night after 8pm and ordered 2 bottles of wine the 2 orders of the bruschetta which were 8 slices for those that cant count and a bowl of olives  the total came out to about 50thats a little over 13person wo tip  awesome    plus they have complementary valet parking everything was just fantastic ive never had fig before and they made my experience there quite memorable  i definitely recommend you come in regardless if its for their 20 deals on montues or not  theyve got great food and i will be back'

In [6]:
# Inspect corresponding rating, do this for a few samples
y_train[1]

5

In [7]:
# Convert to dictionaries
yelp_text = {'train': X_train, 'valid': X_val, 'test': X_test}
yelp_ratings = {'train': y_train, 'valid': y_val, 'test': y_test}

In [8]:
''' Function returns takes in list of lists (of lines), returns list of n most freq. words '''
def top_n_words(linesOfLines, n):
    count = Counter([word for line in linesOfLines for word in line.split()]).most_common(n)
    top_features = [word[0] for word in count]

    return top_features, count

In [9]:
# Return top 10 000 features for dataset
yelp_vocab, yelp_count = top_n_words(X_train, FEATURES)

In [10]:
# Inspect some of the most common words
yelp_vocab

['the',
 'and',
 'a',
 'i',
 'to',
 'of',
 'was',
 'is',
 'it',
 'for',
 'in',
 'that',
 'my',
 'with',
 'but',
 'you',
 'this',
 'they',
 'on',
 'have',
 'we',
 'not',
 'had',
 'are',
 'place',
 'good',
 'at',
 'so',
 'were',
 'food',
 'be',
 'as',
 'there',
 'great',
 'like',
 'if',
 'its',
 'me',
 'all',
 'just',
 'very',
 'out',
 'here',
 'one',
 'or',
 'get',
 'their',
 'from',
 'up',
 'go',
 'really',
 'when',
 'our',
 'time',
 'about',
 'some',
 'would',
 'service',
 'an',
 'your',
 'what',
 'can',
 'been',
 'which',
 'back',
 'more',
 'dont',
 'only',
 'also',
 'will',
 'by',
 'no',
 'love',
 'has',
 'little',
 'too',
 'nice',
 'im',
 'other',
 'because',
 'well',
 'always',
 'ive',
 'than',
 'them',
 'do',
 'even',
 'us',
 'best',
 'pretty',
 'got',
 'he',
 'after',
 'she',
 'much',
 'chicken',
 'try',
 'ordered',
 'restaurant',
 'menu',
 'people',
 'know',
 'think',
 'could',
 'didnt',
 'first',
 'am',
 'order',
 'make',
 'went',
 'over',
 'never',
 'staff',
 'friendly',
 'ba

In [11]:
''' Function that converts preprocessed text to binary, frequency bag-of-words representations with corresponding ratings '''
def convert_bow(text, ratings):
    binary = {}
    freq = {}

    vectorizer = CountVectorizer(vocabulary=yelp_vocab)
    vectorizer_bin = CountVectorizer(vocabulary=yelp_vocab, binary=True)
    # data_types referenced globally, bad practice
    for dtype in data_types:
        v_freq = np.array(normalize(vectorizer.fit_transform(text[dtype]).todense()))
        v_bin = sparse.csr_matrix(np.array(vectorizer_bin.fit_transform(text[dtype]).todense()))
        # Appends transformed bag-of-words representation, and corresponding ratings
        freq[dtype] = [v_freq, ratings[dtype]]
        binary[dtype] = [v_bin, ratings[dtype]]

    # return bin, freq
    return binary, freq

In [12]:
# Convert text to binary and frequency bag-of-words representation
yelp_bin, yelp_freq = convert_bow(yelp_text, yelp_ratings)

In [13]:
# Inspect values of binary bag-of-words representation
# Inspect sparse matrix shapes
yelp_bin['train'][0].shape

(7000, 10000)

In [14]:
# Inspect corresponding ratings
print(len(yelp_bin['train'][1]))
yelp_bin['train'][1]

7000


[5,
 5,
 5,
 3,
 2,
 2,
 3,
 1,
 3,
 5,
 5,
 5,
 4,
 4,
 1,
 2,
 2,
 3,
 5,
 2,
 5,
 4,
 4,
 5,
 5,
 5,
 5,
 3,
 4,
 5,
 4,
 4,
 5,
 4,
 4,
 5,
 2,
 4,
 3,
 4,
 4,
 4,
 4,
 4,
 3,
 5,
 5,
 5,
 2,
 4,
 5,
 4,
 5,
 5,
 2,
 5,
 5,
 5,
 1,
 5,
 4,
 3,
 3,
 4,
 4,
 4,
 5,
 4,
 4,
 4,
 2,
 5,
 5,
 4,
 1,
 4,
 4,
 2,
 4,
 4,
 3,
 2,
 3,
 4,
 4,
 5,
 3,
 4,
 5,
 5,
 5,
 1,
 5,
 2,
 4,
 4,
 5,
 5,
 4,
 1,
 3,
 5,
 4,
 5,
 2,
 5,
 1,
 4,
 5,
 5,
 4,
 5,
 4,
 4,
 5,
 2,
 5,
 4,
 1,
 1,
 1,
 3,
 3,
 2,
 4,
 4,
 4,
 2,
 1,
 5,
 1,
 2,
 4,
 5,
 1,
 4,
 5,
 3,
 3,
 5,
 5,
 1,
 1,
 5,
 5,
 4,
 5,
 4,
 2,
 5,
 5,
 1,
 4,
 4,
 1,
 4,
 3,
 2,
 5,
 5,
 4,
 2,
 5,
 5,
 1,
 3,
 5,
 4,
 4,
 4,
 5,
 3,
 3,
 1,
 2,
 1,
 2,
 2,
 2,
 5,
 2,
 5,
 5,
 5,
 5,
 4,
 5,
 5,
 5,
 4,
 4,
 5,
 4,
 5,
 3,
 3,
 5,
 2,
 4,
 3,
 3,
 4,
 4,
 3,
 5,
 1,
 4,
 1,
 4,
 5,
 4,
 5,
 4,
 5,
 4,
 4,
 4,
 5,
 4,
 2,
 4,
 3,
 4,
 4,
 5,
 3,
 4,
 1,
 3,
 2,
 4,
 4,
 2,
 4,
 5,
 5,
 3,
 4,
 2,
 5,
 2,
 4,
 1,
 3,
 2,
 4,
 5,
 5,
 3,
 3,


In [15]:
# Inspect frequency bag-of-words sample values
yelp_freq['train'][0][0]

array([ 0.50702013,  0.20280805,  0.        , ...,  0.        ,
        0.        ,  0.        ])

In [18]:
# Compare with original text from same sample
X_train[0]

'i cant believe i havent yelped about the place yet several months maybe over a year ago my husband read a newspaper article about the clover coffee maker and the one place in town that had managed to procure one i was skeptical as is my nature it cant be that much better right youre just saying its amazing because you want to talk about the new hot coffee shop you discovered right well maybe but i love this place and i dont think it has a whole lot to do with the clover they roast their own beans and they roast them way differently than that other ginormous coffee chain  all a light or medium roast never bitter never oily never yucky the coffee they make there is obviously the best but i send my husband in every week now to buy a pound of beans so that i can approximate the same coffee at home add an edgy though sometimes intimidating seating area great local art which we bought off the wall and smiley serviceim sold cant wait to try out the downtown location'

Now that we have our bag-of-words-representations, we attempt to classify the data using algorithms seen today, e.g. SVMs, Random Forests, and Gradient-Boosted Trees.

#### Define function to train and evaluate classifiers.

In [16]:
''' Function  to train, evaluate classifier and returns best parameters, accuracies on different sets '''
def train_clf(dataset, clf, params):
    
    # Hyperparameter tuning
    if params != None:
        clf = tune_hyper_params(clf, dataset, params)
        # Concatenate training, validation sets - use validation set to tune hyperparameters
        X_train_val = sparse.vstack([dataset['train'][0], dataset['valid'][0]])
        y_train_val = np.concatenate((dataset['train'][1], dataset['valid'][1]))

        clf.fit(X_train_val, y_train_val)
        
    # If no hyperparameter tuning, fit on training data
    else:
        clf.fit(dataset['train'][0], dataset['train'][1])

    acc_train = accuracy_score(dataset['train'][1], clf.predict(dataset['train'][0]))
    acc_val = accuracy_score(dataset['valid'][1], clf.predict(dataset['valid'][0]))
    acc_test = accuracy_score(dataset['test'][1], clf.predict(dataset['test'][0]))

    acc = {'Train Accuracy': acc_train, 'Validation Accuracy': acc_val, 'Test Accuracy': acc_test}
    best_param = None if params == None else clf.best_params_

    return acc, best_param

Now, we can evaluate different classifiers on both the binary and frequency bag-of-words representations of the Yelp dataset.

### 2. Support Vector Machines

In [19]:
# Linear SVM
params = [{'max_iter': [100 * i for i in range(10)]}]

In [20]:
# Test on binary bag-of-words representation
acc_SVM, best_params = train_clf(yelp_bin, LinearSVC(), params)
print('Linear SVM')
print(acc_SVM)
print('Best params - {}'.format(best_params))

Linear SVM
{'Train Accuracy': 0.996, 'Validation Accuracy': 0.996, 'Test Accuracy': 0.44750000000000001}
Best params - {'max_iter': 200}


In [21]:
# Test on freq bag-of-words representation
acc_SVM, best_params = train_clf(yelp_freq, LinearSVC(), params)
print('Linear SVM')
print(acc_SVM)
print('Best params - {}'.format(best_params))

Linear SVM
{'Train Accuracy': 0.80671428571428572, 'Validation Accuracy': 0.81000000000000005, 'Test Accuracy': 0.52000000000000002}
Best params - {'max_iter': 100}


### 3. Ensemble Methods

#### 3.1 Random Forests

In [22]:
# Test on binary bag-of-words representation
acc_rand_forest, best_params = train_clf(yelp_bin, RandomForestClassifier(), None)
print('Random Forest')
print(acc_rand_forest)
print('Best params - {}'.format(best_params))

Random Forest
{'Train Accuracy': 0.98928571428571432, 'Validation Accuracy': 0.36799999999999999, 'Test Accuracy': 0.39700000000000002}
Best params - None


In [23]:
acc_rand_forest, best_params = train_clf(yelp_freq, RandomForestClassifier(), None)
print('Random Forest')
print(acc_rand_forest)
print('Best params - {}'.format(best_params))

Random Forest
{'Train Accuracy': 0.99228571428571433, 'Validation Accuracy': 0.38300000000000001, 'Test Accuracy': 0.40000000000000002}
Best params - None


#### 3.2 Gradient Boosted Trees

In [24]:
acc_grad_boosted, best_params = train_clf(yelp_bin, GradientBoostingClassifier(subsample=0.75, n_estimators=200), None)
print('Gradient Boosted Trees')
print(acc_grad_boosted)
print('Best params - {}'.format(best_params))

Gradient Boosted Trees
{'Train Accuracy': 0.75871428571428567, 'Validation Accuracy': 0.48899999999999999, 'Test Accuracy': 0.48599999999999999}
Best params - None


In [None]:
acc_grad_boosted, best_params = train_clf(yelp_freq, GradientBoostingClassifier(), None)
print('Gradient Boosted Trees')
print(acc_grad_boosted)
print('Best params - {}'.format(best_params))

### BONUS - Hyperparameter Tuning

The accuracy we obtained is not so great. What could we have done differently? First of all, we didn't get to tune hyperparameters(except for briefly on the SVM), so here's the method used for hyperparameter tuning on the validation set! Explore the sklearn documentation and try to beat the accuracy scores currently reached. Anything else? Inspect the most common words, what are they? Are they all useful?

In [17]:
# Function that returns best hyper-parameters for given classifier, tunes parameters on validation set
def tune_hyper_params(classifier, dataset, parameters):
    ps = PredefinedSplit(test_fold=[-1 for i in range(dataset['train'][0].shape[0])] + [0 for i in range(dataset['valid'][0].shape[0])])
    classifier = GridSearchCV(classifier, parameters, cv=ps, refit=True)

    return classifier