# Classification cont.

Let's test the performance of our classification models from last week on the test data! I'm quickly going to run my best attempt at a model from last week and see how it performs on the test data. You can substitute in your best model from last week and see how it performs on the test data, or you can copy the last couple of cells from this week into last weeks notebook to see how it performs.

### Install packages and load data

In [None]:
!pip install asent
!pip install pandas
!pip install numpy
!pip install sklearn
!pip install gensim

In [1]:
from asent import lexicons
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import gensim.downloader
from sklearn.linear_model import LogisticRegression

In [2]:
lex = pd.DataFrame(lexicons.get("lexicon_en_v1").items(), columns=["word", "sentiment"])

train, test = train_test_split(lex, test_size=0.2, random_state=42)

### Preprocessing and feature generation

In [3]:
# binarise the outcome variable
y = [1 if x>0 else 0 for x in train["sentiment"]]

In [4]:
# load the embeddings
embeddings = gensim.downloader.load("glove-wiki-gigaword-300")

I've swapped the way I handle out-of-vocabulary words to use the mean of the embeddings for the words in the training data, instead of using a zero vector. The mean embedding has the same shape as the individual embeddings, the difference is that every dimension in the mean embedding is the mean of the corresponding dimension in the individual embeddings. This is a more sophisticated way of handling out-of-vocabulary words, and it should improve the performance of the model.

In [7]:
embeddings_mean = np.mean([embeddings[r["word"]] for i, r in train.iterrows() if r["word"] in embeddings.index_to_key], axis=0)
embeddings_mean.shape

In [21]:
features = [embeddings[r["word"]] if r["word"] in embeddings.index_to_key else embeddings_mean for i, r in train.iterrows()]

In [23]:
X = np.array(features)

### Training the model

I've made a few tweeks to the model architecture to try and improve performance. First, I've added information about our class weight imbalance: we have more negative outcomes (0) than positive outcomes (1), so adding information about the imbalance should help the model adjust the weights according to the class distribution. Second, I've changed the algorithms used for optimisation (solver) to the liblinear solver, which is better suited to smaller datasets.

In [None]:
y.count(0) / len(y)

In [136]:
clf = LogisticRegression(random_state=42,
                         class_weight={0:0.55,1:0.45},
                         solver="liblinear",
                         )

In [None]:
clf.fit(X, y)

In [None]:
clf.score(X, y)

I also tried running a random forest model, which seems to outperform the logistic regression model.

In [143]:
from sklearn.ensemble import RandomForestClassifier

clf_forest = RandomForestClassifier(random_state=42)

In [None]:
clf_forest.fit(X, y)

In [None]:
clf_forest.score(X, y)

### Testing the model

Now let's see how the model performs on the test data!

In [139]:
y_test = [1 if x>0 else 0 for x in test["sentiment"]]

One important thing to note is that we still use the mean of the embeddings from the **training data** for out-of-vocabulary words in the test data. This is because we want the test set to represent out-of-sample data to test how well our model generalises to unseen data - if the test data had actually been unavailable to us when we trained the model, we wouldn't have been able to use the embeddings from the test data to handle out-of-vocabulary words.

In [140]:
features_test = [embeddings[r["word"]] if r["word"] in embeddings.index_to_key else embeddings_mean for i, r in test.iterrows()]

In [141]:
X_test = np.array(features_test)

This time we don't fit the model to the data again, we just use the model we trained on the training data to predict the labels for the test data.

In [None]:
clf.score(X_test, y_test)

In [None]:
clf_forest.score(X_test, y_test)