# Lab 5: Text Classification

This lab explores a new dataset for text classification tasks using naïve Bayes and logistic regression.

### Learning Outcomes
* Be able to train and test naïve Bayes and logistic regression classifiers using scikit-learn.
* Know how to apply evaluation metrics to the classifiers and display examples of misclassifications.
* Be able to examine learned model parameters and explain how each classifier makes a decision.

### Outline

1. Load a new Twitter dataset, which is described in [this paper](https://arxiv.org/pdf/2010.12421.pdf), then extracts feature vectors from each sample.
1. Training and evaluating naïve Bayes using Scikit-learn.
1. Training and evaluating logistic regression using Scikit-learn.
1. Optional extension: lemmatization and bigram features.
1. Optional extensions: lexicon features.

### How To Complete This Lab

Read the text and the code then look for 'TODOs' that instruct you to complete some missing code. Look out for 'QUESTIONS' which you should try to answer before moving on to the next cell. Aim to work through the lab during the scheduled lab hours. To get help, you can talk to TAs or the lecturer during the labs, post questions to Blackboard (anonymously) or on Teams in the QA channel (with your name), or ask a question in the Wednesday live sessions. 

As you work through the notebooks, please make a note of any code that is unclear to you.

The labs *will not be marked*. However, they will prepare you for the coursework, so try to keep up with the weekly labs and have fun with the exercises! To understand what's going on inside the methods we use here, make sure to watch the lecture videos for the same week.

# 1. Preparing the Data 

This time we are using part of the Tweet Eval dataset, which contains seven Twitter datasets for various social media classification tasks. Here, we'll focus on the sentiment analysis data. 
Run the code below to download the data from [HuggingFace's datasets hub](https://huggingface.co/datasets/tweet_eval):

In [83]:
from datasets import load_dataset

cache_dir = "./data_cache"

# The data is already divided into training and test sets.
# Load the training set:
train_dataset = load_dataset(
    "tweet_eval",
    name="sentiment",
    split="train",
    #ignore_verifications=True,
    cache_dir=cache_dir,
)
print(f"Training dataset with {len(train_dataset)} instances loaded")

Reusing dataset tweet_eval (./data_cache\tweet_eval\sentiment\1.1.0\12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343)


Training dataset with 45615 instances loaded


In [84]:
# Load the test set:
test_dataset = load_dataset(
    "tweet_eval",
    name="sentiment",
    split="test",
    #ignore_verifications=True,
    cache_dir=cache_dir,
)
print(f"Test dataset with {len(test_dataset)} instances loaded")

KeyboardInterrupt: 

Let's take a look at one of the instances in the training set:

The next step is to tokenise the text of each tweet and convert it to a bag of words, ready for input to a classifier. 
To do this, we will use the scikit-learn library. 

In [None]:
# Put the data into lists ready for the next steps...
train_tweets = [sample['text'] for sample in train_dataset]
train_labels = [sample['label'] for sample in train_dataset]

In [None]:
test_tweets = [sample['text'] for sample in test_dataset]
test_labels = [sample['label'] for sample in test_dataset]
print(train_dataset[0], test_dataset[0], train_labels[0],test_labels[0])

{'text': '"QT @user In the original draft of the 7th book, Remus Lupin survived the Battle of Hogwarts. #HappyBirthdayRemusLupin"', 'label': 2} {'text': "@user @user what do these '1/2 naked pics' have to do with anything? They're not even like that.", 'label': 1} 2 1


To extract a bag of words, we can use the CountVectorizer class ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)).
This class outputs the bag of words as a feature vector, where the length of the vector is equal to the size of the vocabulary, and the values are the counts of each words in a document. 

Run the code below to obtain feature vectors for the training and test samples:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk import word_tokenize
# CountVectorizer can do its own tokenization, but for consistency we want to
# carry on using WordNetTokenizer. We write a small wrapper class to enable this:
class Tokenizer(object):
    def __call__(self, tweets):
        return word_tokenize(tweets)

vectorizer = CountVectorizer(tokenizer=Tokenizer())  # construct the vectorizer

vectorizer.fit(train_tweets)  # Learn the vocabulary
X_train = vectorizer.transform(train_tweets)  # extract training set bags of words
X_test = vectorizer.transform(test_tweets)  # extract test set bags of words



The fit() method sets the vectorizer up by extracting a vocabulary from some text data. 

QUESTION: Why do we fit the CountVectorizer on the training set?

The vectorizer stores the vocabulary as a dictionary that maps a token to its index in the feature vector. The code below looks up the indexes of some example words:

In [None]:
import reprlib

vocabulary = vectorizer.vocabulary_
print(vocabulary['the'])
print(vocabulary['horse'])
print(vocabulary['smile'])

print(f'Vocabulary size = {len(vocabulary)}')

45912
23574
42635
Vocabulary size = 51915


# 2. Naive Bayes Classifier

The code above has obtained the feature vectors and lists of labels. The data is now ready for use
with scikit-learn's classifiers.

Scikit-learn contains several different variants of naïve Bayes for different kinds of data. For our bag of words data, we need to use the [MultinomialNB class](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB).


TODO 2.1: Look at the documentation for MultinomalNB and write code to train a NB classifier using `X_train` and `train_labels`.

In [None]:
# WRITE YOUR CODE HERE
from sklearn.naive_bayes import MultinomialNB as mnb

mnb_model = mnb()
mnb_model.fit(X_train,train_labels)


MultinomialNB()

Now we have a trained model, we would like to evaluate its performance on some test data. 

TODO 2.2: Refer to the documentation again and predict the labels for the test set. Use `X_test` as the inputs to the classifier.

In [None]:
# WRITE YOUR CODE HERE
test_predicted_labels = mnb_model.predict(X_test)
print(test_predicted_labels[:5])

[0 1 1 2 1]


We can compute standard metrics for classifier performance using [scikit-learn's metrics libary](https://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules). A useful function for multi-class classification (when there are more than two classes) is the [classification report function](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-report).

TODO 2.3: Refer again to the documentation, and compute accuracy, precision, recall and F1 scores on the test set. 

In [None]:
# WRITE YOUR CODE HERE
from sklearn.metrics import classification_report
import numpy as np
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(test_predicted_labels, test_labels, target_names=target_names))



              precision    recall  f1-score   support

     class 0       0.43      0.66      0.52      2582
     class 1       0.68      0.61      0.64      6594
     class 2       0.64      0.49      0.56      3108

    accuracy                           0.59     12284
   macro avg       0.58      0.59      0.57     12284
weighted avg       0.62      0.59      0.59     12284



Now, let's examine the classifier that we learned. If you don't follow what's happening here, you may wish to refer back to the slides on naïve Bayes classifiers or to [Jurafsky and Martin's textbook](https://web.stanford.edu/~jurafsky/slp3/4.pdf). 

Previously, we trained a MultinomialNB classifier. The trained classifier object stores all the probabilities that it learned during training, which are needed to make predictions. The log of the likelihoods of each word given the class are represented by the attribute `feature_log_prob_`. So, if your classifier object is named `classifier`, you can access the likelihoods with `classifier.feature_log_prob_`.

TODO 2.4: Print out the likelihood of the words 'happy' and 'hate' in each class. Hint: look up the index of the chosen words in `vocabulary`. The rows of `feature_log_prob` correspond to classes, and the columns to words.

In [None]:
import numpy as np

### CHANGE THE NAME OF THE CLASSIFIER VARIABLE BELOW TO USE YOUR TRAINED CLASSIFIER
feat_likelihoods = np.exp(mnb_model.feature_log_prob_)  # Use exponential to convert the logs back to probabilities
###
# WRITE YOUR CODE HERE
index_happ = vocabulary["happy"]
index_hate = vocabulary["hate"]
print(f'happy in class 0: {feat_likelihoods[0][index_happ]}')
print(f'happy in class 1: {feat_likelihoods[1][index_happ]}')
print(f'happy in class 2: {feat_likelihoods[2][index_happ]}')
print(f'hate in class 0: {feat_likelihoods[0][index_hate]}')
print(f'hate in class 1: {feat_likelihoods[1][index_hate]}')
print(f'hate in class 2: {feat_likelihoods[2][index_hate]}')

happy in class 0: 0.00016528408760950073
happy in class 1: 0.00010482220248600166
happy in class 2: 0.0017517469461780536
hate in class 0: 0.000509253675337381
hate in class 1: 8.957533666985593e-05
hate in class 2: 5.334186803221848e-05


The sentiment classes are negative (0), neutral (1) and positive (2). 

QUESTION: Which class has the strongest association with 'happy' and with 'hate'?

A key part of evaluating a classifier is investigating the errors it makes to better understand its limitations. 

TODO 2.5: Complete the code below to print out some misclassified tweets along with their predicted and true labels.

In [None]:
error_indexes = test_predicted_labels != test_labels  # compare predictions to gold labels
print(len(error_indexes))
# get the text of tweets where the classifier made an error:
tweets_err = np.array(test_tweets)[error_indexes]

### WRITE YOUR CODE HERE
gold_err = np.array(test_labels)[error_indexes]
pred_err = np.array(test_predicted_labels)[error_indexes]

for i in range(10):  # just print the first ten
    print(f'Tweet: {tweets_err[i]}; true label = {gold_err[i]}, prediction = {pred_err[i]}.')

12284
Tweet: @user @user what do these '1/2 naked pics' have to do with anything? They're not even like that.; true label = 1, prediction = 0.
Tweet: @user Wow,first Hugo Chavez and now Fidel Castro. Danny Glover, Michael Moore, Oliver Stone, and Sean Penn are running out of heroes.; true label = 0, prediction = 1.
Tweet: Twitter's #ThankYouObama Shows Heartfelt Gratitude To POTUS; true label = 2, prediction = 1.
Tweet: @user @user @user @user @user @user take away illegals and dead people and Trump wins popular vote too.; true label = 0, prediction = 1.
Tweet: When Ryan privatizes SS, Medicare, Medicaid, & does away with ACA, what will Trump's base feel about "change" then? That's a big one right?!; true label = 0, prediction = 1.
Tweet: @user ohhh ok i see 🤔 what if u have medical marijuana clearance? Does that make a difference; true label = 1, prediction = 0.
Tweet: @user alt-right was adopted by Deplorables. Average middle Americans.  I've now moved to Libertarian. @user; true lab

# 3. Logistic Regression Classifier

Another simple, linear classifier is logistic regression. This classifier does not rely on the conditional independence assumption, so can better model features that are highly correlated with each other. Scikit-learn provides the [logisticRegression class](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), which has a very similar interface to the naïve Bayes classifier.

TODO 3.1: Train a logistic regression classifier, referring to the scikit-learn documentation as required.

In [None]:
# WRITE YOUR CODE HERE
from sklearn.linear_model import LogisticRegression as lr
lr_model = lr(max_iter=1000)
lr_model.fit(X_train,train_labels)

LogisticRegression(max_iter=1000)

TODO 3.2: Obtain predictions on the test set.

In [None]:
# WRITE YOUR CODE HERE
lr_predicted_test_labels = lr_model.predict(X_test)

TODO 3.3: Compute accuracy, precision, recall and F1 scores on the test set using [scikit-learn's metrics libary.](https://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules)

In [None]:
# WRITE YOUR CODE HERE
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(lr_predicted_test_labels, test_labels, target_names=target_names))

              precision    recall  f1-score   support

     class 0       0.42      0.66      0.51      2568
     class 1       0.70      0.60      0.65      6951
     class 2       0.58      0.50      0.54      2765

    accuracy                           0.59     12284
   macro avg       0.57      0.59      0.57     12284
weighted avg       0.62      0.59      0.59     12284



QUESTION: How does the performance of logistic regression compare with naïve Bayes?

The logistic regression classifier works by learning a weight for each feature that indicates its importance in predicting a class. These weights are stored in the `coef_` attribute of the LogisticRegression object, which has rows corresponding to classes, and columns corresponding to words in the vocabulary. 

TODO 3.4: Print out the weights for 'happy' and 'hate' for each class.

In [None]:
### WRITE YOUR CODE HERE
weights = lr_model.coef_
index_happ = vocabulary["happy"]
index_hate = vocabulary["hate"]
print(f'happy in class 0: {weights[0][index_happ]}')
print(f'happy in class 1: {weights[1][index_happ]}')
print(f'happy in class 2: {weights[2][index_happ]}')
print(f'hate in class 0: {weights[0][index_hate]}')
print(f'hate in class 1: {weights[1][index_hate]}')
print(f'hate in class 2: {weights[2][index_hate]}')

happy in class 0: -0.5335792742995011
happy in class 1: -1.2779543000406763
happy in class 2: 1.8115335743401884
hate in class 0: 1.7530872087713203
hate in class 1: -0.17058100984881422
hate in class 2: -1.582506198922476


QUESTION: Are the weights what you would expect to see?

The code below prints out the words with the highest weights for each class. We use numpy's `argsort` function to get the indexes of the sorted weights. Run the code below to show the result: 

In [None]:
n_feats_to_show = 10


# Flip the index so that values are keys and keys are values:
keys = vectorizer.vocabulary_.values()
values = vectorizer.vocabulary_.keys()
vocab_inverted = dict(zip(keys, values))
for c, weights_c in enumerate(lr_model.coef_):
    print(f'\nWeights for class {c}:\n')
    strongest_idxs = np.argsort(weights_c)[-n_feats_to_show:]
    for idx in strongest_idxs:
        print(f'{vocab_inverted[idx]} with weight {weights_c[idx]}')


Weights for class 0:

fucked with weight 1.8719001739520214
horrible with weight 1.8810811722403882
disappointing with weight 1.8826870906457467
fuck with weight 1.8890126298389327
asshole with weight 1.9000627856350742
terrible with weight 1.9186021292858315
ruined with weight 1.9238251486899896
stupid with weight 2.2156346849764352
sucks with weight 2.321729398187835
worst with weight 2.697173777649992

Weights for class 1:

jab with weight 1.092118742018886
1977 with weight 1.1116859940321
uc with weight 1.1146564729729118
klub with weight 1.1291464808202143
load with weight 1.173361579803787
bama with weight 1.2190830130882138
50/50 with weight 1.3216677124489193
morris with weight 1.3563773417873466
paterno with weight 1.4175607934836516
phase2 with weight 1.688193658393043

Weights for class 2:

impressive with weight 1.778428553491804
congratulations with weight 1.801213015155426
happy with weight 1.8115335743401884
proud with weight 1.8179080217177876
perfect with weight 1.830

TODO 3.5: Use the same code as for naïve Bayes to print out examples of misclassified tweets and their labels. Hint: you should be able to compy and paste your code from above :) 

In [None]:
error_indexes = lr_predicted_test_labels != test_labels  # compare predictions to gold labels

# get the text of tweets where the classifier made an error:
tweets_err = np.array(test_tweets)[error_indexes]

### WRITE YOUR CODE HERE
gold_err = np.array(test_labels)[error_indexes]
pred_err = np.array(lr_predicted_test_labels)[error_indexes]

for i in range(10):  # just print the first ten
    print(f'Tweet: {tweets_err[i]}; true label = {gold_err[i]}, prediction = {pred_err[i]}.')

Tweet: @user @user what do these '1/2 naked pics' have to do with anything? They're not even like that.; true label = 1, prediction = 0.
Tweet: I think I may be finally in with the in crowd #mannequinchallenge  #grads2014 @user; true label = 2, prediction = 1.
Tweet: @user Wow,first Hugo Chavez and now Fidel Castro. Danny Glover, Michael Moore, Oliver Stone, and Sean Penn are running out of heroes.; true label = 0, prediction = 1.
Tweet: Twitter's #ThankYouObama Shows Heartfelt Gratitude To POTUS; true label = 2, prediction = 1.
Tweet: An interesting security vulnerability - albeit not for the everyday car thief; true label = 1, prediction = 2.
Tweet: When Ryan privatizes SS, Medicare, Medicaid, & does away with ACA, what will Trump's base feel about "change" then? That's a big one right?!; true label = 0, prediction = 1.
Tweet: Swampbitch Nasty Pelosi  loves yelling 'Fire' in the crowded swamp. #blackfriday @user; true label = 0, prediction = 1.
Tweet: @user ohhh ok i see 🤔 what if u 

# 4. Optional: Lemmatization and N-grams

You only need to do this section if you finish the previous sections before the end of the lab.

In the previous lab, we tried out lemmatization. This is useful for reducing the size of the vocabulary. Could it help us here?

To apply lemmatization, we have to go back to the CountVectorizer and define a new tokenizer that will carry out the extra step of lemmatization. Run the code below to test this out:

In [None]:
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 
class LemmaTokenizer(object):
    
    def __init__(self):
        self.wnl = WordNetLemmatizer()
        
    def __call__(self, tweets):
        return [self.wnl.lemmatize(self.wnl.lemmatize(self.wnl.lemmatize(tok, pos='n'), pos='v'), pos='a') for tok in word_tokenize(tweets)]
    
vectorizer = CountVectorizer(tokenizer=LemmaTokenizer())
vectorizer.fit(train_tweets)
X_train = vectorizer.transform(train_tweets)
X_test = vectorizer.transform(test_tweets)
# Print out some of the features in the vocabulary:
print(list(vectorizer.vocabulary_)[:20])



['``', 'qt', '@', 'user', 'in', 'the', 'original', 'draft', 'of', '7th', 'book', ',', 'remus', 'lupin', 'survive', 'battle', 'hogwarts', '.', '#', 'happybirthdayremuslupin']


In [None]:
print(f'Vocabulary size: {len(vectorizer.vocabulary_)}')

Vocabulary size: 45324


TODO 4.1: Now, repeat your training of the logistic regression using the new features, and compare its performance with the previous classifers.

In [None]:
### WRITE YOUR OWN CODE HERE
mnb_model = mnb()
mnb_model.fit(X_train,train_labels)
test_predicted_labels = mnb_model.predict(X_test)
print(classification_report(test_predicted_labels, test_labels, target_names=target_names))

              precision    recall  f1-score   support

     class 0       0.45      0.66      0.54      2723
     class 1       0.67      0.61      0.64      6501
     class 2       0.63      0.49      0.55      3060

    accuracy                           0.60     12284
   macro avg       0.59      0.59      0.58     12284
weighted avg       0.62      0.60      0.60     12284



QUESTION: Did lemmatization bring about any improvements on this dataset?

The bag of words is a very simple representation of the tweets that does not capture enough information to make accurate sentiment classifications. Another way to improve it could be to use bigrams instead of single words as our features. Bigrams are pairs of words that occur one after another in the text. Bigrams are a kind of 'n-gram', where 'n=2'. 

To extract bigrams, we again modify our CountVectorizer. This class has a parameter `ngram_range`, which determines the range of sizes of n-grams the vectorizer will include. If we set `ngram_range=(1,1)` we have our standard bag of words. If we set it to `ngram_range=(2,2)`, we use bigrams instead. Choosing If we set `ngram_range=(1,2)` will use both single tokens (unigrams) and bigrams.

TODO 4.2: Create a new CountVectorizer that extracts bigram features instead of unigrams (single tokens) and uses the LemmaTokenizer.

In [None]:
class Tokenizer(object):
    def __call__(self, tweets):
        return word_tokenize(tweets)
vectorizer = CountVectorizer(tokenizer=Tokenizer(), ngram_range=(1,1))
vectorizer.fit(train_tweets)
print(list(vectorizer.vocabulary_)[:20])
print(f'Vocabulary size: {len(vectorizer.vocabulary_)}')



['``', 'qt', '@', 'user', 'in', 'the', 'original', 'draft', 'of', '7th', 'book', ',', 'remus', 'lupin', 'survived', 'battle', 'hogwarts', '.', '#', 'happybirthdayremuslupin']
Vocabulary size: 51915


In [None]:
vectorizer = CountVectorizer(tokenizer=Tokenizer(), ngram_range=(1,2))
vectorizer.fit(train_tweets)
print(list(vectorizer.vocabulary_)[-20:])
print(f'Vocabulary size: {len(vectorizer.vocabulary_)}')

['september has', 'arrived ,', 'means apple', 'from becoming', 'becoming an', 'official thing', 'maguire had', 'had opened', 'in hilton', 'hilton head', 'head till', '8th lol', 'lol go', 'aldean sept.', 'sept. 19th', '! alot', 'president richard', 'trumka as', 'he explores', 'explores whether']
Vocabulary size: 419483


In [None]:
X_train = vectorizer.transform(train_tweets)
X_test = vectorizer.transform(test_tweets)

In [None]:
keys = vectorizer.vocabulary_.values()
values = vectorizer.vocabulary_.keys()
vocab_inverted = dict(zip(keys, values))
print(vocab_inverted[24831])
print(vectorizer.get_feature_names()[24831:24832])
print(train_tweets[0])
print(X_train)

.
['.']
"QT @user In the original draft of the 7th book, Remus Lupin survived the Battle of Hogwarts. #HappyBirthdayRemusLupin"
  (0, 1712)	1
  (0, 4368)	1
  (0, 10297)	1
  (0, 19011)	1
  (0, 22105)	1
  (0, 24831)	1
  (0, 24833)	1
  (0, 46525)	1
  (0, 46555)	1
  (0, 53100)	1
  (0, 53734)	1
  (0, 54644)	1
  (0, 56453)	1
  (0, 90676)	1
  (0, 90693)	1
  (0, 102180)	1
  (0, 102188)	1
  (0, 144006)	1
  (0, 144027)	1
  (0, 186451)	1
  (0, 186452)	1
  (0, 195916)	1
  (0, 195917)	1
  (0, 204054)	1
  (0, 206379)	1
  :	:
  (45614, 279352)	1
  (45614, 298576)	2
  (45614, 298625)	1
  (45614, 298644)	1
  (45614, 311485)	1
  (45614, 311486)	1
  (45614, 312074)	1
  (45614, 312089)	1
  (45614, 369580)	1
  (45614, 369817)	1
  (45614, 371781)	1
  (45614, 380633)	1
  (45614, 380634)	1
  (45614, 383703)	1
  (45614, 383742)	1
  (45614, 392713)	1
  (45614, 392717)	1
  (45614, 397987)	1
  (45614, 397991)	1
  (45614, 403687)	1
  (45614, 403698)	1
  (45614, 405626)	1
  (45614, 405949)	1
  (45614, 407489)	1
  (



TODO 4.3: Now, repeat your training of the logistic regression or naïve Bayes classifier using the new features, and compare its performance with the previous classifers.

In [None]:
### WRITE YOUR OWN CODE HERE
mnb_model = mnb()
mnb_model.fit(X_train,train_labels)
test_predicted_labels = mnb_model.predict(X_test)
print(classification_report(test_predicted_labels, test_labels, target_names=target_names))



              precision    recall  f1-score   support

     class 0       0.11      0.76      0.19       572
     class 1       0.77      0.55      0.65      8247
     class 2       0.68      0.46      0.55      3465

    accuracy                           0.54     12284
   macro avg       0.52      0.59      0.46     12284
weighted avg       0.71      0.54      0.60     12284



QUESTION: Do bigrams improve performance on this dataset?

# 5. Optional: Lexicon Features

You only need to do this part if you finish the other parts before the end of the lab session. 

The NLTK library contains sentiment lexicons, which are lists of words with negative or positive connotations. 

In [None]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')

analyser = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to C:\Users\12055\AppDat
[nltk_data]     a\Local\Programs\Python\Python310\lib\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Now have a look at the sentiment scores for some words in the lexicon by running the code below. What do the scores mean and why do some words have no score?

In [None]:
testwords = ['happy', 'wonderful', 'horrible', 'boring', 'tablecloth', 'not']

for word in testwords:
    if word in analyser.lexicon:
        print(f'{word}: {analyser.lexicon[word]}')
    else:
        print(f'{word}: NOT IN LEXICON')

happy: 2.7
wonderful: 2.7
horrible: -2.5
boring: -1.3
tablecloth: NOT IN LEXICON
not: NOT IN LEXICON


Now we would like to use this function to compute counts of all positive and negative words. Let's start by recording whether the words in our vocabulary are positive or negative:

In [None]:
# get the Vader lexicon scores for each word in our vocabulary
vectorizer = CountVectorizer(tokenizer=Tokenizer())

vectorizer.fit(train_tweets)
X_train = vectorizer.transform(train_tweets)
X_test = vectorizer.transform(test_tweets)

# Print out some of the features in the vocabulary:
print(list(vectorizer.vocabulary_)[:20])

vocabulary = vectorizer.vocabulary_

lex_pos_scores = np.zeros((1, len(vocabulary)))
lex_neg_scores = np.zeros((1, len(vocabulary)))

for i, term in enumerate(vocabulary):
    if term in analyser.lexicon and analyser.lexicon[term] > 0:
        lex_pos_scores[0, i] = 1
    elif term in analyser.lexicon and analyser.lexicon[term] < 0:
        lex_neg_scores[0, i] = 1



['``', 'qt', '@', 'user', 'in', 'the', 'original', 'draft', 'of', '7th', 'book', ',', 'remus', 'lupin', 'survived', 'battle', 'hogwarts', '.', '#', 'happybirthdayremuslupin']


In [None]:
print(X_train.shape)

(45615, 51915)


Now let's compute the counts of positive and negative words in the dataset:

In [None]:
# Compute the scores for each instance in the data set. 
# Multiply the lexicon scores by the feature vectors, then sum over the 
# vocabulary to get the total positive and total negative counts:
lex_pos_train = np.sum(X_train.multiply(lex_pos_scores), axis=1)
lex_pos_test = np.sum(X_test.multiply(lex_pos_scores), axis=1)
lex_neg_train = np.sum(X_train.multiply(lex_neg_scores), axis=1)
lex_neg_test = np.sum(X_test.multiply(lex_neg_scores), axis=1)


Finally, we can append the counts to the feature vector and treat them as extra features:

In [None]:
from scipy.sparse import hstack

X_train = hstack((X_train, lex_pos_train, lex_neg_train))
X_test = hstack((X_test, lex_pos_test, lex_neg_test))

In [None]:
print(X_train.shape)

(45615, 51917)


TODO 5.1: Use the new X_train and X_test feature vectors to train and evaluate your classifier. 
Does adding the lexicon features improve performance?


In [87]:
### WRITE YOUR OWN CODE HERE
mnb_model = mnb()
mnb_model.fit(X_train,train_labels)
test_predicted_labels = mnb_model.predict(X_test)
print(classification_report(test_predicted_labels, test_labels, target_names=target_names))

              precision    recall  f1-score   support

     class 0       0.40      0.67      0.50      2385
     class 1       0.70      0.60      0.65      6848
     class 2       0.64      0.50      0.56      3051

    accuracy                           0.59     12284
   macro avg       0.58      0.59      0.57     12284
weighted avg       0.62      0.59      0.60     12284



In [88]:
vectorizer = CountVectorizer(tokenizer=Tokenizer())

vectorizer.fit(train_tweets)
X_train = vectorizer.transform(train_tweets)
X_test = vectorizer.transform(test_tweets)

# Print out some of the features in the vocabulary:
print(list(vectorizer.vocabulary_)[:20])
vocabulary = vectorizer.vocabulary_
lex_pos_scores = np.zeros((1, len(vocabulary)))
lex_neg_scores = np.zeros((1, len(vocabulary)))
for i, term in enumerate(vocabulary):
    if term in analyser.lexicon and analyser.lexicon[term] > 0:
        lex_pos_scores[0, i] = analyser.lexicon[term]
    elif term in analyser.lexicon and analyser.lexicon[term] < 0:
        lex_neg_scores[0, i] = analyser.lexicon[term]
lex_pos_train = np.sum(X_train.multiply(lex_pos_scores), axis=1)
lex_pos_test = np.sum(X_test.multiply(lex_pos_scores), axis=1)
lex_neg_train = np.sum(X_train.multiply(lex_neg_scores), axis=1)
lex_neg_test = np.sum(X_test.multiply(lex_neg_scores), axis=1)
X_train = hstack((X_train, lex_pos_train, lex_neg_train))
X_test = hstack((X_test, lex_pos_test, lex_neg_test))
lr_model = lr(max_iter=1000)
lr_model.fit(X_train,train_labels)
test_predicted_labels = lr_model.predict(X_test)
print(classification_report(test_predicted_labels, test_labels, target_names=target_names))



['``', 'qt', '@', 'user', 'in', 'the', 'original', 'draft', 'of', '7th', 'book', ',', 'remus', 'lupin', 'survived', 'battle', 'hogwarts', '.', '#', 'happybirthdayremuslupin']
              precision    recall  f1-score   support

     class 0       0.42      0.66      0.52      2573
     class 1       0.70      0.60      0.65      6955
     class 2       0.58      0.50      0.54      2756

    accuracy                           0.59     12284
   macro avg       0.57      0.59      0.57     12284
weighted avg       0.62      0.59      0.60     12284



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [91]:
np.sum(lex_neg_scores)
np.sum(lex_pos_scores)

2059.9999999999995

In [None]:
vectorizer = CountVectorizer(tokenizer=Tokenizer())

vectorizer.fit(train_tweets)
X_train = vectorizer.transform(train_tweets)
X_test = vectorizer.transform(test_tweets)

In [None]:
sen_list = ["Research shows you could be better off by up to £43k over 30 years of investing in an ii ISA and Trading Account due to our low, flat fees. This is just for illustration if all other factors were the same.",
"The advantage of lower fees over time means that you could be significantly better off in the long run. And by holding multiple accounts with ii for the same low fee, you can save even more.", 
"Transfer your ISA and Trading Accounts to ii and keep more of your money.",
"Please remember, investment value can go up or down and you could get back less than you invest. The value of international investments may be affected by currency fluctuations which might reduce their value in sterling."] 

In [None]:
class Tokenizer(object):
    def __call__(self, tweets):
        return word_tokenize(tweets)
vectorizer = CountVectorizer(tokenizer=Tokenizer(), ngram_range=(1,1))
vectorizer.fit(sen_list)
X_train = vectorizer.transform(sen_list)
print(list(vectorizer.vocabulary_)[:20])
print(f'Vocabulary size: {len(vectorizer.vocabulary_)}')

['research', 'shows', 'you', 'could', 'be', 'better', 'off', 'by', 'up', 'to', '£43k', 'over', '30', 'years', 'of', 'investing', 'in', 'an', 'ii', 'isa']
Vocabulary size: 86




In [None]:
class Tokenizer(object):
    def __call__(self, tweets):
        return word_tokenize(tweets)
vectorizer = CountVectorizer(tokenizer=Tokenizer(), ngram_range=(1,2))
vectorizer.fit(sen_list)
X_train = vectorizer.transform(train_tweets)
print(list(vectorizer.vocabulary_)[-20:])
print(f'Vocabulary size: {len(vectorizer.vocabulary_)}')

['invest .', '. the', 'the value', 'value of', 'of international', 'international investments', 'investments may', 'may be', 'be affected', 'affected by', 'by currency', 'currency fluctuations', 'fluctuations which', 'which might', 'might reduce', 'reduce their', 'their value', 'value in', 'in sterling', 'sterling .']
Vocabulary size: 212
