# Missions 10/20 Project: Text Classification on Tweets

We are going to build an example of a text classification model using a dataset providing tweets and their sentiment about Covid-19.

Here is the URL to the dataset that needs to be downloaded:
https://www.kaggle.com/datasets/datatattle/covid-19-nlp-text-classification/code

In [6]:
# Import necessary libraries
import pandas as pd
import spacy

In [5]:
# The dataset is already split in a train and test set
# The path needs to be changed depending on where the csv files are stored on your machine
train = pd.read_csv("./Corona_NLP_train.csv", encoding='latin-1')
test = pd.read_csv('./Corona_NLP_test.csv', encoding='latin-1')

In [11]:
train.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive
2,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive
3,3802,48754,,16-03-2020,My food stock is not the only one which is emp...,Positive
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative


The Usernames and Screennames have been replaced by numbers to keep their anonimity.

Our focus is only on the last two columns since they represent our data and associated labels.

Let's take a look at how much data we have:

In [12]:
train.shape

(41157, 6)

With this in mind, we can start creating our model.

In [7]:
# Create an empty model in English
nlp = spacy.blank("en")

# Add the TextCategorizer to the empty model using textcat_multilabel
textcat = nlp.add_pipe("textcat_multilabel")

In [8]:
# Find out exactly which values are in sentiments, which will be our labels
sentiments = train['Sentiment']
sentiments = sentiments.drop_duplicates()
for value in sentiments:
    print(value)

Neutral
Positive
Extremely Negative
Negative
Extremely Positive


In [9]:
# Add our labels to the text classifier
textcat.add_label("Neutral")
textcat.add_label("Positive")
textcat.add_label("Extremely Negative")
textcat.add_label("Negative")
textcat.add_label("Extremely Positive")

1

In [10]:
#Convert our labels to the form that TextCategorizer requires
train_texts = train['OriginalTweet'].values
train_labels = [{'cats': {'Neutral': Sentiment == 'Neutral',
                          'Positive': Sentiment == 'Positive',
                          'Extremely Negative': Sentiment == 'Extremely Negative',
                          'Negative': Sentiment == 'Negative',
                          'Extremely Positive': Sentiment == 'Extremely Positive'}} 
                for Sentiment in train['Sentiment']] 
train_data = list(zip(train_texts, train_labels))
train_data[:3]

[('@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8',
  {'cats': {'Neutral': True,
    'Positive': False,
    'Extremely Negative': False,
    'Negative': False,
    'Extremely Positive': False}}),
 ('advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate supplies of regular meds but not over order',
  {'cats': {'Neutral': False,
    'Positive': True,
    'Extremely Negative': False,
    'Negative': False,
    'Extremely Positive': False}}),
 ('Coronavirus Australia: Woolworths to give elderly, disabled dedicated shopping hours amid COVID-19 outbreak https://t.co/bInCA9Vp8P',
  {'cats': {'Neutral': False,
    'Positive': True,
    'Extremely Negative': False,
    'Negative': False,
    'Extremely Positive': False}})]

**Now let's train it !**

In [34]:
import random
from spacy.util import minibatch
from spacy.training.example import Example

# We set some randomness to our data
random.seed(1)
spacy.util.fix_random_seed(1)

# And start the optimization process
optimizer = nlp.begin_training()

losses = {}
for epoch in range(4):
    random.shuffle(train_data)
    # Create the batch generator with a certain batch size
    batches = minibatch(train_data, size=20)
    # Iterate through minibatches
    for batch in batches:
        for OriginalTweet, Sentiment in batch:
            doc = nlp.make_doc(OriginalTweet)
            example = Example.from_dict(doc, Sentiment)
            nlp.update([example], sgd=optimizer, losses=losses) # update the model's parameters with the observations made
    print(losses)

{'textcat_multilabel': 4954.34124961769}
{'textcat_multilabel': 8577.840496189383}
{'textcat_multilabel': 11523.869732122108}
{'textcat_multilabel': 14053.386383053567}


In [52]:
#Quickly test our model on sentences we've come up with
texts = ["I hate everything about this situation, Covid makes me depressed",
         "I love staying at home and watching Netflix all day",
         "Today I saw a balloon"]

#Tokenize our test sentences so that the model can assimilate them
docs = [nlp.tokenizer(text) for text in texts]
    
# Use textcat to get the predicted scores for each sentence in docs
textcat = nlp.get_pipe('textcat_multilabel')
scores = textcat.predict(docs)

#Turn the probabilistic scores into actual predictions with .argmax
predicted_labels = scores.argmax(axis=1)
print([textcat.labels[label] for label in predicted_labels])

['Extremely Negative', 'Extremely Positive', 'Neutral']


The model seems to understand the general sentiment of our examples, although the first sentence should probably be associated with an Extremely Negative label.

Now, let's evaluate our model by computing it's accuracy on our test set. To do that, we first need to compute the amount of true predictions the model has made.

In [36]:
#let's retrieve the tweets from our test set
test_texts = test["OriginalTweet"]
docs_test = [nlp.tokenizer(test_text) for test_text in test_texts]

#Make the predictions
scores_test = textcat.predict(docs_test)
predicted_labels_test = scores_test.argmax(axis=1)

In [37]:
#Retrieve the actual labels associated with the predictions and turn them from an array into a list
actual_labels_test = test["Sentiment"].values
actual_labels_test = actual_labels_test.tolist()

In [38]:
#Check the size of the data to know how many predictions we have
print(predicted_labels_test.shape)

(3798,)


In [39]:
#Turn the predicted numbers into the labels they are associated with
predicted_labels_evaluate = [textcat.labels[label_new] for label_new in predicted_labels_test]

#Compare each row step by step to find true predictions
true_prediction = 0
for x in range(0,3798):
    if [predicted_labels_evaluate[x]] == [actual_labels_test[x]]:
        true_prediction += 1
        
print(true_prediction)       

2620


In [41]:
#Compute our accuracy
Accuracy = true_prediction/3978
print(Accuracy)

0.6586224233283057


**In the end, even if our model is using the TextCategorizer function of the Spacy library, it is actually able to perform sentiment analysis simply because it was trained with "positive" and "negative" type labels.**

Unfortunately, our model isn't performing good at all and takes a lot of time to train. If we were able to improve our program for it to train faster, we could increase the batch size and amount of epochs, which could lead to a better performing model. Using this approach we were already able, through different iterations, to go from an accuracy of 0.57 to 0.66, so there is some room for improvement !