We are going to look at an example of text classification model using a dataset providing tweets and their sentiment about Covid-19.

Here is the URL to the dataset: https://www.kaggle.com/datasets/datatattle/covid-19-nlp-text-classification/code

In [20]:
# Import necessary libraries
import pandas as pd
import spacy

In [21]:
# The dataset is already split in a train and test set

train = pd.read_csv('./Corona_NLP_train.csv', encoding='latin-1')
test = pd.read_csv('./Corona_NLP_test.csv', encoding='latin-1')

In [22]:
train.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive
2,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive
3,3802,48754,,16-03-2020,My food stock is not the only one which is emp...,Positive
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative


The Usernames and Screennames have been replaced by numbers to keep their anonimity

In [23]:
train.shape

(41157, 6)

In [24]:
train.info

<bound method DataFrame.info of        UserName  ScreenName                      Location     TweetAt  \
0          3799       48751                        London  16-03-2020   
1          3800       48752                            UK  16-03-2020   
2          3801       48753                     Vagabonds  16-03-2020   
3          3802       48754                           NaN  16-03-2020   
4          3803       48755                           NaN  16-03-2020   
...         ...         ...                           ...         ...   
41152     44951       89903  Wellington City, New Zealand  14-04-2020   
41153     44952       89904                           NaN  14-04-2020   
41154     44953       89905                           NaN  14-04-2020   
41155     44954       89906                           NaN  14-04-2020   
41156     44955       89907  i love you so much || he/him  14-04-2020   

                                           OriginalTweet           Sentiment  
0      @MeNy

In [25]:
# Create an empty model
nlp = spacy.blank("en")

# Add the TextCategorizer to the empty model
textcat = nlp.add_pipe("textcat")

In [26]:
# Find out which values are in sentiments
sentiments = train['Sentiment']
sentiments = sentiments.drop_duplicates()
for value in sentiments:
    print(value)

Neutral
Positive
Extremely Negative
Negative
Extremely Positive


In [27]:
# Add labels to text classifier
textcat.add_label("Neutral")
textcat.add_label("Positive")
textcat.add_label("Extremely Negative")
textcat.add_label("Negative")
textcat.add_label("Extremely Positive")

1

In [28]:
train_texts = train['OriginalTweet'].values
train_labels = [{'cats': {'Neutral': Sentiment == 'Neutral',
                          'Positive': Sentiment == 'Positive',
                          'Extremely Negative': Sentiment == 'Extremely Negative',
                          'Negative': Sentiment == 'Negative',
                          'Extremely Positive': Sentiment == 'Extremely Positive'}} 
                for Sentiment in train['Sentiment']]

In [29]:
train_data = list(zip(train_texts, train_labels))
train_data[:3]

[('@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8',
  {'cats': {'Neutral': True,
    'Positive': False,
    'Extremely Negative': False,
    'Negative': False,
    'Extremely Positive': False}}),
 ('advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate supplies of regular meds but not over order',
  {'cats': {'Neutral': False,
    'Positive': True,
    'Extremely Negative': False,
    'Negative': False,
    'Extremely Positive': False}}),
 ('Coronavirus Australia: Woolworths to give elderly, disabled dedicated shopping hours amid COVID-19 outbreak https://t.co/bInCA9Vp8P',
  {'cats': {'Neutral': False,
    'Positive': True,
    'Extremely Negative': False,
    'Negative': False,
    'Extremely Positive': False}})]

In [31]:
import random
from spacy.util import minibatch
from spacy.training.example import Example

random.seed(1)
spacy.util.fix_random_seed(1)
optimizer = nlp.begin_training()

losses = {}
for epoch in range(3):
    random.shuffle(train_data)
    # Create the batch generator with batch size = 30000
    batches = minibatch(train_data, size=100)
    # Iterate through minibatches
    for batch in batches:
        for OriginalTweet, Sentiment in batch:
            doc = nlp.make_doc(OriginalTweet)
            example = Example.from_dict(doc, Sentiment)
            nlp.update([example], sgd=optimizer, losses=losses)
    print(losses)

{'textcat': 5454.42810615442}
{'textcat': 10351.818137345455}
{'textcat': 14990.507116890312}


In [35]:
texts = ["I am very scared of dying of Covid-19",
         "I love staying at home and watching Netflix all day",
         "Today I saw a balloon"]
docs = [nlp.tokenizer(text) for text in texts]
    
# Use textcat to get the scores for each doc
textcat = nlp.get_pipe('textcat')
scores = textcat.predict(docs)

predicted_labels = scores.argmax(axis=1)
print([textcat.labels[label] for label in predicted_labels])

['Negative', 'Extremely Positive', 'Neutral']


In [49]:
test_texts = test["OriginalTweet"]
docs_test = [nlp.tokenizer(test_text) for test_text in test_texts]

scores_test = textcat.predict(docs_test)
predicted_labels_test = scores_test.argmax(axis=1)
#print([textcat.labels[label_new] for label_new in predicted_labels_test])

In [65]:
print(predicted_labels_test.shape)
print(actual_labels_test.shape)

(3798,)
(3798,)


In [90]:
actual_labels_test = test["Sentiment"].values
actual_labels_test = actual_labels_test.tolist()

In [99]:
predicted_labels = [textcat.labels[label_new] for label_new in predicted_labels_test]
true_prediction = 0
for x in range(0,3798):
    if [predicted_labels[x]] == [actual_labels_test[x]]:
        true_prediction += 1
        
print(true_prediction)       
#print([textcat.labels[label_new] for label_new in predicted_labels_test])

2292


In [100]:
Accuracy = true_prediction/3978
print(Accuracy)

0.5761689291101055
