# Natural Language Processing Project: Coronavirus Tweets

Coded by Luna McBride, following ideas in the Kaggle NLP course

In [340]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import spacy # NLP
from sklearn.svm import LinearSVC
import re # regular expressions
import html # HTML content, like &amp;
from spacy.lang.en.stop_words import STOP_WORDS # stopwords
from sklearn.model_selection import train_test_split # training and testing a model
from spacy.util import minibatch # batches for training
import random # randomizing for training

nlp = spacy.load('en_core_web_lg') #Load spacy, up here so I do not have to load it constantly

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/covid-19-nlp-text-classification/Corona_NLP_test.csv
/kaggle/input/covid-19-nlp-text-classification/Corona_NLP_train.csv


# Read in the data files

Note: just read_csv("file") causes error here. Source for fix: https://stackoverflow.com/questions/18171739/unicodedecodeerror-when-reading-csv-file-in-pandas-with-python

In [341]:
train = pd.read_csv("../input/covid-19-nlp-text-classification/Corona_NLP_train.csv", encoding = "ISO-8859-1") #Load the training set
train.head() #Take a peek at the training set

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive
2,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive
3,3802,48754,,16-03-2020,My food stock is not the only one which is emp...,Positive
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative


In [342]:
test = pd.read_csv("../input/covid-19-nlp-text-classification/Corona_NLP_test.csv", encoding = "ISO-8859-1") #Load the testing set
test.head() #Take a peek at the testing set

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,1,44953,NYC,02-03-2020,TRENDING: New Yorkers encounter empty supermar...,Extremely Negative
1,2,44954,"Seattle, WA",02-03-2020,When I couldn't find hand sanitizer at Fred Me...,Positive
2,3,44955,,02-03-2020,Find out how you can protect yourself and love...,Extremely Positive
3,4,44956,Chicagoland,02-03-2020,#Panic buying hits #NewYork City as anxious sh...,Negative
4,5,44957,"Melbourne, Victoria",03-03-2020,#toiletpaper #dunnypaper #coronavirus #coronav...,Neutral


---

# Check for Nulls

In [343]:
# Check for nulls in all columns in Train
print("Train CSV: \n")
print(train["UserName"].isnull().any())
print(train["ScreenName"].isnull().any())
print(train["Location"].isnull().any())
print(train["TweetAt"].isnull().any())
print(train["OriginalTweet"].isnull().any())
print(train["Sentiment"].isnull().any())
    
#Location has a null
train["Location"] = train["Location"].fillna("Unknown") #Fill the null values with "Unknown"
print("Location: ", train["Location"].isnull().any(), "\n") #Print the now fixed location to make sure it is truly fixed

# Check for nulls in all columns in Test
print("Test CSV: \n")
print(test["UserName"].isnull().any())
print(test["ScreenName"].isnull().any())
print(test["Location"].isnull().any())
print(test["TweetAt"].isnull().any())
print(test["OriginalTweet"].isnull().any())
print(test["Sentiment"].isnull().any())

#Location has a null
test["Location"] = test["Location"].fillna("Unknown") #Fill the null values with "Unknown"
print("Location: ", test["Location"].isnull().any(), "\n") #Print the now fixed location to make sure it is truly fixed

Train CSV: 

False
False
True
False
False
False
Location:  False 

Test CSV: 

False
False
True
False
False
False
Location:  False 



All nulls removed

---

# Check OriginalTweet for empty strings

In [344]:
empty = train["OriginalTweet"].apply(lambda x: print("One") if not x else x) #Prints "One" if there are any empty strings
empty2 = test["OriginalTweet"].apply(lambda x: print("One") if not x else x) #Prints "One" if there are any empty strings

No print statements. No empty strings here.

---

# Tweet Processing

Sources for Tweet Processing: https://towardsdatascience.com/basic-tweet-preprocessing-in-python-efd8360d529e , https://medium.com/analytics-vidhya/working-with-twitter-data-b0aa5419532 , https://www.analyticsvidhya.com/blog/2019/08/how-to-remove-stopwords-text-normalization-nltk-spacy-gensim-python/ ,  https://stackoverflow.com/questions/2087370/decode-html-entities-in-python-string , plus some general regex searches.

In [345]:
punctuations = """!()-![]{};:+'"\,<>./?@#$%^&*_~Â""" #List of punctuations to remove, including a weird A that will not process out any other way

#CleanTweets: parces the tweets and removes punctuation, stop words, digits, and links.
#Input: the list of tweets that need parsing
#Output: the parsed tweets
def cleanTweets(tweetParse):
    for i in range(0,len(tweetParse)):
        tweet = tweetParse[i] #Putting the tweet into a variable so that it is not calling tweetParse[i] over and over
        tweet = html.unescape(tweet) #Removes leftover HTML elements, such as &amp;
        tweet = re.sub(r"@\w+", ' ', tweet) #Completely removes @'s, as other peoples' usernames mean nothing
        tweet = re.sub(r'https\S+', ' ', tweet) #Removes links, as links provide no data in tweet analysis in themselves
        tweet = re.sub(r"\d+\S+", ' ', tweet) #Removes numbers, as well as cases like the "th" in "14th"
        tweet = ''.join([punc for punc in tweet if not punc in punctuations]) #Removes the punctuation defined above
        tweet = tweet.lower() #Turning the tweets lowercase real quick for later use
    
        tweetWord = tweet.split() #Splits the tweet into individual words
        tweetParse[i] = ''.join([word + " " for word in tweetWord if nlp.vocab[word].is_stop == False]) #Checks if the words are stop words
        
    return tweetParse #Returns the parsed tweets

#Jeez, this whole NLP project (plus the kaggle course) has thrown a lot of use of making a list via _ for _ if _

trainCopy = train["OriginalTweet"].copy() #Copies the train tweets, using a copy to ensure I do not screw it up
testCopy = test["OriginalTweet"].copy() #Copies the test tweets, using a copy to ensure I do not screw it up

trainTweets = cleanTweets(trainCopy) #Calls the cleanTweets method to clean the train tweets
testTweets = cleanTweets(testCopy) #Calls the cleanTweets method to clean the test tweets

train["CleanTweet"] = trainTweets #Puts the clean train tweets into a new column
test["CleanTweet"] = testTweets #Puts the clean test tweets into a new column
train.head() #Take a peek at the new addition to the data

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment,CleanTweet
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral,
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive,advice talk neighbours family exchange phone n...
2,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive,coronavirus australia woolworths elderly disab...
3,3802,48754,Unknown,16-03-2020,My food stock is not the only one which is emp...,Positive,food stock dont panic food need stay calm stay...
4,3803,48755,Unknown,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative,ready supermarket covid outbreak im paranoid f...


***Important note***: there are various rows that become blank strings after processing. After a bit of exploration, these blank strings have different sentiments (ie. the train set's 186 and 13777, which are neutral and negative respectively). Below is the list of the indecies of empty strings after preprocessing. These will be removed further below. It is important to note this, as empty tweets give no information and could skew the model with different sentiments.

In [346]:
print(trainTweets.loc[trainTweets == ""], "\n \n") #Print the row numbers with empty clean train tweets
print(testTweets.loc[testTweets == ""]) #Print the row number with empty clean test tweets

0        
16       
186      
583      
2190     
5214     
5946     
8841     
12410    
13777    
13843    
14840    
16920    
16924    
18437    
22994    
27932    
28549    
28604    
28987    
29888    
30345    
30473    
31116    
31293    
31440    
31627    
31657    
32455    
35563    
35565    
35601    
36781    
37646    
40893    
Name: OriginalTweet, dtype: object 
 

3066    
3195    
Name: OriginalTweet, dtype: object


In [347]:
#RemoveBlanks: removes tweets that became blank after processing
#Input: the dataframe to look at
#Output: none
def removeBlanks(df):
    df["CleanTweet"] = df["CleanTweet"].apply(lambda x: np.nan if not x else x) #Changes blank strings to nan
    df.dropna(subset = ["CleanTweet"], inplace = True) #Drops the rows newly assigned to nan
    df.reset_index(drop=True, inplace=True) #Reset indecies so we can still loop through without error

removeBlanks(train) #Removes the blanks from the train set
removeBlanks(test) #Removes the blanks from the test set
train.head() #Opens up the train to take a peek, as the first one was blank in the training set

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment,CleanTweet
0,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive,advice talk neighbours family exchange phone n...
1,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive,coronavirus australia woolworths elderly disab...
2,3802,48754,Unknown,16-03-2020,My food stock is not the only one which is emp...,Positive,food stock dont panic food need stay calm stay...
3,3803,48755,Unknown,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative,ready supermarket covid outbreak im paranoid f...
4,3804,48756,"ÃT: 36.319708,-82.363649",16-03-2020,As news of the regionÂs first confirmed COVID...,Positive,news regions confirmed covid case came sulliv...


And to check if that caught everything

In [348]:
print(train["CleanTweet"].loc[train["CleanTweet"] == ""], "\n \n") #Print the row number that still has empty clean train tweets
print(test["CleanTweet"].loc[test["CleanTweet"] == ""]) #Print the row number that still has empty clean test tweets

Series([], Name: CleanTweet, dtype: object) 
 

Series([], Name: CleanTweet, dtype: object)


---

# Adding Numeric Sentiments and Remove Extremes

In [349]:
# Sentiments: A function to turn the word sentiments into numerical values for the Train set, 0, 1, 2, 0 being negative, 2 being positive.
# This function also makes incorrect values in labels -1, as nothing else is -1
def sentiments(x):
    if x == "Negative":
        return 0
    if x == "Neutral":
        return 1
    return 2

def removeExtremes(x):
    if x == "Extremely Negative":
        return "Negative"
    if x == "Extremely Positive":
        return "Positive"
    return x

#Extremes were causing problems in the model, as it is hard to exemplify extreme to a computer
#These change the extremes to just their counterparts so it is not a necessary hurdle
train["Sentiment"] = train["Sentiment"].apply(removeExtremes)
test["Sentiment"] = test["Sentiment"].apply(removeExtremes)

train["NumSentiment"] = train["Sentiment"].apply(sentiments) #Add a row into train for numerical sentiment
test["NumSentiment"] = test["Sentiment"].apply(sentiments) #Add a row into test for numerical sentiment
test.head() #Display the test and see if it has numerical sentiment

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment,CleanTweet,NumSentiment
0,1,44953,NYC,02-03-2020,TRENDING: New Yorkers encounter empty supermar...,Negative,trending new yorkers encounter supermarket she...,0
1,2,44954,"Seattle, WA",02-03-2020,When I couldn't find hand sanitizer at Fred Me...,Positive,couldnt find hand sanitizer fred meyer turned ...,2
2,3,44955,Unknown,02-03-2020,Find out how you can protect yourself and love...,Positive,find protect loved ones coronavirus,2
3,4,44956,Chicagoland,02-03-2020,#Panic buying hits #NewYork City as anxious sh...,Negative,panic buying hits newyork city anxious shopper...,0
4,5,44957,"Melbourne, Victoria",03-03-2020,#toiletpaper #dunnypaper #coronavirus #coronav...,Neutral,toiletpaper dunnypaper coronavirus coronavirus...,1


---

# Begin the pipeline

In [350]:
#Pipe for processing, copied from the kaggle course
textcat = nlp.create_pipe(
              "textcat",
              config={
                "exclusive_classes": True,
                "architecture": "bow"})
try:
    nlp.add_pipe(textcat) #Add the pipe
    print("Pipeline loaded") #Print for if the pipeline is loaded
except:
    nlp.remove_pipe("textcat") #delete the pipe to reload
    nlp.add_pipe(textcat) #Add the pipe
    print("Pipeline now loaded") #Print for if the pipeline is loaded

#Adding labels for the tweets
textcat.add_label("Negative")
textcat.add_label("Neutral")
textcat.add_label("Positive")

Pipeline loaded


1

---

# Training via the training set

Based on the code from the kaggle course Text Classification

In [351]:
#TrainData: a function to train the model to the train data. Modeled after the one in the kaggle class
#Input: the model, the training data, and an optimizer
#Output: losses
def trainData(model, data, optimize):
    losses = {} #A set for the losses data
    random.seed() #Randomizing the seed of shuffling data
    random.shuffle(data) #Shuffles the data
    
    batches = minibatch(data, size=10) #Creates batches of texts
    
    #For each batch of texts
    for batch in batches:
        text, label = zip(*batch) #Unzip the labels and text
        model.update(text, label, sgd = optimize, losses = losses) #Update the model with the new data
    
    return losses #Return the losses

I had initially been trying to count sentiments like Extremely Negative and Negative separately, but that gave a loss of 32 with batches of 10. It also took over an hour. To compare, this took a few minutes to finish with a loss of about 1 at batches of 30. I guess it is hard to quantify "Extremely".

Also, I initially trained the model here and then predicted later, but I decided to have all the heavy lifting in one place.

---

# Prediction on Test set

In [352]:
#PredictTexts: predicts the sentiment of the tweet, from negative to positive
#Input: the model and the tweets
#Output: predictions
def predictTexts(model, texts):
    predicText = [model.tokenizer(text) for text in texts] #Tokenizes the test tweets
    model.get_pipe("textcat") #Gets the trained textcat pipe
    scores,_ = textcat.predict(predicText) #Gets the scores from the predictions, ignoring other outputs
    classes = scores.argmax(axis = 1) #Get the highest ranked prediction score for each tweet
    return classes #Returns the predictions

In [353]:
#CheckAccuracy: checks the accuracy compared to the predictions.
#Input: the NLP model, the tweets to predict, their pre-determined labels
#Output: the accuracy of the predictions
def checkAccuracy(model, texts, labels):
    predicted = predictTexts(model, texts) #Creates predictions on the tweets
    trueVal = [2*int(label["cats"]["Positive"]) + int(label["cats"]["Neutral"]) for label in labels] #Gets the actual value of the tweets provided
    correct = 0 #A holder variable for how many predictions are correct
    total = len(predicted) #The total number of analyzed tweets
    
    #For loop, comparing predictions to their values
    for i in range(0,total):
        if trueVal[i] == predicted[i]: #If the prediction is correct
            correct+=1  #Add a point to the correct pile
    
    accuracy = correct/total #Get the accuracy of the number correct over the number total
    return accuracy #Returns the accuracy of the model

---

# Adding labels for the categories

In [354]:
labels = [] #Labels for the cleaned training tweet
labelsT = [] #The labels for the cleaned test tweet

#For loop to add true and false to classifications for the train set
for i in range(0,len(train)): 
    label = train["Sentiment"][i] #Get the sentiment
    
    #Categorize true false based on the labels
    if label == "Negative":
        cats = {"Negative" : True, "Neutral" : False, "Positive" : False}
    elif label == "Neutral":
        cats = {"Negative" : False, "Neutral" : True, "Positive" : False}
    else:
        cats = {"Negative" : False, "Neutral" : False, "Positive" : True}
    labels.append({'cats' : cats})

#For loop to add true and false to classifications for the test set
for i in range(0,len(test)):
    label = test["Sentiment"][i] #Get the sentiment
    
    #Categorize true false based on the labels
    if label == "Negative":
        cats = {"Negative" : True, "Neutral" : False, "Positive" : False}
    elif label == "Neutral":
        cats = {"Negative" : False, "Neutral" : True, "Positive" : False}
    else:
        cats = {"Negative" : False, "Neutral" : False, "Positive" : True}
    labelsT.append({'cats' : cats})


---

# Run the training and prediction

In [355]:
texts = train["CleanTweet"].copy() #Get the clean tweets
tokenTexts = [nlp.tokenizer(tweet) for tweet in texts] #Tokenize the training tweets
optimize = nlp.begin_training() #The optimizer, using spacy
data = list(zip(tokenTexts, labels)) #Zipping the labels and texts together

losses = trainData(nlp, data, optimize) #Train the model
accuracy = checkAccuracy(nlp, test["CleanTweet"].copy(), labelsT) #Gets the accuracy of predictions for the trained model
print("Losses: ", losses["textcat"], "Accuracy: ", accuracy) #Prints the loss when training and the accuracy

Losses:  17.024992450140417 Accuracy:  0.7623814541622761


The initial accuracy was 19%, however, it was when I was giving labels before I deleted the blank tweets. Just a note about the process.

---

# Conclusion

The highest accuracy I achieved was 76.2% with a loss of 17.025, that being with a batch size of 10. This is honestly a little lower than I was hoping. I used Spacy for this first NLP project in order following the Kaggle course, but I believe it was a limiting factor in this case. When trying to look up methods and other items, I would notice NLTK and Keras pop up way more commonly even with "Spacy" in the search term. This makes me think these are more common and possibly more powerful, but I think I will do a little bit more research before trying another project with one of those. Overall, however, I do think it was a good learning experience.