### Text Preprocessing

Load covid tweets 

In [1]:
import pandas as pd
import numpy as np
import re

tweets_covid = pd.read_csv("covid19_tweets.csv")
tweets_covid.head()

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
0,ᏉᎥ☻լꂅϮ,astroworld,wednesday addams as a disney princess keepin i...,2017-05-26 05:46:42,624,950,18775,False,2020-07-25 12:27:21,If I smelled the scent of hand sanitizers toda...,,Twitter for iPhone,False
1,Tom Basile 🇺🇸,"New York, NY","Husband, Father, Columnist & Commentator. Auth...",2009-04-16 20:06:23,2253,1677,24,True,2020-07-25 12:27:17,Hey @Yankees @YankeesPR and @MLB - wouldn't it...,,Twitter for Android,False
2,Time4fisticuffs,"Pewee Valley, KY",#Christian #Catholic #Conservative #Reagan #Re...,2009-02-28 18:57:41,9275,9525,7254,False,2020-07-25 12:27:14,@diane3443 @wdunlap @realDonaldTrump Trump nev...,['COVID19'],Twitter for Android,False
3,ethel mertz,Stuck in the Middle,#Browns #Indians #ClevelandProud #[]_[] #Cavs ...,2019-03-07 01:45:06,197,987,1488,False,2020-07-25 12:27:10,@brookbanktv The one gift #COVID19 has give me...,['COVID19'],Twitter for iPhone,False
4,DIPR-J&K,Jammu and Kashmir,🖊️Official Twitter handle of Department of Inf...,2017-02-12 06:45:15,101009,168,101,False,2020-07-25 12:27:08,25 July : Media Bulletin on Novel #CoronaVirus...,"['CoronaVirusUpdates', 'COVID19']",Twitter for Android,False


### Data Preparation
To clean the data we need to remove the **links**, **punctuation**, **numbers**, **emojis**, and **stop words**. We will utilize nltk's english stopword and wordnet databases to filter out unwanted words and then normalize the rest.

In [3]:
import nltk
from nltk.corpus import stopwords

# Download nltk's databases
nltk.download('all')

def remove_stopwords(txt):
    """Remove stopwords from the input text

    Args:
        txt (str): the input text to filter

    Returns:
        str: the filtered text, with all stopwords removed
    """
    words = txt.lower().split()
    non_stopwords = [word for word in words if word not in stop_words]
    non_stopwords = ' '.join(non_stopwords)
    return non_stopwords

# Create a hash-set containing all stopwords
# which automatically guarantees word uniqueness
stop_words = set(stopwords.words('english'))

# Filter out links
tweets_covid['clean_text'] = tweets_covid['text'].apply(lambda s: ' '.join(re.sub("(w+://S+)", " ", s).split()))
# Filter out punctuation
tweets_covid["clean_text"] = tweets_covid["clean_text"].apply(lambda s: ' '.join(re.sub("[.,!?:;-='...@#_]", " ", s).split()))
# Filter out numerical values
tweets_covid["clean_text"] = tweets_covid["clean_text"].apply(lambda s: ' '.join(re.sub("\d", "", s).split()))
# Filter out emojis, first turn into ascii and then back to utf
tweets_covid["clean_text"] = tweets_covid["clean_text"].apply(lambda s: s.encode('ascii', 'ignore').decode('ascii'))
# Filter out stopwords
tweets_covid["clean_text"] = tweets_covid["clean_text"].apply(lambda s: remove_stopwords(s))
# Print sample output
tweets_covid[['text', 'clean_text']]

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\nikos\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\nikos\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\nikos\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\nikos\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     C:\Users\nikos\AppData\Roaming\nltk_data...
[nltk_data]    | 

Unnamed: 0,text,clean_text
0,If I smelled the scent of hand sanitizers toda...,smelled scent hand sanitizers today someone pa...
1,Hey @Yankees @YankeesPR and @MLB - wouldn't it...,hey yankees yankeespr mlb - made sense players...
2,@diane3443 @wdunlap @realDonaldTrump Trump nev...,diane wdunlap realdonaldtrump trump never clai...
3,@brookbanktv The one gift #COVID19 has give me...,brookbanktv one gift covid give appreciation s...
4,25 July : Media Bulletin on Novel #CoronaVirus...,july media bulletin novel coronavirusupdates c...
...,...,...
179103,Thanks @IamOhmai for nominating me for the @WH...,thanks iamohmai nominating wearamask challenge...
179104,2020! The year of insanity! Lol! #COVID19 http...,year insanity lol covid https //t co/ynpyzgn
179105,@CTVNews A powerful painting by Juan Lucena. I...,ctvnews powerful painting juan lucena tribute ...
179106,"More than 1,200 students test positive for #CO...",students test positive covid major university ...


Tokenize the clean text

In [4]:
# Split each line on every whitespace
tweets_covid['clean_text'] = tweets_covid['clean_text'].apply(lambda s: s.split())
tweets_covid[['text', 'clean_text']]

Unnamed: 0,text,clean_text
0,If I smelled the scent of hand sanitizers toda...,"[smelled, scent, hand, sanitizers, today, some..."
1,Hey @Yankees @YankeesPR and @MLB - wouldn't it...,"[hey, yankees, yankeespr, mlb, -, made, sense,..."
2,@diane3443 @wdunlap @realDonaldTrump Trump nev...,"[diane, wdunlap, realdonaldtrump, trump, never..."
3,@brookbanktv The one gift #COVID19 has give me...,"[brookbanktv, one, gift, covid, give, apprecia..."
4,25 July : Media Bulletin on Novel #CoronaVirus...,"[july, media, bulletin, novel, coronavirusupda..."
...,...,...
179103,Thanks @IamOhmai for nominating me for the @WH...,"[thanks, iamohmai, nominating, wearamask, chal..."
179104,2020! The year of insanity! Lol! #COVID19 http...,"[year, insanity, lol, covid, https, //t, co/yn..."
179105,@CTVNews A powerful painting by Juan Lucena. I...,"[ctvnews, powerful, painting, juan, lucena, tr..."
179106,"More than 1,200 students test positive for #CO...","[students, test, positive, covid, major, unive..."


### Text Normalization

At this stage we want to convert words to their base form. This will produce the root form of all words, which will help our models later on during training and inference.

In [5]:
from nltk.stem import WordNetLemmatizer

lemmatiser = WordNetLemmatizer()
# Apply text normalization
tweets_covid['clean_text'] = tweets_covid['clean_text'].apply(lambda tokens: [lemmatiser.lemmatize(token, pos='v') for token in tokens])
tweets_covid[['text', 'clean_text']]

Unnamed: 0,text,clean_text
0,If I smelled the scent of hand sanitizers toda...,"[smell, scent, hand, sanitizers, today, someon..."
1,Hey @Yankees @YankeesPR and @MLB - wouldn't it...,"[hey, yankees, yankeespr, mlb, -, make, sense,..."
2,@diane3443 @wdunlap @realDonaldTrump Trump nev...,"[diane, wdunlap, realdonaldtrump, trump, never..."
3,@brookbanktv The one gift #COVID19 has give me...,"[brookbanktv, one, gift, covid, give, apprecia..."
4,25 July : Media Bulletin on Novel #CoronaVirus...,"[july, media, bulletin, novel, coronavirusupda..."
...,...,...
179103,Thanks @IamOhmai for nominating me for the @WH...,"[thank, iamohmai, nominate, wearamask, challen..."
179104,2020! The year of insanity! Lol! #COVID19 http...,"[year, insanity, lol, covid, https, //t, co/yn..."
179105,@CTVNews A powerful painting by Juan Lucena. I...,"[ctvnews, powerful, paint, juan, lucena, tribu..."
179106,"More than 1,200 students test positive for #CO...","[students, test, positive, covid, major, unive..."
