The purpose of this notebook is to explore our scraped tweets with query of the hashtag "#COVID19". In this we are going to first preprocess the tweets, perform some visualisations and later on investigate if there is a correlation between COVID19 and scarcity and if so, try to quantify it. An issue with this very model is that we require a benchmark to compare against, otherwise the values we report are effectively useless. Maybe a better model would be to query the hashtag "#Hunger" or "#Poverty" and see how many associations there are with coronavirus. Again, past data would come in handly. Perhaps a Premium Twitter Developer account may help and we can form a partnership of some kind with the right people to aid in this quest for analysis of tweets. More visualiastion here than NLP.


Author: Steven Vuong<br>
Last Edited: 11-05-2020

In [1]:
# import libraries we will be using
import pandas as pd

In [2]:
csv_savepath = "../../data/tweets.csv"
tweets_df = pd.read_csv(csv_savepath) # Of query "#COVID19"

In [3]:
tweets_df.head()

Unnamed: 0,tweet_creation_date,tweet_text,tweet_retweet_count,tweet_favourite_count,tweet_hashtags,user_follow_count,user_created_at,user_verified
0,2020-05-11 06:52:34,RT @Parvez_Iftikhar: I tried to tell Sama TV t...,5,0,[],1180,2015-07-10 10:57:30,False
1,2020-05-11 06:52:33,All crowd and no social distancing at Paris ga...,0,0,[],83205,2013-01-04 14:19:42,True
2,2020-05-11 06:52:33,RT @HackneyAbbott: Muddled messaging from @Bor...,289,0,['coronavirus'],90,2015-09-27 08:00:46,False
3,2020-05-11 06:52:33,RT @OpenOrphan: We are very pleased to announc...,1,0,[],266,2016-07-08 16:01:44,False
4,2020-05-11 06:52:33,"For @NicolaSturgeon Alert, adjective meaning \...",0,0,[],273,2010-11-03 19:07:27,False


We will try get some ideas for preprocessing from other notebooks/kernels.

-  Idea: Try to predict if tweet contains hashtag "#Hunger" or related. Can use tweet text and other data to try and predict then ensemble. Brilliant template for model training: https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle
-  Preprocessing: https://www.kaggle.com/sudalairajkumar/getting-started-with-text-preprocessing
-  Visualisation: https://www.kaggle.com/duttadebadri/detailed-nlp-project-prediction-visualization
    - Histogram/barplots for frequency
    - Wordcloud (Hoping to do post processing)
    - Word correlation map
    
Also to consider: unigrams/bigrams/trigrams


In [134]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

from nltk.corpus import stopwords, \
    wordnet
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

# from collections import Counter # Used later for frequent and rare words
import re
from language_dicts.emoticons_dict import EMOTICONS
from language_dicts.emojis_dict import UNICODE_EMO

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/steven/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/steven/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/steven/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [135]:
class TextPreprocessor:
    """Class containing typical functions for text preprocessing.
    Can swap ordering of preprocessing functions if so desired.
    Args:
        - text(str)
    Note:
        - Appears lemmatize is more effective than stemming
    Ref:
        - https://www.kaggle.com/sudalairajkumar/getting-started-with-text-preprocessing
    """
    def __init__(self, text: str):
        self.text = text
        
    def _make_lowercase(self):
        self.text = self.text.lower()
        
    def _remove_punctuation(self):
        """Note: We do not include '@' or because it is used to reference 
        other twitter accounts"""
        punctuation_to_remove="\"\!#$%&\'()*+,-./:;<=>?[\\]^_{|}~`"
        self.text = self.text.translate(
            str.maketrans('', '', punctuation_to_remove)
        )
    
    def _remove_stopwords(self):
        nltk_stopwords = set(stopwords.words('english'))
        self.text = " ".join([
            word for word in str(self.text).split() if word not in nltk_stopwords
        ])
        
    def _stem(self):
        """Note: Can also use SnowballStemmer for languages other than English"""
        stemmer = PorterStemmer()
        self.text = " ".join([stemmer.stem(word) for word in self.text.split()])
        
    def _lemmatize(self):
        """Note: Different Lemmatizing options: N (noun), 
        V (verb), J (adjective) and R (adverb)
        """
        lemmatizer = WordNetLemmatizer()
        wordnet_map = {
            "N":wordnet.NOUN, 
            "V":wordnet.VERB, 
            "J":wordnet.ADJ, 
            "R":wordnet.ADV
        }
        pos_tagged_text = nltk.pos_tag(self.text.split())
        self.text = " ".join([
            lemmatizer.lemmatize(
                word, wordnet_map.get(pos[0], wordnet.NOUN)
            ) for word, pos in pos_tagged_text
        ])
        
    def _remove_emojis(self):
        """Ref: https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b"""
        emoji_pattern = re.compile("["
                       u"\U0001F600-\U0001F64F"  # emoticons
                       u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                       u"\U0001F680-\U0001F6FF"  # transport & map symbols
                       u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                       u"\U00002702-\U000027B0"
                       u"\U000024C2-\U0001F251"
                       "]+", flags=re.UNICODE
                                  )
        self.text = emoji_pattern.sub(r'', self.text)
        
    def _convert_emojis_to_words(self):
        for emoji in UNICODE_EMO:
            self.text = re.sub(
                r'('+emoji+')', 
                "_".join(UNICODE_EMO[emoji].replace(",","").replace(":","").split()), 
                self.text
            )
        
    def _remove_emoticons(self):
        emoticon_pattern = re.compile(
            u'(' + u'|'.join(k for k in EMOTICONS) + u')'
        )
        self.text = emoticon_pattern.sub(r'', self.text)
        
    def _convert_emoticons_to_words(self):
        for emoticon in EMOTICONS:
            self.text = re.sub(
                u'('+emoticon+')', 
                "_".join(EMOTICONS[emoticon].replace(",","").split()), 
                self.text
            )




In [137]:
print("-----------Sample Text-----------")
txt_sample = TextPreprocessor(tweets_df["tweet_text"][0])
print(txt_sample.text)

# Make lowercase
txt_sample._make_lowercase()

# Remove punctuation
txt_sample._remove_punctuation()
print("\n-----------Remove Punctuation-----------")
print(txt_sample.text)

# Remove stopwords
txt_sample._remove_stopwords()
print("\n-----------Remove Stopwords-----------")
print(txt_sample.text)

# Lemmatize words
txt_sample._lemmatize()
print("\n-----------Lemmatize-----------")
print(txt_sample.text)

# Remove emojis
txt_sample._remove_emojis()
print("\n-----------Remove Emoji-----------")
print(txt_sample.text)

# Remove emoticons
txt_sample._remove_emoticons()
print("\n-----------Remove Emoticons-----------")
print(txt_sample.text)


-----------Sample Text-----------
RT @Parvez_Iftikhar: I tried to tell Sama TV that our 4G is badly in need of improvement. Govt needs to give some spectrum plus some tax re…

-----------Make Lowercase-----------
rt @parvez_iftikhar: i tried to tell sama tv that our 4g is badly in need of improvement. govt needs to give some spectrum plus some tax re…

-----------Remove Punctuation-----------
rt @parveziftikhar i tried to tell sama tv that our 4g is badly in need of improvement govt needs to give some spectrum plus some tax re…

-----------Remove Stopwords-----------
rt @parveziftikhar tried tell sama tv 4g badly need improvement govt needs give spectrum plus tax re…

-----------Lemmatize-----------
rt @parveziftikhar try tell sama tv 4g badly need improvement govt need give spectrum plus tax re…

-----------Remove Emoji-----------
rt @parveziftikhar try tell sama tv 4g badly need improvement govt need give spectrum plus tax re…

-----------Remove Emoticons-----------
rt @parveziftikha

Notes (for this use case):

- We opt for lemmatizing in this use case, just that the example output looks better >> Would.

We will opt for removing emoticons and emojis than 
converting to words. Just want to have a very simple
model to begin with and conversion may not be
so greatly depicted in tweets. May be for other use cases

In [None]:
    # Make a plot function -> histogram, worcloud ettc..
    # Or does that belong in a separate func? Maybe clas with static
    # metthods hauah!
# Remove frequent words would have to be performed outside as analysis of 
# all the words first, then remove. Same applies for removing rare words.


In [129]:
# Also considerr stanford NER (named entity recognition) Tagger
# https://towardsdatascience.com/tweet-analytics-using-nlp-f83b9f7f7349
# from nltk.tag import StanfordNERTagger
# from nltk.tokenize import word_tokenize

# st = StanfordNERTagger('/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz','/usr/share/stanford-ner/stanford-ner.jar',encoding='utf-8')
# #a tweet by Donald Trump
# text = 'Just had a very good call with @SwedishPM Stefan Löfven who assured me that American citizen A$AP Rocky will be treated fairly. Likewise, I assured him that A$AP was not a flight risk and offered to personally vouch for his bail, or an alternative....'

# tokenized_text = word_tokenize(text)
# classified_text = st.tag(tokenized_text)

# print(classified_text)