<a href="https://colab.research.google.com/github/ShesterG/Twitter-Sentiment-Analysis/blob/master/notebooks/02_text_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Preprocessing for Twitter Sentiment Analysis

In [1]:
from google.colab import drive
drive.mount("/content/drive", force_remount=True)

Mounted at /content/drive


# Imports and Constants

In [2]:
import pandas as pd
import re
import nltk
from nltk.tokenize import TweetTokenizer
from nltk import FreqDist
import string

In [3]:
DATA_FILE_PATH = '/content/drive/MyDrive/NLPGh/'
CLEAN_DATA_FILE_NAME = 'NALS1Clean.csv'
SAVE_FILE = True
TOKENIZED_DATA_FILE_NAME = 'NALS1Tokenized.csv'

# Load Data

In [4]:
df = pd.read_csv(DATA_FILE_PATH + CLEAN_DATA_FILE_NAME)

In [5]:
pd.set_option('display.max_colwidth', None)
df.head()

Unnamed: 0,tweet,location,pretweet,Sentiment
0,Pls add us some momo to make data 0246964913 Ã°Å¸ËâÃ°Å¸ËâÃ°Å¸Ëâ https://t.co/w5ozYUF59x,,pl add some momo make data 0246964913,0.0
1,@McVan_1 @AnnanPerry @blac4rina We will descend on @NAkufoAddo soon,Ghana,will descend soon,0.0
2,*Forgery allegations by EC is not enough to disqualify the five presidential candidates*\n\nhttps://t.co/GAkYghEbQHÃ¢â¬Â¦ https://t.co/o0pCodbuWj,,forgeri alleg not enough disqualifi the five presidenti candid,0.0
3,@NiiWills @bosompemny I donÃ¢â¬â¢t know how dem dey see @NAkufoAddo oo,dansoman accra,dont know how dem dey see,0.0
4,Do we have online renewal what what ka kwano?? https://t.co/3CdekJYMgr,Botswana,have onlin renew what what kwano,0.0


# Clean Tweet Text Data

* Change all text to lowercase
* Remove urls
* Remove mentions
* Remove placeholders {link} and \[video\]
* Remove punctuation that isn't associated with emojis

In [6]:
df_clean = df

In [7]:
# lower case
df_clean.pretweet = df_clean.pretweet.str.lower()

In [8]:
# remove url links
df_clean.pretweet = df_clean.pretweet.apply(lambda x: re.sub(r'https?:\/\/\S+', '', x))

In [9]:
# remove url/website that didn't use http, is only checking for .com websites 
# so words that are seperated by a . are not removed
df_clean.pretweet = df_clean.pretweet.apply(lambda x: re.sub(r"www\.[a-z]?\.?(com)+|[a-z]+\.(com)", '', x))

In [10]:
# remove @mention
df_clean.pretweet = df_clean.pretweet.apply(lambda x: re.sub(r'@mention', '', x))

In [11]:
# remove {link}
df_clean.pretweet = df_clean.pretweet.apply(lambda x: re.sub(r'{link}', '', x))

In [12]:
# remove &text; html chars
df_clean.pretweet = df_clean.pretweet.apply(lambda x: re.sub(r'&[a-z]+;', '', x))

In [13]:
# [video]
df_clean.pretweet = df_clean.pretweet.apply(lambda x: re.sub(r"\[video\]", '', x))

In [14]:
# remove all remaining characters that aren't letters, white space, or 
# the following #:)(/\='] that are used in emojis or hashtags
df_clean.pretweet = df_clean.pretweet.apply(lambda x: re.sub(r"[^a-z\s\(\-:\)\\\/\];='#]", '', x))

In [15]:
df_clean.iloc[90:100]

Unnamed: 0,tweet,location,pretweet,Sentiment
90,Are you joining me to vote for @NAkufoAddo on 7th Dec?\n#1Touch4Nana #4MoreForNana\n#AppreciateAkufoAddo https://t.co/9zcefdXQyW,Kwahu Bepong,are you join vote for th dec touchnana morefornana appreciateakufoaddo,1.0
91,@ObreAkye @NAkufoAddo I'm sure the police will be given electronic devices for that purpose.,"Tema, Ghana",im sure the polic will given electron devic for that purpos,0.0
92,@JuliusOkpei @NAkufoAddo So because heÃ¢â¬â¢s using the countries revenue to benefit us we canÃ¢â¬â¢t appreciate him erhhhh??Ã¢â¬Â¦ https://t.co/kOcWTCT4dJ,,becaus he use the countri revenu benefit cant appreci him erhhhh,0.0
93,@luielle @iamAmaBlue @NAkufoAddo Then go there,"Tema, Ghana",then there,0.0
94,@ShopAuthenticGh @Bra_Sammy20 @AbbanyawYaw @NAkufoAddo Increase the volume bro,Ghana,increas the volum bro,0.0
95,@alffyalf1 @__mbrownn @AgnesAdjei10 @gisthaphy @NAkufoAddo @O_LI_SE Hey enough of this ok. One love,"Lagos, Nigeria",hey enough thi ok one love,0.0
96,"@shattadrake @flexkgermain @NAkufoAddo US sef dey owe, make we think",,sef dey owe make think,0.0
97,Where is that stupid boy @kwadwosheldon masa u must apologize to our Great President @NAkufoAddo,"Sunyani, Ghana",where that stupid boy masa must apolog our great presid,0.0
98,@Eli_elShay @RexOmarrr @NAkufoAddo ThatÃ¢â¬â¢s wat u seeing,Ghana,that wat see,0.0
99,Please President @NAkufoAddo can you employ him in the fire service...he's saved many live at the collapsed churchÃ¢â¬Â¦ https://t.co/HnHiSG6rhP,"Greater Kumasi, Ghana.",pleas presid can you employ him the fire serviceh save mani live the collaps church,0.0


# Tokenize pretweet

Use the specialized NLTK TweetTokenizer to keep hashtags and emojis 

In [16]:
tknzr = TweetTokenizer()

In [17]:
df_clean['tokens'] = df_clean['pretweet'].apply(tknzr.tokenize)

In [18]:
df_clean.iloc[40:50][['pretweet', 'tokens']]

Unnamed: 0,pretweet,tokens
40,got sens like that,"[got, sens, like, that]"
41,hmm ghanaian are the caus,"[hmm, ghanaian, are, the, caus]"
42,you are nigerian pleas stay and fight end sar,"[you, are, nigerian, pleas, stay, and, fight, end, sar]"
43,excel your excel,"[excel, your, excel]"
44,ye but where your evid that they didnt their job or,"[ye, but, where, your, evid, that, they, didnt, their, job, or]"
45,wa onli you that benefit from hi free thing but not those need help abandon,"[wa, onli, you, that, benefit, from, hi, free, thing, but, not, those, need, help, abandon]"
46,had hand thi digit transform alway,"[had, hand, thi, digit, transform, alway]"
47,npp campaign your stronghold swing state oo dun let voltarian wast your campaign time yoo,"[npp, campaign, your, stronghold, swing, state, oo, dun, let, voltarian, wast, your, campaign, time, yoo]"
48,the law work tho thi collaps show some evid structur defect,"[the, law, work, tho, thi, collaps, show, some, evid, structur, defect]"
49,oy guy rough nana addo more more,"[oy, guy, rough, nana, addo, more, more]"


## Remove Punctuation From Tokens

The tweet tokenizer combined characters that make common emoticons, but all the other punctuation needs to be removed

In [19]:
PUNCUATION_LIST = list(string.punctuation)

In [20]:
def remove_punctuation(word_list):
    """Remove punctuation tokens from a list of tokens"""
    return [w for w in word_list if w not in PUNCUATION_LIST]

In [21]:
df_clean['tokens'] = df_clean['tokens'].apply(remove_punctuation)

# Create Corpus

In [22]:
corpus_tokens = df_clean['tokens'].sum()

# Check Frequency Distribution

In [23]:
corpus_freq_dist = FreqDist(corpus_tokens)

In [24]:
len(corpus_freq_dist)

5622

How many words appear only once?

In [25]:
only_one_instance = [w for w in corpus_freq_dist.most_common() if w[1] == 1]

In [26]:
len(only_one_instance)

3013

More than half of the words in the corpus appear only once.

How many words appear at least 5 times?

In [27]:
at_least_five = [w for w in corpus_freq_dist.most_common() if w[1] >= 5]

In [28]:
len(at_least_five)

1231

In [29]:
at_least_five[:50]

[('the', 2159),
 ('you', 1585),
 ('and', 1015),
 ('for', 806),
 ('your', 607),
 ('are', 581),
 ('thi', 558),
 ('presid', 527),
 ('that', 513),
 ('ghana', 381),
 ('not', 345),
 ('will', 337),
 ('what', 334),
 ('have', 328),
 ('elect', 303),
 ('with', 276),
 ('they', 273),
 ('peopl', 272),
 ('all', 270),
 ('congratul', 263),
 ('dont', 260),
 ('but', 234),
 ('hi', 222),
 ('pleas', 221),
 ('more', 217),
 ('our', 213),
 ('nana', 204),
 ('wa', 202),
 ('it', 202),
 ('ha', 201),
 ('vote', 201),
 ('can', 193),
 ('who', 181),
 ('know', 174),
 ('god', 173),
 ('from', 173),
 ('one', 166),
 ('how', 162),
 ('like', 161),
 ('ghanaian', 156),
 ('job', 154),
 ('see', 153),
 ('say', 147),
 ('whi', 147),
 ('about', 143),
 ('good', 138),
 ('just', 138),
 ('him', 137),
 ('come', 137),
 ('npp', 136)]

This group is more than one fifth of the corpus and contains many stop words that would typically be removed from text, however since a tweet is highly restricted to a number of characters, each word that a person uses is of potential value for the sentiment analysis.  

Additionally, According to a study down on the removal of stop words from tweets when doing sentiment analysis, removing them degrades classification performance. see [link](https://www.aclweb.org/anthology/L14-1265/)

# Save Cleaned and Tokenized Data

In [30]:
if SAVE_FILE:
    df_clean.to_csv(DATA_FILE_PATH + TOKENIZED_DATA_FILE_NAME, index=False)