## Preprocessing

In this lab, we will be exploring how to preprocess tweets for sentiment analysis. We will provide a function for preprocessing tweets during this week's assignment, but it is still good to know what is going on under the hood. By the end of this lecture, you will see how to use the NLTK package to perform a preprocessing pipeline for Twitter datasets.

Import Libraries

In [1]:
import nltk
from nltk.corpus import twitter_samples
import numpy as np
import re
from nltk.tokenize import TweetTokenizer   # module for tokenizing strings
from nltk.corpus import stopwords          # module for stop words that come with NLTK
from nltk.stem import PorterStemmer 

stem=PorterStemmer()

# About the Twitter dataset

The sample dataset from NLTK is separated into positive and negative tweets. It contains 5000 positive tweets and 5000 negative tweets exactly. The exact match between these classes is not a coincidence. The intention is to have a balanced dataset. That does not reflect the real distributions of positive and negative classes in live Twitter streams. It is just because balanced datasets simplify the design of most computational methods that are required for sentiment analysis. However, it is better to be aware that this balance of classes is artificial.

You can download the dataset in your workspace (or in your local computer) by doing:

In [2]:
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

In [3]:
Postive_Samples=twitter_samples.strings('positive_tweets.json')
Negative_Samples=twitter_samples.strings('negative_tweets.json')

print(len(Postive_Samples))
print(len(Negative_Samples))

5000
5000


# Looking at raw texts

Before anything else, we can print a couple of tweets from the dataset to see how they look. Understanding the data is responsible for 80% of the success or failure in data science projects. We can use this time to observe aspects we'd like to consider when preprocessing our data.

Below, you will print one random positive and one random negative tweet. We have added a color mark at the beginning of the string to further distinguish the two. (Warning: This is taken from a public dataset of real tweets and a very small portion has explicit content.)

In [4]:
Postive_Samples_Random=Postive_Samples[np.random.randint(0,5000)]
Negative_Samples_Random=Negative_Samples[np.random.randint(0,5000)]
print(Postive_Samples_Random)
print(Negative_Samples_Random)

Last day at work! This time for real :D No more summer jobs! But school starts in three weeks :/ #mysummer #happy
I feel like I'm a weird person for shipping Bellamy &amp; Raven, because everyone ships him with Clarke, but the heart wants what it wants :(


In [5]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Preprocessing

In [6]:
tweet=Postive_Samples[2277]
print(tweet)

My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i


In [7]:
# remove the Hash 
hash_remove_tweet=re.sub(r'#' , '' , tweet)
print(hash_remove_tweet)

# remove the links from text
links_remove_tweet=re.sub(r'https?://[^\s\n\r]+', '' , hash_remove_tweet)
print(links_remove_tweet)

My beautiful sunflowers on a sunny Friday morning off :) sunflowers favourites happy Friday off… https://t.co/3tfYom0N1i
My beautiful sunflowers on a sunny Friday morning off :) sunflowers favourites happy Friday off… 


# Token and Lower Strings

In [8]:
Tokenize=TweetTokenizer()
Tokenize_words=Tokenize.tokenize(links_remove_tweet)
print(Tokenize_words)

['My', 'beautiful', 'sunflowers', 'on', 'a', 'sunny', 'Friday', 'morning', 'off', ':)', 'sunflowers', 'favourites', 'happy', 'Friday', 'off', '…']


# Stop Words and Remove Punctuations

In [9]:
stop_words=stopwords.words('english')
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [10]:
len(Tokenize_words)

16

In [11]:
import string
remove_stop_words=[]
for words in Tokenize_words:
#     print(Tokenize_words)
    if (words  not in stop_words and words not in string.punctuation):
        remove_stop_words.append(words)
        
print(remove_stop_words)   

['My', 'beautiful', 'sunflowers', 'sunny', 'Friday', 'morning', ':)', 'sunflowers', 'favourites', 'happy', 'Friday', '…']


# Stemming

In [12]:
tweet_clean=[]
for words in remove_stop_words:
    clean_words=stem.stem(words)
    tweet_clean.append(clean_words)
    
print(tweet_clean)

['My', 'beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']
