# Capstone project 2: Sentiment analysis of tweets

**Table of Content**
- Introduction

- Section A: Preparing the Test set 
    - Step A.1: Getting the authentication credentials
    - Step A.2: Authenticating our python script
    - Step A.3: Creating the function to build the Test set
    
- Section B: Preparing the Training set 

- Section C: Pre-processing Tweets in the Data Sets (both Test and Training)

- Section D: Naive Bayes Classifier 
    - Step D.1: Build a vocabulary/list of words in our training data set 
    - Step D.2: Match tweet content against our vocabulary 
    - Step D.3: Build our word feature vector 
    - Step D.4: Training the classifier 
    
- Section E: Testing the model 
- Section F: Measuring the model performance

- Conclusion

## Introduction

**What is a Sentiment Analysis and why is it useful?**

Sentiment Analysis represents the use of Natural Language Processing to determine the attitude, opinions and emotions of a speaker, writer or other subject within an online mention. In other words, it is the process of determining whether a piece of writing is positive or negative. A human is able to recognize and classify a text into positive or negative. However, a computer not but can learn to do so. 

**On which topic?**

I would like to perform a sentiment analysis on Tweets about the late blockbuster release: Joker, with Joaquin Phoenix. Out in the USA beginning of October 2019 and throughout theaters in the world later on, no doubt that Joker has divided its audience. Whether people love or hate it, the reactions were numerous. I saw it myself at the movies in Switzerland as soon as it came out, and felt exactly this way: divided. The movie received a Golden Lion win at the Venice Film Festival, I had high expectations. The complex personality of the Joker was one of the reasons I wanted to see the movie. But after visioning it, I realized it had a strong impact on me, more than expected and not only positive. The critics I read afterwards were also two-folds, both positive and negative. Therefore, I thought it would be interesting to train a classification model on Tweets on the topic of Joker. 

**Who is interested?**

Different types of people could be interested to know the proportion of positive vs negative tweets on this topic: fans of Joaquin Phoenix and of The Dark Knight Rises, owners of movies, producers, script writers, psychologists and psychiatrists. 

**Methodology**

I will be using the Twitter API to collect a Test set based on keywords. A function will return a list of tweets that contain our keywords selected. Each tweet’s text will see itself attributed a label (‘positive’ or ‘negative’) to classify each tweet as positive or negative. The Training set will be downloaded because it has to be labelled into ‘positive’ or ‘negative’ on a big amount of tweets. The Training set is critical to the success of the model since our model will “learn” how to do create a sentiment analysis based on the Training set. 

In [69]:
import numpy as np
import tweepy, json
import twitter
import pandas as pd
import re
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import time
import ssl
import requests
import csv
import pickle
from nltk.tokenize import word_tokenize
from string import punctuation 
from nltk.corpus import stopwords

In [2]:
from requests.exceptions import Timeout, ConnectionError
from requests.packages.urllib3.exceptions import ReadTimeoutError

## Section A: Preparing The Test Set

**Step A.1: Getting the authentication credentials**

First off, we need to visit the Twitter Developer website, log into our account, and apply for the Twitter API. This takes a couple of clicks to execute, and wait a few hours for the approval from Twitter. 

**Step A.2: Authenticating our Python script**

Since we now have our Twitter Developers login credentials (i.e. API keys and Access token), we can proceed to authenticating our program. First, we need to import the Twitter library, then create an Twitter.API object with the credentials from the “safe” place we talked about, as follows:

In [3]:
api = twitter.Api(consumer_key='3h2dbtJm7BchEn8NhszxhqGXF',
                  consumer_secret='kLLwKYTIZg4FyNlECC3jxI34ITTTMI22KWuq3uZjpzZid2kLVR',
                  access_token_key='3395163311-L3kfrmBLjeT2wid9IUIrmhWhg9xhwoWIuGVdtdB',
                  access_token_secret='c7yzOy6ZP7fSspILaPkPFhaIp65P9g2rzcSTqG4OgSqDD')

In [4]:
print(api.VerifyCredentials())

{"created_at": "Thu Jul 30 07:50:24 +0000 2015", "default_profile": true, "favourites_count": 2, "followers_count": 6, "friends_count": 63, "id": 3395163311, "id_str": "3395163311", "name": "Zeballos Coline", "profile_background_color": "C0DEED", "profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png", "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png", "profile_image_url": "http://pbs.twimg.com/profile_images/958331092867665920/qtDfnmPi_normal.jpg", "profile_image_url_https": "https://pbs.twimg.com/profile_images/958331092867665920/qtDfnmPi_normal.jpg", "profile_link_color": "1DA1F2", "profile_sidebar_border_color": "C0DEED", "profile_sidebar_fill_color": "DDEEF6", "profile_text_color": "333333", "profile_use_background_image": true, "screen_name": "colinezeballos", "status": {"created_at": "Tue Jan 30 13:54:39 +0000 2018", "id": 958337667451772929, "id_str": "958337667451772929", "lang": "fr", "retweet_count": 550, "re

The last line in the previous code snippet is only there to verify that our API instance works. 

**Step A.3: Creating the function to build the Test set**

Now we can start on making a function that downloads the Test set that we talked about. Basically, this is going to be a function that takes a search keyword (i.e. string) as an input, searches for tweets that include this keyword and returns them as twitter.Status objects that we can iterate through.

The caveat here, though, is that Twitter limits the number of requests you can make through the API for security purposes. This limit is 180 requests per 15-minute window.
This means, we can only get up to 180 tweets using our search function every 15 minutes, which should not be a problem, as our Training set is not going to be that large anyway. For the sake of simplicity, we will limit the search to 100 tweets for now, not exceeding the allowed number of requests. Our function for searching for the tweets (i.e. Test set) will be:

In [5]:
def buildTestSet(search_keyword):
    try: 
        tweets_fetched = api.GetSearch(search_keyword, count = 100)
        
        print("Fetched " + str(len(tweets_fetched)) + " tweets for the term " + search_keyword)
        
        return [{"text":status.text, "label":None} for status in tweets_fetched]
    except:
        print("Unfortunately, something went wrong..")
        return None

As expected, this function will return a list of tweets that contain our search keyword/s.

Note that we coupled — into a JSON object — every tweet’s text with a label that is NULL for now. This is merely because we are going to classify each tweet as Positive or Negative later on, in order to determine whether the sentiment on the search term is positive or negative, based on the majority count. This is how Sentiment Analysis pragmatically works.

Before we move on, let’s test out our function by adding the following code after the function body:

In [6]:
search_term = input("Enter a search keyword:")

Enter a search keyword:joker, movie


As described in the project introduction, we will be looking for Tweets talking about the movie "Joker", we will therefore search for the keywords "joker" and "movie". 

In [7]:
testDataSet = buildTestSet(search_term)

Fetched 100 tweets for the term joker, movie


In [35]:
len(testDataSet)

100

We have now a variable testDataSet that contains our Test set of 100 Tweets on the movie Joker. 

In [8]:
print(testDataSet[0:4])

[{'text': 'Finally saw the joker.\n\nWhile the technical aspects of the movie are stunning and the acting incredible.\n\nThat migh… https://t.co/QFIi10T0oH', 'label': None}, {'text': '"Joker" became the first R-rated movie to hit $1 billion at the box-office, and there are\u200b rumors of a sequel https://t.co/k6U8yuxsNC', 'label': None}, {'text': "Would you like to see a sequel to Joaquin Phoenix's #Joker? https://t.co/InEGd1EvME", 'label': None}, {'text': 'RT @1woo17: They had done this before for other movies such as:\nBTS Bring The Soul: The Movie\nWeathering With You\nJoker\nSpiderman: Far From…', 'label': None}]


These are five tweets that contain our search keywords. Now everything is set. We have our Test set and we can move on to building our Training set.

## Section B: Preparing The Training Set

In this section, we will also be using our Twitter API instance from the last section. However, we need to get some things out the way first. We will be using a downloadable Training set. The tweets of which were all labeled as positive or negative, depending on the content. This exactly what a Training set is for.

A Training set is critical to the success of the model. Data is which needs to be labeled properly with no inconsistencies or incompleteness, as training will rely heavily on the accuracy of such data and the manner of acquisition.

For this task, we will be using Niek Sanders’ Corpus of over 5000 hand-classified tweets, which makes it quite reliable. There’s also a catch here. Twitter does not allow storing tweets on a personal device, even though all such data is publicly available. Therefore, the corpus includes a keyword (topic of the tweet), a label and a tweet ID number for every tweet (i.e. row in our CSV corpus). You can get the file containing the corpus through this link of a personal repository: https://github.com/karanluthra/twitter-sentiment-training/blob/master/corpus.csv

In [45]:
# Exploring the Training Dataset from Niek Sanders
tr_set = pd.read_csv('/Users/colinechabloz/Desktop/CapstoneProject2/corpus.csv')

In [46]:
tr_set.head()

Unnamed: 0,apple,positive,126415614616154112
0,apple,positive,126404574230740992
1,apple,positive,126402758403305474
2,apple,positive,126397179614068736
3,apple,positive,126395626979196928
4,apple,positive,126394830791254016


In [48]:
len(tr_set)

5512

In [49]:
tr_set.values

array([['apple', 'positive', 126404574230740992],
       ['apple', 'positive', 126402758403305474],
       ['apple', 'positive', 126397179614068736],
       ...,
       ['twitter', 'irrelevant', 126854999442587648],
       ['twitter', 'irrelevant', 126854818101858304],
       ['twitter', 'irrelevant', 126854423317188608]], dtype=object)

The Training Dataset contains exactly 5512 (topics, label and id) of Tweets. There seems to be more labels than just 'positive' and 'negative' (ex: irrelevant). Check all values for Label: 

In [54]:
tr_set[tr_set.columns[0]].unique()

array(['apple', 'google', 'microsoft', 'twitter'], dtype=object)

**Observation:** the first column contains topics of type:

- apple
- google
- microsoft
- twitter

In [53]:
tr_set[tr_set.columns[1]].unique()

array(['positive', 'negative', 'neutral', 'irrelevant'], dtype=object)

**Observation:** the second column contains label of type:

- positive
- negative
- neutral
- irrelevant

The original dataset "tr_set" doesn't contain the Tweets text, but we will fetch it using the API below.

First, we must remember the Twitter API limit we talked about. This will also apply here, as we will be using the API to get the actual tweet text through each tweet’s ID number included in the Corpus we have. This means, to download 5000 tweets, we will need to follow:

max_number_of_requests = 180
time_window = 15 minutes = 900 seconds
Therefore, the process should follow:
Repeat until end-of-file: {
    180 requests -> (900/180) sec wait
}

Let’s now write the code that does exactly that. Let’s not forget to save the tweets we retrieve through the API into a new CSV file so that we don’t have to download them every time we run the code. Our function will be as follows:

In [9]:
def buildTrainingSet(corpusFile, tweetDataFile):
    import csv
    import time
    
    corpus = []
    
    with open(corpusFile,'r') as csvfile:
        lineReader = csv.reader(csvfile,delimiter=',', quotechar="\"")
        for row in lineReader:
            corpus.append({"tweet_id":row[2], "label":row[1], "text":row[0]})
    
    rate_limit = 180
    sleep_time = 900/180
    
    trainingDataSet = []
    
    for tweet in corpus:
        try:
            status = api.GetStatus(tweet["tweet_id"])
            tweet["text"] = status.text
            trainingDataSet.append(tweet)
            time.sleep(sleep_time) 
        except: 
            print("Error processing:",  tweet["tweet_id"])
            continue
        
        # now we write them to the empty CSV file
    with open(tweetDataFile,'wb') as csvfile:
        linewriter = csv.writer(csvfile,delimiter=',',quotechar="\"")
        for tweet in trainingDataSet:
            try:
                linewriter.writerow([tweet["tweet_id"], tweet["text"], tweet["label"], tweet["topic"]])
            except Exception as e:
                print(e)
    return trainingDataSet

Explanation of the big function above. Firstly, we define the function to take two inputs, both of which are file paths:

- corpusFile is the string path to the Niek Sanders’ CSV corpus file we downloaded. This file, as mentioned earlier, includes the tweet’s topic, label and id.
- tweetDataFile is the string path to the file we would like to save the full tweets in. In contrast to corpusFile, this file will include every tweet’s text as well as topic, label and id.

Next, we started with an empty list corpus. We then opened the file corpusFile and appended every tweet from the file to the list corpus.

The next segment of the code deals with getting the text of tweets based on the IDs. We loop through the tweets in corpus, calling the API on every tweet to get the Tweet.Status object of the particular tweet. Afterwards, we use that same object (status) to get the text associated with it and push it into the trainingDataSet then sleep (i.e. pause execution) for five minutes (900/180 seconds) in order to abide by the request limit we talked about.

Now let’s take the time to leave our script download the tweets (which will take hours) following our last function. We can do this using the following snippet:

In [10]:
#THIS STEP TAKES A COUPLE OF HOURS
corpusFile = "/Users/colinechabloz/Desktop/CapstoneProject2/corpus.csv"
tweetDataFile = "/Users/colinechabloz/Desktop/CapstoneProject2/tweetDataFile.csv"
trainingData = buildTrainingSet(corpusFile, tweetDataFile)

In [11]:
file_pi = open('/Users/colinechabloz/Desktop/CapstoneProject2/training_data.pkl', 'wb') 
pickle.dump(trainingData, file_pi)

I have saved the Training set obtained in a pickle file .pkl, called file_pi, and am loading it in a variable called trainingData. 

In [12]:
file_pi = open("/Users/colinechabloz/Desktop/CapstoneProject2/training_data.pkl",'rb')
trainingData = pickle.load(file_pi)

In [13]:
trainingData

[{'tweet_id': '126415614616154112',
  'label': 'positive',
  'topic': 'apple',
  'text': 'Now all @Apple has to do is get swype on the iphone and it will be crack. Iphone that is'},
 {'tweet_id': '126402758403305474',
  'label': 'positive',
  'topic': 'apple',
  'text': "Hilarious @youtube video - guy does a duet with @apple 's Siri. Pretty much sums up the love affair! http://t.co/8ExbnQjY"},
 {'tweet_id': '126397179614068736',
  'label': 'positive',
  'topic': 'apple',
  'text': '@RIM you made it too easy for me to switch to @Apple iPhone. See ya!'},
 {'tweet_id': '126379685453119488',
  'label': 'positive',
  'topic': 'apple',
  'text': 'The 16 strangest things Siri has said so far. I am SOOO glad that @Apple gave Siri a sense of humor! http://t.co/TWAeUDBp via @HappyPlace'},
 {'tweet_id': '126377656416612353',
  'label': 'positive',
  'topic': 'apple',
  'text': 'Great up close & personal event @Apple tonight in Regent St store!'},
 {'tweet_id': '126373779483004928',
  'label': 'po

In [57]:
# Transform list into dataframe for easier manipulation
df_trainingData = pd.DataFrame(trainingData) 

In [58]:
df_trainingData.head()

Unnamed: 0,label,text,topic,tweet_id
0,positive,Now all @Apple has to do is get swype on the i...,apple,126415614616154112
1,positive,Hilarious @youtube video - guy does a duet wit...,apple,126402758403305474
2,positive,@RIM you made it too easy for me to switch to ...,apple,126397179614068736
3,positive,The 16 strangest things Siri has said so far. ...,apple,126379685453119488
4,positive,Great up close & personal event @Apple tonight...,apple,126377656416612353


In [59]:
# Check that we find the same values for Label column
df_trainingData[df_trainingData.columns[0]].unique()

array(['positive', 'negative', 'neutral', 'irrelevant'], dtype=object)

In [61]:
# Check that we find the same values for Topic column
df_trainingData[df_trainingData.columns[2]].unique()

array(['apple', 'google', 'microsoft', 'twitter'], dtype=object)

**Observations:** Good! We find the topics and labels after going through the function. 

In [60]:
df_trainingData.values

array([['positive',
        'Now all @Apple has to do is get swype on the iphone and it will be crack. Iphone that is',
        'apple', '126415614616154112'],
       ['positive',
        "Hilarious @youtube video - guy does a duet with @apple 's Siri. Pretty much sums up the love affair! http://t.co/8ExbnQjY",
        'apple', '126402758403305474'],
       ['positive',
        '@RIM you made it too easy for me to switch to @Apple iPhone. See ya!',
        'apple', '126397179614068736'],
       ...,
       ['irrelevant', 'me re copè con #twitter', 'twitter',
        '126855687060987904'],
       ['irrelevant',
        '#twitter tiene la mala costumbre de ponerce bno cuano yo me voy :/',
        'twitter', '126854999442587648'],
       ['irrelevant',
        'Oi @flaviasansi. Muito bem vinda ao meu #Twitter. Sempre dou followback pelo meu perfil profissional. Permaneça por aqui, certo? Abrass!',
        'twitter', '126854818101858304']], dtype=object)

**Observations**: the Training Dataset now contains a column name and the Tweets' text, in addition to the topic, a label and a tweet id. 

## Section C: Pre-processing Tweets in The Data Sets

Before we move on to the actual classification section, there is some cleaning up to do. As a matter of fact, this step is critical and usually takes a long time when building Machine Learning models. However, this will not be a problem in our task, as the data we have is relatively consistent. In other words, we know exactly what we need from it. I will express on this matter later on.

**What matters and what doesn’t matter in Sentiment Analysis**

Words are the most important part (to an extent that we will talk about in the upcoming section). However, when it comes to things like punctuation, you cannot get the sentiment from punctuation. Therefore, punctuation does not matter to Sentiment Analysis. Moreover, tweet components like images, videos, URLs, usernames, emojis, etc. do not contribute to the polarity (whether it is positive or negative) of the tweet. However, this is only true for this application. For instance, in another application, we could have a Deep Learning image classifier that learns and predicts whether this image that the tweet contains stands for something positive (e.g. a rainbow) or negative (e.g. a tank). When it comes to the technicality, both Sentiment Analysis and Deep Learning fall under Machine Learning. In fact, we can perform Sentiment Analysis through Deep Learning, but that’s a story for another day.

**A word about the importance of normalizing/pre-processing**

Normalization in the NLP context is the process of converting a list of words to a more uniform sequence. By transforming the words into a standard format, later operations can be done on the data without compromising the process. Many pre-processing steps can be taken including: lowercasing, stemming (example: troubled, troubles go into 'troubl'), lemmatization (example: troubled, troubles to into 'trouble'), stopword removal, normalizing, noise removal... 

So we know what we need to keep in the tweets we have and what we need to take out. This applies to both Training and Test sets. So let’s make a our pre-processor class:

In [15]:
class PreProcessTweets:
    def __init__(self):
        self._stopwords = set(stopwords.words('english') + list(punctuation) + ['AT_USER','URL'])
    def processTweets(self, list_of_tweets, labels ='All'):
        processedTweets=[]
        for tweet in list_of_tweets:
            if labels is not "All":
                if tweet["label"] in labels:
                    processedTweets.append((self._processTweet(tweet["text"]),tweet["label"]))
            else: 
                processedTweets.append((self._processTweet(tweet["text"]), tweet["label"])) 
        return processedTweets
    def _processTweet(self, tweet):
        tweet = tweet.lower() # convert text to lower-case
        tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', 'URL', tweet) # remove URLs
        tweet = re.sub('@[^\s]+', 'AT_USER', tweet) # remove usernames
        tweet = re.sub(r'#([^\s]+)', r'\1', tweet) # remove the # in #hashtag
        tweet = word_tokenize(tweet) # remove repeated characters (helloooooooo into hello)
        return [word for word in tweet if word not in self._stopwords]

**Explaining the pre-processing steps and their importance:**

1. We start off by our imported libraries. re is Python’s Regular Expressions (RegEx) library, which takes care of parsing strings and modifying them in an efficient way without having to explicitly iterate through the characters comprising the particular string. We also imported ntlk, is the Natural Processing Toolkit, which is one of the most commonly used Python libraries out there. It takes care of any processing that we need to perform on text to change its form or extract certain components from it. The class constructor removes stop words. This is a relatively big topic that you can read up on later, as it is more into Natural Language Processing and less related to our topic.

2. The processTweets function loops through all the tweets input into it, calling its neighboring function processTweet on every tweet in the list. The latter does the actual pre-processing steps: 
    - Removing stop words in english: stopr words are a set of commonly used words in a language (a, the, is, are,...). The idea behing removing these words is to remove low informtion words from the text and focus on the important words instead. 
    - Making all the text in lower-case letters. This is merely because, in almost all programming languages, “cAr” is not interpreted the same way as “car”. Therefore, it is better to normalize all characters to be lower-case across all our data. 
    - URLs and usernames are removed from the tweet. This is for the reasons we disclosed earlier.
    - The number sign (i.e. #) is removed from every hashtag, in order to avoid hashtags being processed differently than regular words.
    - Duplicate characters are rid off of, in order to ensure that no important word goes unprocessed even if it is spelled out in an unusual way (e.g. “caaaaar” becomes “car”). 
    - Finally, the tweet’s text is broken into words (tokenized) in order to ease its processing in the upcoming stages. This step is also called text segmentation or lexical analysis. 

Let’s take an example. The following tweet could be present in the data set:

"@person1 retweeted @person2: Corn has got to be the most delllllicious crop in the world!!!! #corn #thoughts..."

Our pre-processor will result in the tweet looking like:

“AT_USER rt AT_USER corn has got to be the most delicious crop in the world corn thoughts”

And finally, the tokenization will result in:

{“corn”, “most”, “delicious”, “crop”, “world”, “corn”, “thoughts”}

Note that our code removed duplicate characters in words as we metioned earlier (i.e. “delllllicious” became “delicious”). However, it did not remove duplicate words (i.e. “corn”) from the text, but rather kept them. This is because duplicate word play a role in determining the polarity of the text (as we will see in the upcoming section).

We are all set to use our Pre-processor class. First, we will create a variable that refers to it (an object), and then call it on both the Training and Test sets as we discussed earlier:

In [16]:
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /Users/colinechabloz/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /Users/colinechabloz/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /Users/colinechabloz/nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /Users/colinechabloz/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     /Users/colinechabloz/nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     /Users/colinechabloz/nltk_data...
[nltk_data]    |   Package

[nltk_data]    |   Package semcor is already up-to-date!
[nltk_data]    | Downloading package senseval to
[nltk_data]    |     /Users/colinechabloz/nltk_data...
[nltk_data]    |   Package senseval is already up-to-date!
[nltk_data]    | Downloading package sentiwordnet to
[nltk_data]    |     /Users/colinechabloz/nltk_data...
[nltk_data]    |   Package sentiwordnet is already up-to-date!
[nltk_data]    | Downloading package sentence_polarity to
[nltk_data]    |     /Users/colinechabloz/nltk_data...
[nltk_data]    |   Package sentence_polarity is already up-to-date!
[nltk_data]    | Downloading package shakespeare to
[nltk_data]    |     /Users/colinechabloz/nltk_data...
[nltk_data]    |   Package shakespeare is already up-to-date!
[nltk_data]    | Downloading package sinica_treebank to
[nltk_data]    |     /Users/colinechabloz/nltk_data...
[nltk_data]    |   Package sinica_treebank is already up-to-date!
[nltk_data]    | Downloading package smultron to
[nltk_data]    |     /Users/colin

True

**Note:** Remember, the Tweets' labels in the Training Dataset take 4 values: positive, negative, irrelevant and neutral. We decide at this point to select on Tweets with Label 'positive' and 'negative'.

In [17]:
tweetProcessor = PreProcessTweets()
preprocessedTrainingSet = tweetProcessor.processTweets(trainingData, labels=['positive', 'negative'])
preprocessedTestSet = tweetProcessor.processTweets(testDataSet)

**Explanation:** we apply the data wrangling/cleaning steps to both the Training Set (trainingData) and the Test set (testDataSet).

In [18]:
print(preprocessedTrainingSet[0:4])

[(['get', 'swype', 'iphone', 'crack', 'iphone'], 'positive'), (['hilarious', 'video', 'guy', 'duet', "'s", 'siri', 'pretty', 'much', 'sums', 'love', 'affair'], 'positive'), (['made', 'easy', 'switch', 'iphone', 'see', 'ya'], 'positive'), (['16', 'strangest', 'things', 'siri', 'said', 'far', 'sooo', 'glad', 'gave', 'siri', 'sense', 'humor', 'via'], 'positive')]


In [36]:
print(preprocessedTestSet[0:4])

[(['finally', 'saw', 'joker', 'technical', 'aspects', 'movie', 'stunning', 'acting', 'incredible', 'migh…'], None), (['``', 'joker', "''", 'became', 'first', 'r-rated', 'movie', 'hit', '1', 'billion', 'box-office', 'are\u200b', 'rumors', 'sequel'], None), (['would', 'like', 'see', 'sequel', 'joaquin', 'phoenix', "'s", 'joker'], None), (['rt', 'done', 'movies', 'bts', 'bring', 'soul', 'movie', 'weathering', 'joker', 'spiderman', 'far', 'from…'], None)]


Now we can move on to the most exciting part — classification. 

## Section D: Naive Bayes Classifier

**What is Naive Bayes Classifier?**

Naive Bayes Classifier is a classification algorithm that relies on Bayes’ Theorem. This theorem provides a way of calculating a type or probability called posterior probability, in which the probability of an event A occurring is reliant on probabilistic known background (e.g. event B evidence). For example, if Person_X only plays tennis when it is not raining outside, then, according to Bayesian statistics, the probability of Person_X playing tennis when it is not raining can be given as:

P(X plays | no rain) = P(no rain | X plays)*P(x plays)/P(no rain)

following Bayes’ theorem:

P(A|B) = P(B|A)*P(A)/P(B)

**Why Naive Bayes Classifier?**

Naive Bayes (NB) are mostly used in natural language processing problems. Naive Bayes predicts the tag of a text by calculating the probability of each tag for a given text and then output the tag with the highest one. It is the most simplest one which is enough in our case. 

Other possible classifier:  
- Support Vector Machine: 
    - effective in high dimensional spaces;
    - still effective in cases where number of dimensions is greater than the number of samples
    - uses a subset of training points in the decision function (called support vectors), so it is also memory efficient (very important too!)
    
**Conclusion:** SVM could be a good option too. Since we need to choose one, we will work with NB. 

All we need to know for our task is that a Naive Bayes Classifier depends on the ever-famous Bayes’ theorem. Before we move on, let’s give a quick overview of the steps we will be taking next:

- Build a vocabulary (list of words) of all the words resident in our training data set.
- Match tweet content against our vocabulary — word-by-word.
- Build our word feature vector.
- Plug our feature vector into the Naive Bayes Classifier.

**Step D.1: Building the vocabulary**

A vocabulary in Natural Language Processing is a list of all speech segments available for the model. In our case, this includes all the words resident in the Training set we have, as the model can make use of all of them relatively equally — at this point, to say the least. The code will look something like this:

In [19]:
def buildVocabulary(preprocessedTrainingData):
    all_words = []
    
    for (words, sentiment) in preprocessedTrainingData:
        all_words.extend(words)

    wordlist = nltk.FreqDist(all_words)
    word_features = wordlist.keys()
    
    return word_features

This is just creating a list of all_words we have in the Training set, breaking it into word features. Those word_features are basically a list of distinct words, each of which has its frequency (number of occurrences in the set) as a key.

**Step D.2: Matching tweets against our vocabulary**

This step is crucial, as we will go through all the words in our Training set (i.e. our word_features list), comparing every word against the tweet at hand, associating a number with the word following:

"label 1 (true): if word in vocabulary is resident in tweet
label 0 (false): if word in vocabulary is not resident in tweet"

This is fairly simple to code:

In [20]:
def extract_features(tweet):
    tweet_words = set(tweet)
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in tweet_words)
    return features 

Given the last snippet, for every word in the word_features, we will have the JSON key ‘contains word X’, where X is the word. Every key of those will have the value True/False, according to what we said earlier about the labels — True for ‘present’ and False for ‘absent’.

**Step D.3: Building our feature vector**

Let’s now call the last two functions we have written. This will build our final feature vector, with which we can proceed on to training.

In [21]:
word_features = buildVocabulary(preprocessedTrainingSet)
trainingFeatures = nltk.classify.apply_features(extract_features, preprocessedTrainingSet)

In [39]:
word_features



In [63]:
len(word_features)

2700

**Observations:** Our vocabulary of words contains 2700 words.

In [40]:
trainingFeatures



The NTLK built-in function apply_features does the actual feature extraction from our lists. Our final feature vector is trainingFeatures.

**Step D.4: Training the classifier**

We have finally come to the most important — and ironically the shortest — part of our task. Thanks to NLTK, it will only take us a function call to train the model as a Naive Bayes Classifier, since the latter is built into the library:

In [22]:
NBayesClassifier = nltk.NaiveBayesClassifier.train(trainingFeatures)

We will perform a small test of our classifier by creating fake sequences of words that are particularly positive and negative to see if the classifier is efficient. 

In [30]:
test_tweet = ['rt','hilarious', 'video', 'guy', 'duet', "'s", 'siri', 'pretty', 'much', 'sums', 'love', 'affair']
print(NBayesClassifier.classify(extract_features(test_tweet)))

positive


In [31]:
test_tweet = ['terrible','bad', 'awful', 'negative', 'duet', 'stupid', 'mean']
print(NBayesClassifier.classify(extract_features(test_tweet)))

negative


Our small and extreme tests show that the classifier is doing some job at least!

Now, we must run the classifier training code (i.e. nltk.NaiveBayesClassifier.train()) and test it. Note that this code could take a few minutes to execute.

## Section E: Testing The Model

Let's run the classifier (i.e. NBayesClassifier) on the 100 tweets that we downloaded from Twitter, according to our search terms, and get the majority vote of the labels returned by the classifier, then outputting the total positive or negative percentage (i.e. score) of the tweets. 

In [23]:
NBResultLabels = [NBayesClassifier.classify(extract_features(tweet[0])) for tweet in preprocessedTestSet]

# get the majority vote
if NBResultLabels.count('positive') > NBResultLabels.count('negative'):
    print("Overall Positive Sentiment")
    print("Positive Sentiment Percentage = " + str(100*NBResultLabels.count('positive')/len(NBResultLabels)) + "%")
else: 
    print("Overall Negative Sentiment")
    print("Negative Sentiment Percentage = " + str(100*NBResultLabels.count('negative')/len(NBResultLabels)) + "%")

Overall Negative Sentiment
Negative Sentiment Percentage = 52.0%


In [64]:
NBResultLabels.count('negative')

52

**Observation:** the results of the sentiment analysis on Tweets about the movie Joker with Joaquin Phoenix that came out in October 2019 is that there is the overall sentiment is negative at 52%. These classification results coincide with the critics one can read in the newspaper and on the internet, as they are balanced between positive and negative. Let's have a look at a couple of Tweets and the assigned label to check if the classification is intuitive or not. This is a first approximate idea of our classifier's performance. 

## Section F: Analyzing the performance of the model

**Concatenate preprocessedTestSet and NBResultLabels** 

Making some changes on preprocessedTestSet

In [118]:
len(preprocessedTestSet)

100

In [72]:
# Transform list into dataframe for easier manipulation
df_preprocessedTestSet = pd.DataFrame(preprocessedTestSet) 

In [74]:
df_preprocessedTestSet.shape

(100, 2)

In [80]:
# Adding column names
df_preprocessedTestSet.columns = ['Tweets pre-processed', 'label']

In [81]:
df_preprocessedTestSet.head(10)

Unnamed: 0,Tweets pre-processed,label
0,"[finally, saw, joker, technical, aspects, movi...",
1,"[``, joker, '', became, first, r-rated, movie,...",
2,"[would, like, see, sequel, joaquin, phoenix, '...",
3,"[rt, done, movies, bts, bring, soul, movie, we...",
4,"[ashishchanchalani, ashishchanchalanimemes, ol...",
5,"[think, joker, new, favorite, character/movie,...",
6,"[rt, still, thinking, sana, harley, quinn, wan...",
7,"[joker, action, movie]",
8,"[saw, ford, v, ferrari, last, weekend, great, ...",
9,"[’, dope, joker, movie]",


Making some changes on NBResultLabels

In [78]:
len(NBResultLabels)

100

In [87]:
# Transform list into dataframe for easier manipulation
df_NBResultLabels = pd.DataFrame(NBResultLabels) 

In [88]:
df_NBResultLabels.shape

(100, 1)

In [89]:
# Adding column names
df_NBResultLabels.columns = ['label']

In [90]:
df_NBResultLabels.head(10)

Unnamed: 0,label
0,positive
1,negative
2,negative
3,positive
4,negative
5,positive
6,negative
7,positive
8,positive
9,positive


In [92]:
# Add labels to preprocessedTestSet
test_labels = np.concatenate((df_preprocessedTestSet, df_NBResultLabels),axis=1)

In [119]:
test_labels

array([[list(['finally', 'saw', 'joker', 'technical', 'aspects', 'movie', 'stunning', 'acting', 'incredible', 'migh…']),
        None, 'positive'],
       [list(['``', 'joker', "''", 'became', 'first', 'r-rated', 'movie', 'hit', '1', 'billion', 'box-office', 'are\u200b', 'rumors', 'sequel']),
        None, 'negative'],
       [list(['would', 'like', 'see', 'sequel', 'joaquin', 'phoenix', "'s", 'joker']),
        None, 'negative'],
       [list(['rt', 'done', 'movies', 'bts', 'bring', 'soul', 'movie', 'weathering', 'joker', 'spiderman', 'far', 'from…']),
        None, 'positive'],
       [list(['ashishchanchalani', 'ashishchanchalanimemes', 'oldvideos', 'joker', 'think', 'downloaded', 'wrong', 'joker', 'movie']),
        None, 'negative'],
       [list(['think', 'joker', 'new', 'favorite', 'character/movie', 'love', 'every', 'actor', '’', 'played', 'get', 'new', 'parts', 'him…']),
        None, 'positive'],
       [list(['rt', 'still', 'thinking', 'sana', 'harley', 'quinn', 'wanted', 'c

In [104]:
# Check first 10 Tweets words and their attributed labels
test_labels[0]

array([list(['finally', 'saw', 'joker', 'technical', 'aspects', 'movie', 'stunning', 'acting', 'incredible', 'migh…']),
       None, 'positive'], dtype=object)

In [105]:
test_labels[1]

array([list(['``', 'joker', "''", 'became', 'first', 'r-rated', 'movie', 'hit', '1', 'billion', 'box-office', 'are\u200b', 'rumors', 'sequel']),
       None, 'negative'], dtype=object)

In [106]:
test_labels[2]

array([list(['would', 'like', 'see', 'sequel', 'joaquin', 'phoenix', "'s", 'joker']),
       None, 'negative'], dtype=object)

In [107]:
test_labels[3]

array([list(['rt', 'done', 'movies', 'bts', 'bring', 'soul', 'movie', 'weathering', 'joker', 'spiderman', 'far', 'from…']),
       None, 'positive'], dtype=object)

In [108]:
test_labels[4]

array([list(['ashishchanchalani', 'ashishchanchalanimemes', 'oldvideos', 'joker', 'think', 'downloaded', 'wrong', 'joker', 'movie']),
       None, 'negative'], dtype=object)

In [110]:
test_labels[5]

array([list(['think', 'joker', 'new', 'favorite', 'character/movie', 'love', 'every', 'actor', '’', 'played', 'get', 'new', 'parts', 'him…']),
       None, 'positive'], dtype=object)

In [111]:
test_labels[6]

array([list(['rt', 'still', 'thinking', 'sana', 'harley', 'quinn', 'wanted', 'chaeyoung', 'joker', 'coz', 'chaeyoung', 'love', 'movie…']),
       None, 'negative'], dtype=object)

In [112]:
test_labels[7]

array([list(['joker', 'action', 'movie']), None, 'positive'], dtype=object)

In [116]:
test_labels[8]

array([list(['saw', 'ford', 'v', 'ferrari', 'last', 'weekend', 'great', 'movie', 'would', 'top', 'list', '2019', "'m", 'marvel', 'guy', 'avengers…']),
       None, 'positive'], dtype=object)

In [117]:
test_labels[9]

array([list(['’', 'dope', 'joker', 'movie']), None, 'positive'],
      dtype=object)

**Observations of Tweets and assigned label:** 
- Tweet[0]: At a glance, we see words such as 'stunning', 'incredible' that are positive. The label is therefore 'positive' as expected. CORRECT
- Tweet[1]: This tweet seems rather neutral, but is classified as 'negative'. BIAS?
- Tweet[2]: This tweet seems rather neutral, but is classified as 'negative'. BIAS?
- Tweet[3]: This tweet seems rather neutral, but is classified as 'positive'. BIAS?
- Tweet[4]: This tweet seems rather neutral, but is classified as 'negative'. BIAS?
- Tweet[5]: This tweet seems rather positive with the word 'love', 'favorite'. The label is therefore 'positive' as expected. CORRECT
- Tweet[6]: This tweet seems rather neutral, with a small positive touch with the word 'love'. However, the label is 'negative'. WRONG
- Tweet[7]: This tweet seems rather neutral, but is classified as 'negative'. BIAS?
- Tweet[8]: At a glance, we see words such as 'great', 'top' that are positive. The label is therefore 'positive' as expected. CORRECT
- Tweet[9]: This tweet seems rather neutral, with a small positive note 'dope'. The label is therefore 'positive' as expected. CORRECT

**As a conclusion, we see different cases:**
1. **50%** | It seems like neutral Tweets are more often classified as negative (4/5) which is a consequence of simplifying the classification options to two (positive and negative). More rarely (1/5), a neutral Tweet is classified as positive. 
2. **40%** | Correct classification of the Tweet
3. **10%** | Wrong classfication of the Tweet

This (subjective) performance evaluation task is telling us that the classification model is not optimal yet.

**Possible improvements:**
- Trying different combinations of pre-processing steps
- Adding some pre-processing steps
- Checking the impact of each pre-processing step on model performance
- Building our own Training Dataset, to control quality of the classification in the Training Dataset, as it is crucial to a good model performance.

## Conclusion

Sentiment Analysis is an interesting way to think about the applicability of Natural Language Processing in making automated conclusions about text. It is being utilized in social media trend analysis and, sometimes, for marketing purposes. Making a Sentiment Analysis program in Python is not a difficult task, thanks to modern-day, ready-for-use libraries. 

With this project, we wanted to analyze Tweets about the late released movie "Joker" that attracted all kinds of comments, critics on the web. In summary, we have 

- created a Test set directly from Twitter thanks to the API
- used an existing Training set that was classified into positive and negative tweets according to its content by hand
- cleaned both sets by removing any signs (punctuation, hashtags...) that do not bring anything to the sentiment analysis
- created a vocabulary of words based on the Training Set
- matched tweets against vocabulary 
- trained the classifier 
- tested the model on the test Set
- evaluated the performance of our model

The result of the sentiment analysis is that almost half of the Tweets in the Test set are classified as negative and the other half as positive. The results show the divided critics about the movie, which is a feeling I had when scanning through articles on the web. 

However, the quality of our model seems compromised, and several areas of improvement have been identified above.
I would not use the model per se, but most probably after implementing the different changes mentionned above. 

Another area of improvement to the model is in regards to measuring the performance of our model. The way the project was conducted didn't allow to use a solid model performance tool such as a confusion matrix. What could be done to measure better the model's performance would be to: 
- separate the Training Tweets Dataset between Train and Test sets
- develop and train the model on the Train set
- use Test set to test model
- run metrics (confusion matrix) to evaluate model's performance over Test set and Train set