# Twitter Sentiment Analysis:

Sentiment Analysis means analyzing the sentiment of a given text or document and categorizing the text/document into a specific class or category. In our case, we will be classifying tweets as either positive or negative.

We will be using the following modules for this project:
* NLTK (For implementation based on nltk)
* Numpy, re, string (For efficient processing)
* tweepy (For fetching tweets from the api)
* textblob (For implementation based on textblob)

## Sentiment Analysis using NLTK

Here we will be training a model called `NaiveBayesClassifier` to perform supervised classification. The first thing to do is to import everything that we will be needing for this project

In [1]:
import string
import re
import numpy as np
import tweepy
import nltk
from nltk.corpus import twitter_samples, stopwords
from nltk.tokenize import TweetTokenizer
from nltk.stem import PorterStemmer
from nltk import NaiveBayesClassifier, classify

Let us first setup a tweepy API object so that we can fetch tweets easily.

Let us also function to fetch tweets by a tweet id. 

In [2]:
fh = open('../../data/taskone/i/TwitterAPI.txt')
ids = fh.read().split('\n')
API_KEY = ids[0]
API_KEY_SECRET = ids[1]
ACCESS_TOKEN = ids[2]
ACCESS_TOKEN_SECRET = ids[3]
auth = tweepy.OAuthHandler(API_KEY, API_KEY_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
api = tweepy.API(auth)

In [3]:
def get_tweet(_id, authenticated_api=api):
    tweet = authenticated_api.get_status(_id)
    return tweet.text

The twitter_samples from nltk.corpus has 3 files:
* positive_tweets.json: A file containing 5000 positive tweets
* negative_tweets.json: A file containing 5000 negative tweets
* tweets.20150430-223406.json: A file containing 20000 tweets

We will be using the first two files to train our `NaiveBayesClassifier`

Let's load in the data into lists using the `twitter_samples.strings` method


In [4]:
pos_tweets = twitter_samples.strings('positive_tweets.json')
neg_tweets = twitter_samples.strings('negative_tweets.json')

The first order of business that we have to attend to is cleaning and tokenizing the corpus (which is a tweet in this case)

We will do that by writing a function specifically to clean a tweet (and tokenize it eventually)

To clean the tweet, we will use the following criteria:
- Remove stock market symbols like $
- Remove hyperlinks and hashtags
- Remove stop words
- Remove punctuation
- Stem each token

In [5]:
def clean(tweet):
    tweet = re.sub(r'\$\w*', '', tweet)
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    tweet = re.sub(r'#', '', tweet)
    tokenizer = TweetTokenizer(preserve_case=False, 
                               strip_handles=True, 
                               reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)
    tweet_clean = []
    stemmer = PorterStemmer()
    sw = stopwords.words('english')
    for word in tweet_tokens:
        if (word not in sw and
            word not in string.punctuation):
            tweet_clean.append(stemmer.stem(word))
    return tweet_clean

In [6]:
random_tweet = "RT @Twitter Hello There! Hope you are having a good day. It's so sunny outside. It really is sunny. I have $5 :) #good #morning http://www.youtube.com/watch?"
clean(random_tweet)

['hello',
 'hope',
 'good',
 'day',
 'sunni',
 'outsid',
 'realli',
 'sunni',
 ':)',
 'good',
 'morn']

Now we need to make a feature extractor. We will be using creating a simple bag of words function called `BoW` which will return a dictionary of booleans.

Note: We can't use a frequency based vectorization like in the Text Similarity project because we do not have a reference vector in this case. 

In [7]:
def BoW(tweet_clean):
    words_dictionary = dict([word, True] for word in tweet_clean)    
    return words_dictionary

We should now focus on creating the feature sets from our `pos_tweet_set` and `neg_tweet_set`

We will be doing this by looping over each tweet in both sets and:
- Cleaning the tweet
- Extracting the features from the tokens(using BoW function)

In [8]:
pos_features = []
neg_features = []
for tweet in pos_tweets:
    tweet = clean(tweet)
    pos_features.append((BoW(tweet), 1))
for tweet in neg_tweets:
    tweet = clean(tweet)
    neg_features.append((BoW(tweet), 0))

Our `pos_features` is a list of tuples, where the first element in the tuple is the bag of words for a particular tweet and the second element is it's sentiment. The same holds for `neg_features`.

We have chosen `1` to denote positive sentiment and `0` to denote negative sentiment in this project. Hence `pos_features` has `1` as the second element in every tuple, and `neg_features` has `0` as the second element in every tuple.

This basically implies that we now have labelled data that we can split into training and testing data. `pos_features` has all labelled data for positive tweets, and `neg_features` has all labelled data for negative tweets

In [9]:
test_set = pos_features[:100] + neg_features[:100]
train_set = pos_features[100:] + neg_features[100:]

We are now ready to train our model.

We define a `NaiveBayesClassifier` below and train it on the `train_set`

In [10]:
classifier = NaiveBayesClassifier.train(train_set)

We can now test out `NaiveBayesClassifier` on some dummy data. Let's say someone tweets that they had a bad day today.

NOTE: 0 Implies a negative sentiment (Expected output)

In [11]:
dummy_tweet = "RT @Twitter I had such a bad day today :( #bad #morning"
dummy_bag = BoW(clean(dummy_tweet))
classifier.classify(dummy_bag)

0

Recall, `0` denotes a negative sentiment. Our model has caught the negative sentiment in the dummy tweet correctly and given us an output of `0`

We should at this point try to check the probablity of our prediction (or classification).

We can do this easily using `NaiveBayesClassifier.prob_classify` method.

In [12]:
prob = classifier.prob_classify(dummy_bag)
prob.prob(0), prob.prob(1) #Probability of 0 (negative), Probability of 1

(0.9995324354143069, 0.00046756458569199155)

As we can see, the probability of our tweet having a negative sentiment according to the `NaiveBayesClassifier` is 0.9995 (99.95%)

We can also test the accuracy of this tweet on our `test_set` and see how well it performs.

This can be done using `classify.accuracy` on our trained `NaiveBayesClassifier` and `test_set`

In [13]:
classify.accuracy(classifier, test_set)

0.97

This shows us a very good accuracy score of 97%!

That may just be the courtesy of us having a small `test_set`, but we now know that it does correctly classify tweets to a decent extent.


We can now move on to defining a `get_sentiment` function that sortof acts as a wrapper around our previously defined functions.

It will take a raw tweet as its input and give us it's sentiment and the probability which it used to decide the sentiment.

In [14]:
def get_sentiment(tweet):
    clean_data = clean(tweet)
    bag = BoW(clean_data)
    ans = classifier.classify(bag)
    prob =  classifier.prob_classify(bag)
    if ans==0:
        return "Negative Sentiment", prob.prob(0)
    return "Positive Sentiment", prob.prob(1)

In [15]:
get_sentiment(dummy_tweet)

('Negative Sentiment', 0.9995324354143069)

As we can see, this gives us the expected results. So we can now write a driver code for this.


In [16]:
_id = input("Enter a tweet ID")
tweet = get_tweet(_id)
s,p = get_sentiment(tweet)
print("The tweet has", s, "\nThis was predicted with", p, "probability")

The tweet has Positive Sentiment 
This was predicted with 0.6957860762007811 probability


This was sentiment analysis using the `nltk` module.
We also happen to have a module called `textblob` which is built on top of `nltk`.
We will also explore an implementation based on that.

## Sentiment Analysis using textblob

Here we will be using the textblob library to extract the sentiment in a given text. The first thing to do is to import the required libraries

In [17]:
from textblob import TextBlob

A `TextBlob` is an object that is initialised with a string. It has a very convenient property called `sentiment` which we can access to easily figure out the sentiment in the string used to initialise it.

In our case, this string will be the tweet.

Before we create the TextBlob object, we will need a cleaned string.

This will mainly be done using the `clean` function we created above for the `nltk` implementation, but we will need to convert the cleaned data back to a string.

This is because the TextBlob object needs to be initialised with a string, and not a list of words.

We will test this out on `dummy_tweet` that we defined above.

In [18]:
def tb_clean(tweet):
    return ' '.join(clean(tweet))
tb_clean(dummy_tweet)

'bad day today :( bad morn'

As we can see, it returns a string of cleaned words (As expected.)

Now all we need to do is initialise a `TextBlob` object with this clean string and take a look at it's `sentiment` property.

The `sentiment` property of a `TextBlob` is an object itself and has two properties of its own.
* sentiment.polarity
* sentiment.subjectivity

The polarity indicates the sentiment in the string. So we will use that.

At this point, since we don't need to generate any vectors for our string or train any models, we can directly start writing our final wrapper function (as we did at the end of our `nltk` implementation)

In [19]:
def tb_sentiment(tweet):
    tweet = tb_clean(tweet)
    tb = TextBlob(tweet)
    if tb.sentiment.polarity > 0:
        return 'Positive Sentiment'
    else:
        return 'Negative Sentiment'
tb_sentiment(dummy_tweet)

'Negative Sentiment'

We can now simply write a driver code and conclude our sentiment analysis project.

In [20]:
_id = input("Enter a tweet ID")
tweet = get_tweet(_id)
s = tb_sentiment(tweet)
print("The tweet has", s)

The tweet has Positive Sentiment


# Conclusion:
We have thus, explored two different ways of sentiment analysis using two different libraries in python.
* `nltk`
* `textblob`

## NLTK 
We first started off with cleaning our tweets

We then went on to implement a feature extraction function to generate a bag of words

We then trained a classifier called `NaiveBayesClassifier` to perform supervised classification.

## Textblob 
We first started off with cleaning our tweets

We did not need to implement our own feature extraction function or train our own classifier

We initialised a TextBlob object and accessed it's `sentiment` property