## Case Study - Sentiment Analysis

In this case study, you will be shown how to perform Sentiment Analysis on Twitter data using NLP techniques from Python.

##### Scenario:

The demonetization of Indian currency was a step taken by Government of India on November 8 leaving the entire country into shock. Some of currency notes were banned and entire country became emotional and few have taken to twitter to express their feelings.

##### Challenge:

It's important to understand the implications of the steps taken by Government and take actions based on the citizen's response. You, as a Data Scientist, have to analyse these tweets to understand the overall reaction of citizens.

##### Dataset:

Data of tweets on #demonetization has been extracted from Twitter and is available in the file "demonetization_tweets_data.csv". This dataset contains 7470 rows and 12 columns. The description of the columns is as given below:

- <b>text</b>: Text of tweet
- <b>favorited</b>: Boolean; indicates whether the tweet has been liked by authenticating user
- <b>favoriteCount</b>: Number of times this tweet has been liked
- <b>replyToSN</b>: Screen Name of original tweet's author if this tweet is a reply
- <b>created</b>: Timestamp of creation of tweet
- <b>truncated</b>: Boolean; indicates whether the tweet has been truncated due to length limits
- <b>replyToSID</b>: ID of the original tweet if this tweet is a reply
- <b>statusSource</b>: Source used to post the tweet
- <b>screenName</b>: Screen Name of the author of the tweet
- <b>retweetCount</b>: Number of times this tweet has been retweeted
- <b>isRetweet</b>: Boolean; indicates whether this tweet is a retweet or not
- <b>retweeted</b>: Boolean; indicates whether this tweet has been retweeted by authenticating user


- ### Load the required libraries

In [1]:
# Load the required libraries from Python
# Make sure all the libraries have been download else download using nltk.download command
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import re
import nltk 

- ### Load and analyse the data 


    - Load the data from the required location into a DataFrame
    - Analyse the shape of the data by printing its total number of rows & columns
    - Also print 5 rows of the DataFrame
    - Print the 'text' of the tweet with highest number of retweets

In [2]:
tweets = pd.read_csv("demonetization-tweets_data.csv",encoding = "ISO-8859-1")

In [3]:
tweets.shape

(7470, 12)

In [4]:
tweets.head()

Unnamed: 0,text,favorited,favoriteCount,replyToSN,created,truncated,replyToSID,statusSource,screenName,retweetCount,isRetweet,retweeted
0,RT @rssurjewala: Critical question: Was PayTM ...,False,0,,11/23/2016 18:40,False,,"<a href=""http://twitter.com/download/android"" ...",HASHTAGFARZIWAL,331,True,False
1,"RT @roshankar: Former FinSec, RBI Dy Governor,...",False,0,,11/23/2016 18:40,False,,"<a href=""http://twitter.com/download/android"" ...",rahulja13034944,12,True,False
2,RT @satishacharya: Reddy Wedding! @mail_today ...,False,0,,11/23/2016 18:39,False,,"<a href=""http://cpimharyana.com"" rel=""nofollow...",CPIMBadli,120,True,False
3,RT @gauravcsawant: Rs 40 lakh looted from a ba...,False,0,,11/23/2016 18:38,False,,"<a href=""http://twitter.com/download/android"" ...",bhodia1,637,True,False
4,RT @sumitbhati2002: Many opposition leaders ar...,False,0,,11/23/2016 18:38,False,,"<a href=""http://twitter.com/download/android"" ...",sumitbhati2002,1,True,False


In [5]:
tweets.iloc[tweets['retweetCount'].idxmax()]['text']

'RT @RNTata2000: The government\x92s bold implementation of the demonetization programme needs the nation\x92s support. https://t.co/tx1ZILSor8'

- ### Clean the data


    - Observe that the tweet text contains various elements such as 'Retweet tag RT@', 'punctuation marks' and 'stop words'
    - Use functions from Python libraries such as re, string and NLTK to remove these unnecessary elements


In [6]:
# Load the required libraries for cleaning
import string,re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [7]:
# Create a function to generate cleaned data from raw text
def clean_text(tweets):
    tweets = word_tokenize(tweets) # Create tokens
    tweets = tweets[4:] # Remove RT@
    tweets= " ".join(tweets) # Join tokens
    tweets= re.sub('https','',tweets) # Remove 'https' text with blank
    tweets = [char for char in tweets if char not in string.punctuation] # Remove punctuations
    tweets = ''.join(tweets) # Join the leters
    tweets = [word for word in tweets.split() if word.lower() not in stopwords.words('english')] # Remove common english words (I, you, we,...)
    return " ".join(tweets)

In [8]:
# Apply the function to 'text' to clean it
# Add cleaned data as a separate column to the DataFrame

tweets['cleaned_text']=tweets['text'].apply(clean_text)

In [9]:
# Print the first 5 values of cleaned tweet data

tweets['cleaned_text'].head()

0    Critical question PayTM informed Demonetizatio...
1    Former FinSec RBI Dy Governor CBDT Chair Harva...
2    Reddy Wedding mailtoday cartoon demonetization...
3    Rs 40 lakh looted bank Kishtwar J amp K Third ...
4    Many opposition leaders narendramodi Demonetiz...
Name: cleaned_text, dtype: object

- ### Process the data


    - Apart from cleaning, data also needs to be processed to remove elements which may cause issues in analysis
    - Examples of such elements are 'single characters', 'multiple spaces', 'Upper-cased'
    - Apply various text pre-processing techniques one-by-one to the cleaned data
    
        - Remove all the special characters
        - Remove single characters appearing in the text except the start
        - Remove single characters appearing at the start
        - Substitute multiple spaces with a single space
        - Remove prefix 'b'
        - Convert to lowercase
        - Print first five values of processed data
        - Add the processed data as a separate column to the DataFrame


In [10]:
features = tweets['cleaned_text']
processed_features = []

for sentence in range(0, len(features)):
    # Remove all the special characters
    processed_feature = re.sub(r'\W', ' ', str(features[sentence]))
    
    # Remove single characters appearing in the text except the start
    processed_feature= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature)
    
    # Remove single characters appearing at the start
    processed_feature = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature) 
    
    # Substitute multiple spaces with a single space
    processed_feature = re.sub(r'\s+', ' ', processed_feature, flags=re.I)
    
    # Remove prefix 'b'
    processed_feature = re.sub(r'^b\s+', '', processed_feature)
    
    # Convert to lowercase
    processed_feature = processed_feature.lower()

    processed_features.append(processed_feature)

In [11]:
# Print first five values of processed data
processed_features[:5]

['critical question paytm informed demonetization edict pm clearly fishy requires full disclosure amp',
 'former finsec rbi dy governor cbdt chair harvard professor lambaste demonetization aam aadmi listen th',
 'reddy wedding mailtoday cartoon demonetization reddywedding tcou7glnrq31f',
 'rs 40 lakh looted bank kishtwar amp third incident since demonetization terrorists',
 'many opposition leaders narendramodi demonetization respect decision support oppositio']

In [12]:
# Add the processed data as a separate column to the DataFrame

tweets['processed_text'] = processed_features

In [13]:
# Observe the entire data

tweets.head()

Unnamed: 0,text,favorited,favoriteCount,replyToSN,created,truncated,replyToSID,statusSource,screenName,retweetCount,isRetweet,retweeted,cleaned_text,processed_text
0,RT @rssurjewala: Critical question: Was PayTM ...,False,0,,11/23/2016 18:40,False,,"<a href=""http://twitter.com/download/android"" ...",HASHTAGFARZIWAL,331,True,False,Critical question PayTM informed Demonetizatio...,critical question paytm informed demonetizatio...
1,"RT @roshankar: Former FinSec, RBI Dy Governor,...",False,0,,11/23/2016 18:40,False,,"<a href=""http://twitter.com/download/android"" ...",rahulja13034944,12,True,False,Former FinSec RBI Dy Governor CBDT Chair Harva...,former finsec rbi dy governor cbdt chair harva...
2,RT @satishacharya: Reddy Wedding! @mail_today ...,False,0,,11/23/2016 18:39,False,,"<a href=""http://cpimharyana.com"" rel=""nofollow...",CPIMBadli,120,True,False,Reddy Wedding mailtoday cartoon demonetization...,reddy wedding mailtoday cartoon demonetization...
3,RT @gauravcsawant: Rs 40 lakh looted from a ba...,False,0,,11/23/2016 18:38,False,,"<a href=""http://twitter.com/download/android"" ...",bhodia1,637,True,False,Rs 40 lakh looted bank Kishtwar J amp K Third ...,rs 40 lakh looted bank kishtwar amp third inci...
4,RT @sumitbhati2002: Many opposition leaders ar...,False,0,,11/23/2016 18:38,False,,"<a href=""http://twitter.com/download/android"" ...",sumitbhati2002,1,True,False,Many opposition leaders narendramodi Demonetiz...,many opposition leaders narendramodi demonetiz...


- ### Run Sentiment analysis

    1. Import TextBlob from Python to calculate various Sentiment scores as described below:
        - <b>Polarity</b> is a float value within the range [-1.0 to 1.0] where 0 indicates neutral, +1 indicates a very positive sentiment and -1 represents a very negative sentiment.
        - <b>Subjectivity</b> is a float value within the range [0.0 to 1.0] where 0.0 is very objective and 1.0 is very subjective. Subjective sentence expresses some personal feelings, views, beliefs, opinions, allegations, desires, beliefs, suspicions, and speculations where as Objective sentences are factual.

    2. After calculating the above scores, encode the polarity scores into three categories- 'positive', 'negative' and 'neutral'
    
    3. Print the most positive and most negative tweet using Polarity score
    
    4. Print the most subjective and most objective tweet using Subjectivity score

In [14]:
from textblob import TextBlob  ### Python library to create sentiment analysis

In [15]:
# Create a function to calculate Sentiment scores for each text
def generate_polarity(text):
    sentiment = TextBlob(text).sentiment
    return sentiment

In [16]:
# Apply the function to processed data
sentiment = tweets['processed_text'].apply(generate_polarity)
sentiment = sentiment.to_frame()
sentiment.head()

Unnamed: 0,processed_text
0,"(0.15, 0.5777777777777778)"
1,"(0.0, 0.0)"
2,"(0.0, 0.0)"
3,"(0.0, 0.0)"
4,"(0.5, 0.5)"


In [17]:
# Use the first element as Polarity
sentiment['polarity'] = sentiment['processed_text'].apply(lambda x:x[0])

# Use the second element as Subjectivity
sentiment ['subjectivity'] = sentiment['processed_text'].apply(lambda x:x[1])

In [18]:
# Add two columns to DataFrame for Polarity and Subjectivity score respectively

tweets['polarity'] = sentiment['polarity']
tweets['subjectivity'] = sentiment['subjectivity']

In [19]:
# Encode polarity into 'positive', 'negative' and 'neutral' based on the score

tweets['polarity_encoded'] = ['positive' if x > 0 else 'negative' if x < 0 else 'neutral' for x in tweets['polarity']]

In [20]:
# Print the number of tweets of each category of polarity
tweets['polarity_encoded'].value_counts()

neutral     3724
positive    2645
negative    1101
Name: polarity_encoded, dtype: int64

In [21]:
# Print the most positive and most negative tweet

print("The most positive tweet:",tweets.iloc[tweets['polarity'].idxmax()]['processed_text'])
print("The most negative tweet:",tweets.iloc[tweets['polarity'].idxmin()]['processed_text']) 

The most positive tweet: one greatest computer scientists dr vijay bhatkar views demonetization decision hon pm narendramodi h
The most negative tweet: pathetic journalism media thought get stds atms another attempt malign demonetization tco


In [22]:
# Print the most subjective and most objective tweet

print("The most subjective tweet:",tweets.iloc[tweets['subjectivity'].idxmax()]['processed_text'])
print("The most objective tweet:",tweets.iloc[tweets['subjectivity'].idxmin()]['processed_text']) 

The most subjective tweet: demonetization harbhajansingh gives hilarious shagun suggestion struggling wedding season
The most objective tweet: former finsec rbi dy governor cbdt chair harvard professor lambaste demonetization aam aadmi listen th


- ### Apply Vectorization

    1. Create a DataFrame containing only the columns of interest- Processed text & Polarity Category
    2. Tokenize the text using TweetTokenizer from NLTK
    3. Calculate the number of unique words (Bag of Words) using Count Vectorizer

In [23]:
# Create a DataFrame containing only the columns of interest- Processed text & Polarity Category
tweets.columns

Index(['text', 'favorited', 'favoriteCount', 'replyToSN', 'created',
       'truncated', 'replyToSID', 'statusSource', 'screenName', 'retweetCount',
       'isRetweet', 'retweeted', 'cleaned_text', 'processed_text', 'polarity',
       'subjectivity', 'polarity_encoded'],
      dtype='object')

In [24]:
df = tweets[['processed_text', 'polarity_encoded']]

In [25]:
df.head()

Unnamed: 0,processed_text,polarity_encoded
0,critical question paytm informed demonetizatio...,positive
1,former finsec rbi dy governor cbdt chair harva...,neutral
2,reddy wedding mailtoday cartoon demonetization...,neutral
3,rs 40 lakh looted bank kishtwar amp third inci...,neutral
4,many opposition leaders narendramodi demonetiz...,positive


In [26]:
df.shape

(7470, 2)

In [27]:
# Tokenize the text using TweetTokenizer from NLTK

from nltk.tokenize import TweetTokenizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

In [28]:
# Function to generate tokens using TweetTokenizer
def tokenize(text): 
    tk = TweetTokenizer()
    return tk.tokenize(text)

vectorizer = CountVectorizer(analyzer = 'word',tokenizer = tokenize,lowercase = True,ngram_range=(1, 1))

In [29]:
# Generate unique words from the processed data by applying Count Vectorizer along with TweetTokenizer

count= vectorizer.fit_transform(df['processed_text'])

In [30]:
# What is the shape of the data- Count vectorizer provides information about unique words present in data
count.shape
# This returns the shape of the term-document matrix geerated by application of Count Vectorizer
# The matrix contains same number of rows as in the input DataFrame and number of columns represent the number of unique ngrams (here unigrams) created by vectorizer

(7470, 8919)

- ### Create a classification model on our data

    1. Split the data into training and testing data sets
        - Use processed data as independent variable and polarity as dependent variable
    2. Extract features using TFIDF Vectorizer
    3. Perform Multinomial Naive Bayes Claasification
        - Apply MultinomialNB on training data
        - Predict polarity by fitting the model to testing data
        - Calculate accuracy of predicted values
    4. Perform Random Forest classification on the processed data and compare the accuracy score of both these models

In [31]:
# Load the libraries required for performing classification

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score

In [32]:
# Split the data into training and testing data sets
# Use processed data as independent variable and polarity as dependent variable

X = df['processed_text'].values
y = df['polarity_encoded'].values

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=100, test_size=0.3)

In [33]:
# Extract features using TFIDF Vectorizer

vectorizer = TfidfVectorizer(max_features=1000)
X_train_idf = vectorizer.fit_transform(X_train)
X_test_idf = vectorizer.transform(X_test)

In [34]:
# Print idf values
df_idf = pd.DataFrame(vectorizer.idf_, index=vectorizer.get_feature_names(),columns=["idf_weights"])
# Sort ascending
df_idf.sort_values(by=['idf_weights'],ascending = False).head()

Unnamed: 0,idf_weights
ysrcp,7.770407
u092c,7.482725
listen,7.364942
tcorng1gkiugm,7.364942
tcorijenewx7y,7.364942


In [35]:
# Perform Multinomial Naive Bayes Classification
# Apply MultinomialNB on training data
mnb = MultinomialNB()
mnb.fit(X_train_idf, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [36]:
# Predict polarity by fitting the model to testing data
pred_mnb = mnb.predict(X_test_idf)

# Calculate accuracy of predicted values
acc = accuracy_score(y_test, pred_mnb)


results = pd.DataFrame([['Multinomial Naive Bayes', acc]],
               columns = ['Model', 'Accuracy'])

print(results)

                     Model  Accuracy
0  Multinomial Naive Bayes  0.862115


In [37]:
# Perform Random Forest classification on the processed data and compare the accuracy score of both these models

# Random Forest Classifier with 'gini'

from sklearn.ensemble import RandomForestClassifier
clf_rf = RandomForestClassifier()
clf_rf.fit(X_train_idf, y_train)

# Predict using testing data
y_pred_rf = clf_rf.predict(X_test_idf)

# Calculate accuracy
acc = accuracy_score(y_test, y_pred_rf)

model_results = pd.DataFrame([['Random Forest(Gini)', acc]],
               columns = ['Model', 'Accuracy'])

results = results.append(model_results, ignore_index = True)
print(results)

                     Model  Accuracy
0  Multinomial Naive Bayes  0.862115
1      Random Forest(Gini)  0.927711


In [38]:
# Random Forest Classifier with 'entropy'

from sklearn.ensemble import RandomForestClassifier
clf_rf = RandomForestClassifier(criterion='entropy')
clf_rf.fit(X_train_idf, y_train)

# Predict using testing data
y_pred_rf = clf_rf.predict(X_test_idf)

# Calculate accuracy
acc = accuracy_score(y_test, y_pred_rf)

model_results = pd.DataFrame([['Random Forest(Entropy)', acc]],
               columns = ['Model', 'Accuracy'])

results = results.append(model_results, ignore_index = True)
print(results)

                     Model  Accuracy
0  Multinomial Naive Bayes  0.862115
1      Random Forest(Gini)  0.927711
2   Random Forest(Entropy)  0.920571


In [39]:
# Display confusion matrix for Random Forest

confusion_matrix(y_test,y_pred_rf) ### Confusion matrix for Random Forest

array([[ 270,   75,    9],
       [   6, 1048,    5],
       [   5,   78,  745]])

<b><i>Conclusion</i></b>: In this demonstration of the case study, we examined how to perform Sentiment Analysis on Twitter data through various phases such as data cleaning, data pre-processing, tokenization, sentiment scoring, feature extraction and classification using Machine Learning algorithms.