# Sentiment analysis using labelled data and dictionary based approaches

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Text-classification-/-Sentiment-analysis-using-labelled-data" data-toc-modified-id="Text-classification-/-Sentiment-analysis-using-labelled-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Text classification / Sentiment analysis using labelled data</a></span><ul class="toc-item"><li><span><a href="#Load-Libraries" data-toc-modified-id="Load-Libraries-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Load Libraries</a></span></li><li><span><a href="#Load-Data---Positive-text" data-toc-modified-id="Load-Data---Positive-text-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Load Data - Positive text</a></span></li><li><span><a href="#Preprocessing" data-toc-modified-id="Preprocessing-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Preprocessing</a></span></li><li><span><a href="#Run-a-Logistic-Regression-model" data-toc-modified-id="Run-a-Logistic-Regression-model-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Run a Logistic Regression model</a></span></li></ul></li><li><span><a href="#Sentiment-Analysis-using-a-dictionary-based-approach" data-toc-modified-id="Sentiment-Analysis-using-a-dictionary-based-approach-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Sentiment Analysis using a dictionary based approach</a></span><ul class="toc-item"><li><span><a href="#Tweets-Sentiment-Use-Case" data-toc-modified-id="Tweets-Sentiment-Use-Case-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Tweets Sentiment Use Case</a></span></li></ul></li></ul></div>

## Text classification / Sentiment analysis using labelled data

#Given to you short review of some movies. The reviews could talk bad or good about the movie. We can identify the sentiment of the text by looking/reading the words in the sentence. How can we make a machine/system understand the sentiment in the text.

#One way is the ML way. There is a ground truth that is created for some corpus i.e  we have both postive and negative reviews that are tagged with their respective class. This forms the base and the algorithm is trained on this data (after converting this to structured form) and depending on the words used the classification is done (Machine/system tries to obtain a pattern from data).

#Another way is dictionary approach, where we create a dictionary of positive and negative words and explicitly state that these words are positive or negative. We can then count the number of positive and negative words in the sentence and give a score. If the score is positive then its positive else its negative.

#In either cases, there is manual work involved (creating ground truth in case 1 or creating the dictionary in case 2)

### Load Libraries

In [6]:
import re
import nltk
import random
from nltk.corpus import movie_reviews
from nltk.classify.scikitlearn import SklearnClassifier
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix

### Load Data - Positive text

In [7]:
f1 = open("short_reviews/positive.txt","r", encoding= 'latin-1')   # "r" is for reading
short_pos = f1.readlines()

In [8]:
len(short_pos)

5331

In [9]:
short_pos[:10]

['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal . \n',
 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth . \n',
 'effective but too-tepid biopic\n',
 'if you sometimes like to go to the movies to have fun , wasabi is a good place to start . \n',
 "emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one . \n",
 'the film provides some great insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game . \n',
 'offers that rare combination of entertainment and education . \n',
 'perhaps no picture ever made has more literally showed that the road to hell is paved with good intentions 

In [10]:
type(short_pos)

list

In [11]:
short_pos=[re.sub("\n","",i)for i in short_pos]
x_short_pos=short_pos[:1000]

In [12]:
len(x_short_pos)

1000

In [13]:
### Load Data - Negative text

In [14]:
f2 = open("short_reviews/negative.txt","r",encoding='latin-1')
short_neg = f2.readlines()
print(len(short_neg))
short_neg=[re.sub("\n","",i)for i in short_neg]
x_short_neg=short_neg[:1000]
len(x_short_neg)

5331


1000

In [15]:
x_short_neg[:5]

['simplistic , silly and tedious . ',
 "it's so laddish and juvenile , only teenage boys could possibly find it funny . ",
 'exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable . ',
 '[garbus] discards the potential for pathological study , exhuming instead , the skewed melodrama of the circumstantial situation . ',
 'a visually flashy but narratively opaque and emotionally vapid exercise in style and mystification . ']

In [16]:
#Combine both the positive and negative reviews data
data=x_short_pos+x_short_neg

In [17]:
len(data)

2000

In [18]:
data[1]

'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth . '

### Preprocessing

In [21]:
from nltk.corpus import stopwords
def review_to_words(raw_review):
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    
    # 1. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z0-9]", " ", raw_review) 
    #
    # 2. Convert to lower case, split into individual words
    words = letters_only.lower().split()                             
    #
    # 3. In Python, searching a set is much faster than searching
    #   a list, so convert the stop words to a set
    stops = set(stopwords.words("english"))                  
    # 
    # 4. Remove stop words
    meaningful_words = [w for w in words if not w in stops]   
    #
    # 5. Join the words back into one string separated by space, 
    # and return the result.
    
    return(" ".join( meaningful_words ))
    
    
num_reviews = len(data)
# Initialize an empty list to hold the clean reviews
clean_reviews = []

# Loop over each review; create an index i that goes from 0 to the length
# of the movie review list 
for i in range( 0, num_reviews ):
    # Call our function for each one, and add the result to the list of
    # clean reviews
    clean_reviews.append( review_to_words( data[i] ) )
    
    
data = clean_reviews

In [22]:
data[1]

'gorgeously elaborate continuation lord rings trilogy huge column words cannot adequately describe co writer director peter jackson expanded vision j r r tolkien middle earth'

In [24]:
cv=CountVectorizer(stop_words='english',lowercase=True,
                   strip_accents='unicode',decode_error='ignore')

tdm = cv.fit_transform(data)
tdm

<2000x7383 sparse matrix of type '<class 'numpy.int64'>'
	with 18836 stored elements in Compressed Sparse Row format>

In [25]:
Mat = tdm.todense()
Mat

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [26]:
Mat.shape

(2000, 7383)

In [27]:
import pandas as pd
Mat = pd.DataFrame(Mat)
Mat.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,7373,7374,7375,7376,7377,7378,7379,7380,7381,7382
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [28]:
#Define the Target Variable
Mat['type'] = ['pos']*1000+['neg']*1000
Mat.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,7374,7375,7376,7377,7378,7379,7380,7381,7382,type
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,pos
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,pos
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,pos
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,pos
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,pos


In [29]:
# Train Test split
Mat = Mat.sample(frac = 1,random_state=1234)
train = Mat.iloc[:1500]
test = Mat.iloc[1500:]

### Run a Logistic Regression model

In [30]:
from sklearn.linear_model import LogisticRegression

In [31]:
logreg = LogisticRegression()
X=train.iloc[:,:-1]
Y=train.iloc[:,-1]

In [32]:
logreg.fit(X,Y)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [33]:
#Predictions on test data
test1=test.iloc[:,:-1]#from col1 to collast, except last one (Excl target) slicing
true=test.iloc[:,-1]#selecting target col
pred=logreg.predict(test1)

In [34]:
# Test data confusion Matrix
confusion_matrix(test.iloc[:,-1],pred)

array([[161,  99],
       [ 71, 169]])

__Work with any other classification models, tfidf and check if you can improve the accuracies__

Multi-variate Bernoulli Naive Bayes **(BernoulliNB)** The binomial model is useful if your **feature vectors are binary (i.e., 0s and 1s)**. One application would be text classification with a bag of words model where the 0s 1s are "word occurs in the document" and "word does not occur in the document"

Multinomial Naive Bayes **(MultinomialNB)** The multinomial naive Bayes model is typically used for discrete counts. E.g., if we have a **text classification problem**, we can take the idea of bernoulli trials one step further and instead of **"word occurs in the document"** we have "count how often word occurs in the document", you can think of it as "number of times outcome number x_i is observed over the n trials"

Gaussian Naive Bayes **(GaussianNB)** Here, we assume that the features follow a normal distribution. Instead of discrete counts, we have **continuous features** (e.g., the popular Iris dataset where the features are sepal width, petal width, sepal length, petal length).


#What else could be done to improve the accuracies

1. Should I manually classify all the english words into emotions?
2. Should I weight the words like good, better, best etc.
3. Will reviews like 'not bad', 'not good' work?
4. Is there any easy method?

**We have libraries like**

**NLTK (Natural Language Toolkit)**: One of the oldest and used mostly for research and educational purpose.

**TextBlob**: Built on top of NLTK, best for beginners. It is a user-friendly and intuitive NLTK interface. It is used for rapid prototyping.

**Spacy**: Industrial Standard right now and the best among the bunch currently.

**CoreNLP(Stanford CoreNLP)**: Production-ready solution built and maintained by Stanford group but it is built in java.

**gensim**: It is the package for topic and vector space modeling, document similarity.

**Polyglot**: It is usually used for projects involving a language spaCy doesn’t support.

## Sentiment Analysis using a dictionary based approach

### Tweets Sentiment Use Case


In [69]:
# !pip install tweepy

# #TextBlob: textblob is the python library for processing textual data.

# #Install it using following pip command:
# !pip install textblob


# #Also, we need to install some NLTK corpora using following command:
# !python -m textblob.download_corpora

# #(Corpora is nothing but a large and structured set of texts.)

How TextBlob works - https://planspace.org/20150607-textblob_sentiment/

Create Twitter app - https://docs.inboundnow.com/guide/create-twitter-application/

In [70]:
import re 
import tweepy 
from tweepy import OAuthHandler 
from textblob import TextBlob 
  
class TwitterClient(object): 
    ''' 
    Generic Twitter Class for sentiment analysis. 
    '''
    def __init__(self): 
        ''' 
        Class constructor or initialization method. 
        '''
        # keys and tokens from the Twitter Dev Console 
        consumer_key = ''
        consumer_secret = ''
        access_token = ''
        access_token_secret = ''
  
        # attempt authentication 
        try: 
            # create OAuthHandler object 
            self.auth = OAuthHandler(consumer_key, consumer_secret) 
            # set access token and secret 
            self.auth.set_access_token(access_token, access_token_secret) 
            # create tweepy API object to fetch tweets 
            self.api = tweepy.API(self.auth) 
        except: 
            print("Error: Authentication Failed") 
  
    def clean_tweet(self, tweet): 
        ''' 
        Utility function to clean tweet text by removing links, special characters 
        using simple regex statements. 
        '''
        return(' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t]) |(\w+:\/\/\S+)", " ", tweet).split())) 
  
    def get_tweet_sentiment(self, tweet): 
        ''' 
        Utility function to classify sentiment of passed tweet 
        using textblob's sentiment method 
        '''
        # create TextBlob object of passed tweet text 
        analysis = TextBlob(self.clean_tweet(tweet)) 
        # set sentiment 
        if analysis.sentiment.polarity > 0: 
            return('positive')
        elif analysis.sentiment.polarity == 0: 
            return ('neutral')
        else: 
            return ('negative')
  
    def get_tweets(self, query, count = 10): 
        ''' 
        Main function to fetch tweets and parse them. 
        '''
        # empty list to store parsed tweets 
        tweets = [] 
  
        try: 
            # call twitter api to fetch tweets 
            fetched_tweets = self.api.search(q = query, count = count) 
  
            # parsing tweets one by one 
            for tweet in fetched_tweets: 
                # empty dictionary to store required params of a tweet 
                parsed_tweet = {} 
  
                # saving text of tweet 
                parsed_tweet['text'] = tweet.text 
                # saving sentiment of tweet 
                parsed_tweet['sentiment'] = self.get_tweet_sentiment(tweet.text) 
  
                # appending parsed tweet to tweets list 
                if tweet.retweet_count > 0: 
                    # if tweet has retweets, ensure that it is appended only once 
                    if parsed_tweet not in tweets: 
                        tweets.append(parsed_tweet) 
                else: 
                    tweets.append(parsed_tweet) 
  
            # return parsed tweets 
            return (tweets) 
  
        except tweepy.TweepError as e: 
            # print error (if any) 
            print("Error : " + str(e)) 
  

In [71]:
def main(): 
    # creating object of TwitterClient Class 
    api = TwitterClient() 
    # calling function to get tweets 
    tweets = api.get_tweets(query = 'surgical strike', count = 150) 
  
    # picking positive tweets from tweets 
    ptweets = [tweet for tweet in tweets if tweet['sentiment'] == 'positive'] 
    # percentage of positive tweets 
    print("Positive tweets percentage: {} %".format(100*len(ptweets)/len(tweets))) 
    # picking negative tweets from tweets 
    ntweets = [tweet for tweet in tweets if tweet['sentiment'] == 'negative'] 
    # percentage of negative tweets 
    print("Negative tweets percentage: {} %".format(100*len(ntweets)/len(tweets))) 
    # percentage of neutral tweets 
    print("Neutral tweets percentage: {} % ".format(100*(len(tweets) - (len(ntweets) + len(ptweets)))/len(tweets))) 
  
    # printing first 5 positive tweets 
    print("\n\nPositive tweets:") 
    for tweet in ptweets[:10]: 
        print(tweet['text']) 
  
    # printing first 5 negative tweets 
    print("\n\nNegative tweets:") 
    for tweet in ntweets[:10]: 
        print(tweet['text'])

In [None]:
if __name__ == "__main__": 
    # calling main function 
    main() 