# Opinion Mining & Sentiment Analysis

## What is Opinion Mining?
    Opinion Mining is related to mining human generated data. Humans are like sensors indicating their opinions when they use a particular product or watch a movie or use a service. 
    The output of these human generated sensors is unstructured data and may be in the form of video data, audio data, or text data. 
    Such an opinion mining can be subjective to one's perspective of analysis or interpretations of the unstructured data. Opinion provider is one person and who is interpreting the service/object experience and providing a feedback. Then there is one more person, the text miner/ data analyst who would be interpreting these opinions. Both the parties are working upon interpretations and sharing their understanding and inferences subjectively. This indicates that everything in opinion mining is subjective and hence nothing can be factually called as right or wrong.
    
### What is it that we want to understand?
    The basic questions that pop up often are
        1. Who is talking about the product?
        2. What is that product?
    As we seek answers to these questions, the curiosity leads to few more questions-
        1. What is the opinion?
        2. What is the background under which this opinion was expressed?
        3. Is it good for the product? Is it positive or negative?

### How easy is this task of opinion mining?
    Well, sometimes we readily have access to the information like who is talking about the product. But at times, we may only have the text passage and then the opinion holder and the target product is hidden in the text(may be the passage refers indirectly to the a government personnel and the opinion holder is from opponent government party!). It would certainly involve information deduction from the passage. 
    Also, the problem may get a little complex when opinion provider is a group than an individual, target may be someone else's opinion or a set of products than a single entity, or the opinion text or context is highly complex.
    
## Why Opinion Mining?
    Some reasons that intuitively come forward are-
        1. To make better and improved decisions
        2. To understand people
        3. To improve and make targeted advertising
        4. For Business Intelligence
        5. For Market Research
        6. For any other research...
        
## What is Sentiment Analysis?
    Basically, its a classification problem! Very often, we already know the opinion bearer, opinion target/product, the context of the opinion and the context. Only thing left is analysing the sentiments!
    So, the input is definitely the text data and output is a sentiment. However, there could be 2 types of analysis here. 
        a. Polarity Analysis - positive, negative, neutral or rank ordered categories like 1, 2, 3, 4
        b. Emotion Analysis - sad, happy, angry, scared, disgusted
    But either way, it is sentiment classification problem.
    
## How do we do this analysis?
    Feature Identification comes here for rescue. It is the most complex step and identification of right features can make a huge difference. It is said that Natural Language Processing is an amazing tool for identifying right features but could lead to overfitting.
    
    Some of the features that are commonly used are -
        1. Character n-grams - n could be any number that analyst finds relevant. This represents characters allowed for analysis
        2. Word n-grams - This represents total words allowed
        3. Parts of speech(POS) tag n-grams - This refers to adjective, noun, verbs etc. allowed
        4. Word classes - It could be Syntactics like POS tags, semantics like thesaurus, or some other word clusters
        5. Word Patterns - It represents frequently used word patterns
        6. Sentence Patterns - They are specific set of repeating sentences
        
    Choosing the right feature could be a tough call! Just to elaborate, selection of text features depends on the purpose of your mining task. For example, if a data analyst aims to classify text as positive or negative, unigram (1-gram) word feature would be a bad feature. Lets say there are 2 sentences, 'I love my iPhone.' and 'I don't love my iPhone as much as I love my MacBook.' If the unigram feature was selected and 'love' was the unigram in consideration, we would end up classifying both the sentences as positive for iPhone with respect to postiveness indicated by the unigram 'love' even when the second sentence is a negative one.
    
## How do we define machine learning process for Sentiment Analysis?
    The steps are similar to those of the rest of the machine learning problems.
    A. Select a set of features that as an analyst & domain expert you believe are appropriate
    B. Train the features on your data
    C. Validate the features on new data and modify the model features based on the errors

## Movie Reviews - Sentiment Analysis
Let us now consider a movie review sentiment analysis use case in python.

In [1]:
################ IMPORT DATA AND EXPLORE #########################

# Importing nltk & random package
import nltk
import random
# Imporing the dataset & stopwords corpus
from nltk.corpus import stopwords
from nltk.corpus import movie_reviews

In [2]:
# Creating documents list which stores file name and its category as pos or neg
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
# we randomly shuffle the documents before creating training and testing datasets
random.shuffle(documents)

In [3]:
# randomly chosing 40th document to see its content
# print(documents[40])

In [4]:
# Listing categories
movie_reviews.categories()

['neg', 'pos']

In [5]:
# Listing unique file ids
# movie_reviews.fileids()

In [6]:
# Finding out number of categories
len(movie_reviews.categories())

2

In [7]:
# Finding out number of distinct file ids
len(movie_reviews.fileids())

2000

In [8]:
# here we define a function to find out useful words such that they are not stop words from the English language
# We also segregate these words and create our own dictionary to store all such words in a data object called dict
def create_word_features(words):
    useful_words = [word for word in words if word not in stopwords.words("english")]
    my_dict = dict([(word, True) for word in useful_words])
    return my_dict

In [9]:
# Here we create a new vector for movies having negative reviews
# the for loop is for scanning every fileid that belongs to neg category
# iteratively for all neg file ids words are scanned & stored into 'words'
# all such words are then appended into the 'neg_reviews' vector
# As a result, we get a list of all words which are used in negative reviews
# We use the user defined function 'create_word_features' for storing only useful words and not stop words
neg_reviews = []
for fileid in movie_reviews.fileids('neg'):
    words = movie_reviews.words(fileid)
    neg_reviews.append((create_word_features(words), "negative"))

In [10]:
# Just as neg_reviews, we create a vector to store words from positive reviews
pos_reviews = []
for fileid in movie_reviews.fileids('pos'):
    words = movie_reviews.words(fileid)
    pos_reviews.append((create_word_features(words), "positive"))

In [11]:
# To verify the vector contents we print the length of neg_reviews; 
# which should come out to be 1000 as there are exactly 1000 neg reviews
print(len(neg_reviews))

1000


In [45]:
# to check the contents of neg_reviews, we run following command and we expect 1000 lists of lists of words
neg_review_sample = [item[0] for item in neg_reviews]
neg_review_sample[1]

{'"': True,
 "'": True,
 '(': True,
 ')': True,
 ',': True,
 '.': True,
 'across': True,
 'acting': True,
 'action': True,
 'another': True,
 'around': True,
 'average': True,
 'back': True,
 'baldwin': True,
 'bastard': True,
 'big': True,
 'body': True,
 'brain': True,
 'bringing': True,
 'brother': True,
 'bug': True,
 'cgi': True,
 'chase': True,
 'comes': True,
 'course': True,
 'crew': True,
 'curtis': True,
 'damn': True,
 'deserted': True,
 'design': True,
 'donald': True,
 'drunkenly': True,
 'empty': True,
 'even': True,
 'feels': True,
 'flash': True,
 'flashy': True,
 'get': True,
 'going': True,
 'good': True,
 'gore': True,
 'got': True,
 'h20': True,
 'halloween': True,
 'happy': True,
 'head': True,
 'hey': True,
 'hit': True,
 'jamie': True,
 'kick': True,
 'know': True,
 'lee': True,
 'let': True,
 'like': True,
 'likely': True,
 'likes': True,
 'little': True,
 'middle': True,
 'mir': True,
 'movie': True,
 'much': True,
 'nowhere': True,
 'occasional': True,
 'origi

In [13]:
# Create training and testing dataset
train_set = neg_reviews[:750] + pos_reviews[:750]
test_set =  neg_reviews[750:] + pos_reviews[750:]
print(len(train_set),  len(test_set))

1500 500


In [28]:
#importing necessary packages
import pandas as pd
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cross_validation import train_test_split
from sklearn import naive_bayes
from sklearn.metrics import roc_auc_score
from sklearn.naive_bayes import MultinomialNB
import os
import warnings
warnings.filterwarnings('ignore')

In [15]:
#setting up working directory
os.getcwd()
os.chdir("/Users/dhanashreepokale/Downloads/Data Mining/tm/data")
df = pd.read_csv("movie_reviews_n.txt", sep='\t',names=['content','polarity'])
df.head()

Unnamed: 0,content,polarity
0,content,polarity
1,"plot : two teen couples go to a church party ,...",0
2,the happy bastard's quick movie review damn th...,0
3,it is movies like these that make a jaded movi...,0
4,""" quest for camelot "" is warner bros . ' firs...",0


In [16]:
#TFIDF Vectorizer
stopset = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(use_idf=True, lowercase = True, strip_accents = 'ascii', stop_words= stopset)

In [17]:
#in this case, our dependent variable witll be polarity as 0(din't like the movie) or 1 (liked the movie)
y=df.polarity

In [18]:
#convert df.tsv from text to features
X= vectorizer.fit_transform(df.content)

In [19]:
#observationsx unique words
print(y.shape)
print(X.shape)

(2001,)
(2001, 39516)


In [20]:
#Test Train Split as usual
X_train, X_test, y_train,y_test = train_test_split(X,y, random_state=42)

In [21]:
#we will train a naive bayes classifier
clf = naive_bayes.MultinomialNB()
clf.fit(X_train,y_train)
pred = clf.predict(X_test)

In [22]:
################## PIPELINE #######################

#importing required packages
import sklearn.pipeline
from sklearn import ensemble


In [23]:
#specifying feature selection technique and its parameters
select = sklearn.feature_selection.SelectKBest(k=100)
#specifying the classifier
clf = sklearn.ensemble.RandomForestClassifier()

In [24]:
#creating a steps object to store above mentioned techniques
steps = [('feature_selection', select),
        ('random_forest', clf)]

In [25]:
# using pipeline for tightening up the steps code
pipeline = sklearn.pipeline.Pipeline(steps)

In [26]:
################## sampling #######################
X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(X, y, test_size=0.33, random_state=42)

In [27]:
################## MODEL FITTING & PREDICTION REPORT #######################
### fit your pipeline on X_train and y_train
pipeline.fit( X_train, y_train )
### call pipeline.predict() on your X_test data to make a set of test predictions
y_prediction = pipeline.predict( X_test )
### test your predictions using sklearn.classification_report()
report = sklearn.metrics.classification_report( y_test, y_prediction )
### and print the report
print(report)
warnings.filterwarnings('ignore')

             precision    recall  f1-score   support

          0       0.69      0.80      0.74       332
          1       0.76      0.64      0.70       329

avg / total       0.73      0.72      0.72       661

