# Sentiment Analysis

Fine-Grained Sentiment:
This type of analysis gives you an understanding of customer feedback, get the precise result in the terms of the polarity of the input. 
Review labels are - Postive , negative , very positive , very negative and neutral.

The dataset is comprised of tab-separated files with phrases from the Rotten Tomatoes dataset. The train/test split has been preserved for the purposes of benchmarking, but the sentences have been shuffled from their original order. Each Sentence has been parsed into many phrases by the Stanford parser. Each phrase has a PhraseId. Each sentence has a SentenceId. Phrases that are repeated (such as short/common words) are only included once in the data.

train.tsv contains the phrases and their associated sentiment labels. We have additionally provided a SentenceId so that you can track which phrases belong to a single sentence.
test.tsv contains just phrases. You must assign a sentiment label to each phrase.
The sentiment labels are:

0 - negative
1 - somewhat negative
2 - neutral
3 - somewhat positive
4 - positive

https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data

In [38]:
import string # from some string manipulation tasks
import nltk # natural language toolkit
import re # regex
from string import punctuation # solving punctuation problems
from nltk.corpus import stopwords # stop words in sentences
from nltk.stem import WordNetLemmatizer # For stemming the sentence
from nltk.stem import SnowballStemmer # For stemming the sentence
from contractions import contractions_dict # to solve contractions


In [39]:
#Ploting library 
import matplotlib.pyplot as plt
import seaborn as sns
import pylab as pl 

In [40]:
import numpy as np
import pandas as pd
#train_data = pd.read_csv('train.tsv/train.tsv', delimiter='\t', index_col = 'PhraseId')
train_data = pd.read_csv("train.tsv/train.tsv", delimiter='\t')

In [41]:
train_data.head()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,1
1,2,1,A series of escapades demonstrating the adage ...,2
2,3,1,A series,2
3,4,1,A,2
4,5,1,series,2


In [42]:
train_data.shape

(156060, 4)

In [43]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156060 entries, 0 to 156059
Data columns (total 4 columns):
PhraseId      156060 non-null int64
SentenceId    156060 non-null int64
Phrase        156060 non-null object
Sentiment     156060 non-null int64
dtypes: int64(3), object(1)
memory usage: 4.8+ MB


In [44]:
#output is Sentiment i.e  negative=0 , somewhat negative=1 ,neutral=2 ,somewhat positive=3 ,positive = 4
train_data["Sentiment"].value_counts()

2    79582
3    32927
1    27273
4     9206
0     7072
Name: Sentiment, dtype: int64

In [45]:
#train_data["SentenceId"].value_counts()

In [46]:
#using NLTK library easy implement Tokenization
#take string input and return a list of sentences, use nltk.sent_tokenize() to split the sentences.
def sent_tokenize(text):
    return nltk.sent_tokenize(text)  

In [47]:
#This function return the list of the words
def word_tokenize(text):
    return nltk.word_tokenize(text) 

In [48]:
#applying sent_tokenize in phrase
train_data['Phrase'].apply(sent_tokenize) 

0         [A series of escapades demonstrating the adage...
1         [A series of escapades demonstrating the adage...
2                                                [A series]
3                                                       [A]
4                                                  [series]
5         [of escapades demonstrating the adage that wha...
6                                                      [of]
7         [escapades demonstrating the adage that what i...
8                                               [escapades]
9         [demonstrating the adage that what is good for...
10                                [demonstrating the adage]
11                                          [demonstrating]
12                                              [the adage]
13                                                    [the]
14                                                  [adage]
15                        [that what is good for the goose]
16                                      

In [49]:
#if we convert all the works or sentence in lower case, then model will perform well 

def convert_lowercase(text):
    return text.lower()

In [50]:
train_data['Phrase'].apply(convert_lowercase)

0         a series of escapades demonstrating the adage ...
1         a series of escapades demonstrating the adage ...
2                                                  a series
3                                                         a
4                                                    series
5         of escapades demonstrating the adage that what...
6                                                        of
7         escapades demonstrating the adage that what is...
8                                                 escapades
9         demonstrating the adage that what is good for ...
10                                  demonstrating the adage
11                                            demonstrating
12                                                the adage
13                                                      the
14                                                    adage
15                          that what is good for the goose
16                                      

In [51]:
#train_data['Phrase'][:200].apply(autocorrect_spells)

In [52]:
#take string input and return a clean text without numbers.Use regex to discard the numbers.
def remove_num(text):
    output = ''.join(c for c in text if not c.isdigit())
    return output 

def remove_punct(text):
    return ''.join(c for c in text if c not in punctuation) 

#removes all the stop words like "is,the,a,...

def remove_stopwords(sentence):
    stop_words = stopwords.words('english')
    return ' '.join([w for w in nltk.word_tokenize(sentence) if not w in stop_words]) 

def lemmatize(text):
    wordnet_lemmatizer = WordNetLemmatizer()
    lemmatized_word = [wordnet_lemmatizer.lemmatize(word)for word in nltk.word_tokenize(text)]
    return " ".join(lemmatized_word)

In [53]:
def preprocess(text):
    lower_text = convert_lowercase(text)
    sentence_tokens = sent_tokenize(lower_text)
    word_list = []
    for each_sent in sentence_tokens:
        lemmatizzed_sent = lemmatize(each_sent)
        clean_text = remove_num(lemmatizzed_sent)
        clean_text = remove_punct(clean_text)
        clean_text = remove_stopwords(clean_text)
        word_tokens = word_tokenize(clean_text)
        for i in word_tokens:
            word_list.append(i)
    return word_list

In [54]:
from sklearn.feature_extraction.text import CountVectorizer 

In [55]:
cv = CountVectorizer(analyzer=preprocess) 

In [56]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer()

In [57]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()

In [58]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('bow', cv),  # strings to token integer counts
    ('tfidf', tfidf),  # integer counts to weighted TF-IDF scores
    ('classifier', classifier),  # train on TF-IDF vectors w/ Naive Bayes classifier
])

In [59]:
pipeline.fit(train_data['Phrase'],train_data['Sentiment']) 

Pipeline(memory=None,
     steps=[('bow', CountVectorizer(analyzer=<function preprocess at 0x000002C0512F1948>,
        binary=False, decode_error='strict', dtype=<class 'numpy.int64'>,
        encoding='utf-8', input='content', lowercase=True, max_df=1.0,
        max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=...f=False, use_idf=True)), ('classifier', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [64]:
from sklearn.metrics import classification_report

all_predictions = pipeline.predict(train_data['Phrase'])
print(classification_report(train_data['Sentiment'], all_predictions))

              precision    recall  f1-score   support

           0       0.77      0.08      0.15      7072
           1       0.61      0.34      0.43     27273
           2       0.64      0.92      0.76     79582
           3       0.60      0.45      0.51     32927
           4       0.78      0.10      0.18      9206

   micro avg       0.63      0.63      0.63    156060
   macro avg       0.68      0.38      0.41    156060
weighted avg       0.64      0.63      0.59    156060



NLTK is used to clean the data and preprocess it. Another tool that can be used is known as SpaCy. SpaCy is a library designed for fast and practical work, to avoid wasting time on NLP projects.

In [None]:
import spacy
from spacy.tokenizer import Tokenizer
nlp = spacy.load('en_core_web_sm') 

In [None]:

def tokenizer_spacy(my_doc):
    tokenizer = Tokenizer(nlp.vocab)
    return tokenizer(my_doc.text)

In [None]:
#remove stopwords
def filter_words(my_doc):
    filtered_sent=[]
    for word in my_doc:
        if word.is_stop==False:
            filtered_sent.append(word)
    #print("Filtered Sentence:",filtered_sent) 

In [None]:
def lemmatization_spcay(my_doc):
    lem_word = []
    for i in my_doc:
        lem_word.append(i.lemma_)
        

In [None]:
def partofSpeach(my_doc):
    for word in my_doc:
        print(word.text,word.pos_) 

In [None]:
def removePunch(my_doc):
    nopunc = []
    for word in my_doc:
        if word.pos_ != 'PUNCT':
            nopunc.append(word)
    #print(nopunc)

In [None]:
import string
from spacy.lang.en.stop_words import STOP_WORDS

punctuations = string.punctuation

nlp = spacy.load('en_core_web_sm')

stopwards = spacy.lang.en.stop_words.STOP_WORDS


def preprocessText(text):
    
    noNum = "".join([i for i in text if not i.isdigit()])
    
    tokenize_list = nlp(noNum)
    
    tokenize_list = [word.lemma_.lower().strip() if word.lemma_ !="-PRON-" else word.lower_ for word in tokenized_list]
    
    tokenize_list = [word for word in tokenize_list if word not in punctuations]
    
    return tokenized_list
    
    

In [61]:
from sklearn.naive_bayes import GaussianNB

In [62]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

classifier = MultinomialNB()
tfidf_vector = TfidfVectorizer(tokenizer = preprocess)
# Create pipeline 
pipe = Pipeline([('tfidf',tfidf_vector),
                 ('classifier', classifier)])

# model generation
pipe.fit(train_data['Phrase'],train_data['Sentiment']) 

Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,...True, vocabulary=None)), ('classifier', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [63]:
pipe.score(train_data['Phrase'],train_data['Sentiment']) 
from sklearn.metrics import classification_report
Pred = pipe.predict(train_data['Phrase'])
print(classification_report(train_data['Sentiment'], Pred)) 

              precision    recall  f1-score   support

           0       0.77      0.08      0.15      7072
           1       0.61      0.34      0.43     27273
           2       0.64      0.92      0.76     79582
           3       0.60      0.45      0.51     32927
           4       0.78      0.10      0.18      9206

   micro avg       0.63      0.63      0.63    156060
   macro avg       0.68      0.38      0.41    156060
weighted avg       0.64      0.63      0.59    156060



In [66]:
test_data = pd.read_csv("test.tsv/test.tsv", delimiter='\t')

In [67]:
test_data.head()

Unnamed: 0,PhraseId,SentenceId,Phrase
0,156061,8545,An intermittently pleasing but mostly routine ...
1,156062,8545,An intermittently pleasing but mostly routine ...
2,156063,8545,An
3,156064,8545,intermittently pleasing but mostly routine effort
4,156065,8545,intermittently pleasing but mostly routine


In [69]:
# model generation
y_test_pred=pipe.predict(test_data['Phrase']) 

In [71]:
y_test_pred

array([3, 3, 2, ..., 1, 1, 2], dtype=int64)

In [76]:
pipe.predict(["It 's a perfect show of respect to just one of those underrated professionals who deserve but rarely receive it ."])

array([3], dtype=int64)