<a href="https://www.kaggle.com/raviprakashkumar/automation-detection-of-different-sentiments-from?scriptVersionId=84386671" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

### Algorithm applied : 
+ NLP(Natural Language Processing)

### Project Objective : 
+ To develop a deep learning algorithm to detect different types of sentiments contained in a collection of English sentences or a large paragraph.   

### To predict the number of positive and negative reviews using either classification or deep learning algorithms.

### Importing Libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### Importing datasets

+ IMDB dataset having 50K movie reviews for natural language processing or Text analytics. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. It provide a set of 25,000 highly polar movie reviews for training and 25,000 for testing. 
+ Kaggle datasets: http://ai.stanford.edu/~amaas/data/sentiment/

In [2]:
dataset = pd.read_csv('../input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')
dataset

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [3]:
dataset.keys()

Index(['review', 'sentiment'], dtype='object')

### Cleaning the text
+ We are going to remove punctuations and different types of special characters, capital letters or lower case letters from our text as these things will create problem while processing the data

In [4]:
import re
# Importing library to remove stopwords from our text as it will not help to predict our text emotion 
# like all the articles(the, a , an...)
import nltk

# Downloading all the stopwords from the nltk library
nltk.download('stopwords')

# Importing stopwords
from nltk.corpus import stopwords

[nltk_data] Error loading stopwords: <urlopen error [Errno -3]
[nltk_data]     Temporary failure in name resolution>


### Stemming of Text : It will convert all the words in their roots.

+ Example:
- loved as love
- helped as help
- hopes as hope

+ Reason: 
- As after cleaning the text when will create the bag of words model we will create sparse matrix with each column will have all the different words all having different emotions. So in order to optimize the dimension of the sparse matrix we need to apply stemming. If we don't apply the stemming then in sparse matrix we would have one column for present tense and other for the past tense that would be same thing so will create redundants and will make sparse matrix more complex with higher dimension. 

In [5]:
from nltk.stem.porter import PorterStemmer

# Cleaning the texts
# Creating empty list which will contain all the cleaned texts
# We will create a for loop to iterate all the texts of our datasets 
# and for each of these review we will apply the cleaning process
# and after cleaning all the reviews we will add it into created empty list corpus

Cleaned_Text = []
for i in range(len(dataset)):
    # Removing every punctuations and commas except a-z or A-Z by space
    Text = re.sub('[^a-zA-Z]', ' ', dataset['review'][i])
    
    # Transforming all the capital letters into lower case letters
    Text = Text.lower()
    
    # Splitting the text into different words so that we can apply stemming 
    Text = Text.split()
    
    # Stemming the text
    ps = PorterStemmer()
    all_stopwords = stopwords.words('english')
    
    # Remove 'not' from the stopwords as it can alter the emotion of expression
    all_stopwords.remove('not')
    
    # Applying Stemming on all words except the stopwords
    Text = [ps.stem(word) for word in Text if not word in set(all_stopwords)]
    
    # Joining all the words together seperating with space
    Text = ' '.join(Text)
    
    # Adding the cleaned text to the empty list
    Cleaned_Text.append(Text)

### Cleaned_Text

In [6]:
# Cleaned_Text

### Creating the Bag of Words model

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

# Creating instance of Count Vectorizer class
cv = CountVectorizer(max_features = 2000)

# Fit method will take all the words 
# and transform method will put all those in different columns
X = cv.fit_transform(Cleaned_Text).toarray()

# Creating dependent variable
y1 = dataset.iloc[:, -1].values

### Label encoding the dependent variables containing different sentiments

In [8]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y1)

In [9]:
y

array([1, 1, 1, ..., 0, 0, 0])

In [10]:
# Creating dataframe
Df1 = pd.DataFrame(y1)
Df1.columns = ['Sentiments']
Df1['Sentiments Encoded'] = y
Df1

Unnamed: 0,Sentiments,Sentiments Encoded
0,positive,1
1,positive,1
2,positive,1
3,negative,0
4,positive,1
...,...,...
49995,positive,1
49996,negative,0
49997,negative,0
49998,negative,0


### Splitting the datasets into training sets and the test sets

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)

# Creating Classification model

### 1. Naive Bayes Classification Model

In [12]:
# Importing library for naive_bayes classification
from sklearn.naive_bayes import GaussianNB

# Creating instance
classifier1 = GaussianNB()

# Traning model as Naive_bayes classification model
classifier1.fit(X_train, y_train)

# Predicting test sets results using Model
y_pred = classifier1.predict(X_test)

# Comparing the predicted y and actual y to ensure accuracy of model
Df_1 = pd.DataFrame(y_pred)
Df_1.columns = ['Predicted_Emotions']
Df_1['Actual_Emotions'] = y_test
Df_1

Unnamed: 0,Predicted_Emotions,Actual_Emotions
0,0,1
1,1,1
2,0,0
3,1,1
4,0,0
...,...,...
9995,0,0
9996,0,1
9997,1,1
9998,0,0


In [13]:
# Importing library for creating Confusion matrix and accuracy
from sklearn.metrics import confusion_matrix, accuracy_score

In [14]:
# Creating Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix :\n\n',cm)

# Finding R_score
R = accuracy_score(y_test, y_pred)
print('\nR score = ',R)

Confusion Matrix :

 [[4228  733]
 [1840 3199]]

R score =  0.7427


In [15]:
def Prediction_Single_Text(Text):
    # Removing every punctuations and commas except a-z or A-Z by space
    Text = re.sub('[^a-zA-Z]', ' ', Text)
    
    # Transforming all the capital letters into lower case letters
    Text = Text.lower()
    
    # Splitting the text into different words so that we can apply stemming 
    Text = Text.split()
    
    # Stemming the text
    ps = PorterStemmer()
    all_stopwords = stopwords.words('english')
    
    # Remove 'not' from the stopwords as it can alter the emotion of expression
    all_stopwords.remove('not')
    
    # Applying Stemming on all words except the stopwords
    Text = [ps.stem(word) for word in Text if not word in set(all_stopwords)]
    
    # Joining all the words together seperating with space
    Text = ' '.join(Text)
    
    Cleaned_Text = [Text]
    new_X_test = cv.transform(Cleaned_Text).toarray()
    new_y_pred1 = classifier1.predict(new_X_test)
    return new_y_pred1

In [16]:
Text = 'I wasted my money in this movie.'
Prediction_Single_Text(Text)

array([0])