## MOVIES REVIEW CLASSIFICATION
Here we will determine whether a given review of a movie has a positive sentiment or a negative sentiment.

#### Importing Required Libraries

In [1]:
import pandas as pd
import numpy as np

In [2]:
from sklearn.preprocessing import LabelEncoder
#Label Encoder to convert categorical data to numerical

In [3]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report,confusion_matrix
import pickle
import seaborn as sns

#### Load Movie Reviews

In [4]:
df = pd.read_csv(r"C:\Users\dell\Desktop\sample_project_1\IMDB Dataset.csv\IMDB Dataset.csv")
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


### EDA

In [5]:
df.shape

(50000, 2)

In [6]:
df.isnull().sum()

review       0
sentiment    0
dtype: int64

In [7]:
df.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [9]:
#Checking the unique values in the dataset
df['sentiment'].unique()

array(['positive', 'negative'], dtype=object)

In [10]:
df['sentiment'].value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

We can observe that there are 2 sentiments, positive and negative under the sentiments column

Using Label Encoder to make categorical data into numerical(Positive:1, Negative:0)

In [11]:
label=LabelEncoder()
df['sentiment'] = label.fit_transform(df['sentiment'])

In [12]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


#### Dividing the data into independent and dependent(target)

In [13]:
x = df['review']
y = df['sentiment']

#### Remove all special characters and numeric character from data and remove stopwords and apply stemming
- This is so that there are no numeric characters and special characters which arent possible for language model to identify
- Stopwords are words such as "the" , "is", "in", "and" , "to", "of" which dont hold any meaning and are not important in determining the meaning of a sentence

### Stemming
Stemming is the process of converting words to their root or base form for example lover, loving, loved, lovely,lovable are all stemmed to love.
We will be using the PorterStemmer algorithm for this purpose

In [None]:
ps = PorterStemmer()
corpus = []

for i in range(len(x)):
    review = re.sub("[^a-zA-Z]"," ",x[i])
    #The above process will remove all numeric and special characters using regular expression
    review = review.lower()
    review = review.split()
    #The above statement converts the entire review into a list of separate words
    review = [ps.stem(word) for word in review if word not in set(stopwords.words("english"))]
    review = " ".join(review)
    #The above line again joins the separate words into a statement
    corpus.append(review)

In [None]:
corpus

### Apply Tfidf Vectorizer to make text data into vectors
Vectorisation basically converts sentences into numerical format for easy computation

In [None]:
from sklearn.feature_extraction import TfidfVectorizer
cv = TfidfVectorizer(max_features =5000)
x = cv.fit_transform(corpus).toarray()

In [None]:
x.shape

### Split Data into train and test

In [None]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size =0.2,random_state = 42)

In [None]:
x_train,x_test,y_train,y_test

### Define NaiveBayes Model

In [None]:
mnb = MultinomialNB()
mnb.fit(x_train,y_train)

### Testing and evaluation

In [None]:
pred = mnb.predict(x_test)

In [None]:
print(accuracy_score(y_test,pred))
print(confusion_matrix(y_test,pred))
print(classification_report(y_test,pred))

### Define a function to test the model

In [None]:
def test_model(sentence):
    sen = save_cv.transform([sentence]).toarray()
    res = model.predict(sen)[0]
    if res==1:
        return "Positive review"
    else:
        return "Negative review"

### Testing a positive and negative review

In [None]:
sen = "This  is a great movie to watch with friends and family23l;tq"
res = test_model(sen)
print(res)

In [None]:
sen = "This is a very worst movie.Wouldnt recommend to anyone"
res = test_model(sen)
print(res)