# **Sentiment Analysis on Movie Reviews**

### Importing the dataset 

In [1]:
import pandas as pd
# Loading the dataset
data = pd.read_csv("Datasets/dataset/archive/IMDB Dataset.csv")

#Checking the first few rows 
print(data.head())

#Checking the columns
print(data.columns)

                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive
Index(['review', 'sentiment'], dtype='object')


### Pre-processing the text 

In [7]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

#Downloading the stopwords from nltk
nltk.download('stopwords')
#Downloading the Dictionary for Lemmatization
nltk.download('wordnet')
nltk.download('omw-1.4')

#Load the datasets
data = pd.read_csv("Datasets/dataset/archive/IMDB Dataset.csv")

#Example Review
review = data['review'][7]
print("Original Review : /n", review)

#1. Preprocessing (Reducing the example dataset to lowercase) 
review = review.lower()

#2. Removing the HTML tags i.e (br /)
review = re.sub(r'<.*?>', '', review)

#3. Remove Punctuations/Numbers
review = re.sub(r'[^a-z\s]', '', review)

#4. Tokenization
tokens = review.split()

#5. Remove stopwords
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word not in stop_words]

#6. Lemmatization
lemmatizer =  WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(word, pos='v') for word in tokens]

print("\n After Cleaning : \n", tokens)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\aarya\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\aarya\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\aarya\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Original Review : /n This show was an amazing, fresh & innovative idea in the 70's when it first aired. The first 7 or 8 years were brilliant, but things dropped off after that. By 1990, the show was not really funny anymore, and it's continued its decline further to the complete waste of time it is today.<br /><br />It's truly disgraceful how far this show has fallen. The writing is painfully bad, the performances are almost as bad - if not for the mildly entertaining respite of the guest-hosts, this show probably wouldn't still be on the air. I find it so hard to believe that the same creator that hand-selected the original cast also chose the band of hacks that followed. How can one recognize such brilliance and then see fit to replace it with such mediocrity? I felt I must give 2 stars out of respect for the original cast that made this show such a huge success. As it is now, the show is just awful. I can't believe it's still on the air.

 After Cleaning : 
 ['show', 'amaze', 'fres

### Preprocessing function - applying preprocessing for the whole dataset

In [6]:
# importing the necessary libraries
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# initializing the Lemmatizer and the stopwords
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

# Defining the clean_review Function for cleaning the dataset
def clean_review(text):
    #1. Convert all the the data in the dataset into lowercase()
    text = text.lower()

    #2. removing HTML tags from the dataset 
    text = re.sub(r'<.*?>', '', text)

    #3. Removing the Punctuations and numbers from the dataset 
    text = re.sub(r'[^a-z\s]', '', text)

    #4 Tokenize the dataset
    tokens = text.split()

    #5. Remove the stopwords from the dataset 
    tokens = [word for word in tokens if word not in stop_words]

    #6. Lemmatizing the dataset for better accuracy
    tokens = [lemmatizer.lemmatize(word, pos='v') for word in tokens]

    # Prints the Clean_review as a single string
    return " ".join(tokens)

# Loading the dataset
data = pd.read_csv("Datasets/dataset/archive/IMDB Dataset.csv")

# applying the function to the complete dataset
data ['clean_review'] = data ['review'].apply(clean_review)

# Checking whether the function works properly 
print(data[['review', 'clean_review']].head(2))

#YAYYYY IT WORKSS !!!!!
print("\n It Works !")

                                              review  \
0  One of the other reviewers has mentioned that ...   
1  A wonderful little production. <br /><br />The...   

                                        clean_review  
0  one reviewers mention watch oz episode youll h...  
1  wonderful little production film technique una...  

 It Works !


### Feature Extraction For ML

In [11]:
'''Using SKLEARN library
    sklearn.feature_extraction.text → has functions to turn text → numbers.
    TfidfVectorizer → specifically makes the TF-IDF matrix.'''
from sklearn.feature_extraction.text import TfidfVectorizer

#Creating a TF-IDF object
vectorizer = TfidfVectorizer(max_features=5000)

# Applying TF_IDF on all reviews
x = vectorizer.fit_transform(data['clean_review'])

# Preparing the labels (what we want to predict)
y = data['sentiment'].map({'positive' : 1, 'negative' : 0})

print("Shape Of TF_IDF Matrix", x.shape)

Shape Of TF_IDF Matrix (50000, 5000)
