<a href="https://colab.research.google.com/github/hussain0048/Projects-/blob/master/Movie_Reviews_through_Sentiment_Analysis_in_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **📚Table of Content**



1.   Introduction
2.   Import necessary libraries
3.   Load Dataset



# **📚Introduction:**

In the ever-evolving landscape of cinema, understanding audience sentiments towards movies is paramount. Movie reviews offer valuable insights into audience perceptions, but manually analyzing a large volume of reviews is time-consuming and often subjective. Enter Sentiment Analysis in Natural Language Processing (NLP), a powerful tool that automates the process of gauging sentiments from textual data. In this blog, we delve into the realm of movie reviews, exploring how Sentiment Analysis revolutionizes the way we perceive and analyze cinematic feedback.

# **📚Import necessary libraries**

In [4]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re # for regex
from nltk.corpus import stopwords
# stopwords corpus within NLTK contains a collection of common words that are often considered irrelevant for analysis
# and are thus typically removed from text data during preprocessing.
from nltk.tokenize import word_tokenize #tokenize module, text data can be split into individual words or tokens,
from nltk.stem import SnowballStemmer#stem module applies the Snowball stemming algorithm to reduce words to their root or base form
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB,MultinomialNB,BernoulliNB
from sklearn.metrics import accuracy_score
import pickle

The CountVectorizer **from scikit-learn's feature extraction modu**le converts a collection of text documents into a matrix of token counts, representing the frequency of each word in the corpus, thereby enabling machine learning models to process textual data.


**stopwords**corpus within NLTK contains a collection of common words that are often considered irrelevant for analysis and are thus typically removed from text data during preprocessing.

The **pickle module** in Python provides functionality for serializing and deserializing Python objects, allowing for easy storage and retrieval of data structures, such as lists or dictionaries, in a binary format.

# **📚Load Dataset**

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


This Python code mounts the Google Drive to a Colab notebook, enabling access to files and directories stored on Google Drive within the notebook environment.

In [5]:
data = pd.read_csv('/content/drive/MyDrive/Courses /Data Science /NLP/Datasets/IMDB-Dataset.csv')
print(data.shape)
data.head()

(50000, 2)


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [None]:
data.info()

The info() method likely provides information about the data object, such as the data types of each column, memory usage, and non-null counts, commonly used in Python libraries like pandas for DataFrame objects.

In [7]:
data.sentiment.value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

This command returns the frequency count of different sentiment categories present in the 'data' object, aiding in the analysis of sentiment distribution within the dataset.

In [8]:
data.sentiment.replace('positive',1,inplace=True)
data.sentiment.replace('negative',0,inplace=True)
data.head(10)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1
5,"Probably my all-time favorite movie, a story o...",1
6,I sure would like to see a resurrection of a u...,1
7,"This show was an amazing, fresh & innovative i...",0
8,Encouraged by the positive comments about this...,0
9,If you like original gut wrenching laughter yo...,1


This preprocessing step is called "Label Encoding," where the categorical sentiment labels, such as 'positive', are replaced with numerical values, such as '1', in the 'data' object, and the changes are made in place.

# **📚Pre-processing Steps**

*   Remove HTML tags
*   Remove special characters
*   Convert everything to lowercase
*   Remove stopwords
*   Stemming

**Remove HTML tags**

In [9]:
def clean(text):
    cleaned = re.compile(r'<.*?>')
    return re.sub(cleaned,'',text)

In [None]:
data.review = data.review.apply(clean)
data.review[0]

This Python function clean utilizes regular expressions to remove HTML tags from the input text and returns the cleaned text.