# SENTIMENT ANALYSIS WITH NLP

In this lab we perform sentiment analysis on a dataset of IMDB movie reviews dataset using TF-IDF vectorization and logestic regression.

In the following sections, we'll:
- clean and prepare the data
- build a model
- perform sentiment analysis on the dataset.

## Understanding and cleaning data

In [None]:
# Importing the required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

In [None]:
# Reading the csv file and putting it into 'df' object.
df = pd.read_csv('/content/drive/MyDrive/IMDB Dataset.csv')

In [None]:
# Let's understand the type of values in each column of our dataframe 'df'.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [None]:
# Let's understand the data, how it look like.
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [None]:
# Let's check whether data set consist of any missing values.
print(df.isnull().sum())

review       0
sentiment    0
dtype: int64


No missing datas are found.

## Data preperation

Before we build a machine learning model, it's important to properly preprocess the data to make it suitable for modeling.

The IMDB dataset consists of text reviews and sentiment labels (positive/negative), so all of our input features are in text format. Since Logistic Regression requires numerical input, we need to transform the text into numerical vectors.

This is where TF-IDF (Term Frequency-Inverse Document Frequency) comes in — it converts raw text into meaningful numerical features based on how frequently words appear across documents.

Logistic Regression can't directly work with raw text. So, we must apply vectorization.

In [None]:
# Convert sentiment labels from text to binary values: positive -> 1, negative -> 0
df.sentiment=df.sentiment.map({"positive":1,"negative":0})
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [None]:
# Lowercasing the text
df['review'] = df['review'].str.lower()
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,1
1,a wonderful little production. <br /><br />the...,1
2,i thought this was a wonderful way to spend ti...,1
3,basically there's a family where a little boy ...,0
4,"petter mattei's ""love in the time of money"" is...",1


In [None]:
# Remove HTML tags and remove non-alphanumeric characters except spaces
import re

def remove_html_tags(text):
    clean = re.compile('<.*?>')
    text = re.sub(clean, '', text)
    text = re.sub(r"[^\w\s]", " ", text)

    return text

# Apply the function to the DataFrame
df['review'] = df['review'].apply(remove_html_tags)
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,1
1,a wonderful little production the filming tec...,1
2,i thought this was a wonderful way to spend ti...,1
3,basically there s a family where a little boy ...,0
4,petter mattei s love in the time of money is...,1


In [None]:
# Remove white spaces
def remove_whitespace(text):
    return  " ".join(text.split())
df['review']=df['review'].apply(remove_whitespace)
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,1
1,a wonderful little production the filming tech...,1
2,i thought this was a wonderful way to spend ti...,1
3,basically there s a family where a little boy ...,0
4,petter mattei s love in the time of money is a...,1


In [None]:
# Remove stop words
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
def remove_stopwords(text):
    words = text.split()
    filtered_words = [word for word in words if word not in stop_words]
    return " ".join(filtered_words)
df['review'] = df['review'].apply(remove_stopwords)
df.head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,review,sentiment
0,one reviewers mentioned watching 1 oz episode ...,1
1,wonderful little production filming technique ...,1
2,thought wonderful way spend time hot summer we...,1
3,basically family little boy jake thinks zombie...,0
4,petter mattei love time money visually stunnin...,1


In [None]:
# To check is there any GPU available or not
import torch
is_cuda = torch.cuda.is_available()
# If we have a GPU available, we'll set our device to GPU.
if is_cuda:
    device = torch.device("cuda")
    print("GPU is available")
else:
    device = torch.device("cpu")
    print("GPU not available, CPU used")

GPU not available, CPU used


## Model Building and Evaluation

In [None]:
# Importing train-test-split
from sklearn.model_selection import train_test_split

In [None]:
# Putting review to X
X = df['review']

# Putting sentiment to y
y = df['sentiment']

In [None]:
# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.20,random_state = 42)

In [None]:
# TF-IDF Vectorization
from sklearn.feature_extraction.text import TfidfVectorizer

# Example: assuming X_train contains your training review text
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)
X_train_tfidf = tfidf.fit_transform(X_train)

In [None]:
# Train Logistic Regression Model
model = LogisticRegression()
model.fit(X_train_tfidf, y_train)

In [None]:
# Make predictions
y_pred = model.predict(X_test_tfidf)

In [None]:
# Check accuracy
accuracy = accuracy_score(y_test, y_pred)
print("\nModel Accuracy:", accuracy)


Model Accuracy: 0.8869


In [None]:
# Print classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))


Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.87      0.88      4961
           1       0.88      0.90      0.89      5039

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000



In [None]:
# Print confusion matrix
from sklearn.metrics import confusion_matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(conf_matrix)


Confusion Matrix:
[[4323  638]
 [ 493 4546]]


In [None]:
# Test the model
sample_review = ["This product is amazing! I love it."]
sample_review_tfidf = tfidf.transform(sample_review)
prediction = model.predict(sample_review_tfidf)
print("\nSample Review Prediction:", "Positive" if prediction[0] == 1 else "Negative")


Sample Review Prediction: Positive
