Sentiment analysis is a powerful technique for understanding the emotions conveyed in text. By interpreting these emotions, businesses and organizations can gain valuable insights into public opinion, enabling them to make informed decisions that drive growth. This analysis provides real feedback from the public, helping organizations take necessary actions based on genuine sentiments.

I have developed a sentiment analysis model designed to help people understand the sentiment behind various texts. This model has been trained on diverse datasets, including the IMDB dataset and Hotel dataset and ensuring its ability to accurately interpret emotions from different sources of text.

To achieve this, a Large Language Model (LLM) was utilized and fine-tuned on these datasets. In the field of AI, one of the greatest advancements has been the development of transformers, which are designed to understand natural language with exceptional efficiency and accuracy. By incorporating this technique into my analysis, the model can accurately understand, interpret, and predict sentiments from text.

## Importing Required Packages

In [None]:
!pip install evaluate transformers peft datasets trl BitsandBytes torch 

In [None]:
!pip uninstall bitsandbytes
!pip install bitsandbytes

In [2]:
import os
import pandas as pd
import numpy as np
import warnings
import re
import string
import torch
from transformers import AutoTokenizer, pipeline, DistilBertForSequenceClassification, BitsAndBytesConfig
from warnings import filterwarnings
from transformers import TrainingArguments, Trainer
from datasets import Dataset
from peft import PeftModel, PeftConfig, get_peft_model, LoraConfig,TaskType
import evaluate
from transformers import AutoModelForSequenceClassification

## Loading Dataset for the Analysis

Dataset has been downloaded from Standford.ai website which are IMDB reviews around 50,000. Also as this analysis is not only based on movie reviews i have downloaded the other reviews data and loaded loacally then uploaded in Jupyter notebook for further analysis. Data has been downloaded from multiple sources and then merged them.

##### STANFORD IMDB DATASET

In [2]:
os.chdir("C:\\Users\\Abhinav Khandelwal\\Desktop\\Machine Learning\\LLM Projects\\Sentiment Analysis")

In [3]:
curdir = os.getcwd()

In [4]:
Review = []
Label = []

for i in os.listdir():
    dataset_dir = curdir + "\\" + os.listdir()[0]
    data_dir = os.listdir(dataset_dir)
    for j in data_dir:
        t_data_dir = dataset_dir + "\\" + j
        t_data = os.listdir(t_data_dir)
        for k in t_data:
            t_data2 = t_data_dir + "\\" + k
            t_data3 = os.listdir(t_data2)
            for q in t_data3:
                label = t_data2[-3:]
                file_dir = t_data2 + "\\" + q
                with open(file_dir,encoding = "utf-8") as file:
                    Review.append(file.read())
                if label == "pos":
                    Label.append(1)
                else:
                    Label.append(0)

Review = pd.DataFrame(dict(Final_Review = Review,Sentiment = Label))

In [16]:
IMDB = Review

In [112]:
IMDB.sample(3)

Unnamed: 0,Final_Review,Sentiment
37776,This is the first out of the Guinea Pig series...,1
46302,A lot of the user comments i have seen on the ...,1
29747,I was very excited about this film when I firs...,0


#### Hotel Reviews Dataset

In [149]:
for w in os.listdir():
    hotel_dir = os.getcwd() + "\\" + i
    hotel_t_data = os.listdir(hotel_dir)
    for o in hotel_t_data:
        hotel_data = hotel_dir + "\\" + o
        hotel_1 = pd.read_csv(hotel_data)
    break
    
    

In [150]:
hotel_2 = pd.read_csv("C:\\Users\\Abhinav Khandelwal\\Desktop\\Machine Learning\\LLM Projects\\Sentiment Analysis\\tripadvisor_hotel_reviews.csv")

 ### Data Preprocessing, Data Cleaning and EDA

For the analysis i have downloaded dataset of hotels which have ranking from 0 - 5 i will encode these into 1 and 0. Any review less than equal to 2 will be 0 and any review greater than and equal to 3 will be 1. 
In the hotel dataset there are unnecessary columns which i will remove.
There are imbalancing in the dataset so i will remove the majority data as i have enough data for the analysis (more than 70,000 data points).
There will be some preprocessing like tags removal and punctuation removal. I will keep the stop words i think it will enhance my analysis and i am using LLM models and they are able to understand the context.

**Removing unncessary columns from the dataset**

In [151]:
hotel_1 = hotel_1[["Description","Is_Response"]]

**Encoding ratings into Sentiment**

In [152]:
hotel_2["Rating"] = hotel_2["Rating"].apply(lambda x: 1 if x >=3 else 0)

In [153]:
hotel_2.sample(3)

Unnamed: 0,Review,Rating
16391,"okay decor nice new, desk staff uppity profess...",1
4169,just ok overall impression property customer s...,1
6293,"okay just got riu south beach miami, stayed 2 ...",0


In [155]:
hotel_1["Is_Response"] = hotel_1["Is_Response"].apply(lambda x: 1 if x == "happy" else 0)

In [157]:
hotel_1.sample(3)

Unnamed: 0,Description,Is_Response
18614,DO NOT STAY HERE!!! my boyfriend and I figured...,0
30066,This was a great hotel for our family of four....,1
20350,Seriously - this place is great. We (two adult...,1


**Changing the column names and make it uniform for the merging.**

In [163]:
IMDB.columns = ["Reviews","Sentiment"]

In [167]:
hotel_1.columns = ["Reviews","Sentiment"]

In [168]:
hotel_2.columns = ["Reviews","Sentiment"]

**Merge all the datasets to form 1 dataset for the analysis**

In [180]:
Final_data = pd.concat([IMDB,hotel_1,hotel_2])

**Now Exploratory Data Analysis will be done to understand the pattern of the data.**

In [189]:
Final_data.shape

(109423, 2)

In [191]:
#Checking whether the data is balanced or not. We can see clearly that our data is not balanced so we can go for some balancing technique like augmentation
#class_weights but we have enough data for our analysis from the above we can see that we have more than 1,00,000 data points so for the uniformity
#i will remove the majority data points. In our case i will remove positive labeled points

Final_data["Sentiment"].value_counts()/Final_data.shape[0] * 100


1    62.873436
0    37.126564
Name: Sentiment, dtype: float64

In [510]:
#Shuffling the data to avoid any kind of biasness.
Shuffled_data = Final_data.sample(frac=1,random_state=43,ignore_index=True)

In [511]:
##Now we will filtered the positive labeled data and reduce it to 40625 currently we have 68798 we will use sampling way to extract the data.

positive = Shuffled_data[Shuffled_data["Sentiment"]==1].sample(40625,ignore_index=True,random_state=44)
negative = Shuffled_data[Shuffled_data["Sentiment"]==0]
Dataset = pd.concat([positive,negative])

In [512]:
#Now we have balanced data.
Dataset["Sentiment"].value_counts()

1    40625
0    40625
Name: Sentiment, dtype: int64

In [513]:
#Reset index
Dataset.reset_index(inplace=True,drop = True)

In [514]:
#Datatyes are correct and we don't have any null values as well
Dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81250 entries, 0 to 81249
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Reviews    81250 non-null  object
 1   Sentiment  81250 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 1.2+ MB


In [515]:
#Checking duplicated rows. We can see that pandas is showing 349 datasets as duplicated but if we see the data there is no duplication. 
#But on the safer side we will remove these duplicates.
Dataset[Dataset.duplicated()]

Unnamed: 0,Reviews,Sentiment
5522,Wow! So much fun! Probably a bit much for norm...,1
14808,If you want Scream or anything like the big-st...,1
14842,"A longtime fan of Bette Midler, I must say her...",1
15652,The undoubted highlight of this movie is Peter...,1
15765,This is a new Barbie movie. The graphics were ...,1
...,...,...
80814,I found it hard to care about these characters...,0
80989,"""Three"" is a seriously dumb shipwreck movie. M...",0
81071,I do not fail to recognize Haneke's above-aver...,0
81159,"Les Visiteurs, the first movie about the medie...",0


In [516]:
##Removing Duplicates
Dataset.drop_duplicates(inplace=True)

In [521]:
Dataset.Reviews[10]

'CAUTION: Potential Spoilers Ahead!<br /><br />"Steven Spielberg Presents Tiny Toon Adventures" was always one of my favorite cartoons growing up (heck, it still is). And this movie perfectly captures everything I love about the show and puts it in full-length form.<br /><br />Beautifully animated by the Tokyo Movie Shinsa studio (WB outsourced every "Tiny Toons" project, and this was the best studio to handle the show), the movie starts at the end of the school year at Acme Looniversity, the renowned cartoon college where Buster and Babs Bunny (no relation) and their teenage toon peers learn from the masters of animated lunacy, the Looney Tunes. After the final bell, the movie splits off into five different plots. Buster engages Babs in a water gun fight that culminates with a bursting dam and a tidal wave, sending Buster, Babs, and Elmyra\'s dog Byron downriver on an overturned picnic table in search of adventure in the deep South. Plucky Duck talks Hamton Pig and his family into let

In [522]:
#Remove html tags.
Dataset["Reviews"] = Dataset["Reviews"].str.replace("<.+?>","",regex=True)
Dataset["Reviews"][10]

'CAUTION: Potential Spoilers Ahead!"Steven Spielberg Presents Tiny Toon Adventures" was always one of my favorite cartoons growing up (heck, it still is). And this movie perfectly captures everything I love about the show and puts it in full-length form.Beautifully animated by the Tokyo Movie Shinsa studio (WB outsourced every "Tiny Toons" project, and this was the best studio to handle the show), the movie starts at the end of the school year at Acme Looniversity, the renowned cartoon college where Buster and Babs Bunny (no relation) and their teenage toon peers learn from the masters of animated lunacy, the Looney Tunes. After the final bell, the movie splits off into five different plots. Buster engages Babs in a water gun fight that culminates with a bursting dam and a tidal wave, sending Buster, Babs, and Elmyra\'s dog Byron downriver on an overturned picnic table in search of adventure in the deep South. Plucky Duck talks Hamton Pig and his family into letting him come with them 

In [523]:
##Making all the letters in lower_case for uniformity
Dataset["Reviews"] = Dataset["Reviews"].str.lower()

In [524]:
#Creating function for removing punctuations
def remove_punc(text):
    return text.translate(str.maketrans(" "," ",string.punctuation))

In [525]:
#Removing Punctuations
Dataset["Reviews"] = Dataset["Reviews"].apply(remove_punc)

In [526]:
#We have removed punctuations but there is a new line symbol (\n) still there which we will remove through regex
Dataset["Reviews"][19]

'we loved every second of our new years eve getaway from the moment we arrived until we had to fly home service was impeccable and the suite was everything we couldve asked for and then some enjoyed a massage to start off the new year right and i loved the relaxation room in the spa  the beds are so comfortable that i almost fell asleep the staff went out of their way and accommodated our friends with a toddler and provided a crib complete with stuffed animal to make the little one happy\nthe hotel was kind enough to send us chocolates and fruit as a small holiday gift which was a nice touch and we did enjoy them along with some champagne i miss the plush robes the mood lighting and the tub in the executive suitei would most definitely stay again when i return to chicago in fact i cant wait to go back'

In [527]:
#Removing (\n) tags
Dataset["Reviews"] = Dataset["Reviews"].str.replace(r'\n','',regex=True)

In [528]:
Dataset["Reviews"][19]

'we loved every second of our new years eve getaway from the moment we arrived until we had to fly home service was impeccable and the suite was everything we couldve asked for and then some enjoyed a massage to start off the new year right and i loved the relaxation room in the spa  the beds are so comfortable that i almost fell asleep the staff went out of their way and accommodated our friends with a toddler and provided a crib complete with stuffed animal to make the little one happythe hotel was kind enough to send us chocolates and fruit as a small holiday gift which was a nice touch and we did enjoy them along with some champagne i miss the plush robes the mood lighting and the tub in the executive suitei would most definitely stay again when i return to chicago in fact i cant wait to go back'

In [536]:
#We have our final dataset ready which we have preprocessed and it is balanced
Dataset["Sentiment"].value_counts()/Dataset.shape[0]*100

1    50.159448
0    49.840552
Name: Sentiment, dtype: float64

In [64]:
#We are done with Data Loading and Data Preprocessing. Now we will do the fine tuning our LLM model. I have checked hugging face for the models which 
#can be use for this purpose i decided Distilbert for the analysis there were 2 reasons first this Distilbert was pretrained on large corpus of the 
#data same as Bert and it has good understanding of general language and it has less parameters.