<a href="https://colab.research.google.com/github/LuluW8071/Text-Sentiment-Analysis/blob/main/Text-Sentiment-Analysis-using-BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip uninstall transformers -y
!pip install transformers[torch]

Found existing installation: transformers 4.38.2
Uninstalling transformers-4.38.2:
  Successfully uninstalled transformers-4.38.2
Collecting transformers[torch]
  Downloading transformers-4.39.3-py3-none-any.whl (8.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m23.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: transformers
Successfully installed transformers-4.39.3


## 1. Download and Load the dataset

The dataset that the following script will download is a combination of the Yelp Polarity Dataset and the IMDb Movie Dataset. The Yelp Polarity Dataset has been preprocessed by selecting specific columns to create a dataset suitable for sentiment analysis. This preprocessed dataset has been merged with the IMDb Movie Dataset.

In [None]:
import gdown
import zipfile
import os

file_url = 'https://drive.google.com/uc?id=1Jp3D5gdxGrwa5dHbr4p-pECrD8wi7vik'
file_name = 'sentiment_dataset.zip'

# Download the file from Google Drive
gdown.download(file_url, file_name, quiet=False)
extract_dir = './dataset'

# Extract the zip file
with zipfile.ZipFile(file_name, 'r') as zip_ref:
    zip_ref.extractall(extract_dir)

# Remove the zip file after extraction
os.remove(file_name)
print("Files extracted successfully to:", extract_dir)

Downloading...
From (original): https://drive.google.com/uc?id=1Jp3D5gdxGrwa5dHbr4p-pECrD8wi7vik
From (redirected): https://drive.google.com/uc?id=1Jp3D5gdxGrwa5dHbr4p-pECrD8wi7vik&confirm=t&uuid=8194922d-84a4-4970-bccc-7196797eea92
To: /content/sentiment_dataset.zip
100%|██████████| 182M/182M [00:01<00:00, 150MB/s]


Files extracted successfully to: ./dataset


In [None]:
import pandas as pd
import numpy as np

In [55]:
df = pd.read_csv("dataset/sentiment_combined.csv")
df = df.sample(n=10000, random_state=42)

# Reset the index
df.reset_index(drop=True, inplace=True)
df.head(), df.shape

(                                              review sentiment
 0  Never disappointed!   I have been coming here ...  positive
 1  If you order sushi, ask for the secret menu.  ...  positive
 2  Don't miss the fire breathing dragon roll!!!! ...  positive
 3  Typical chain Mexican food, nothing great, but...  negative
 4  So I'll preface this review with the fact that...  negative,
 (10000, 2))

In [56]:
df['review'][0]

'Never disappointed!   I have been coming here for years (since it was La Taqueria) whenever I have to go to any of the downtown government offices/ court.  The food quality is very consistent and fresh.  I love the Carne Asada, carnitas, and chicken in a burrito or tacos.  They have authentic chips fried in house and very fresh pico.   The absolutely best is the green sauce they have at every table.  I put it on everything!'

In [57]:
df['sentiment'].value_counts()

sentiment
positive    5015
negative    4985
Name: count, dtype: int64

In [58]:
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from collections import Counter

# Precompile regular expressions for faster pre processing
non_word_chars_pattern = re.compile(r"[^\w\s]")
whitespace_pattern = re.compile(r"\s+")
digits_pattern = re.compile(r"\d")
username_pattern = re.compile(r"@([^\s]+)")
hashtags_pattern = re.compile(r"#\d+")
br_pattern = re.compile(r'<br\s*/?>\s*<br\s*/?>')

def preprocess_string(s):
    # Remove all non-word characters (everything except numbers and letters)
    s = non_word_chars_pattern.sub('', s)
    # Replace all runs of whitespaces with single space
    s = whitespace_pattern.sub(' ', s)
    # Replace digits with no space
    s = digits_pattern.sub('', s)
    # Replace usernames with no space
    s = username_pattern.sub('', s)
    # Replace hashtags with no space
    s = hashtags_pattern.sub('', s)
    # Replace <br /> pattern with empty string
    s = br_pattern.sub('', s)
    # Replace specific characters
    s = s.replace("https", "")
    s = s.replace("http", "")
    s = s.replace("rt", "")
    s = s.replace("-", "")
    # Replace br with empty string
    s = s.replace("br", "")
    # Replace newline character with empty string
    s = s.replace("\n", "")
    return s

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [59]:
from tqdm.notebook import tqdm_notebook

# Initialize an empty list to store preprocessed reviews
preprocessed_reviews = []

# Use tqdm_notebook instead of tqdm.tqdm for Jupyter Notebook
for review in tqdm_notebook(df['review'], desc='Preprocessing', dynamic_ncols=True):
    preprocessed_review = preprocess_string(review)
    preprocessed_reviews.append(preprocessed_review)

# Assign the preprocessed reviews back to the 'review' column
df['review'] = preprocessed_reviews

Preprocessing:   0%|          | 0/10000 [00:00<?, ?it/s]

In [60]:
df['review'][0], df['sentiment'][0]

('Never disappointed I have been coming here for years since it was La Taqueria whenever I have to go to any of the downtown government offices cou The food quality is very consistent and fresh I love the Carne Asada carnitas and chicken in a burrito or tacos They have authentic chips fried in house and very fresh pico The absolutely best is the green sauce they have at every table I put it on everything',
 'positive')

In [64]:
# Replace string 'True' and 'False' with boolean True and False
df['sentiment'] = df['sentiment'].replace({'positive': 1, 'negative': 0})
df.head()

Unnamed: 0,review,sentiment
0,Never disappointed I have been coming here for...,1
1,If you order sushi ask for the secret menu The...,1
2,Dont miss the fire eathing dragon roll Habanar...,1
3,Typical chain Mexican food nothing great but n...,0
4,So Ill preface this review with the fact that ...,0


In [65]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['review'],
                                                    df['sentiment'],
                                                    test_size=0.2)

len(X_train), len(X_test)

(8000, 2000)

In [66]:
X_train, X_test, y_train, y_test = list(X_train), list(X_test), list(y_train), list(y_test)
X_train[:2], y_train[:2]

(['Thank you for the reply You refusing to ruin the very subjective integrity of your food as you put it wasnt the reason I walked out It was the hostile way that information was relayed to me This world is full of many different people just because you enjoy a shwarma one way doesnt mean everyone else in the world will If I am paying you my hard earned money for your product I think Im entitled to put it into my mouth and my body the way that I prefer I dont think this is a hard ideology to grasp and frankly its just good business practice As soon as I walked into your business I was your guest You argued or what seemed like a small argument in arabic right in front of me while I looked at the menu That alone I find is disrespectful to a guest Then the way he glared at me was the icing on the cake I was there to give you my business and thats not how you treat a guest Maybe things are different Syria and you are trying to make the customer service authentic too I dont know',
  'Delive

In [67]:
import torch
from torch.utils.data import Dataset
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
from transformers import Trainer, TrainingArguments

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

cuda


In [68]:
import torch
from torch.utils.data import Dataset

class data(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, index):
        item = {key: torch.tensor(val[index]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[index])
        return item

    def __len__(self):
        return len(self.labels)

In [69]:
model_name = 'distilbert-base-uncased'
tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)

In [70]:
train_encoding = tokenizer(X_train, truncation=True, padding=True)
test_encoding = tokenizer(X_test, truncation=True, padding=True)

In [71]:
train_dataset = data(train_encoding, y_train)
test_dataset = data(test_encoding, y_test)

In [72]:
training_args = TrainingArguments(output_dir="./results")

In [73]:
model = DistilBertForSequenceClassification.from_pretrained(model_name)
trainer = Trainer(model=model,
                  args=training_args,
                  train_dataset=train_dataset,
                  eval_dataset=test_dataset)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [74]:
from accelerate import Accelerator

# Initialize Accelerator
accelerator = Accelerator()

In [75]:
trainer.train()

Step,Training Loss


KeyboardInterrupt: 