In this project, I am going to create a text classification service to automatically categorize customer reviews into sentiment categories (positive, negative, neutral). This service can be highly beneficial for businesses to gauge customer sentiment regarding their products or services, allowing for more targeted responses and improvements.

**Task Definition**

*   Input

The input to the model will be raw text from customer reviews. These texts can range from a single sentence to a paragraph, detailing a customer's experience with a product or service.

*   Output

The output will be a sentiment category for each review. The categories will be 'Positive'or 'Negative'.

**Approach**

I will use BERT (Bidirectional Encoder Representations from Transformers) for this task.

*   Preprocessing

Clean the text data by removing special characters, HTML tags, and non-alphanumeric characters.

*   Model Fine-Tuning

Utilize a pre-trained BERT model and fine-tune it on our sentiment analysis dataset.

*   Training

Train the model using a labeled dataset of customer reviews, where each review is tagged with its corresponding sentiment.

*   Evaluation

Use a separate validation set to evaluate the model's performance





For datazet I will use IMDB reviews dataset. The dataset consists of customer reviews for different movies. Each review is labeled with a sentiment category.(Positive, Negative)

In [None]:
import zipfile
import os

# Define the path to the zip file and the extraction target directory
zip_file_path = 'IMDB.zip'
extraction_path = 'imdb_reviews'

# Unzip the dataset
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extraction_path)

# List the contents of the extracted folder
extracted_files = os.listdir(extraction_path)
extracted_files

['IMDB Dataset.csv']

**Training data**


*   Cleaning the Reviews

Removing special characters and converting all text to lowercase to normalize the data.

*  Tokenization and Removal of Stopwords

Breaking down the reviews into individual words and removing common words that do not contribute to sentiment.

*  Splitting the Dataset

Dividing the data into training, validation, and test sets to prepare for model training and evaluation.

In [None]:
from bs4 import BeautifulSoup
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk

# Download stopwords
nltk.download('stopwords')
nltk.download('punkt')

# Load the dataset
df = pd.read_csv('imdb_reviews/IMDB Dataset.csv')

# Define a function for cleaning the reviews
def clean_review(text):
    # Remove HTML tags
    text = BeautifulSoup(text, "html.parser").get_text()
    # Remove non-alphabetic characters
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    # Convert to lowercase
    text = text.lower()
    # Tokenize
    words = word_tokenize(text)
    # Remove stopwords
    words = [w for w in words if w not in stopwords.words('english')]
    # Join the words back into one string
    clean_text = ' '.join(words)
    return clean_text

# Apply the cleaning function to the review column
df['clean_review'] = df['review'].apply(clean_review)

# Display the first few rows of the cleaned dataset
df[['review', 'clean_review']].head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
  text = BeautifulSoup(text, "html.parser").get_text()


Unnamed: 0,review,clean_review
0,One of the other reviewers has mentioned that ...,one reviewers mentioned watching oz episode ho...
1,A wonderful little production. <br /><br />The...,wonderful little production filming technique ...
2,I thought this was a wonderful way to spend ti...,thought wonderful way spend time hot summer we...
3,Basically there's a family where a little boy ...,basically family little boy jake thinks zombie...
4,"Petter Mattei's ""Love in the Time of Money"" is...",petter mattei love time money visually stunnin...


In [None]:
# Re-defining the DataFrame to include only necessary columns for clarity
df_cleaned = df[['clean_review', 'sentiment']].copy()

# Display the first few rows of the redefined DataFrame
df_cleaned.head()

Unnamed: 0,clean_review,sentiment
0,one reviewers mentioned watching oz episode ho...,positive
1,wonderful little production filming technique ...,positive
2,thought wonderful way spend time hot summer we...,positive
3,basically family little boy jake thinks zombie...,negative
4,petter mattei love time money visually stunnin...,positive


In [None]:
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from torch.utils.data import DataLoader, Dataset

# Prepare datasets
train_texts, val_texts, train_labels, val_labels = train_test_split(
    df_cleaned['clean_review'],
    df_cleaned['sentiment'].map({'positive': 1, 'negative': 0}),
    test_size=0.2
)

class ReviewsDataset(Dataset):
    # Custom Dataset for loading encoded reviews and labels
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        # Returns a single item from the dataset at the specified index
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        #Returns the total number of items in the dataset
        return len(self.labels)

# Encode the dataset
def encode_reviews(tokenizer, reviews, labels):
    encodings = tokenizer(reviews, truncation=True, padding=True, max_length=128)
    return ReviewsDataset(encodings, labels)

# Load the tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

train_dataset = encode_reviews(tokenizer, train_texts.tolist(), train_labels.tolist())
val_dataset = encode_reviews(tokenizer, val_texts.tolist(), val_labels.tolist())

**Model**

I will use transfer learning, specifically I will use a pre-trained BERT model from the Transformers library and fine-tune it for our review classification task.

In [None]:
#! pip install -U accelerate
#! pip install -U transformers

Also I will integrate Neptune.ai for experiment tracking, which offers comprehensive capabilities for logging and comparing experiments.

In [None]:
#!pip install neptune-client

In [None]:
import torch
import accelerate
import neptune.new as neptune


# Initialize Neptune
run = neptune.init_run(
  project = "LSML2/LSML2-Final",
  api_token = "eyJhcGlfYWRkcmVzcyI6Imh0dHBzOi8vYXBwLm5lcHR1bmUuYWkiLCJhcGlfdXJsIjoiaHR0cHM6Ly9hcHAubmVwdHVuZS5haSIsImFwaV9rZXkiOiJjYTJkOGFmNi04ODk1LTQwZmQtOTA1ZC1kNDQ0MmMyOTYwMTkifQ==",
)


# Load the pre-trained model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels = 2)

# Define training arguments
training_args = TrainingArguments(
    output_dir = './results',
    num_train_epochs = 3,
    per_device_train_batch_size = 16,
    per_device_eval_batch_size = 32,
    warmup_steps = 500,
    weight_decay = 0.01,
    logging_dir = './logs',
    logging_steps = 10,
    evaluation_strategy = "epoch",
    report_to = "none",  # Disable integration with other experiment tracking tools
)


# Initialize the Trainer
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = val_dataset
)

# Train the model
trainer.train()

# Log model after training
model_path = "review-analysis-bert-model"
trainer.save_model(model_path)
run["trained_model"].upload(model_path)

# Log training arguments
run["training/args"] = vars(training_args)

# Log final results
final_results = trainer.evaluate()
for key, value in final_results.items():
    run[f"evaluation/{key}"] = value

# Stop the Neptune run
run.stop()