# **Auto-Review Analyzer: AI-Powered Pros & Cons Extraction & Summarization** #
### *A scalable NLP system for automatically identifying, classifying, and summarizing positive and negative reviews using Transformer architectures and LoRA fine-tuning*

## 1. Project Overview

### 1.1 Objective & Scope
This project delivers a scalable prototype for automatically identifying **pros** and **cons** from user-generated text, such as reviews, comments, and feedback, and summarizing key insights concisely. While many platforms (e.g., Amazon, Mercado Libre) offer aggregated feedback summaries, explicitly distinguishing between pros and cons provides clearer, more actionable information for decision-making and user satisfaction.

The focus of this project is on building a adaptable foundation for future development rather than a finalized production system.

### 1.2 Solution & Technical Approach
I developed a **two-layer Transformer encoder classifier**, optimized for both accuracy and efficiency. The model was trained on a curated dataset of **~1.7 million Glassdoor job reviews**, with pros and cons carefully extracted and structured for supervised learning. To maximize performance within limited resources, we fine-tuned the model using **LoRA (Low-Rank Adaptation)**, a parameter-efficient technique that reduces training time and computational cost.

### 1.3 Development Context & Constraints
Development was conducted under tight practical constraints, including limited Colab GPU availability, RAM, GPU memory, and storage. Despite these challenges, the resulting model performs reliably in tests and produces consistent, interpretable outputs.

## 2. Environment setup

In this section we prepare our environment to prevent compatibility issues. We begin by clearing any conflicts, then install pinned, compatible versions of all required libraries.

### 2.1 Dependency Pinning & Version Control
To ensure a stable training environment and prevent compatibility issues, the first step is to remove any preinstalled versions of PyTorch, Transformers, NumPy, and related libraries in the Colab environment. Mixing library versions across sessions can introduce runtime errors, CUDA conflicts, or subtle changes in model behavior.

After cleanup, we reinstall pinned, mutually compatible versions of the core dependencies (PyTorch, Transformers, PEFT, Accelerate, and others). This approach ensures reproducibility and guarantees that all components function correctly together, particularly during LoRA fine-tuning and later deployment of the model.

In [1]:
# Environment cleanup (to avoid version conflicts)
!pip uninstall -y -q \
    torch \
    torchvision \
    torchaudio \
    torchtext \
    torchdata \
    numpy \
    transformers \
    peft

# Install pinned, compatible versions
!pip install -q \
    torch==2.3.0 \
    torchtext==0.18.0 \
    numpy==2.0.0 \
    transformers==4.41.0 \
    peft==0.11.0 \
    sentencepiece==0.2.0 \
    safetensors==0.4.3 \
    accelerate==0.29.3 \
    gradio

[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
timm 1.0.22 requires torchvision, which is not installed.
torchtune 0.6.1 requires torchdata==0.11.0, which is not installed.
fastai 2.8.6 requires torchvision>=0.11, which is not installed.[0m[31m
[0m

Once the libraries are installed, the Python environment must be restarted so
that the new versions are properly loaded. Here we trigger a clean restart using `os._exit(0)`, which terminates the current process. And we manually reload the session with the freshly installed packages.

In [None]:
# Restart runtime after installation
import os
os._exit(0)

### 2.2 Library Installation & GPU/CPU Setup

With the environment ready, we move on to importing the main libraries for the project. These cover data handling (NumPy and pandas), plotting (Matplotlib), and general utilities like tqdm and scikit-learn. We use PyTorch and TorchText for training and text preprocessing, and Transformers together with PEFT for LoRA fine-tuning. We also bring in NLTK, Gradio, and a few Colab/Kaggle helpers when needed. At the end, we print some environment details to check GPU support and confirm the library versions.

In [1]:
# Standard libraries
import os
import math
import time
import pickle
import warnings
warnings.filterwarnings("ignore")

# Numericals & data handling
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt

# Utilities
import shutil
from tqdm import tqdm
from sklearn.model_selection import train_test_split

# PyTorch & TorchText utilities
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from torch.utils.data.dataset import random_split
from torch.nn.utils.rnn import pad_sequence
from torch.cuda.amp import autocast, GradScaler
torch.set_num_threads(1)

import torchtext
from torchtext.data.utils import get_tokenizer
from torchtext.data.functional import to_map_style_dataset
from torchtext.vocab import (
    build_vocab_from_iterator,
    GloVe,
    Vectors,
    vocab as create_vocab
)

# Transformers & PEFT
from transformers import (
    pipeline,
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    BitsAndBytesConfig,
    logging
)
logging.set_verbosity_error() # Suppress Transformers warnings

from peft import LoraConfig, get_peft_model

# NLP utilities
import re
import string
import nltk
nltk.download("punkt")
nltk.download("punkt_tab")

# UI & External services
import gdown
import kagglehub
import gradio as gr
from google.colab import files
from huggingface_hub import notebook_login

# Environment info
print(f"PyTorch: {torch.__version__}")
print(f"CUDA Available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"CUDA Version: {torch.version.cuda}")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


PyTorch: 2.3.0+cu121
CUDA Available: True
GPU: Tesla T4
CUDA Version: 12.1


## 3. Data Preprocessing

This section covers the complete data preparation pipeline. We begin by downloading the training dataset and implementing a custom Dataset class for preprocessing. We then instantiate training, evaluation, and test data iterators. Following this, we load pre-trained GloVe embeddings to build the vocabulary and create a tokenization pipeline. A custom collation function is defined to form batches. Finally, we configure the data loaders to enable efficient batch processing during model training.

### 3.1 Dataset Loading & Exploration

First, we download the Glassdoor Job Reviews dataset from Kaggle using `kagglehub` and load it into a pandas DataFrame. The dataset includes user reviews with separate *pros* and *cons* fields, which will be used as training data for the classifier.

In [2]:
# Download Glassdoor Job Reviews dataset from Kaggle
path_training_set = kagglehub.dataset_download("davidgauthier/glassdoor-job-reviews")
# Load dataset into a Pandas DataFrame
gs_data = pd.read_csv(f"{path_training_set}/glassdoor_reviews.csv")

Using Colab cache for faster access to the 'glassdoor-job-reviews' dataset.


The Glassdoor dataset contains about 838k rows and 18 columns, including company information, job titles, ratings, and other metadata. For this project, the most important fields are **pros** and **cons**. However, because each review provides both a pros entry and a cons entry, we effectively have around 1.7 million text samples available for training, validation, and testing.

In [3]:
gs_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 838566 entries, 0 to 838565
Data columns (total 18 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   firm                 838566 non-null  object 
 1   date_review          838566 non-null  object 
 2   job_title            838566 non-null  object 
 3   current              838566 non-null  object 
 4   location             541223 non-null  object 
 5   overall_rating       838566 non-null  int64  
 6   work_life_balance    688672 non-null  float64
 7   culture_values       647193 non-null  float64
 8   diversity_inclusion  136066 non-null  float64
 9   career_opp           691065 non-null  float64
 10  comp_benefits        688484 non-null  float64
 11  senior_mgmt          682690 non-null  float64
 12  recommend            838566 non-null  object 
 13  ceo_approv           838566 non-null  object 
 14  outlook              838566 non-null  object 
 15  headline         

We display three random reviews from the dataset to quickly inspect the content.
This lets us confirm that the `pros` and `cons` text fields are correctly loaded and formatted before moving on to preprocessing.

In [4]:
gs_data.sample(3)

Unnamed: 0,firm,date_review,job_title,current,location,overall_rating,work_life_balance,culture_values,diversity_inclusion,career_opp,comp_benefits,senior_mgmt,recommend,ceo_approv,outlook,headline,pros,cons
408130,Iron-Mountain-Inc,2021-01-28,Warehouse Associate,"Former Employee, less than 1 year","Livermore, CA",1,1.0,1.0,2.0,1.0,1.0,1.0,x,x,r,This place is terrible!,None. There are no pros.,Everything! Low wages. Difficult co-workers. N...
34758,Apple,2016-03-30,IT Support Technician and Help Desk Specialist,"Current Employee, more than 10 years","Sherman Oaks, CA",5,4.0,4.0,,4.0,4.0,4.0,v,o,v,Mac Genius or Home Advisor,A love of helping customers get the best under...,Knowing that I won't be able to support those ...
495589,Marriott-International,2017-12-06,Anonymous Employee,Current Employee,,4,3.0,4.0,,2.0,4.0,2.0,v,r,o,Good company,"Great benefits, great pay, free food","Irregular hours, early mornings, no upward mob..."


### 3.2 Binary Dataset Preparation

After loading and inspecting the dataset, this step prepares the reviews so they can be used for training. The `GlassdoorDataset` class extracts the `pros` and `cons` columns, assigns label 1 to pros and label 0 to cons, removes missing entries, and performs an 85/15 train–test split with a fixed seed for reproducibility. The result is a clean binary dataset where each sample is returned in the form `(label, text)`.

In [5]:
class GlassdoorDataset(Dataset):
    def __init__(self, train=True, random_state=42):
        """
        Args:
            train (bool): If True, return training split; otherwise test split.
            random_state (int): Seed for reproducible train/test split.
        """
        # Create labeled Pros data (Class = 1)
        pros = pd.DataFrame({
            "Text": gs_data.iloc[:, -2],
            "Class": np.ones(len(gs_data), dtype=int)
        })
        # Create labeled Cons data (Class = 0)
        cons = pd.DataFrame({
            "Text": gs_data.iloc[:, -1],
            "Class": np.zeros(len(gs_data), dtype=int)
        })
        # Combine Pros and Cons, remove missing entries
        gs_df = (
            pd.concat([pros, cons], ignore_index=True)
              .dropna()
        )
        # Train / Test split
        train_texts, test_texts, train_labels, test_labels = train_test_split(
            gs_df["Text"].values,
            gs_df["Class"].values,
            test_size=0.15,
            random_state=random_state,
            shuffle=True
        )
        # Select split based on `train` flag
        if train:
            self.texts = train_texts
            self.labels = train_labels
        else:
            self.texts = test_texts
            self.labels = test_labels

        self.length = len(self.texts)

    def __len__(self):
        """Return dataset size."""
        return self.length

    def __getitem__(self, idx):
        """
        Return one sample:
            (label, text)
        """
        return self.labels[idx], self.texts[idx]

Here we define a simple dictionary that maps numeric labels to readable names. Label 0 corresponds to “Con” and label 1 corresponds to “Pro”, this makes results easier to interpret when visualizing predictions or building the user interface.

In [6]:
labels = {0: "Con", 1: "Pro"}

Next, we create the training and test iterators using the `GlassdoorDataset` class, and then inspect one example to verify the labels and text look correct. The printed output confirms that a sample labeled `1` corresponds to a “Pro” entry, which helps validate that the dataset was built properly before moving on to modeling.

In [7]:
# Instantiate train and test datasets
train_iter = GlassdoorDataset(train=True)
test_iter = GlassdoorDataset(train=False)

# Inspect a sample
train_label, train_text = train_iter[5]
print(f"Label: {train_label} ({labels[train_label]})")
print(f"Text: {train_text}")

Label: 1 (Pro)
Text: Worked there for a couple of years, you're colleague's made it worth it.  Pay was low but your told the pay before you start. Just a typical call centre.  At least I wasn't cold calling.


Moving on, we convert the training and test datasets into map-style datasets so they can be indexed efficiently and used with PyTorch data loaders. We then split the training set into two parts: 80% for training and 20% for validation. The validation split is used to track performance and tune hyperparameters, while the test set remains untouched until final evaluation.

In [8]:
# Convert the training and testing iterators to map-style datasets.
train_dataset = to_map_style_dataset(train_iter)
test_dataset = to_map_style_dataset(test_iter)

# Split train dataset into training (80%) and validation (20%)
train_size = int(0.8 * len(train_dataset))
val_size = len(train_dataset) - train_size
train_dataset, valid_dataset = random_split(train_dataset, [train_size, val_size])

### 3.3 Vocabulary Building with GloVe Embeddings

Now, we prepare the data so it can be passed to the model. We load pretrained GloVe embeddings (100-dimensional) and use them to build the project vocabulary. The vocabulary is constructed directly from the GloVe token index and includes special tokens for unknown words and padding. Any token not found in GloVe is mapped to `<unk>`. We also record useful metadata such as the padding index and the total vocabulary size, which will be needed later when defining the model and creating the data loaders.

In [9]:
# Load pretrained GloVe embeddings
glove_embedding = GloVe(name="6B", dim=100, cache=".vector_cache")

# Build vocabulary from GloVe
vocab = create_vocab(
    glove_embedding.stoi,
    min_freq=1,
    specials=["<unk>", "<pad>"],
    special_first=True
)
# Set default token for out-of-vocabulary words
vocab.set_default_index(vocab["<unk>"])

# Vocabulary metadata
PAD_IDX = vocab["<pad>"]
vocab_size = len(vocab)
print("Pad index:", PAD_IDX, "\nVocab size: ", vocab_size)

Pad index: 1 
Vocab size:  400001


### 3.4 Tokenization & Batch Collation Pipeline
This step defines the text preprocessing pipeline. We use TorchText’s `basic_english` tokenizer to split each review into tokens, and then convert those tokens into integer indices using the previously built vocabulary. The `text_pipeline` function will be applied later inside the data loaders so raw text can be fed directly into the model.

In [10]:
# Tokenizer
tokenizer = get_tokenizer("basic_english")

def text_pipeline(text):
    """
    Convert raw text into a sequence of vocabulary indices.

    Process:
    - Tokenize input text using a basic English tokenizer.
    - Map tokens to integer indices using the vocabulary.
    """
    return vocab(tokenizer(text))

Here we select the computation device. If a GPU is available, the model and tensors will run on CUDA; otherwise, the code falls back to the CPU. This ensures the notebook works both locally and in environments without GPUs.

In [11]:
# Select GPU if available, otherwise CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

The `collate_batch` function takes a batch of label–text pairs from a DataLoader and converts them into tensors suitable for model training. It turns each text example into token indices, pads the sequences so they all share the same length, converts the labels into a tensor, and returns both so the model receives a uniform batch.

In [12]:
def collate_batch(batch):
    """
    Prepare a batch for training.

    For each (label, text) pair:
    - Convert text into a tensor of token indices
    - Pad all sequences in the batch to the same length
    - Convert labels into a tensor

    Returns:
        labels (LongTensor): shape (batch_size,)
        texts  (LongTensor): shape (batch_size, max_seq_len)
    """
    label_list = []
    text_list = []

    for label, text in batch:
        # Store label
        label_list.append(label)

        # Tokenize text and convert to tensor
        text_list.append(torch.tensor(text_pipeline(text), dtype=torch.int64))

    # Convert labels to tensor
    label_list = torch.tensor(label_list, dtype=torch.int64)
    # Pad text sequences
    text_list = pad_sequence(text_list, batch_first=True)

    return label_list, text_list

### 3.5 DataLoaders Configuration & Parallelized Batching

Finally, we create the DataLoader objects to organize training, validation, and test datasets into mini-batches of fixed size, shuffle the training data, and rely on the `collate_batch` function to pad and format each batch correctly. They also use multiple worker processes to load data in parallel and enable faster transfers to the GPU, helping the model train and evaluate efficiently.

In [13]:
BATCH_SIZE = 64

train_dataloader = DataLoader(
    train_dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    collate_fn=collate_batch,
    num_workers=min(4, os.cpu_count() - 1),  # More workers
    pin_memory=True,  # Faster data transfer to GPU
    persistent_workers=True
)
valid_dataloader = DataLoader(
    valid_dataset,
    batch_size=BATCH_SIZE,
    shuffle=False,
    collate_fn=collate_batch,
    num_workers=2
)
test_dataloader = DataLoader(
    test_dataset,
    batch_size=BATCH_SIZE,
    shuffle=False,
    collate_fn=collate_batch,
    num_workers=2
)

Next, the code below scans every review in both the training and test sets to determine the maximum sequence length. This value is then used to configure the transformer model's input size.

In [14]:
max_len = 0

# Iterate over training reviews
for review in list(train_iter[:][1]):
    review_len = len(text_pipeline(review))
    if review_len > max_len:
        max_len = review_len

# Iterate over test reviews
for review in list(test_iter[:][1]):
    review_len = len(text_pipeline(review))
    if review_len > max_len:
        max_len = review_len

# Set maximum sequence length
MAX_LEN = max_len

print("Length of longest review:", max_len)

Length of longest review: 3379


## 4. Model Development & Architecture

In this section we detail the construction of the core classifier. It begins by implementing a standard positional encoding class to provide the model with sequence order. Next, we define the full Transformer encoder architecture, built on top of frozen GloVe embeddings, which processes text through attention layers and uses masked mean pooling. Finally, we cover the utility functions for making single predictions and evaluating the model's accuracy.

### 4.1 Positional Encoding Implementation

Firstly, we create a class that implements sinusoidal positional encoding for Transformer models, following the original “Attention Is All You Need” paper. It precomputes a fixed encoding matrix where each position is assigned a unique combination of sine and cosine values across embedding dimensions. These encodings are added to the input token embeddings, enabling the model to utilize sequential order and relative positional information. The encodings are computed once for the maximum sequence length and reused during both training and inference.



In [15]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_len=MAX_LEN, dropout=0.1):
        """
        Args:
            d_model (int): Embedding dimension.
            max_seq_len (int): Maximum sequence length supported.
            dropout (float): Dropout probability applied after adding encoding.
        """
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)

        # Initialize positional encoding matrix
        pe = torch.zeros(max_seq_len, d_model)

        # Create position indices [0, 1, ..., max_seq_len - 1]
        position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)

        # Compute the div_term for sine and cosine frequencies
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float()
            * (-math.log(10000.0) / d_model)
        )
        # Apply sine to even indices and cosine to odd indices
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        # Add batch dimension for broadcasting
        pe = pe.unsqueeze(0)

        # Register as buffer (not a learnable parameter)
        self.register_buffer("pe", pe)

    def forward(self, x):
        """
        Add positional encoding to input embeddings.

        Args:
            x (Tensor): Shape (batch_size, seq_len, d_model)

        Returns:
            Tensor: Position-aware embeddings with same shape as input.
        """
        x = x + self.pe[:, : x.size(1), :]
        return self.dropout(x)

### 4.2 Transformer Encoder Classifier Architecture

Then, we develop the Transformer-based text classifier architecture built on top of pretrained GloVe embeddings. It embeds each input token, adds positional encodings so the model can understand word order, and passes the sequence through a stack of Transformer encoder layers. Afterward it pools the encoded sequence into a single representation while ignoring padding, and feeds that vector into a linear classifier to produce logits for each class.

In [16]:
class TransformerEncoderClassifier(nn.Module):
    def __init__(
        self,
        num_class,
        vocab_size,
        freeze=True,
        nhead=2,
        dim_feedforward=128,
        max_len=MAX_LEN,
        num_layers=2,
        dropout=0.1,
        activation="gelu",
        classifier_dropout=0.1,
    ):
        """
        Transformer-based text classifier with pretrained GloVe embeddings.

        Args:
            num_class (int): Number of output classes.
            vocab_size (int): Vocabulary size.
            freeze (bool): Whether to freeze GloVe embeddings.
            nhead (int): Number of attention heads.
            dim_feedforward (int): Hidden dimension of feedforward layers.
            max_len (int): Maximum supported sequence length.
            num_layers (int): Number of Transformer encoder layers.
            dropout (float): Dropout probability in encoder and positional encoding.
            activation (str): Activation function for encoder layers.
            classifier_dropout (float): Dropout before classifier (reserved).
        """
        super().__init__()

        # Embedding layer initialized from pretrained GloVe
        self.emb = nn.Embedding.from_pretrained(
            glove_embedding.vectors,
            freeze=freeze,
            padding_idx=PAD_IDX,
        )

        embedding_dim = self.emb.embedding_dim
        self.d_model = embedding_dim

        # Positional encoding
        self.pos_encoder = PositionalEncoding(
            d_model=embedding_dim,
            dropout=dropout,
            max_seq_len=max_len,
        )

        # Transformer encoder stack
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=embedding_dim,
            nhead=nhead,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
            batch_first=True,
            activation=activation,
        )

        self.transformer_encoder = nn.TransformerEncoder(
            encoder_layer,
            num_layers=num_layers,
        )

        # Classification head
        self.classifier = nn.Linear(
            embedding_dim,
            num_class
        )

    def forward(self, x):
        """
        Forward pass.

        Args:
            x (LongTensor): Token indices with shape (batch_size, seq_len)

        Returns:
            Tensor: Logits with shape (batch_size, num_class)
        """

        # Identify padding positions
        padding_mask = (x == PAD_IDX)  # (batch_size, seq_len)

        # Embed tokens and scale by sqrt(d_model)
        x = self.emb(x) * math.sqrt(self.d_model)

        # Add positional encoding
        x = self.pos_encoder(x)

        # Apply Transformer encoder with padding mask
        x = self.transformer_encoder(
            x,
            src_key_padding_mask=padding_mask
        )
        # Masked mean pooling over sequence dimension
        mask = (~padding_mask).unsqueeze(-1)  # (batch_size, seq_len, 1)
        denom = mask.sum(dim=1).clamp(min=1)
        x = (x * mask).sum(dim=1) / denom

        # Final classification
        x = self.classifier(x)

        return x

Here we create an instance of Transformer classifier with two output classes and the vocabulary that matches the pretrained embeddings, then it is moved to the selected device so it can run on either the CPU or GPU depending on availability.

In [17]:
# Instantiate the Transformer classifier
classifier = TransformerEncoderClassifier(
    num_class=2,          # Binary classification: Pro vs Con
    vocab_size=vocab_size # Vocabulary size built from GloVe
).to(device)              # Move model to the appropriate device (CPU/GPU)

### 4.3 Prediction and Evaluation functions

For testing and evaluation, we define a predict and evaluation function. The predict function accepts a raw text review, tokenizes it, and converts tokens to vocabulary IDs. The function then runs the model in inference mode to obtain logits, selects the highest-scoring class, and returns the corresponding label ("Pro" or "Con"). Before training, the model is tested to confirm its predictions are effectively random.

In [18]:
def predict(text, model):
    """
    Predicts whether a review is a 'Pro' or 'Con'.

    Process:
    - Tokenize text and convert tokens to vocabulary indices.
    - Add a batch dimension (model expects batches).
    - Move tensor to the same device as the model.
    - Return label.
    """
    with torch.no_grad():
        # Tokenize and get vocab embeddings
        tokens = vocab(tokenizer(text))

        # Convert to tensor and add batch dimension (1, seq_len)
        text_tensor = torch.tensor(tokens).unsqueeze(0).to(device)

        # Forward pass
        output = model(text_tensor)

        # Get predicted class index and map to label name
        return labels[output.argmax(1).item()]

# Check model prediction, not trained
predict("This product is fantastic", classifier)

'Pro'

The evaluation function evaluates the model on a held-out dataset by running it in evaluation mode, turning off gradient computation, and iterating through the dataloader to collect predictions. It compares each prediction with the true label, counts how many are correct, and finally returns the overall accuracy. We used Mixed precision is used to speed up evaluation without affecting results.

In [19]:
def evaluate(dataloader, model_eval):
    """
    Evaluates the model on test dataloader and returns accuracy.

    Process:
      - Switch model to eval mode (disables dropout, etc.)
      - Disable gradients to speed things up and save memory
      - Use mixed precision during forward passes
      - Accumulate correct predictions and compute accuracy
    """
    model_eval.eval()
    total_acc, total_count = 0, 0

    with torch.no_grad():
        for label, text in dataloader:
            # Move batch to device
            label, text = label.to(device), text.to(device)

            # Forward pass with Automatic Mixed Precision
            with autocast():
                output = model_eval(text)

            # Handle models that return dicts (LoRA wrappers)
            logits = output["logits"] if isinstance(output, dict) else output

            # Predicted class indices
            predicted = torch.max(logits.data, 1)[1]

            # Update metrics
            total_acc += (predicted == label).sum().item()
            total_count += label.size(0)

    return total_acc / total_count

Now, we run the evaluation function on the test dataset using the current (untrained) classifier. Since the model hasn’t learned anything yet, the accuracy is expected to be close to random guessing, roughly around 50%.

In [20]:
# Compute accuracy on test data, should be around 50%
test_accuracy = evaluate(test_dataloader, classifier)
test_accuracy

0.49903008331743304

## 5. Training Pipeline & Model Evaluation

This section covers the complete training and evaluation workflow. We begin by configuring the model’s training setup with loss, optimizer, and scheduler. Next, we detail the customized training loop using mixed-precision for efficiency. Following training, the pipeline loads the previously trained model and performs inference, concluding with a final evaluation on the test set to report the model's generalization accuracy.

### 5.1 AMP Customized Training Loop

As a start, we define the main training loop. This function trains the model for a specified number of epochs, using mixed precision to speed up computations. In each epoch, it iterates through training batches to perform the forward pass, calculate loss, execute the backward pass, and update the model parameters. It also adjusts the learning rate with a scheduler. After processing all batches, it evaluates the model on the validation set and logs the epoch's training loss and validation accuracy.

In [21]:
def train_model(
    model,
    optimizer,
    criterion,
    scheduler,
    train_dataloader,
    valid_dataloader,
    epochs=30
):
    """
    Train the model using mixed precision, track validation accuracy,
    and keep the best accuracy seen so far.
    """

    cum_loss_list = []   # Track total loss per epoch
    acc_epoch = []       # Track validation accuracy per epoch
    acc_old = 0          # Best validation accuracy
    time_start = time.time()

    # GradScaler supports Automatic Mixed Precision (AMP)
    scaler = GradScaler()

    for epoch in tqdm(range(1, epochs + 1)):
        model.train()
        cum_loss = 0

        for idx, (label, text) in enumerate(train_dataloader):
            optimizer.zero_grad()

            # Move batch to GPU/CPU
            label, text = label.to(device), text.to(device)

            # Forward pass under AMP context
            with autocast():
                predicted_label = model(text)
                loss = criterion(predicted_label, label)

            # Backpropagation (scaled for numerical stability)
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()

            cum_loss += loss.item()

        # Learning-rate scheduling
        scheduler.step()
        print(f"Epoch {epoch}/{epochs} - Loss: {cum_loss}")

        # Track metrics
        cum_loss_list.append(cum_loss)

        accu_val = evaluate(valid_dataloader, model)
        acc_epoch.append(accu_val)

        # Save best validation accuracy
        if accu_val > acc_old:
            print(accu_val)
            acc_old = accu_val

    time_end = time.time()
    print(f"Training time: {time_end - time_start:.2f} seconds")

    return model, cum_loss_list, acc_epoch

### 5.2 Model Training Configuration

Next, we define a helper function that either trains a new classifier or loads one that was previously saved. When training is enabled, it sets up the loss function, optimizer, and learning-rate scheduler, runs the training loop, and then saves the finished model to disk. When training is disabled, it simply loads the saved model file and makes it ready for use on the current device.

In [22]:
def train_classifier(
    train=False,
    n_epochs=50,
    lr=0.01,
    step_size=2.0,
    gamma=0.8,
    save_name="transformer_classifier",
    load_name="transformer_classifier"
):
    """
    Train the classifier from scratch OR load a previously saved model.

    Args:
        train (bool): If True, train and save a new model.
                      If False, load an existing one.
    """

    global vocab, classifier, PAD_IDX

    if train:
        # Training configuration
        criterion = nn.CrossEntropyLoss()
        optimizer = torch.optim.SGD(
            classifier.parameters(),
            lr=lr,
            momentum=0.9
        )
        scheduler = torch.optim.lr_scheduler.StepLR(
            optimizer,
            step_size=step_size,
            gamma=gamma
        )
        # Run training loop
        train_model(
            model=classifier,
            optimizer=optimizer,
            criterion=criterion,
            scheduler=scheduler,
            train_dataloader=train_dataloader,
            valid_dataloader=valid_dataloader,
            epochs=n_epochs,
        )

        # Save entire model object
        torch.save(classifier, save_name + ".pth")
        print("Saved entire model")

    else:
        # Load saved model
        classifier = torch.load(
            load_name + ".pth",
            map_location=device
        )
        print("Loaded entire model")

### 5.3 Model Inference & Performance Validation

Due to GPU computation limits, the model was trained for 30 epochs. The final model object was saved and uploaded to Google Drive for storage. Now, for inference, we load that saved TransformerClassifier checkpoint from disk. After loading, we print its architecture to verify the layer structure and configuration match the original trained model, ensuring a correct recovery.

In [23]:
# Extract file id from the shareable link
file_id = '1Z2pimkOf28NFMmC3K866hOsW1yNVE6XL'
# Download using gdown
url = f'https://drive.google.com/uc?id={file_id}&confirm=t'
gdown.download(url, 'transformer_classifier.pth', quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1Z2pimkOf28NFMmC3K866hOsW1yNVE6XL&confirm=t
To: /content/transformer_classifier.pth
100%|██████████| 162M/162M [00:02<00:00, 55.3MB/s]


'transformer_classifier.pth'

In [24]:
train_classifier(
    train=False,
    save_name="transformer_classifier",
    load_name="transformer_classifier"
)
print("Model Architecture:\n", classifier)

Loaded entire model
Model Architecture:
 TransformerEncoderClassifier(
  (emb): Embedding(400000, 100, padding_idx=1)
  (pos_encoder): PositionalEncoding(
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer_encoder): TransformerEncoder(
    (layers): ModuleList(
      (0-1): 2 x TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=100, out_features=100, bias=True)
        )
        (linear1): Linear(in_features=100, out_features=128, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=128, out_features=100, bias=True)
        (norm1): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((100,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (classifier): Linear(in_features=100, out_features=2, bias=True)
)


Now, we run the `predict` function with the loaded model on the same example review.

In [25]:
predict("This product is fantastic", classifier)

'Pro'

Finaly, we run the evaluate function on the trained classifier on the held-out test set and prints the overall accuracy.

In [26]:
# Evaluate the trained (or loaded) classifier on the test set
test_accuracy = evaluate(test_dataloader, classifier)
print(f"Test accuracy: {test_accuracy:.2%}")

Test accuracy: 94.54%


## 6. Fine-Tunning Text Classifier with LoRA

This section covers the parameter-efficient adaptation of the trained model using LoRA. We begin by creating a configuration wrapper to make the Transformer classifier compatible with PEFT tools. Next, we inject and configure the LoRA adapters into the model's attention and feed-forward layers. Finally, the process of loading the fine-tuned model and evaluating its improved performance on the test set is demonstrated.

### 6.1 PEFT Wrapper Configuration

First, we create a configuration class that stores the key hyperparameters and metadata needed to describe the Transformer classifier in a way similar to Hugging Face models. It defines architectural details (like hidden size, number of layers, and attention heads), task settings such as number of labels and vocabulary size, label mappings, and a few utility fields used for compatibility and logging.

In [27]:
# Lightweight configuration object to mimic Hugging Face model configs
class ModelConfig:
    def __init__(self):
        # Core architecture parameters
        self.hidden_size = 100
        self.num_hidden_layers = 2
        self.num_attention_heads = 5
        self.intermediate_size = 128

        # Task / vocabulary settings
        self.vocab_size = 400000
        self.num_labels = 2
        self.problem_type = "single_label_classification"

        # Label mappings
        self.id2label = {0: "Con", 1: "Pro"}
        self.label2id = {"Con": 0, "Pro": 1}

        # Misc
        self.torch_dtype = "float32"
        self._name_or_path = "custom_model"
        self.use_return_dict = True
        self.classifier_dropout = None

Next, the wrapper class below helps us adapt the existing Transformer classifier so it can work with PEFT/LoRA. It stores the original trained model, exposes a configuration object that mimics the Hugging Face interface, and freezes all base parameters so that only LoRA adapters will be trained. Its forward method simply routes inputs to the underlying classifier while providing the standard signature expected by PEFT tools.

In [28]:
# Wrapper to make our classifier compatible with PEFT / LoRA
class TransformerEncoderClassifierWithLoRA(nn.Module):
    def __init__(self, original_model):
        """
        Wrap the trained classifier so LoRA can attach adapters.
        The base model is frozen and only LoRA layers will be trained.
        """
        super().__init__()

        # Keep reference to the original classifier
        self.model = original_model

        # Provide a config attribute (expected by PEFT / Transformers)
        self.config = ModelConfig()

        # Freeze all base model parameters
        for param in self.model.parameters():
            param.requires_grad = False

    def forward(self, input_ids, attention_mask=None, labels=None, **kwargs):
        """
        Standard forward signature
        """
        logits = self.model(input_ids)
        return logits

### 6.2 Adapter Injection in Attention/Linear Layers

Here the frozen Transformer classifier is wrapped so LoRA can attach lightweight adapters, and a configuration is defined describing how those adapters should behave. Below we define the LoRA settings such as where adapters are inserted, how large they are, how strongly their updates are scaled, and which parts of the base model are still allowed to train.

In [29]:
# Wrap the frozen classifier so LoRA can attach adapters
model_with_lora = TransformerEncoderClassifierWithLoRA(classifier)

# LoRA configuration
lora_config = LoraConfig(
    r=8,                               # Rank matrices
    lora_alpha=16,                     # Scaling factor for LoRA updates
    target_modules=["linear1", "linear2", "out_proj"],  # Layers to inject adapters
    lora_dropout=0.1,                  # Dropout
    bias="all",                        # Also train biases
    task_type="SEQ_CLS",               # Sequence classification task
    modules_to_save=["classifier"],    # Keep final binary classifier trainable
)

This step applies the LoRA configuration to the wrapped model, injecting trainable adapter layers while keeping the original Transformer mostly frozen. After attaching the adapters, we print out which parameters will actually be updated, confirming that only the LoRA components will be trained.

In [30]:
# Attach LoRA adapters to the model
classifier = get_peft_model(model_with_lora, lora_config)

# Display which parameters will update during training
classifier.print_trainable_parameters()

trainable params: 12,354 || all params: 40,144,156 || trainable%: 0.0308


### 6.3 Parameter-Efficient Training & Evaluation

We trained the LoRA classifier for 20 epochs, to avoid retraining it is saved on Drive and loaded here. By printing the model architecture confirms that LoRA adapters were successfully injected into the Transformer layers, keeping the base model frozen while the classifier head remains trainable.

In [31]:
# Extract FILE_ID from the shareable link
file_id = '1YozUCXDvaLhaWQw-lRHuILP-oAwv_2ph'

# Download using gdown
url = f'https://drive.google.com/uc?id={file_id}&confirm=t'
gdown.download(url, 'transformer_classifier_lora.pth', quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1YozUCXDvaLhaWQw-lRHuILP-oAwv_2ph&confirm=t
To: /content/transformer_classifier_lora.pth
100%|██████████| 162M/162M [00:03<00:00, 40.7MB/s]


'transformer_classifier_lora.pth'

In [32]:
# Train = False - loads already fine-tuned LoRA model
train_classifier(
    train=False,
    n_epochs=40,
    lr=0.01,
    step_size=5.0,
    gamma=0.8,
    save_name="transformer_classifier_lora",
    load_name="transformer_classifier_lora"
    )

print("Classifier architecure with LoRA: ", classifier)

Loaded entire model
Classifier architecure with LoRA:  PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): TransformerEncoderClassifierWithLoRA(
      (model): TransformerEncoderClassifier(
        (emb): Embedding(400000, 100, padding_idx=1)
        (pos_encoder): PositionalEncoding(
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (transformer_encoder): TransformerEncoder(
          (layers): ModuleList(
            (0-1): 2 x TransformerEncoderLayer(
              (self_attn): MultiheadAttention(
                (out_proj): lora.Linear(
                  (base_layer): NonDynamicallyQuantizableLinear(in_features=100, out_features=100, bias=True)
                  (lora_dropout): ModuleDict(
                    (default): Dropout(p=0.1, inplace=False)
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_features=100, out_features=8, bias=False)
                  )
                  (lora_B): Mo

Finally, the model with trained LoRA adapters is evaluated on the test set, achieving a slight improvement in accuracy over the baseline.

In [33]:
# Measure final generalization performance
test_accuracy = evaluate(test_dataloader, classifier)
print(f"Test accuracy: {test_accuracy:.2%}")

Test accuracy: 94.60%


## 7. Summarization System Implementation

In this section we build the final integrated system for review analysis. First, we set up a Hugging Face summarization pipeline. Next, we implement a text cleaning and sentence splitting function. Then, we create a combined pipeline that extracts individual sentences, classifies each one as a Pro or a Con, and generates separate summaries for each category. Finally, we test the complete system on a set of real Amazon reviews.

### 7.1 Summarization Pipeline

We begin by creating a Hugging Face pipeline for text summarization using the pretrained `philschmid/bart-large-cnn-samsum` model. This pipeline automatically handles tokenization, model inference, and decoding, allowing us to summarize input text with a single function call.

In [34]:
# Hugging Face pipeline for text summarization
summarizer = pipeline(
    task="summarization",
    model="philschmid/bart-large-cnn-samsum"
)

### 7.2 Text Cleaning & Sentence Processing

To handle summarization, first we implement a text cleaning pipeline that removes numbers, special characters, and pronouns while preserving letters, spaces, and basic punctuation (periods, exclamation points, question marks). The pipeline also normalizes whitespace by collapsing multiple spaces into single spaces and trimming leading/trailing whitespace.

In [35]:
def clean_text_basic(text):
    """
    Basic cleaning: remove numbers and special symbols, keep letters, spaces, and basic punctuation
    """
    # Remove numbers and leading/trailing whitespace
    text = re.sub(r'\d+', '', text.strip())

    # Remove special symbols but keep basic punctuation (.!?,)
    text = re.sub(r'[^\w\s.!?]', '', text)
    text = text.replace('\n', '. ')

    # Replace any non-space character after . or , with a space
    text = re.sub(r'([.,])\S', r'\1 ', text)

    pronouns = [
    # Personal pronouns
    'I', 'me', 'my', 'mine', 'myself',
    'you', 'your', 'yours', 'yourself', 'yourselves',
    'he', 'him', 'his', 'himself',
    'she', 'her', 'hers', 'herself',
    'it', 'its', 'itself',
    'we', 'us', 'our', 'ours', 'ourselves',
    'they', 'them', 'their', 'theirs', 'themselves'
    ]

    # Remove pronouns and related
    pattern = r'\b(' + '|'.join(map(re.escape, pronouns)) + r')\b'
    text = re.sub(pattern, "", text, flags=re.IGNORECASE)
    text = re.sub(r"\bI'?ve\b", "", text)
    text = re.sub(r"\bI'?m\b", "", text)
    text = re.sub(r"\bI'?ll\b", "", text)
    text = re.sub(r"\bI'?d\b", "", text)
    text = re.sub(r"\bam\b", "", text)

    # Remove extra whitespace
    text = ' '.join(text.split())

    return text.strip()

### 7.3 Integrated Pros/Cons Extraction Pipeline

Finally, we implement the summarizer and classifier, the function takes a piece of text, splits it into sentences, classifies each sentence as a *Pro* or *Con* using the trained classifier, and then summarizes the grouped Pro and Con sentences separately using the Hugging Face summarization pipeline. It returns tuple of a structured strings showing summarized Pros and Cons, skipping empty sections if no sentences fall into one category.

In [36]:
def ProsConsSummarizer(text, summarizer=summarizer):
    """
    Split text into sentences, classify each sentence as Pro/Con,
    summarize each sentence, and return grouped Pros/Cons text.
    """
    # Clean and tokenize textual input
    sentences = nltk.sent_tokenize(clean_text_basic(text))

    pros = ""
    cons = ""

    for sentence in sentences:
        # Predict sentiment category using our classifier
        prediction = predict(sentence, classifier)

        sentence = sentence.capitalize()

        if prediction == "Pro":
            pros += sentence + ", "
        else:
            cons += sentence + ", "

    if pros == "" and cons == "":
      pros = "No Pros found in review."
      cons = "No Cons found in review."
    elif pros == "":
      pros = "No Pros found in review."
      cons = "- " + summarizer(cons, max_length=100, min_length=50)[0]["summary_text"].strip()
      cons = re.sub(r'\.\s+', '.\n- ', cons)
    elif cons == "":
      cons = "No Cons found in review."
      pros = "- " + summarizer(pros, max_length=100, min_length=50)[0]["summary_text"].strip()
      pros = re.sub(r'\.\s+', '.\n- ', pros)
    else:
      pros = "- " + summarizer(pros, max_length=100, min_length=25)[0]["summary_text"].strip()
      pros = re.sub(r'\.\s+', '.\n- ', pros)
      cons = "- " + summarizer(cons, max_length=100, min_length=25)[0]["summary_text"].strip()
      cons = re.sub(r'\.\s+', '.\n- ', cons)
    return pros, cons

### 7.4 Example Reviews for Testing
These five reviews are real Amazon product reviews expressing both positive and negative aspects. They include detailed feedback about fit, usability, performance, comfort, and design issues, providing a mix of pros and cons that can be analyzed or summarized by the classifier and summarization pipeline.

In [37]:
review1 = """
I absolutely loved the toe of this shoe, they seem to be exceptionally comfortable,lightweight and fit nicely. However the issue that I had with these shoes is the arch, my Arch just rubbed on it, too hard and way to prominent..if they could figure out how to redesign the arch in the shoe it would be a fabulous walking shoe.
"""

review2 = """
I ordered these sneakers in a size 7 wide, even though I usually wear a 6 or 6.5, because I wanted a little extra room. For me personally, that sizing choice worked fairly well. They felt slightly loose in the heel area but still comfortable overall.
The shoes came in black with some dark gray tones and a white logo on the side. The style looked clean and simple to me, and I thought they seemed stylish. The bottom soles seemed to have good traction, and when I tested them on tile and hardwood, they didn’t seem slippery. The shoes also felt very lightweight, which I liked since I often have trouble finding comfortable sneakers that don’t seem to weigh down my feet.
The fabric on the upper part of the shoes seemed breathable and flexible, with what seemed like mesh like areas. I think that’s a nice touch for comfort and keeping feet cooler. The shoes can be tied, but they also seem to work well as slip ons, and there’s a small tab or loop on the back that makes them easier to pull on and off.
Inside, the lining felt soft, but I did notice that the insole didn't seem to be secured, it lifted slightly. At first, I thought that might be an issue, but after wearing them for a bit, it didn’t seem to move around when I walked. Still, I think it’s worth mentioning in case others prefer shoes with the insole sewn in. The cushioning itself felt a bit thin but they were still comfortable to me overall.
The overall construction seemed decent, and the shoes felt breathable, light, and seemed easy to wear. For me personally, the comfort level was decent overall. I liked how flexible they felt and that the traction worked well across different surfaces.
Overall, these sneakers seemed lightweight, breathable, and seemed stylish. For me personally, they met most of what I was hoping for, and I plan to keep wearing them. The only small downsides for me were the slightly loose fit at the heel for me and the insole not being fully secured, but they still felt comfortable overall. I’ll continue wearing them and may update my review later if anything changes with more wear.
"""

review3 = """
I’ve had this machine now for a month and I absolutely LOVE IT!!
-pulls good shots
-easy to use and clean
The only bad thing i’ve found about it so far, it’s a little noisy but it’s not terrible! (most machines are going to be noisy)
The frothing wand is really good if you know how to use it, I recommend watching videos on how to properly steam milk if you don’t know how!
"""

review4 = """
I love the oversized hood and the jacket fits well though if you're big chested on a small frame as I am, I'd recommend going a size larger than you're used to wearing. I'm a med normally in coats, but this one doesn't give much room across my chest. Also, I was unhappy about the inner pockets because they were not stitched at the bottom,
so although it looked like a pocket, it was clearly not one. I stitched the bottoms of the pockets closed because I love inside pockets. I have to say the value for the money spent is very good though.
"""

review5 = """
Listen. These shoes fit great, they look sporty, and they're perfect for my wide duck feet (don’t judge me). I bought the extra wide size and my toes are living their best life now.
But here’s the twist. Getting them on is like wrestling an octopus into a sock. The opening is a stiff little circle with all the give of a jealous ex. There’s no stretch. None. Zip. You will earn your right to wear these shoes through sheer determination, upper body strength, and possibly a shoehorn named Excalibur.
Once you're in though? Cloud city. My feet are happy, comfy, and supported. Totally worth the five minute wrestling match. Great price too. Just warm up first. Maybe stretch. Hydrate.
"""

Next we try our integrated pros and cons extraction pipeline.

In [38]:
pros, cons = ProsConsSummarizer(review3, summarizer=summarizer)
print(f"Pros:\n {pros} \nCons:\n {cons}")

Pros:
 - The frothing wand is good if you know how to use it.
- The machine is easy to use and clean. 
Cons:
 - The only bad thing so far is that the machines are noisy, but it's not terrible as most machines are going to be noisy.


## 8. Gradio Interface Implementation for User Interaction

As a last step, we build an interactive web app that analyzes product reviews to automatically identify pros and cons. It creates a clean interface where one can paste any review text, and it will extract and display the positive and negative points in separate, color-coded sections, complete with helpful statistics and example reviews.

In [39]:
# Create the Gradio interface
with gr.Blocks(
    title="Review Pros & Cons Summarizer",
    theme=gr.themes.Soft(),
    css="""
        .pros-card {
            border: 1px solid #10b981;
            border-radius: 10px;
            padding: 15px;
            background: linear-gradient(135deg, #f0fdf4 0%, #dcfce7 100%);
        }
        .cons-card {
            border: 1px solid #ef4444;
            border-radius: 10px;
            padding: 15px;
            background: linear-gradient(135deg, #fef2f2 0%, #fee2e2 100%);
        }
        .pros-header {
            color: #059669;
            font-weight: bold;
            margin-bottom: 10px;
        }
        .cons-header {
            color: #dc2626;
            font-weight: bold;
            margin-bottom: 10px;
        }
        .stats {
            background: #f8fafc;
            padding: 10px;
            border-radius: 8px;
            margin-top: 10px;
            font-size: 0.9em;
            border-left: 3px solid #94a3b8;
        }
        .container {
            max-width: 900px;
            margin: auto;
        }
        .output-text {
            font-size: 14px;
            line-height: 1.6;
        }
        .btn-primary {
            background: linear-gradient(135deg, #3b82f6 0%, #1d4ed8 100%);
            border: none;
        }
        .btn-secondary {
            background: white;
        }
    """
) as demo:

    # Header
    gr.Markdown("""
    # 🧩 Review Pros & Cons Summarizer
    **Automatically extract and summarize positive and negative aspects from reviews**

    *Enter any review text below to see the extracted pros and cons in separate sections.*
    """, elem_classes="header")

    # Main layout with two columns
    with gr.Row():
        # Left column - Input
        with gr.Column(scale=1):
            gr.Markdown("### 📝 **Input Review**")

            input_text = gr.Textbox(
                label="",
                lines=10,
                max_lines=30,
                placeholder='''Paste your product review, customer feedback, or any opinion text here...\n\nExample: "The camera quality is amazing but battery life could be better."''',
                show_label=False,
            )

            with gr.Row():
                analyze_btn = gr.Button(
                    "🔍 Analyze Review",
                    variant="primary",
                    scale=2,
                    elem_classes="btn-primary"
                )
                clear_btn = gr.Button(
                    "🗑️ Clear",
                    variant="secondary",
                    scale=1,
                    elem_classes="btn-secondary"
                )

            # Example section
            gr.Markdown("### 💡 **Quick Examples**")
            gr.Examples(
                examples=[[review1], [review2], [review3], [review4], [review5]],
                inputs=input_text,
                label="Click any example to load it",
                examples_per_page=3
            )

        # Right column - Output
        with gr.Column(scale=1):
            gr.Markdown("### 📊 **Analysis Results**")

            # Pros section
            with gr.Column(elem_classes="pros-card"):
                gr.Markdown("##### ✅ **PROS**", elem_classes="pros-header")
                pros_output = gr.Textbox(
                    label="",
                    lines=3,
                    max_lines=10,
                    interactive=False,
                    show_label=False,
                    elem_classes="output-text",
                    placeholder="Positive aspects will appear here...",
                    show_copy_button=True
                )

                # Optional stats for pros
                with gr.Accordion("📈 Details", open=False):
                    pros_stats = gr.Markdown("")

            # Cons section
            with gr.Column(elem_classes="cons-card"):
                gr.Markdown("##### ❌ **CONS**", elem_classes="cons-header")
                cons_output = gr.Textbox(
                    label="",
                    lines=3,
                    max_lines=10,
                    interactive=False,
                    show_label=False,
                    elem_classes="output-text",
                    placeholder="Negative aspects will appear here...",
                    show_copy_button=True
                )

                # Optional stats for cons
                with gr.Accordion("📈 Details", open=False):
                    cons_stats = gr.Markdown("")

    # Footer
    gr.Markdown("""
    ---
    *Note: The analysis splits text into sentences, classifies each as Pro or Con, and provides concise summaries.*
    """)

    # Helper function to calculate statistics
    def calculate_stats(pros_text, cons_text):
        """Calculate statistics for display"""
        pros_words = len(pros_text.split()) if pros_text!="No Pros found in review." else 0
        cons_words = len(cons_text.split()) if cons_text!="No Cons found in review." else 0

        pros_stats_text = f"""
        **Pros Statistics:**\n
        • Words: {pros_words}
        • Characters: {len(pros_text) if pros_text else 0}
        • Has content: {"✅ Yes" if pros_text!= "No Pros found in review." else "❌ No"}
        """

        cons_stats_text = f"""
        **Cons Statistics:**\n
        • Words: {cons_words}
        • Characters: {len(cons_text) if cons_text else 0}
        • Has content: {"✅ Yes" if cons_text!= "No Cons found in review." else "❌ No"}
        """

        return pros_stats_text, cons_stats_text

    # Main processing function
    def process_text(text):
        """Process the text and return all outputs"""
        # Call your function directly - it returns (pros, cons)
        pros, cons = ProsConsSummarizer(text)

        # Calculate statistics
        pros_stats_text, cons_stats_text = calculate_stats(pros, cons)

        # Return all outputs in order
        return pros, cons, pros_stats_text, cons_stats_text

    # Connect the analyze button
    analyze_btn.click(
        fn=process_text,
        inputs=input_text,
        outputs=[
            pros_output,     # Main pros display
            cons_output,     # Main cons display
            pros_stats,      # Pros statistics
            cons_stats      # Cons statistics
            ]
    )

    # Clear button functionality
    def clear_all():
        """Clear all inputs and outputs"""
        return ["", "", "", "", ""]

    clear_btn.click(
        fn=clear_all,
        inputs=None,
        outputs=[
            input_text,      # Input textbox
            pros_output,     # Pros output
            cons_output,     # Cons output
            pros_stats,      # Pros stats
            cons_stats      # Cons stats
            ]
    )

    # Auto-clear when input changes
    def clear_on_input_change(text):
         return ["", "", "", "", ""]

    input_text.change(
         fn=clear_on_input_change,
         inputs=input_text,
         outputs=[pros_output, cons_output, pros_stats, cons_stats]
    )

# Launch the application
if __name__ == "__main__":
    demo.launch(
        server_name="0.0.0.0" if os.getenv("SPACE_ID") else "127.0.0.1",
        share=False,
        show_error=True,
        favicon_path=None,
        auth=None,
        auth_message=None,
        prevent_thread_lock=False,
        show_api=False,
        debug=False,
        quiet=True
    )

<IPython.core.display.Javascript object>