# Data Balancing Through Synonym-Based Upsampling

## Task: Address Class Imbalance in Tourism Sentiment Dataset

**Team ScourgifyData** | December 2025

## Description

This notebook addresses the class imbalance problem in our tourism sentiment dataset by implementing intelligent upsampling using synonym-based text augmentation. The goal is to achieve perfect balance across all sentiment classes (positive, neutral, negative) to improve model training and prevent bias toward the majority class.

## Table of Contents

1. [Executive Summary](#executive-summary)
2. [Introduction & Background](#introduction--background)
3. [Imports](#imports)
4. [Data to Explore](#data-to-explore)
5. [Analysis: Class Imbalance Identification](#analysis-class-imbalance-identification)
6. [Analysis: Upsampling Implementation](#analysis-upsampling-implementation)
7. [Analysis: Verify Balanced Dataset](#analysis-verify-balanced-dataset)
8. [Save Balanced Dataset](#save-balanced-dataset)
9. [Conclusion](#conclusion)
10. [References](#references)

## Executive Summary

**Key Results:**
- Successfully identified severe class imbalance: positive reviews dominated the dataset while neutral and negative classes were significantly underrepresented
- Implemented synonym-based text augmentation using NLPAug and WordNet to generate realistic synthetic samples
- Achieved perfect class balance: all three sentiment classes now have equal representation
- Created `model_balanced.csv` with balanced data ready for improved model training

**Key Conclusion:** The upsampling process successfully addresses the class imbalance problem, enabling fairer model training and preventing bias toward the majority class. The synonym replacement approach maintains semantic meaning while adding linguistic diversity.

## Introduction & Background

<a id='introduction--background'></a>

**Context:** In our baseline sentiment analysis model, we observed that the tourism review dataset exhibited severe class imbalance. This imbalance is a common problem in sentiment analysis where positive reviews typically outnumber neutral and negative ones.

**Business Impact:** Class imbalance causes machine learning models to develop a strong bias toward the majority class, resulting in:
- Poor performance on minority classes (neutral and negative sentiments)
- Failure to identify critical customer feedback
- Missed opportunities to address service issues

**Task Objective:** Implement an intelligent upsampling strategy that:
- Generates synthetic samples for minority classes using synonym-based text augmentation
- Maintains semantic meaning while adding linguistic diversity
- Achieves perfect balance across all three sentiment classes
- Prepares a balanced dataset for improved model training

## Imports

<a id='imports'></a>

This section includes all necessary library imports and initial setup for the upsampling process.

In [12]:
import pandas as pd

model = pd.read_csv("C:/Users/DELL 7540/Desktop/ScourgifyData/notebooks/model.csv")

print(model.head())
print(model.info())
print(model['sentiment'].value_counts())


   stars                                               text sentiment
0      3  the fish is pretty fresh but the place is run ...   neutral
1      2  First our food came pretty late it took a litt...  negative
2      5  One of my favorite places to go to for everyth...  positive
3      5  We've celebrated birthdays, our anniversary, a...  positive
4      5  Lily did a great job. She was really careful t...  positive
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6974070 entries, 0 to 6974069
Data columns (total 3 columns):
 #   Column     Dtype 
---  ------     ----- 
 0   stars      int64 
 1   text       object
 2   sentiment  object
dtypes: int64(1), object(2)
memory usage: 159.6+ MB
None
sentiment
positive    4673829
negative    1609463
neutral      690778
Name: count, dtype: int64


In [13]:
pip install nltk


Note: you may need to restart the kernel to use updated packages.


In [14]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')  # optional but recommended for synonyms in multiple languages


[nltk_data] Downloading package wordnet to C:\Users\DELL
[nltk_data]     7540\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to C:\Users\DELL
[nltk_data]     7540\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [15]:
!pip install nlpaug




In [16]:
pip install --upgrade nlpaug


Note: you may need to restart the kernel to use updated packages.


In [17]:
pip install tqdm


Note: you may need to restart the kernel to use updated packages.


## Data to Explore

<a id='data-to-explore'></a>

**Dataset:** `model.csv`
- **Source:** Output from baseline model notebook (clean_dataset_joined.ipynb)
- **Description:** Merged tourism reviews dataset containing review text, star ratings, and sentiment labels
- **Features:**
  - `text`: Review text content
  - `stars`: Star rating (1-5)
  - `sentiment`: Sentiment class (positive, neutral, negative)
- **Update Frequency:** Static dataset created from Yelp Academic Dataset
- **Issue:** Contains class imbalance that needs to be addressed before model training

## Analysis: Class Imbalance Identification

<a id='analysis-class-imbalance-identification'></a>

Load the dataset and examine the distribution of sentiment classes to identify the imbalance problem.

In [18]:
import pandas as pd
import nlpaug.augmenter.word as naw
import nltk
from tqdm import tqdm

nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Split classes
df_pos = model[model['sentiment'] == 'positive']
df_neg = model[model['sentiment'] == 'negative']
df_neu = model[model['sentiment'] == 'neutral']

# Synonym augmenter
aug = naw.SynonymAug(aug_src='wordnet', aug_p=0.1)

target_size = len(df_pos)   # largest class

def upsample(df, target_size):
    rows = []
    original_count = len(df)

    # Copy original rows
    rows.extend(df.to_dict('records'))

    idx = 0

    # Number of augmented samples needed
    needed = target_size - original_count

    print(f"Generating {needed:,} new samples...")

    # tqdm progress bar
    for _ in tqdm(range(needed), desc="Augmenting", unit="sample"):
        row = df.iloc[idx % original_count]
        new_text = aug.augment(row['text'])

        rows.append({
            'stars': row['stars'],
            'text': new_text,
            'sentiment': row['sentiment']
        })

        idx += 1

    return pd.DataFrame(rows)


print("Upsampling negative...")
df_neg_up = upsample(df_neg, target_size)

print("Upsampling neutral...")
df_neu_up = upsample(df_neu, target_size)

# Combine all
model_balanced = pd.concat([df_pos, df_neg_up, df_neu_up], ignore_index=True)
model_balanced = model_balanced.sample(frac=1, random_state=42).reset_index(drop=True)

print(model_balanced['sentiment'].value_counts())


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\DELL 7540\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to C:\Users\DELL
[nltk_data]     7540\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to C:\Users\DELL
[nltk_data]     7540\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Upsampling negative...
Generating 3,064,366 new samples...


Augmenting: 100%|██████████| 3064366/3064366 [5:44:05<00:00, 148.43sample/s]  


Upsampling neutral...
Generating 3,983,051 new samples...


Augmenting: 100%|██████████| 3983051/3983051 [6:09:19<00:00, 179.74sample/s]  


sentiment
negative    4673829
neutral     4673829
positive    4673829
Name: count, dtype: int64


In [19]:
print("Dataset information")
model.info()

Dataset information
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6974070 entries, 0 to 6974069
Data columns (total 3 columns):
 #   Column     Dtype 
---  ------     ----- 
 0   stars      int64 
 1   text       object
 2   sentiment  object
dtypes: int64(1), object(2)
memory usage: 159.6+ MB


## Analysis: Upsampling Implementation

<a id='analysis-upsampling-implementation'></a>

Implement synonym-based text augmentation to upsample minority classes (negative and neutral) to match the majority class size (positive).

In [20]:
model

Unnamed: 0,stars,text,sentiment
0,3,the fish is pretty fresh but the place is run ...,neutral
1,2,First our food came pretty late it took a litt...,negative
2,5,One of my favorite places to go to for everyth...,positive
3,5,"We've celebrated birthdays, our anniversary, a...",positive
4,5,Lily did a great job. She was really careful t...,positive
...,...,...,...
6974065,5,"I'm from Texas. That being said, I never thoug...",positive
6974066,4,May I never eat another doughnut again.\n\nSom...,positive
6974067,5,Great pizza. Great atmosphere..best crust in ...,positive
6974068,5,Great place. Have been going here for over twe...,positive


In [21]:
model_balanced

Unnamed: 0,stars,text,sentiment
0,2,This place is a Tourist Trap with two capital ...,negative
1,1,Totally disappointed. Service slow. Only 3 oth...,negative
2,3,"[It isn ' t bad, the wait staff was clearly hi...",neutral
3,2,"[First time here, 2: 00 on a Saturday afternoo...",negative
4,3,I originally wanted to order the lobster quesa...,neutral
...,...,...,...
14021482,3,"[Equal born and raised in the South, I often g...",neutral
14021483,5,Great store and amazing owners! I wanted to ge...,positive
14021484,3,[I first stayed at this hotel in 2004 and just...,neutral
14021485,1,[If you ' re gonna come here might as substant...,negative


## Analysis: Verify Balanced Dataset

<a id='analysis-verify-balanced-dataset'></a>

Verify that the upsampling process successfully created a balanced dataset with equal representation across all sentiment classes.

## Save Balanced Dataset

<a id='save-balanced-dataset'></a>

Export the balanced dataset for use in model training.

In [41]:
model_balanced.to_csv("model_balanced.csv", index=False)

## Conclusion

<a id='conclusion'></a>

**Summary of Results:**
- Successfully identified and quantified class imbalance in the original dataset
- Implemented synonym-based text augmentation using NLPAug library with WordNet
- Generated synthetic samples for minority classes (negative and neutral) by replacing ~10% of words with synonyms
- Achieved perfect class balance: all three sentiment classes now have equal representation
- Created `model_balanced.csv` with the balanced dataset

**Key Achievements:**
1. **Prevented Model Bias:** Equal class representation ensures the model won't favor the majority class
2. **Maintained Semantic Integrity:** Synonym replacement preserves the meaning of reviews
3. **Added Linguistic Diversity:** Augmented samples introduce variations that improve model generalization
4. **Ready for Training:** The balanced dataset is prepared for the next phase (deep learning model training)

**Next Steps:**
- Split the balanced dataset into training (80%) and validation (20%) sets
- Proceed to model training with Bidirectional LSTM architecture
- Evaluate model performance on balanced vs. imbalanced data

## References

<a id='references'></a>

**Internal Documentation:**
- Baseline Model Notebook: `notebooks/1st model/clean_dataset_joined.ipynb`
- Project README: `README.md`
- Data Processing Documentation: `data/README.md`

**External Resources:**
- NLPAug Documentation: https://nlpaug.readthedocs.io/
- WordNet Lexical Database: https://wordnet.princeton.edu/
- NLTK Documentation: https://www.nltk.org/
- Yelp Academic Dataset: https://www.yelp.com/dataset

**Related Literature:**
- Data Augmentation for Text Classification
- Handling Class Imbalance in Machine Learning
- Synonym-based Text Augmentation Techniques