# Handling Imbalanced Classes with Augmentation

In this notebook, we address the issue of **class imbalance**, where some emotion classes (minority classes) contain significantly fewer sentences than others.

To balance the dataset:
- We first identify the **minority classes** by comparing the number of sentences across all classes.
- For each minority class, we calculate how many additional sentences are needed to match the size of the largest class.
- We then generate **augmented sentences** for these underrepresented classes using a controlled augmentation method, ensuring that the new data is meaningful and label-consistent.

This results in a class-balanced dataset, which is better suited for training robust and fair models.


Assume the data set is a dictionary with keys as classes and values lists of lists of words.

1: [[w11,w12..],[w21,w22]...]
2: ...

In [None]:
!pip install --upgrade  gensim
!pip install nltk
!pip install wordcloud
!pip install scipy==1.12.1
!pip install numpy==1.26.4

Collecting gensim
  Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting numpy<2.0,>=1.18.5 (from gensim)
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting scipy<1.14.0,>=1.7.0 (from gensim)
  Downloading scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.7/26.7 MB[0m [31m39.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
[2K   [90m━━━━━━━━━━━

[31mERROR: Ignored the following yanked versions: 1.11.0, 1.14.0rc1[0m[31m
[0m[31mERROR: Ignored the following versions that require a different python version: 1.6.2 Requires-Python >=3.7,<3.10; 1.6.3 Requires-Python >=3.7,<3.10; 1.7.0 Requires-Python >=3.7,<3.10; 1.7.1 Requires-Python >=3.7,<3.10; 1.7.2 Requires-Python >=3.7,<3.11; 1.7.3 Requires-Python >=3.7,<3.11; 1.8.0 Requires-Python >=3.8,<3.11; 1.8.0rc1 Requires-Python >=3.8,<3.11; 1.8.0rc2 Requires-Python >=3.8,<3.11; 1.8.0rc3 Requires-Python >=3.8,<3.11; 1.8.0rc4 Requires-Python >=3.8,<3.11; 1.8.1 Requires-Python >=3.8,<3.11[0m[31m
[0m[31mERROR: Could not find a version that satisfies the requirement scipy==1.12.1 (from versions: 0.8.0, 0.9.0, 0.10.0, 0.10.1, 0.11.0, 0.12.0, 0.12.1, 0.13.0, 0.13.1, 0.13.2, 0.13.3, 0.14.0, 0.14.1, 0.15.0, 0.15.1, 0.16.0, 0.16.1, 0.17.0, 0.17.1, 0.18.0, 0.18.1, 0.19.0, 0.19.1, 1.0.0, 1.0.1, 1.1.0, 1.2.0, 1.2.1, 1.2.2, 1.2.3, 1.3.0, 1.3.1, 1.3.2, 1.3.3, 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2

In [None]:
import nltk
nltk.download('stopwords')

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

In [None]:
from gensim.models import Word2Vec
from collections import Counter
from itertools import chain
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from google.colab import drive
import pandas as pd
import random
drive.mount('/content/drive')
import os
from typing import Iterable, Callable, List, Dict, Any,Tuple
from collections import defaultdict
from nltk.corpus import stopwords
import random

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#Load W2V model

In [None]:
import gensim.downloader as api

# this will download (once) and load the Word2Vec Google News vectors
w2v_model = api.load("word2vec-google-news-300")


[('kings', 0.7138045430183411), ('queen', 0.6510956883430481), ('monarch', 0.6413194537162781), ('crown_prince', 0.6204220056533813), ('prince', 0.6159993410110474)]


#Insert Paths
Please enter the full path to the csv with the full sentences  
and the full path to the csv with each emotion rating.


In [None]:
words_csv_path = "/content/drive/MyDrive/segmented_transcriptions_small.csv" #set the full path to the csv with the words here
rating_csv_path = "/content/drive/MyDrive/segmented_transcriptions_full.csv" #set the full path to emotion ratings here
emotion = "arousal"

In [None]:
df_words = pd.read_csv(words_csv_path,encoding="latin1")
df_rating =  pd.read_csv(rating_csv_path)


In [None]:
df_rating.columns

Index(['sub', 'episode', 'segment', 'start', 'end', 'irritation', 'nostalgia',
       'pride', 'relief', 'sadness', 'satisfaction', 'surprise', 'sympathy',
       'triumph', 'arousal', 'valence', 'contempt', 'contentment',
       'embarrassment', 'empathic_pain', 'envy', 'gratitude', 'disgust',
       'disappointment', 'despair', 'admiration', 'amusement',
       'aesthetic_appreciation', 'anger', 'anxiety', 'awe', 'calmness',
       'confusion', 'excitement', 'fear', 'guilt', 'interest', 'joy',
       'pleasure', 'romance', 'craving', 'entrancement', 'hope', 'boredom',
       'adoration', 'jealousy', 'horror', 'sexual_desire', 'ep_inds'],
      dtype='object')

##Organizing Sentences for Augmentation

The code outputs a nested dictionary structure, where each top-level key corresponds to a specific emotion. Each of these sub-dictionaries maps numerical ratings (e.g., 1 to 7) to a list of sentences that received that particular rating for that emotion. Each sentence in the list is represented as a list of words (i.e., a tokenized sentence).


In [None]:
def merge_words_ratings(
    df_words: pd.DataFrame,
    df_rating: pd.DataFrame,
    emotions: Iterable[str],
) -> pd.DataFrame:
    """
    Return a DataFrame with one row per sentence (from df_words),
    carrying along each requested emotion score (from df_rating).

    Resulting columns: ["sub","episode","segment","label"] + list(emotions)
    """
    id_cols = ["sub", "episode", "segment"]
    # only keep id_cols + emotion columns in df_rating
    keep_cols = id_cols + list(emotions)
    df_trim = df_rating[keep_cols]
    # merge so that every sentence in df_words shows up, with its emotion scores (or NaN)
    df_merged = (
        df_words
        .merge(df_trim, on=id_cols, how="left", sort=False)
        # ensure label is present
        .loc[:, id_cols + ["label"] + list(emotions)]
    )
    return df_merged


def build_word_dicts(
    df_merged: pd.DataFrame,
    emotions: Iterable[str],
    tokenize: Callable[[str], List[str]] = lambda s: s.split(),
) -> Dict[str, Dict[Any, List[Tuple[List[str], int]]]]:
    """
    Given a merged DataFrame (one row per sentence, with columns:
            ["sub","episode","segment","label"] + list(emotions)
    and whose index is the original row‐index),
    build, for each emotion, a dict mapping class_value → list of
    (tokenized_sentence, original_row_index).

    Parameters
    ----------
    df_merged : pd.DataFrame
        Must contain:
          - "label" column (the sentence)
          - one column per emotion in `emotions`
        And its index should correspond to the original df_merged indices.
    emotions : iterable of str
        The names of the emotion‐columns to build dicts for.
    tokenize : Callable[[str], List[str]]
        How to split a sentence into words.

    Returns
    -------
    Dict[str, Dict[Any, List[Tuple[List[str], int]]]]
        A mapping from each emotion name to its word_dict. Each word_dict:
            class_value → [ ( [w1, w2, …], row_idx ), … ]
    """
    # prepare one defaultdict(list) per emotion
    out: Dict[str, Dict[Any, List[Tuple[List[str], int]]]] = {
        emo: defaultdict(list) for emo in emotions
    }

    # iterate once over every row (gets you row.Index as the DF index)
    for row in df_merged.itertuples(index=True):
        tokens = tokenize(row.label)  # split the sentence
        idx    = row.Index           # the original merged‐df index

        for emo in emotions:
            cls = getattr(row, emo)  # fetch the value in that emotion‐column
            out[emo][cls].append((tokens, idx))

    # turn defaultdicts into normal dicts
    return {emo: dict(d) for emo, d in out.items()}

merged = merge_words_ratings(df_words, df_rating, ["arousal","sadness","fear","joy","disgust","anger","surprise"])


# 2) then build per‐emotion word dict
word_dicts = build_word_dicts(merged, ["arousal"])

word_dict_arousal  = word_dicts["arousal"]

## augmentation part

The code below takes the word dictionary generated in the previous step and computes the LIC score for each word. It also determines, for each class, how many additional sentences are needed in order to balance the dataset — i.e., so that all classes contain the same number of sentences as the class with the highest count.

Next, we iterate over each class and generate synthetic sentences to fill the gap. For each new sentence, we select the word with the highest LIC score that is not a stopword, and replace it with the most similar word based on a pretrained Word2Vec model.

By the end of this process, the word dictionary is extended with newly augmented sentences, resulting in a class-balanced dataset ready for training.

In [None]:
stop_words = set(stopwords.words("english"))

{'don', 'now', 've', 'against', 'having', 'its', "it'll", 'haven', 'doesn', "i'm", 'can', "we've", 'yourselves', 'yourself', 'over', 'an', 'had', "hasn't", 'has', 'few', 'are', 'am', 'me', 'is', 'too', 'with', 'how', "we're", "needn't", 'down', "wouldn't", "that'll", "you've", 'does', 'both', 'about', 'be', 'they', 'than', 'there', 'each', 'because', 'but', 'did', 'this', 's', "we'd", 'who', 'most', 'him', 'nor', "she'd", 'hasn', "it'd", "it's", 'until', 'again', 'of', 'hers', 'that', 'off', "you'll", 'from', "you're", 'll', "won't", 'the', 'will', 'myself', 'i', 'm', 'such', 'do', 'to', "shan't", 'being', "didn't", "i'd", 'here', 'itself', 'needn', 'by', 'should', "he'll", 'shouldn', 'at', "he'd", 'out', 'herself', 'only', "mightn't", "aren't", 'a', 'doing', 'above', 'his', 'as', 'she', "they'd", 'mustn', 'my', 'once', 'all', 'weren', 'then', "should've", 'if', 'we', 'after', 'aren', 'ma', 'our', 'on', 'were', 'themselves', 'when', 'what', 'hadn', 'd', 'wouldn', 'below', 'where', 'you

In [None]:
from typing import Callable, List, Dict
stop_words = set(stopwords.words("english"))

def compute_lic(word_dict: Dict) -> Dict:
    def class_word_counts(data):
        out = {}
        for cls, tuples in data.items():
            flattened = Counter(word for sub in tuples for word in sub[0])
            out.update({(cls, w): c for w, c in flattened.items()})
        return out

    c_w_c = class_word_counts(word_dict)
    lic_df = (
        pd.Series(c_w_c, name="count")
        .rename_axis(index=["class", "word"])
        .reset_index()
    )
    lic_df["tf"] = lic_df["count"] / lic_df.groupby("class")["count"].transform("sum")

    pivot = lic_df.pivot_table(index="word", columns="class", values="tf", fill_value=0.0)
    C = len(word_dict)
    df_w = (pivot > 0).sum(axis=1)
    idf = np.log(C / (1 + df_w)) + 1

    lic_records = []
    for c in pivot.columns:
        tf_c = pivot[c]
        tf_not_c = pivot.drop(columns=c)
        mu_not_c = tf_not_c.mean(axis=1)
        sigma_not_c = tf_not_c.std(axis=1).replace(0, np.nan)

        z = (tf_c - mu_not_c) / sigma_not_c
        licc = z * idf
        lic_records.append(licc.rename(c))

    lic_df = pd.concat(lic_records, axis=1).stack(dropna=True)
    lic_series = lic_df.swaplevel(0, 1).sort_index()
    lic_series.index.names = ["class", "word"]
    return lic_series.to_dict()


def count_missing_samples(word_dict: Dict) -> Dict:
    """
    counts how many samples are missing in that class untill it will be full
    """
    max_size = max(len(samples) for samples in word_dict.values())
    return {
        cls: max_size - len(samples)
        for cls, samples in word_dict.items()
    }


def generate_samples_for_class(word_dict, key, missing, LIC, w2v_model):
    existing = word_dict.get(key, [])
    n_exist = len(existing)
    if n_exist == 0 or missing <= 0:
        return []

    new_samples = []
    used_pairs = set()
    attempts = 0
    max_attempts = missing * 10

    while len(new_samples) < missing and attempts < max_attempts:
        attempts += 1
        idx = random.randrange(n_exist)
        sentence, df_idx  = existing[idx]

        lic_list = [(pos, w, LIC.get((key, w), 0.0)) for pos, w in enumerate(sentence) if w.lower() not in stop_words]
        lic_list.sort(key=lambda x: -x[2])
        for pos, word, score in lic_list:
            if score <= 0 or (idx, pos) in used_pairs:
                continue
            try:
                sims = w2v_model.most_similar(positive=[word], topn=10)
            except KeyError:
                continue
            replacement = next((w for w, _ in sims if w != word), None)
            if not replacement:
                continue

            new_sent = sentence.copy()
            new_sent[pos] = replacement
            new_samples.append((new_sent,df_idx))
            used_pairs.add((idx, pos))
            break

    word_dict[key].extend(new_samples)
    return new_samples


def generate_samples(word_dict, LIC, w2v_model):
    missing_counts = count_missing_samples(word_dict)
    for cls, missing in missing_counts.items():
        generate_samples_for_class(word_dict, cls, missing, LIC, w2v_model)
    return word_dict


def augment_all_emotions(word_dicts, emotions, w2v_model) -> Dict[str, dict]:
    augmented = {}
    for emo in emotions:
        word_dict = word_dicts[emo]
        lic = compute_lic(word_dict)
        balanced = generate_samples(word_dict, lic, w2v_model)
        augmented[emo] = balanced
    return augmented

In [None]:
emotions = ["arousal"]
augmented_word_dicts = augment_all_emotions(word_dicts, emotions, w2v_model)



🔧 Augmenting for emotion: arousal
[(30, 'giant', 63.12511946764411), (31, 'giant', 63.12511946764411), (6, 'brother', 31.190837938241714), (25, 'hands', 20.120287008048873), (11, 'snail', 12.030269020600269), (32, 'snail', 12.030269020600269), (9, 'found', 9.90131691864011), (4, 'time', 7.400576964778002), (15, 'honestly', 0.0), (19, 'size', 0.0), (22, 'four', 0.0), (26, 'like', -1.0797454835905596)]


  lic_df = pd.concat(lic_records, axis=1).stack(dropna=True)


[(30, 'giant', 63.12511946764411), (31, 'giant', 63.12511946764411), (6, 'brother', 31.190837938241714), (25, 'hands', 20.120287008048873), (11, 'snail', 12.030269020600269), (32, 'snail', 12.030269020600269), (9, 'found', 9.90131691864011), (3, 'one', 9.090287546464811), (4, 'time', 7.400576964778002), (15, 'honestly', 0.0), (19, 'size', 0.0), (22, 'four', 0.0), (26, 'like', -1.0797454835905596)]
[(30, 'giant', 63.12511946764411), (6, 'brother', 31.190837938241714), (25, 'hands', 20.120287008048873), (11, 'snail', 12.030269020600269), (32, 'snail', 12.030269020600269), (9, 'found', 9.90131691864011), (3, 'one', 9.090287546464811), (4, 'time', 7.400576964778002), (15, 'honestly', 0.0), (19, 'size', 0.0), (22, 'four', 0.0), (31, 'behemoth', 0.0), (26, 'like', -1.0797454835905596)]
[(30, 'giant', 63.12511946764411), (31, 'giant', 63.12511946764411), (25, 'hands', 20.120287008048873), (11, 'snail', 12.030269020600269), (32, 'snail', 12.030269020600269), (9, 'found', 9.90131691864011), (3,

##Restore emotion score for each sentence

### Restoring Emotion Scores

In the final step, we restore the **emotion scores** for each sentence — including both original and augmented samples.

- **Original sentences** keep their original emotion scores as they appeared in the dataset.
- **Augmented sentences** inherit the same emotion scores as the sentence from which they were generated.

This ensures that all samples, whether real or synthetic, are labeled consistently and meaningfully for training.


In [None]:
def augmented_dict_to_df(
    augmented_wd: Dict[Any, List[Tuple[List[str], int]]],
    df_merged: pd.DataFrame,
    emotions: List[str]
) -> pd.DataFrame:
    """
    Build a DataFrame with columns:
       - 'sentence' : the augmented sentence (joined tokens)
       - one column per emotion in `emotions`, filled from df_merged

    Params
    ------
    augmented_wd : Dict[class_value, List[(tokens, df_idx)]]
      Your word_dict for a single emotion, where each sample is (tokens, original_row_index).
    df_merged    : pd.DataFrame
      The merged DataFrame that has your emotion‐columns and whose index is the original row‐indices.
    emotions     : List of column names in df_merged you want to pull (e.g. ['arousal','valence','interest']).

    Returns
    -------
    pd.DataFrame with columns ['sentence'] + emotions.
    """
    rows = []
    for cls, samples in augmented_wd.items():
        for tokens, df_idx in samples:
            sentence = " ".join(tokens)
            # grab the original scores for all requested emotions
            scores = df_merged.loc[df_idx, emotions].to_dict()
            rows.append({"sentence": sentence, **scores})

    return pd.DataFrame(rows)


# --------------------
# Example usage:

# say you only care about augmenting arousal:
wd_arousal = augmented_word_dicts["arousal"]
# and your merged DF has columns ['label','arousal','valence','interest']
df_aug = augmented_dict_to_df(
    wd_arousal,
    merged,
    emotions=["arousal","sadness","fear","joy","disgust","anger","surprise"]
)



## export
Export the original and augmented sentences with their emotions scores to a csv file for future analysis.

In [None]:
def export_word_dicts(word_dicts: dict, out_dir: str = "./exports"):
    """
    Save each emotion’s word-dictionary to <out_dir>/<emotion>.csv
    with columns: class, word
    """
    os.makedirs(out_dir, exist_ok=True)

    for emotion, wdict in word_dicts.items():
        rows = []
        for cls, sentences in wdict.items():
            for sent in sentences:
                sentence_str = " ".join(sent)
                rows.append({"class": cls, "sentence": sentence_str})

        df = pd.DataFrame(rows)
        path = os.path.join(out_dir, f"{emotion}.csv")
        df.to_csv(path, index=False, encoding="utf-8")

export_word_dicts(augmented_word_dicts,
                  out_dir="/content/drive/MyDrive/emotions_csv")

# Re-Balance + Proof about the generalization effect

In [None]:
import numpy as np
import pandas as pd

def cap_augmented_in_3to5_rectangle(df, cap_per_class=2000, seed=42):
    """
    Downsample ONLY augmented rows within the rectangle:
      is_augmented == True AND valence in [3,5] AND arousal in [3,5]
    so that final counts per class (valence=3/4/5 and arousal=3/4/5) are <= cap_per_class.
    Two passes: (1) cap by valence, (2) cap by arousal.
    """
    rng = np.random.RandomState(seed)
    df = df.copy()

    # Ensure numeric (and drop NAs in these keys if exist)
    df['valence'] = pd.to_numeric(df['valence'], errors='coerce')
    df['arousal'] = pd.to_numeric(df['arousal'], errors='coerce')
    df = df.dropna(subset=['valence', 'arousal'])
    df['valence'] = df['valence'].astype(int)
    df['arousal'] = df['arousal'].astype(int)

    def rect_mask(d):
        return (
            d['is_augmented'].astype(bool) &
            d['valence'].between(2, 5) &
            d['arousal'].between(3, 6)
        )

    # ---- Pass A: cap by VALENCE for classes 3,4,5 ----
    for v in (2, 3, 4, 5):
        total_v = (df['valence'] == v).sum()
        excess = max(0, total_v - cap_per_class)
        if excess > 0:
            candidates = df.index[rect_mask(df) & (df['valence'] == v)]
            if len(candidates) > 0:
                n_remove = min(excess, len(candidates))
                drop_idx = rng.choice(candidates, size=n_remove, replace=False)
                df = df.drop(drop_idx)

    # ---- Pass B: cap by AROUSAL for classes 3,4,5 ----
    for a in (3, 4, 5, 6):
        total_a = (df['arousal'] == a).sum()
        excess = max(0, total_a - cap_per_class)
        if excess > 0:
            candidates = df.index[rect_mask(df) & (df['arousal'] == a)]
            if len(candidates) > 0:
                n_remove = min(excess, len(candidates))
                drop_idx = rng.choice(candidates, size=n_remove, replace=False)
                df = df.drop(drop_idx)

    return df

# === usage ===
df_capped = cap_augmented_in_3to5_rectangle(df_balanced, cap_per_class=2000, seed=42)
levels = np.arange(1, 8)

bef_val = (df_balanced.loc[df_balanced['is_augmented'] == False, 'valence']
           .astype(int).value_counts().reindex(levels, fill_value=0))
bef_aro = (df_balanced.loc[df_balanced['is_augmented'] == False, 'arousal']
           .astype(int).value_counts().reindex(levels, fill_value=0))

aft_val = df_capped['valence'].astype(int).value_counts().reindex(levels, fill_value=0)
aft_aro = df_capped['arousal'].astype(int).value_counts().reindex(levels, fill_value=0)

aug_val = np.maximum(aft_val - bef_val, 0)
aug_aro = np.maximum(aft_aro - bef_aro, 0)
fig, axes = plt.subplots(1, 2, figsize=(12, 4), sharey=True)

# --- Arousal subplot ---
axes[0].bar(levels, bef_aro, color="#88c999", alpha=0.7, label="Before (no_aug)")
axes[0].bar(levels, aug_aro, bottom=bef_aro, color="#a6cee3", alpha=0.99, label="After (aug)")
axes[0].set_title("Arousal Distribution")
axes[0].set_xlabel("Emotion Level")
axes[0].set_ylabel("Count")
axes[0].legend()

# --- Valence subplot ---
axes[1].bar(levels, bef_val, color="#88c999", alpha=0.7, label="Before (no_aug)")
axes[1].bar(levels, aug_val, bottom=bef_val, color="#a6cee3", alpha=0.99, label="After (aug)")
axes[1].set_title("Valence Distribution")
axes[1].set_xlabel("Emotion Level")
axes[1].legend()

plt.tight_layout()
plt.show()

In [None]:
balance_count = df_balanced[df_balanced['is_augmented']==False]['sub'].value_counts()
capped_count = df_capped['sub'].value_counts()
#create one df with both value counts
df_count = pd.DataFrame({'original_count': balance_count, 'after_augmentation_count': capped_count})
df_count

In [None]:
after_augmentation = df_balanced['sub'].value_counts()
original_count = df_balanced[df_balanced['is_augmented']==False]['sub'].value_counts()
capped_count = df_capped['sub'].value_counts()

df_count = pd.DataFrame({'original_count': original_count,
                         'after_augmentation_count': after_augmentation,
                         'after_reblance':capped_count})
df_count