# Data Preprocessing and Undersampling

This notebook demonstrates the process of reading a CSV file, preprocessing its headers, filtering the data, and applying undersampling to balance the dataset. This is particularly useful in machine learning tasks where class imbalance could bias the model training.

## Features:
- Preprocess CSV headers to remove unwanted characters.
- Read the CSV data into a pandas DataFrame.
- Filter the data by removing specific columns and rows with missing values.
- Apply undersampling to balance the dataset.
- Visualize the distribution of classes before and after undersampling.
- Save the processed data to a new CSV file.

## Setup and Imports

Before running this notebook, ensure you have installed the necessary Python packages required for your specific environment.

In [None]:
import pandas as pd
from imblearn.under_sampling import RandomUnderSampler
import matplotlib.pyplot as plt

## Configuration

Specify the path to the CSV file and other relevant settings here.


In [None]:
CSV_FILENAME = 'annotated.csv'   # Adjust the path to your CSV file

## Data Loading and Preprocessing

Load the CSV data with the corrected headers and perform initial data filtering.

In [None]:
try:
    csv = pd.read_csv(CSV_FILENAME, lineterminator='\n')
except FileNotFoundError:
    print("ERROR: File not found")
    exit(1)

In [None]:
csv

In [None]:
# Drop submission-related columns
csv = csv.drop(columns=[
    'submission_name',
    'submission_text',
    '\r',   # Windows may append \r and it becomes considered
            # as its own column. This prevents that
], errors='ignore')

csv

In [None]:
# Drop all rows whose labels are not 0 or 1
csv = pd.concat([
    csv[csv['label'] == '0'], 
    csv[csv['label'] == '1'],
])

csv

In [None]:
# Drop all columns with no header
# Prevents errors from having other unnecessary data in other columns
# It selects the values of columns whose header does not
# begin with 'Unnamed'
csv = csv.loc[:, ~csv.columns.str.contains('^Unnamed')]

csv

In [None]:
# Drop all rows with no text
csv = csv.dropna(subset='body')

In [None]:
# Extract X and y from the csv, this allows the data to be
# undersampled
X = csv.iloc[:, 0]
y = csv.iloc[:, 1]

X
y

In [None]:
# Reshape X into a 2D array to be compatible with the undersampler
X = X.values.reshape(-1, 1)

X

## Undersampling

Apply undersampling to balance the dataset, focusing on the distribution of the 'label' column.

In [None]:
sampler = RandomUnderSampler(random_state=42)

In [None]:
# Undersample the dataX
try:
    X_resampled, y_resampled = sampler.fit_resample(X, y)
except ValueError:
    print("ERROR: Insufficient data")
    exit(1)

In [None]:
X_resampled
y_resampled

In [None]:
# Flatten X again after resampling so it returns to
# a 1D list
X_resampled = X_resampled.flatten()

X_resampled

In [None]:
# Make a new dataframe with the resampled data
# These columns have the same name as the 
# 2016 and 2022 PH Hate Speech dataset
final_csv = pd.DataFrame(
    list(zip(X_resampled, y_resampled)),
    columns=['text', 'label']
)

final_csv

## Save Processed Data

Save the undersampled dataset to a new CSV file.

In [None]:
final_filename = CSV_FILENAME.replace('.csv', '-final.csv')
final_csv.to_csv(final_filename, index=False)
print(f"Processed data saved to {final_filename}")

## Visualization

Visualize the class distribution before and after undersampling to understand the effect.


In [None]:
fig, axs = plt.subplots(1, 2, figsize=(14, 5), sharey=True)

# Before undersampling
axs[0].bar(csv['label'].value_counts().index, csv['label'].value_counts().values, color='skyblue')
axs[0].set_title('Before Undersampling')
axs[0].set_xlabel('Label')
axs[0].set_ylabel('Count')

# After undersampling
axs[1].bar(final_csv['label'].value_counts().index, final_csv['label'].value_counts().values, color='lightgreen')
axs[1].set_title('After Undersampling')
axs[1].set_xlabel('Label')

plt.show()