# Text Preprocessing for Emotion Analysis

This notebook handles all preprocessing steps for the emotions dataset. It can process any CSV file (train/validation/test) and saves the preprocessed dataframe to a pickle file for efficient loading in training notebooks.

## Preprocessing Steps:
1. Load CSV data
2. Handle missing values and duplicates
3. Remove URLs
4. Remove special characters and punctuation
5. Remove extra whitespaces
6. Remove numeric values
7. Lowercase text
8. Remove stopwords
9. Remove non-alphanumeric characters
10. Save preprocessed dataframe to pickle file


## 1. Import Required Libraries


In [174]:
import pandas as pd
import numpy as np
import re
import nltk
import pickle
import os
from nltk.corpus import stopwords

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

print("‚úÖ Libraries imported successfully")


‚úÖ Libraries imported successfully


[nltk_data] Downloading package punkt to /home/ido/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/ido/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 2. Configuration - Dataset Split

**Set `split`** to one of: `'train'`, `'validation'`, or `'test'`

The notebook will automatically:
- Load the corresponding CSV file from `./data/`
- Apply appropriate preprocessing (e.g., remove duplicates only for training)
- Save to the corresponding pickle file


In [175]:
# ========================================
# CONFIGURATION: Set the dataset split
# ========================================
# Choose one: 'train', 'validation', or 'test'
split = 'validation'  # <-- CHANGE THIS

# Validate split
assert split in ['train', 'validation', 'test'], \
    "split must be one of: 'train', 'validation', 'test'"

# Automatically set input and output paths based on split
INPUT_CSV_PATH = f'./data/{split}.csv'
OUTPUT_PKL_PATH = f'./data/{split}_preprocessed.pkl'

print(f"Dataset Split: {split.upper()}")
print(f"Input file:    {INPUT_CSV_PATH}")
print(f"Output file:   {OUTPUT_PKL_PATH}")


Dataset Split: VALIDATION
Input file:    ./data/validation.csv
Output file:   ./data/validation_preprocessed.pkl


## 3. Load Data


In [176]:
# Load the CSV file
df = pd.read_csv(INPUT_CSV_PATH)

print(f"Data loaded successfully!")
print(f"Shape: {df.shape}")
print(f"\nFirst few rows:")
df.head()


Data loaded successfully!
Shape: (2000, 2)

First few rows:


Unnamed: 0,text,label
0,im feeling quite sad and sorry for myself but ...,0
1,i feel like i am still looking at a blank canv...,0
2,i feel like a faithful servant,2
3,i am just feeling cranky and blue,3
4,i can have for a treat or if i am feeling festive,1


## 4. Initial Data Inspection


In [177]:
# Check for null values
print("Null values per column:")
print(df.isnull().sum())

print(f"\nNumber of duplicates: {df.duplicated().sum()}")

# Display column names
print(f"\nColumn names: {df.columns.tolist()}")


Null values per column:
text     0
label    0
dtype: int64

Number of duplicates: 0

Column names: ['text', 'label']


## 4.1. Remove Duplicates (Train Only)

Duplicates are removed ONLY from training data to prevent overfitting.  
Validation and test data keep duplicates to preserve real-world distribution.


In [178]:
# Remove duplicates on train dataset
if split == 'train':
    duplicates_count = df.duplicated().sum()
    if duplicates_count > 0:
        df = df.drop_duplicates().reset_index(drop=True)
        print(f"‚ö†Ô∏è  Training split: Removed {duplicates_count} duplicate(s)")
        print(f"   New shape: {df.shape}")
    else:
        print("‚úÖ Training split: No duplicates found")
else:
    print(f"üìù {split.capitalize()} split: Keeping duplicates (preserves data distribution)")


üìù Validation split: Keeping duplicates (preserves data distribution)


## 5. Data Cleaning - Column Renaming


In [179]:
# Rename columns for consistency (capitalize first letter)
if 'text' in df.columns:
    df.rename(columns={'text': 'Text'}, inplace=True)
    
if 'label' in df.columns:
    df.rename(columns={'label': 'Label'}, inplace=True)

# Drop any unnamed index columns if they exist
if 'Unnamed: 0' in df.columns:
    df.drop('Unnamed: 0', axis=1, inplace=True)

print("Columns after renaming:")
print(df.columns.tolist())
df.head()


Columns after renaming:
['Text', 'Label']


Unnamed: 0,Text,Label
0,im feeling quite sad and sorry for myself but ...,0
1,i feel like i am still looking at a blank canv...,0
2,i feel like a faithful servant,2
3,i am just feeling cranky and blue,3
4,i can have for a treat or if i am feeling festive,1


## 6. Text Preprocessing Pipeline

### Step 1: Remove URLs


In [180]:
# Remove URLs from text
df['Text'] = df['Text'].str.replace(r'http\S+', '', regex=True)

print("URLs removed.")
df.head()


URLs removed.


Unnamed: 0,Text,Label
0,im feeling quite sad and sorry for myself but ...,0
1,i feel like i am still looking at a blank canv...,0
2,i feel like a faithful servant,2
3,i am just feeling cranky and blue,3
4,i can have for a treat or if i am feeling festive,1


### Step 2: Remove Special Characters and Punctuation


In [181]:
# Remove special characters and punctuation
df['Text'] = df['Text'].str.replace(r'[^\w\s]', '', regex=True)

print("Special characters and punctuation removed.")
df.head()


Special characters and punctuation removed.


Unnamed: 0,Text,Label
0,im feeling quite sad and sorry for myself but ...,0
1,i feel like i am still looking at a blank canv...,0
2,i feel like a faithful servant,2
3,i am just feeling cranky and blue,3
4,i can have for a treat or if i am feeling festive,1


### Step 3: Remove Extra Whitespaces


In [182]:
# Remove extra whitespaces (replace multiple spaces with single space)
df['Text'] = df['Text'].str.replace(r'\s+', ' ', regex=True)

# Strip leading and trailing whitespaces
df['Text'] = df['Text'].str.strip()

print("Extra whitespaces removed.")
df.head()


Extra whitespaces removed.


Unnamed: 0,Text,Label
0,im feeling quite sad and sorry for myself but ...,0
1,i feel like i am still looking at a blank canv...,0
2,i feel like a faithful servant,2
3,i am just feeling cranky and blue,3
4,i can have for a treat or if i am feeling festive,1


### Step 4: Remove Numeric Values


In [183]:
# Remove numeric values from text
df['Text'] = df['Text'].str.replace(r'\d+', '', regex=True)

print("Numeric values removed.")
df.head()


Numeric values removed.


Unnamed: 0,Text,Label
0,im feeling quite sad and sorry for myself but ...,0
1,i feel like i am still looking at a blank canv...,0
2,i feel like a faithful servant,2
3,i am just feeling cranky and blue,3
4,i can have for a treat or if i am feeling festive,1


### Step 5: Lowercase Text


In [184]:
# Convert all text to lowercase
df['Text'] = df['Text'].str.lower()

print("Text converted to lowercase.")
df.head()


Text converted to lowercase.


Unnamed: 0,Text,Label
0,im feeling quite sad and sorry for myself but ...,0
1,i feel like i am still looking at a blank canv...,0
2,i feel like a faithful servant,2
3,i am just feeling cranky and blue,3
4,i can have for a treat or if i am feeling festive,1


### Step 6: Remove Stopwords


In [185]:
# Remove English stopwords
stop = stopwords.words('english')
df["Text"] = df['Text'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop]))

print("Stopwords removed.")
df.head()


Stopwords removed.


Unnamed: 0,Text,Label
0,im feeling quite sad sorry ill snap soon,0
1,feel like still looking blank canvas blank pie...,0
2,feel like faithful servant,2
3,feeling cranky blue,3
4,treat feeling festive,1


### Step 7: Remove Non-Alphanumeric Characters


In [186]:
# Remove any remaining non-alphanumeric characters
df['Text'] = df['Text'].apply(lambda x: re.sub(r'[^a-zA-Z\s]', '', x))

# Clean up any extra spaces that might have been created
df['Text'] = df['Text'].str.replace(r'\s+', ' ', regex=True).str.strip()

print("Non-alphanumeric characters removed.")
df.head()


Non-alphanumeric characters removed.


Unnamed: 0,Text,Label
0,im feeling quite sad sorry ill snap soon,0
1,feel like still looking blank canvas blank pie...,0
2,feel like faithful servant,2
3,feeling cranky blue,3
4,treat feeling festive,1


## 7. Final Data Inspection


In [187]:
# Check for empty texts after preprocessing
empty_texts = df[df['Text'].str.strip() == ''].shape[0]
print(f"Number of empty text entries after preprocessing: {empty_texts}")

if empty_texts > 0:
    print(f"\nRemoving {empty_texts} empty text entries...")
    df = df[df['Text'].str.strip() != ''].reset_index(drop=True)

print(f"\nFinal shape: {df.shape}")
print(f"\nLabel distribution:")
print(df['Label'].value_counts().sort_index())


Number of empty text entries after preprocessing: 0

Final shape: (2000, 2)

Label distribution:
Label
0    550
1    704
2    178
3    275
4    212
5     81
Name: count, dtype: int64


## 8. Display Sample of Preprocessed Data


In [188]:
# Display random samples from the preprocessed data
print("Sample of preprocessed data:")
df.sample(10)


Sample of preprocessed data:


Unnamed: 0,Text,Label
1582,looking forward amazing makes feel probably po...,5
961,really didnt feel like going roger keen went b...,1
40,sit chicken preferably bone chicken thighs ski...,2
1965,started feeling pathetic ashamed,0
600,learnt expectations people always met may leav...,0
417,im feeling artistic couple drawings dust ms ca...,1
1210,suppose keep putting know feeling inadequate s...,0
1775,feeling alot people think feel way im sure apa...,1
1300,feeling depressed fabric prices much money hob...,0
918,always feel bit homesick,0


## 9. Save Preprocessed DataFrame to Pickle File

**Why Pickle?**
- Fast and efficient for Python DataFrames
- Preserves data types and structure
- Easy to load in training notebooks
- Smaller file size compared to CSV


In [189]:
# Create output directory if it doesn't exist
output_dir = os.path.dirname(OUTPUT_PKL_PATH)
if output_dir and not os.path.exists(output_dir):
    os.makedirs(output_dir)
    print(f"Created directory: {output_dir}")

# Save the preprocessed dataframe to pickle file
df.to_pickle(OUTPUT_PKL_PATH)

print("\n" + "="*60)
print("PREPROCESSING COMPLETE!")
print("="*60)
print(f"\n‚úÖ Preprocessed data saved to: {OUTPUT_PKL_PATH}")
print(f"   Shape: {df.shape}")
print(f"   Columns: {df.columns.tolist()}")
print(f"   File size: {os.path.getsize(OUTPUT_PKL_PATH) / 1024:.2f} KB")
print(f"\nüìå To load this data in your training notebook, use:")
print(f"   df = pd.read_pickle('{OUTPUT_PKL_PATH}')")



PREPROCESSING COMPLETE!

‚úÖ Preprocessed data saved to: ./data/validation_preprocessed.pkl
   Shape: (2000, 2)
   Columns: ['Text', 'Label']
   File size: 140.81 KB

üìå To load this data in your training notebook, use:
   df = pd.read_pickle('./data/validation_preprocessed.pkl')


## 10. Display Final Preprocessed DataFrame

In [190]:
# Display first 10 rows of the preprocessed dataframe
df.head(10)

Unnamed: 0,Text,Label
0,im feeling quite sad sorry ill snap soon,0
1,feel like still looking blank canvas blank pie...,0
2,feel like faithful servant,2
3,feeling cranky blue,3
4,treat feeling festive,1
5,start feel appreciative god done,1
6,feeling confident able take care baby,1
7,feel incredibly lucky able talk,1
8,feel less keen army every day,1
9,feel dirty ashamed saying,0
