# Exploratory Data Analysis (EDA)

**NLP Multi-Type Classification Project**

This notebook performs exploratory data analysis on the processed train/val/test datasets.

## Objectives

1. Load processed CSVs (train_4class.csv, val_4class.csv, test_4class.csv)
2. Verify data integrity and schema compliance
3. Analyze class distributions across splits
4. Investigate text length distributions
5. Detect duplicates and potential data quality issues
6. Visualize key patterns and relationships

## EDA Checklist

### Data Loading and Validation
- [ ] Load all three splits (train/val/test)
- [ ] Verify column names match expected schema
- [ ] Check for missing values
- [ ] Verify all labels are in {T1, T2, T3, T4} or {0, 1, 2, 3}
- [ ] Count total samples per split

### Class Distribution Analysis
- [ ] Compute class counts and proportions per split
- [ ] Visualize class distribution (bar charts)
- [ ] Check for class imbalance (warn if any class < 10%)
- [ ] Compare class ratios across train/val/test

### Text Length Analysis
- [ ] Compute word count and character count for all texts
- [ ] Plot length distributions per class (boxplots/violin plots)
- [ ] Identify outliers (very short or very long texts)
- [ ] Print top 5% and bottom 5% by length for manual inspection
- [ ] Generate summary statistics (mean, median, std, min, max)

### Duplicate Detection
- [ ] Count exact duplicate texts within each split
- [ ] Identify duplicate texts across splits (potential leakage)
- [ ] Print examples of duplicates for inspection

### Family-Aware Split Validation
- [ ] Verify no family_id appears in multiple splits
- [ ] Count unique families per split
- [ ] Verify family grouping is correct

### Data Quality Checks
- [ ] Check for empty or near-empty texts
- [ ] Check for unusual characters or encoding issues
- [ ] Identify potential outliers or anomalies
- [ ] Verify text preprocessing was applied correctly

### Visualization Summary
- [ ] Class proportion bar chart (per split)
- [ ] Text length boxplot (per class)
- [ ] Text length histogram (overall distribution)
- [ ] Correlation between text length and class (if any)

---

**TODO**: Implement code cells below to complete each checklist item.


In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import sys

# Add src to path for imports
sys.path.append('../src')
from constants import LABELS, LABEL2ID, REQUIRED_COLUMNS

# Set plot style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

print("Libraries imported successfully.")


## 1. Load Processed Data

Load train/val/test CSVs from `data/processed/`.


In [None]:
# TODO: Load CSVs
# train_df = pd.read_csv('../data/processed/train_4class.csv')
# val_df = pd.read_csv('../data/processed/val_4class.csv')
# test_df = pd.read_csv('../data/processed/test_4class.csv')

# print(f"Train samples: {len(train_df)}")
# print(f"Val samples: {len(val_df)}")
# print(f"Test samples: {len(test_df)}")
# print(f"Total samples: {len(train_df) + len(val_df) + len(test_df)}")

print("TODO: Implement data loading")


## 2. Class Distribution Analysis

Analyze the distribution of classes across splits.


In [None]:
# TODO: Compute class distributions
# for split_name, df in [("Train", train_df), ("Val", val_df), ("Test", test_df)]:
#     print(f"\n{split_name} Class Distribution:")
#     print(df['label'].value_counts().sort_index())
#     print(df['label'].value_counts(normalize=True).sort_index())

print("TODO: Implement class distribution analysis")


## 3. Text Length Analysis

Analyze text length distributions (word count and character count).


In [None]:
# TODO: Compute and visualize text lengths
# if 'text_len_word' in train_df.columns:
#     print("Text length statistics:")
#     print(train_df.groupby('label')['text_len_word'].describe())

print("TODO: Implement text length analysis")


## 4. Duplicate Detection

Check for exact duplicate texts within and across splits.


In [None]:
# TODO: Check for duplicates
# print("Duplicate texts within splits:")
# print(f"Train duplicates: {train_df['text'].duplicated().sum()}")
# print(f"Val duplicates: {val_df['text'].duplicated().sum()}")
# print(f"Test duplicates: {test_df['text'].duplicated().sum()}")

print("TODO: Implement duplicate detection")


## 5. Leakage Check

Verify no family_id appears in multiple splits.


In [None]:
# TODO: Check for family_id overlap
# train_families = set(train_df['family_id'])
# val_families = set(val_df['family_id'])
# test_families = set(test_df['family_id'])

# overlap_train_val = train_families & val_families
# overlap_train_test = train_families & test_families
# overlap_val_test = val_families & test_families

# print(f"Train-Val overlap: {len(overlap_train_val)} families")
# print(f"Train-Test overlap: {len(overlap_train_test)} families")
# print(f"Val-Test overlap: {len(overlap_val_test)} families")

# if len(overlap_train_val) == 0 and len(overlap_train_test) == 0 and len(overlap_val_test) == 0:
#     print("✓ PASS: No family leakage detected")
# else:
#     print("✗ FAIL: Family leakage detected!")

print("TODO: Implement leakage check")
