# Post Exploratory Data Analysis (Post-EDA)
We perform Post-EDA to confirm the data cleaning worked properly and the dataset is ready for modeling. This includes checking for missing values, duplicates, label distributions, and token lengths (ensuring texts are truncated to 128 tokens). We also review sample texts with labels to verify data quality and consistency.

Import necessary libraries

In [2]:
import pandas as pd

Load processed dataset

In [3]:
df = pd.read_csv('/content/combined_train.csv')

Check total number of samples

In [4]:
print(f"Total samples: {len(df)}")

Total samples: 28454


 Check for missing/null values

In [5]:
null_counts = df.isnull().sum()
print("\nMissing values per column:")
print(null_counts)


Missing values per column:
text             0
emotion_label    0
sarcasm_label    0
dtype: int64


Check duplicates by 'text'

In [6]:
dup_count = df.duplicated(subset=['text']).sum()
print(f"\nDuplicate rows by 'text': {dup_count}")


Duplicate rows by 'text': 0


Check Label distributions

In [7]:
print("\nEmotion label distribution:")
print(df['emotion_label'].value_counts(normalize=True).sort_index())

print("\nSarcasm label distribution:")
print(df['sarcasm_label'].value_counts(normalize=True).sort_index())


Emotion label distribution:
emotion_label
-1    0.129332
 0    0.102938
 1    0.024179
 2    0.118296
 3    0.293491
 4    0.040416
 5    0.173754
 6    0.004006
 7    0.113587
Name: proportion, dtype: float64

Sarcasm label distribution:
sarcasm_label
-1    0.870668
 0    0.067056
 1    0.062276
Name: proportion, dtype: float64


Count how many have -1 label (missing label in multitask)

In [8]:
emotion_missing = (df['emotion_label'] == -1).sum()
sarcasm_missing = (df['sarcasm_label'] == -1).sum()
print(f"\nSamples with missing emotion label (-1): {emotion_missing} ({emotion_missing / len(df) * 100:.2f}%)")
print(f"Samples with missing sarcasm label (-1): {sarcasm_missing} ({sarcasm_missing / len(df) * 100:.2f}%)")


Samples with missing emotion label (-1): 3680 (12.93%)
Samples with missing sarcasm label (-1): 24774 (87.07%)


Text length stats (token count)

In [9]:
# Assuming token count is not precomputed, let's compute approximate token count by splitting on whitespace
df['token_count'] = df['text'].apply(lambda x: len(str(x).split()))
print("\nToken count statistics (after cleaning & truncation):")
print(f"Mean: {df['token_count'].mean():.2f}")
print(f"Median: {df['token_count'].median()}")
print(f"Max: {df['token_count'].max()}")
print(f"Min: {df['token_count'].min()}")


Token count statistics (after cleaning & truncation):
Mean: 18.10
Median: 17.0
Max: 127
Min: 1


Show sample cleaned texts

In [15]:
print("\nSample cleaned tweets with labels:")
sample_df = df.sample(10)[['text', 'emotion_label', 'sarcasm_label']]
for idx, row in sample_df.iterrows():
    print(f"Text: {row['text']}\nEmotion Label: {row['emotion_label']}, Sarcasm Label: {row['sarcasm_label']}\n---")


Sample cleaned tweets with labels:
Text: Did not realize how excited I was to have my Netflix stream from my computer to my TV until the cords to make this happen failed .
Emotion Label: 5, Sarcasm Label: -1
---
Text: Oh yes , I loved it . Was n't the scene with the judge great ?
Emotion Label: 3, Sarcasm Label: -1
---
Text: Bout to be another great day ! !
Emotion Label: 3, Sarcasm Label: -1
---
Text: I kept trying to get back to his lugubrious face , which reclined morosely in his good hand as the guests filled the air around him with cultivated noises .
Emotion Label: 5, Sarcasm Label: -1
---
Text: Bobby Robson ' s delight at having guided his team to another major tournament was coupled with gratitude to the 40 - year-old goalkeeper .
Emotion Label: 3, Sarcasm Label: -1
---
Text: I have plans . I feel sick . What a surprising turn of events .
Emotion Label: -1, Sarcasm Label: 1
---
Text: News about current affairs , documentaries , music , movies , noncommercial ads and so on .
Em

## Post-EDA summary

- The combined training dataset contains 28,454 samples with no missing values in the text, emotion_label, or sarcasm_label columns, ensuring data completeness.

- Duplicate checking by text content revealed zero duplicates, indicating clean and unique samples.

- Token count statistics indicate that text length ranges from 1 to 127 tokens, with an average length of about 18 tokens per sample. Note that all texts were truncated to a maximum of 128 tokens during preprocessing to fit model input constraints.

- A random sample of 10 tweets with their corresponding emotion and sarcasm labels demonstrates the variety in text content and label distribution.

- Overall, the dataset is clean and ready for further modeling, with clear indications of label sparsity in sarcasm detection and mild imbalance across emotion categories. These insights will guide preprocessing and model training strategies.