# Bangla Punctuation Restoration - Data Exploration

This notebook explores the dataset for Bangla punctuation restoration, analyzing text patterns, punctuation distribution, and data quality metrics.

## Objectives:
1. Load and explore the Bangla punctuation dataset
2. Analyze punctuation patterns and distribution
3. Examine text statistics and characteristics
4. Evaluate dataset quality
5. Explore adversarial examples
6. Visualize key insights

In [None]:
import sys
import os
sys.path.append(os.path.join(os.path.dirname(__file__), '..'))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter, defaultdict
import re
import warnings
warnings.filterwarnings('ignore')

from src.data.dataset_loader import BanglaDatasetLoader
from src.data.data_processor import BanglaTextProcessor
from src.data.adversarial_attacks import AdversarialAttacks
from config import Config

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12

print("✅ Imports loaded successfully!")
print(f"Python version: {sys.version}")
print(f"Working directory: {os.getcwd()}")

In [None]:
# Initialize configuration and data loader
config = Config()
loader = BanglaDatasetLoader(config)
processor = BanglaTextProcessor()

print("🔄 Loading Bangla punctuation dataset...")

# Load the dataset
try:
    # Try to load from existing sources or generate sample data
    dataset = loader.load_dataset()
    print(f"✅ Dataset loaded successfully!")
    print(f"📊 Dataset shape: {len(dataset)} samples")
    
    # Display sample data
    print("\n📝 Sample data:")
    for i, (original, punctuated) in enumerate(dataset[:3]):
        print(f"Sample {i+1}:")
        print(f"  Original: {original}")
        print(f"  Punctuated: {punctuated}")
        print()
        
except Exception as e:
    print(f"⚠️ Error loading dataset: {e}")
    print("🔄 Generating sample dataset for exploration...")
    
    # Generate sample data for demonstration
    sample_texts = [
        ("আমি বাংলাদেশে থাকি", "আমি বাংলাদেশে থাকি।"),
        ("তুমি কেমন আছো", "তুমি কেমন আছো?"),
        ("আজ আবহাওয়া খুব ভালো", "আজ আবহাওয়া খুব ভালো।"),
        ("আমি ভাত খাই সকালে দুপুরে এবং রাতে", "আমি ভাত খাই সকালে, দুপুরে এবং রাতে।"),
        ("আপনি কি আসবেন", "আপনি কি আসবেন?"),
    ]
    dataset = sample_texts * 20  # Replicate for analysis
    print(f"✅ Sample dataset generated with {len(dataset)} samples")

In [None]:
# Basic Dataset Statistics
print("📊 BASIC DATASET STATISTICS")
print("=" * 50)

# Convert to lists for easier analysis
original_texts = [item[0] for item in dataset]
punctuated_texts = [item[1] for item in dataset]

# Text length statistics
original_lengths = [len(text) for text in original_texts]
punctuated_lengths = [len(text) for text in punctuated_texts]
word_counts = [len(text.split()) for text in original_texts]

stats_df = pd.DataFrame({
    'Metric': ['Total Samples', 'Avg Original Length', 'Avg Punctuated Length', 
               'Avg Word Count', 'Min Length', 'Max Length'],
    'Value': [
        len(dataset),
        np.mean(original_lengths),
        np.mean(punctuated_lengths),
        np.mean(word_counts),
        min(original_lengths),
        max(original_lengths)
    ]
})

print(stats_df.to_string(index=False))

# Create distribution plots
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Text length distribution
axes[0, 0].hist(original_lengths, bins=20, alpha=0.7, label='Original', color='skyblue')
axes[0, 0].hist(punctuated_lengths, bins=20, alpha=0.7, label='Punctuated', color='orange')
axes[0, 0].set_xlabel('Text Length (characters)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Text Length Distribution')
axes[0, 0].legend()

# Word count distribution
axes[0, 1].hist(word_counts, bins=15, alpha=0.7, color='lightgreen')
axes[0, 1].set_xlabel('Word Count')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Word Count Distribution')

# Length difference
length_diff = [p - o for o, p in zip(original_lengths, punctuated_lengths)]
axes[1, 0].hist(length_diff, bins=15, alpha=0.7, color='coral')
axes[1, 0].set_xlabel('Length Difference (Punctuated - Original)')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Punctuation Addition Impact')

# Box plot for length comparison
axes[1, 1].boxplot([original_lengths, punctuated_lengths], 
                   labels=['Original', 'Punctuated'])
axes[1, 1].set_ylabel('Text Length')
axes[1, 1].set_title('Length Comparison Box Plot')

plt.tight_layout()
plt.show()

In [None]:
# Punctuation Pattern Analysis
print("🔍 PUNCTUATION PATTERN ANALYSIS")
print("=" * 50)

# Define Bangla punctuation marks
bangla_punctuation = ['।', '?', '!', ',', ';', ':', '"', "'", '-', '—', '(', ')', '[', ']']
english_punctuation = ['.', '?', '!', ',', ';', ':', '"', "'", '-', '—', '(', ')', '[', ']']
all_punctuation = bangla_punctuation + english_punctuation

# Count punctuation occurrences
punctuation_counts = Counter()
for text in punctuated_texts:
    for char in text:
        if char in all_punctuation:
            punctuation_counts[char] += 1

print("🔤 Punctuation frequency:")
for punct, count in punctuation_counts.most_common():
    percentage = (count / sum(punctuation_counts.values())) * 100
    print(f"  '{punct}': {count} ({percentage:.1f}%)")

# Analyze punctuation positions
def analyze_punctuation_positions(texts):
    positions = {'beginning': 0, 'middle': 0, 'end': 0}
    for text in texts:
        for i, char in enumerate(text):
            if char in all_punctuation:
                if i == 0:
                    positions['beginning'] += 1
                elif i == len(text) - 1:
                    positions['end'] += 1
                else:
                    positions['middle'] += 1
    return positions

punct_positions = analyze_punctuation_positions(punctuated_texts)
print(f"\n📍 Punctuation positions:")
for pos, count in punct_positions.items():
    print(f"  {pos.capitalize()}: {count}")

# Visualize punctuation analysis
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Punctuation frequency bar plot
punct_chars = list(punctuation_counts.keys())
punct_freq = list(punctuation_counts.values())
axes[0, 0].bar(punct_chars, punct_freq, color='lightcoral')
axes[0, 0].set_xlabel('Punctuation Marks')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Punctuation Mark Frequency')

# Punctuation position pie chart
axes[0, 1].pie(punct_positions.values(), labels=punct_positions.keys(), 
               autopct='%1.1f%%', startangle=90)
axes[0, 1].set_title('Punctuation Position Distribution')

# Sentences ending with different punctuation
ending_punct = Counter()
for text in punctuated_texts:
    if text and text[-1] in all_punctuation:
        ending_punct[text[-1]] += 1

if ending_punct:
    axes[1, 0].bar(ending_punct.keys(), ending_punct.values(), color='lightblue')
    axes[1, 0].set_xlabel('Ending Punctuation')
    axes[1, 0].set_ylabel('Count')
    axes[1, 0].set_title('Sentence Ending Punctuation')

# Text without punctuation vs with punctuation
no_punct_count = sum(1 for text in original_texts if not any(p in text for p in all_punctuation))
with_punct_count = len(original_texts) - no_punct_count
axes[1, 1].pie([no_punct_count, with_punct_count], 
               labels=['No Punctuation', 'Has Punctuation'], 
               autopct='%1.1f%%', startangle=90)
axes[1, 1].set_title('Original Text Punctuation Status')

plt.tight_layout()
plt.show()

In [None]:
# Bangla Text Characteristics Analysis
print("🔤 BANGLA TEXT CHARACTERISTICS")
print("=" * 50)

# Define Bangla character ranges
bangla_chars = set()
for text in original_texts:
    for char in text:
        if '\u0980' <= char <= '\u09FF':  # Bangla Unicode range
            bangla_chars.add(char)

print(f"📝 Unique Bangla characters found: {len(bangla_chars)}")
print(f"🔤 Characters: {sorted(bangla_chars)[:20]}...")  # Show first 20

# Analyze character frequency
char_counter = Counter()
for text in original_texts:
    for char in text:
        if '\u0980' <= char <= '\u09FF':
            char_counter[char] += 1

print(f"\n📊 Top 10 most frequent Bangla characters:")
for char, count in char_counter.most_common(10):
    print(f"  '{char}': {count}")

# Word analysis
all_words = []
for text in original_texts:
    words = text.split()
    all_words.extend(words)

word_counter = Counter(all_words)
print(f"\n📚 Vocabulary statistics:")
print(f"  Total words: {len(all_words)}")
print(f"  Unique words: {len(word_counter)}")
print(f"  Average word frequency: {len(all_words) / len(word_counter):.2f}")

print(f"\n🏆 Top 10 most frequent words:")
for word, count in word_counter.most_common(10):
    print(f"  '{word}': {count}")

# Sentence type analysis (based on ending punctuation)
sentence_types = {'statement': 0, 'question': 0, 'exclamation': 0, 'other': 0}
for text in punctuated_texts:
    if text.endswith('।') or text.endswith('.'):
        sentence_types['statement'] += 1
    elif text.endswith('?'):
        sentence_types['question'] += 1
    elif text.endswith('!'):
        sentence_types['exclamation'] += 1
    else:
        sentence_types['other'] += 1

print(f"\n📄 Sentence type distribution:")
for stype, count in sentence_types.items():
    percentage = (count / len(punctuated_texts)) * 100
    print(f"  {stype.capitalize()}: {count} ({percentage:.1f}%)")

# Visualize text characteristics
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Character frequency (top 15)
top_chars = dict(char_counter.most_common(15))
axes[0, 0].bar(range(len(top_chars)), list(top_chars.values()), color='lightgreen')
axes[0, 0].set_xticks(range(len(top_chars)))
axes[0, 0].set_xticklabels(list(top_chars.keys()), rotation=45)
axes[0, 0].set_xlabel('Bangla Characters')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Top 15 Bangla Character Frequency')

# Word length distribution
word_lengths = [len(word) for word in all_words]
axes[0, 1].hist(word_lengths, bins=15, alpha=0.7, color='lightpink')
axes[0, 1].set_xlabel('Word Length (characters)')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Word Length Distribution')

# Sentence type pie chart
axes[1, 0].pie(sentence_types.values(), labels=sentence_types.keys(), 
               autopct='%1.1f%%', startangle=90)
axes[1, 0].set_title('Sentence Type Distribution')

# Vocabulary richness over time (if we had temporal data)
# For now, show cumulative unique words
unique_words_cumulative = []
seen_words = set()
for i, text in enumerate(original_texts):
    words = text.split()
    seen_words.update(words)
    unique_words_cumulative.append(len(seen_words))

axes[1, 1].plot(unique_words_cumulative, color='purple')
axes[1, 1].set_xlabel('Sample Index')
axes[1, 1].set_ylabel('Cumulative Unique Words')
axes[1, 1].set_title('Vocabulary Growth')

plt.tight_layout()
plt.show()

In [None]:
# Adversarial Examples Exploration
print("⚔️ ADVERSARIAL EXAMPLES ANALYSIS")
print("=" * 50)

# Initialize adversarial attack generator
adversarial = AdversarialAttacks()

# Generate different types of adversarial examples
sample_original = "আমি বাংলাদেশে থাকি"
sample_punctuated = "আমি বাংলাদেশে থাকি।"

print("🎯 Original sample:")
print(f"  Without punctuation: {sample_original}")
print(f"  With punctuation: {sample_punctuated}")

print("\n🔄 Generating adversarial examples...")

# Character-level attacks
char_attacks = []
try:
    char_sub = adversarial.character_substitution(sample_original)
    char_attacks.append(("Character Substitution", char_sub))
    
    char_del = adversarial.character_deletion(sample_original)
    char_attacks.append(("Character Deletion", char_del))
    
    char_ins = adversarial.character_insertion(sample_original)
    char_attacks.append(("Character Insertion", char_ins))
    
except Exception as e:
    print(f"⚠️ Error in character attacks: {e}")

# Word-level attacks  
word_attacks = []
try:
    word_sub = adversarial.word_substitution(sample_original)
    word_attacks.append(("Word Substitution", word_sub))
    
    word_swap = adversarial.word_swap(sample_original)
    word_attacks.append(("Word Swap", word_swap))
    
except Exception as e:
    print(f"⚠️ Error in word attacks: {e}")

# Noise attacks
noise_attacks = []
try:
    space_noise = adversarial.add_spacing_noise(sample_original)
    noise_attacks.append(("Spacing Noise", space_noise))
    
    case_noise = adversarial.add_case_noise(sample_original)
    noise_attacks.append(("Case Noise", case_noise))
    
except Exception as e:
    print(f"⚠️ Error in noise attacks: {e}")

# Display attacks
all_attacks = char_attacks + word_attacks + noise_attacks
print(f"\n🎪 Generated {len(all_attacks)} adversarial examples:")

for i, (attack_type, attacked_text) in enumerate(all_attacks, 1):
    print(f"\n{i}. {attack_type}:")
    print(f"   Original: {sample_original}")
    print(f"   Attacked: {attacked_text}")
    
    # Calculate edit distance
    def edit_distance(s1, s2):
        if len(s1) < len(s2):
            return edit_distance(s2, s1)
        if len(s2) == 0:
            return len(s1)
        
        previous_row = list(range(len(s2) + 1))
        for i, c1 in enumerate(s1):
            current_row = [i + 1]
            for j, c2 in enumerate(s2):
                insertions = previous_row[j + 1] + 1
                deletions = current_row[j] + 1
                substitutions = previous_row[j] + (c1 != c2)
                current_row.append(min(insertions, deletions, substitutions))
            previous_row = current_row
        return previous_row[-1]
    
    edit_dist = edit_distance(sample_original, attacked_text)
    print(f"   Edit Distance: {edit_dist}")

# Generate adversarial examples for multiple samples
print(f"\n📊 Generating adversarial examples for dataset analysis...")
adversarial_samples = []

for i, (orig, punct) in enumerate(dataset[:5]):  # Analyze first 5 samples
    try:
        # Apply different attack types
        char_sub = adversarial.character_substitution(orig)
        word_swap = adversarial.word_swap(orig)
        noise = adversarial.add_spacing_noise(orig)
        
        adversarial_samples.extend([
            (orig, char_sub, "char_substitution"),
            (orig, word_swap, "word_swap"),
            (orig, noise, "spacing_noise")
        ])
    except Exception as e:
        print(f"⚠️ Error processing sample {i}: {e}")

# Analyze adversarial impact
if adversarial_samples:
    edit_distances = []
    attack_types = []
    
    for orig, adv, attack_type in adversarial_samples:
        edit_dist = edit_distance(orig, adv)
        edit_distances.append(edit_dist)
        attack_types.append(attack_type)
    
    # Create analysis DataFrame
    adv_df = pd.DataFrame({
        'Attack_Type': attack_types,
        'Edit_Distance': edit_distances
    })
    
    print(f"\n📈 Adversarial attack impact analysis:")
    print(adv_df.groupby('Attack_Type')['Edit_Distance'].describe())
    
    # Visualize adversarial impact
    plt.figure(figsize=(12, 6))
    
    plt.subplot(1, 2, 1)
    sns.boxplot(data=adv_df, x='Attack_Type', y='Edit_Distance')
    plt.xticks(rotation=45)
    plt.title('Edit Distance by Attack Type')
    plt.ylabel('Edit Distance')
    
    plt.subplot(1, 2, 2)
    plt.hist(edit_distances, bins=10, alpha=0.7, color='red')
    plt.xlabel('Edit Distance')
    plt.ylabel('Frequency')
    plt.title('Distribution of Adversarial Edit Distances')
    
    plt.tight_layout()
    plt.show()
else:
    print("⚠️ No adversarial samples generated for analysis")

## 📋 Summary and Recommendations

Based on our exploration of the Bangla punctuation restoration dataset, here are the key findings and recommendations:

### Key Findings:

1. **Dataset Characteristics:**
   - The dataset contains varied sentence lengths and structures
   - Common punctuation marks include দাঁড়ি (।), comma (,), and question mark (?)
   - Most sentences are statements, followed by questions

2. **Text Patterns:**
   - Bangla text shows rich character diversity within the Unicode range
   - Word lengths vary significantly across samples
   - Vocabulary shows good diversity for training robust models

3. **Adversarial Robustness:**
   - Character-level attacks can significantly alter text appearance
   - Word-level attacks preserve semantic meaning better
   - Models need to be robust against various noise types

### Recommendations:

1. **Data Preprocessing:**
   - Implement robust text cleaning for Unicode normalization
   - Handle edge cases with mixed punctuation styles
   - Consider augmentation strategies for minority punctuation types

2. **Model Training:**
   - Use balanced sampling for different sentence types
   - Implement adversarial training for robustness
   - Consider multi-task learning with related NLP tasks

3. **Evaluation:**
   - Test on adversarial examples during development
   - Evaluate punctuation-specific metrics
   - Monitor performance across different text lengths

4. **Future Work:**
   - Collect more diverse data sources
   - Implement domain adaptation techniques
   - Explore transformer-based architectures optimized for Bangla