# Combine Train and Test Data for Sinhala ASR

This notebook combines train.csv and test.csv files, extracts file paths and sentences, and saves the combined dataset to a new CSV file for Sinhala Automatic Speech Recognition (ASR) research.

## Overview
- Load train and test CSV files
- Extract file paths and sentences
- Combine datasets with source identification
- Save to new CSV file
- Display dataset statistics

## 1. Import Required Libraries

Import pandas for data manipulation and processing.

In [30]:
import pandas as pd
import numpy as np

## 2. Read CSV Files

Load train.csv and test.csv files using pandas, display their shapes and basic information.

In [31]:
# Read train CSV file
print("Reading train.csv...")
train_df = pd.read_csv('train.csv')
print(f"Train data shape: {train_df.shape}")
print(f"Train data columns: {list(train_df.columns)}")
print(f"First few rows of train data:")
train_df.head()

Reading train.csv...
Train data shape: (132574, 6)
Train data columns: ['Unnamed: 0', 'filename', 'x', 'sentence', 'full', 'file']
First few rows of train data:
Train data shape: (132574, 6)
Train data columns: ['Unnamed: 0', 'filename', 'x', 'sentence', 'full', 'file']
First few rows of train data:


Unnamed: 0.1,Unnamed: 0,filename,x,sentence,full,file
0,69861,70d725d61b,6e92f,ශ්‍රාවක චරිත නිදසුන් කොට පැහැදිලි කරන්න.,70d725d61b6e92f,asr_sinhala/data/70/70d725d61b.flac
1,77001,7f2e316d14,3b15d,වෙන්න පුළුවනි,7f2e316d143b15d,asr_sinhala/data/7f/7f2e316d14.flac
2,108545,bc470065df,4b5d0,එය තමයි ඔවුන්ට,bc470065df4b5d0,asr_sinhala/data/bc/bc470065df.flac
3,47802,4b48bbffc5,8e991,සප්ත ආර්ය ධනයෙහි එක් කොටසකි.,4b48bbffc58e991,asr_sinhala/data/4b/4b48bbffc5.flac
4,28441,2f2951d11c,936a6,මුදලාලි නිවසේ දොර විවෘත කිරීමෙන් පසු එය වසා නො...,2f2951d11c936a6,asr_sinhala/data/2f/2f2951d11c.flac


In [32]:
# Read test CSV file
print("Reading test.csv...")
test_df = pd.read_csv('test.csv')
print(f"Test data shape: {test_df.shape}")
print(f"Test data columns: {list(test_df.columns)}")
print(f"First few rows of test data:")
test_df.head()

Reading test.csv...
Test data shape: (23396, 6)
Test data columns: ['Unnamed: 0', 'filename', 'x', 'sentence', 'full', 'file']
First few rows of test data:
Test data shape: (23396, 6)
Test data columns: ['Unnamed: 0', 'filename', 'x', 'sentence', 'full', 'file']
First few rows of test data:


Unnamed: 0.1,Unnamed: 0,filename,x,sentence,full,file
0,147490,f44918e4dd,cb1fe,පිරිමින්ට දීල තියෙනවා,f44918e4ddcb1fe,asr_sinhala/data/f4/f44918e4dd.flac
1,149997,f7d50f4b2c,59778,එමෙන්ම මාර්ගය ඉදිකරන ස්ථානයට,f7d50f4b2c59778,asr_sinhala/data/f7/f7d50f4b2c.flac
2,12622,12d4081d0c,199c3,එතකොට සිරිගුත්ත හයියෙන් කෑ ගහලා,12d4081d0c199c3,asr_sinhala/data/12/12d4081d0c.flac
3,124199,d2a53f5f0a,e1e01,එතන ගොඩගැහෙන්නේ සුන්බුන් ගොඩක් පමණයි.,d2a53f5f0ae1e01,asr_sinhala/data/d2/d2a53f5f0a.flac
4,139037,e841b4dfe0,8f885,එක විශ්ව ධර්මය,e841b4dfe08f885,asr_sinhala/data/e8/e841b4dfe0.flac


## 3. Extract Required Columns

Extract 'file' and 'sentence' columns from both train and test datasets.

In [33]:
# Extract file path and sentence columns from both datasets
train_extracted = train_df[['file', 'sentence']].copy()
test_extracted = test_df[['file', 'sentence']].copy()

print(f"Train extracted shape: {train_extracted.shape}")
print(f"Test extracted shape: {test_extracted.shape}")
print(f"\nSample from train_extracted:")
print(train_extracted.head(3))
print(f"\nSample from test_extracted:")
print(test_extracted.head(3))

Train extracted shape: (132574, 2)
Test extracted shape: (23396, 2)

Sample from train_extracted:
                                  file  \
0  asr_sinhala/data/70/70d725d61b.flac   
1  asr_sinhala/data/7f/7f2e316d14.flac   
2  asr_sinhala/data/bc/bc470065df.flac   

                                   sentence  
0  ශ්‍රාවක චරිත නිදසුන් කොට පැහැදිලි කරන්න.  
1                             වෙන්න පුළුවනි  
2                            එය තමයි ඔවුන්ට  

Sample from test_extracted:
                                  file                         sentence
0  asr_sinhala/data/f4/f44918e4dd.flac            පිරිමින්ට දීල තියෙනවා
1  asr_sinhala/data/f7/f7d50f4b2c.flac     එමෙන්ම මාර්ගය ඉදිකරන ස්ථානයට
2  asr_sinhala/data/12/12d4081d0c.flac  එතකොට සිරිගුත්ත හයියෙන් කෑ ගහලා


## 4. Add Source Identification

Add a 'source' column to identify whether each record came from train or test dataset.

In [34]:
# Add a source column to identify which dataset each row came from
train_extracted['source'] = 'train'
test_extracted['source'] = 'test'

print("Train data with source column:")
print(train_extracted.head(3))
print(f"\nTest data with source column:")
print(test_extracted.head(3))

print(f"\nTrain records: {len(train_extracted)}")
print(f"Test records: {len(test_extracted)}")

Train data with source column:
                                  file  \
0  asr_sinhala/data/70/70d725d61b.flac   
1  asr_sinhala/data/7f/7f2e316d14.flac   
2  asr_sinhala/data/bc/bc470065df.flac   

                                   sentence source  
0  ශ්‍රාවක චරිත නිදසුන් කොට පැහැදිලි කරන්න.  train  
1                             වෙන්න පුළුවනි  train  
2                            එය තමයි ඔවුන්ට  train  

Test data with source column:
                                  file                         sentence source
0  asr_sinhala/data/f4/f44918e4dd.flac            පිරිමින්ට දීල තියෙනවා   test
1  asr_sinhala/data/f7/f7d50f4b2c.flac     එමෙන්ම මාර්ගය ඉදිකරන ස්ථානයට   test
2  asr_sinhala/data/12/12d4081d0c.flac  එතකොට සිරිගුත්ත හයියෙන් කෑ ගහලා   test

Train records: 132574
Test records: 23396


## 5. Combine Datasets

Use pandas concat to merge the train and test datasets into a single DataFrame.

In [35]:
# Combine both datasets
combined_df = pd.concat([train_extracted, test_extracted], ignore_index=True)

print(f"Combined data shape: {combined_df.shape}")
print(f"Total records: {len(combined_df)}")
print(f"Columns: {list(combined_df.columns)}")

# Verify the combination
print(f"\nData distribution by source:")
print(combined_df['source'].value_counts())

Combined data shape: (155970, 3)
Total records: 155970
Columns: ['file', 'sentence', 'source']

Data distribution by source:
source
train    132574
test      23396
Name: count, dtype: int64


## 6. Save Combined Data

Export the combined dataset to a new CSV file named 'combined_file_path_sentence.csv'.

In [36]:
# Save to new CSV file
output_file = 'combined_file_path_sentence.csv'
combined_df.to_csv(output_file, index=False)
print(f"Combined data saved to: {output_file}")

# Verify the file was created and check its size
import os
if os.path.exists(output_file):
    file_size = os.path.getsize(output_file) / (1024 * 1024)  # MB
    print(f"File size: {file_size:.2f} MB")
    print("✅ File saved successfully!")

Combined data saved to: combined_file_path_sentence.csv
File size: 19.25 MB
✅ File saved successfully!


## 7. Display Data Overview

Show the first few rows of the combined dataset and print basic information about record counts.

In [37]:
# Display first few rows
print("First 5 rows of combined data:")
print(combined_df.head())

print(f"\nLast 5 rows of combined data:")
print(combined_df.tail())

print(f"\nDataset info:")
print(f"- Shape: {combined_df.shape}")
print(f"- Columns: {list(combined_df.columns)}")
print(f"- Memory usage: {combined_df.memory_usage(deep=True).sum() / 1024 / 1024:.2f} MB")

First 5 rows of combined data:
                                  file  \
0  asr_sinhala/data/70/70d725d61b.flac   
1  asr_sinhala/data/7f/7f2e316d14.flac   
2  asr_sinhala/data/bc/bc470065df.flac   
3  asr_sinhala/data/4b/4b48bbffc5.flac   
4  asr_sinhala/data/2f/2f2951d11c.flac   

                                            sentence source  
0           ශ්‍රාවක චරිත නිදසුන් කොට පැහැදිලි කරන්න.  train  
1                                      වෙන්න පුළුවනි  train  
2                                     එය තමයි ඔවුන්ට  train  
3                       සප්ත ආර්ය ධනයෙහි එක් කොටසකි.  train  
4  මුදලාලි නිවසේ දොර විවෘත කිරීමෙන් පසු එය වසා නො...  train  

Last 5 rows of combined data:
                                       file                    sentence source
155965  asr_sinhala/data/7f/7f65556b1e.flac      එදා දවසෙන් වැඩි කොටසක්   test
155966  asr_sinhala/data/dc/dcd0c96642.flac             little boy bomb   test
155967  asr_sinhala/data/e2/e21b29cf06.flac  අනෙකුත් වැඩිහිටියන්ට පමණයි   te

## 8. Generate Dataset Statistics

Calculate and display dataset summary statistics including total sentences, unique file paths, and average sentence length.

In [38]:
# Calculate comprehensive dataset statistics
print("📊 DATASET SUMMARY STATISTICS")
print("=" * 50)

# Basic counts
print(f"📁 Total sentences: {len(combined_df):,}")
print(f"🎯 Unique file paths: {combined_df['file'].nunique():,}")

# Text analysis
sentence_lengths = combined_df['sentence'].str.len()
print(f"\n📝 SENTENCE LENGTH ANALYSIS:")
print(f"   • Average: {sentence_lengths.mean():.1f} characters")
print(f"   • Median: {sentence_lengths.median():.1f} characters")
print(f"   • Min: {sentence_lengths.min()} characters")
print(f"   • Max: {sentence_lengths.max()} characters")
print(f"   • Std Dev: {sentence_lengths.std():.1f} characters")

# Source distribution
print(f"\n📊 DATA SOURCE DISTRIBUTION:")
source_counts = combined_df['source'].value_counts()
for source, count in source_counts.items():
    percentage = (count / len(combined_df)) * 100
    print(f"   • {source.capitalize()}: {count:,} records ({percentage:.1f}%)")

# Missing values check
print(f"\n🔍 DATA QUALITY:")
missing_files = combined_df['file'].isnull().sum()
missing_sentences = combined_df['sentence'].isnull().sum()
print(f"   • Missing file paths: {missing_files}")
print(f"   • Missing sentences: {missing_sentences}")

# File path analysis
print(f"\n📂 FILE PATH ANALYSIS:")
unique_directories = combined_df['file'].str.extract(r'data/([a-f0-9]{2})/')[0].nunique()
print(f"   • Unique directories: {unique_directories}")

print(f"\n✅ Dataset processing completed successfully!")

📊 DATASET SUMMARY STATISTICS
📁 Total sentences: 155,970
🎯 Unique file paths: 155,970

📝 SENTENCE LENGTH ANALYSIS:
   • Average: 34.1 characters
   • Median: 24.0 characters
   • Min: 2 characters
   • Max: 156305 characters
   • Std Dev: 799.2 characters

📊 DATA SOURCE DISTRIBUTION:
   • Train: 132,574 records (85.0%)
   • Test: 23,396 records (15.0%)

🔍 DATA QUALITY:
   • Missing file paths: 0
   • Missing sentences: 0

📂 FILE PATH ANALYSIS:
🎯 Unique file paths: 155,970

📝 SENTENCE LENGTH ANALYSIS:
   • Average: 34.1 characters
   • Median: 24.0 characters
   • Min: 2 characters
   • Max: 156305 characters
   • Std Dev: 799.2 characters

📊 DATA SOURCE DISTRIBUTION:
   • Train: 132,574 records (85.0%)
   • Test: 23,396 records (15.0%)

🔍 DATA QUALITY:
   • Missing file paths: 0
   • Missing sentences: 0

📂 FILE PATH ANALYSIS:
   • Unique directories: 238

✅ Dataset processing completed successfully!
   • Unique directories: 238

✅ Dataset processing completed successfully!


# Data Preprocessing for Sinhala ASR Training

This section preprocesses the combined dataset to prepare it for Sinhala Automatic Speech Recognition (ASR) model training. The preprocessing includes text normalization, cleaning, tokenization, and audio file validation.

## 9. Text Normalization and Cleaning

Clean and normalize the Sinhala text data by removing unwanted characters, normalizing whitespace, and handling special cases.

In [39]:
import re
import unicodedata

def clean_sinhala_text(text):
    """
    Minimal cleaning for Sinhala text to preserve original structure
    Only removes truly problematic characters while preserving Sinhala integrity
    """
    if pd.isna(text) or text == "":
        return ""
    
    # Convert to string if not already
    text = str(text)
    
    # Only normalize Unicode to NFC (canonical composition) - preserve ZWJ and other important chars
    text = unicodedata.normalize('NFC', text)
    
    # Only normalize excessive whitespace (3+ spaces to single space)
    text = re.sub(r' {3,}', ' ', text)
    
    # Remove leading/trailing whitespace
    text = text.strip()
    
    # Only remove obviously problematic characters (control chars except common ones)
    # Preserve all Sinhala chars, ZWJ (\u200D), ZWNJ (\u200C), and normal punctuation
    # Remove only control characters that are not needed for text rendering
    text = re.sub(r'[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]', '', text)
    
    # Remove some rare Unicode categories that are clearly not text
    # But preserve Sinhala, Latin, punctuation, symbols, joiners, etc.
    cleaned_chars = []
    for char in text:
        category = unicodedata.category(char)
        # Remove only format characters that are not joiners
        if category == 'Cf' and char not in ['\u200C', '\u200D']:  # Preserve ZWNJ and ZWJ
            continue
        cleaned_chars.append(char)
    
    text = ''.join(cleaned_chars)
    
    return text.strip()

# Apply minimal cleaning to the sentences
print("🧹 Applying minimal cleaning to preserve Sinhala text integrity...")
combined_df['sentence_cleaned'] = combined_df['sentence'].apply(clean_sinhala_text)

# Compare before and after cleaning
print("\n📝 Text Cleaning Results:")
print("Before and after cleaning examples:")
for i in range(5):
    if i < len(combined_df):
        original = combined_df.iloc[i]['sentence']
        cleaned = combined_df.iloc[i]['sentence_cleaned']
        if original != cleaned:
            print(f"{i+1}. Original: '{original}'")
            print(f"   Cleaned:  '{cleaned}'")
            print(f"   Changed:  {'Yes' if original != cleaned else 'No'}")
        else:
            print(f"{i+1}. Text preserved: '{cleaned}'")
        print()

# Check for empty sentences after cleaning
empty_after_cleaning = combined_df['sentence_cleaned'].str.len() == 0
print(f"⚠️  Empty sentences after cleaning: {empty_after_cleaning.sum()}")

if empty_after_cleaning.sum() > 0:
    print("Examples of sentences that became empty:")
    empty_examples = combined_df[empty_after_cleaning]['sentence'].head(5)
    for idx, sentence in empty_examples.items():
        print(f"  - '{sentence}'")

# Check how many texts were actually changed
texts_changed = (combined_df['sentence'] != combined_df['sentence_cleaned']).sum()
print(f"\n📊 Cleaning Impact:")
print(f"   • Texts modified: {texts_changed:,} out of {len(combined_df):,} ({texts_changed/len(combined_df)*100:.2f}%)")
print(f"   • Texts preserved: {len(combined_df) - texts_changed:,} ({(len(combined_df) - texts_changed)/len(combined_df)*100:.2f}%)")

# Update sentence length statistics after cleaning
cleaned_lengths = combined_df['sentence_cleaned'].str.len()
original_lengths = combined_df['sentence'].str.len()
print(f"\n📊 Length Comparison:")
print(f"   • Original avg length: {original_lengths.mean():.1f} characters")
print(f"   • Cleaned avg length: {cleaned_lengths.mean():.1f} characters")
print(f"   • Length difference: {(cleaned_lengths.mean() - original_lengths.mean()):.1f} characters")
print(f"   • Min length: {cleaned_lengths.min()} characters")
print(f"   • Max length: {cleaned_lengths.max()} characters")

🧹 Applying minimal cleaning to preserve Sinhala text integrity...

📝 Text Cleaning Results:
Before and after cleaning examples:
1. Text preserved: 'ශ්‍රාවක චරිත නිදසුන් කොට පැහැදිලි කරන්න.'

2. Text preserved: 'වෙන්න පුළුවනි'

3. Text preserved: 'එය තමයි ඔවුන්ට'

4. Text preserved: 'සප්ත ආර්ය ධනයෙහි එක් කොටසකි.'

5. Text preserved: 'මුදලාලි නිවසේ දොර විවෘත කිරීමෙන් පසු එය වසා නොදමා ගෙතුළට ගොස්‌ තිබේ.'

⚠️  Empty sentences after cleaning: 0

📊 Cleaning Impact:
   • Texts modified: 14 out of 155,970 (0.01%)
   • Texts preserved: 155,956 (99.99%)

📊 Length Comparison:
   • Original avg length: 34.1 characters
   • Cleaned avg length: 34.1 characters
   • Length difference: -0.0 characters
   • Min length: 2 characters
   • Max length: 156305 characters

📝 Text Cleaning Results:
Before and after cleaning examples:
1. Text preserved: 'ශ්‍රාවක චරිත නිදසුන් කොට පැහැදිලි කරන්න.'

2. Text preserved: 'වෙන්න පුළුවනි'

3. Text preserved: 'එය තමයි ඔවුන්ට'

4. Text preserved: 'සප්ත ආර්ය ධනයෙහි එක් ක

## 10. Audio File Validation

Validate that audio files exist and can be accessed for training.

In [40]:
import os
from pathlib import Path

def validate_audio_files(df, sample_size=1000):
    """
    Validate existence of audio files
    """
    print("🔍 Validating audio files...")
    
    # Check if files exist (sample for performance)
    sample_df = df.sample(n=min(sample_size, len(df)), random_state=42)
    
    existing_files = []
    missing_files = []
    
    for idx, row in sample_df.iterrows():
        file_path = row['file']
        if os.path.exists(file_path):
            existing_files.append(file_path)
        else:
            missing_files.append(file_path)
    
    return existing_files, missing_files

# Validate a sample of audio files
existing, missing = validate_audio_files(combined_df, sample_size=100)

print(f"📁 Audio File Validation Results (sample of 100):")
print(f"   • Files found: {len(existing)}")
print(f"   • Files missing: {len(missing)}")

if missing:
    print(f"\n⚠️  Missing files examples:")
    for i, file_path in enumerate(missing[:5]):
        print(f"   {i+1}. {file_path}")

# Check unique file extensions
file_extensions = combined_df['file'].str.extract(r'\.([a-zA-Z0-9]+)$')[0].value_counts()
print(f"\n📊 File Extensions Distribution:")
for ext, count in file_extensions.items():
    print(f"   • .{ext}: {count:,} files")

# Create a column for absolute file paths (if needed)
combined_df['file_absolute'] = combined_df['file'].apply(lambda x: os.path.abspath(x) if os.path.exists(x) else x)

🔍 Validating audio files...
📁 Audio File Validation Results (sample of 100):
   • Files found: 100
   • Files missing: 0
📁 Audio File Validation Results (sample of 100):
   • Files found: 100
   • Files missing: 0

📊 File Extensions Distribution:
   • .flac: 155,970 files

📊 File Extensions Distribution:
   • .flac: 155,970 files


## 11. Data Filtering and Quality Control

Filter out problematic samples and ensure data quality for ASR training.

In [41]:
def filter_data_for_asr(df, min_length=2, max_length=500):
    """
    Filter data based on quality criteria for ASR training
    """
    print("🔄 Applying quality filters...")
    
    initial_count = len(df)
    
    # Filter 1: Remove empty sentences
    df_filtered = df[df['sentence_cleaned'].str.len() > 0].copy()
    empty_removed = initial_count - len(df_filtered)
    
    # Filter 2: Remove sentences that are too short or too long
    df_filtered = df_filtered[
        (df_filtered['sentence_cleaned'].str.len() >= min_length) & 
        (df_filtered['sentence_cleaned'].str.len() <= max_length)
    ].copy()
    length_filtered = len(df) - len(df_filtered) - empty_removed
    
    # Filter 3: Remove sentences with mostly non-Sinhala characters
    def is_mostly_sinhala(text):
        if not text:
            return False
        sinhala_chars = len(re.findall(r'[\u0D80-\u0DFF]', text))
        total_chars = len(re.findall(r'[^\s]', text))  # Non-whitespace chars
        return total_chars > 0 and (sinhala_chars / total_chars) >= 0.5
    
    df_filtered = df_filtered[df_filtered['sentence_cleaned'].apply(is_mostly_sinhala)].copy()
    non_sinhala_removed = len(df) - len(df_filtered) - empty_removed - length_filtered
    
    # Filter 4: Remove duplicates based on cleaned sentences
    df_filtered = df_filtered.drop_duplicates(subset=['sentence_cleaned'], keep='first').copy()
    duplicates_removed = len(df) - len(df_filtered) - empty_removed - length_filtered - non_sinhala_removed
    
    print(f"📊 Filtering Results:")
    print(f"   • Initial records: {initial_count:,}")
    print(f"   • Empty sentences removed: {empty_removed:,}")
    print(f"   • Length filtered: {length_filtered:,}")
    print(f"   • Non-Sinhala removed: {non_sinhala_removed:,}")
    print(f"   • Duplicates removed: {duplicates_removed:,}")
    print(f"   • Final records: {len(df_filtered):,}")
    print(f"   • Retention rate: {len(df_filtered)/initial_count*100:.1f}%")
    
    return df_filtered

# Apply filtering
filtered_df = filter_data_for_asr(combined_df)

# Update statistics after filtering
print(f"\n📈 Post-filtering Statistics:")
filtered_lengths = filtered_df['sentence_cleaned'].str.len()
print(f"   • Average length: {filtered_lengths.mean():.1f} characters")
print(f"   • Median length: {filtered_lengths.median():.1f} characters")
print(f"   • Length range: {filtered_lengths.min()} - {filtered_lengths.max()} characters")

# Show distribution by source after filtering
print(f"\n📊 Source Distribution After Filtering:")
source_dist = filtered_df['source'].value_counts()
for source, count in source_dist.items():
    percentage = (count / len(filtered_df)) * 100
    print(f"   • {source.capitalize()}: {count:,} records ({percentage:.1f}%)")

🔄 Applying quality filters...
📊 Filtering Results:
   • Initial records: 155,970
   • Empty sentences removed: 0
   • Length filtered: 33
   • Non-Sinhala removed: 5,751
   • Duplicates removed: 59,992
   • Final records: 90,194
   • Retention rate: 57.8%

📈 Post-filtering Statistics:
   • Average length: 26.7 characters
   • Median length: 25.0 characters
   • Length range: 2 - 132 characters

📊 Source Distribution After Filtering:
   • Train: 82,166 records (91.1%)
   • Test: 8,028 records (8.9%)
📊 Filtering Results:
   • Initial records: 155,970
   • Empty sentences removed: 0
   • Length filtered: 33
   • Non-Sinhala removed: 5,751
   • Duplicates removed: 59,992
   • Final records: 90,194
   • Retention rate: 57.8%

📈 Post-filtering Statistics:
   • Average length: 26.7 characters
   • Median length: 25.0 characters
   • Length range: 2 - 132 characters

📊 Source Distribution After Filtering:
   • Train: 82,166 records (91.1%)
   • Test: 8,028 records (8.9%)


## 12. Character-Level Analysis

Analyze the character distribution for ASR vocabulary preparation.

In [42]:
from collections import Counter

def analyze_characters(df):
    """
    Analyze character distribution in the dataset
    """
    print("🔤 Analyzing character distribution...")
    
    # Combine all text
    all_text = ' '.join(df['sentence_cleaned'].tolist())
    
    # Count characters
    char_counts = Counter(all_text)
    
    # Separate different character types
    sinhala_chars = {}
    punctuation_chars = {}
    latin_chars = {}
    other_chars = {}
    
    for char, count in char_counts.items():
        if '\u0D80' <= char <= '\u0DFF':  # Sinhala Unicode range
            sinhala_chars[char] = count
        elif char.isalpha() and ord(char) < 128:  # Basic Latin
            latin_chars[char] = count
        elif char in '.,!?;:"\'-()[]{}':  # Common punctuation
            punctuation_chars[char] = count
        elif char != ' ':  # Skip spaces
            other_chars[char] = count
    
    return sinhala_chars, latin_chars, punctuation_chars, other_chars, char_counts

# Analyze character distribution
sinhala_chars, latin_chars, punct_chars, other_chars, all_chars = analyze_characters(filtered_df)

print(f"📊 Character Distribution Summary:")
print(f"   • Total unique characters: {len(all_chars):,}")
print(f"   • Sinhala characters: {len(sinhala_chars):,}")
print(f"   • Latin characters: {len(latin_chars):,}")
print(f"   • Punctuation: {len(punct_chars):,}")
print(f"   • Other characters: {len(other_chars):,}")

# Show top Sinhala characters
print(f"\n🇱🇰 Top 20 Sinhala Characters:")
top_sinhala = sorted(sinhala_chars.items(), key=lambda x: x[1], reverse=True)[:20]
for i, (char, count) in enumerate(top_sinhala, 1):
    print(f"   {i:2d}. '{char}' : {count:,}")

# Show punctuation distribution
if punct_chars:
    print(f"\n📝 Punctuation Distribution:")
    for char, count in sorted(punct_chars.items(), key=lambda x: x[1], reverse=True):
        print(f"   • '{char}' : {count:,}")

# Create vocabulary for ASR
def create_vocab(char_counts, min_frequency=10):
    """Create vocabulary list for ASR training"""
    vocab = []
    
    # Add special tokens
    special_tokens = ['<pad>', '<unk>', '<sos>', '<eos>']
    vocab.extend(special_tokens)
    
    # Add space
    vocab.append(' ')
    
    # Add characters with frequency >= min_frequency
    for char, count in sorted(char_counts.items(), key=lambda x: x[1], reverse=True):
        if char != ' ' and count >= min_frequency:
            vocab.append(char)
    
    return vocab

# Create vocabulary
vocab = create_vocab(all_chars, min_frequency=5)
print(f"\n📚 ASR Vocabulary:")
print(f"   • Vocabulary size: {len(vocab)}")
print(f"   • Characters included: {vocab[:20]}...")  # Show first 20

# Save vocabulary
vocab_df = pd.DataFrame({'character': vocab, 'index': range(len(vocab))})
print(f"\n💾 Vocabulary created with {len(vocab)} characters")

🔤 Analyzing character distribution...
📊 Character Distribution Summary:
   • Total unique characters: 107
   • Sinhala characters: 76
   • Latin characters: 0
   • Punctuation: 8
   • Other characters: 22

🇱🇰 Top 20 Sinhala Characters:
    1. '්' : 179,293
    2. 'න' : 164,429
    3. 'ි' : 146,687
    4. 'ව' : 116,263
    5. 'ක' : 107,634
    6. 'ය' : 103,131
    7. 'ම' : 98,461
    8. 'ත' : 94,662
    9. 'ා' : 94,047
   10. 'ර' : 79,568
   11. 'ු' : 75,894
   12. 'ස' : 64,996
   13. 'ද' : 61,699
   14. 'ෙ' : 55,223
   15. 'ප' : 51,924
   16. 'ල' : 51,695
   17. 'ේ' : 49,952
   18. 'හ' : 47,500
   19. 'ට' : 45,513
   20. 'ග' : 39,015

📝 Punctuation Distribution:
   • '.' : 23,771
   • '?' : 1,882
   • ',' : 1,323
   • '!' : 263
   • ''' : 142
   • '"' : 100
   • ';' : 2
   • ':' : 1

📚 ASR Vocabulary:
   • Vocabulary size: 101
   • Characters included: ['<pad>', '<unk>', '<sos>', '<eos>', ' ', '්', 'න', 'ි', 'ව', 'ක', 'ය', 'ම', 'ත', 'ා', 'ර', 'ු', 'ස', 'ද', 'ෙ', 'ප']...

💾 Vocabulary c

## 13. Create Training and Test Splits (80/20)

Split the preprocessed data into 80% training and 20% testing sets for ASR model training.

In [43]:
from sklearn.model_selection import train_test_split

def create_asr_splits(df, test_size=0.20, random_state=42):
    """
    Create train/test splits for ASR training (80/20)
    """
    print("🔄 Creating train/test splits (80/20)...")
    
    # Split into train (80%) and test (20%)
    train_df, test_df = train_test_split(
        df, 
        test_size=test_size, 
        random_state=random_state,
        stratify=df['source']  # Maintain source distribution
    )
    
    return train_df, test_df

# Create splits
train_data, test_data = create_asr_splits(filtered_df)

print(f"📊 Data Split Summary:")
print(f"   • Training set: {len(train_data):,} samples ({len(train_data)/len(filtered_df)*100:.1f}%)")
print(f"   • Test set: {len(test_data):,} samples ({len(test_data)/len(filtered_df)*100:.1f}%)")

# Check source distribution in each split
print(f"\n📈 Source Distribution by Split:")
for split_name, split_data in [("Training", train_data), ("Test", test_data)]:
    source_counts = split_data['source'].value_counts()
    print(f"\n{split_name}:")
    for source, count in source_counts.items():
        percentage = (count / len(split_data)) * 100
        print(f"   • {source.capitalize()}: {count:,} ({percentage:.1f}%)")

# Check sentence length distribution across splits
print(f"\n📏 Average Sentence Length by Split:")
for split_name, split_data in [("Training", train_data), ("Test", test_data)]:
    avg_length = split_data['sentence_cleaned'].str.len().mean()
    print(f"   • {split_name}: {avg_length:.1f} characters")

# Create a validation subset from training data if needed for model development
print(f"\n💡 Note: You can create a validation subset from training data during model training if needed.")
print(f"   Recommended: Use 10-15% of training data for validation during training.")

🔄 Creating train/test splits (80/20)...
📊 Data Split Summary:
   • Training set: 72,155 samples (80.0%)
   • Test set: 18,039 samples (20.0%)

📈 Source Distribution by Split:

Training:
   • Train: 65,733 (91.1%)
   • Test: 6,422 (8.9%)

Test:
   • Train: 16,433 (91.1%)
   • Test: 1,606 (8.9%)

📏 Average Sentence Length by Split:
   • Training: 26.7 characters
   • Test: 26.7 characters

💡 Note: You can create a validation subset from training data during model training if needed.
   Recommended: Use 10-15% of training data for validation during training.
📊 Data Split Summary:
   • Training set: 72,155 samples (80.0%)
   • Test set: 18,039 samples (20.0%)

📈 Source Distribution by Split:

Training:
   • Train: 65,733 (91.1%)
   • Test: 6,422 (8.9%)

Test:
   • Train: 16,433 (91.1%)
   • Test: 1,606 (8.9%)

📏 Average Sentence Length by Split:
   • Training: 26.7 characters
   • Test: 26.7 characters

💡 Note: You can create a validation subset from training data during model training if 

## 14. Export Preprocessed Data

Save the preprocessed and split data in formats suitable for ASR training.

In [44]:
import json

# Create output directory for processed data
output_dir = "processed_asr_data"
os.makedirs(output_dir, exist_ok=True)

print(f"💾 Exporting preprocessed data to '{output_dir}' directory...")

# Export splits as CSV files
train_data.to_csv(f"{output_dir}/train_data.csv", index=False)
test_data.to_csv(f"{output_dir}/test_data.csv", index=False)

# Export vocabulary
vocab_df.to_csv(f"{output_dir}/vocabulary.csv", index=False)

# Create metadata file
metadata = {
    "dataset_info": {
        "total_samples": len(filtered_df),
        "train_samples": len(train_data),
        "test_samples": len(test_data),
        "train_percentage": round(len(train_data)/len(filtered_df)*100, 1),
        "test_percentage": round(len(test_data)/len(filtered_df)*100, 1),
        "vocabulary_size": len(vocab),
        "avg_sentence_length": float(filtered_df['sentence_cleaned'].str.len().mean()),
        "max_sentence_length": int(filtered_df['sentence_cleaned'].str.len().max()),
        "min_sentence_length": int(filtered_df['sentence_cleaned'].str.len().min())
    },
    "split_strategy": "80% training, 20% testing",
    "preprocessing_steps": [
        "Text normalization and cleaning",
        "Unicode normalization",
        "Whitespace normalization",
        "Punctuation normalization",
        "Empty sentence removal",
        "Length filtering (2-500 characters)",
        "Sinhala character ratio filtering (>=50%)",
        "Duplicate removal"
    ],
    "file_info": {
        "audio_format": "flac",
        "text_encoding": "utf-8",
        "language": "Sinhala (si)"
    }
}

# Save metadata
with open(f"{output_dir}/metadata.json", 'w', encoding='utf-8') as f:
    json.dump(metadata, f, indent=2, ensure_ascii=False)

# Create manifest files for ASR training (common format)
def create_manifest(data, filename):
    """Create manifest file in JSON Lines format for ASR training"""
    with open(f"{output_dir}/{filename}", 'w', encoding='utf-8') as f:
        for _, row in data.iterrows():
            manifest_entry = {
                "audio_filepath": row['file'],
                "text": row['sentence_cleaned'],
                "duration": -1,  # To be filled by audio processing
                "source": row['source']
            }
            f.write(json.dumps(manifest_entry, ensure_ascii=False) + '\n')

# Create manifest files
create_manifest(train_data, "train_manifest.jsonl")
create_manifest(test_data, "test_manifest.jsonl")

print(f"✅ Export completed! Files created:")
print(f"   📄 CSV Files:")
print(f"      • train_data.csv ({len(train_data):,} samples - 80%)")
print(f"      • test_data.csv ({len(test_data):,} samples - 20%)")
print(f"      • vocabulary.csv ({len(vocab)} characters)")
print(f"   📄 Manifest Files (JSONL format):")
print(f"      • train_manifest.jsonl")
print(f"      • test_manifest.jsonl")
print(f"   📄 Metadata:")
print(f"      • metadata.json")

# Show file sizes
total_size = 0
for filename in os.listdir(output_dir):
    file_path = os.path.join(output_dir, filename)
    if os.path.isfile(file_path):
        size_mb = os.path.getsize(file_path) / (1024 * 1024)
        total_size += size_mb
        print(f"      📁 {filename}: {size_mb:.2f} MB")

print(f"\n📊 Total exported data size: {total_size:.2f} MB")
print(f"\n🎯 Data is ready for Sinhala ASR training with 80/20 split!")

# Provide guidance on validation
print(f"\n💡 Training Tips:")
print(f"   • Use the full training set (80%) for model training")
print(f"   • Reserve test set (20%) for final evaluation only")
print(f"   • Consider using cross-validation or a subset of training data for validation during development")
print(f"   • Many ASR frameworks can automatically split training data into train/val during training")

💾 Exporting preprocessed data to 'processed_asr_data' directory...
✅ Export completed! Files created:
   📄 CSV Files:
      • train_data.csv (72,155 samples - 80%)
      • test_data.csv (18,039 samples - 20%)
      • vocabulary.csv (101 characters)
   📄 Manifest Files (JSONL format):
      • train_manifest.jsonl
      • test_manifest.jsonl
   📄 Metadata:
      • metadata.json
      📁 metadata.json: 0.00 MB
      📁 test_data.csv: 4.92 MB
      📁 test_manifest.jsonl: 3.07 MB
      📁 train_data.csv: 19.67 MB
      📁 train_manifest.jsonl: 12.27 MB
      📁 vocabulary.csv: 0.00 MB

📊 Total exported data size: 39.94 MB

🎯 Data is ready for Sinhala ASR training with 80/20 split!

💡 Training Tips:
   • Use the full training set (80%) for model training
   • Reserve test set (20%) for final evaluation only
   • Consider using cross-validation or a subset of training data for validation during development
   • Many ASR frameworks can automatically split training data into train/val during trainin

## 15. Create Clean Files with Essential Columns Only

Extract only the essential columns (`file` and `sentence_cleaned`) from train and test files for ASR training efficiency.

In [45]:
# Create clean versions with only essential columns
print("🧹 Creating clean versions of train and test files...")

# Extract only file and sentence_cleaned columns from both datasets
train_clean = train_data[['file', 'sentence_cleaned']].copy()
test_clean = test_data[['file', 'sentence_cleaned']].copy()

print(f"📊 Clean Data Summary:")
print(f"   • Training data: {len(train_clean):,} samples")
print(f"   • Test data: {len(test_clean):,} samples")

# Save clean versions
train_clean.to_csv(f"{output_dir}/train_data_clean.csv", index=False)
test_clean.to_csv(f"{output_dir}/test_data_clean.csv", index=False)

print(f"\n💾 Clean files saved:")
print(f"   • train_data_clean.csv")
print(f"   • test_data_clean.csv")

# Display first few rows of clean data
print(f"\n📝 Sample from clean training data:")
print(train_clean.head(3))

print(f"\n📝 Sample from clean test data:")
print(test_clean.head(3))

# Calculate file sizes
train_clean_size = os.path.getsize(f"{output_dir}/train_data_clean.csv") / (1024 * 1024)
test_clean_size = os.path.getsize(f"{output_dir}/test_data_clean.csv") / (1024 * 1024)

print(f"\n📊 Clean File Sizes:")
print(f"   • train_data_clean.csv: {train_clean_size:.2f} MB")
print(f"   • test_data_clean.csv: {test_clean_size:.2f} MB")
print(f"   • Total clean size: {train_clean_size + test_clean_size:.2f} MB")

# Show space savings
original_train_size = os.path.getsize(f"{output_dir}/train_data.csv") / (1024 * 1024)
original_test_size = os.path.getsize(f"{output_dir}/test_data.csv") / (1024 * 1024)
total_original = original_train_size + original_test_size
total_clean = train_clean_size + test_clean_size
space_saved = total_original - total_clean

print(f"\n💾 Space Optimization:")
print(f"   • Original total size: {total_original:.2f} MB")
print(f"   • Clean total size: {total_clean:.2f} MB")
print(f"   • Space saved: {space_saved:.2f} MB ({space_saved/total_original*100:.1f}%)")

print(f"\n✅ Clean files ready for ASR training!")

🧹 Creating clean versions of train and test files...
📊 Clean Data Summary:
   • Training data: 72,155 samples
   • Test data: 18,039 samples
📊 Clean Data Summary:
   • Training data: 72,155 samples
   • Test data: 18,039 samples

💾 Clean files saved:
   • train_data_clean.csv
   • test_data_clean.csv

📝 Sample from clean training data:
                                       file  \
9583    asr_sinhala/data/98/983e3c6613.flac   
83995   asr_sinhala/data/29/29ab15c6d4.flac   
110010  asr_sinhala/data/b0/b0072f9ac0.flac   

                                     sentence_cleaned  
9583     එය මිලිටරිය තුළ ති‍යන ප්‍රධානම ප්‍රතිමානයක්.  
83995   සාහිත්‍යකරුවාට ඊට වැඩිය ලොකු වගකීමක් තියෙනවා.  
110010                        ඕගොල්ලන්ට දකින්න ලැබෙයි  

📝 Sample from clean test data:
                                       file  \
116168  asr_sinhala/data/b6/b63bfb7a34.flac   
48318   asr_sinhala/data/f1/f134378295.flac   
71784   asr_sinhala/data/d3/d3bf1e10b0.flac   

                            

## 16. Update Manifest Files for Clean Data

Create updated manifest files that reference the clean data structure.

In [46]:
# Create clean manifest files for ASR training
def create_clean_manifest(data, filename):
    """Create manifest file using clean data structure"""
    with open(f"{output_dir}/{filename}", 'w', encoding='utf-8') as f:
        for _, row in data.iterrows():
            manifest_entry = {
                "audio_filepath": row['file'],
                "text": row['sentence_cleaned'],
                "duration": -1  # To be filled by audio processing
            }
            f.write(json.dumps(manifest_entry, ensure_ascii=False) + '\n')

print("📄 Creating clean manifest files...")

# Create clean manifest files
create_clean_manifest(train_clean, "train_manifest_clean.jsonl")
create_clean_manifest(test_clean, "test_manifest_clean.jsonl")

print(f"✅ Clean manifest files created:")
print(f"   • train_manifest_clean.jsonl")
print(f"   • test_manifest_clean.jsonl")

# Update metadata for clean version
clean_metadata = {
    "dataset_info": {
        "total_samples": len(train_clean) + len(test_clean),
        "train_samples": len(train_clean),
        "test_samples": len(test_clean),
        "train_percentage": 80.0,
        "test_percentage": 20.0,
        "vocabulary_size": len(vocab),
        "avg_sentence_length": float(pd.concat([train_clean['sentence_cleaned'], test_clean['sentence_cleaned']]).str.len().mean()),
        "file_format": "clean (file, sentence_cleaned only)"
    },
    "split_strategy": "80% training, 20% testing",
    "data_structure": {
        "columns": ["file", "sentence_cleaned"],
        "description": "Minimal structure for efficient ASR training"
    },
    "preprocessing_steps": [
        "Text normalization and cleaning",
        "Unicode normalization (NFC)",
        "Minimal whitespace normalization",
        "Control character removal",
        "Empty sentence removal",
        "Length filtering (2-500 characters)",
        "Sinhala character ratio filtering (>=50%)",
        "Duplicate removal",
        "Column reduction to essentials only"
    ],
    "file_info": {
        "audio_format": "flac",
        "text_encoding": "utf-8",
        "language": "Sinhala (si)"
    }
}

# Save clean metadata
with open(f"{output_dir}/metadata_clean.json", 'w', encoding='utf-8') as f:
    json.dump(clean_metadata, f, indent=2, ensure_ascii=False)

print(f"   • metadata_clean.json")

print(f"\n🎯 Summary of Clean Data Files:")
print(f"   📄 CSV Files (essential columns only):")
print(f"      • train_data_clean.csv ({len(train_clean):,} samples)")
print(f"      • test_data_clean.csv ({len(test_clean):,} samples)")
print(f"   📄 Manifest Files (JSONL format):")
print(f"      • train_manifest_clean.jsonl")
print(f"      • test_manifest_clean.jsonl")
print(f"   📄 Metadata:")
print(f"      • metadata_clean.json")
print(f"\n✨ Ready for efficient ASR training with minimal data footprint!")

📄 Creating clean manifest files...
✅ Clean manifest files created:
   • train_manifest_clean.jsonl
   • test_manifest_clean.jsonl
   • metadata_clean.json

🎯 Summary of Clean Data Files:
   📄 CSV Files (essential columns only):
      • train_data_clean.csv (72,155 samples)
      • test_data_clean.csv (18,039 samples)
   📄 Manifest Files (JSONL format):
      • train_manifest_clean.jsonl
      • test_manifest_clean.jsonl
   📄 Metadata:
      • metadata_clean.json

✨ Ready for efficient ASR training with minimal data footprint!
✅ Clean manifest files created:
   • train_manifest_clean.jsonl
   • test_manifest_clean.jsonl
   • metadata_clean.json

🎯 Summary of Clean Data Files:
   📄 CSV Files (essential columns only):
      • train_data_clean.csv (72,155 samples)
      • test_data_clean.csv (18,039 samples)
   📄 Manifest Files (JSONL format):
      • train_manifest_clean.jsonl
      • test_manifest_clean.jsonl
   📄 Metadata:
      • metadata_clean.json

✨ Ready for efficient ASR training 