# Vietnamese Text Correction using Transformers Pipeline

This notebook demonstrates how to use a Vietnamese text correction model to fix spelling and grammar errors in text data. We'll process a dataset and show before/after examples of the corrections.

## 1. Import Required Libraries

Import necessary libraries including transformers, pandas, torch, and other utilities for text processing and model operations.

In [1]:
# Import required libraries
from transformers import pipeline, AutoTokenizer
import pandas as pd
import pickle
import os
import unicodedata
import re
import torch
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")
print(f"CUDA available: {torch.cuda.is_available()}")


Libraries imported successfully!
CUDA available: True


## 2. Configure Model and Parameters

Set up configuration variables including model name, file paths, batch size, and column specifications for processing.

In [2]:
# Configuration parameters
MODEL = "bmd1905/vietnamese-correction-v2"
ROOT_DIR = os.path.dirname(os.getcwd())
DATA_DIR = os.path.join(ROOT_DIR, "data")
INPUT_CSV = os.path.join(DATA_DIR, "vihallu-public-test.csv")
OUTPUT_CSV = os.path.join(DATA_DIR, "fixed-vihallu-public-test-pipeline.csv")
CACHE_FILE = os.path.join(ROOT_DIR, "cache", "pipeline_corrections.pkl")
BATCH_SIZE = 8  # Reduced for better memory management
COLS = ['context', 'prompt', 'response']

print("Configuration:")
print(f"Model: {MODEL}")
# print(f"Input CSV: {INPUT_CSV}")
print(f"Output CSV: {OUTPUT_CSV}")
print(f"Cache file: {CACHE_FILE}")
print(f"Batch size: {BATCH_SIZE}")
print(f"Columns to process: {COLS}")


Configuration:
Model: bmd1905/vietnamese-correction-v2
Output CSV: /home/guest/Projects/DSC2025/BAN/data/fixed-vihallu-public-test-pipeline.csv
Cache file: /home/guest/Projects/DSC2025/BAN/cache/pipeline_corrections.pkl
Batch size: 8
Columns to process: ['context', 'prompt', 'response']


## 3. Load and Initialize the Correction Pipeline

Initialize the text2text-generation pipeline with the Vietnamese correction model and configure device settings.

In [3]:
# Initialize the correction pipeline
device = 0 if torch.cuda.is_available() else -1
print(f"Using device: {'GPU' if device == 0 else 'CPU'}")

try:
    corrector = pipeline(
        "text2text-generation", 
        model=MODEL, 
        device=device, 
        truncation=True,
        max_length=512
    )
    print("‚úÖ Vietnamese correction pipeline loaded successfully!")
except Exception as e:
    print(f"‚ùå Error loading pipeline: {e}")


Using device: GPU


Device set to use cuda:0


‚úÖ Vietnamese correction pipeline loaded successfully!


## 4. Define Text Preprocessing Functions

Create helper functions for text postprocessing including prefix removal, Unicode normalization, and whitespace cleanup.

In [6]:
def postprocess_text(text):
    """
    Postprocess corrected text to clean up common issues
    """
    if text is None or text == "":
        return ""
    
    # Remove common prefixes that the model might add
    prefixes = [
        "S·ª≠a l·ªói ch√≠nh t·∫£ v√† ng·ªØ ph√°p:", 
        "S·ª≠a l·ªói ch√≠nh t·∫£:", 
        "Corrected:",
        "VƒÉn b·∫£n ƒë√£ s·ª≠a:",
        "K·∫øt qu·∫£:"
    ]
    
    for prefix in prefixes:
        if text.startswith(prefix):
            text = text[len(prefix):].strip()
            break
    
    # Unicode normalization
    text = unicodedata.normalize("NFC", text)
    
    # Fix spacing around punctuation
    text = re.sub(r"\s+([,.;:!?])", r"\1", text)
    
    # Normalize whitespace
    text = re.sub(r"\s+", " ", text).strip()
    
    return text

def create_sample_text():
    """Create sample Vietnamese text with errors for testing"""
    return [
        "T√¥i ƒëang ƒëi ƒë·∫øn tr∆∞·ªùng h·ªçc v√†o bu·ªïi sang.",
        "Con ch c·ªßa t√¥i r·∫•t th√¥ng minh v√† ngoan ngo√£n.",
        "H√¥m nay tr·ªùi ƒë·∫πp, ch√∫ng ta ƒëi ch∆°i nh√©!",
        "Vi·ªát Nam l√† m·ªôt ƒë·∫•t n∆∞·ªõc xinh ƒë·∫πp v·ªõi nhi·ªÅu c·∫£nh ƒë·∫πp.",
        "T√¥i th√≠ch ƒÉn ph v√† b√°nh mi v√†o bu·ªïi sng."
    ]

# Test the postprocessing function
sample_output = "S·ª≠a l·ªói ch√≠nh t·∫£: ƒê√¢y l√† m·ªôt c√¢u ƒë√£ ƒë∆∞·ª£c s·ª≠a l·ªói  ."
cleaned = postprocess_text(sample_output)
print(f"Original: '{sample_output}'")
print(f"Cleaned: '{cleaned}'")


Original: 'S·ª≠a l·ªói ch√≠nh t·∫£: ƒê√¢y l√† m·ªôt c√¢u ƒë√£ ƒë∆∞·ª£c s·ª≠a l·ªói  .'
Cleaned: 'ƒê√¢y l√† m·ªôt c√¢u ƒë√£ ƒë∆∞·ª£c s·ª≠a l·ªói.'


## 5. Load Sample Data

Load the CSV dataset and prepare sample texts for correction, handling missing values and data types.

In [7]:
# Create cache directory
os.makedirs(os.path.dirname(CACHE_FILE) or ".", exist_ok=True)

# Load existing cache if available
if os.path.exists(CACHE_FILE):
    with open(CACHE_FILE, "rb") as f:
        cache = pickle.load(f)
    print(f"‚úÖ Loaded cache with {len(cache)} entries")
else:
    cache = {}
    print("üìù Starting with empty cache")

# Try to load the CSV file, create sample data if not available
try:
    df = pd.read_csv(INPUT_CSV)
    print(f"‚úÖ Loaded CSV with {len(df)} rows and {len(df.columns)} columns")
    print(f"Columns: {list(df.columns)}")
except FileNotFoundError:
    print("‚ö†Ô∏è Input CSV not found. Creating sample data...")
    # Create sample data with Vietnamese text containing errors
    sample_data = {
        'context': [
            "Ng√†y h√¥m nay tr·ªùi rat ƒë·∫πp v√† n·∫Øng ·∫•m",
            "T√¥i ƒëang h·ªçc t·∫°i tr∆∞·ªùng ƒë·∫°i h·ªçc B√°ch Khoaaa",
            "Con meo c·ªßa t√¥i r·∫•t de th∆∞∆°ng v√† ngoan",
            "Vi·ªát Nam c√≥ nhi·ªÅu m√≥n n ngon v√† ƒë·∫∑c s·∫Øc",
            "Ch√∫ng ta n√™n b·∫£o v·ªá m√¥i tr∆∞·ªùng s·ªëng",
        ] * 6,  # Repeat to get 30 samples
        'prompt': [
            "H√£y k·ªÉ cho t√¥i nghe v·ªÅ qu√™ h∆∞∆°ng b·∫°n",
            "B·∫°n th√≠ch mn ƒÉn n√†o nh·∫•t ·ªü Vi·ªát Nam?",
            "S·ªü th√≠ch cua b·∫°n trong th·ªùi gian r·∫£nh l√† g√¨?",
            "B·∫°n c√≥ k·∫ø ho·∫°ch g√¨ cho t∆∞∆°ng lai kh√¥ng?",
            "ƒêi·ªÅu g√¨ l√†m b·∫°n c·∫£m th·∫•y h·∫°nh ph√∫oc nh·∫•t ?",
        ] * 6,
        'response': [
            "Qu√™ t√¥i ·ªü mi·ªÅn B·∫Øac, n∆°i c√≥ nhi·ªÅu c·∫£nh ƒë·∫πp v√† con ng∆∞·ªùi th√¢n thi·ªán.",
            "T√¥i r·∫•t th√≠ch ph·ªü b·ªüi v√¨ n√≥ c√≥ h∆∞∆°ng v·ªã ƒë·∫≠m ƒë√† v√† th∆°m ngonn.",
            "T√¥i th√≠ch ƒë·ªçc sach v√† nghe nh·∫°c khii r·∫£nh r·ªói.",
            "T√¥i mun tr·ªü th√†nh m·ªôt ky s∆∞ gi·ªèi trong t∆∞∆°ng lai.",
            "ƒê∆∞·ª£c ·ªü b√™n gia ƒë√¨nh l√† ƒëi·ªÅu la t√¥i h·∫°nh ph√∫c nhat.",
        ] * 6
    }
    df = pd.DataFrame(sample_data)
    print(f"‚úÖ Created sample dataset with {len(df)} rows")

# Display basic info about the dataset
print(f"\nDataset shape: {df.shape}")
print(f"Available columns for processing: {[col for col in COLS if col in df.columns]}")


üìù Starting with empty cache
‚úÖ Loaded CSV with 1000 rows and 5 columns
Columns: ['id', 'context', 'prompt', 'response', 'predict_label']

Dataset shape: (1000, 5)
Available columns for processing: ['context', 'prompt', 'response']


## 6. Implement Batch Text Correction

Process texts in batches using the correction pipeline with caching mechanism to avoid redundant corrections.

In [8]:
def correct_texts_batch(texts, column_name):
    """
    Correct a list of texts in batches with caching
    """
    corrected_texts = []
    
    print(f"\nüîÑ Processing {len(texts)} texts for column '{column_name}'...")
    
    for i in tqdm(range(0, len(texts), BATCH_SIZE), desc=f"Correcting {column_name}"):
        batch = texts[i:i+BATCH_SIZE]
        batch_results = []
        
        for text in batch:
            # Skip empty texts
            if not text.strip():
                batch_results.append(text)
                continue
                
            # Check cache first
            if text in cache:
                batch_results.append(cache[text])
                continue
            
            try:
                # Generate correction
                correction_result = corrector(text, max_length=512, num_beams=3, do_sample=False)
                corrected_text = correction_result[0]['generated_text'] if correction_result else text
                corrected_text = postprocess_text(corrected_text)
                
                # Cache the result
                cache[text] = corrected_text
                batch_results.append(corrected_text)
                
            except Exception as e:
                print(f"Error correcting text: {e}")
                batch_results.append(text)  # Return original text if correction fails
        
        corrected_texts.extend(batch_results)
    
    return corrected_texts

# Process each column that exists in the dataframe
original_df = df.copy()  # Keep original for comparison

for col in COLS:
    if col in df.columns:
        texts = df[col].fillna("").astype(str).tolist()
        corrected_texts = correct_texts_batch(texts, col)
        df[col] = corrected_texts
        print(f"‚úÖ Completed correction for column '{col}'")
    else:
        print(f"‚ö†Ô∏è Column '{col}' not found in dataset")

print(f"\n‚úÖ Correction process completed!")
print(f"Cache now contains {len(cache)} entries")



üîÑ Processing 1000 texts for column 'context'...


Correcting context:   1%|          | 1/125 [00:20<42:27, 20.55s/it]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Correcting context: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 125/125 [22:39<00:00, 10.87s/it]


‚úÖ Completed correction for column 'context'

üîÑ Processing 1000 texts for column 'prompt'...


Correcting prompt: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 125/125 [06:01<00:00,  2.89s/it]


‚úÖ Completed correction for column 'prompt'

üîÑ Processing 1000 texts for column 'response'...


Correcting response: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 125/125 [08:27<00:00,  4.06s/it]

‚úÖ Completed correction for column 'response'

‚úÖ Correction process completed!
Cache now contains 2919 entries





## 7. Display Before and After Examples

Show 30 examples of original texts alongside their corrected versions in a formatted comparison table.

In [9]:
def display_comparison_examples(original_df, corrected_df, num_examples=30):
    """
    Display before and after examples in a nice format
    """
    print("=" * 100)
    print(f"üìã DISPLAYING {num_examples} BEFORE/AFTER CORRECTION EXAMPLES")
    print("=" * 100)
    
    examples_shown = 0
    
    for col in COLS:
        if col not in original_df.columns or examples_shown >= num_examples:
            continue
            
        print(f"\nüîç COLUMN: {col.upper()}")
        print("-" * 80)
        
        for i in range(min(10, len(original_df), num_examples - examples_shown)):
            original_text = str(original_df[col].iloc[i])
            corrected_text = str(corrected_df[col].iloc[i])
            
            # Only show if there's actual content
            if original_text.strip() and len(original_text.strip()) > 10:
                print(f"\nüìù Example {examples_shown + 1}:")
                print(f"BEFORE:  {original_text}")
                print(f"AFTER:   {corrected_text}")
                
                # Highlight if there were changes
                if original_text != corrected_text:
                    print("‚ú® CHANGED")
                else:
                    print("üìå NO CHANGE")
                
                examples_shown += 1
                
                if examples_shown >= num_examples:
                    break
    
    print(f"\nüìä Summary: Displayed {examples_shown} examples")
    return examples_shown

# Display the examples
examples_count = display_comparison_examples(original_df, df, 30)


üìã DISPLAYING 30 BEFORE/AFTER CORRECTION EXAMPLES

üîç COLUMN: CONTEXT
--------------------------------------------------------------------------------

üìù Example 1:
BEFORE:  Putin ng√†y 14 th√°ng 10 nƒÉm 2009, ƒë∆∞a ra ƒë·ªÅ ngh·ªã l√† Trung Qu·ªëc, c√°c n∆∞·ªõc Trung √Å v√† Nga n√™n t·ªï ch·ª©c m·ªôt cu·ªôc thi h√°t h√†ng nƒÉm ƒë·ªÉ c√≥ th·ªÉ gia tƒÉng c√°c m·ªëi li√™n l·∫°c vƒÉn h√≥a. Putin c≈©ng ƒë·ªÅ ngh·ªã l√† cu·ªôc thi h√°t n√†y c√≥ th·ªÉ ƒë∆∞·ª£c g·ªçi l√† "Intervision" ƒë·ªÉ ƒë·ªëi ƒë·∫ßu v·ªõi cu·ªôc thi h√°t n·ªïi ti·∫øng th∆∞·ªùng ni√™n c·ªßa l·ª•c ƒë·ªãa ch√¢u √Çu mang t√™n Eurovision. M·ªôt cu·ªôc thi nh∆∞ v·∫≠y s·∫Ω cho th·∫•y c√°c nam, n·ªØ ca sƒ© Trung Qu·ªëc tranh t√†i v·ªõi c√°c ca sƒ© Uzbeek, Tadjik, Kazakh, Nga v√† Kyrgyzstan. Th√¥ng t·∫•n x√£ Interfax t∆∞·ªùng thu·∫≠t l·ªùi c·ªßa Putin, n√≥i th√™m l√†: "Vi·ªác t·ªï ch·ª©c m·ªôt cu·ªôc thi h√°t qu·ªëc t·∫ø hi·ªán ƒë·∫°i, Intervision, s·∫Ω c·ªßng c·ªë c√°c m·ªëi li√™n l·∫°c vƒÉn h√≥a gi·ªØa c√°c n∆∞·ªõc ch√∫n

In [None]:
# Create a more structured comparison table using pandas
def create_comparison_table(original_df, corrected_df, num_examples=15):
    """
    Create a structured comparison table
    """
    comparison_data = []
    
    for col in COLS:
        if col not in original_df.columns:
            continue
            
        for i in range(min(num_examples//len(COLS), len(original_df))):
            original_text = str(original_df[col].iloc[i])
            corrected_text = str(corrected_df[col].iloc[i])
            
            if original_text.strip() and len(original_text.strip()) > 10:
                comparison_data.append({
                    'Column': col,
                    'Example': i + 1,
                    'Original Text': original_text[:100] + "..." if len(original_text) > 100 else original_text,
                    'Corrected Text': corrected_text[:100] + "..." if len(corrected_text) > 100 else corrected_text,
                    'Changed': 'Yes' if original_text != corrected_text else 'No'
                })
    
    comparison_df = pd.DataFrame(comparison_data)
    return comparison_df

# Create and display comparison table
comparison_table = create_comparison_table(original_df, df, 30)
print(f"\nüìã STRUCTURED COMPARISON TABLE")
print("=" * 120)
print(comparison_table.to_string(index=False, max_colwidth=50))

# Show statistics
total_changes = len(comparison_table[comparison_table['Changed'] == 'Yes'])
total_examples = len(comparison_table)
print(f"\nüìà CORRECTION STATISTICS:")
print(f"Total examples: {total_examples}")
print(f"Texts changed: {total_changes}")
print(f"Change rate: {total_changes/total_examples*100:.1f}%")


: 

## 8. Export Corrected Results

Save the corrected dataset to a new CSV file and update the cache with processed corrections.

In [None]:
# Create output directory if it doesn't exist
os.makedirs(os.path.dirname(OUTPUT_CSV) or ".", exist_ok=True)

# Save the corrected dataframe
try:
    df.to_csv(OUTPUT_CSV, index=False, encoding="utf-8-sig")
    print(f"‚úÖ Corrected dataset saved to: {OUTPUT_CSV}")
    print(f"üìä Saved {len(df)} rows with {len(df.columns)} columns")
except Exception as e:
    print(f"‚ùå Error saving CSV: {e}")

# Save the updated cache
try:
    with open(CACHE_FILE, "wb") as f:
        pickle.dump(cache, f)
    print(f"‚úÖ Cache saved with {len(cache)} entries to: {CACHE_FILE}")
except Exception as e:
    print(f"‚ùå Error saving cache: {e}")

# Display final summary
print("\n" + "="*60)
print("üéâ VIETNAMESE TEXT CORRECTION COMPLETED!")
print("="*60)
print(f"üìÅ Input file: {INPUT_CSV}")
print(f"üìÅ Output file: {OUTPUT_CSV}")
print(f"üîß Model used: {MODEL}")
print(f"üíæ Cache entries: {len(cache)}")
print(f"üìä Processed rows: {len(df)}")
print(f"üîÑ Batch size: {BATCH_SIZE}")
print(f"üñ•Ô∏è Device used: {'GPU' if device == 0 else 'CPU'}")
print("="*60)

# Show file sizes
if os.path.exists(OUTPUT_CSV):
    file_size = os.path.getsize(OUTPUT_CSV) / (1024 * 1024)  # MB
    print(f"üìè Output file size: {file_size:.2f} MB")

if os.path.exists(CACHE_FILE):
    cache_size = os.path.getsize(CACHE_FILE) / (1024 * 1024)  # MB
    print(f"üìè Cache file size: {cache_size:.2f} MB")


: 

## Summary

This notebook successfully demonstrates Vietnamese text correction using the `bmd1905/vietnamese-correction-v2` model. The key features include:

- ‚úÖ **Batch Processing**: Efficient processing of large datasets with configurable batch sizes
- ‚úÖ **Caching System**: Avoids redundant corrections by caching results
- ‚úÖ **Text Preprocessing**: Cleans up model outputs and normalizes text
- ‚úÖ **Before/After Comparison**: Shows 30 examples of corrections with visual formatting
- ‚úÖ **Error Handling**: Robust error handling for various edge cases
- ‚úÖ **GPU Support**: Automatically uses GPU if available for faster processing

The corrected dataset has been saved and is ready for further analysis or use in downstream tasks.