# Bengali OCR Correction Tool
This notebook implements a solution for correcting corrupted Bengali words from OCR output. This problem is solved for both cases where we have a file with corrupted words per line (e.g. োনো, িদ্যু, াবিক্রিয়া ) or we have a paragraph that contains some corrupted words (e.g. ্যাসীয পদাথ ুব গুরুত্বপূর্ণ। োনো  িদ্যু াবিক্রিয়া ঘটলে উত্তপ সৃষ্টি হয়।)

**Approach**: Three-tier correction strategy
1. Dictionary-based exact matching (fastest)
2. Pattern-based rules (missing consonants)
3. Fuzzy string matching (unknown corruptions)


In [173]:
import re
from typing import Dict, List, Tuple, Optional
from rapidfuzz import fuzz

#### Assume that we have:
1. Known corruption words as dictionary
2. Common Bengali consonants that might be missing
3. All the valid Bengali dictionary words.

In [174]:
# Known corruptions dictionary
corrections_dict = {
            '্যাসীয়': 'গ্যাসীয়',
            '্যাসীয': 'গ্যাসীয়',
            'ুব': 'খুব',
            'োনো': 'কোনো',
            'িদ্যুৎ': 'বিদ্যুৎ',
            'াবিক্রিয়া': 'প্রতিক্রিয়া',
            'পদাথ': 'পদার্থ',
            'উত্তপ': 'উত্তাপ',
        }
        
# Common Bengali consonants that might be missing
common_prefixes = [
            'ক', 'খ', 'গ', 'ঘ', 'ঙ',
            'চ', 'ছ', 'জ', 'ঝ', 'ঞ',
            'ট', 'ঠ', 'ড', 'ঢ', 'ণ',
            'ত', 'থ', 'দ', 'ধ', 'ন',
            'প', 'ফ', 'ব', 'ভ', 'ম',
            'য', 'র', 'ল', 'শ', 'ষ',
            'স', 'হ', 'ড়', 'ঢ়', 'য়',
        ]
        
# Valid Bengali dictionary (expandable)
bengali_dictionary = set([
            'গ্যাসীয়', 'খুব', 'কোনো', 'বিদ্যুৎ', 'প্রতিক্রিয়া',
            'পদার্থ', 'উত্তাপ', 'রাসায়নিক', 'পানি', 'তাপমাত্রা',
            'চাপ', 'শক্তি', 'গতি', 'ভর', 'আয়তন', 'ঘনত্ব',
            'তরল', 'কঠিন', 'বায়বীয়', 'পরমাণু', 'অণু',
            'ইলেকট্রন', 'প্রোটন', 'নিউট্রন', 'বিক্রিয়া',
            'সমীকরণ', 'দ্রবণ', 'অম্ল', 'ক্ষার', 'লবণ',
            'হয়', 'এবং', 'থেকে', 'একটি', 'যখন', 'তখন',
            'সাথে', 'মধ্যে', 'উপর', 'নিচে', 'ভিতরে', 'বাইরে',
        ])


#### This function checks if a word is corrupted or not by following way:
1.  Starts with vowel marks or other indicators ( Bengali vowel marks (কার): া ি ী ু ূ ৃ ে ৈ ো ৌ)
2. Word appears in the bangla dictionary or not. If it appears in the dictionary then it is not corrupted.

In [175]:
def is_corrupted(word: str) -> bool:
        if not word:
            return False
        
        # Check if starts with vowel mark
        vowel_marks = ['া', 'ি', 'ী', 'ু', 'ূ', 'ৃ', 'ে', 'ৈ', 'ো', 'ৌ', '্']
        if word[0] in vowel_marks:
            return True
        
        # Check if it's a known corrupted word
        if word in corrections_dict:
            return True
        
        return False


#### This function corrects the corrupted word by direct search in the known corruption word dictionary. It checks if a corruppted word appears in the dictionary. If so, it returns the corresponding correct word.

In [176]:
def correct_with_dictionary( word: str) -> Optional[str]:
        return corrections_dict.get(word)
 

#### This function creates and returns all the possible corrected words. It works in following way:
1. At first checks if a word is corrupted or not.
2. If it is corrupted, then it creates all the possible words by adding common consonants prefixes (e.g., গ, খ, ক, প্র, প্রতি, etc.) or fixes common mistakes (OCR errors by dropping the র্ (ra-phala) inside Bengali clusters,  example: "পদাথ" → "পদার্থ")


In [177]:
def correct_with_patterns( word: str) -> List[str]:
        candidates = []
        # If word starts with vowel mark, try adding consonants
        if is_corrupted(word):
            for prefix in common_prefixes:
                candidate = prefix + word
                candidates.append(candidate)
        
        # Try adding র্ (র-ফলা) for words like পদাথ -> পদার্থ
        if 'াথ' in word or 'াপ' in word:
            candidate = word.replace('াথ', 'ার্থ')
            if candidate != word:
                candidates.append(candidate)
            
            # উত্তপ -> উত্তাপ
            if word.endswith('প') or word.endswith('থ'):
                candidate = word + 'া'
                candidates.append(candidate)
        
        return candidates
    
  

#### This function searches the best correct word from the dictionary using fuzzy match. It works in following ways:
1. Creates all possible correct words of a corrupted word.
2. Check if a word apears in the dictionary then it is a valid corrected word. After the operation, if we get a word having similarity less than threshold then we go to step 3
3. We calculate the similarity between each bengali dictionary words and corrupted word and take the word having highest similarity.

In [178]:
  
def fuzzy_match(word: str, threshold: float = 0.75) -> Optional[Tuple[str, float]]:
        best_match = None
        best_score = 0.0
        
        # First try candidates from pattern rules
        candidates = correct_with_patterns(word)
        
        # Check pattern candidates against dictionary
        for candidate in candidates:
            if candidate in bengali_dictionary:
                similarity = fuzz.ratio(word, candidate) / 100.0
                if similarity > best_score:
                    best_score = similarity
                    best_match = candidate
        
        # If no good match from patterns, try full dictionary scan
        if best_score < threshold:
            for dict_word in bengali_dictionary:
                similarity = fuzz.ratio(word, dict_word) / 100.0
                if similarity > best_score and similarity >= threshold:
                    best_score = similarity
                    best_match = dict_word
        
        return (best_match, best_score) if best_match else None
   

#### This function corrects the corrupted words in following steps:
1. Searches directly in the known corrupted word dictionary
2. Implement fuzzy matching with pattern rules
3. Returns a dictionary with correction details.

In [179]:
 
def correct_word(word: str) -> Dict:

        result = {
            'original': word,
            'corrected': word,
            'method': 'no_correction',
            'confidence': 1.0,
            'is_corrupted': False
        }
        
        # Check if word appears corrupted
        if not is_corrupted(word):
            return result
        
        result['is_corrupted'] = True
        
        # Step 1: Try dictionary lookup (highest confidence)
        dict_correction = correct_with_dictionary(word)
        if dict_correction:
            result['corrected'] = dict_correction
            result['method'] = 'dictionary'
            result['confidence'] = 1.0
            return result
        
        # Step 2: Try fuzzy matching with pattern rules
        fuzzy_result = fuzzy_match(word, threshold=0.75)
        if fuzzy_result:
            matched_word, score = fuzzy_result
            result['corrected'] = matched_word
            result['method'] = 'fuzzy_match'
            result['confidence'] = score
            return result
        
        # No correction found
        result['method'] = 'no_match_found'
        result['confidence'] = 0.0
        return result
    
    

#### This function takes a whole paragraph having corrupted words in it and then correct all corrupted words in the paragraph. Returns the corrected_paragraph

In [180]:

def correct_paragraph( paragraph: str) -> Tuple[str, List[Dict]]:
        # Split paragraph into words while preserving punctuation and spacing
        # Use regex to split on whitespace but keep the whitespace
        words = re.findall(r'\S+|\s+', paragraph)
        
        corrected_words = []
        corrections_list = []
        
        for word in words:
            # If it's whitespace, keep it as is
            if word.isspace():
                corrected_words.append(word)
                continue
            
            # Remove punctuation for processing but remember it
            punctuation = ''
            clean_word = word
            
            # Check for trailing punctuation
            if word and not word[-1].isalnum() and word[-1] not in ['া', 'ি', 'ী', 'ু', 'ূ', 'ৃ', 'ে', 'ৈ', 'ো', 'ৌ', '্', 'ং', 'ঃ', 'ঁ']:
                punctuation = word[-1]
                clean_word = word[:-1]
            
            # Correct the word
            result = correct_word(clean_word)
            
            # If correction was made, track it
            if result['corrected'] != result['original']:
                corrections_list.append(result)
            
            # Add corrected word with punctuation back
            corrected_words.append(result['corrected'] + punctuation)
        
        corrected_paragraph = ''.join(corrected_words)
        return corrected_paragraph
    

#### Process input file containing corrupted words and paragraphs.
Expected format:
- Section 1: CORRUPTED_WORDS (one word per line)
- Section 2: CORRUPTED_PARAGRAPH (text with errors)

In [181]:

def process_input_file(input_file: str = 'input.txt'):

        try:
            with open(input_file, 'r', encoding='utf-8') as f:
                content = f.read()
        
            # Split content into sections
            sections = content.split('---')
            
            # Process corrupted words section
            corrupted_words_section = sections[0].strip()
            corrupted_words = [line.strip() for line in corrupted_words_section.split('\n') if line.strip()]
            
            # Remove header if present
            if corrupted_words and 'CORRUPTED_WORDS' in corrupted_words[0]:
                corrupted_words = corrupted_words[1:]
            
            # Process corrupted paragraph section
            corrupted_paragraph = sections[1].strip()
            
            # Remove header if present
            if corrupted_paragraph.startswith('CORRUPTED_PARAGRAPH'):
                lines = corrupted_paragraph.split('\n')
                corrupted_paragraph = '\n'.join(lines[1:]).strip()
            
            return corrupted_words, corrupted_paragraph, None
    
        except Exception as e:
            print(f"Error processing: {e}")


#### Save all results to output file in a structured format.

In [182]:
def save_results(output_file: str, word_results: List[Dict], 
                    original_paragraph: str, corrected_paragraph: str):

        try:
            with open(output_file, 'w', encoding='utf-8') as f:
                # Header
                f.write("BENGALI OCR CORRECTION RESULTS\n")
                # Section 1: Individual Word Corrections
                f.write("SECTION 1: INDIVIDUAL WORD CORRECTIONS\n")
                f.write("-" * 70 + "\n\n")
                
                if word_results:
                    
                    for result in word_results:
                        f.write(f"{result['original']:<25} "
                               f"{result['corrected']:<25} "
                               f"{result['method']:<15} "
                               f"{result['confidence']:.2f}\n")
                    
                    f.write("\n")
                else:
                    f.write("No individual words processed.\n\n")
                
                # Section 2: Paragraph Correction
    
                f.write("SECTION 2: PARAGRAPH CORRECTION\n")
                f.write("-" * 70 + "\n\n")
                f.write("ORIGINAL PARAGRAPH:\n")
                f.write(original_paragraph + "\n\n")
                f.write("CORRECTED PARAGRAPH:\n")
                f.write(corrected_paragraph + "\n\n")
                
            
        except Exception as e:
            print(f"Error saving results: {e}")

#### Main processing function works in following steps: 
1. reads input from a file. Processes both corrupted words file and corrupted paragraph file
2. corrects the corrupted words
3. prints results
4. saves results into a output file.

In [183]:
def process_and_save(input_file: str = 'input.txt', output_file: str = 'output.txt'):

        print("Bengali OCR Correction Tool - Processing")
        print("=" * 70)
        
        # Read input file
        corrupted_words, corrupted_paragraph, _ = process_input_file(input_file)
        
        if corrupted_words is None and corrupted_paragraph is None:
            return
        
        # Process individual corrupted words
        print("CORRECTING INDIVIDUAL WORDS")
        print("─" * 70)
        
        word_results = []
        if corrupted_words:
            
            for word in corrupted_words:
                result = correct_word(word)
                word_results.append(result)
                
                # Print correction
                if result['corrected'] != result['original']:
                    print(f"{result['original']:20} → {result['corrected']:20} "
                          f"[{result['method']}, confidence: {result['confidence']:.2f}]")
                else:
                    print(f"{result['original']:20} → (no correction found)")
        
        # Process paragraph

        print("\nCORRECTING PARAGRAPH")
        print("─" * 70)
        
        corrected_paragraph = corrupted_paragraph
        
        if corrupted_paragraph:
            print("\nOriginal Paragraph:")

            print(corrupted_paragraph)

            
            corrected_paragraph = correct_paragraph(corrupted_paragraph)
            
            print("\nCorrected Paragraph:")

            print(corrected_paragraph)

            
        save_results(output_file, word_results, corrupted_paragraph, 
                         corrected_paragraph)
        
        

#### Main execution function.

In [184]:
def main():
    # Process the file
    process_and_save(input_file='input.txt', output_file='output.txt')

if __name__ == "__main__":
    main()

Bengali OCR Correction Tool - Processing
CORRECTING INDIVIDUAL WORDS
──────────────────────────────────────────────────────────────────────
্যাসী                → গ্যাসীয়             [fuzzy_match, confidence: 0.77]
ুব                   → খুব                  [dictionary, confidence: 1.00]
োনো                  → কোনো                 [dictionary, confidence: 1.00]
িদ্যু                → বিদ্যুৎ              [fuzzy_match, confidence: 0.83]
াবিক্রিয়া           → প্রতিক্রিয়া         [dictionary, confidence: 1.00]
পদাথ                 → পদার্থ               [dictionary, confidence: 1.00]
উত্তপ                → উত্তাপ               [dictionary, confidence: 1.00]

CORRECTING PARAGRAPH
──────────────────────────────────────────────────────────────────────

Original Paragraph:
্যাসীয পদাথ ুব গুরুত্বপূর্ণ। োনো  িদ্যু াবিক্রিয়া ঘটলে উত্তপ সৃষ্টি হয়। পদাথ বিজ্ঞানে এই ধরনের ঘটনা ুব সাধারণ।

Corrected Paragraph:
গ্যাসীয় পদার্থ খুব গুরুত্বপূর্ণ। কোনো  বিদ্যুৎ প্রতিক্রিয়া ঘটলে উত্তাপ সৃষ্টি হয়।