****Prothom Alo Corpus****

**Batch Filter “body” Column from the archived Prothom Alo corpus and Save to a New Folder**

This code scans a folder of CSV files, extracts the body column from each file (when present), and saves a new CSV per source file containing only sentences (data) into a separate output directory. It creates the output directory if it doesn’t already exist and prints a status message for each file processed.

In [7]:
# Import the standard library module for filesystem path handling and directory operations
import os
# Import pandas for reading and writing CSV files as DataFrames
import pandas as pd

# Absolute path to the folder containing the original CSV files to process
input_folder = r'C:\Users\Student\Documents\Bangla\archive\csv'                # Folder with your original CSVs
# Absolute path to the destination folder where filtered CSVs (body-only) will be written
output_folder = r'C:\Users\Student\Documents\Bangla\Filtered_archive'          # Folder to save body-only files

# Create the output directory if it doesn't already exist (no error if it does)
os.makedirs(output_folder, exist_ok=True)

# Build a list of filenames in input_folder that end with ".csv"
csv_files = [f for f in os.listdir(input_folder) if f.endswith('.csv')]

# Iterate over each CSV filename discovered in the input folder
for fname in csv_files:
    # Construct the full path to the current CSV file
    fpath = os.path.join(input_folder, fname)
    try:
        # Read the CSV into a pandas DataFrame (uses pandas’ default dtype inference)
        df = pd.read_csv(fpath)
        # Check whether the DataFrame contains a column named "body"
        if 'body' in df.columns:
            # Slice the DataFrame to keep only the "body" column
            body_df = df[['body']]
            # Construct the output filename by appending "_body_only" before the .csv extension
            out_fname = fname.replace('.csv', '_body_only.csv')
            # Build the full output path under output_folder
            out_path = os.path.join(output_folder, out_fname)
            # Write the single-column DataFrame to disk (no index column)
            body_df.to_csv(out_path, index=False)
            # Print a success message with the destination path
            print(f"Saved: {out_path}")
        else:
            # If "body" does not exist, report that this file was skipped
            print(f"Skipped {fname}: No 'body' column found.")
    except Exception as e:
        # If anything goes wrong while reading/writing, log a warning with the error message
        print(f"Error processing {fname}: {e}")


Saved: C:\Users\Student\Documents\Bangla\Filtered_archive\2009-07_body_only.csv
Saved: C:\Users\Student\Documents\Bangla\Filtered_archive\2009-08_body_only.csv
Saved: C:\Users\Student\Documents\Bangla\Filtered_archive\2009-09_body_only.csv
Saved: C:\Users\Student\Documents\Bangla\Filtered_archive\2009-10_body_only.csv
Saved: C:\Users\Student\Documents\Bangla\Filtered_archive\2009-11_body_only.csv
Saved: C:\Users\Student\Documents\Bangla\Filtered_archive\2009-12_body_only.csv
Saved: C:\Users\Student\Documents\Bangla\Filtered_archive\2010-01_body_only.csv
Saved: C:\Users\Student\Documents\Bangla\Filtered_archive\2010-02_body_only.csv
Saved: C:\Users\Student\Documents\Bangla\Filtered_archive\2010-03_body_only.csv
Saved: C:\Users\Student\Documents\Bangla\Filtered_archive\2010-04_body_only.csv
Saved: C:\Users\Student\Documents\Bangla\Filtered_archive\2010-05_body_only.csv
Saved: C:\Users\Student\Documents\Bangla\Filtered_archive\2010-06_body_only.csv
Saved: C:\Users\Student\Documents\Bangla

**Bangla Sentence Tokenization Pipeline (Regex-Based Tokenization with Length Filter) for CSV “body” Fields**

This code scans a folder of CSV files, reads the body column from each file, splits the text into Bangla sentences using a regex-based tokenizer (splitting on ।, !, ?), filters out very short sentences (≤ 2 words), and writes the remaining sentences to a .txt file (one sentence per line) in an output directory. It also creates the output directory if it doesn’t exist and prints a status line per file.

In [None]:
import os                                         # Filesystem utilities for paths and directory listing
import pandas as pd                               # DataFrame-based CSV loading/processing
import re                                         # Regular expressions for sentence splitting

input_folder = r'C:\Users\Student\Documents\Bangla\Filtered_archive'     # Source directory containing input CSV files
output_folder = r'C:\Users\Student\Documents\Bangla\Tokenized_archive'   # Destination directory for sentence text files
os.makedirs(output_folder, exist_ok=True)         # Ensure the output directory exists (no error if already present)


# Function bn_sentence_tokenize(text)
# Purpose: Split a Bangla text into sentences by recognizing end-of-sentence punctuation (Dari ।, exclamation !, and question mark ?).
# Inputs:
#   text (Any): The input text; will be coerced to str.
# Outputs:
#   List[str]: A list of trimmed sentence strings with empty fragments removed.
def bn_sentence_tokenize(text):                   # Define a simple Bangla sentence tokenizer using regex
    # Split by Bangla sentence enders (।, !, ?)
    return [s.strip() for s in re.split(r'(?<=[।!?])', str(text)) if s.strip()]  # Split on enders, trim, and drop empties

csv_files = [f for f in os.listdir(input_folder) if f.endswith('.csv')]  # Collect all .csv filenames from the input folder

for fname in csv_files:                           # Iterate over each CSV file to process
    in_path = os.path.join(input_folder, fname)   # Build absolute path to the input CSV
    out_fname = fname.replace('.csv', '_sentences.txt')  # Derive output filename by changing extension
    out_path = os.path.join(output_folder, out_fname)    # Build absolute path to the output text file
    
    try:                                          # Guard the per-file pipeline with error handling
        df = pd.read_csv(in_path)                 # Read the CSV into a DataFrame
        sentences = []                            # Initialize accumulator for tokenized sentences
        for text in df['body'].dropna():          # Iterate over non-null entries in the 'body' column
            sentences.extend(bn_sentence_tokenize(text))  # Tokenize the text and append all sentences
        # Filter out sentences with only 1 or 2 words
        filtered_sentences = [                     # Build a filtered list of sentences
            sent for sent in sentences 
            if len(sent.split()) > 2               # Keep sentences strictly longer than 2 words
        ]
        with open(out_path, 'w', encoding='utf-8') as f:  # Open the destination file for UTF-8 writing
            for sent in filtered_sentences:        # Write each filtered sentence on its own line
                f.write(sent.strip() + '\n')       # Trim whitespace and append newline
        print(f"Tokenized and saved: {out_path}")  # Success message for this file
    except Exception as e:                         # Catch any exception that occurs while processing this file
        print(f"Error processing {fname}: {e}")   # Print a diagnostic error message


Tokenized and saved: C:\Users\Student\Documents\Bangla\Tokenized_archives\2009-07_body_only_sentences.txt
Tokenized and saved: C:\Users\Student\Documents\Bangla\Tokenized_archives\2009-08_body_only_sentences.txt
Tokenized and saved: C:\Users\Student\Documents\Bangla\Tokenized_archives\2009-09_body_only_sentences.txt
Tokenized and saved: C:\Users\Student\Documents\Bangla\Tokenized_archives\2009-10_body_only_sentences.txt
Tokenized and saved: C:\Users\Student\Documents\Bangla\Tokenized_archives\2009-11_body_only_sentences.txt
Tokenized and saved: C:\Users\Student\Documents\Bangla\Tokenized_archives\2009-12_body_only_sentences.txt
Tokenized and saved: C:\Users\Student\Documents\Bangla\Tokenized_archives\2010-01_body_only_sentences.txt
Tokenized and saved: C:\Users\Student\Documents\Bangla\Tokenized_archives\2010-02_body_only_sentences.txt
Tokenized and saved: C:\Users\Student\Documents\Bangla\Tokenized_archives\2010-03_body_only_sentences.txt
Tokenized and saved: C:\Users\Student\Document

In [None]:
import sys # Access the current Python executable path
!{sys.executable} -m pip install bangla-stemmer # Install 'bangla-stemmer' into this environment


Collecting bangla-stemmer
  Using cached bangla_stemmer-1.0-py3-none-any.whl.metadata (2.4 kB)
Using cached bangla_stemmer-1.0-py3-none-any.whl (9.1 kB)
Installing collected packages: bangla-stemmer
Successfully installed bangla-stemmer-1.0



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


****Prothom Alo Corpus****

**Bangla Keyword-Aware Sentence Retriever (Rule-Based with Suffix Matching)**

This code scans a folder of Bangla sentence files (.txt) and extracts sentences that contain any target keyword or a keyword followed by common Bangla inflectional suffixes. It compiles a comprehensive regex for each keyword (lemma + suffix variants with proper word boundaries), searches every line across all files, and writes the matched sentences to per-keyword text files in an output directory. It prints a summary of how many matches were found for each keyword.

In [None]:
# ---- Standard library imports: filesystem paths/listing, regex utilities, CSV (unused; left for potential extensions)
import os
import re
import csv


# Target keyword list (lemmas) to search for in the corpus
keywords = [
    'অর্থ', 'কপাল', 'রাস্তা', 'মাথা', 'বল', 'হাত', 'ফল', 'চোখ', 'কর', 'পাকা', 'বর', 'জল', 'এঁটে', 'বর্ণ', 'মধু',
    'গভীর', 'তার', 'মূল', 'মুখ', 'কাজ', 'অন্ধকার', 'ঘর', 'ছত্র', 'বক', 'জাম', 'ধরা', 'দাঁড়া', 'গোলা', 'ঝুল', 'কামাই',
    'চড়', 'সন্ধি', 'ফাঁক', 'ঝড়', 'সোজা', 'পটল', 'কাটা', 'জানালা', 'আম', 'দিক', 'কুঁজো', 'বাঁশ', 'বেলা', 'বাজি',
    'তাড়া', 'সারা', 'আগুন', 'সই', 'পথ', 'দল', 'পদ', 'কালা', 'ডিম', 'গুলি', 'বিষয়', 'বিরোধ', 'নয়', 'কান', 'দম',
    'হেলা', 'কুল', 'মাত', 'বন্ধ', 'ঝোলা', 'চড়া', 'রস', 'মাটি', 'মালা', 'ধন', 'পাল', 'কর্ণ', 'আকাশ', 'शীর্ষ',
    'চিহ্ন', 'লাঠি', 'আকা', 'পুষ্কর', 'নাক', 'টোকা', 'শিশু', 'গজ', 'লাই', 'হল', 'স্বামী', 'স্পষ্ট', 'সামান্য',
    'পিচ', 'পাহাড়', 'গা', 'ময়না', 'ঠান্ডা', 'ঘোর', 'পড়া', 'আগ', 'উঠা', 'চাঁদা', 'ঢাকা', 'উপায়', 'তুলা', 'ধোঁয়া',
    'ঘন', 'পর', 'কপি', 'জিন', 'সার', 'পালা', 'পাড়া', 'ফিট', 'চাপ', 'বাড়ি', 'বয়স', 'বাটা', 'চা', 'পাতি', 'ফেলা',
    'জাল', 'পেঁচ', 'হাওয়া', 'মটকা', 'জমি', 'চাবি','কথা','রূপ','বিন্দু','খাতা','শাখা','দরজা','কান্ড','বাকী','নাম','বাকী',
    'ঘটনা','বাজে','দাগ','পাত্র','দ্বার', 'ডাক'
]


# Comprehensive list of common Bangla inflectional suffixes and standalone vowel diacritics
# These are appended to keywords in a regex to catch inflected forms, with word-boundary enforcement.
# Full list of inflectional suffixes + single vowels
suffixes = [
    '', 'কে', 'কেই', 'কেও', 'কেওই', 'কো', 'গুলা', 'গুলা-গুলা', 'গুলাগুলি', 'গুলাগো',
    'গুলান', 'গুলানা', 'গুলানো', 'গুলার', 'গুলারই', 'গুলারও', 'গুলারাই', 'গুলারে', 'গুলারেই', 'গুলারেইও',
    'গুলারটাও', 'গুলারি', 'গুলাটারে', 'গুলাডা', 'গুলাডারে', 'গুলিকেও', 'গুলিকে', 'গুলিকো', 'গুলিতে',
    'গুলিতো', 'গুলিতেও', 'গুলিতেই', 'গুলির', 'গুলিরই', 'গুলিরও', 'গুলিরেই', 'গুলিও', 'গুলিয়েই', 'গুলিয়েও',
    'গুলো', 'গুলাই', 'গুলাইও', 'গুলোই', 'গুলো-গুলো', 'গুলোকেই', 'গুলোকেও', 'গুলোতে', 'গুলোয়', 'গুলোর',
    'গুলোরই', 'গুলোরও', 'গুলোরটাই', 'গুলোরে', 'গুলোরেই', 'গুলান', 'খানা', 'খানাকে', 'খানাতে', 'খানার', 'খানায়',
    'চ্ছে', 'ছে', 'ছেন', 'ছো', 'ছি', 'ছিস', 'ছিসি', 'ছিসেন', 'ছিল', 'ছিলা', 'ছিলাই', 'ছিলাম', 'ছিলি',
    'ছিলে', 'ছিলে না', 'ছিলেন', 'ছিলেনা', 'ছিলনা', 'ছিলো', 'জন', 'জনগো', 'জনকে', 'জনেরা', 'জনের', 'জনে',
    'তা', 'টা', 'টা-টা', 'টাগো', 'টাই', 'টাইও', 'টাইনা', 'টাইওনা', 'টার', 'টারই', 'টারও', 'টারেও',
    'টারটাও', 'টারে', 'টারেই', 'টি', 'টি-টি', 'টিই', 'টিা', 'টিাে', 'টিাও', 'টিায়', 'টিেই', 'টিতে', 'টিতেও',
    'টিতেই', 'টিকে', 'টিকেই', 'টিকেও', 'টির', 'টিরই', 'টিরও', 'টিরে', 'টিরেই', 'টিরটাই', 'টিও', 'টিওনা',
    'টিনাও', 'টিনাওনা', 'টাতো', 'টাতেও', 'টাতে', 'টাকে', 'টায়', 'তাচ্ছি', 'তাছি', 'তবে', 'ত', 'তাম',
    'তি', 'তে', 'তেও', 'তেওই', 'তেওনা', 'তেই', 'তেইও', 'তেন', 'তের', 'তেরই', 'তেরও', 'তেরাই', 'তেরা', 'তেও', 'তো', 'তো',
    'তোও', 'থেকে', 'থেকেই', 'থেকেই না', 'থেকেও', 'থেকেও না', 'দিয়ে', 'দিয়েও', 'দিয়ো', 'দিয়েও না',
    'দের', 'দেরকেই', 'দেরকে', 'দেরকেও', 'দেররাও', 'দেররেই', 'দেররে', 'দেরই', 'দেরও', 'দেরওয়', 'দেরে', 'দেরেই',
    'দেরটাই', 'দেরটাইও', 'দেরনা', 'নাই', 'নাছে', 'নাছি', 'না', 'না’', 'নেই', 'নেও', 'নেওনা', 'নাও', 'নাই', 'নেই', 'ই',
    'ইও', 'ইনা', 'ইওনা', 'ইরা', 'ইগো', 'ইনাই', 'ইতে', 'ইয়', 'ইয়েই', 'ইয়েও', 'ও', 'ওগো', 'ওরা', 'ওরাও', 'ওরাওনা', 'ওরাই',
    'ওদের', 'ওদেরকেও', 'ওদেরকে', 'ওদেররাও', 'ওদেরনা', 'ওদেরই', 'ওদেরও', 'ওদেরটাই', 'ওনি', 'ওন', 'ওনা',
    'এ', 'এতে', 'এরই', 'আমেরটাই', 'ব', 'বে', 'বেন', 'বেনা', 'বেনে', 'বিনা', 'বিনা না', 'বি', 'বো', 'বো না', 'বোনে',
    'বুলছি', 'মেয়েরা', 'পান', 'পানোর', 'পানেও', 'রা', 'রা’', 'রাই', 'রাইও', 'রাইনা', 'রাে', 'রাগো', 'রাও', 'রাওই', 'রাওনা',
    'রাটা', 'রেই', 'রেইও', 'র', 'রের', 'রেরই', 'রেরও', 'রেরে', 'রও', 'লাম', 'লামও', 'লামেই', 'লামনা', 'লি', 'লে', 'লেন', 'লো',
    'তাম', 'তেন', 'হয়', 'হয়েই', 'হয়না', 'সহ', 'সহেই', 'সহেই না', 'সহেও', 'সহেও না', 'য়', 'য়াটা', 'য়ে', 'য়েই', 'য়েই না',
    'য়েইও', 'য়ো', 'য়ো না', 'য়েও', 'য়েও না', 'য়েটা', 'য়াছে', 'য়াছি', 'য়াছিল', 'য়াছিলাম', 'য়াছেন', 'য়াছিলে', 'য়াছিলেন'
] + ['া', 'ি', 'ু', 'ে', 'ো', 'ৈ', 'ৌ']  # Vowel endings (no derivation)


# Function: find_sentences_with_word
# Purpose : For a given keyword (lemma), search all .txt files in a directory and collect sentences containing 
# the lemma or lemma+suffix (from the list). Builds a regex that matches lemma + any suffix from `suffixes`, then enforces a
# word boundary by requiring a trailing whitespace, punctuation, or string end. Uses a negative lookbehind for 
# ASCII word chars to avoid partial matches there. Results are written to <lemma>.txt in the output directory.
# Inputs  :
#   input_directory (str): Path to folder with input .txt files (one sentence per line).
#   lemma_word      (str): The base keyword/lemma to search for.
#   output_directory(str): Path to folder where matched sentences file will be saved.
# Outputs :
#   None (side effects): Writes a text file with one matched sentence per line
#   Prints a completion summary with match count.
def find_sentences_with_word(input_directory, lemma_word, output_directory):
    # Prebuild a suffix alternation pattern, sorted by length (longest first) to ensure greedy matching
    # Make regex pattern for lemma + suffixes, must be followed by boundary (not a Bengali char)
    suffix_regex = '|'.join(sorted([re.escape(s) for s in suffixes if s], key=len, reverse=True))
    # Compose the final pattern:
    #   - capture lemma + one of the suffixes
    #   - enforce ending at boundary: whitespace, Bangla danda/ punctuation, bracket/quote, or string end
    # Force match to end at space, punctuation, or string end
    # Note: [\s।,;:.?!\]\["'\)\(]|$ will match boundary after lemma+suffix
    pattern = re.compile(
        rf'(?<!\w)({re.escape(lemma_word)}({suffix_regex}))(?:[\s।,;:.?!\]\["\'\)\(]|$)'
    )

    # Accumulate matched sentences
    matches = []
    # Iterate all files in the input directory
    for file_name in os.listdir(input_directory):
        # Process only .txt files
        if file_name.endswith('.txt'):
            # Full path to the file
            file_path = os.path.join(input_directory, file_name)
            # Open the file in UTF-8 and scan line by line
            with open(file_path, 'r', encoding='utf-8') as infile:
                for line in infile:
                    # Strip whitespace/newlines
                    line = line.strip()
                    # If the compiled pattern matches anywhere in the line, keep it
                    if pattern.search(line):
                        matches.append([line])

    # Build output filename for this lemma and write one sentence per line
    # Write to txt
    output_file = os.path.join(output_directory, f"{lemma_word}.txt")
    with open(output_file, 'w', encoding='utf-8') as outfile:
        for row in matches:
            outfile.write(row[0] + '\n')

    # Print completion summary with count and destination path
    print(f"Search complete! Found {len(matches)} matching sentences saved to {output_file}.")


# ---- Input/Output directories for the corpus and where to save per-keyword matches
input_directory = r"C:\Users\Student\Documents\Bangla\Tokenized_archive"
output_directory = r"C:\Users\Student\Documents\Bangla\extracted_sentences_prothom_alo"

# ---- Run the search for every keyword in the list, producing one output file per lemma
for lemma_word in keywords:
    find_sentences_with_word(input_directory, lemma_word, output_directory)

Search complete! Found 13569 matching sentences saved to C:\Users\Student\Documents\Bangla\extracted_sentences\অর্থ.txt.
Search complete! Found 5415 matching sentences saved to C:\Users\Student\Documents\Bangla\extracted_sentences\কপাল.txt.
Search complete! Found 65006 matching sentences saved to C:\Users\Student\Documents\Bangla\extracted_sentences\রাস্তা.txt.
Search complete! Found 58411 matching sentences saved to C:\Users\Student\Documents\Bangla\extracted_sentences\মাথা.txt.
Search complete! Found 1300518 matching sentences saved to C:\Users\Student\Documents\Bangla\extracted_sentences\বল.txt.
Search complete! Found 173001 matching sentences saved to C:\Users\Student\Documents\Bangla\extracted_sentences\হাত.txt.
Search complete! Found 147783 matching sentences saved to C:\Users\Student\Documents\Bangla\extracted_sentences\ফল.txt.
Search complete! Found 42640 matching sentences saved to C:\Users\Student\Documents\Bangla\extracted_sentences\চোখ.txt.
Search complete! Found 5061746 ma

****IndicNLP Bangla Corpus (Tokenized)****

In [None]:
# ---- Standard library imports: filesystem paths/listing, regex utilities, CSV (unused; left for potential extensions)
import os
import re
import csv


# Target keyword list (lemmas) to search for in the corpus
keywords = [
    'অর্থ', 'কপাল', 'রাস্তা', 'মাথা', 'বল', 'হাত', 'ফল', 'চোখ', 'কর', 'পাকা', 'বর', 'জল', 'এঁটে', 'বর্ণ', 'মধু',
    'গভীর', 'তার', 'মূল', 'মুখ', 'কাজ', 'অন্ধকার', 'ঘর', 'ছত্র', 'বক', 'জাম', 'ধরা', 'দাঁড়া', 'গোলা', 'ঝুল', 'কামাই',
    'চড়', 'সন্ধি', 'ফাঁক', 'ঝড়', 'সোজা', 'পটল', 'কাটা', 'জানালা', 'আম', 'দিক', 'কুঁজো', 'বাঁশ', 'বেলা', 'বাজি',
    'তাড়া', 'সারা', 'আগুন', 'সই', 'পথ', 'দল', 'পদ', 'কালা', 'ডিম', 'গুলি', 'বিষয়', 'বিরোধ', 'নয়', 'কান', 'দম',
    'হেলা', 'কুল', 'মাত', 'বন্ধ', 'ঝোলা', 'চড়া', 'রস', 'মাটি', 'মালা', 'ধন', 'পাল', 'কর্ণ', 'আকাশ', 'शীর্ষ',
    'চিহ্ন', 'লাঠি', 'আকা', 'পুষ্কর', 'নাক', 'টোকা', 'শিশু', 'গজ', 'লাই', 'হল', 'স্বামী', 'স্পষ্ট', 'সামান্য',
    'পিচ', 'পাহাড়', 'গা', 'ময়না', 'ঠান্ডা', 'ঘোর', 'পড়া', 'আগ', 'উঠা', 'চাঁদা', 'ঢাকা', 'উপায়', 'তুলা', 'ধোঁয়া',
    'ঘন', 'পর', 'কপি', 'জিন', 'সার', 'পালা', 'পাড়া', 'ফিট', 'চাপ', 'বাড়ি', 'বয়স', 'বাটা', 'চা', 'পাতি', 'ফেলা',
    'জাল', 'পেঁচ', 'হাওয়া', 'মটকা', 'জমি', 'চাবি','কথা','রূপ','বিন্দু','খাতা','শাখা','দরজা','কান্ড','বাকী','নাম','বাকী',
    'ঘটনা','বাজে','দাগ','পাত্র','দ্বার', 'ডাক'
]


# Comprehensive list of common Bangla inflectional suffixes and standalone vowel diacritics
# These are appended to keywords in a regex to catch inflected forms, with word-boundary enforcement.
# Full list of inflectional suffixes + single vowels
suffixes = [
    '', 'কে', 'কেই', 'কেও', 'কেওই', 'কো', 'গুলা', 'গুলা-গুলা', 'গুলাগুলি', 'গুলাগো',
    'গুলান', 'গুলানা', 'গুলানো', 'গুলার', 'গুলারই', 'গুলারও', 'গুলারাই', 'গুলারে', 'গুলারেই', 'গুলারেইও',
    'গুলারটাও', 'গুলারি', 'গুলাটারে', 'গুলাডা', 'গুলাডারে', 'গুলিকেও', 'গুলিকে', 'গুলিকো', 'গুলিতে',
    'গুলিতো', 'গুলিতেও', 'গুলিতেই', 'গুলির', 'গুলিরই', 'গুলিরও', 'গুলিরেই', 'গুলিও', 'গুলিয়েই', 'গুলিয়েও',
    'গুলো', 'গুলাই', 'গুলাইও', 'গুলোই', 'গুলো-গুলো', 'গুলোকেই', 'গুলোকেও', 'গুলোতে', 'গুলোয়', 'গুলোর',
    'গুলোরই', 'গুলোরও', 'গুলোরটাই', 'গুলোরে', 'গুলোরেই', 'গুলান', 'খানা', 'খানাকে', 'খানাতে', 'খানার', 'খানায়',
    'চ্ছে', 'ছে', 'ছেন', 'ছো', 'ছি', 'ছিস', 'ছিসি', 'ছিসেন', 'ছিল', 'ছিলা', 'ছিলাই', 'ছিলাম', 'ছিলি',
    'ছিলে', 'ছিলে না', 'ছিলেন', 'ছিলেনা', 'ছিলনা', 'ছিলো', 'জন', 'জনগো', 'জনকে', 'জনেরা', 'জনের', 'জনে',
    'তা', 'টা', 'টা-টা', 'টাগো', 'টাই', 'টাইও', 'টাইনা', 'টাইওনা', 'টার', 'টারই', 'টারও', 'টারেও',
    'টারটাও', 'টারে', 'টারেই', 'টি', 'টি-টি', 'টিই', 'টিা', 'টিাে', 'টিাও', 'টিায়', 'টিেই', 'টিতে', 'টিতেও',
    'টিতেই', 'টিকে', 'টিকেই', 'টিকেও', 'টির', 'টিরই', 'টিরও', 'টিরে', 'টিরেই', 'টিরটাই', 'টিও', 'টিওনা',
    'টিনাও', 'টিনাওনা', 'টাতো', 'টাতেও', 'টাতে', 'টাকে', 'টায়', 'তাচ্ছি', 'তাছি', 'তবে', 'ত', 'তাম',
    'তি', 'তে', 'তেও', 'তেওই', 'তেওনা', 'তেই', 'তেইও', 'তেন', 'তের', 'তেরই', 'তেরও', 'তেরাই', 'তেরা', 'তেও', 'তো', 'তো',
    'তোও', 'থেকে', 'থেকেই', 'থেকেই না', 'থেকেও', 'থেকেও না', 'দিয়ে', 'দিয়েও', 'দিয়ো', 'দিয়েও না',
    'দের', 'দেরকেই', 'দেরকে', 'দেরকেও', 'দেররাও', 'দেররেই', 'দেররে', 'দেরই', 'দেরও', 'দেরওয়', 'দেরে', 'দেরেই',
    'দেরটাই', 'দেরটাইও', 'দেরনা', 'নাই', 'নাছে', 'নাছি', 'না', 'না’', 'নেই', 'নেও', 'নেওনা', 'নাও', 'নাই', 'নেই', 'ই',
    'ইও', 'ইনা', 'ইওনা', 'ইরা', 'ইগো', 'ইনাই', 'ইতে', 'ইয়', 'ইয়েই', 'ইয়েও', 'ও', 'ওগো', 'ওরা', 'ওরাও', 'ওরাওনা', 'ওরাই',
    'ওদের', 'ওদেরকেও', 'ওদেরকে', 'ওদেররাও', 'ওদেরনা', 'ওদেরই', 'ওদেরও', 'ওদেরটাই', 'ওনি', 'ওন', 'ওনা',
    'এ', 'এতে', 'এরই', 'আমেরটাই', 'ব', 'বে', 'বেন', 'বেনা', 'বেনে', 'বিনা', 'বিনা না', 'বি', 'বো', 'বো না', 'বোনে',
    'বুলছি', 'মেয়েরা', 'পান', 'পানোর', 'পানেও', 'রা', 'রা’', 'রাই', 'রাইও', 'রাইনা', 'রাে', 'রাগো', 'রাও', 'রাওই', 'রাওনা',
    'রাটা', 'রেই', 'রেইও', 'র', 'রের', 'রেরই', 'রেরও', 'রেরে', 'রও', 'লাম', 'লামও', 'লামেই', 'লামনা', 'লি', 'লে', 'লেন', 'লো',
    'তাম', 'তেন', 'হয়', 'হয়েই', 'হয়না', 'সহ', 'সহেই', 'সহেই না', 'সহেও', 'সহেও না', 'য়', 'য়াটা', 'য়ে', 'য়েই', 'য়েই না',
    'য়েইও', 'য়ো', 'য়ো না', 'য়েও', 'য়েও না', 'য়েটা', 'য়াছে', 'য়াছি', 'য়াছিল', 'য়াছিলাম', 'য়াছেন', 'য়াছিলে', 'য়াছিলেন'
] + ['া', 'ি', 'ু', 'ে', 'ো', 'ৈ', 'ৌ']  # Vowel endings (no derivation)


# Function: find_sentences_with_word
# Purpose : For a given keyword (lemma), search all .txt files in a directory and collect sentences containing 
# the lemma or lemma+suffix (from the list). Builds a regex that matches lemma + any suffix from `suffixes`, then enforces a
# word boundary by requiring a trailing whitespace, punctuation, or string end. Uses a negative lookbehind for 
# ASCII word chars to avoid partial matches there. Results are written to <lemma>.txt in the output directory.
# Inputs  :
#   input_directory (str): Path to folder with input .txt files (one sentence per line).
#   lemma_word      (str): The base keyword/lemma to search for.
#   output_directory(str): Path to folder where matched sentences file will be saved.
# Outputs :
#   None (side effects): Writes a text file with one matched sentence per line
#   Prints a completion summary with match count.
def find_sentences_with_word(input_directory, lemma_word, output_directory):
    # Prebuild a suffix alternation pattern, sorted by length (longest first) to ensure greedy matching
    # Make regex pattern for lemma + suffixes, must be followed by boundary (not a Bengali char)
    suffix_regex = '|'.join(sorted([re.escape(s) for s in suffixes if s], key=len, reverse=True))
    # Compose the final pattern:
    #   - capture lemma + one of the suffixes
    #   - enforce ending at boundary: whitespace, Bangla danda/ punctuation, bracket/quote, or string end
    # Force match to end at space, punctuation, or string end
    # Note: [\s।,;:.?!\]\["'\)\(]|$ will match boundary after lemma+suffix
    pattern = re.compile(
        rf'(?<!\w)({re.escape(lemma_word)}({suffix_regex}))(?:[\s।,;:.?!\]\["\'\)\(]|$)'
    )

    # Accumulate matched sentences
    matches = []
    # Iterate all files in the input directory
    for file_name in os.listdir(input_directory):
        # Process only .txt files
        if file_name.endswith('.txt'):
            # Full path to the file
            file_path = os.path.join(input_directory, file_name)
            # Open the file in UTF-8 and scan line by line
            with open(file_path, 'r', encoding='utf-8') as infile:
                for line in infile:
                    # Strip whitespace/newlines
                    line = line.strip()
                    # If the compiled pattern matches anywhere in the line, keep it
                    if pattern.search(line):
                        matches.append([line])

    # Build output filename for this lemma and write one sentence per line
    # Write to txt
    output_file = os.path.join(output_directory, f"{lemma_word}.txt")
    with open(output_file, 'w', encoding='utf-8') as outfile:
        for row in matches:
            outfile.write(row[0] + '\n')

    # Print completion summary with count and destination path
    print(f"Search complete! Found {len(matches)} matching sentences saved to {output_file}.")


# ---- Input/Output directories for the corpus and where to save per-keyword matches
input_directory = r"C:\Users\Student\Documents\Bangla\split_parts"
output_directory = r"C:\Users\Student\Documents\Bangla\extracted_sentences_indic"

# ---- Run the search for every keyword in the list, producing one output file per lemma
for lemma_word in keywords:
    find_sentences_with_word(input_directory, lemma_word, output_directory)

Search complete! Found 32334 matching sentences saved to C:\Users\Student\Documents\Bangla\extracted_sentence_indic\অর্থ.txt.
Search complete! Found 24405 matching sentences saved to C:\Users\Student\Documents\Bangla\extracted_sentence_indic\কপাল.txt.
Search complete! Found 94268 matching sentences saved to C:\Users\Student\Documents\Bangla\extracted_sentence_indic\রাস্তা.txt.
Search complete! Found 61598 matching sentences saved to C:\Users\Student\Documents\Bangla\extracted_sentence_indic\মাথা.txt.
Search complete! Found 3682752 matching sentences saved to C:\Users\Student\Documents\Bangla\extracted_sentence_indic\বল.txt.
Search complete! Found 531707 matching sentences saved to C:\Users\Student\Documents\Bangla\extracted_sentence_indic\হাত.txt.
Search complete! Found 486865 matching sentences saved to C:\Users\Student\Documents\Bangla\extracted_sentence_indic\ফল.txt.
Search complete! Found 148729 matching sentences saved to C:\Users\Student\Documents\Bangla\extracted_sentence_indic\

****BNWaC Bengali Corpus via SketchEngine (Tokenized)****

In [None]:
# ---- Standard library imports: filesystem paths/listing, regex utilities, CSV (unused; left for potential extensions)
import os
import re
import csv


# Target keyword list (lemmas) to search for in the corpus
keywords = [
    'অর্থ', 'কপাল', 'রাস্তা', 'মাথা', 'বল', 'হাত', 'ফল', 'চোখ', 'কর', 'পাকা', 'বর', 'জল', 'এঁটে', 'বর্ণ', 'মধু',
    'গভীর', 'তার', 'মূল', 'মুখ', 'কাজ', 'অন্ধকার', 'ঘর', 'ছত্র', 'বক', 'জাম', 'ধরা', 'দাঁড়া', 'গোলা', 'ঝুল', 'কামাই',
    'চড়', 'সন্ধি', 'ফাঁক', 'ঝড়', 'সোজা', 'পটল', 'কাটা', 'জানালা', 'আম', 'দিক', 'কুঁজো', 'বাঁশ', 'বেলা', 'বাজি',
    'তাড়া', 'সারা', 'আগুন', 'সই', 'পথ', 'দল', 'পদ', 'কালা', 'ডিম', 'গুলি', 'বিষয়', 'বিরোধ', 'নয়', 'কান', 'দম',
    'হেলা', 'কুল', 'মাত', 'বন্ধ', 'ঝোলা', 'চড়া', 'রস', 'মাটি', 'মালা', 'ধন', 'পাল', 'কর্ণ', 'আকাশ', 'शীর্ষ',
    'চিহ্ন', 'লাঠি', 'আকা', 'পুষ্কর', 'নাক', 'টোকা', 'শিশু', 'গজ', 'লাই', 'হল', 'স্বামী', 'স্পষ্ট', 'সামান্য',
    'পিচ', 'পাহাড়', 'গা', 'ময়না', 'ঠান্ডা', 'ঘোর', 'পড়া', 'আগ', 'উঠা', 'চাঁদা', 'ঢাকা', 'উপায়', 'তুলা', 'ধোঁয়া',
    'ঘন', 'পর', 'কপি', 'জিন', 'সার', 'পালা', 'পাড়া', 'ফিট', 'চাপ', 'বাড়ি', 'বয়স', 'বাটা', 'চা', 'পাতি', 'ফেলা',
    'জাল', 'পেঁচ', 'হাওয়া', 'মটকা', 'জমি', 'চাবি','কথা','রূপ','বিন্দু','খাতা','শাখা','দরজা','কান্ড','বাকী','নাম','বাকী',
    'ঘটনা','বাজে','দাগ','পাত্র','দ্বার', 'ডাক'
]


# Comprehensive list of common Bangla inflectional suffixes and standalone vowel diacritics
# These are appended to keywords in a regex to catch inflected forms, with word-boundary enforcement.
# Full list of inflectional suffixes + single vowels
suffixes = [
    '', 'কে', 'কেই', 'কেও', 'কেওই', 'কো', 'গুলা', 'গুলা-গুলা', 'গুলাগুলি', 'গুলাগো',
    'গুলান', 'গুলানা', 'গুলানো', 'গুলার', 'গুলারই', 'গুলারও', 'গুলারাই', 'গুলারে', 'গুলারেই', 'গুলারেইও',
    'গুলারটাও', 'গুলারি', 'গুলাটারে', 'গুলাডা', 'গুলাডারে', 'গুলিকেও', 'গুলিকে', 'গুলিকো', 'গুলিতে',
    'গুলিতো', 'গুলিতেও', 'গুলিতেই', 'গুলির', 'গুলিরই', 'গুলিরও', 'গুলিরেই', 'গুলিও', 'গুলিয়েই', 'গুলিয়েও',
    'গুলো', 'গুলাই', 'গুলাইও', 'গুলোই', 'গুলো-গুলো', 'গুলোকেই', 'গুলোকেও', 'গুলোতে', 'গুলোয়', 'গুলোর',
    'গুলোরই', 'গুলোরও', 'গুলোরটাই', 'গুলোরে', 'গুলোরেই', 'গুলান', 'খানা', 'খানাকে', 'খানাতে', 'খানার', 'খানায়',
    'চ্ছে', 'ছে', 'ছেন', 'ছো', 'ছি', 'ছিস', 'ছিসি', 'ছিসেন', 'ছিল', 'ছিলা', 'ছিলাই', 'ছিলাম', 'ছিলি',
    'ছিলে', 'ছিলে না', 'ছিলেন', 'ছিলেনা', 'ছিলনা', 'ছিলো', 'জন', 'জনগো', 'জনকে', 'জনেরা', 'জনের', 'জনে',
    'তা', 'টা', 'টা-টা', 'টাগো', 'টাই', 'টাইও', 'টাইনা', 'টাইওনা', 'টার', 'টারই', 'টারও', 'টারেও',
    'টারটাও', 'টারে', 'টারেই', 'টি', 'টি-টি', 'টিই', 'টিা', 'টিাে', 'টিাও', 'টিায়', 'টিেই', 'টিতে', 'টিতেও',
    'টিতেই', 'টিকে', 'টিকেই', 'টিকেও', 'টির', 'টিরই', 'টিরও', 'টিরে', 'টিরেই', 'টিরটাই', 'টিও', 'টিওনা',
    'টিনাও', 'টিনাওনা', 'টাতো', 'টাতেও', 'টাতে', 'টাকে', 'টায়', 'তাচ্ছি', 'তাছি', 'তবে', 'ত', 'তাম',
    'তি', 'তে', 'তেও', 'তেওই', 'তেওনা', 'তেই', 'তেইও', 'তেন', 'তের', 'তেরই', 'তেরও', 'তেরাই', 'তেরা', 'তেও', 'তো', 'তো',
    'তোও', 'থেকে', 'থেকেই', 'থেকেই না', 'থেকেও', 'থেকেও না', 'দিয়ে', 'দিয়েও', 'দিয়ো', 'দিয়েও না',
    'দের', 'দেরকেই', 'দেরকে', 'দেরকেও', 'দেররাও', 'দেররেই', 'দেররে', 'দেরই', 'দেরও', 'দেরওয়', 'দেরে', 'দেরেই',
    'দেরটাই', 'দেরটাইও', 'দেরনা', 'নাই', 'নাছে', 'নাছি', 'না', 'না’', 'নেই', 'নেও', 'নেওনা', 'নাও', 'নাই', 'নেই', 'ই',
    'ইও', 'ইনা', 'ইওনা', 'ইরা', 'ইগো', 'ইনাই', 'ইতে', 'ইয়', 'ইয়েই', 'ইয়েও', 'ও', 'ওগো', 'ওরা', 'ওরাও', 'ওরাওনা', 'ওরাই',
    'ওদের', 'ওদেরকেও', 'ওদেরকে', 'ওদেররাও', 'ওদেরনা', 'ওদেরই', 'ওদেরও', 'ওদেরটাই', 'ওনি', 'ওন', 'ওনা',
    'এ', 'এতে', 'এরই', 'আমেরটাই', 'ব', 'বে', 'বেন', 'বেনা', 'বেনে', 'বিনা', 'বিনা না', 'বি', 'বো', 'বো না', 'বোনে',
    'বুলছি', 'মেয়েরা', 'পান', 'পানোর', 'পানেও', 'রা', 'রা’', 'রাই', 'রাইও', 'রাইনা', 'রাে', 'রাগো', 'রাও', 'রাওই', 'রাওনা',
    'রাটা', 'রেই', 'রেইও', 'র', 'রের', 'রেরই', 'রেরও', 'রেরে', 'রও', 'লাম', 'লামও', 'লামেই', 'লামনা', 'লি', 'লে', 'লেন', 'লো',
    'তাম', 'তেন', 'হয়', 'হয়েই', 'হয়না', 'সহ', 'সহেই', 'সহেই না', 'সহেও', 'সহেও না', 'য়', 'য়াটা', 'য়ে', 'য়েই', 'য়েই না',
    'য়েইও', 'য়ো', 'য়ো না', 'য়েও', 'য়েও না', 'য়েটা', 'য়াছে', 'য়াছি', 'য়াছিল', 'য়াছিলাম', 'য়াছেন', 'য়াছিলে', 'য়াছিলেন'
] + ['া', 'ি', 'ু', 'ে', 'ো', 'ৈ', 'ৌ']  # Vowel endings (no derivation)


# Function: find_sentences_with_word
# Purpose : For a given keyword (lemma), search all .txt files in a directory and collect sentences containing 
# the lemma or lemma+suffix (from the list). Builds a regex that matches lemma + any suffix from `suffixes`, then enforces a
# word boundary by requiring a trailing whitespace, punctuation, or string end. Uses a negative lookbehind for 
# ASCII word chars to avoid partial matches there. Results are written to <lemma>.txt in the output directory.
# Inputs  :
#   input_directory (str): Path to folder with input .txt files (one sentence per line).
#   lemma_word      (str): The base keyword/lemma to search for.
#   output_directory(str): Path to folder where matched sentences file will be saved.
# Outputs :
#   None (side effects): Writes a text file with one matched sentence per line
#   Prints a completion summary with match count.
def find_sentences_with_word(input_directory, lemma_word, output_directory):
    # Prebuild a suffix alternation pattern, sorted by length (longest first) to ensure greedy matching
    # Make regex pattern for lemma + suffixes, must be followed by boundary (not a Bengali char)
    suffix_regex = '|'.join(sorted([re.escape(s) for s in suffixes if s], key=len, reverse=True))
    # Compose the final pattern:
    #   - capture lemma + one of the suffixes
    #   - enforce ending at boundary: whitespace, Bangla danda/ punctuation, bracket/quote, or string end
    # Force match to end at space, punctuation, or string end
    # Note: [\s।,;:.?!\]\["'\)\(]|$ will match boundary after lemma+suffix
    pattern = re.compile(
        rf'(?<!\w)({re.escape(lemma_word)}({suffix_regex}))(?:[\s।,;:.?!\]\["\'\)\(]|$)'
    )

    # Accumulate matched sentences
    matches = []
    # Iterate all files in the input directory
    for file_name in os.listdir(input_directory):
        # Process only .txt files
        if file_name.endswith('.txt'):
            # Full path to the file
            file_path = os.path.join(input_directory, file_name)
            # Open the file in UTF-8 and scan line by line
            with open(file_path, 'r', encoding='utf-8') as infile:
                for line in infile:
                    # Strip whitespace/newlines
                    line = line.strip()
                    # If the compiled pattern matches anywhere in the line, keep it
                    if pattern.search(line):
                        matches.append([line])

    # Build output filename for this lemma and write one sentence per line
    # Write to txt
    output_file = os.path.join(output_directory, f"{lemma_word}.txt")
    with open(output_file, 'w', encoding='utf-8') as outfile:
        for row in matches:
            outfile.write(row[0] + '\n')

    # Print completion summary with count and destination path
    print(f"Search complete! Found {len(matches)} matching sentences saved to {output_file}.")


# ---- Input/Output directories for the corpus and where to save per-keyword matches
input_directory = r"C:\Users\Student\Documents\Bangla\Sketch engine"
output_directory = r"C:\Users\Student\Documents\Bangla\extracted_sentences_sketch_engine"

# ---- Run the search for every keyword in the list, producing one output file per lemma
for lemma_word in keywords:
    find_sentences_with_word(input_directory, lemma_word, output_directory)

Search complete! Found 2062 matching sentences saved to C:\Users\Student\Documents\Bangla\extracted_sentences_sketch_engine\অর্থ.txt.
Search complete! Found 1596 matching sentences saved to C:\Users\Student\Documents\Bangla\extracted_sentences_sketch_engine\কপাল.txt.
Search complete! Found 11415 matching sentences saved to C:\Users\Student\Documents\Bangla\extracted_sentences_sketch_engine\রাস্তা.txt.
Search complete! Found 7492 matching sentences saved to C:\Users\Student\Documents\Bangla\extracted_sentences_sketch_engine\মাথা.txt.
Search complete! Found 116495 matching sentences saved to C:\Users\Student\Documents\Bangla\extracted_sentences_sketch_engine\বল.txt.
Search complete! Found 27488 matching sentences saved to C:\Users\Student\Documents\Bangla\extracted_sentences_sketch_engine\হাত.txt.
Search complete! Found 22118 matching sentences saved to C:\Users\Student\Documents\Bangla\extracted_sentences_sketch_engine\ফল.txt.
Search complete! Found 13715 matching sentences saved to C:\

**Bangla Multi-Corpus Sentence Merger & Shuffler**

This script consolidates per-lemma sentence lists from multiple corpora. For every lemma (one lemma per .txt filename), it:

 1) Discovers all lemmas present across the given source folders.
 2) Loads every available file for that lemma, strips blank lines, and concatenates all sentences.
 3) Shuffles the merged sentences to reduce corpus-order bias.
 4) Writes a single combined file per lemma to a target folder, reporting how many sentences were saved.

In [None]:
import os                                                     # Filesystem utilities: paths, directory ops
import glob                                                   # Pattern-based file discovery (e.g., *.txt)
import random                                                 # Random shuffling to reduce corpus-order bias

# Folders containing your txt files for each corpus
folders = [                                                   # List of source directories to merge from
    r'C:\Users\Student\Documents\Bangla\extracted_sentences_indic',            # Corpus 1
    r'C:\Users\Student\Documents\Bangla\extracted_sentences_prothom_alo',      # Corpus 2
    r'C:\Users\Student\Documents\Bangla\extracted_sentences_sketch_engine',    # Corpus 3
]
output_folder = r'C:\Users\Student\Documents\Bangla\final_dataset'             # Destination directory for merged files
os.makedirs(output_folder, exist_ok=True)                      # Create destination if missing (no error if exists)

# Get all unique word filenames from all corpora
all_files = []                                                 # Accumulator for all discovered .txt paths
for folder in folders:                                         # Iterate each source directory
    all_files.extend(glob.glob(os.path.join(folder, '*.txt'))) # Append all .txt files from this folder
words = set(os.path.splitext(os.path.basename(f))[0] for f in all_files)  # Unique lemma names (filename without extension)

for word in words:                                             # Process each lemma independently
    sentences = []                                             # Collector for sentences across all corpora for this lemma
    for folder in folders:                                     # Check each source folder for this lemma’s file
        file_path = os.path.join(folder, word + '.txt')        # Expected path for the lemma file in this folder
        if os.path.isfile(file_path):                          # Only proceed if the file exists here
            with open(file_path, 'r', encoding='utf-8') as f:  # Open lemma file (UTF-8)
                lines = [line.strip() for line in f if line.strip()]  # Read non-empty lines, strip whitespace
                sentences.extend(lines)                        # Add sentences from this corpus to the pool
    # Shuffle to remove bias from corpus order
    random.shuffle(sentences)                                  # Randomize order to mitigate source ordering effects
    # Save the combined file
    output_path = os.path.join(output_folder, word + '.txt')   # Path for the merged output file for this lemma
    with open(output_path, 'w', encoding='utf-8') as out:      # Create/overwrite the merged file
        for line in sentences:                                 # Write each sentence on its own line
            out.write(line + '\n')                             # Persist sentence with newline
    print(f"{word}: {len(sentences)} sentences combined and saved.")  # Progress report per lemma

print("All datasets combined and shuffled.")                   # Final completion message

তার: 1962429 sentences combined and saved.
পাল: 96179 sentences combined and saved.
বন্ধ: 240872 sentences combined and saved.
খাতা: 10507 sentences combined and saved.
চা: 1764345 sentences combined and saved.
পেঁচ: 1803 sentences combined and saved.
বাটা: 8736 sentences combined and saved.
কান্ড: 6647 sentences combined and saved.
উঠা: 18063 sentences combined and saved.
চড়: 75160 sentences combined and saved.
চিহ্ন: 6121 sentences combined and saved.
পাতি: 7550 sentences combined and saved.
রাস্তা: 170689 sentences combined and saved.
পাহাড়: 61591 sentences combined and saved.
ঢাকা: 334607 sentences combined and saved.
পুষ্কর: 110 sentences combined and saved.
রস: 86418 sentences combined and saved.
গুলি: 172409 sentences combined and saved.
বাজে: 2975 sentences combined and saved.
নাক: 359389 sentences combined and saved.
ধন: 148487 sentences combined and saved.
বাজি: 29126 sentences combined and saved.
জল: 102350 sentences combined and saved.
এঁটে: 1916 sentences combined and sa