# Telugu Tokenization Pipeline using IndicCorpV2

This notebook demonstrates how to:
- Load Telugu text from the AI4Bharat IndicCorpV2 dataset
- Tokenize text into sentences and words using Unicode-aware regex
- Save tokenized output to a file
- Compute corpus-level statistics
- Tokenize user input interactively

---

In [None]:
# Load dataset 
# !pip install datasets
from datasets import load_dataset

dataset = load_dataset("ai4bharat/IndicCorpV2", "indiccorp_v2", split="tel_Telu", streaming=True)
first_item = next(iter(dataset))
print(first_item['text'])

Collecting datasets
  Downloading datasets-4.0.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp312-cp312-win_amd64.whl.metadata (13 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py312-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-4.0.0-py3-none-any.whl (494 kB)
Downloading dill-0.3.8-py3-none-any.whl (116 kB)
Downloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
Downloading multiprocess-0.70.16-py312-none-any.whl (146 kB)
Downloading xxhash-3.5.0-cp312-cp312-win_amd64.whl (30 kB)
Installing collected packages: xxhash, fsspec, dill, multiprocess, datasets

  Attempting uninstall: fsspec

    Found existing installation: fsspec 2025.3.2


  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


అమెరికా అధ్యక్షుడు డొనాల్డ్ ట్రంప్ కు రాష్ట్రపతి  భవన్ వద్ద ఘనస్వాగతం లభించింది. ఆయనకు రాష్ట్రపతి రామ్ నాథ్ కోవింద్ దంపతులు, ప్రధాని మోదీ సాదరంగా ఆహ్వానం పలకడంతో పాటు సైనికులు గౌరవ వందనాన్ని అందించారు.


## Tokenization Functions

We define two functions:
- `tokenize_sentences`: Splits text into sentences using punctuation.
- `tokenize_words`: Uses regex to extract Telugu words, numbers, dates, URLs, emails, and punctuation.

In [2]:
# Define Tokenizers
import re

def tokenize_sentences(text):
    sentences = re.split(r'[।.!?]+\s*', text)
    return [s.strip() for s in sentences if s.strip()]

def tokenize_words(text):
    pattern = r'''
        \d{1,2}[/-]\d{1,2}[/-]\d{2,4}     |  # Dates like 12/08/2025
        \d{4}-\d{2}-\d{2}                 |  # Dates like 2025-08-05
        https?://\S+                     |  # URLs with http or https
        www\.\S+                         |  # URLs starting with www
        [\w._%+-]+@[\w.-]+\.\w+          |  # Email addresses
        \d+\.\d+                         |  # Decimal numbers
        \d+                              |  # Whole numbers
        [\u0C00-\u0C7F]+                 |  # Telugu words
        [^\s\u0C00-\u0C7F]               # Punctuation and symbols
    '''
    return re.findall(pattern, text, re.VERBOSE)

## Tokenize Dataset and Save to File

We tokenize each paragraph into sentences and words, then save the output to a `.txt` file.


In [3]:
# Tokenize and Save
output_file = "tokenized_bengali.txt"

sentence_list = []
word_list = []

with open(output_file, "w", encoding="utf-8") as f_out:
    count = 0
    for item in dataset:
        paragraph = item["text"]
        sentences = tokenize_sentences(paragraph)

        for sent in sentences:
            words = tokenize_words(sent)
            if words:
                f_out.write(" ".join(words) + "\n")
                sentence_list.append(words)
                word_list.extend(words)

        # Limit for quick testing
        count += 1
        if count >= 1000:
            break


## Corpus Statistics

We compute:
- Total sentences, words, and characters
- Average sentence and word length
- Type/Token Ratio (TTR)


In [4]:
# Compute Statistics
def compute_statistics(sentences, words):
    total_sentences = len(sentences)
    total_words = len(words)
    total_chars = sum(len(word) for word in words)
    avg_sentence_len = total_words / total_sentences if total_sentences > 0 else 0
    avg_word_len = total_chars / total_words if total_words > 0 else 0
    ttr = len(set(words)) / total_words if total_words > 0 else 0

    return {
        'Total Sentences': total_sentences,
        'Total Words': total_words,
        'Total Characters': total_chars,
        'Average Sentence Length': round(avg_sentence_len, 2),
        'Average Word Length': round(avg_word_len, 2),
        'Type/Token Ratio (TTR)': round(ttr, 4)
    }


In [5]:
stats = compute_statistics(sentence_list, word_list)
print("\n--- Corpus Statistics ---")
for key, value in stats.items():
    print(f"{key}: {value}")


--- Corpus Statistics ---
Total Sentences: 1945
Total Words: 19367
Total Characters: 111631
Average Sentence Length: 9.96
Average Word Length: 5.76
Type/Token Ratio (TTR): 0.4289


## Interactive Tokenization

You can enter a Telugu sentence or paragraph to see how it gets tokenized.

In [None]:
# User Input
print("\n🔤 Enter a Telugu sentence or paragraph to tokenize (or press Enter to skip):")
user_input = input()

if user_input.strip():
    user_sentences = tokenize_sentences(user_input)
    print("\n Tokenized Output:")
    for i, sent in enumerate(user_sentences, 1):
        words = tokenize_words(sent)
        print(f"Sentence {i}: {' '.join(words)}")
else:
    print("No input provided. Skipping user tokenization.")


🔤 Enter a Telugu sentence or paragraph to tokenize (or press Enter to skip):
