# Tokenizer known as akshar

This notebook demonstrates **every feature** of akshar, the linguistically-aware tokenizer for Hindi, Sanskrit, and Hinglish.

## What You'll See:
- akshar segmentation (conjunct preservation)
- Code-switch detection
- Text normalization
- Hinglish processing
- Sanskrit handling
- And much more...


## Setup

Import the tokenizer and helper functions


In [1]:
from akshar import aksharTokenizer


In [2]:
tokenizer = aksharTokenizer()


In [3]:
text = "aaj ‡§Æ‡•å‡§∏‡§Æ ‡§¨‡§π‡•Å‡§§ ‡§Ö‡§ö‡•ç‡§õ‡§æ ‡§π‡•à"
tokens = tokenizer.tokenize(text)
print(tokens)

['a', 'a', 'j', ' ', '‡§Æ‡•å', '‡§∏', '‡§Æ', ' ', '‡§¨', '‡§π‡•Å', '‡§§', ' ', '‡§Ö', '‡§ö‡•ç‡§õ‡§æ', ' ', '‡§π‡•à']


## Word-Level Tokenization

Get word-level tokens (like IndicNLP) using akshar's preprocessing

In [4]:
# word-level tokenization with akshar intelligence
def word_tokenize(text):
    """
    get word-level tokens using akshar preprocessing
    
    uses akshar's normalization but returns words instead of akshars
    """
    tokenizer = aksharTokenizer()
    
    # use akshar's intelligent preprocessing
    normalized = tokenizer.preprocess(text)
    
    # split on whitespace
    words = normalized.split()
    
    return words

# test it
text = "aaj ‡§Æ‡•å‡§∏‡§Æ ‡§¨‡§π‡•Å‡§§ ‡§Ö‡§ö‡•ç‡§õ‡§æ ‡§π‡•à"
words = word_tokenize(text)

print("Word-Level Tokenization:")
print(f"\nInput:  {text}")
print(f"Output: {words}")
print(f"\nToken count: {len(words)}")

# compare with akshar-level
akshars = tokenizer.tokenize(text)
print(f"\nComparison:")
print(f"  Word-level:    {len(words)} tokens")
print(f"  akshar-level: {len(akshars)} tokens")

Word-Level Tokenization:

Input:  aaj ‡§Æ‡•å‡§∏‡§Æ ‡§¨‡§π‡•Å‡§§ ‡§Ö‡§ö‡•ç‡§õ‡§æ ‡§π‡•à
Output: ['aaj', '‡§Æ‡•å‡§∏‡§Æ', '‡§¨‡§π‡•Å‡§§', '‡§Ö‡§ö‡•ç‡§õ‡§æ', '‡§π‡•à']

Token count: 5

Comparison:
  Word-level:    5 tokens
  akshar-level: 16 tokens


In [5]:
analysis = tokenizer.explain(text)
print(analysis)

{'original': 'aaj ‡§Æ‡•å‡§∏‡§Æ ‡§¨‡§π‡•Å‡§§ ‡§Ö‡§ö‡•ç‡§õ‡§æ ‡§π‡•à', 'normalized': 'aaj ‡§Æ‡•å‡§∏‡§Æ ‡§¨‡§π‡•Å‡§§ ‡§Ö‡§ö‡•ç‡§õ‡§æ ‡§π‡•à', 'akshars': ['a', 'a', 'j', ' ', '‡§Æ‡•å', '‡§∏', '‡§Æ', ' ', '‡§¨', '‡§π‡•Å', '‡§§', ' ', '‡§Ö', '‡§ö‡•ç‡§õ‡§æ', ' ', '‡§π‡•à'], 'code_switches': [('aaj ', 'roman'), ('‡§Æ‡•å‡§∏‡§Æ ‡§¨‡§π‡•Å‡§§ ‡§Ö‡§ö‡•ç‡§õ‡§æ ‡§π‡•à', 'devanagari')], 'tokens': ['a', 'a', 'j', ' ', '‡§Æ‡•å', '‡§∏', '‡§Æ', ' ', '‡§¨', '‡§π‡•Å', '‡§§', ' ', '‡§Ö', '‡§ö‡•ç‡§õ‡§æ', ' ', '‡§π‡•à'], 'stats': {'akshar_count': 16, 'script_switches': 1, 'devanagari_ratio': 0.8181818181818182, 'roman_ratio': 0.18181818181818182}}


# akshar Features Demo

This notebook demonstrates all the intelligent features of akshar tokenizer.


## 1. Basic Tokenization

Tokenize text without any trained model (falls back to akshar-level)


In [6]:
examples = [
    "‡§Ü‡§ú ‡§Æ‡•å‡§∏‡§Æ ‡§¨‡§π‡•Å‡§§ ‡§Ö‡§ö‡•ç‡§õ‡§æ ‡§π‡•à",
    "‡§®‡§Æ‡§∏‡•ç‡§§‡•á ‡§¶‡•Å‡§®‡§ø‡§Ø‡§æ",
    "‡§ï‡•ç‡§∑‡•á‡§§‡•ç‡§∞‡•á ‡§ß‡§∞‡•ç‡§Æ‡§ï‡•ç‡§∑‡•á‡§§‡•ç‡§∞‡•á",
]

for text in examples:
    tokens = tokenizer.tokenize(text)
    print(f"Text: {text}")
    print(f"Tokens ({len(tokens)}): {' | '.join(tokens)}")
    print()


Text: ‡§Ü‡§ú ‡§Æ‡•å‡§∏‡§Æ ‡§¨‡§π‡•Å‡§§ ‡§Ö‡§ö‡•ç‡§õ‡§æ ‡§π‡•à
Tokens (15): ‡§Ü | ‡§ú |   | ‡§Æ‡•å | ‡§∏ | ‡§Æ |   | ‡§¨ | ‡§π‡•Å | ‡§§ |   | ‡§Ö | ‡§ö‡•ç‡§õ‡§æ |   | ‡§π‡•à

Text: ‡§®‡§Æ‡§∏‡•ç‡§§‡•á ‡§¶‡•Å‡§®‡§ø‡§Ø‡§æ
Tokens (7): ‡§® | ‡§Æ | ‡§∏‡•ç‡§§‡•á |   | ‡§¶‡•Å | ‡§®‡§ø | ‡§Ø‡§æ

Text: ‡§ï‡•ç‡§∑‡•á‡§§‡•ç‡§∞‡•á ‡§ß‡§∞‡•ç‡§Æ‡§ï‡•ç‡§∑‡•á‡§§‡•ç‡§∞‡•á
Tokens (7): ‡§ï‡•ç‡§∑‡•á | ‡§§‡•ç‡§∞‡•á |   | ‡§ß | ‡§∞‡•ç‡§Æ | ‡§ï‡•ç‡§∑‡•á | ‡§§‡•ç‡§∞‡•á



## 2. akshar Segmentation

Shows how Devanagari conjuncts stay together as single units


In [7]:
from akshar import segment_akshars

sanskrit_examples = [
    "‡§ï‡•ç‡§∑‡•á‡§§‡•ç‡§∞",      # kShetra - notice ‡§ï‡•ç‡§∑ stays together
    "‡§ú‡•ç‡§û‡§æ‡§®",        # gyaan - ‡§ú‡•ç‡§û stays together
    "‡§§‡•ç‡§∞‡§ø‡§∂‡•Ç‡§≤",      # trishul - ‡§§‡•ç‡§∞ stays together
    "‡§ß‡§∞‡•ç‡§Æ‡§ï‡•ç‡§∑‡•á‡§§‡•ç‡§∞‡•á",  # dharmakshetre - multiple conjuncts
]

for text in sanskrit_examples:
    akshars = segment_akshars(text)
    print(f"Text: {text}")
    print(f"akshars: [ {' ] [ '.join(akshars)} ]")
    print(f"Count: {len(akshars)}")
    print()


Text: ‡§ï‡•ç‡§∑‡•á‡§§‡•ç‡§∞
akshars: [ ‡§ï‡•ç‡§∑‡•á ] [ ‡§§‡•ç‡§∞ ]
Count: 2

Text: ‡§ú‡•ç‡§û‡§æ‡§®
akshars: [ ‡§ú‡•ç‡§û‡§æ ] [ ‡§® ]
Count: 2

Text: ‡§§‡•ç‡§∞‡§ø‡§∂‡•Ç‡§≤
akshars: [ ‡§§‡•ç‡§∞‡§ø ] [ ‡§∂‡•Ç ] [ ‡§≤ ]
Count: 3

Text: ‡§ß‡§∞‡•ç‡§Æ‡§ï‡•ç‡§∑‡•á‡§§‡•ç‡§∞‡•á
akshars: [ ‡§ß ] [ ‡§∞‡•ç‡§Æ ] [ ‡§ï‡•ç‡§∑‡•á ] [ ‡§§‡•ç‡§∞‡•á ]
Count: 4



## 3. Code-Switch Detection

Detects boundaries where script changes (Hinglish!)


In [8]:
from akshar import detect_code_switches

hinglish_examples = [
    "aaj ‡§Æ‡•å‡§∏‡§Æ ‡§¨‡§π‡•Å‡§§ ‡§Ö‡§ö‡•ç‡§õ‡§æ ‡§π‡•à",
    "‡§Æ‡•à‡§Ç California ‡§Æ‡•á‡§Ç ‡§∞‡§π‡§§‡§æ ‡§π‡•Ç‡§Ç",
    "hello ‡§Ø‡§æ‡§∞ kya ‡§π‡§æ‡§≤ ‡§π‡•à",
    "‡§Ü‡§ú ‡§ï‡§æ day ‡§¨‡§π‡•Å‡§§ nice ‡§•‡§æ",
]

for text in hinglish_examples:
    switches = detect_code_switches(text)
    print(f"Text: {text}")
    print("Script Boundaries:")
    for segment, script in switches:
        print(f"  [{script:12}] '{segment}'")
    print()


Text: aaj ‡§Æ‡•å‡§∏‡§Æ ‡§¨‡§π‡•Å‡§§ ‡§Ö‡§ö‡•ç‡§õ‡§æ ‡§π‡•à
Script Boundaries:
  [roman       ] 'aaj '
  [devanagari  ] '‡§Æ‡•å‡§∏‡§Æ ‡§¨‡§π‡•Å‡§§ ‡§Ö‡§ö‡•ç‡§õ‡§æ ‡§π‡•à'

Text: ‡§Æ‡•à‡§Ç California ‡§Æ‡•á‡§Ç ‡§∞‡§π‡§§‡§æ ‡§π‡•Ç‡§Ç
Script Boundaries:
  [devanagari  ] '‡§Æ‡•à‡§Ç '
  [roman       ] 'California '
  [devanagari  ] '‡§Æ‡•á‡§Ç ‡§∞‡§π‡§§‡§æ ‡§π‡•Ç‡§Ç'

Text: hello ‡§Ø‡§æ‡§∞ kya ‡§π‡§æ‡§≤ ‡§π‡•à
Script Boundaries:
  [roman       ] 'hello '
  [devanagari  ] '‡§Ø‡§æ‡§∞ '
  [roman       ] 'kya '
  [devanagari  ] '‡§π‡§æ‡§≤ ‡§π‡•à'

Text: ‡§Ü‡§ú ‡§ï‡§æ day ‡§¨‡§π‡•Å‡§§ nice ‡§•‡§æ
Script Boundaries:
  [devanagari  ] '‡§Ü‡§ú ‡§ï‡§æ '
  [roman       ] 'day '
  [devanagari  ] '‡§¨‡§π‡•Å‡§§ '
  [roman       ] 'nice '
  [devanagari  ] '‡§•‡§æ'



## 4. Text Normalization

Handles Unicode, semantic normalization, and Hinglish cleaning


In [9]:
from akshar import normalize_text

normalization_examples = [
    "Hello ‡§®‡§Æ‡§∏‡•ç‡§§‡•á WORLD",           # mixed case Roman
    "Heyyy ‡§Ø‡§æ‡§∞ kya HAAL hai",       # elongations + case
    "‡§Æ‡•à‡§Ç California ‡§Æ‡•á‡§Ç ‡§∞‡§π‡§§‡§æ ‡§π‡•Ç‡§Ç",  # mixed scripts
]

print("Text Normalization (lowercase Roman, preserve Devanagari):\n")
for text in normalization_examples:
    normalized = normalize_text(text)
    print(f"Original:   {text}")
    print(f"Normalized: {normalized}")
    print()


Text Normalization (lowercase Roman, preserve Devanagari):

Original:   Hello ‡§®‡§Æ‡§∏‡•ç‡§§‡•á WORLD
Normalized: hello ‡§®‡§Æ‡§∏‡•ç‡§§‡•á world

Original:   Heyyy ‡§Ø‡§æ‡§∞ kya HAAL hai
Normalized: hey ‡§Ø‡§æ‡§∞ kya haal hai

Original:   ‡§Æ‡•à‡§Ç California ‡§Æ‡•á‡§Ç ‡§∞‡§π‡§§‡§æ ‡§π‡•Ç‡§Ç
Normalized: ‡§Æ‡•à‡§Ç california ‡§Æ‡•á‡§Ç ‡§∞‡§π‡§§‡§æ ‡§π‡•Ç‡§Ç



## 5. Hinglish Normalization

Handles elongations and variations common in social media


In [10]:
from akshar.normalize import remove_elongations, roman_phonetic_signature

hinglish_variations = [
    "heyyy",
    "yaaaaar", 
    "niceeee",
    "bohoooot",
]

print("Elongation Removal:\n")
for word in hinglish_variations:
    cleaned = remove_elongations(word)
    print(f"{word:15} ‚Üí {cleaned}")

print("\n" + "="*50)
print("\nPhonetic Signatures (for alignment):\n")

variations = [
    ["nahi", "nahii", "nahee"],
    ["yaar", "yar"],
    ["achha", "acha", "accha"],
]

for group in variations:
    signatures = [roman_phonetic_signature(w) for w in group]
    print(f"Variations: {', '.join(group)}")
    print(f"Signatures: {', '.join(signatures)}")
    print()


Elongation Removal:

heyyy           ‚Üí hey
yaaaaar         ‚Üí yar
niceeee         ‚Üí nice
bohoooot        ‚Üí bohot


Phonetic Signatures (for alignment):

Variations: nahi, nahii, nahee
Signatures: nahi, nahii, nahi

Variations: yaar, yar
Signatures: yar, yar

Variations: achha, acha, accha
Signatures: acha, aca, acca



## 6. Detailed Analysis

The `explain()` method shows complete breakdown


In [11]:
text = "yaar aaj ka ‡§Æ‡•å‡§∏‡§Æ ‡§¨‡§π‡•Å‡§§ ‡§Ö‡§ö‡•ç‡§õ‡§æ hai"
analysis = tokenizer.explain(text)

print(f"Original Text: {analysis['original']}")
print(f"Normalized:    {analysis['normalized']}")
print(f"\nakshars ({len(analysis['akshars'])}): {' | '.join(analysis['akshars'][:20])}...")
print(f"\nTokens ({len(analysis['tokens'])}): {' | '.join(analysis['tokens'][:20])}...")

print(f"\nCode Switches:")
for segment, script in analysis['code_switches']:
    print(f"  [{script:12}] '{segment}'")

print(f"\nStatistics:")
stats = analysis['stats']
print(f"  akshar count:    {stats['akshar_count']}")
print(f"  Script switches:  {stats['script_switches']}")
print(f"  Devanagari ratio: {stats['devanagari_ratio']:.1%}")
print(f"  Roman ratio:      {stats['roman_ratio']:.1%}")


Original Text: yaar aaj ka ‡§Æ‡•å‡§∏‡§Æ ‡§¨‡§π‡•Å‡§§ ‡§Ö‡§ö‡•ç‡§õ‡§æ hai
Normalized:    yaar aaj ka ‡§Æ‡•å‡§∏‡§Æ ‡§¨‡§π‡•Å‡§§ ‡§Ö‡§ö‡•ç‡§õ‡§æ hai

akshars (26): y | a | a | r |   | a | a | j |   | k | a |   | ‡§Æ‡•å | ‡§∏ | ‡§Æ |   | ‡§¨ | ‡§π‡•Å | ‡§§ |  ...

Tokens (26): y | a | a | r |   | a | a | j |   | k | a |   | ‡§Æ‡•å | ‡§∏ | ‡§Æ |   | ‡§¨ | ‡§π‡•Å | ‡§§ |  ...

Code Switches:
  [roman       ] 'yaar aaj ka '
  [devanagari  ] '‡§Æ‡•å‡§∏‡§Æ ‡§¨‡§π‡•Å‡§§ ‡§Ö‡§ö‡•ç‡§õ‡§æ '
  [roman       ] 'hai'

Statistics:
  akshar count:    26
  Script switches:  2
  Devanagari ratio: 51.6%
  Roman ratio:      48.4%


### Feature 18: Morphological Segmentation

Break words into morphemes (requires: pip install morfessor)


In [12]:
from akshar import get_hindi_segmenter, segment_hindi

segmenter = get_hindi_segmenter()

if segmenter.is_model_loaded():
    print("‚úì Hindi Morfessor model loaded!\n")
    print("Morphological Segmentation:\n")
    
    hindi_words = ["‡§®‡§Æ‡§∏‡•ç‡§§‡•á", "‡§Ö‡§ö‡•ç‡§õ‡§æ", "‡§∏‡§Æ‡§ù‡§®‡§æ", "‡§≤‡§°‡§º‡§ï‡•Ä", "‡§ï‡§∞‡§®‡§æ"]
    
    for word in hindi_words:
        morphemes = segmenter.segment_word(word)
        print(f"  {word:15} ‚Üí {' + '.join(morphemes)}")
else:
    print("‚ö† Morfessor not installed")
    print("Install with: pip install morfessor")
    print("\nNote: Morphological segmentation will fall back to basic mode")


‚ö† Morfessor not installed
Install with: pip install morfessor

Note: Morphological segmentation will fall back to basic mode


### Feature 20: Complete Feature Comparison

akshar vs Basic Tokenization


## 7. Composition Analysis

Analyze text composition across different types


In [13]:
from akshar.segment import analyze_text_composition

test_texts = {
    "Pure Hindi": "‡§Ü‡§ú ‡§Æ‡•å‡§∏‡§Æ ‡§¨‡§π‡•Å‡§§ ‡§Ö‡§ö‡•ç‡§õ‡§æ ‡§π‡•à",
    "Pure English": "today weather is very nice",
    "Hinglish": "aaj ka ‡§Æ‡•å‡§∏‡§Æ ‡§¨‡§π‡•Å‡§§ nice hai",
    "Sanskrit": "‡§ï‡•ç‡§∑‡•á‡§§‡•ç‡§∞‡•á ‡§ß‡§∞‡•ç‡§Æ‡§ï‡•ç‡§∑‡•á‡§§‡•ç‡§∞‡•á ‡§∏‡§Æ‡§µ‡•á‡§§‡§æ ‡§Ø‡•Å‡§Ø‡•Å‡§§‡•ç‡§∏‡§µ‡§É",
}

print("Text Composition Analysis:\n")
print(f"{'Type':<15} {'akshars':<10} {'Switches':<10} {'Devanagari':<12} {'Roman':<10}")
print("="*65)

for name, text in test_texts.items():
    comp = analyze_text_composition(text)
    print(f"{name:<15} {comp['akshar_count']:<10} {comp['script_switches']:<10} "
          f"{comp['devanagari_ratio']:<12.1%} {comp['roman_ratio']:<10.1%}")


Text Composition Analysis:

Type            akshars    Switches   Devanagari   Roman     
Pure Hindi      15         0          100.0%       0.0%      
Pure English    26         0          0.0%         100.0%    
Hinglish        23         2          40.0%        60.0%     
Sanskrit        17         0          100.0%       0.0%      


## 8. Tokenization with Metadata

Get tokens plus all analysis in one call


In [14]:
text = "hello ‡§®‡§Æ‡§∏‡•ç‡§§‡•á ‡§¶‡•ã‡§∏‡•ç‡§§‡•ã‡§Ç"
result = tokenizer.tokenize(text, return_metadata=True)

print("Tokenization with Metadata:\n")
print(f"Original:       {result['original_text']}")
print(f"Normalized:     {result['normalized_text']}")
print(f"Token count:    {result['token_count']}")
print(f"akshar count:  {result['akshar_count']}")
print(f"Script switches: {result['script_switches']}")
print(f"Tokens:         {' | '.join(result['tokens'])}")


Tokenization with Metadata:

Original:       hello ‡§®‡§Æ‡§∏‡•ç‡§§‡•á ‡§¶‡•ã‡§∏‡•ç‡§§‡•ã‡§Ç
Normalized:     hello ‡§®‡§Æ‡§∏‡•ç‡§§‡•á ‡§¶‡•ã‡§∏‡•ç‡§§‡•ã‡§Ç
Token count:    12
akshar count:  12
Script switches: 1
Tokens:         h | e | l | l | o |   | ‡§® | ‡§Æ | ‡§∏‡•ç‡§§‡•á |   | ‡§¶‡•ã | ‡§∏‡•ç‡§§‡•ã‡§Ç


## 9. Script Detection

Identify which script each character belongs to


In [15]:
from akshar.segment import identify_script

test_chars = ['‡§®', 'a', '‡§ï', 'h', '5', '.', '‡§Æ', 'Z']

print("Character Script Detection:\n")
for char in test_chars:
    script = identify_script(char)
    print(f"'{char}' ‚Üí {script}")


Character Script Detection:

'‡§®' ‚Üí devanagari
'a' ‚Üí roman
'‡§ï' ‚Üí devanagari
'h' ‚Üí roman
'5' ‚Üí digit
'.' ‚Üí punct
'‡§Æ' ‚Üí devanagari
'Z' ‚Üí roman


## 10. Visual Comparison

Compare how different texts tokenize


In [16]:
import pandas as pd

comparison_texts = [
    "‡§®‡§Æ‡§∏‡•ç‡§§‡•á",
    "namaste", 
    "hello ‡§®‡§Æ‡§∏‡•ç‡§§‡•á",
    "‡§ï‡•ç‡§∑‡•á‡§§‡•ç‡§∞",
    "yaar kya haal hai",
    "‡§Ü‡§ú ‡§ï‡§æ day",
]

data = []
for text in comparison_texts:
    analysis = tokenizer.explain(text)
    data.append({
        'Text': text,
        'akshars': analysis['stats']['akshar_count'],
        'Tokens': len(analysis['tokens']),
        'Switches': analysis['stats']['script_switches'],
        'Dev%': f"{analysis['stats']['devanagari_ratio']:.0%}",
        'Rom%': f"{analysis['stats']['roman_ratio']:.0%}",
    })

df = pd.DataFrame(data)
print("\nTokenization Comparison:\n")
print(df.to_string(index=False))



Tokenization Comparison:

             Text  akshars  Tokens  Switches Dev% Rom%
           ‡§®‡§Æ‡§∏‡•ç‡§§‡•á        3       3         0 100%   0%
          namaste        7       7         0   0% 100%
     hello ‡§®‡§Æ‡§∏‡•ç‡§§‡•á        9       9         1  50%  50%
          ‡§ï‡•ç‡§∑‡•á‡§§‡•ç‡§∞        2       2         0 100%   0%
yaar kya haal hai       17      17         0   0% 100%
        ‡§Ü‡§ú ‡§ï‡§æ day        8       8         1  67%  33%


## 11. Real-World Examples

Social media style Hinglish text


## 21. üÜï Phonetic Analysis

Analyze phonetic properties: vowels, consonants, aspiration, nasalization


In [17]:
from akshar import get_phonetic_analyzer

analyzer = get_phonetic_analyzer()

print("Phonetic Properties of Characters:\n")

# test various characters
chars = ['‡§ï', '‡§ñ', '‡§®', '‡§Æ', '‡§Ö', '‡§Ü']

for char in chars:
    props = analyzer.get_properties(char)
    if props:
        print(f"{char}:")
        print(f"  Type: {'Vowel' if props['is_vowel'] else 'Consonant'}")
        if props['is_consonant']:
            print(f"  Aspirated: {props['aspirated']}")
            print(f"  Voiced: {props['voiced']}")
            print(f"  Nasal: {props['nasal']}")
            place = analyzer.get_place_of_articulation(char)
            if place:
                print(f"  Place: {place}")
        print()

print("="*50)
print("\nWord Analysis:\n")

words = ["‡§®‡§Æ‡§∏‡•ç‡§§‡•á", "‡§≠‡§æ‡§∞‡§§", "‡§ñ‡§º‡•Å‡§∂‡•Ä"]

for word in words:
    analysis = analyzer.analyze_word(word)
    print(f"{word:10} ‚Üí Vowels: {analysis['vowels']}, "
          f"Consonants: {analysis['consonants']}, "
          f"Nasals: {analysis['nasals']}, "
          f"Aspirated: {analysis['aspirated']}")


Phonetic Properties of Characters:

‡§ï:
  Type: Consonant
  Aspirated: False
  Voiced: False
  Nasal: False
  Place: velar

‡§ñ:
  Type: Consonant
  Aspirated: True
  Voiced: False
  Nasal: False
  Place: velar

‡§®:
  Type: Consonant
  Aspirated: False
  Voiced: True
  Nasal: True
  Place: dental

‡§Æ:
  Type: Consonant
  Aspirated: False
  Voiced: True
  Nasal: True
  Place: labial

‡§Ö:
  Type: Vowel

‡§Ü:
  Type: Vowel


Word Analysis:

‡§®‡§Æ‡§∏‡•ç‡§§‡•á     ‚Üí Vowels: 1, Consonants: 4, Nasals: 2, Aspirated: 1
‡§≠‡§æ‡§∞‡§§       ‚Üí Vowels: 1, Consonants: 3, Nasals: 0, Aspirated: 1
‡§ñ‡§º‡•Å‡§∂‡•Ä      ‚Üí Vowels: 2, Consonants: 2, Nasals: 0, Aspirated: 2


## 22. üÜï Multi-Script Detection

Detect multiple Indic scripts (not just Devanagari)


In [18]:
from akshar import identify_scripts, analyze_script

# pure devanagari
text1 = "‡§®‡§Æ‡§∏‡•ç‡§§‡•á ‡§≠‡§æ‡§∞‡§§"
scripts1 = identify_scripts(text1)
print(f"Text: {text1}")
print(f"Scripts: {scripts1}\n")

# mixed content
text2 = "Hello ‡§®‡§Æ‡§∏‡•ç‡§§‡•á World"
analysis2 = analyze_script(text2)
print(f"Text: {text2}")
print(f"Total chars: {analysis2['total_chars']}")
print(f"Indic chars: {analysis2['indic_chars']}")
print(f"Scripts: {analysis2['scripts']}")
print(f"Multilingual: {analysis2['is_multilingual']}\n")

print("="*50)
print("\nScript Identification Examples:\n")

test_texts = [
    "‡§∏‡§Ç‡§∏‡•ç‡§ï‡•É‡§§‡§Æ‡•ç",           # Sanskrit/Devanagari
    "‡§π‡§ø‡§®‡•ç‡§¶‡•Ä",              # Hindi
    "aaj ka din",         # Roman
    "‡§Æ‡•à‡§Ç California ‡§ú‡§æ‡§§‡§æ ‡§π‡•Ç‡§Ç",  # Mixed
]

for text in test_texts:
    scripts = identify_scripts(text)
    analysis = analyze_script(text)
    print(f"{text:30} ‚Üí Scripts: {list(scripts.keys()) if scripts else ['roman']}")


Text: ‡§®‡§Æ‡§∏‡•ç‡§§‡•á ‡§≠‡§æ‡§∞‡§§
Scripts: {'devanagari': 10}

Text: Hello ‡§®‡§Æ‡§∏‡•ç‡§§‡•á World
Total chars: 18
Indic chars: 6
Scripts: {'devanagari': 6}
Multilingual: False


Script Identification Examples:

‡§∏‡§Ç‡§∏‡•ç‡§ï‡•É‡§§‡§Æ‡•ç                      ‚Üí Scripts: ['devanagari']
‡§π‡§ø‡§®‡•ç‡§¶‡•Ä                         ‚Üí Scripts: ['devanagari']
aaj ka din                     ‚Üí Scripts: ['roman']
‡§Æ‡•à‡§Ç California ‡§ú‡§æ‡§§‡§æ ‡§π‡•Ç‡§Ç        ‚Üí Scripts: ['devanagari']


## 23. üÜï Combined Intelligence Pipeline

Use all features together for complete analysis


## 24. üÜï Character Deep Dive

Examine individual character properties in detail


In [19]:
from akshar.script_utils import ScriptAnalyzer

script_analyzer = ScriptAnalyzer()
phonetic = get_phonetic_analyzer()

# interesting characters
chars_to_examine = ['‡§ï', '‡§ñ', '‡§®', '‡§Æ', '‡§Ö', '‡§Ü']

print("Deep Character Analysis:\n")
print("="*70)

for char in chars_to_examine:
    print(f"\nCharacter: {char}")
    print(f"Unicode: U+{ord(char):04X}")
    
    # unicode name
    name = script_analyzer.get_character_name(char)
    print(f"Name: {name}")
    
    # phonetic props
    props = phonetic.get_properties(char)
    if props:
        print(f"ITRANS: {props['itrans']}")
        if props['is_vowel']:
            print(f"Type: Vowel")
        elif props['is_consonant']:
            print(f"Type: Consonant")
            place = phonetic.get_place_of_articulation(char)
            if place:
                print(f"Articulation: {place}")
            print(f"Aspirated: {props['aspirated']}")
            print(f"Voiced: {props['voiced']}")
            print(f"Nasal: {props['nasal']}")
        
        if props['halanta']:
            print("‚ö† Halanta (virama)")
        if props['anusvara']:
            print("‚ö† Anusvara")

print("\n" + "="*70)


print("\nNote: Conjuncts like ‡§ö‡•ç and ‡§ú‡•ç‡§û are multiple unicode codepoints,")
print("      so they need special handling for character-level analysis.")

Deep Character Analysis:


Character: ‡§ï
Unicode: U+0915
Name: DEVANAGARI LETTER KA
ITRANS: ka
Type: Consonant
Articulation: velar
Aspirated: False
Voiced: False
Nasal: False

Character: ‡§ñ
Unicode: U+0916
Name: DEVANAGARI LETTER KHA
ITRANS: kha
Type: Consonant
Articulation: velar
Aspirated: True
Voiced: False
Nasal: False

Character: ‡§®
Unicode: U+0928
Name: DEVANAGARI LETTER NA
ITRANS: na
Type: Consonant
Articulation: dental
Aspirated: False
Voiced: True
Nasal: True

Character: ‡§Æ
Unicode: U+092E
Name: DEVANAGARI LETTER MA
ITRANS: ma
Type: Consonant
Articulation: labial
Aspirated: False
Voiced: True
Nasal: True

Character: ‡§Ö
Unicode: U+0905
Name: DEVANAGARI LETTER A
ITRANS: a
Type: Vowel

Character: ‡§Ü
Unicode: U+0906
Name: DEVANAGARI LETTER AA
ITRANS: A
Type: Vowel


Note: Conjuncts like ‡§ö‡•ç and ‡§ú‡•ç‡§û are multiple unicode codepoints,
      so they need special handling for character-level analysis.


## 25. üÜï Linguistic Feature Extraction

Extract all linguistic features for ML/NLP tasks


In [20]:
from akshar import (
    aksharTokenizer,
    detect_code_switches,
    normalize_text,
    analyze_phonetics,
    analyze_script,
    segment_akshars
)

def extract_all_features(text):
    """
    Extract complete linguistic feature vector.
    Useful for ML models, text classification, etc.
    """
    tokenizer = aksharTokenizer()
    
    # get all analyses
    token_analysis = tokenizer.explain(text)
    script_analysis = analyze_script(text)
    phonetic_analysis = analyze_phonetics(text)
    akshars = segment_akshars(text)
    switches = detect_code_switches(text)
    
    features = {
        # basic
        'length': len(text),
        'tokens': len(token_analysis['tokens']),
        'akshars': len(akshars),
        
        # script composition
        'devanagari_ratio': token_analysis['stats']['devanagari_ratio'],
        'roman_ratio': token_analysis['stats']['roman_ratio'],
        'script_switches': len(switches),
        'is_multilingual': script_analysis['is_multilingual'],
        
        # phonetic
        'vowel_count': phonetic_analysis['vowels'],
        'consonant_count': phonetic_analysis['consonants'],
        'nasal_count': phonetic_analysis['nasals'],
        'aspirated_count': phonetic_analysis['aspirated'],
        
        # ratios
        'vowel_consonant_ratio': (
            phonetic_analysis['vowels'] / max(phonetic_analysis['consonants'], 1)
        ),
        'akshar_per_token': len(akshars) / max(len(token_analysis['tokens']), 1),
    }
    
    return features


# test it
test_sentences = [
    "‡§®‡§Æ‡§∏‡•ç‡§§‡•á ‡§≠‡§æ‡§∞‡§§",
    "hello world",
    "aaj ‡§Æ‡•å‡§∏‡§Æ ‡§Ö‡§ö‡•ç‡§õ‡§æ ‡§π‡•à",
    "‡§Ø‡§æ‡§∞ kya ‡§¨‡§æ‡§§ ‡§π‡•à"
]

print("Complete Feature Extraction:\n")
print("="*70)

import pandas as pd

features_list = []
for sent in test_sentences:
    features = extract_all_features(sent)
    features['text'] = sent
    features_list.append(features)

df = pd.DataFrame(features_list)

# reorder columns
cols = ['text', 'length', 'tokens', 'akshars', 'devanagari_ratio', 
        'roman_ratio', 'script_switches', 'vowel_count', 'consonant_count']
print(df[cols].to_string(index=False))

print("\n" + "="*70)
print("\nüí° These features can be used for:")
print("   ‚Ä¢ Text classification")
print("   ‚Ä¢ Language identification")
print("   ‚Ä¢ Code-mixing detection")
print("   ‚Ä¢ Sentiment analysis")
print("   ‚Ä¢ ML model training")


Complete Feature Extraction:

             text  length  tokens  akshars  devanagari_ratio  roman_ratio  script_switches  vowel_count  consonant_count
      ‡§®‡§Æ‡§∏‡•ç‡§§‡•á ‡§≠‡§æ‡§∞‡§§      11       7        7          1.000000     0.000000                1            2                7
      hello world      11      11       11          0.000000     1.000000                1            0                0
aaj ‡§Æ‡•å‡§∏‡§Æ ‡§Ö‡§ö‡•ç‡§õ‡§æ ‡§π‡•à      17      12       12          0.764706     0.235294                2            4                6
   ‡§Ø‡§æ‡§∞ kya ‡§¨‡§æ‡§§ ‡§π‡•à      14      11       11          0.714286     0.285714                3            3                5


üí° These features can be used for:
   ‚Ä¢ Text classification
   ‚Ä¢ Language identification
   ‚Ä¢ Code-mixing detection
   ‚Ä¢ Sentiment analysis
   ‚Ä¢ ML model training


In [21]:
social_media_texts = [
    "yaaaar aaj ka mausam bohot achaaaa hai",
    "‡§Æ‡•Å‡§ù‡•á traveling ‡§ï‡§∞‡§®‡§æ ‡§™‡§∏‡§Ç‡§¶ ‡§π‡•à",
    "‡§Ü‡§ú ‡§ï‡§æ day ‡§¨‡§π‡•Å‡§§ productive ‡§•‡§æ bro",
    "‡§ö‡§≤‡•ã coffee ‡§™‡•Ä‡§§‡•á ‡§π‡•à‡§Ç",
]

print("Social Media Hinglish Examples:\n")
for text in social_media_texts:
    analysis = tokenizer.explain(text)
    print(f"Original:   {text}")
    print(f"Normalized: {analysis['normalized']}")
    print(f"Tokens:     {' | '.join(analysis['tokens'][:15])}...")
    print()


Social Media Hinglish Examples:

Original:   yaaaar aaj ka mausam bohot achaaaa hai
Normalized: yar aaj ka mausam bohot acha hai
Tokens:     y | a | r |   | a | a | j |   | k | a |   | m | a | u | s...

Original:   ‡§Æ‡•Å‡§ù‡•á traveling ‡§ï‡§∞‡§®‡§æ ‡§™‡§∏‡§Ç‡§¶ ‡§π‡•à
Normalized: ‡§Æ‡•Å‡§ù‡•á traveling ‡§ï‡§∞‡§®‡§æ ‡§™‡§∏‡§Ç‡§¶ ‡§π‡•à
Tokens:     ‡§Æ‡•Å | ‡§ù‡•á |   | t | r | a | v | e | l | i | n | g |   | ‡§ï | ‡§∞...

Original:   ‡§Ü‡§ú ‡§ï‡§æ day ‡§¨‡§π‡•Å‡§§ productive ‡§•‡§æ bro
Normalized: ‡§Ü‡§ú ‡§ï‡§æ day ‡§¨‡§π‡•Å‡§§ productive ‡§•‡§æ bro
Tokens:     ‡§Ü | ‡§ú |   | ‡§ï‡§æ |   | d | a | y |   | ‡§¨ | ‡§π‡•Å | ‡§§ |   | p | r...

Original:   ‡§ö‡§≤‡•ã coffee ‡§™‡•Ä‡§§‡•á ‡§π‡•à‡§Ç
Normalized: ‡§ö‡§≤‡•ã coffee ‡§™‡•Ä‡§§‡•á ‡§π‡•à‡§Ç
Tokens:     ‡§ö | ‡§≤‡•ã |   | c | o | f | f | e | e |   | ‡§™‡•Ä | ‡§§‡•á |   | ‡§π‡•à‡§Ç...



## 12. Sanskrit Text Processing

Complex conjuncts and classical texts


In [22]:
sanskrit_texts = [
    "‡§ï‡•ç‡§∑‡•á‡§§‡•ç‡§∞‡•á ‡§ß‡§∞‡•ç‡§Æ‡§ï‡•ç‡§∑‡•á‡§§‡•ç‡§∞‡•á ‡§∏‡§Æ‡§µ‡•á‡§§‡§æ ‡§Ø‡•Å‡§Ø‡•Å‡§§‡•ç‡§∏‡§µ‡§É",
    "‡§ß‡§∞‡•ç‡§Æ ‡§ï‡•Ä ‡§ú‡§Ø ‡§π‡•ã",
    "‡§ú‡•ç‡§û‡§æ‡§® ‡§∏‡•á ‡§π‡•Ä ‡§Æ‡•ã‡§ï‡•ç‡§∑ ‡§Æ‡§ø‡§≤‡§§‡§æ ‡§π‡•à",
    "‡§µ‡•á‡§¶‡§æ‡§É ‡§ú‡•ç‡§û‡§æ‡§®‡§∏‡•ç‡§Ø ‡§∏‡•ç‡§∞‡•ã‡§§‡§É",
]

print("Sanskrit Text Processing:\n")
for text in sanskrit_texts:
    akshars = segment_akshars(text)
    tokens = tokenizer.tokenize(text)
    
    print(f"Text: {text}")
    print(f"akshars: {len(akshars)}, Tokens: {len(tokens)}")
    
    conjuncts = [a for a in akshars if len(a) > 1 and a.strip()]
    if conjuncts:
        print(f"Conjuncts preserved: {', '.join(conjuncts[:5])}")
    print()


Sanskrit Text Processing:

Text: ‡§ï‡•ç‡§∑‡•á‡§§‡•ç‡§∞‡•á ‡§ß‡§∞‡•ç‡§Æ‡§ï‡•ç‡§∑‡•á‡§§‡•ç‡§∞‡•á ‡§∏‡§Æ‡§µ‡•á‡§§‡§æ ‡§Ø‡•Å‡§Ø‡•Å‡§§‡•ç‡§∏‡§µ‡§É
akshars: 17, Tokens: 17
Conjuncts preserved: ‡§ï‡•ç‡§∑‡•á, ‡§§‡•ç‡§∞‡•á, ‡§∞‡•ç‡§Æ, ‡§ï‡•ç‡§∑‡•á, ‡§§‡•ç‡§∞‡•á

Text: ‡§ß‡§∞‡•ç‡§Æ ‡§ï‡•Ä ‡§ú‡§Ø ‡§π‡•ã
akshars: 9, Tokens: 9
Conjuncts preserved: ‡§∞‡•ç‡§Æ, ‡§ï‡•Ä, ‡§π‡•ã

Text: ‡§ú‡•ç‡§û‡§æ‡§® ‡§∏‡•á ‡§π‡•Ä ‡§Æ‡•ã‡§ï‡•ç‡§∑ ‡§Æ‡§ø‡§≤‡§§‡§æ ‡§π‡•à
akshars: 15, Tokens: 15
Conjuncts preserved: ‡§ú‡•ç‡§û‡§æ, ‡§∏‡•á, ‡§π‡•Ä, ‡§Æ‡•ã, ‡§ï‡•ç‡§∑

Text: ‡§µ‡•á‡§¶‡§æ‡§É ‡§ú‡•ç‡§û‡§æ‡§®‡§∏‡•ç‡§Ø ‡§∏‡•ç‡§∞‡•ã‡§§‡§É
akshars: 9, Tokens: 9
Conjuncts preserved: ‡§µ‡•á, ‡§¶‡§æ‡§É, ‡§ú‡•ç‡§û‡§æ, ‡§∏‡•ç‡§Ø, ‡§∏‡•ç‡§∞‡•ã



## 13. Visualization Helpers

Color-coded output for terminal display


In [23]:
from akshar.viz import format_token_boundaries, format_akshar_boundaries

text = "aaj ‡§Æ‡•å‡§∏‡§Æ ‡§Ö‡§ö‡•ç‡§õ‡§æ ‡§π‡•à"
tokens = tokenizer.tokenize(text)
akshars = segment_akshars(text)

print("Visualization Formats:\n")
print("Token Boundaries:")
print(format_token_boundaries(text, tokens))
print("\nakshar Boundaries:")
print(format_akshar_boundaries(akshars))


Visualization Formats:

Token Boundaries:
a | a | j |   | ‡§Æ‡•å | ‡§∏ | ‡§Æ |   | ‡§Ö | ‡§ö‡•ç‡§õ‡§æ |   | ‡§π‡•à

akshar Boundaries:
[a] [a] [j] [ ] [‡§Æ‡•å] [‡§∏] [‡§Æ] [ ] [‡§Ö] [‡§ö‡•ç‡§õ‡§æ] [ ] [‡§π‡•à]


## 14. Batch Processing

Process multiple texts efficiently


In [24]:
batch_texts = [
    "‡§≠‡§æ‡§∞‡§§ ‡§è‡§ï ‡§Æ‡§π‡§æ‡§® ‡§¶‡•á‡§∂ ‡§π‡•à",
    "India is a great country",
    "‡§Æ‡•à‡§Ç India ‡§∏‡•á ‡§π‡•Ç‡§Ç",
    "‡§Ø‡§π ‡§¨‡§π‡•Å‡§§ interesting ‡§π‡•à",
    "sanskritam ‡§¶‡•á‡§µ‡§≠‡§æ‡§∑‡§æ ‡§Ö‡§∏‡•ç‡§§‡§ø",
]

print("Batch Processing Results:\n")
results = []
for text in batch_texts:
    tokens = tokenizer.tokenize(text)
    analysis = tokenizer.explain(text)
    results.append({
        'text': text,
        'tokens': len(tokens),
        'dev_ratio': analysis['stats']['devanagari_ratio']
    })

for r in results:
    print(f"{r['text']:<40} | Tokens: {r['tokens']:<3} | Dev: {r['dev_ratio']:.0%}")


Batch Processing Results:

‡§≠‡§æ‡§∞‡§§ ‡§è‡§ï ‡§Æ‡§π‡§æ‡§® ‡§¶‡•á‡§∂ ‡§π‡•à                      | Tokens: 15  | Dev: 100%
India is a great country                 | Tokens: 24  | Dev: 0%
‡§Æ‡•à‡§Ç India ‡§∏‡•á ‡§π‡•Ç‡§Ç                         | Tokens: 11  | Dev: 62%
‡§Ø‡§π ‡§¨‡§π‡•Å‡§§ interesting ‡§π‡•à                   | Tokens: 20  | Dev: 45%
sanskritam ‡§¶‡•á‡§µ‡§≠‡§æ‡§∑‡§æ ‡§Ö‡§∏‡•ç‡§§‡§ø                 | Tokens: 18  | Dev: 54%


## 15. Summary Statistics

Get overall statistics from the tokenizer


In [25]:
print("Tokenizer Configuration:\n")
print(f"Model loaded: {tokenizer.model is not None}")
print(f"Model type: {tokenizer.model_type}")
print(f"Normalize Roman: {tokenizer.normalize_roman}")
print(f"Clean Hinglish: {tokenizer.clean_hinglish}")
print(f"Vocab size: {tokenizer.vocab_size()}")

print("\n" + "="*60)
print("\nakshar Features Demonstrated:")
print("  ‚úì Basic tokenization")
print("  ‚úì akshar segmentation (conjunct preservation)")
print("  ‚úì Code-switch detection")
print("  ‚úì Text normalization")
print("  ‚úì Hinglish handling")
print("  ‚úì Phonetic signatures")
print("  ‚úì Detailed analysis")
print("  ‚úì Composition analysis")
print("  ‚úì Script detection")
print("  ‚úì Metadata extraction")
print("  ‚úì Sanskrit processing")
print("  ‚úì Batch processing")
print("  ‚úì Visualization helpers")


Tokenizer Configuration:

Model loaded: False
Model type: sentencepiece
Normalize Roman: True
Clean Hinglish: True
Vocab size: 0


akshar Features Demonstrated:
  ‚úì Basic tokenization
  ‚úì akshar segmentation (conjunct preservation)
  ‚úì Code-switch detection
  ‚úì Text normalization
  ‚úì Hinglish handling
  ‚úì Phonetic signatures
  ‚úì Detailed analysis
  ‚úì Composition analysis
  ‚úì Script detection
  ‚úì Metadata extraction
  ‚úì Sanskrit processing
  ‚úì Batch processing
  ‚úì Visualization helpers


In [26]:
from indicnlp import common
common.set_resources_path("indic_nlp_resources")

In [27]:
from indicnlp.tokenize.indic_tokenize import trivial_tokenize

sentence = "aaj ‡§Æ‡•å‡§∏‡§Æ ‡§¨‡§π‡•Å‡§§ ‡§Ö‡§ö‡•ç‡§õ‡§æ ‡§π‡•à"
print(trivial_tokenize(sentence))

['aaj', '‡§Æ‡•å‡§∏‡§Æ', '‡§¨‡§π‡•Å‡§§', '‡§Ö‡§ö‡•ç‡§õ‡§æ', '‡§π‡•à']


## Same thing deriving in akshar

In [28]:
text = "aaj ‡§Æ‡•å‡§∏‡§Æ ‡§¨‡§π‡•Å‡§§ ‡§Ö‡§ö‡•ç‡§õ‡§æ ‡§π‡•à"
tokens_char = tokenizer.tokenize(sentence)
print("akshar default (character-level):")
print(tokens_char)
print()

akshar default (character-level):
['a', 'a', 'j', ' ', '‡§Æ‡•å', '‡§∏', '‡§Æ', ' ', '‡§¨', '‡§π‡•Å', '‡§§', ' ', '‡§Ö', '‡§ö‡•ç‡§õ‡§æ', ' ', '‡§π‡•à']



In [29]:
def word_tokenize(sentence):
    normalized = tokenizer.preprocess(sentence)
    words = normalized.split()
    return words

tokens_word = word_tokenize(text)
print(tokens_word)
print()

['aaj', '‡§Æ‡•å‡§∏‡§Æ', '‡§¨‡§π‡•Å‡§§', '‡§Ö‡§ö‡•ç‡§õ‡§æ', '‡§π‡•à']



In [30]:
tokenizer.preprocess(sentence).split()

['aaj', '‡§Æ‡•å‡§∏‡§Æ', '‡§¨‡§π‡•Å‡§§', '‡§Ö‡§ö‡•ç‡§õ‡§æ', '‡§π‡•à']

In [31]:
print(f"Character-level: {len(tokens_char)} tokens")
print(f"Word-level:      {len(tokens_word)} tokens")

Character-level: 16 tokens
Word-level:      5 tokens
