# Phonetic Exploration: Thai ‚Üí Sanskrit ‚Üí PIE

This notebook demonstrates:
1. Converting Thai words to IPA
2. Converting Sanskrit cognates to IPA
3. Calculating phonetic distance
4. Extracting phonetic features

In [None]:
import sys
from pathlib import Path

# Add src to path
sys.path.insert(0, str(Path.cwd().parent / "src"))

from data.phonetic_converter import PhoneticConverter
import pandas as pd

## Initialize Phonetic Converter

In [None]:
converter = PhoneticConverter()

## Test Cases: Numbers (Swadesh List)

Numbers are highly conserved across Indo-European languages.

In [None]:
# Test data: Thai ‚Üí Sanskrit ‚Üí Latin ‚Üí English
cognate_sets = [
    {
        'meaning': 'three',
        'thai': '‡πÑ‡∏ï‡∏£',
        'sanskrit': 'tri',
        'pie': '*tr√©yes',
        'latin': 'trƒìs',
        'english': 'three'
    },
    {
        'meaning': 'ten',
        'thai': '‡∏ó‡∏®',
        'sanskrit': 'da≈õa',
        'pie': '*de·∏±mÃ•',
        'latin': 'decem',
        'english': 'ten'
    },
    {
        'meaning': 'mother',
        'thai': '‡∏°‡∏≤‡∏£‡∏î‡∏≤',
        'sanskrit': 'mƒÅt·πõ',
        'pie': '*m√©h‚ÇÇtƒìr',
        'latin': 'mƒÅter',
        'english': 'mother'
    },
    {
        'meaning': 'father',
        'thai': '‡∏ö‡∏¥‡∏î‡∏≤',
        'sanskrit': 'pit·πõ',
        'pie': '*ph‚ÇÇt·∏ór',
        'latin': 'pater',
        'english': 'father'
    }
]

## Convert to IPA

In [None]:
# Convert all to IPA
results = []

for cognate in cognate_sets:
    row = {
        'Meaning': cognate['meaning'],
        'Thai': cognate['thai'],
        'Thai IPA': converter.to_ipa(cognate['thai'], 'tha'),
        'Sanskrit': cognate['sanskrit'],
        'Sanskrit IPA': converter.to_ipa(cognate['sanskrit'], 'san'),
        'English': cognate['english'],
        'English IPA': converter.to_ipa(cognate['english'], 'eng'),
    }
    results.append(row)

df = pd.DataFrame(results)
df

## Calculate Phonetic Distances

In [None]:
# Calculate distances between cognates
distance_results = []

for idx, row in df.iterrows():
    thai_ipa = row['Thai IPA']
    san_ipa = row['Sanskrit IPA']
    eng_ipa = row['English IPA']
    
    dist_thai_san = converter.phonetic_distance(thai_ipa, san_ipa)
    dist_san_eng = converter.phonetic_distance(san_ipa, eng_ipa)
    dist_thai_eng = converter.phonetic_distance(thai_ipa, eng_ipa)
    
    distance_results.append({
        'Meaning': row['Meaning'],
        'Thai ‚Üî Sanskrit': dist_thai_san,
        'Sanskrit ‚Üî English': dist_san_eng,
        'Thai ‚Üî English': dist_thai_eng
    })

dist_df = pd.DataFrame(distance_results)
dist_df

## Extract Phonetic Features

In [None]:
# Analyze features for "three" cognates
example_word = cognate_sets[0]['sanskrit']  # 'tri'
example_ipa = converter.to_ipa(example_word, 'san')

print(f"Word: {example_word}")
print(f"IPA: {example_ipa}")
print("\nPhonetic Features:")

features = converter.extract_features(example_ipa)
for feat in features:
    print(f"  {feat}")

## Visualization: Phonetic Space

Plot words in 2D phonetic space (using t-SNE on feature vectors)

In [None]:
# TODO: Implement t-SNE visualization of phonetic embeddings
# This will be useful once we train the phonetic embedding model

import matplotlib.pyplot as plt

print("Phonetic space visualization coming soon...")
print("Will show clustering of cognates in feature space")

## Next Steps

1. ‚úÖ IPA conversion works for basic cases
2. üîÑ Need to refine Thai ‚Üí IPA mapping (currently simplified)
3. üîú Build dataset of confirmed cognate pairs for training
4. üîú Train phonetic embedding model
5. üîú Implement Siamese network for cognate prediction