## Undoing Assimilation and Reduction in the Bolinao Lexicon

A process of assimilation and reduction occurs when prefixes ending in /ng/ (maNg-, naNg-, paNg-) attach to a word beginning with a stop or s (p, b, t, d, s, ', k, g). The /ng/ assimilates to the point of articulation and the initial consonant of the word is removed. Otherwise (words beginning with y, w, l, r, m, n, ng), the Ng remains as /ng/ and reduction does not take place. The tendency of ng to assimilate in this manner can also be found interword such as in compounding.


Rules:
*   maNg- + /p/ → ma- + m + reduced word
*   maNg- + /b/ → ma- + m + reduced word
*   naNg- + /p/ → na- + m + reduced word
*   naNg- + /b/ → na- + m + reduced word
*   paNg- + /p/ → pa- + m + reduced word
*   paNg- + /b/ → pa- + m + reduced word
------------------------------------------------
*   maNg- + /d/ → ma- + n + reduced word
*   maNg- + /t/ → ma- + n + reduced word
*   maNg- + /s/ → ma- + n + reduced word
*   naNg- + /d/ → na- + n + reduced word
*   naNg- + /t/ → na- + n + reduced word
*   naNg- + /s/ → na- + n + reduced word
*   paNg- + /d/ → pa- + n + reduced word
*   paNg- + /t/ → pa- + n + reduced word
*   paNg- + /s/ → pa- + n + reduced word
------------------------------------------------
*   maNg- + /k/ → ma- + ng + reduced word
*   maNg- + /g/ → ma- + ng + reduced word
*   maNg- + /'/ → ma- + ng + reduced word
*   naNg- + /k/ → na- + ng + reduced word
*   naNg- + /g/ → na- + ng + reduced word
*   naNg- + /'/ → na- + ng + reduced word
*   paNg- + /k/ → pa- + ng + reduced word
*   paNg- + /g/ → pa- + ng + reduced word
*   paNg- + /'/ → pa- + ng + reduced word
------------------------------------------------
*   maNg- + /not k, g, '/ → ma- + ng + unreduced word
*   naNg- + /not k, g, '/ → ma- + ng + unreduced word
*   paNg- + /not k, g, '/ → ma- + ng + unreduced word

Examples:
* mang + basa → mamasa “to read”
* mang + pa + ta’gay → mamata’gay “to go up”
* mang + saliw → manaliw “to buy”
* mang + kalap → mangalap “to get, remove”
* mang + aluyon → mangaluyon “to accompany”
* mang + lipot → manglipot “to give over”


## 1. Load and Prepare Data

This step loads the Excel-like file containing Bolinao words and prepares tools for text processing.

In [119]:
import pandas as pd
import re

df = pd.read_csv("Bolinao Lexicon Final.csv", encoding='latin-1')

Filter only those starting with assimilation prefixes.

Words that starts with "mam", "man", "mang", "nam", "nan",  "nang", "pam", "pan", and "pang".

## 2. Filter Words with Assimilation Prefixes

This step filters the dataset to include only those words starting with the assimilation prefixes: "mam", "man", "mang", "nam", "nan", "nang", "pam", "pan", and "pang".

In [120]:
candidates = df[df["word"].str.startswith(("mam", "man", "mang", "nam", "nan",  "nang", "pam", "pan", "pang"), na=False)].copy()

## 3. Define Root Word Recovery Function

This function is designed to reverse the sound changes caused by assimilation and reduction, aiming to find the original root word before prefixes were added.

In [121]:
# Define assimilation + reduction rules
def undo_assimilation_reduction(word):
    original = word
    process = []
    formula = []
    root = word  # default if no change

    if word.startswith("mang"):
        rest = word[4:]
        if rest and rest[0] in ["y", "w", "l", "r", "m", "n"]:  # y/w/l/r/m/n root
            root = rest
            process.append("no reduction")
            formula.append("mang- with root starting y/w/l/r/m/n/ng (ng stays)")
        else:  # k, g, ' root
            root = "k" + rest + "|" + "g" + rest + "|" + "'" + rest + "|" + rest
            process.append("assimilation+reduction|no reduction for a/e/i/o/u")
            formula.append("mang- + (k/g/' root (consonant dropped, ng kept)|mang- with root starting a/e/i/o/u (ng stays))")

    elif word.startswith("mam"):  # mang- + p/b root
        root = "b" + word[3:] + "|" + "p" + word[3:]
        process.append("assimilation+reduction")
        formula.append("mam- from mang- + p/b root (p/b dropped, ng→m)")

    elif word.startswith("man"):  # mang- + t/d/s root
        root = "s" + word[3:] + "|" + "t" + word[3:] + "|" + "d" + word[3:]
        process.append("assimilation+reduction")
        formula.append("man- from mang- + t/d/s root (consonant dropped, ng→n)")

    elif word.startswith("nang"):
        rest = word[4:]
        if rest and rest[0] in ["y", "w", "l", "r", "m", "n"]:  # y/w/l/r/m/n root
            root = rest
            process.append("no reduction")
            formula.append("nang- with root starting y/w/l/r/m/n/ng (ng stays)")
        else:  # k, g, ' root
            root = "k" + rest + "|" + "g" + rest + "|" + "'" + rest + "|" + rest
            process.append("assimilation+reduction|no reduction for a/e/i/o/u")
            formula.append("nang- + (k/g/' root (consonant dropped, ng kept)|nang- with root starting a/e/i/o/u (ng stays))")

    elif word.startswith("nam"):  # nang- + p/b root
        root = "b" + word[3:] + "|" + "p" + word[3:]
        process.append("assimilation+reduction")
        formula.append("nam- from nang- + p/b root (p/b dropped, ng→m)")

    elif word.startswith("nan"):  # nang- + t/d/s root
        root = "s" + word[3:] + "|" + "t" + word[3:] + "|" + "d" + word[3:]
        process.append("assimilation+reduction")
        formula.append("nan- from nang- + t/d/s root (consonant dropped, ng→n)")

    elif word.startswith("pang"):
        rest = word[4:]
        if rest and rest[0] in ["y", "w", "l", "r", "m", "n"]:  # y/w/l/r/m/n root
            root = rest
            process.append("no reduction")
            formula.append("pang- with root starting y/w/l/r/m/n/ng (ng stays)")
        else:  # k, g, ' root
            root = "k" + rest + "|" + "g" + rest + "|" + "'" + rest + "|" + rest
            process.append("assimilation+reduction|no reduction for a/e/i/o/u")
            formula.append("pang- + (k/g/' root (consonant dropped, ng kept)|pang- with root starting a/e/i/o/u (ng stays))")

    elif word.startswith("pam"):  # pang- + p/b root
        root = "b" + word[3:] + " |" + "p" + word[3:]
        process.append("assimilation+reduction")
        formula.append("pam- from pang- + p/b root (p/b dropped, ng→m)")

    elif word.startswith("pan"):  # pang- + t/d/s root
        root = "s" + word[3:] + "|" + "t" + word[3:] + "|" + "d" + word[3:]
        process.append("assimilation+reduction")
        formula.append("pan- from pang- + t/d/s root (consonant dropped, ng→n)")

    else:
        process.append("unchanged")
        formula.append("does not match assimilation/reduction patterns")

    return pd.Series({
        "word": original,
        "root_word": root,
        "process": ", ".join(process),
        "formula": "; ".join(formula)
    })


## 4. Apply Root Word Recovery

This step applies the `undo_assimilation_reduction` function to each word identified in Step 3 to generate potential root word candidates.

In [122]:
results_df = candidates["word"].dropna().apply(undo_assimilation_reduction)

## 5. Combine Original Data with Analysis

This step merges the original word data (including meaning and part of speech) with the results of the root word analysis. The resulting data includes the columns: "word", "upos", "meaning_english", "root_word", "process", and "formula".

In [123]:
results_full = candidates[["word", "upos", "meaning_english"]].merge(
    results_df, left_on="word", right_on="word"
)

## 6. Verify Root Word Candidates

This step checks if the predicted root words generated in Step 5 actually exist in the original dictionary and whether their meanings align with the assimilated/reduced words.

In [124]:
confirmed = []

for idx, row in results_full.iterrows():
    word = row['word']
    meaning = row['meaning_english']
    upos_assimilated_reduced = row['upos']
    roots = row['root_word'].split('|')  # Split multiple root candidates

    for r in roots:
        r = r.strip()  # Remove extra spaces
        # Look for exact matches in the lexicon
        match = df[df['word'].str.strip() == r.strip()]
        if not match.empty:
            record = {
                "assimilated": word,
                "root_candidate": r,
                "meaning_assimilated_reduced": meaning,
                "meaning_root": "; ".join(match['meaning_english'].unique()),
                "upos_assimilated_reduced": upos_assimilated_reduced,
                "upos_root": "; ".join(match['upos'].unique()) if 'upos' in match.columns else "; ".join(match['part_of_speech'].unique())
            }
            confirmed.append(record)

confirmed_df = pd.DataFrame(confirmed)

## 7. Export Results

This step saves the confirmed root words and their corresponding information to a CSV file for further analysis or use.

In [125]:
confirmed_df.to_csv("bolinao_root_words_assimilation_and_reduction.csv", index=False)

## 8. Redo Process: Apply Assimilation and Reduction

This step performs the forward process - applying assimilation and reduction rules to root words to verify that we can reconstruct the original assimilated/reduced forms.

In [126]:
def apply_assimilation_reduction(prefix, root):
    """
    Apply assimilation and reduction rules to reconstruct the assimilated form.
    
    prefix: one of "mang", "nang", "pang"
    root: the root word
    
    Returns the assimilated/reduced form
    """
    if not root:
        return prefix
    
    first_char = root[0].lower()
    
    # Rule 1: prefix + p/b roots → assimilated form with 'm'
    if first_char in ['p', 'b']:
        # Remove 'ng' from prefix and add 'm', then drop first consonant of root
        base = prefix[:-2]  # ma, na, or pa
        return base + 'm' + root[1:]
    
    # Rule 2: prefix + t/d/s roots → assimilated form with 'n'
    elif first_char in ['t', 'd', 's']:
        # Remove 'ng' from prefix and add 'n', then drop first consonant of root
        base = prefix[:-2]  # ma, na, or pa
        return base + 'n' + root[1:]
    
    # Rule 3: prefix + k/g/' roots → keep 'ng', drop first consonant of root
    elif first_char in ['k', 'g', "'"]:
        # Keep the full prefix (mang, nang, pang) and drop first consonant
        return prefix + root[1:]
    
    # Rule 4: prefix + y/w/l/r/m/n/ng or vowels → keep 'ng', no reduction
    elif first_char in ['y', 'w', 'l', 'r', 'm', 'n'] or first_char in ['a', 'e', 'i', 'o', 'u']:
        # Keep the full prefix and full root
        return prefix + root
    
    else:
        # Default: no change
        return prefix + root

## 9. Generate All Possible Assimilated Forms from Root Words

This step truly starts from root words and generates ALL 3 possible assimilated forms (mang-, nang-, pang-) for each confirmed root, without looking at the original assimilated word.

In [127]:
# ============================================
# WORD LIST - Place your root words here
# ============================================
root_word_list = [
    {"root": "gamet", "meaning": "hand"}, {"root": "wiri", "meaning": "left"},
    {"root": "wanan", "meaning": "right"}, {"root": "bitih", "meaning": "leg/foot"},
    {"root": "daan", "meaning": "road/path"}, {"root": "tangoy", "meaning": "to swim"},
    {"root": "tapok", "meaning": "dust"}, {"root": "katat", "meaning": "skin"},
    {"root": "gorot", "meaning": "back"}, {"root": "tyan", "meaning": "belly"},
    {"root": "botol", "meaning": "bone"}, {"root": "agtay", "meaning": "liver"},
    {"root": "soso", "meaning": "breast"}, {"root": "abaya", "meaning": "shoulder"},
    {"root": "daya", "meaning": "blood"}, {"root": "olo", "meaning": "head"},
    {"root": "leey", "meaning": "neck"}, {"root": "sabot", "meaning": "hair"},
    {"root": "arong", "meaning": "nose"}, {"root": "angot", "meaning": "to sniff/smell"},
    {"root": "bebey", "meaning": "mouth"}, {"root": "ngipin", "meaning": "tooth"},
    {"root": "dila", "meaning": "tongue"}, {"root": "kalis", "meaning": "to laugh"},
    {"root": "akis", "meaning": "to cry"}, {"root": "soka", "meaning": "to vomit"},
    {"root": "kan", "meaning": "to eat"}, {"root": "inom", "meaning": "to drink"},
    {"root": "kayat", "meaning": "to bite"}, {"root": "sepsep", "meaning": "to suck"},
    {"root": "toly", "meaning": "ear"}, {"root": "ingar", "meaning": "to hear"},
    {"root": "mata", "meaning": "eye"}, {"root": "kit", "meaning": "to see"},
    {"root": "elek", "meaning": "to sleep"}, {"root": "taynep", "meaning": "to dream"},
    {"root": "tekre", "meaning": "to sit"}, {"root": "ideng", "meaning": "to stand"},
    {"root": "lalaki", "meaning": "man/male"}, {"root": "babayi", "meaning": "woman/female"},
    {"root": "anak", "meaning": "child"}, {"root": "ahawa", "meaning": "spouse"},
    {"root": "ina", "meaning": "mother"}, {"root": "tatay", "meaning": "father"},
    {"root": "bali", "meaning": "house"}, {"root": "atep", "meaning": "roof"},
    {"root": "ngaran", "meaning": "name"}, {"root": "robir", "meaning": "rope"},
    {"root": "tayi", "meaning": "to sew"}, {"root": "kadayem", "meaning": "needle"},
    {"root": "takaw", "meaning": "to steal"}, {"root": "pati", "meaning": "to kill"},
    {"root": "tadem", "meaning": "sharp"}, {"root": "obra", "meaning": "to work"},
    {"root": "tanem", "meaning": "to plant"}, {"root": "pili", "meaning": "to choose"},
    {"root": "pespes", "meaning": "to squeeze"}, {"root": "kotkot", "meaning": "to dig"},
    {"root": "haliw", "meaning": "to buy"}, {"root": "bantak", "meaning": "to throw"},
    {"root": "aso", "meaning": "dog"}, {"root": "manok", "meaning": "bird/chicken"},
    {"root": "salay", "meaning": "egg"}, {"root": "pakpak", "meaning": "wing"},
    {"root": "lompad", "meaning": "to fly"}, {"root": "ikoy", "meaning": "tail"},
    {"root": "olay", "meaning": "snake"}, {"root": "bolati", "meaning": "worm"},
    {"root": "gigang", "meaning": "spider"}, {"root": "kona", "meaning": "fish"},
    {"root": "yamot", "meaning": "root"}, {"root": "bonga", "meaning": "fruit"},
    {"root": "bato", "meaning": "stone"}, {"root": "boyangin", "meaning": "sand"},
    {"root": "ranom", "meaning": "water"}, {"root": "asin", "meaning": "salt"},
    {"root": "langit", "meaning": "sky"}, {"root": "bulan", "meaning": "moon"},
    {"root": "bitoen", "meaning": "star"}, {"root": "gonem", "meaning": "cloud"},
    {"root": "rapeg", "meaning": "rain"}, {"root": "kodor", "meaning": "thunder"},
    {"root": "kimat", "meaning": "lightning"}, {"root": "emot", "meaning": "warm"},
    {"root": "rayep", "meaning": "cold"}, {"root": "albet", "meaning": "wet"},
    {"root": "byat", "meaning": "heavy"}
]

# Convert word list to DataFrame
word_list_df = pd.DataFrame(root_word_list)
unique_roots = word_list_df['root'].unique()

# Generate all 3 possible assimilated forms for each root
all_generated_forms = []

for idx, row in word_list_df.iterrows():
    root = row['root']
    meaning = row['meaning']
    
    # Apply all 3 prefixes to each root
    for prefix in ["mang", "nang", "pang"]:
        assimilated_form = apply_assimilation_reduction(prefix, root)
        
        # Check if this generated form exists in the original lexicon
        exists_in_lexicon = assimilated_form in df['word'].values
        
        all_generated_forms.append({
            'root_word': root,
            'root_meaning': meaning,
            'prefix_applied': prefix,
            'generated_form': assimilated_form,
            'exists_in_lexicon': exists_in_lexicon
        })

generated_df = pd.DataFrame(all_generated_forms)

# Display statistics
print(f"Total unique roots: {len(unique_roots)}")
print(f"Total generated forms: {len(generated_df)} (3 per root)")
print(f"Forms that exist in lexicon: {generated_df['exists_in_lexicon'].sum()}")
print(f"\nFirst 30 generated forms:")
generated_df.head(30)

Total unique roots: 87
Total generated forms: 261 (3 per root)
Forms that exist in lexicon: 9

First 30 generated forms:


Unnamed: 0,root_word,root_meaning,prefix_applied,generated_form,exists_in_lexicon
0,gamet,hand,mang,mangamet,False
1,gamet,hand,nang,nangamet,False
2,gamet,hand,pang,pangamet,False
3,wiri,left,mang,mangwiri,False
4,wiri,left,nang,nangwiri,False
5,wiri,left,pang,pangwiri,False
6,wanan,right,mang,mangwanan,False
7,wanan,right,nang,nangwanan,False
8,wanan,right,pang,pangwanan,False
9,bitih,leg/foot,mang,mamitih,False


## 10. Verify Which Generated Forms Match Original Lexicon

Check which of our generated forms actually appear in the original Bolinao lexicon.

In [128]:
# Filter to show only forms that exist in lexicon
existing_forms = generated_df[generated_df['exists_in_lexicon'] == True].copy()

# Group by root to see how many valid forms each root produces
forms_per_root = existing_forms.groupby('root_word').size()

print(f"Roots that produce 1 valid form: {(forms_per_root == 1).sum()}")
print(f"Roots that produce 2 valid forms: {(forms_per_root == 2).sum()}")
print(f"Roots that produce 3 valid forms: {(forms_per_root == 3).sum()}")
print(f"\nExisting forms in lexicon:")
existing_forms.head(30)

Roots that produce 1 valid form: 7
Roots that produce 2 valid forms: 1
Roots that produce 3 valid forms: 0

Existing forms in lexicon:


Unnamed: 0,root_word,root_meaning,prefix_applied,generated_form,exists_in_lexicon
89,sepsep,to suck,pang,panepsep,True
93,ingar,to hear,mang,mangingar,True
120,anak,child,mang,manganak,True
122,anak,child,pang,panganak,True
132,bali,house,mang,mamali,True
140,ngaran,name,pang,pangngaran,True
164,tanem,to plant,pang,pananem,True
180,aso,dog,mang,mangaso,True
186,salay,egg,mang,manalay,True


## 11. View Forms NOT in Lexicon (Potential New Words)

These are grammatically valid forms we can generate, but they don't appear in the current lexicon.

In [129]:
# Show forms that DON'T exist in lexicon (potential new words)
non_existing_forms = generated_df[generated_df['exists_in_lexicon'] == False]

print(f"Total generated forms NOT in lexicon: {len(non_existing_forms)}")
print(f"\nThese are grammatically valid but not documented:")
non_existing_forms.head(30)

Total generated forms NOT in lexicon: 252

These are grammatically valid but not documented:


Total generated forms NOT in lexicon: 252

These are grammatically valid but not documented:


Unnamed: 0,root_word,root_meaning,prefix_applied,generated_form,exists_in_lexicon
0,gamet,hand,mang,mangamet,False
1,gamet,hand,nang,nangamet,False
2,gamet,hand,pang,pangamet,False
3,wiri,left,mang,mangwiri,False
4,wiri,left,nang,nangwiri,False
5,wiri,left,pang,pangwiri,False
6,wanan,right,mang,mangwanan,False
7,wanan,right,nang,nangwanan,False
8,wanan,right,pang,pangwanan,False
9,bitih,leg/foot,mang,mamitih,False


## 12. Export Generated Forms

Save all generated forms (both existing and potential new words) to CSV files.

In [130]:
# Export all generated forms
generated_df.to_csv("bolinao_all_generated_assimilated_forms.csv", index=False)
print("All generated forms saved to 'bolinao_all_generated_assimilated_forms.csv'")

# Export only existing forms
existing_forms.to_csv("bolinao_generated_forms_in_lexicon.csv", index=False)
print("Existing forms saved to 'bolinao_generated_forms_in_lexicon.csv'")

# Export potential new words
non_existing_forms.to_csv("bolinao_potential_new_words.csv", index=False)
print("Potential new words saved to 'bolinao_potential_new_words.csv'")

All generated forms saved to 'bolinao_all_generated_assimilated_forms.csv'
Existing forms saved to 'bolinao_generated_forms_in_lexicon.csv'
Potential new words saved to 'bolinao_potential_new_words.csv'
