# ðŸ“– Detecting Number or Mass Indication in Bolinao Words

This notebook demonstrates how **Bolinao words** show number or mass indication based on linguistic rules and also attempts to recover the **root word**.

### 2.2 Number or Mass Indication Rules
Number or mass are indicated through:

1. **General sense of the word**
2. **Reduplication** (repeating part of the root word)
3. **Use of pronouns** to accompany the word
4. **Infixation of `-aw-` / `-u-`**

ðŸ‘‰ Example transformations:
- *bato* â†’ *bubato* / *bawbato* ('group of stones')
- *anak* â†’ *uanak* / *awanak* ('group of children')
- *baboy* â†’ *bubaboy* / *bawbaboy* ('group of pigs')

### Special cases:
- Gemination (sound doubling with vowel reduction):
  - *lalaki* â†’ *lulalaki* â†’ *lullaki* ('group of men')
  - *babayi* â†’ *bubabayi* â†’ *bubbiyi* ('group of women')
- Extended noun forms through reduplication of consonant + vowel:
  - *anak* â†’ *aanak* â†’ *aâ€™nak* ('children')



## ðŸ”½ Step 1: Import Libraries
We will use **pandas** for handling the CSV file and **regex** for detecting the morphological patterns.

In [None]:
import pandas as pd
import re

## ðŸ”½ Step 2: Load the CSV File
Upload your CSV with the following columns:

`word, part_of_speech, meaning_english, meaning_filipino, sample_bolinao, sample_english, upos`

We will preview the first rows to check that it loaded properly.


In [None]:
# Load CSV directly (make sure the file is in your Colab working directory)
df = pd.read_csv("bolinao_lexicon.csv")
df.head()


Unnamed: 0.1,Unnamed: 0,part_of_speech,meaning_english,meaning_filipino,sample_bolinao,sample_english,upos,word
0,a'lo,n,"Pestle, a rounded piece of wood about five inc...",Halo.,Kustoy byat nansi a'lonman'ipambayo kon irik.,Thepestlethat I am using to pound unhusked ric...,NOUN,
1,a'nak,n,"Referring to specific children individually, n...",Mga anak.,Si Ligaya a kaka sa sarba konran syam nin a'na...,Ligaya is the oldest of Gorio's nine children.,NOUN,
2,a'nem,n,"Six, the number following five.",Anim.,A'nem ray salay nan manok.,The chicken had six eggs.,NOUN,
3,a'pat,n,Four.,Apat.,Nagbakasyon ako nin a'pat nin awro.,I took a vacation for four months.,NOUN,
4,a'rong,n,Nose.,Ilong.,Say a'rong ran Pilipino ket ambo' tuloy nin ma...,The noses of Filipinos are not too pointed.,NOUN,


## ðŸ”½ Step 3: Define Detection and Root Extraction Rules

We implemented regex- and condition-based rules to identify number/mass indication and extract **rootword candidates**. The function `detect_and_extract` applies the following rules:

1. **Gemination (lulalaki â†’ lalaki, bubbiyi â†’ babayi)**  
   - If the word contains the infix `-u-` and the 2nd, 3rd, and 4th characters are reduplicated forms of the initial consonant, we replace the `u` with `a` and collapse the geminated sequence.  
   - Example:  
     - *lulalaki* â†’ *lalaki*  
     - *bubbiyi* â†’ *babayi*  

2. **Skip Rule: Two-syllable reduplication**  
   - If a word is formed by repeating a two-syllable sequence (e.g., `abab`, `bibi`, `lolo`), no stemming is applied.  
   - These are full reduplications, not mass indicators.  

3. **Initial Gemination / CV Reduplication (aanakan â†’ anakan)**  
   - If the word begins with two identical consonants or vowels, the extra character is dropped.  
   - Example: *aanakan* â†’ *anakan*  

4. **Consonant-Vowel (CV) Reduplication (bubato â†’ bato, bubaboy â†’ baboy)**  
   - If the first two letters (a CV sequence) are repeated, we remove the reduplicated CV.  
   - Example:  
     - *bubato* â†’ *bato*  
     - *bubaboy* â†’ *baboy*  

5. **Simple reduplication (fallback rule)**  
   - If none of the above match but the first character is reduplicated, we drop the repeated consonant.  
   - Example: *bubato* â†’ *bato*  

In [None]:
def detect_and_extract(word, lexicon):
    """
    Detects number/mass indication and attempts to extract a rootword candidate.
    Returns:
        (root_candidate, applied_stemming, process)
    """

    root_candidate = None
    applied_stemming = "no"
    process = "None"

    # Rule 1: Gemination (lulalaki â†’ lalaki, bubbiyi â†’ babayi)
    # This rule is specific for the documented gemination cases.
    if (len(word) >= 4 and word[1] == "u" and word[2] == word[0] and word[3] == word[2]):
        root_candidate = word[0] + "a" + word[3:]  # replace "uX" with "a" and collapse C1C1 to C1
        applied_stemming = "yes"
        process = "Gemination (lulalaki/bubbiyi)"
        # print(f"[LOG] {word} â†’ {root_candidate} | Rule: {process}") # Re-enable for debugging
        return root_candidate, applied_stemming, process

    # Rule 2: Skip Two-syllable exact reduplication (e.g., abab, lolo)
    # These are full reduplications, not mass indicators to be stemmed.
    elif len(word) % 2 == 0 and word[:len(word)//2] == word[len(word)//2:]:
        # print(f"[LOG] Skipping {word} (two-syllable reduplication)") # Re-enable for debugging
        return None, "no", "Skip: Two-syllable reduplication"

    # Rule 3: Initial Gemination / Vowel Reduplication (aanakan â†’ anakan)
    # If the word begins with two identical consonants or vowels, the extra character is dropped.
    elif re.match(r"^(.)\1", word): # Matches patterns like 'aa', 'bb' at the start
        root_candidate = word[1:]  # Remove the first repeated character
        applied_stemming = "yes"
        process = "Initial Gemination/Vowel Reduplication"
        # print(f"[LOG] {word} â†’ {root_candidate} | Rule: {process}") # Re-enable for debugging
        return root_candidate, applied_stemming, process

    # Rule 4: Consonant-Vowel (CV) Reduplication / Prefix Removal for Number/Mass Indication
    # Based on explicit examples like bubato â†’ bato, uanak â†’ anak, awanak â†’ anak, bawbato â†’ bato.
    # These are specific prefixes that indicate number/mass and should be stripped.
    elif word.startswith("bu") and len(word) > 2: # e.g., bubato, bubaboy
        root_candidate = word[2:]
        applied_stemming = "yes"
        process = "CV Reduplication: 'bu-' prefix"
        # print(f"[LOG] {word} â†’ {root_candidate} | Rule: {process}") # Re-enable for debugging
        return root_candidate, applied_stemming, process

    elif word.startswith("u") and len(word) > 1: # e.g., uanak
        root_candidate = word[1:]
        applied_stemming = "yes"
        process = "CV Reduplication: 'u-' prefix"
        # print(f"[LOG] {word} â†’ {root_candidate} | Rule: {process}") # Re-enable for debugging
        return root_candidate, applied_stemming, process

    elif word.startswith("aw") and len(word) > 2: # e.g., awanak
        root_candidate = word[2:]
        applied_stemming = "yes"
        process = "Infixation/Prefix: 'aw-'"
        # print(f"[LOG] {word} â†’ {root_candidate} | Rule: {process}") # Re-enable for debugging
        return root_candidate, applied_stemming, process

    elif word.startswith("baw") and len(word) > 3: # e.g., bawbato, bawbaboy
        root_candidate = word[3:]
        applied_stemming = "yes"
        process = "Infixation/Prefix: 'baw-'"
        # print(f"[LOG] {word} â†’ {root_candidate} | Rule: {process}") # Re-enable for debugging
        return root_candidate, applied_stemming, process

    # The previous rules 'elif re.match(r"^(.{2})\1", word):' and the final 'elif re.match(r"^(\w)\1", word):'
    # were either redundant or did not correctly match the described transformations for CV reduplication
    # (e.g., they would match 'bubu' but not 'bubato' as intended by the example) and have been removed
    # in favor of the more explicit prefix rules above.

    return root_candidate, applied_stemming, process

### ðŸ”½ Step 4: Detection and Rootword Verification

After defining the detection rules, these were systematically applied to each word in the dataset.
The rules identify morphological processes such as reduplication, infixation, or gemination.
When a process was detected, a rootword candidate was extracted and subsequently verified against the lexicon.  

Only cases that satisfied two conditions were retained in the results table:  
1. The word underwent at least one detectable morphological process.  
2. The extracted rootword candidate was present in the lexicon.  

This filtering ensured that the analysis focused solely on linguistically valid forms, excluding surface words without clear morphological alternations or candidates lacking lexical evidence.  

The results table includes the following columns:  

- **original_word** â†’ the observed surface form  
- **rootword_candidate** â†’ the extracted rootword candidate (must exist in the lexicon)  
- **meaning_original** â†’ English gloss of the surface form  
- **meaning_candidate** â†’ English gloss of the verified root candidate (from lexicon)  
- **upos_original** â†’ part-of-speech tag of the original word  
- **upos_candidate** â†’ part-of-speech tag of the root candidate (from lexicon)  


In [None]:
results = []

# Ensure 'word' column has no NaN values before building lexicon_dict and iterating
df["word"] = df["Unnamed: 0"].fillna("")

# Build lookup for lexicon words â†’ meanings and UPOS
lexicon_dict = dict(zip(df["word"], zip(df["meaning_english"], df["upos"])))

for _, row in df.iterrows():
    word = row["word"]
    meaning_original = row["meaning_english"]
    upos_original = row["upos"]

    # Only process if 'word' is a non-empty string
    if isinstance(word, str) and word:
        # Detect and extract
        root_candidate, applied_stemming, process = detect_and_extract(word, df["word"].tolist())

        # Only include if processing/stemming was applied AND root candidate exists in lexicon
        if applied_stemming == "yes" and root_candidate in lexicon_dict:
            meaning_candidate, upos_candidate = lexicon_dict[root_candidate]

            results.append({
                "original_word": word,
                "rootword_candidate": root_candidate,
                "meaning_original": meaning_original,
                "meaning_candidate": meaning_candidate,
                "upos_original": upos_original,
                "upos_candidate": upos_candidate
            })

results_df = pd.DataFrame(results)

# Replace NaN with empty string for clean CSV
results_df = results_df.fillna("")

## ðŸ”½ Step 5: View Results
We will display the table of all words that were processed and their corrseponding meanings and upos together with the extracted rootword and also their corresponding meaning and upos if it exists in the original lexicon


In [None]:
# Save final CSV with only the required columns
results_df.to_csv("rootword-number-or-mass-indication-removal.csv", index=False)

print(f"Results saved to rootword-number-or-mass-indication-removal.csv")



Results saved to rootword-number-or-mass-indication-removal.csv


In [None]:
pd2 = pd.read_csv("rootword-number-or-mass-indication-removal.csv")
pd2.head()

Unnamed: 0,original_word,rootword_candidate,meaning_original,meaning_candidate,upos_original,upos_candidate
0,aadyan,adyan,A hiding places.,To hide from s/o.,NOUN,VERB
1,away,ay,"A quarrel or fight over something, usually in ...","Oh my! An exclamation of astonishment, complai...",NOUN,INTJ
2,awey,ey,"Movement, actions, motions, acts.","Huh, gives emphasis and weight to a statement.",NOUN,INTJ
3,bawangan,angan,To spice something with garlic.,An opinion about a situation.,VERB,NOUN
4,bawet,et,A vine whose root is soaked and the mixture is...,"Further, a marker for comparative continuation...",NOUN,PART
