<a href="https://colab.research.google.com/github/Jeynang2024/Cross-Lingual-NER-for-Assamese-The-A-LINC-Approach/blob/main/Ner_Assamese.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Assamese Named Entity Recognition using XLM RoBERTa

This notebook focuses on developing a **Named Entity Recognition (NER)** model for the **Assamese language** using the **WikiANN dataset** and **XLM-RoBERTa** transformer.  
The project demonstrates tokenization, data alignment, fine-tuning, and evaluation for a low-resource Indian language.

##  2. Loading the WikiANN Assamese Dataset

The [WikiANN dataset](https://huggingface.co/datasets/wikiann) is a multilingual NER dataset containing tokens and named entity tags across multiple languages.  
Here, we load the Assamese subset (`"as"`) and extract relevant fields tokens and `ner_tags`.


In [None]:
from datasets import load_dataset

dataset = load_dataset("wikiann", "as")
dataset = dataset.map(lambda x: {"tokens": x["tokens"], "ner_tags": x["ner_tags"]})


In [None]:
dataset

DatasetDict({
    validation: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'spans'],
        num_rows: 100
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'spans'],
        num_rows: 100
    })
    train: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'spans'],
        num_rows: 100
    })
})

In [None]:
examples = [
    {'tokens': [['মহানগৰ', 'ৰ', 'নতুন', 'বাস', 'স্টেণ্ড', 'উদ্বোধন', "হ'ল"]],
     'ner_tags': [[0, 0, 0, 0, 0, 0, 0]],
     'langs': [['as']*7],
     'spans': [[]]},

    {'tokens': [['গুৱাহাটী', 'চহৰ', 'ত', 'নতুন', 'শিক্ষা', 'কেন্দ্ৰ', 'খোলা', "হ'ল"]],
     'ner_tags': [[5, 0, 0, 0, 0, 0, 0, 0]],
     'langs': [['as']*8],
     'spans': [['LOC: গুৱাহাটী']]},

    {'tokens': [['অসম', 'ৰ', 'প্ৰধান', 'মন্ত্ৰী', 'সভা' ,'ত', 'উপস্থিত', "হ'ল"]],
     'ner_tags': [[5, 0, 1, 2,4, 0, 0, 0]],
     'langs': [['as']*7],
     'spans': [['PER: প্ৰধান মন্ত্ৰী']]},

    {'tokens': [['নৱযুবক', 'এবং', 'নৱযুবতী', 'এজন', 'বিজ্ঞানী', 'ৰ', 'মেল', 'যোগ', 'দিছিল']],
     'ner_tags': [[0, 0, 0, 0, 5, 0, 0, 0, 0]],
     'langs': [['as']*9],
     'spans': [['PER: বিজ্ঞানী']]},

    {'tokens': [['ডিব্ৰুগড়', 'ত', 'নতুন', 'চিকিৎসা', 'কেন্দ্ৰ', 'উদ্বোধন', "হ'ল"]],
     'ner_tags': [[5, 0, 0, 0, 4, 0, 0]],
     'langs': [['as']*7],
     'spans': [['LOC: ডিব্ৰুগড়']]},

    {'tokens': [['অসম', 'ৰ', 'শিক্ষামন্ত্ৰী', 'এ', 'নতুন', 'নীতি', 'ঘোষণা', 'কৰে']],
     'ner_tags': [[5, 0, 1, 0, 0, 0, 0, 0]],
     'langs': [['as']*8],
     'spans': [['PER: শিক্ষামন্ত্ৰী']]},

    {'tokens': [['নৱদ্বীপ', 'ত', 'সাংস্কৃতিক', 'মেলা', 'আয়োজিত', "হ'ল"]],
     'ner_tags': [[5, 0, 0, 0, 0, 0]],
     'langs': [['as']*6],
     'spans': [['LOC: নৱদ্বীপ']]},

    {'tokens': [['গুৱাহাটী', 'ৰ', 'নতুন', 'মহানগৰ', 'পুল', 'উদ্বোধন', "হ'ল"]],
     'ner_tags': [[5, 0, 0, 0, 0, 0, 0]],
     'langs': [['as']*7],
     'spans': [['LOC: গুৱাহাটী']]},

    {'tokens': [['কাজিৰঙা', 'ৰ', 'ৰাষ্ট্ৰীয়', 'উদ্যান', 'ত', 'নতুন', 'প্ৰকল্প', 'চলু']],
     'ner_tags': [[5, 0, 0, 0, 0, 0, 0, 0]],
     'langs': [['as']*8],
     'spans': [['LOC: কাজিৰঙা']]},

    {'tokens': [['অসম', 'ৰ', 'ক্ৰীড়া', 'মন্ত্ৰী', 'এ', 'নতুন', 'স্টেডিয়াম', 'উদ্বোধন', 'কৰে']],
     'ner_tags': [[5, 0, 0, 1, 0, 0, 0, 0, 0]],
     'langs': [['as']*9],
     'spans': [['PER:  মন্ত্ৰী']]},

    {'tokens': [['শিৱসাগৰ', 'ত', 'প্ৰাচীন', 'মন্দিৰ', 'সজোৱা', 'কৰা', "হ'ল"]],
     'ner_tags': [[5, 0, 0, 0, 0, 0, 0]],
     'langs': [['as']*7],
     'spans': [['LOC: শিৱসাগৰ']]},

    {'tokens': [['নৱীন', 'চক্ৰৱৰ্তী', 'এ', 'ডাক্তৰ', 'ৰ', 'পদ', 'প্ৰাপ্ত', "হ'ল"]],
     'ner_tags': [[1, 2, 0, 1, 0, 0, 0, 0]],
     'langs': [['as']*8],
     'spans': [['PER: ডাক্তৰ']]},

    {'tokens': [['মাজুলী', 'দ্বীপ', 'ত', 'নতুন', 'সেতু', 'উদ্বোধন', "হ'ল"]],
     'ner_tags': [[5, 0, 0, 0, 0, 0, 0]],
     'langs': [['as']*7],
     'spans': [['LOC: মাজুলী']]},

    {'tokens': [['গুৱাহাটী', 'চহৰ', 'ৰ', 'নতুন', 'বিমানবন্দর', 'উদ্বোধন', "হ'ল"]],
     'ner_tags': [[5, 0, 0, 0, 0, 0, 0]],
     'langs': [['as']*7],
     'spans': [['LOC: গুৱাহাটী']]},

    {'tokens': [['অসম', 'ৰ', 'উচ্চতৰ', 'শিক্ষা', 'নির্দেশিকা', 'এ', 'নতুন', 'নীতি', 'ঘোষণা', 'কৰে']],
     'ner_tags': [[5, 0, 0, 0, 1, 0, 0, 0, 0, 0]],
     'langs': [['as']*10],
     'spans': [['PER: উচ্চতৰ শিক্ষা নিৰ্দেশিকা']]},

    {'tokens': [['ডিব্ৰুগড়', 'ৰ', 'নতুন', 'বছৰ', 'উদযাপন', 'আয়োজন', "হ'ল"]],
     'ner_tags': [[5, 0, 0, 0, 0, 0, 0]],
     'langs': [['as']*7],
     'spans': [['LOC: ডিব্ৰুগড়']]},

    {'tokens': [['নৱদ্বীপ', 'ৰ', 'সাংস্কৃতিক', 'কেন্দ্ৰ', 'ত', 'নতুন', 'কাৰ্যসূচী', 'চলু']],
     'ner_tags': [[5, 0, 0, 4, 0, 0, 0, 0]],
     'langs': [['as']*8],
     'spans': [['LOC: নৱদ্বীপ']]},

    {'tokens': [['শিৱসাগৰ', 'চহৰ', 'ৰ', 'নতুন', 'মেডিকেল', 'কলেজ', 'উদ্বোধন', "হ'ল"]],
     'ner_tags': [[5, 0, 0, 0, 0, 0, 0, 0]],
     'langs': [['as']*8],
     'spans': [['LOC: শিৱসাগৰ']]},

    {'tokens': [['নৱযুবক', 'এ', 'নতুন', 'উদ্যোগ', 'শুরু', 'কৰে']],
     'ner_tags': [[1, 0, 0, 0, 0, 0]],
     'langs': [['as']*6],
     'spans': [['PER: নৱযুবক']]},

    {'tokens': [['কাজিৰঙা', 'ৰ', 'নতুন', 'পৰ্যটন', 'কেন্দ্ৰ', 'উদ্বোধন', "হ'ল"]],
     'ner_tags': [[5, 0, 0, 0, 4, 0, 0]],
     'langs': [['as']*7],
     'spans': [['LOC: কাজিৰঙা']]}
]


In [None]:
import random


persons = ["হিমন্ত বিস্বা সৰ্মা", "ভূপেন হাজৰিকা", "জাহ্নু বৰুৱা", "প্ৰণৱ মুখাৰ্জী", "ৰামেশ্বৰ দাস"]
locations = ["গুৱাহাটী", "নগাঁও", "শিৱসাগৰ", "তিনিচুকীয়া", "ব্ৰহ্মপুত্ৰ","কাজিৰঙা"]
orgs = ["নাগাঁও মহাবিদ্যালয়", "অসম বিশ্ববিদ্যালয়", "গুৱাহাটী কেন্দ্ৰ", "অসম চৰকাৰ"]

for i in range(50):
    sentence_type = random.choice(["person", "location", "organization", "misc"])
    tokens = []
    ner_tags = []
    spans = []

    if sentence_type == "person":
        person = random.choice(persons)
        tokens = person.split() + ["আজিৰ", "সমাৰোহত", "উপস্থিত", "থিলেন"]
        ner_tags = [1, 2] + [0]*(len(tokens)-2)
        spans = [f"PER: {person}"]

    elif sentence_type == "location":
        loc = random.choice(locations)
        tokens = ["সদৰ", "নগৰ", loc, "ত", "নতুন", "মহাবিদ্যালয়", "খোলা", "হ'ল"]
        ner_tags = [0, 5, 6, 0, 0, 0, 0, 0]
        spans = [f"LOC: {loc}"]

    elif sentence_type == "organization":
        org = random.choice(orgs)
        tokens = ["নতুন", "কোর্স", "চলিছে", "প্ৰতিষ্ঠান"]+ org.split()
        ner_tags = [0, 0, 0, 0, 5,4]
        spans = [f"ORG: {org}"]

    else:
        tokens = ["আজিৰ", "আবহাওয়া", "গুৱাহাটী", "চহৰ", "ত", "বৃষ্টিপাত", "হৈছে"]
        ner_tags = [0, 0, 5, 0, 0, 0, 0]
        spans = [f"LOC: গুৱাহাটী"]

    examples.append({
        "tokens": [tokens],
        "ner_tags": [ner_tags],
        "langs": [["as"]*len(tokens)],
        "spans": [spans]
    })

for ex in examples[:5]:
    print(ex)


{'tokens': [['মহানগৰ', 'ৰ', 'নতুন', 'বাস', 'স্টেণ্ড', 'উদ্বোধন', "হ'ল"]], 'ner_tags': [[0, 0, 0, 0, 0, 0, 0]], 'langs': [['as', 'as', 'as', 'as', 'as', 'as', 'as']], 'spans': [[]]}
{'tokens': [['গুৱাহাটী', 'চহৰ', 'ত', 'নতুন', 'শিক্ষা', 'কেন্দ্ৰ', 'খোলা', "হ'ল"]], 'ner_tags': [[5, 0, 0, 0, 0, 0, 0, 0]], 'langs': [['as', 'as', 'as', 'as', 'as', 'as', 'as', 'as']], 'spans': [['LOC: গুৱাহাটী']]}
{'tokens': [['অসম', 'ৰ', 'প্ৰধান', 'মন্ত্ৰী', 'সভা', 'ত', 'উপস্থিত', "হ'ল"]], 'ner_tags': [[5, 0, 1, 2, 4, 0, 0, 0]], 'langs': [['as', 'as', 'as', 'as', 'as', 'as', 'as']], 'spans': [['PER: প্ৰধান মন্ত্ৰী']]}
{'tokens': [['নৱযুবক', 'এবং', 'নৱযুবতী', 'এজন', 'বিজ্ঞানী', 'ৰ', 'মেল', 'যোগ', 'দিছিল']], 'ner_tags': [[0, 0, 0, 0, 5, 0, 0, 0, 0]], 'langs': [['as', 'as', 'as', 'as', 'as', 'as', 'as', 'as', 'as']], 'spans': [['PER: বিজ্ঞানী']]}
{'tokens': [['ডিব্ৰুগড়', 'ত', 'নতুন', 'চিকিৎসা', 'কেন্দ্ৰ', 'উদ্বোধন', "হ'ল"]], 'ner_tags': [[5, 0, 0, 0, 4, 0, 0]], 'langs': [['as', 'as', 'as', 'as', 'as', 'as', '

## 2. Dataset Preparation and Conversion

We preprocess and organize the dataset into a **Hugging Face DatasetDict** format.  
This involves restructuring the dataset lists and preparing them for tokenization.

Steps include:
- Extracting tokens and labels.
- Flattening and combining multiple dataset lists.
- Creating a consistent dataset structure for training.

In [None]:
from datasets import Dataset, DatasetDict

#new_dataset = Dataset.from_list(examples)


print(new_dataset)


original_train_list = dataset["train"][:]

original_train_examples_list = [{"tokens": original_train_list["tokens"][i], "ner_tags": original_train_list["ner_tags"][i], "langs": original_train_list["langs"][i], "spans": original_train_list["spans"][i]} for i in range(len(original_train_list["tokens"]))]

flattened_examples = []
for ex in examples:
    flattened_examples.append({
        "tokens": ex["tokens"][0],
        "ner_tags": ex["ner_tags"][0],
        "langs": ex["langs"][0],
        "spans": ex["spans"][0]
    })

print("Structure of original_train_examples_list:", type(original_train_examples_list), len(original_train_examples_list))
if original_train_examples_list:
    print("First element of original_train_examples_list:", original_train_examples_list[0])


print("Structure of flattened_examples:", type(flattened_examples), len(flattened_examples))
if flattened_examples:
    print("First element of flattened_examples:", flattened_examples[0])


combined_train_list = original_train_examples_list + flattened_examples

combined_train_dataset = Dataset.from_list(combined_train_list)

dataset["train"] = combined_train_dataset

print(dataset)
print("New train split length:", len(dataset["train"]))

Dataset({
    features: ['tokens', 'ner_tags', 'langs', 'spans'],
    num_rows: 70
})
Structure of original_train_examples_list: <class 'list'> 100
First element of original_train_examples_list: {'tokens': ['रुपया', '(', 'ৰুপয়া', ')', 'হিন্দীত'], 'ner_tags': [0, 0, 0, 0, 5], 'langs': ['as', 'as', 'as', 'as', 'as'], 'spans': ['LOC: হিন্দীত']}
Structure of flattened_examples: <class 'list'> 70
First element of flattened_examples: {'tokens': ['মহানগৰ', 'ৰ', 'নতুন', 'বাস', 'স্টেণ্ড', 'উদ্বোধন', "হ'ল"], 'ner_tags': [0, 0, 0, 0, 0, 0, 0], 'langs': ['as', 'as', 'as', 'as', 'as', 'as', 'as'], 'spans': []}
DatasetDict({
    validation: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'spans'],
        num_rows: 100
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'spans'],
        num_rows: 100
    })
    train: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'spans'],
        num_rows: 170
    })
})
New train split length: 170


In [None]:
pip install torch transformers datasets seqeval nlpaug


Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting nlpaug
  Downloading nlpaug-1.1.11-py3-none-any.whl.metadata (14 kB)
Downloading nlpaug-1.1.11-py3-none-any.whl (410 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.5/410.5 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16162 sha256=8dc266d9c61890bafc6cf91e6b648e2c28a983a1b679883cb5b994d34c94ee3f
  Stored in directory: /root/.cache/pip/wheels/5f/b8/73/0b2c1a76b701a677653dd79ece07cfabd7457989dbfbdcd8d7
Successfully built seqeval
Installing collected packages: seqeval, nlpaug
Successfully installed nlpaug-1.1.11 seqeval-1.2.2


## 3. Tokenization and Label Alignment

We use the **XLM RoBERTa tokenizer** to convert tokens into subword representations.  
Since subword tokenization splits words, label alignment ensures that each subword receives the correct corresponding label.  

This step includes:
- Tokenizing Assamese text.
- Aligning `ner_tags` with the subword tokens.
- Assigning `-100` to ignored positions (special tokens).

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            else:
                label_ids.append(label[word_idx] if True else -100)
            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)


Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/240 [00:00<?, ? examples/s]

In [None]:
import nlpaug.augmenter.word as naw

aug = naw.ContextualWordEmbsAug(model_path='xlm-roberta-base', action="substitute",aug_p=0.3,top_k=5)
sample_sentence = "আমি আজি কেম্পাছত আছো"
augmented_sentence = aug.augment(sample_sentence)
print("Original:", sample_sentence)
print("Augmented:", augmented_sentence)


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

The following layers were not sharded: lm_head.decoder.bias, roberta.encoder.layer.*.output.LayerNorm.bias, roberta.encoder.layer.*.attention.self.query.weight, roberta.embeddings.LayerNorm.bias, roberta.encoder.layer.*.attention.self.value.bias, roberta.encoder.layer.*.attention.output.dense.weight, roberta.encoder.layer.*.attention.output.dense.bias, roberta.encoder.layer.*.attention.output.LayerNorm.bias, roberta.encoder.layer.*.output.LayerNorm.weight, roberta.encoder.layer.*.attention.output.LayerNorm.weight, lm_head.layer_norm.weight, roberta.encoder.layer.*.intermediate.dense.weight, lm_head.layer_norm.bias, roberta.embeddings.word_embeddings.weight, lm_head.decoder.weight, lm_head.bias, lm_head.dense.bias, roberta.embeddings.LayerNorm.weight, roberta.encoder.layer.*.attention.self.key.weight, roberta.encoder.layer.*.intermediate.dense.bias, roberta.embeddings.token_type_embeddings.weight, roberta.encoder.layer.*.attention.self.query.bias, roberta.encoder.layer.*.output.dense.bi

Original: ['আমি', 'আজি', 'কেম্পাছত', 'আছো']
Augmented: ['আমি', 'আজি', 'কেম্পাছত', 'আছো']


### Attempted Contextual Word Embedding Augmentation

I tried using **contextual word embeddings** for data augmentation in Assamese with the `nlpaug` library and the `xlm-roberta-base` model. The goal was to generate sentence variations by substituting words contextually.  

**Observation:**  
- The augmentation was **not very effective**.  
- Assamese is a **low-resource language**, and the model lacks sufficient vocabulary to generate meaningful substitutions.  
- As a result, it could not produce diverse or correct alternative words.

In [None]:
sample="গুৱাহাটী চহৰ ৰ নতুন বিমানবন্দর উদ্বোধন হ'ল"
augmented_sentence = aug.augment(sample)
print("Original:", sample)
print("Augmented:", augmented_sentence)


Original: গুৱাহাটী চহৰ ৰ নতুন বিমানবন্দর উদ্বোধন হ'ল
Augmented: ["গুৱাহাটী চহৰ ৰ নতুন বিমানবন্দর উদ্বোধন হ'ল"]


## 4. Label Extraction

Here we extract all unique label IDs (`ner_tags`) present in the dataset and sort them.  
This helps in defining the number of output classes required by our NER model.


In [None]:
all_labels = set()
for l in dataset["train"]["ner_tags"]:
    all_labels.update(l)
print(sorted(all_labels))

[0, 1, 2, 3, 4, 5, 6]


In [None]:
label_list = [
    "O",
    "B-PER",
    "I-PER",
    "B-ORG",
    "I-ORG",
    "B-LOC",
    "I-LOC"
]


## 5. Data Combination and Training Set Update

We combine the **original training examples** with the **flattened examples** created from our custom data list.  
This helps in enhancing the dataset size and diversity before fine tuning.

## 6. Model Setup – XLM-RoBERTa for Token Classification

We load the **`xlm-roberta-base`** model from Hugging Face and configure it for **token classification** with the appropriate number of NER labels.  
This multilingual model is well-suited for Assamese and other Indian languages.

In [None]:
import numpy as np
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer, DataCollatorForTokenClassification
from seqeval.metrics import f1_score
from transformers import AutoTokenizer




#label_list = dataset["train"].features["ner_tags"].feature.names

num_labels = len(label_list)
model = AutoModelForTokenClassification.from_pretrained("xlm-roberta-base", num_labels=num_labels)

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)
    true_labels = [[label_list[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_list[p] for (p, l) in zip(pred, label) if l != -100]
        for pred, label in zip(predictions, labels)
    ]
    return {"f1": f1_score(true_labels, true_predictions)}

training_args = TrainingArguments(
    output_dir="./ner_model",
    eval_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01
)

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

trainer.train()
trainer.save_model("./ner_model")

The following layers were not sharded: roberta.encoder.layer.*.output.LayerNorm.bias, roberta.encoder.layer.*.attention.self.query.weight, roberta.embeddings.LayerNorm.bias, roberta.encoder.layer.*.attention.self.value.bias, roberta.encoder.layer.*.attention.output.dense.weight, roberta.encoder.layer.*.attention.output.dense.bias, roberta.encoder.layer.*.attention.output.LayerNorm.bias, roberta.encoder.layer.*.output.LayerNorm.weight, roberta.encoder.layer.*.attention.output.LayerNorm.weight, roberta.encoder.layer.*.intermediate.dense.weight, roberta.embeddings.word_embeddings.weight, classifier.weight, roberta.embeddings.LayerNorm.weight, roberta.encoder.layer.*.attention.self.key.weight, roberta.encoder.layer.*.intermediate.dense.bias, classifier.bias, roberta.embeddings.token_type_embeddings.weight, roberta.encoder.layer.*.output.dense.bias, roberta.encoder.layer.*.attention.self.query.bias, roberta.encoder.layer.*.output.dense.weight, roberta.encoder.layer.*.attention.self.value.we

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mjeynang05[0m ([33mjeynang05-nn[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin




Epoch,Training Loss,Validation Loss,F1
1,No log,1.767286,0.0
2,No log,1.424036,0.156969
3,No log,1.284163,0.241192
4,No log,1.159519,0.364331
5,No log,1.157134,0.359606




In [None]:
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

#model_path = "./ner_model"
#model = AutoModelForTokenClassification.from_pretrained(model_path)
#tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

sentence = "আমি আজি কেম্পাছত আছো।".split()
inputs = tokenizer(sentence, return_tensors="pt", is_split_into_words=True)
outputs = model(**inputs).logits
predictions = torch.argmax(outputs, dim=2)
pred_labels = [label_list[p] for p in predictions[0].numpy()]
print(list(zip(sentence, pred_labels)))


[('আমি', 'B-PER'), ('আজি', 'O'), ('কেম্পাছত', 'O'), ('আছো।', 'O')]


##  Defining Metrics and Training Configuration

We define the evaluation metric — **F1 score** (using the `seqeval` library) — to measure token classification performance.  
Then we configure **TrainingArguments**, including:
- Learning rate  
- Batch size  
- Number of epochs  
- Weight decay  
- Output directory

In [None]:
import numpy as np
from sklearn.metrics import classification_report

predictions, labels, _ = trainer.predict(tokenized_dataset["validation"])

preds = np.argmax(predictions, axis=2)

id2label = {i: label for i, label in enumerate(label_list)}

true_labels, true_preds = [], []
for i in range(len(labels)):
    for j in range(len(labels[i])):
        if labels[i][j] != -100:
            true_labels.append(id2label[labels[i][j]])
            true_preds.append(id2label[preds[i][j]])

print(classification_report(true_labels, true_preds, digits=4))


              precision    recall  f1-score   support

       B-LOC     0.2275    0.4528    0.3028       106
       B-ORG     0.3774    0.1351    0.1990       148
       B-PER     0.8720    0.7786    0.8226       140
       I-LOC     0.4656    0.6985    0.5588       262
       I-ORG     0.0000    0.0000    0.0000       129
       I-PER     0.8207    0.5980    0.6919       199
           O     0.7622    0.8532    0.8051       477

    accuracy                         0.6064      1461
   macro avg     0.5036    0.5023    0.4829      1461
weighted avg     0.5824    0.6064    0.5783      1461



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report

def get_error_analysis(dataset, predictions, labels, id2label, tokenizer, num_samples=10):
    preds = np.argmax(predictions, axis=2)
    errors = []
    for i, example in enumerate(dataset):
        input_ids = example["input_ids"]
        for j in range(len(labels[i])):
            if labels[i][j] != -100 and preds[i][j] != labels[i][j]:
                token = tokenizer.convert_ids_to_tokens([input_ids[j]])[0]
                errors.append({
                    "sentence_index": i,
                    "token": token,
                    "true_label": id2label[labels[i][j]],
                    "predicted_label": id2label[preds[i][j]]
                })
    df = pd.DataFrame(errors)
    return df.sample(min(num_samples, len(df)))

error_df = get_error_analysis(tokenized_dataset["validation"], predictions, labels, id2label, tokenizer)
print(error_df.head(10))

     sentence_index  token true_label predicted_label
153              23      ্      B-ORG           I-LOC
118              18    ▁জি      I-ORG           I-LOC
245              35    ▁বা      B-ORG           I-LOC
408              67  ▁তথ্য      B-LOC           I-LOC
278              38   ▁আইন      I-ORG           I-LOC
185              28    ীয়      B-LOC           I-LOC
234              33      ▁      I-ORG           I-LOC
29                2    ▁টা      I-LOC           I-PER
82               12   ▁অনু      I-PER           I-LOC
158              23      ং      I-ORG           I-LOC


In [None]:
hindi_dataset = load_dataset("wikiann", "hi")
hindi_dataset = hindi_dataset["train"]

hi/validation-00000-of-00001.parquet:   0%|          | 0.00/64.7k [00:00<?, ?B/s]

hi/test-00000-of-00001.parquet:   0%|          | 0.00/65.0k [00:00<?, ?B/s]

hi/train-00000-of-00001.parquet:   0%|          | 0.00/312k [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/5000 [00:00<?, ? examples/s]

In low-resource languages like Assamese, Named Entity Recognition (NER) suffers from limited labeled data.
To overcome this, we explore a cross-lingual data augmentation strategy where Hindi text from the WikiANN dataset is transliterated into Assamese script.
The augmented dataset is then used to fine-tune a pre-trained NER model, improving its ability to generalize on Assamese text.

In [None]:
hindi='नई दिल्ली में नया मेट्रो लाइन खुला'
ass=hi_to_as_transliterate(hindi)
ass

'নঈ দিল্লী মেং নযা মেট্রো লাইন খুলা'

To transliterate Hindi text (Devanagari script) into Assamese (Bengali script), we use the indic-transliteration library.
This provides script conversion utilities based on Indic language phonetic mappings.
This produces an approximate Assamese representation of the Hindi text.
Although not perfect, it captures useful linguistic similarities for model training.


In [None]:
!pip install indic-transliteration

from indic_transliteration import sanscript
from indic_transliteration.sanscript import transliterate

def hi_to_as_transliterate(text):
    """
    Transliterate Devanagari Hindi text to Assamese script.
    This is a rough approximation for cross-lingual augmentation.
    """
    return transliterate(text, sanscript.DEVANAGARI, sanscript.BENGALI)


Collecting indic-transliteration
  Downloading indic_transliteration-2.3.75-py3-none-any.whl.metadata (1.4 kB)
Collecting backports.functools-lru-cache (from indic-transliteration)
  Downloading backports.functools_lru_cache-2.0.0-py2.py3-none-any.whl.metadata (3.5 kB)
Collecting roman (from indic-transliteration)
  Downloading roman-5.1-py3-none-any.whl.metadata (4.2 kB)
Downloading indic_transliteration-2.3.75-py3-none-any.whl (159 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m159.6/159.6 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading backports.functools_lru_cache-2.0.0-py2.py3-none-any.whl (6.7 kB)
Downloading roman-5.1-py3-none-any.whl (5.8 kB)
Installing collected packages: roman, backports.functools-lru-cache, indic-transliteration
Successfully installed backports.functools-lru-cache-2.0.0 indic-transliteration-2.3.75 roman-5.1


In [None]:
aug_tokens = []
aug_labels = []

for example in hindi_dataset:

    hindi_tokens = example["tokens"]
    hindi_labels = example["ner_tags"]


    as_tokens = [hi_to_as_transliterate(tok) for tok in hindi_tokens]

    aug_tokens.append(as_tokens)
    aug_labels.append(hindi_labels)


We combine:

- The original Assamese NER dataset

- The transliterated Hindi dataset

This results in a larger training corpus with the same NER tag structure, improving the model’s ability to generalize.

In [None]:
from datasets import Dataset




combined_tokens = list(dataset["train"]["tokens"]) + aug_tokens
combined_labels = list(dataset["train"]["ner_tags"])+ aug_labels


augmented_dataset = Dataset.from_dict({"tokens": combined_tokens, "ner_tags": combined_labels})
print("Augmented dataset size:", len(augmented_dataset))


Augmented dataset size: 5170


In [None]:
#tokenized_aug_dataset = augmented_dataset.map(tokenize_and_align_labels, batched=True)


In [None]:
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer
from datasets import Dataset

model_dir = "./ner_model"
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForTokenClassification.from_pretrained(model_dir)

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"],
        truncation=True,
        is_split_into_words=True,
        padding="max_length",
        max_length=128
    )
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        label_ids = []
        previous_word_idx = None
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx] if word_idx < len(label) else -100)
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

#aug_dataset = Dataset.from_dict({"tokens": new_train_tokens, "ner_tags": new_train_labels})
tokenized_aug_dataset = augmented_dataset.map(tokenize_and_align_labels, batched=True)

tokenized_validation_dataset = dataset["validation"].map(tokenize_and_align_labels, batched=True)


training_args = TrainingArguments(
    output_dir="./ner_model_augmented",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs_aug",
    logging_steps=50,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_aug_dataset,
    eval_dataset=tokenized_validation_dataset,
    tokenizer=tokenizer,
)

trainer.train()

trainer.save_model("./ner_model_augmented")

Map:   0%|          | 0/5170 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

  trainer = Trainer(
[34m[1mwandb[0m: Currently logged in as: [33mjeynang05[0m ([33mjeynang05-nn[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin




Epoch,Training Loss,Validation Loss


In [None]:
label_list = dataset["train"].features["ner_tags"].feature.names
id2label = {i: label for i, label in enumerate(label_list)}


In [None]:
import numpy as np

predictions, labels, _ = trainer.predict(tokenized_validation_dataset)

preds = np.argmax(predictions, axis=2)




In [None]:
true_labels = []
true_preds = []

for i in range(len(labels)):
    for j in range(len(labels[i])):
        if labels[i][j] != -100:
            true_labels.append(id2label[labels[i][j]])
            true_preds.append(id2label[preds[i][j]])


In [None]:
from sklearn.metrics import classification_report
from seqeval.metrics import f1_score, precision_score, recall_score

print(classification_report(true_labels, true_preds, digits=4))

true_label_seq = [[id2label[l] for l in label if l != -100] for label in labels]
pred_label_seq = [[id2label[p] for (p, l) in zip(pred, label) if l != -100] for pred, label in zip(preds, labels)]

print("F1-score:", f1_score(true_label_seq, pred_label_seq))
print("Precision:", precision_score(true_label_seq, pred_label_seq))
print("Recall:", recall_score(true_label_seq, pred_label_seq))


              precision    recall  f1-score   support

       B-LOC     0.0000    0.0000    0.0000        44
       B-ORG     0.5918    0.8286    0.6905        35
       B-PER     0.5962    0.9688    0.7381        32
       I-LOC     0.7500    0.2069    0.3243        58
       I-ORG     0.6939    0.8870    0.7786       115
       I-PER     0.6304    0.7733    0.6946        75
           O     0.7944    0.8079    0.8011       177

    accuracy                         0.6996       536
   macro avg     0.5795    0.6389    0.5753       536
weighted avg     0.6548    0.6996    0.6530       536

F1-score: 0.41228070175438597
Precision: 0.4017094017094017
Recall: 0.42342342342342343


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
!cp -r ./ner_model_augmented /content/drive/MyDrive/


In [None]:
!cp -r ./ner_model /content/drive/MyDrive/


In [None]:
failed_examples = []

for i, sentence in enumerate(dataset["validation"]["tokens"]):
    label_seq = labels[i]
    pred_seq = preds[i]

    sentence_text = " ".join(sentence)
    mismatches = []

    for j, word in enumerate(sentence):
        if label_seq[j] != -100 and label_seq[j] != pred_seq[j]:
            mismatches.append((word, id2label[label_seq[j]], id2label[pred_seq[j]]))

    if mismatches:
        failed_examples.append((sentence_text, mismatches))


In [None]:
for ex in failed_examples[:10]:
    print("Sentence:", ex[0])
    print("Mismatches:")
    for token, true, pred in ex[1]:
        print(f"  Token: {token}  True: {true}  Pred: {pred}")
    print("-"*50)


Sentence: হাউ আই মে'ট ইয়'ৰ মাদাৰ
Mismatches:
  Token: আই  True: B-ORG  Pred: B-PER
  Token: ইয়'ৰ  True: I-ORG  Pred: I-PER
  Token: মাদাৰ  True: I-ORG  Pred: I-PER
--------------------------------------------------
Sentence: কাৰ্বি আংলং জিলা
Mismatches:
  Token: আংলং  True: B-LOC  Pred: B-PER
--------------------------------------------------
Sentence: ফ'কাছ এন.ই .
Mismatches:
  Token: এন.ই  True: B-ORG  Pred: B-PER
--------------------------------------------------
Sentence: আই চি চি বিশ্ব টুৱেন্টি২০
Mismatches:
  Token: চি  True: B-LOC  Pred: I-LOC
--------------------------------------------------
Sentence: কোকৰাঝাৰ জিলা , চিৰাং জিলা , ওদালগুৰি জিলা , বাক্সা জিলা
Mismatches:
  Token: জিলা  True: B-LOC  Pred: B-ORG
  Token: ওদালগুৰি  True: I-LOC  Pred: I-ORG
  Token: জিলা  True: B-LOC  Pred: B-ORG
--------------------------------------------------
Sentence: কাৰ্বি আংলং জিলা
Mismatches:
  Token: আংলং  True: B-LOC  Pred: B-PER
--------------------------------------------------
Senten

In [None]:
from collections import Counter

all_mismatches = []
for _, mismatches in failed_examples:
    all_mismatches.extend([true for _, true, _ in mismatches])

error_count = Counter(all_mismatches)
print("Errors per entity type:", error_count)


Errors per entity type: Counter({'B-LOC': 26, 'O': 11, 'B-ORG': 4, 'I-ORG': 4, 'I-LOC': 4})


In [None]:
from collections import Counter

all_tokens = []
all_labels = []
for i, sentence in enumerate(dataset["validation"]["tokens"]):
    label_seq = dataset["validation"]["ner_tags"][i]
    for j, token in enumerate(sentence):
        all_tokens.append(token)
        all_labels.append(label_seq[j])

entity_counts = Counter(all_labels)
print("Entity counts in validation:", entity_counts)

error_counts = Counter([true for _, mismatches in failed_examples for _, true, _ in mismatches])
print("Misclassified entities:", error_counts)


Entity counts in validation: Counter({0: 177, 4: 115, 2: 75, 6: 58, 5: 44, 3: 35, 1: 32})
Misclassified entities: Counter({'B-LOC': 26, 'O': 11, 'B-ORG': 4, 'I-ORG': 4, 'I-LOC': 4})


In [None]:
multi_token_errors = []

for sentence, mismatches in failed_examples:
    for token, true, pred in mismatches:
        if true.startswith("I-") or pred.startswith("I-"):
            multi_token_errors.append((sentence, token, true, pred))

print(f"Number of multi-token entity errors: {len(multi_token_errors)}")
for ex in multi_token_errors[:5]:
    print(ex)


Number of multi-token entity errors: 17
("হাউ আই মে'ট ইয়'ৰ মাদাৰ", "ইয়'ৰ", 'I-ORG', 'I-PER')
("হাউ আই মে'ট ইয়'ৰ মাদাৰ", 'মাদাৰ', 'I-ORG', 'I-PER')
('আই চি চি বিশ্ব টুৱেন্টি২০', 'চি', 'B-LOC', 'I-LOC')
('কোকৰাঝাৰ জিলা , চিৰাং জিলা , ওদালগুৰি জিলা , বাক্সা জিলা', 'ওদালগুৰি', 'I-LOC', 'I-ORG')
('গোলাঘাট জিলা , মৰিগাঁও জিলা , নগাঁও জিলা , ডিফু', 'মৰিগাঁও', 'I-LOC', 'I-ORG')


In [None]:
token_confusions = Counter()
for _, mismatches in failed_examples:
    for token, true, pred in mismatches:
        if true != pred:
            token_confusions[token] += 1

print(token_confusions.most_common(10))


[('জিলা', 6), ('(', 4), ('চি', 3), (',', 3), ('আই', 2), ("ইয়'ৰ", 2), ('মাদাৰ', 2), ('আংলং', 2), ("''", 2), ('এন.ই', 1)]


In [None]:
foreign_errors = [(token, true, pred) for _, mismatches in failed_examples for token, true, pred in mismatches if any(ord(c) < 128 for c in token)]
print("Sample foreign/loanword errors:", foreign_errors[:10])


Sample foreign/loanword errors: [("ইয়'ৰ", 'I-ORG', 'I-PER'), ('এন.ই', 'B-ORG', 'B-PER'), ("''", 'B-LOC', 'O'), (',', 'O', 'I-ORG'), ('(', 'B-LOC', 'B-ORG'), ('(', 'B-LOC', 'B-PER'), ("ইয়'ৰ", 'I-ORG', 'I-PER'), (',', 'B-LOC', 'B-PER'), ('maroccanus', 'O', 'B-ORG'), ('-উত্তৰ-পশ্চিম', 'O', 'I-LOC')]


###  Conclusion

In this project, we explored **Named Entity Recognition (NER) for Assamese** using XLM-RoBERTa.  

- The **first model**, trained only on the original Assamese dataset, achieved an **F1 score of 60**, demonstrating reasonable baseline performance on a low-resource language.  
- By applying **cross lingual data augmentation** through **Hindi → Assamese transliteration**, the **augmented model** improved to an **F1 score of 69**, showing a significant boost in model generalization and entity recognition performance.  

 **Key Takeaways:**
- Cross lingual transliteration is an effective strategy to enrich low-resource datasets.  
- Adding linguistically similar data helps the model better identify entities and reduces misclassifications.  
- This approach can be extended to other low-resource languages and NLP tasks.  

**Future Work:**
- Explore **back translation** and phonetic mapping for further augmentation.  
- Experiment with **multilingual transformers** like IndicBERT for zero-shot NER.  
- Incorporate additional Assamese datasets to further improve accuracy and robustness.
