<a href="https://colab.research.google.com/github/AbdulxoliqMirzayev/NER_Model/blob/main/Uz_NER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
import pandas as pd

from transformers import AutoTokenizer

from tqdm import tqdm
import tensorflow as tf

import matplotlib.pyplot as plt

df = pd.read_json("hf://datasets/risqaliyevds/uzbek_ner/uzbek_ner.json")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [2]:
df.head()

Unnamed: 0,text,ner
0,Shvetsiya hukumati Stokholmdagi asosiy piyodal...,"{'GPE': ['Shvetsiya', 'O‘zbekiston', 'Shvetsiy..."
1,Turkiya prezidenti Rajab Toyyib Erdo‘g‘an AQSh...,"{'GPE': ['O‘zbekiston', 'Suriya', 'AQSh', 'Vas..."
2,Stokholm markazida yuk mashinasi orqali sodir ...,"{'LOC': ['Stokholm', 'Stokgolm'], 'GPE': ['O‘z..."
3,«Vest Hem» bosh murabbiyi Slaven Bilich o‘z va...,"{'GPE': ['O‘zbekiston', 'Angliya'], 'ORG': ['V..."
4,AQSh prezidenti Donald Trampning nabirasi - 5 ...,"{'PERSON': ['Donald Tramp', 'Ivanka Tramp', 'S..."


**Data Preprocessing**

datasetni bizga kerakli formatga o'tkazib olamiz va datasetni list ko'rinishga olib kelamiz

In [3]:
def convert_to_ner_format(df, text_col='text', ner_col='ner'):
    dataset = []

    for index in range(len(df)):
        text = df.iloc[index][text_col]
        ner = df.iloc[index][ner_col]

        if isinstance(ner, str):
            ner = eval(ner)

        sentence = []

        for word in text.split():
            found = False
            for tag, entities in ner.items():
                if word in entities and tag in ('LOC', 'PERSON', 'DATE', 'ORG', 'PRODUCT','PERCENT','TIME','LANGUAGE','GPE'):
                    sentence.append((word, f'B-{tag.upper()}'))
                    found = True
                    break

            if not found:
                sentence.append((word, 'O'))

        dataset.append(sentence)

    return dataset

Yuqoridagi convert_to_ner_format funksiyasini dataset uchun qo'llagan holda train, test va validation setlar yaratamiz

In [4]:
df = df.sample(frac=1).reset_index(drop=True)

train_data = convert_to_ner_format(df.iloc[:18000])
test_data = convert_to_ner_format(df.iloc[18000:19000])
valid_data = convert_to_ner_format(df.iloc[19000:])

datasetda teglar sonini aniqlab olamiz yani qanchalik tez tez uchraganini ko'rish

In [5]:
from collections import Counter

def count_tags(dataset):
    tags = []
    for sentence in dataset:
        for _, tag in sentence:
            tags.append(tag)

    # Teglarni sanash
    tag_counts = Counter(tags)
    return tag_counts

print(count_tags(train_data))

Counter({'O': 1522977, 'B-GPE': 28781, 'B-LOC': 10693, 'B-ORG': 5805, 'B-PERSON': 3383, 'B-PRODUCT': 850, 'B-DATE': 413, 'B-TIME': 266, 'B-PERCENT': 55})


In [6]:
print(train_data[5][:15])

[('Pyongchangdagi', 'O'), ('Olimpiadada', 'O'), ('norovirus', 'O'), ('infeksiyasidan', 'O'), ('kasallanganlar', 'O'), ('soni', 'O'), ('ortib', 'O'), ('bormoqda.', 'O'), ('Ayni', 'O'), ('vaqtda', 'O'), ('bu', 'O'), ('yuqumli', 'O'), ('virusdan', 'O'), ('zararlanganlar', 'O'), ('soni', 'O')]


teglarni schema ko'rinishda saqlaymiz
samples: Trening va test datasetlarini birlashtirib, umumiy ma'lumot to'plamini hosil qiladi.
schema: Ushbu umumiy to'plamdan barcha noyob teglarni yig'ib, ularni alifbo tartibida joylashtiradi va boshiga maxsus belgi ('_') qo'shadi. Bu ro'yxat keyinchalik modellarni o'qitishda yoki tahlil jarayonida foydalanish uchun kerak bo'ladi.

In [7]:
samples = train_data + test_data
schema = ['_'] + sorted({tag for sentence in samples  for _, tag in sentence})

In [8]:
print(schema)

['_', 'B-DATE', 'B-GPE', 'B-LOC', 'B-ORG', 'B-PERCENT', 'B-PERSON', 'B-PRODUCT', 'B-TIME', 'O']


**BASE MODEL FACEBOOK**

Fine tuning uchun Facebookning xlm-roberta-base.

In [9]:
from transformers import AutoConfig, TFAutoModelForTokenClassification

MODEL_NAME = 'FacebookAI/xlm-roberta-base'

config = AutoConfig.from_pretrained(MODEL_NAME, num_labels=len(schema))
model = TFAutoModelForTokenClassification.from_pretrained(MODEL_NAME, config=config)

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFXLMRobertaForTokenClassification.

Some weights or buffers of the TF 2.0 model TFXLMRobertaForTokenClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Tokenizer and Preprocess**
Common voice 17.0 dataseti yordamida o'zbek tili uchun Tokenizer yaratdim. train va validated splitlarini o'z ichiga oladi va Tokenizer huggingfacega yuklangan.

In [10]:
from transformers import AutoTokenizer
from tqdm import tqdm
import numpy as np

# Avvaldan o'qitilgan "new_uz_tokenizer" modelidan tokenizer obyektini yuklaymiz
tokenizer = AutoTokenizer.from_pretrained("AbdulxoliqMirzaev/new_uz_tokenizer")

def tokenize_sample(sample):
    """
    Ushbu funksiya bitta namunadagi (sample) token va teg juftligini tokenizatsiya qiladi.
    Vazifasi:
      Har bir tokenni tokenizer yordamida subtokenlarga bo'lib, ularning har biriga mos keluvchi tag ni saqlaydi.
    """
    seq = [
        (subtoken, tag)
        for token, tag in sample
        for subtoken in tokenizer(token)['input_ids'][1:-1]
    ]
    return [(3, 'O')] + seq + [(4, 'O')]

def preprocess(samples):
    """
    Ushbu funksiya namunalarni (samples) oldindan qayta ishlash (preprocess) orqali tokenizatsiya qiladi
    va ularni xotira uchun tayyor numpy massivlariga (X, y) joylaydi.
    Vazifasi:
      1. schema (teglar to'plami) asosida tag ga indekslar tayinlaydi.
      2. Har bir namunani tokenize_sample yordamida tokenizatsiya qiladi.
      3. Eng uzun jumla uzunligini aniqlaydi va barcha jumlalarni shu maksimal uzunlikka nol bilan to'ldiradi.
      4. Har bir token va teg uchun mos ravishda X (token ID lar) va y (teg indekslari) massivlarini yaratadi.
    """

    tag_index = {tag: i for i, tag in enumerate(schema)}

    # Har bir namunani tokenize_sample funksiyasi yordamida tokenizatsiya qilib, ro'yxatga olamiz.
    tokenized_samples = list(tqdm(map(tokenize_sample, samples)))

    # Eng uzun tokenizatsiyalangan jumlaning uzunligini aniqlaymiz
    max_len = max(map(len, tokenized_samples))

    # X massivini yaratamiz: har bir satr namunadagi token ID lar uchun, 0 bilan to'ldirilgan
    X = np.zeros((len(samples), max_len), dtype=np.int32)
    # y massivini yaratamiz: har bir satr namunadagi teg indekslari uchun, 0 bilan to'ldirilgan
    y = np.zeros((len(samples), max_len), dtype=np.int32)

    for i, sentence in enumerate(tokenized_samples):
        for j, (subtoken_id, tag) in enumerate(sentence):
            X[i, j] = subtoken_id
            y[i, j] = tag_index[tag]

    # Tokenizatsiyadan keyin tayyorlangan massivlar (X va y) ni qaytaramiz
    return X, y

# Train, test va valid datasetlarni oldindan qayta ishlaymiz
X_train, y_train = preprocess(train_data)
X_test, y_test = preprocess(test_data)
X_valid, y_valid = preprocess(valid_data)


tokenizer_config.json:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/320k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

18000it [01:21, 220.32it/s]
1000it [00:03, 273.83it/s]
609it [00:02, 273.43it/s]


In [11]:
optimizer = tf.keras.optimizers.Adam(learning_rate=0.000001)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)


model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

EPOCHS = 1
BATCH_SIZE = 4

train_encodings = {
    'input_ids': X_train,
    'attention_mask': (X_train != 0).astype(int)
}


val_encodings = {
    'input_ids': X_test,
    'attention_mask': (X_test != 0).astype(int)
}

train_dataset = tf.data.Dataset.from_tensor_slices((train_encodings, y_train)).batch(BATCH_SIZE)
val_dataset = tf.data.Dataset.from_tensor_slices((val_encodings, y_test)).batch(BATCH_SIZE)

# Modelni o'qitish
history = model.fit(
    train_dataset,
    validation_data=val_dataset,
    epochs=EPOCHS
)



**Metrics (Precision, Recall, and F1-Score)**

In [16]:
from sklearn.metrics import precision_score, recall_score, f1_score

def compute_metrics(y_true, y_pred):
    precision = precision_score(np.concatenate(y_true), np.concatenate(y_pred), average='micro')
    recall = recall_score(np.concatenate(y_true), np.concatenate(y_pred), average='micro')
    f1 = f1_score(np.concatenate(y_true), np.concatenate(y_pred), average='micro')
    return {'precision': precision, 'recall': recall, 'f1': f1}

In [17]:
y_true = []
y_pred = []

for batch in val_dataset:
    inputs = batch[0]
    labels = batch[1]

    logits = model(**inputs).logits

    predictions = tf.argmax(logits, axis=-1).numpy()

    y_true.append(labels.numpy())
    y_pred.append(predictions)

y_true = np.concatenate(y_true, axis=0)
y_pred = np.concatenate(y_pred, axis=0)

# Metrikalarni hisoblash
metrics = compute_metrics(y_true, y_pred)

# Natijalarni chiqarish
print(f"Precision: {metrics['precision']}")
print(f"Recall: {metrics['recall']}")
print(f"F1-Score: {metrics['f1']}")

Precision: 0.9813157894736843
Recall: 0.9813157894736843
F1-Score: 0.9813157894736843


**Huggingface upload**

In [29]:
from huggingface_hub import create_repo, login
login(token='your token')

# Repozitoriya yaratish:
create_repo(repo_id='AbdulxoliqMirzaev/roberta-ner-uz', exist_ok=True, repo_type="model")
model.save_pretrained('./roberta-ner-uz')
tokenizer.save_pretrained('./roberta-ner-uz')

upload_folder(
    repo_id='AbdulxoliqMirzaev/roberta-ner-uz',
    folder_path='./roberta-ner-uz',
    path_in_repo='',
    repo_type="model"
)

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/AbdulxoliqMirzaev/roberta-ner-uz/commit/8fac43a386fb9c7264b357b75a4a09ff5c41a563', commit_message='Upload folder using huggingface_hub', commit_description='', oid='8fac43a386fb9c7264b357b75a4a09ff5c41a563', pr_url=None, repo_url=RepoUrl('https://huggingface.co/AbdulxoliqMirzaev/roberta-ner-uz', endpoint='https://huggingface.co', repo_type='model', repo_id='AbdulxoliqMirzaev/roberta-ner-uz'), pr_revision=None, pr_num=None)

**Testing NER**

In [30]:
from transformers import pipeline
ner_pipeline = pipeline('ner', model='AbdulxoliqMirzaev/roberta-ner-uz', tokenizer='AbdulxoliqMirzaev/roberta-ner-uz')

config.json:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

tf_model.h5:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Some layers from the model checkpoint at AbdulxoliqMirzaev/roberta-ner-uz were not used when initializing TFXLMRobertaForTokenClassification: ['dropout_37']
- This IS expected if you are initializing TFXLMRobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFXLMRobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFXLMRobertaForTokenClassification were initialized from the model checkpoint at AbdulxoliqMirzaev/roberta-ner-uz.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFXLMRobertaForTokenClassification for predictions without further training.


tokenizer_config.json:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/320k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

Device set to use 0


In [33]:
text = "Men hozirda Toshkent shaxrida istiqomad qilaman. Men kelajakda Yevropa mamlakatlariga sayohat qilish rajam bor."

entities = ner_pipeline(text)

In [34]:
for entity in entities:
    print(entity)

{'entity': 'LABEL_9', 'score': 0.46702513, 'index': 1, 'word': 'Men', 'start': 0, 'end': 3}
{'entity': 'LABEL_9', 'score': 0.78394884, 'index': 2, 'word': 'hozirda', 'start': 4, 'end': 11}
{'entity': 'LABEL_9', 'score': 0.8928846, 'index': 3, 'word': 'Toshkent', 'start': 12, 'end': 20}
{'entity': 'LABEL_9', 'score': 0.916321, 'index': 4, 'word': 'shaxrida', 'start': 21, 'end': 29}
{'entity': 'LABEL_9', 'score': 0.9237888, 'index': 5, 'word': 'istiqomad', 'start': 30, 'end': 39}
{'entity': 'LABEL_9', 'score': 0.9298442, 'index': 6, 'word': 'qilaman.', 'start': 40, 'end': 48}
{'entity': 'LABEL_9', 'score': 0.9305881, 'index': 7, 'word': 'Men', 'start': 49, 'end': 52}
{'entity': 'LABEL_9', 'score': 0.9335007, 'index': 8, 'word': 'kelajakda', 'start': 53, 'end': 62}
{'entity': 'LABEL_9', 'score': 0.93511397, 'index': 9, 'word': 'Yevropa', 'start': 63, 'end': 70}
{'entity': 'LABEL_9', 'score': 0.93947625, 'index': 10, 'word': 'mamlakatlariga', 'start': 71, 'end': 85}
{'entity': 'LABEL_9', '