<a href="https://colab.research.google.com/github/AnuarKenzhibekov/Samruk_finetuned-xlm-roberta_0.1/blob/main/Samruk_finetuned-xlm-roberta_0.1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -U transformers



## Local Inference on GPU
Model page: https://huggingface.co/google-bert/bert-base-multilingual-uncased

⚠️ If the generated code snippets do not work, please open an issue on either the [model repo](https://huggingface.co/google-bert/bert-base-multilingual-uncased)
			and/or on [huggingface.js](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/model-libraries-snippets.ts) 🙏

In [2]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("fill-mask", model="google-bert/bert-base-multilingual-uncased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/672M [00:00<?, ?B/s]

Some weights of the model checkpoint at google-bert/bert-base-multilingual-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


In [3]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-multilingual-uncased")
model = AutoModelForMaskedLM.from_pretrained("google-bert/bert-base-multilingual-uncased")

Some weights of the model checkpoint at google-bert/bert-base-multilingual-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [4]:
model_name = "xlm-roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)

encodings = tokenizer("Какова миссия Фонда?", truncation=True, padding="max_length", max_length=256)


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

In [5]:
import pandas as pd
from datasets import Dataset

train_df = pd.read_csv("train-02.csv")
test_df  = pd.read_csv("test-02.csv")

# HuggingFace Dataset
train_ds = Dataset.from_pandas(train_df)
test_ds  = Dataset.from_pandas(test_df)


In [6]:
labels = sorted(train_df["label"].unique())
label2id = {l:i for i,l in enumerate(labels)}
id2label = {i:l for l,i in label2id.items()}

def encode_batch(batch):
    enc = tokenizer(batch["text"], truncation=True, padding="max_length", max_length=256)
    enc["labels"] = [label2id[l] for l in batch["label"]]
    return enc

train_ds = train_ds.map(encode_batch, batched=True)
test_ds  = test_ds.map(encode_batch, batched=True)

cols = ["input_ids", "attention_mask", "labels"]
train_ds.set_format(type="torch", columns=cols)
test_ds.set_format(type="torch", columns=cols)


Map:   0%|          | 0/6298 [00:00<?, ? examples/s]

Map:   0%|          | 0/700 [00:00<?, ? examples/s]

In [12]:
import transformers
import numpy as np
import random
import torch
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer, AutoTokenizer

def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)
    transformers.set_seed(seed)

set_seed(42)

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=len(labels),
    id2label=id2label,
    label2id=label2id
)

args = TrainingArguments(
    output_dir="./out",
    per_device_train_batch_size=16,
    gradient_accumulation_steps=2,
    num_train_epochs=10,
    learning_rate=3e-5,
    warmup_ratio=0.1,
    weight_decay=0.01,
    lr_scheduler_type="cosine",
    logging_dir="./logs",
    logging_steps=100,
    save_strategy="epoch",
    save_total_limit=3,
)

def compute_metrics(eval_pred):
    logits, y_true = eval_pred
    y_pred = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy_score(y_true, y_pred),
        "macro_precision": precision_score(y_true, y_pred, average="macro", zero_division=0),
        "macro_recall": recall_score(y_true, y_pred, average="macro", zero_division=0),
        "macro_f1": f1_score(y_true, y_pred, average="macro", zero_division=0),
    }

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_ds,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mkenzhibekovanuar[0m ([33mkenzhibekovanuar-aitu[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
100,2.4901
200,2.2004
300,1.6181
400,1.3848
500,1.1227
600,1.0718
700,0.8864
800,0.7988
900,0.63
1000,0.6339


TrainOutput(global_step=1970, training_loss=0.8115138959158495, metrics={'train_runtime': 3695.181, 'train_samples_per_second': 17.044, 'train_steps_per_second': 0.533, 'total_flos': 8286111042969600.0, 'train_loss': 0.8115138959158495, 'epoch': 10.0})

In [13]:
trainer.evaluate(test_ds)


{'eval_loss': 0.7818223237991333,
 'eval_accuracy': 0.7742857142857142,
 'eval_macro_precision': 0.7783408709164671,
 'eval_macro_recall': 0.7720200132612849,
 'eval_macro_f1': 0.7716568519347592,
 'eval_runtime': 9.5247,
 'eval_samples_per_second': 73.493,
 'eval_steps_per_second': 9.239,
 'epoch': 10.0}

In [14]:
trainer.save_model("Samruk_finetuned-xlm-roberta_0.1")
tokenizer.save_pretrained("Samruk_finetuned-xlm-roberta_0.1")


('Samruk_finetuned-xlm-roberta_0.1/tokenizer_config.json',
 'Samruk_finetuned-xlm-roberta_0.1/special_tokens_map.json',
 'Samruk_finetuned-xlm-roberta_0.1/sentencepiece.bpe.model',
 'Samruk_finetuned-xlm-roberta_0.1/added_tokens.json',
 'Samruk_finetuned-xlm-roberta_0.1/tokenizer.json')

In [2]:
from transformers import pipeline

clf = pipeline("text-classification", model="Samruk_finetuned-xlm-roberta_0.1", tokenizer=tokenizer)

print(clf("Какова миссия Фонда?"))
print(clf("Қордың миссиясы қандай?"))
print(clf("What is the mission of the Fund?"))


NameError: name 'tokenizer' is not defined

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


OSError: Samruk_finetuned-xlm-roberta_0.1 is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `hf auth login` or by passing `token=<your_token>`