# Help BOBAI: Classify an unknown language

<img src="https://drive.google.com/uc?id=1Hvgrrah-T7yFTzDP002XuRodhyfY1Hju" width="750">

## Background
Bob's AI start-up, Bobai, builds AI solutions for other companies which have to process large volumes of text in their daily tasks. Bobai serve companies from all over the world, and they pride themselves on their ability to handle a variety of languages, from English, through Arabic to Mandarin. The secret to Bobai's success is that all of their products are based on a strong multilingual language encoder, mBERT. Bobai's infrastructure is actually highly optimized for this specific language encoder, which makes their products super fast and efficient, i.e. very attractive to clients.

## Task

But mBERT is trained on just 101 languages. So what happens when one of Bobai's biggest clients, Amoira, requests support for a new language X that is not among those 101 languages? Bob and his team have to find a way to meet this request, as they cannot risk losing the client.

The data Amoira has provided consists of a small labeled dataset for text classification and a larger corpus or raw text in the language.

To make things even more complicated, Amoira has encrypted the data, as they don't want to risk competitors finding out which new market they are targetting.

Bob has found out that at this time his team has no bandwidth to develop this product, so he is asking for your help. He has shared the baseline solution he uses for languages that mBERT already has support for, so you can start by checking how well this solution does and modify it to obtain better results. You should not waste any efforts on trying to decrypt the data - this will not help you build a better classifier and it will get you in trouble with Bob!

Your task is to build the best text classifier for language X that you can, while operating within the constraints of Bobai:

*   The classifier has to be based on mBERT (and cannot use any additional pre-trained language encoder).
*   The classifier has to train in under 8 hours using an L4 GPU as the compute resources of the company are limited.
*   The classifier has to perform inference on any random 500 data samples in under 5 minutes (Bobai will then apply their optimization tricks to bring this time even further down).

## Deliverables

You need to submit:


*   Your model predictions on the test inputs that we will provide 48 hours before the deadline.
  * saved as a text file in the format shown at the bottom of the notebook
*   Your best trained model.
  * as a link to the Huggingface Hub (read up on `push_to_hub` [here](push_to_hub)).
*   Working code that can be used to reproduce your best trained model.
  * In this Colab notebook.


## Prerequisites


### HuggingFace configuration

The steps below need to be completed by the team leader:

1. Create a team account on [HuggingFace](https://huggingface.co/) using the Gmail account provided by the IOAI organizers.

2. Go to the [IOAI HuggingFace repo](https://huggingface.co/InternationalOlympiadAI) and request access to all datasets.

3. In settings, create two Access Tokens, one with read rights, one with write rights, and store those in [Colab Secrets](https://www.youtube.com/watch?v=q87i2LZbbPc) as `hf_read` and `hf_write`, respectively.

In [None]:
import pandas as pd

In [1]:
from google.colab import userdata

read_access_token = userdata.get('hf_read')
write_access_token = userdata.get('hf_write')

TimeoutException: Requesting secret hf_read timed out. Secrets can only be fetched when running from the Colab UI.

### Dependencies

In [3]:
import importlib
import torch, transformers

if '2.3.0' not in torch.__version__:
  !pip install torch==2.3.0
if transformers.__version__!='4.41.2':
  !pip install transformers==4.41.2

if importlib.util.find_spec('datasets') is None:
  !pip install datasets==2.18.0s
  !pip install evaluate==0.4.2
  !pip install accelerate -U


Collecting torch==2.3.0
  Downloading torch-2.3.0-cp311-cp311-manylinux1_x86_64.whl.metadata (26 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.3.0)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch==2.3.0)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch==2.3.0)
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch==2.3.0)
  Downloading nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch==2.3.0)
  Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch==2.3.0)
  Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylin

If you've just installed `accelerate`, execute `Runtime > Restart session and run all` in the Colab UI menu above.

# Data

In [1]:
# load the data

from datasets import load_dataset, Dataset, DatasetDict

classification_dataset = load_dataset('InternationalOlympiadAI/NLP_problem', token='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')
raw_text = load_dataset('InternationalOlympiadAI/NLP_problem_raw', token='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')

README.md:   0%|          | 0.00/397 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/126k [00:00<?, ?B/s]

dev-00000-of-00001.parquet:   0%|          | 0.00/19.4k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1524 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/218 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/281 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/90.6M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/611245 [00:00<?, ? examples/s]

In [66]:
pip install fasttext


Note: you may need to restart the kernel to use updated packages.


In [98]:
from sklearn.model_selection import train_test_split
import pandas as pd
X = pd.DataFrame(classification_dataset['train'])

X_train,X_test = train_test_split(X,test_size=0.2,random_state =42)
X_test = X_test.reset_index(drop=True)

In [90]:
import pandas as pd
import fasttext
import os

def save_fasttext_format(df, filename):
    with open(filename, 'w', encoding='utf-8') as f:
        for text, label in zip(df['text'], df['label']):
            f.write(f"__label__{label} {text.strip()}\n")

save_fasttext_format(pd.DataFrame(X_train), 'train.txt')
model = fasttext.train_supervised('train.txt', epoch=1200, lr=0.002, verbose=2)

model.save_model("model_fasttext.bin")


Read 0M words
Number of words:  3321
Number of labels: 5
Progress: 100.0% words/sec/thread:  315592 lr: -0.000000 avg.loss:  0.668558 ETA:   0h 0m 0s 0.000000 avg.loss:  0.668558 ETA:   0h 0m 0s


In [None]:
print(123)

In [117]:
y_pred = list(model.predict(X_test.loc[7,'text'],k=5))
local_preds = {}
for j in range(5):
    local_preds[y_pred[0][j]] = float(y_pred[1][j])
list(model.predict(X_test.loc[7,'text'],k=5))[0][0]
local_preds

{'__label__4': 0.9859292507171631,
 '__label__1': 0.010640477761626244,
 '__label__0': 0.0023608056362718344,
 '__label__2': 0.0006791295018047094,
 '__label__3': 0.00044029904529452324}

In [120]:
X_test = pd.DataFrame(classification_dataset['dev'])

In [131]:
from sklearn.metrics import accuracy_score, f1_score
X_test['preds'] = ''
preds = []
for i in range(X_test.shape[0]):
    y_pred = list(model.predict(X_test.loc[i,'text'],k=5))
    local_preds = {}
    for j in range(5):
        local_preds[y_pred[0][j]] = float(y_pred[1][j])
    preds.append(local_preds)
    # label = model.predict(X_test.loc[i,'text'],k=5)
    # X_test.loc[i,'preds'] = label


cols = [ '__label__'+i for i in ['0','1','2','3','4']]
pd.DataFrame(preds)[cols].to_numpy()

# X_test['preds'] = X_test.preds.apply(lambda x: str(x)[-4]).astype(int)
# f1_score(X_test['preds'],X_test['label'],average='macro')

array([[1.94499653e-03, 7.58887781e-03, 9.85762298e-01, 8.12053448e-04,
        3.94175202e-03],
       [6.52575731e-01, 1.03909001e-01, 3.70552130e-02, 7.16722533e-02,
        1.34837776e-01],
       [7.51080085e-03, 4.65761200e-02, 9.42970991e-01, 2.32021301e-03,
        6.71822287e-04],
       ...,
       [3.72130685e-02, 5.48836924e-02, 8.86468496e-03, 2.24635024e-02,
        8.76625061e-01],
       [3.67137072e-05, 9.97336974e-05, 9.99725640e-01, 9.62707200e-05,
        9.15986675e-05],
       [1.65863484e-01, 1.37453852e-02, 1.30314042e-03, 5.14733046e-02,
        7.67664790e-01]])

In [30]:
load_dataset(X_test)

TypeError: expected str, bytes or os.PathLike object, not DataFrame

In [6]:
# X_test = X_test.reset_index(drop=True)

In [92]:
X_test.preds.unique()

array([('__label__3',), ('__label__4',), ('__label__2',), ('__label__1',),
       ('__label__0',)], dtype=object)

In [41]:
X_test = X_test.reset_index(drop=True)

In [65]:
dop_data  = pd.DataFrame(pseudo_label)
dop_data.label.value_counts()

label
2    22585
3       88
0       40
1       18
4        5
Name: count, dtype: int64

In [71]:
pd.DataFrame(X_train)

Unnamed: 0,text,label
0,चढ𑀢𑀟 𑀣च णच 𑀳च ढच 𑀠न 𑀘च𑀟ण𑁦 पचललच𑀲𑀢𑀟 𑀲𑁦पन𑀪 ढच च ...,0
1,𑀙तनपच𑀪 लच𑀞च ढच पच 𑀫च𑀟च 𑀟𑀢 त𑀢𑀠𑀠च ढन𑀪𑀢𑀟च 𑀟च 𑀤च𑀠च...,4
2,लच𑀲𑀢णच 𑀤𑀢𑀟च𑀪𑀢णच𑀕 𑀱चण𑁦𑀱च त𑁦 पच 𑀳च 𑀠चपच 𑀳न𑀞च 𑀲𑀢 ...,4
3,त𑀫च𑀠ध𑀢𑁣𑀟𑀳 ल𑁦चबन𑁦𑀕 𑀤च च 𑀪चढच 𑀘च𑀣च𑀱चल𑀢𑀟 𑀤चबचण𑁦𑀟 ...,2
4,च𑀟च ढ𑀢𑀟त𑀢𑀞𑁦𑀟 𑀪𑁣𑀟चल𑀣𑁣 𑀳चढ𑁣𑀣च 𑀲च𑀳च 𑀱चणच𑀪 ञच𑀟 𑀞चलल𑁣,2
...,...,...
1214,𑀳𑀫𑀢𑀟 च𑀟 झच𑀪च 𑀞न𑀣𑀢𑀟 𑀲𑀢प𑁣𑀟 𑀳𑀫𑀢बच𑀪 𑀣च 𑀠𑁣प𑁣त𑀢 च 𑀟च...,0
1215,ढच𑀪त𑁦ल𑁣𑀟च 𑀤च पच 𑀳च𑀞𑁦 𑀣चन𑀞च𑀪 𑀣च𑀟 ढचणच𑀟 च𑀪𑀳𑁦𑀟चल ...,2
1216,लच𑀲𑀢णच 𑀤𑀢𑀟च𑀪𑀢णच𑀕 च𑀠𑀲च𑀟𑀢𑀟 ढच𑀪𑀢𑀟 पच𑀤च𑀪च प𑀳च𑀞च𑀟𑀢𑀟...,4
1217,𑀙ढ𑀢ल𑀢त𑀢𑀟 णच 𑀳च 𑀟च 𑀘𑀢 𑀞च𑀠च𑀪 च𑀟 𑀘𑁦𑀲च 𑀟𑀢 च ब𑀢𑀣च𑀟 ...,4


In [133]:
from sklearn.model_selection import train_test_split
import pandas as pd
X = pd.DataFrame(classification_dataset['train'])[['text']]
y = pd.DataFrame(classification_dataset['train'])[['label']]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state =42)

In [54]:
X_train = pd.concat([dop_data[['text']],X_train])
y_train = pd.concat([dop_data[['label']],y_train])
X_train.shape

(1325, 1)

In [134]:
from catboost import CatBoostClassifier



ctb = CatBoostClassifier(verbose = 50, iterations = 1500,task_type = 'GPU')
ctb.fit(X_train,y_train,eval_set=(X_test,y_test),text_features=['text'])



Learning rate set to 0.08986
0:	learn: 1.4502622	test: 1.4303001	best: 1.4303001 (0)	total: 237ms	remaining: 5m 54s
50:	learn: 0.6950814	test: 0.6372347	best: 0.6370243 (49)	total: 818ms	remaining: 23.3s
100:	learn: 0.6485179	test: 0.6121753	best: 0.6121753 (100)	total: 1.41s	remaining: 19.6s
150:	learn: 0.6180779	test: 0.6034310	best: 0.6033295 (146)	total: 1.95s	remaining: 17.4s
200:	learn: 0.5924592	test: 0.6007748	best: 0.6005537 (194)	total: 2.48s	remaining: 16s
250:	learn: 0.5691364	test: 0.5961830	best: 0.5961830 (250)	total: 3s	remaining: 14.9s
300:	learn: 0.5488409	test: 0.5949463	best: 0.5941170 (284)	total: 3.53s	remaining: 14.1s
350:	learn: 0.5307846	test: 0.5921930	best: 0.5917243 (343)	total: 4.07s	remaining: 13.3s
400:	learn: 0.5135891	test: 0.5911638	best: 0.5906565 (390)	total: 4.6s	remaining: 12.6s
450:	learn: 0.4972318	test: 0.5906079	best: 0.5900306 (446)	total: 5.13s	remaining: 11.9s
500:	learn: 0.4801563	test: 0.5904008	best: 0.5900306 (446)	total: 5.66s	remaining

<catboost.core.CatBoostClassifier at 0x7d36a2216e10>

In [138]:
# test_data = 
ctb.predict_proba(pd.DataFrame(classification_dataset['dev'])[['text']])

array([[0.00282817, 0.0031624 , 0.98993893, 0.00243975, 0.00163075],
       [0.67495886, 0.01782564, 0.00188756, 0.03783274, 0.26749519],
       [0.00343716, 0.00412888, 0.98780808, 0.00295632, 0.00166956],
       ...,
       [0.03589574, 0.22668599, 0.00527421, 0.02227044, 0.70987362],
       [0.00293864, 0.00269879, 0.98961555, 0.00304553, 0.00170149],
       [0.05953949, 0.01299258, 0.00217284, 0.04407551, 0.88121959]])

In [None]:
n_splits = 5
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)

ctb_params = {
    'iterations': 1500,
    'verbose': 50,
    'task_type': 'GPU',
    'eval_metric': 'Accuracy'
}

fold_scores = []
predictions = np.zeros(len(X))

for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
    print(f'\nFold {fold + 1}/{n_splits}')
    
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
    
    model = CatBoostClassifier(**ctb_params)
    model.fit(
        X_train, y_train,
        eval_set=(X_val, y_val),
        text_features=['text']
    )
    
    val_preds = model.predict(X_val)
    fold_score = accuracy_score(y_val, val_preds)
    fold_scores.append(fold_score)
    predictions[val_idx] = val_preds
    
    print(f'Fold {fold + 1} Accuracy: {fold_score:.4f}')


In [None]:
https://www.kaggle.com/code/qacenn/dz-na-meznar
#metrics

In [None]:
https://www.kaggle.com/code/qacenn/dzzmeshnar

In [None]:
https://www.kaggle.com/code/qacenn/dz-meznar-abc

In [None]:
https://www.kaggle.com/code/qacenn/dz-mashnaric-gg

In [None]:
https://www.kaggle.com/code/qacenn/notebook6625fb602a
#sega

In [None]:
https://www.kaggle.com/code/qacenn/notebook4278f7d042

In [None]:
https://www.kaggle.com/code/qacenn/notebook64fc3354fe
#clean text

In [None]:
import optuna
from catboost import CatBoostClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

def objective(trial):
    params = {
        'iterations': trial.suggest_int('iterations', 500, 2000),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 1e-8, 10.0, log=True),
        'num_leaves': trial.suggest_int('num_leaves', 7, 64),
        'grow_policy': trial.suggest_categorical('grow_policy', ['SymmetricTree', 'Depthwise', 'Lossguide']),
        'max_depth': trial.suggest_int('max_depth', 4, 12),
        'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 1, 100),
        'bagging_temperature': trial.suggest_float('bagging_temperature', 0.0, 1.0),
        'random_strength': trial.suggest_float('random_strength', 1e-9, 10.0, log=True),
        'border_count': trial.suggest_int('border_count', 32, 255),
        'task_type': 'GPU',
        'verbose': False,
        'eval_metric': 'Accuracy'
    }
    
    model = CatBoostClassifier(**params)
    
    # Используем кросс-валидацию
    scores = cross_val_score(
        model, 
        X_train, 
        y_train, 
        cv=3, 
        scoring='accuracy',
        n_jobs=-1
    )
    
    return scores.mean()

# Оптимизация
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50, timeout=3600)

# Лучшие параметры
print("Best trial:")
trial = study.best_trial
print(f"  Accuracy: {trial.value:.4f}")
print("  Params: ")
for key, value in trial.params.items():
    print(f"    {key}: {value}")

# Обучение финальной модели
best_params = trial.params
best_params.update({
    'task_type': 'GPU',
    'verbose': True,
    'eval_metric': 'Accuracy'
})

final_model = CatBoostClassifier(**best_params)
final_model.fit(X_train, y_train, eval_set=(X_test, y_test), text_features=['text'])

# Оценка
test_pred = final_model.predict(X_test)
print(f"Test Accuracy: {accuracy_score(y_test, test_pred):.4f}")

In [None]:
import optuna
from catboost import CatBoostRegressor, Pool
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from math import sqrt

#
X = df.drop(col_to_drop, axis=1)
for i in categorial_features:
    X[i] = X[i].astype('category')
y = df['fare_amount']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=323)
train_pool = Pool(data=X_train, label=y_train, cat_features=categorial_features)
eval_pool = Pool(data=X_test, label=y_test, cat_features=categorial_features)


def objective(trial):
    params = {
        'iterations': 2500,
        'learning_rate': trial.suggest_loguniform('learning_rate', 1e-4, 0.1),
        'depth': trial.suggest_int('depth', 4, 10),
        'random_seed': 23634,
        'task_type': 'GPU',
        'early_stopping_rounds': 25,
        'use_best_model': True,
        'leaf_estimation_method': trial.suggest_categorical('leaf_estimation_method', ['Newton', 'Gradient']),  
        'l2_leaf_reg': trial.suggest_loguniform('l2_leaf_reg', 1e-4, 10.0),   
        'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 1, 20),  
        'border_count': trial.suggest_int('border_count', 32, 254),     
        'grow_policy': trial.suggest_categorical('grow_policy', ['SymmetricTree', 'Depthwise', 'Lossguide']),    
        'random_strength': trial.suggest_uniform('random_strength', 0, 1),     
        'bagging_temperature': trial.suggest_uniform('bagging_temperature', 0.0, 1.0),
        'leaf_estimation_iterations': trial.suggest_int('leaf_estimation_iterations', 1, 20)
    }

    model = CatBoostRegressor(**params)
    model.fit(train_pool, eval_set=eval_pool, verbose=0)
    
    y_pred_val = model.predict(X_test)
    rmse_val = sqrt(mean_squared_error(y_test, y_pred_val))
    return rmse_val


study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=5)


best_params = study.best_params

final_model = CatBoostRegressor(**best_params, task_type="GPU", random_seed=23634)
final_model.fit(train_pool, eval_set=eval_pool, verbose=100)

y_pred_test = final_model.predict(X_test)
rmse_test = sqrt(mean_squared_error(y_test, y_pred_test))
print(f"Финальный RMSE: {rmse_test}")

In [30]:
from sklearn.metrics import accuracy_score,f1_score
accuracy_score(y_test,ctb.predict(X_test)),f1_score(y_test,ctb.predict(X_test),average='macro')

(0.7901639344262295, 0.7720873390492875)

In [6]:
!pip install evaluate

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting fsspec>=2021.05.0 (from fsspec[http]>=2021.05.0->evaluate)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.whl (183 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.9/183.9 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, evaluate
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.3.2
    Uninstalling fsspec-2025.3.2:
      Successfully uninstalled fsspec-2025.3.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==20

# Baseline

In [2]:
with open ('new_lang_text.txt','w') as f:
    for i in raw_text['train'][:]['text']:
        f.write(i+'\n')

In [5]:
# load the pre-trained tokenizer and use it to process the data

from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-multilingual-uncased")

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_data = classification_dataset.map(preprocess_function, batched=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.72M [00:00<?, ?B/s]

Map:   0%|          | 0/1524 [00:00<?, ? examples/s]

Map:   0%|          | 0/218 [00:00<?, ? examples/s]

In [6]:
from tokenizers import BertWordPieceTokenizer
tokenizer = BertWordPieceTokenizer(
    clean_text=True,
    handle_chinese_chars=True,
    strip_accents=False,
    lowercase=True,
)
tokenizer.train(
    files=['new_lang_text.txt'],
    vocab_size=30000,
    min_frequency=2,
    limit_alphabet=1000,
    special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]
)






In [8]:
import os
# os.mkdir('tokenizer_output')
tokenizer.save_model("tokenizer_output/")

['tokenizer_output/vocab.txt']

In [9]:
from transformers import (
    BertConfig,
    BertTokenizerFast,
    BertForMaskedLM,
    DataCollatorForLanguageModeling,
    Trainer,
    TrainingArguments
)
from datasets import load_dataset

tokenizer = BertTokenizerFast.from_pretrained("tokenizer_output/") 

config = BertConfig(
    vocab_size=tokenizer.vocab_size,
    hidden_size=512,
    num_hidden_layers=6,
    num_attention_heads=8,
    intermediate_size=2048,
    max_position_embeddings=512,
    type_vocab_size=2,
)

model = BertForMaskedLM.from_pretrained("google-bert/bert-base-multilingual-uncased")

dataset = load_dataset("text", data_files={"train": "/kaggle/working/new_lang_text.txt"})

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=128)

tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"])

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True,
    mlm_probability=0.15
)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/672M [00:00<?, ?B/s]

Some weights of the model checkpoint at google-bert/bert-base-multilingual-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/611245 [00:00<?, ? examples/s]

In [17]:
training_args = TrainingArguments(
    output_dir="./bert-from-scratch",
    overwrite_output_dir=True,
    num_train_epochs=0.5,
    per_device_train_batch_size=32,
    # train_batch_size = 24,
    save_steps=1000,
    save_total_limit=2,
    logging_dir="./logs",
    logging_steps=20,
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    data_collator=data_collator,
)


trainer.train()


model.save_pretrained("./bert-from-scratch")
tokenizer.save_pretrained("./bert-from-scratch")



Step,Training Loss
20,7.2901
40,7.5437
60,7.3478
80,7.3064
100,7.1798
120,7.0723
140,7.0872
160,6.9078
180,6.8711
200,6.9003




KeyboardInterrupt: 

In [39]:
trainer.train()

ValueError: The model did not return a loss from the inputs, only the following keys: prediction_logits,seq_relationship_logits. For reference, the inputs it received are input_ids,token_type_ids,attention_mask.

In [46]:
tokenizer(['agi is good','agi is not good'])

{'input_ids': [[101, 13353, 10116, 10127, 12050, 102], [101, 13353, 10116, 10127, 10497, 12050, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1]]}

In [28]:
from sklearn.model_selection import KFold
import pandas as pd
kf = KFold(n_splits=5)
X = pd.DataFrame(classification_dataset['train'])

datasets = []
for i, (train_index, test_index) in enumerate(kf.split(X)):
    ds = DatasetDict()
    ds['train'] = Dataset.from_pandas(X.iloc[train_index])
    ds['val'] = Dataset.from_pandas(X.iloc[test_index])
    datasets.append(ds)

In [29]:
datasets

[DatasetDict({
     train: Dataset({
         features: ['text', 'label', '__index_level_0__'],
         num_rows: 1219
     })
     val: Dataset({
         features: ['text', 'label', '__index_level_0__'],
         num_rows: 305
     })
 }),
 DatasetDict({
     train: Dataset({
         features: ['text', 'label', '__index_level_0__'],
         num_rows: 1219
     })
     val: Dataset({
         features: ['text', 'label', '__index_level_0__'],
         num_rows: 305
     })
 }),
 DatasetDict({
     train: Dataset({
         features: ['text', 'label', '__index_level_0__'],
         num_rows: 1219
     })
     val: Dataset({
         features: ['text', 'label', '__index_level_0__'],
         num_rows: 305
     })
 }),
 DatasetDict({
     train: Dataset({
         features: ['text', 'label', '__index_level_0__'],
         num_rows: 1219
     })
     val: Dataset({
         features: ['text', 'label', '__index_level_0__'],
         num_rows: 305
     })
 }),
 DatasetDict({
     train: D

In [32]:
from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained("/kaggle/working/bert-from-scratch/checkpoint-1000")
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True,max_length=512)

for i in range(len(datasets)):
    
    datasets[i] = datasets[i].map(preprocess_function, batched=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/1219 [00:00<?, ? examples/s]

Map:   0%|          | 0/305 [00:00<?, ? examples/s]

Map:   0%|          | 0/1219 [00:00<?, ? examples/s]

Map:   0%|          | 0/305 [00:00<?, ? examples/s]

Map:   0%|          | 0/1219 [00:00<?, ? examples/s]

Map:   0%|          | 0/305 [00:00<?, ? examples/s]

Map:   0%|          | 0/1219 [00:00<?, ? examples/s]

Map:   0%|          | 0/305 [00:00<?, ? examples/s]

Map:   0%|          | 0/1220 [00:00<?, ? examples/s]

Map:   0%|          | 0/304 [00:00<?, ? examples/s]

In [19]:
pip install evaluate

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting fsspec>=2021.05.0 (from fsspec[http]>=2021.05.0->evaluate)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.whl (183 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.9/183.9 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, evaluate
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.3.2
    Uninstalling fsspec-2025.3.2:
      Successfully uninstalled fsspec-2025.3.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==20

In [162]:
# define the evaluation metric

import evaluate
import numpy as np

f1 = evaluate.load("f1")
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return f1.compute(predictions=predictions, references=labels, average='macro')

In [39]:
import shutil
shutil.rmtree('/kaggle/working/basiline_bobai_1/checkpoint-400')
shutil.rmtree('/kaggle/working/basiline_bobai_3/checkpoint-400')
shutil.rmtree('/kaggle/working/basiline_bobai_2/checkpoint-400')

In [40]:
# define the model and the training configuration

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

# models = []
for i in range(4,len(datasets)):
    model = AutoModelForSequenceClassification.from_pretrained(
        "/kaggle/working/bert-from-scratch/checkpoint-1000", num_labels=5
    )
    
    training_args = TrainingArguments(
        output_dir=f"basiline_bobai_{i}",
        learning_rate=5e-5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=20,
        weight_decay=0.01,
        eval_strategy="epoch",
        save_strategy="epoch",
        save_total_limit=1,
        metric_for_best_model='f1',
        load_best_model_at_end=True,
        report_to='none',
        lr_scheduler_type="linear",
        # push_to_hub=True,
        # hub_strategy="checkpoint",
        # # hub_token=write_access_token,
        # hub_private_repo=True,
        # hub_model_id='baseline_bobai'
    
    )
    
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=datasets[i]["train"],
        eval_dataset=datasets[i]["val"],
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
    )
    trainer.train()
    models.append(model)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at /kaggle/working/bert-from-scratch/checkpoint-1000 and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,F1
1,No log,1.238887,0.384075
2,No log,0.908724,0.501617
3,No log,0.776207,0.631246
4,No log,0.727961,0.69772
5,No log,0.887765,0.706189
6,No log,0.85183,0.686658
7,No log,0.835634,0.76249
8,No log,0.955149,0.730838
9,No log,1.024541,0.744806
10,No log,1.013903,0.764257




In [42]:
models[0] == models[1]

False

In [80]:
import os
os.environ["WANDB_MODE"] = "disabled"
os.environ["WANDB_DISABLED"] = "false"


In [81]:
# execute the model training
trainer.train()



Epoch,Training Loss,Validation Loss,F1
1,No log,1.052951,0.489758
2,No log,0.819005,0.637608
3,No log,0.760731,0.653115
4,No log,0.720225,0.693577
5,No log,0.627713,0.743934
6,No log,0.7252,0.748794
7,No log,0.741031,0.734587
8,No log,0.85378,0.732889
9,No log,0.775022,0.770116
10,No log,0.842236,0.776611




TrainOutput(global_step=960, training_loss=0.22283920298020046, metrics={'train_runtime': 82.6064, 'train_samples_per_second': 368.978, 'train_steps_per_second': 11.621, 'total_flos': 84265368866328.0, 'train_loss': 0.22283920298020046, 'epoch': 20.0})

# Inference

In [58]:
tokenizer = AutoTokenizer.from_pretrained("/kaggle/working/basiline_bobai_4/checkpoint-240")

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_data = classification_dataset.map(preprocess_function, batched=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/1524 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/218 [00:00<?, ? examples/s]

In [None]:
import numpy as np
from scipy.optimize import minimize

# Пример данных
# Предсказания моделей на валидационной выборке (n_samples x n_models)
predictions = np.array([
    [0.7, 0.6, 0.5],  # предсказания для 1-го примера
    [0.2, 0.3, 0.4],  # предсказания для 2-го примера
    [0.9, 0.8, 0.7]   # предсказания для 3-го примера
])  # shape: (n_samples, n_models)

# Истинные значения
true_values = np.array([0.65, 0.25, 0.85])  # shape: (n_samples,)

# Целевая функция (MSE)
def objective(weights, predictions, true_values):
    weighted_pred = np.dot(predictions, weights)  # Взвешенные предсказания
    mse = np.mean((true_values - weighted_pred) ** 2)
    return mse

# Ограничения: сумма весов = 1
constraints = ({'type': 'eq', 'fun': lambda w: np.sum(w) - 1})

# Границы: веса >= 0
bounds = [(0, None)] * predictions.shape[1]

# Начальное предположение (равные веса)
initial_weights = np.array([1.0 / predictions.shape[1]] * predictions.shape[1])

# Оптимизация
result = minimize(
    objective,
    initial_weights,
    args=(predictions, true_values),
    method='SLSQP',
    bounds=bounds,
    constraints=constraints
)

# Результат
optimal_weights = result.x
print("Оптимальные веса:", optimal_weights)
print("Достигнутая MSE:", result.fun)

# Проверка итоговых предсказаний
final_predictions = np.dot(predictions, optimal_weights)
print("Итоговые предсказания:", final_predictions)

In [167]:
# run the trained model on a dev/test split
data_split = "dev"
eval_out = 0
skip = []
all_preds = []
for i in range(5):
    if (i not in skip):
        cur_trainer = Trainer(
                model=models[i],
                args=training_args,
                train_dataset=datasets[i]["train"],
                eval_dataset=datasets[i]["val"],
                tokenizer=tokenizer,
                data_collator=data_collator,
                compute_metrics=compute_metrics,
            )
        # if (i==0):
        eval_out = cur_trainer.predict(tokenized_data[data_split])
        labels = eval_out.label_ids
            # eval_out = eval_out.predictions
        # else:
            # eval_out += cur_trainer.predict(tokenized_data[data_split]).predictions
        all_preds.append(eval_out.predictions)
# eval_out+=pd.DataFrame(preds)[cols].to_numpy()
all_preds.append(pd.DataFrame(preds)[cols].to_numpy())
all_preds.append(ctb.predict_proba(pd.DataFrame(classification_dataset['dev'])[['text']]))
all_preds = np.array(all_preds)


# pd.DataFrame(preds)[cols].to_numpy().argmax(1)

# eval_out+=ctb.predict_proba(pd.DataFrame(classification_dataset['dev'])[['text']])

# eval_out/=(5-len(skip)+2)
# predictions = eval_out.argmax(1)
dev_f1 = f1.compute(predictions=pd.DataFrame(preds)[cols].to_numpy().argmax(1), references=labels, average='macro')
dev_f1

  cur_trainer = Trainer(


  cur_trainer = Trainer(


  cur_trainer = Trainer(


  cur_trainer = Trainer(


  cur_trainer = Trainer(


{'f1': 0.8238322400351489}

In [150]:
predictions.shape

(6, 218, 5)

In [172]:
import numpy as np
from scipy.optimize import minimize
from torchmetrics import F1Score
import torch
from scipy.special import softmax

# Данные (замените на ваши реальные данные)
eval_outs = all_preds  # Предполагается, что all_preds имеет форму (6, 218, 5)
labels = labels  # Предполагается, что labels имеет форму (218,) с значениями 0-4

# Проверка данных
print("Форма eval_outs:", eval_outs.shape)
print("Форма labels:", labels.shape)
print("Уникальные метки:", np.unique(labels))

# Нормализация вероятностей (если они не нормализованы)
eval_outs = softmax(eval_outs, axis=2)  # Применяем softmax по оси классов
print("Проверка нормализации (сумма вероятностей по классам):", 
      np.sum(eval_outs[0, 0, :]))  # Должно быть ~1.0

# Инициализация F1-меры
f1_metric = F1Score(task="multiclass", num_classes=5, average='macro')

# Целевая функция: минимизируем -F1 (максимизируем F1)
def objective(weights, eval_outs, labels):
    n_models, n_samples, n_classes = eval_outs.shape
    weighted_probs = np.zeros((n_samples, n_classes))
    for i in range(n_models):
        weighted_probs += weights[i] * eval_outs[i]
    predictions = np.argmax(weighted_probs, axis=1)
    # Используем torchmetrics F1
    predictions = torch.tensor(predictions, dtype=torch.int64)
    labels_tensor = torch.tensor(labels, dtype=torch.int64)
    f1 = f1_metric(predictions, labels_tensor)
    return -f1.item()  # Минимизируем отрицательную F1

# Диагностика: проверяем F1 для каждой модели отдельно
for i in range(eval_outs.shape[0]):
    predictions = np.argmax(eval_outs[i], axis=1)
    predictions = torch.tensor(predictions, dtype=torch.int64)
    labels_tensor = torch.tensor(labels, dtype=torch.int64)
    f1 = f1_metric(predictions, labels_tensor)
    print(f"F1 для модели {i+1}: {f1.item():.4f}")

# Ограничения: сумма весов = 1
constraints = ({'type': 'eq', 'fun': lambda w: np.sum(w) - 1})
# Границы: веса в [0, 1]
bounds = [(0, None)] * eval_outs.shape[0]
# Начальное предположение: равные веса
initial_weights = np.array([1.0 / eval_outs.shape[0]] * eval_outs.shape[0])

# Оптимизация
result = minimize(
    objective,
    initial_weights,
    args=(eval_outs, labels),
    method='trust-constr',
    bounds=bounds,
    constraints=constraints,
    options={'disp': True, 'maxiter': 1000}  # Увеличиваем итерации
)

# Результат
optimal_weights = result.x
print("Оптимальные веса:", optimal_weights)
print("Достигнутая F1-мера:", -result.fun)

# Итоговые предсказания
weighted_probs = np.zeros((eval_outs.shape[1], eval_outs.shape[2]))
for i in range(eval_outs.shape[0]):
    weighted_probs += optimal_weights[i] * eval_outs[i]
final_predictions = np.argmax(weighted_probs, axis=1)
print("Итоговые предсказания:", final_predictions)

# Проверка F1 для итоговых предсказаний
final_f1 = f1_metric(torch.tensor(final_predictions, dtype=torch.int64), 
                     torch.tensor(labels, dtype=torch.int64))
print("F1 для итоговых предсказаний:", final_f1.item())

Форма eval_outs: (7, 218, 5)
Форма labels: (218,)
Уникальные метки: [0 1 2 3 4]
Проверка нормализации (сумма вероятностей по классам): 0.9999999999999999
F1 для модели 1: 0.7519
F1 для модели 2: 0.7734
F1 для модели 3: 0.7754
F1 для модели 4: 0.7370
F1 для модели 5: 0.7573
F1 для модели 6: 0.8238
F1 для модели 7: 0.7794
`gtol` termination condition is satisfied.
Number of iterations: 1, function evaluations: 8, CG iterations: 0, optimality: 4.16e-17, constraint violation: 2.22e-16, execution time: 0.001 s.
Оптимальные веса: [0.14285714 0.14285714 0.14285714 0.14285714 0.14285714 0.14285714
 0.14285714]
Достигнутая F1-мера: 0.8077743053436279
Итоговые предсказания: [2 0 2 0 4 3 1 1 2 3 3 3 1 1 2 3 4 4 1 1 4 3 4 0 3 2 4 3 4 3 0 3 2 2 3 3 2
 3 0 3 4 0 3 4 4 1 2 1 1 0 2 1 3 4 3 4 0 1 0 4 4 2 3 3 2 0 3 1 2 3 0 3 1 1
 1 4 2 3 4 0 4 0 3 4 3 2 0 4 2 2 2 4 3 2 0 2 3 1 3 4 0 4 2 4 1 4 1 1 2 2 3
 0 4 0 1 2 4 0 2 4 3 4 2 1 2 1 0 1 0 1 3 2 4 1 2 3 2 2 4 0 1 3 0 3 1 4 2 0
 3 4 0 3 2 0 1 2 1 2 4 4 0 

In [None]:
import optuna

def optuna_objective(trial):
    weights = [trial.suggest_float(f'w{i}', 0, 1) for i in range(7)]
    weights = np.array(weights) / np.sum(weights)  # Нормализация
    weighted_probs = np.zeros((218, 5))
    for i in range(7):
        weighted_probs += weights[i] * eval_outs[i]
    predictions = np.argmax(weighted_probs, axis=1)
    return -f1_score(labels, predictions, average='macro')

study = optuna.create_study(direction='minimize')
study.optimize(optuna_objective, n_trials=1000)
optimal_weights = np.array([study.best_params[f'w{i}'] for i in range(7)])
optimal_weights /= optimal_weights.sum()



[I 2025-05-03 16:10:46,280] A new study created in memory with name: no-name-d8ab239e-a8fc-4f09-a066-b8e063e02e67
[I 2025-05-03 16:10:46,288] Trial 0 finished with value: -0.8174223709714411 and parameters: {'w0': 0.2249217343796227, 'w1': 0.23675307294271997, 'w2': 0.33152905061583116, 'w3': 0.20005554452478613, 'w4': 0.09416924821445316, 'w5': 0.7935377509220644, 'w6': 0.68214153311441}. Best is trial 0 with value: -0.8174223709714411.
[I 2025-05-03 16:10:46,291] Trial 1 finished with value: -0.77745679167954 and parameters: {'w0': 0.39583878024461516, 'w1': 0.3837186675435742, 'w2': 0.4364172429723028, 'w3': 0.05287417401538297, 'w4': 0.9914388423361498, 'w5': 0.4967821632259347, 'w6': 0.10068750031350604}. Best is trial 0 with value: -0.8174223709714411.
[I 2025-05-03 16:10:46,294] Trial 2 finished with value: -0.8270100564682877 and parameters: {'w0': 0.4041334263737665, 'w1': 0.5669534043442175, 'w2': 0.411008448815139, 'w3': 0.33763569611590527, 'w4': 0.5891782150082215, 'w5': 0

In [202]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class F1Loss(nn.Module):
    def __init__(self, task='binary', epsilon=1e-7):
        super().__init__()
        self.task = task  # 'binary' или 'multiclass'
        self.epsilon = epsilon  # Для избежания деления на ноль

    def forward(self, logits, targets):
        """
        Вычисляет лосс как 1 - F1_score.
        
        Args:
            logits: Тензор предсказаний (логиты), размер зависит от задачи:
                    - binary: (batch_size, 1) или (batch_size,)
                    - multiclass: (batch_size, num_classes)
            targets: Тензор целевых значений:
                    - binary: (batch_size,) с 0 или 1
                    - multiclass: (batch_size,) с метками классов (0, 1, ..., num_classes-1)
        
        Returns:
            loss: Скаляр, 1 - F1_score
        """
        if self.task == 'binary':
            # Для бинарной классификации применяем сигмоиду к логитам
            probs = torch.sigmoid(logits).squeeze()
            targets = targets.float()

            # Истинные положительные (TP), ложные положительные (FP), ложные отрицательные (FN)
            tp = torch.sum(probs * targets)  # TP: предсказано 1 и target 1
            fp = torch.sum(probs * (1 - targets))  # FP: предсказано 1, но target 0
            fn = torch.sum((1 - probs) * targets)  # FN: предсказано 0, но target 1

            # Точность (precision) и полнота (recall)
            precision = tp / (tp + fp + self.epsilon)
            recall = tp / (tp + fn + self.epsilon)

            # F1-мера
            f1 = 2 * (precision * recall) / (precision + recall + self.epsilon)

        elif self.task == 'multiclass':
            # Для многоклассовой классификации применяем softmax к логитам
            probs = F.softmax(logits, dim=1)  # (batch_size, num_classes)
            targets_one_hot = F.one_hot(targets, num_classes=logits.shape[1]).float()

            # TP, FP, FN по всем классам
            tp = torch.sum(probs * targets_one_hot, dim=0)  # (num_classes,)
            fp = torch.sum(probs * (1 - targets_one_hot), dim=0)
            fn = torch.sum((1 - probs) * targets_one_hot, dim=0)

            # Точность и полнота для каждого класса
            precision = tp / (tp + fp + self.epsilon)
            recall = tp / (tp + fn + self.epsilon)

            # F1-мера для каждого класса
            f1_per_class = 2 * (precision * recall) / (precision + recall + self.epsilon)

            # Средняя F1-мера (macro F1)
            f1 = torch.mean(f1_per_class)

        else:
            raise ValueError("task должен быть 'binary' или 'multiclass'")

        # Лосс = 1 - F1
        loss = 1 - f1
        return loss

In [201]:
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.metrics import f1_score
import numpy as np

class EnsembleModel(nn.Module):
    def __init__(self, n_models):
        super(EnsembleModel, self).__init__()
        # Веса для каждой модели (инициализируем равномерно)\
        self.count_models = n_models
        self.params = nn.Parameter(torch.ones(n_models) / n_models)
        
    def forward(self, logits):
        preds = torch.zeros(logits.shape[1],5)
        print(preds.shape)
        print(logits.shape)
        normalized_weights = torch.softmax(self.params, dim=0)
        print(normalized_weights)
        for i in range(self.count_models):
            preds += logits[i] * normalized_weights[i]
        preds/=self.count_models
        
        # logits: (n_models, batch_size, n_classes)
        # Нормализуем веса через softmax
        # normalized_weights = torch.softmax(self.weights, dim=0)
        # # Взвешенное усреднение логитов
        # weighted_logits = torch.einsum('m,mbc->bc', normalized_weights, logits)
        return preds.argmax(1)

def train_ensemble(model, logits, targets, n_epochs=100, lr=0.01):
    """
    logits: tensor of shape (n_models, n_samples, n_classes)
    targets: tensor of shape (n_samples,)
    """
    optimizer = optim.Adam(model.parameters(), lr=lr)
    criterion = nn.MSELoss()
    
    best_f1 = 0.0
    best_weights = None
    
    for epoch in range(n_epochs):
        model.train()
        optimizer.zero_grad()
        
        # Прямой проход
        output = model(logits)
        local_f1 = f1_score(labels, output, average='macro')
        if (local_f1>best_f1):
            best_f1=local_f1
            best_weight = model.params
        loss = criterion(torch.Tensor(np.array(local_f1)).unsqueeze(0), torch.Tensor(1.).unsqueeze(0))
        
        # Обратное распространение
        loss.backward()
        optimizer.step()
        
        # Оценка F1 macro
       
        if (epoch + 1) % 10 == 0:
            print(f'Epoch [{epoch+1}/{n_epochs}], Loss: {loss.item():.4f}, F1 Macro: {local_f1:.4f}')
    
model = EnsembleModel(7)
train_ensemble(model, torch.Tensor(all_preds), torch.Tensor(targets))
# Пример использования
# if __name__ == "__main__":
#     # Пример данных
#     n_models = 7
#     n_samples = 100
#     n_classes = 5
    
#     # Случайные логиты и метки
#     logits = torch.FloatTensor(all_preds)
#     targets = torch.Tensor(labels)
    
#     # Инициализация модели
#     model = EnsembleModel(n_models)
    
#     # Обучение
#     best_f1, best_weights = train_ensemble(model, logits, targets)
#     print(f"Best F1 Macro: {best_f1:.4f}")
#     print(f"Best Weights: {torch.softmax(best_weights, dim=0)}")

torch.Size([218, 5])
torch.Size([7, 218, 5])
tensor([0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429],
       grad_fn=<SoftmaxBackward0>)


TypeError: new(): data must be a sequence (got float)

# Testing

In [None]:
# UPDATE THIS CELL ACCORDINGLY

# define a funciton to load your tokenizer and model from a HF path
# the path variables can be strings or lists of strings (for ensemble solutions)
def load_model(path_to_tokenizer, path_to_model, token):
  # Example:
  tokenizer = AutoTokenizer.from_pretrained(path_to_tokenizer, token=token)
  model = AutoModelForSequenceClassification.from_pretrained(path_to_model, token=token)
  model.eval()

  return tokenizer, model

# define a "predict" function that takes the model and a list of input strings
# and returns the outputs as a list of integer classes
def predict(tokenizer, model, input_texts):
  #Example:
  predictions = []
  for input_text in input_texts:

    input_ids = tokenizer(input_text, return_tensors="pt")

    with torch.no_grad():
      logits = model(**input_ids).logits

    predictions.append(logits.argmax().item())

  return predictions

# set variables
path_to_model = "path/to/your/best/model/on/hf" # can be a list instead
path_to_tokenizer = "path/to/your/best/tokenizer/on/hf" # can be a list instead
model_access_token = "access token" # a fine-grained token with read rights for your model repository


In [None]:
# DO NOT CHANGE THIS CELL!!!

tokenizer, model = load_model(path_to_tokenizer, path_to_model, token=model_access_token)

test_data = load_dataset("InternationalOlympiadAI/NLP_problem_test")['test']['text']

predictions = predict(tokenizer, model, test_data)

with open('test_predictions.txt', 'w') as outfile:
  outfile.write('\n'.join([str(p) for p in predictions]))