### Task Description

**Legend:**

Young Alex has a beloved BERT model that he carries everywhere on his trusty flash drive. One day, during an excursion along the River Styx, a few drops of water landed on the precious device, corrupting the model's weights.

Heartbroken, Alex rushed home to fix the neural network. After quick analysis, he discovered only the token embeddings were damaged - the rest of the architecture (attention blocks and heads) remained perfectly intact. Now he needs to restore the model's performance on the Sentiment Analysis Task.

**Task:**

You need to fix the broken vectors of the Embeddings matrix of the model so as to improve the quality of the model on the task of text sentiment analysis.

**Restrictions:**

- You cannot use any other transformer-based pre-trained models and LLMs

- You cannot any additional data

In [1]:
import numpy as np
import pandas as pd
import torch
np.random.seed(21)

### Load Dataset

In [3]:
val_data_path = "/kaggle/input/cyprus-ai-camp-broken-bert/val_dataset.csv"
test_data_path = "/kaggle/input/cyprus-ai-camp-broken-bert/test.csv"

val_df = pd.read_csv(val_data_path)
test_df = pd.read_csv(test_data_path)

### Load Tokenizer & Model

In [4]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("Ilseyar-kfu/broken_bert")

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [5]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

In [6]:
val_encodings = tokenizer(val_df["text"].to_list(), truncation=True, padding=True, max_length=256)
val_dataset = Dataset(val_encodings, val_df["labels"].to_list())

In [7]:
texts_2_score = val_df["text"].to_list() + test_df["text"].to_list()

### Model changes (Task is here)

In [8]:
model = AutoModelForSequenceClassification.from_pretrained("Ilseyar-kfu/broken_bert")

new_embedings = model.bert.embeddings.word_embeddings.weight.detach().numpy().copy()

# There's magic (your solution) going on here! And we need to get brand-new !!!new_embedings!!!

model.bert.embeddings.word_embeddings.weight = torch.nn.Parameter(torch.Tensor(new_embedings))

config.json:   0%|          | 0.00/841 [00:00<?, ?B/s]

2026-02-04 15:26:14.936805: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1770218775.188100      47 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1770218775.255907      47 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

### Evaluation

In [9]:
from sklearn.metrics import f1_score
from transformers import pipeline

In [10]:
from sklearn.metrics import classification_report

def evaluate_on_validation(model, tokenizer, df_val):
    label_2_dict = {'LABEL_0': 'neutral', "LABEL_1" : 'positive', "LABEL_2": 'negative'}
    classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
    ans = classifier.predict(list(df_val["text"]))
    ans = [label_2_dict[el["label"]] for el in ans]
    
    print("f1_score = ", f1_score(df_val["labels"], ans, average='macro'))
    print(classification_report(df_val["labels"], ans))

In [11]:
evaluate_on_validation(model, tokenizer, val_df)

Device set to use cuda:0


f1_score =  0.28619377475000246
              precision    recall  f1-score   support

    negative       0.60      0.18      0.27       935
     neutral       0.32      0.91      0.47       759
    positive       0.62      0.06      0.11       806

    accuracy                           0.36      2500
   macro avg       0.52      0.38      0.29      2500
weighted avg       0.52      0.36      0.28      2500



### Model Scoring


In [13]:
import hashlib

def create_submission(model, tokenizer, df_test):
    label_2_dict = {'LABEL_0': 'neutral', "LABEL_1" : 'positive', "LABEL_2": 'negative'}
    classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
    ans = classifier.predict(list(df_test["text"]))
    ans = [label_2_dict[el["label"]] for el in ans]
    
    df = pd.DataFrame({"labels" : ans, "id": df_test['id']})
    hsh = hashlib.sha256(df.to_csv(index=False).encode('utf-8')).hexdigest()[:8]
    submit_path = f"submit_{hsh}.csv"
    print(f"SUBMIT_NAME: {submit_path}")
    print(df.head(10))
    df.to_csv(submit_path, index=False)
    df.to_csv("submission.csv", index=False)

In [14]:
create_submission(model, tokenizer, test_df)

Device set to use cuda:0


SUBMIT_NAME: submit_a1511d58.csv
     labels    id
0   neutral  5000
1   neutral  5001
2   neutral  5002
3   neutral  5003
4   neutral  5004
5   neutral  5005
6   neutral  5006
7  negative  5007
8  negative  5008
9   neutral  5009
