# ModernBERT

**Based on** [ModernBERT fine-tuning example](https://colab.research.google.com/drive/1iWIruk02fGib9RZWdS51SJStIrHx4hMK?usp=sharing)

https://unfoldai.com/modernbert/

In [None]:
!pip install git+https://github.com/huggingface/transformers.git datasets

In [2]:
# Import required PyTorch libraries for deep learning operations and model components
# - torch: Main PyTorch library for tensor operations and neural networks
# - AutoTokenizer: For converting text into tokens that the model can understand
# - AutoModelForSequenceClassification: Pre-built model architecture for classification tasks
# - Dataset, DataLoader: PyTorch utilities for handling data efficiently
# - load_dataset: HuggingFace utility to load datasets from their hub
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch.utils.data import Dataset, DataLoader
from datasets import load_dataset

In [3]:
# Custom Dataset class for handling text data and preparing it for model training
# This class inherits from PyTorch's Dataset class and implements required methods:
# - __init__: Initializes the dataset by tokenizing text and preparing labels
# - __getitem__: Returns a single data item with its features and label
# - __len__: Returns the total number of items in the dataset
# The max_length parameter controls the maximum number of tokens per text sample
class TextDataset(Dataset):
    def __init__(self, data, tokenizer, max_length=128):
        self.encodings = tokenizer(data['Resume'], truncation=True, padding=True,
                                 max_length=max_length, return_tensors='pt')
        self.labels = torch.tensor(data['Label'])

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = self.labels[idx]
        return item

    def __len__(self):
        return len(self.labels)

In [4]:
import gdown
url = 'https://drive.google.com/uc?export=download&confirm=no_antivirus&id=1LhHBKx2wzJrT7XXVbE6Yk9AHqtcCo2w8'
gdown.download(url, '/content/')

Downloading...
From: https://drive.google.com/uc?export=download&confirm=no_antivirus&id=1LhHBKx2wzJrT7XXVbE6Yk9AHqtcCo2w8
To: /content/df_with_labels.zip
100%|██████████| 6.43M/6.43M [00:00<00:00, 56.2MB/s]


'/content/df_with_labels.zip'

In [5]:
import os
os.mkdir("data")

In [6]:
!unzip -q /content/df_with_labels.zip  -d data

In [7]:
import pandas as pd
import numpy as np

In [8]:
df = pd.read_csv('/content/data/df_with_labels(2).csv')
df['split'] = np.random.randn(df.shape[0], 1)

msk = np.random.rand(len(df)) <= 0.7

train = df[msk]
test_val_df = df[~msk]

In [9]:
test_val_df.to_csv('/content/data/test_val.csv', index=False)

In [10]:
test_val = pd.read_csv('/content/data/test_val.csv')
test_val['split'] = np.random.randn(test_val.shape[0], 1)

msk = np.random.rand(len(test_val)) <= 0.5

test = test_val[msk]
val = test_val[~msk]

In [11]:
train.to_csv('train.csv', index=False)
test.to_csv('test.csv', index=False)
val.to_csv('val.csv', index=False)

In [12]:
# Load dataset
dataset = load_dataset('csv', data_files={'train': "/content/train.csv",'test': "/content/test.csv",'val': "/content/val.csv"})

# Initialize tokenizer and model
model_name = "answerdotai/ModernBERT-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=7)

# Create datasets
train_dataset = TextDataset(dataset['train'], tokenizer)
val_dataset = TextDataset(dataset['val'], tokenizer)
test_dataset = TextDataset(dataset['test'], tokenizer)

# Create dataloaders
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16)
test_loader = DataLoader(test_dataset, batch_size=16)

# Training setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-5)

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating val split: 0 examples [00:00, ? examples/s]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/20.8k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.13M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/599M [00:00<?, ?B/s]

Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at answerdotai/ModernBERT-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Training loop
num_epochs = 20
best_acc = 0.63

for epoch in range(num_epochs):
    model.train()
    total_loss = 0

    for batch in train_loader:
        optimizer.zero_grad()
        batch = {k: v.to(device) for k, v in batch.items()}

        outputs = model(**batch)
        loss = outputs.loss

        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    # Validation
    model.eval()
    val_loss = 0
    correct = 0
    total = 0

    with torch.no_grad():
        for batch in val_loader:
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)
            val_loss += outputs.loss.item()

            predictions = torch.argmax(outputs.logits, dim=1)
            correct += (predictions == batch['labels']).sum().item()
            total += batch['labels'].size(0)

    avg_train_loss = total_loss / len(train_loader)
    avg_val_loss = val_loss / len(val_loader)
    accuracy = correct / total

    print(f"Epoch {epoch+1}/{num_epochs}")
    print(f"Average training loss: {avg_train_loss:.4f}")
    print(f"Average validation loss: {avg_val_loss:.4f}")
    print(f"Validation accuracy: {accuracy:.4f}")

    if accuracy > best_acc:
       best_acc = accuracy
       torch.save(model.state_dict(), 'best.pt')
       model.save_pretrained('./modernbert_resumaizer')
       tokenizer.save_pretrained('./modernbert_resumaizer')
       print('save best.pt with acc = ', str(best_acc))

# Final test set evaluation
model.eval()
test_loss = 0
correct = 0
total = 0

with torch.no_grad():
    for batch in test_loader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        test_loss += outputs.loss.item()

        predictions = torch.argmax(outputs.logits, dim=1)
        correct += (predictions == batch['labels']).sum().item()
        total += batch['labels'].size(0)

test_accuracy = correct / total
print(f"\nFinal Test Accuracy: {test_accuracy:.4f}")

In [13]:
# We trained the model and saved it to Google Drive.
# Download the trained model:
url = 'https://drive.google.com/uc?export=download&confirm=no_antivirus&id=1FQ9-NzozILQroq2cDZzSV82pBTXrHUxL'
gdown.download(url, '/content/')

Downloading...
From: https://drive.google.com/uc?export=download&confirm=no_antivirus&id=1FQ9-NzozILQroq2cDZzSV82pBTXrHUxL
To: /content/ModernBERT.zip
100%|██████████| 556M/556M [00:08<00:00, 63.8MB/s]


'/content/ModernBERT.zip'

In [14]:
import os
os.mkdir("model")

In [15]:
!unzip -q /content/ModernBERT.zip  -d model

In [16]:
import torch.nn.functional as F

In [17]:
# Loading the tokenizer and model from the saved files
model_dir = "/content/model/ModernBERT"  # Specify the path to the directory where the model is saved
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForSequenceClassification.from_pretrained(model_dir)

In [18]:
# Set the model to evaluation mode
model.eval()

ModernBertForSequenceClassification(
  (model): ModernBertModel(
    (embeddings): ModernBertEmbeddings(
      (tok_embeddings): Embedding(50368, 768, padding_idx=50283)
      (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (drop): Dropout(p=0.0, inplace=False)
    )
    (layers): ModuleList(
      (0): ModernBertEncoderLayer(
        (attn_norm): Identity()
        (attn): ModernBertAttention(
          (Wqkv): Linear(in_features=768, out_features=2304, bias=False)
          (rotary_emb): ModernBertRotaryEmbedding()
          (Wo): Linear(in_features=768, out_features=768, bias=False)
          (out_drop): Identity()
        )
        (mlp_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): ModernBertMLP(
          (Wi): Linear(in_features=768, out_features=2304, bias=False)
          (act): GELUActivation()
          (drop): Dropout(p=0.0, inplace=False)
          (Wo): Linear(in_features=1152, out_features=768, bias=False)
        )
      

In [None]:
#resume = df['Resume'][3430]

In [None]:
# https://hh.ru/resume/9312d92000088f4a3f0039ed1f6c4667475153

resume = 'Front End Developer 2 000 $ in hand Specializations: Programmer, developer Employment: full time Work schedule: full day \
Work experience 7 years 1 month February 2018 — currently 7 years 1 month Efusoft Front End Developer * Worked effectively with a diverse \
team to accomplish daily objectives and meet long-term goals. * Adapted websites to match changing user preferences and client \
demands with regular updates. * Developed and deployed successful Sportsbook strategies into clients websites. \
Skills Skill proficiency levels Работоспособность HTML5 CSS3 Less Sass JavaScript REACT REDUX WEBPACK Node.js Git \
About me Innovative Front End Developer with 3 years experience building and maintaining responsive websites in the recruiting industry. \
Proficient in HTML, CSS, JavaScript; plus modern libraries and frameworks. Passionate about usability and process working knowledge of \
Adobe Photoshop, Invisionapp, Figma \
Education Secondary education Languages Armenian — Native English — B1 — Intermediate Russian — B2 — Upper Intermediate \
Citizenship, travel time to work Citizenship: Armenia Permission to work: Russia Desired travel time to work: Doesnt matter'

In [21]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [22]:
import PyPDF2

def pdf_to_text(pdf_path):
    # Open the PDF file in read-binary mode
    with open(pdf_path, 'rb') as pdf_file:
        # Create a PdfReader object instead of PdfFileReader
        pdf_reader = PyPDF2.PdfReader(pdf_file)

        # Initialize an empty string to store the text
        text = ''

        for page_num in range(len(pdf_reader.pages)):
            page = pdf_reader.pages[page_num]
            text += page.extract_text()

    return text

In [23]:
url = 'https://drive.google.com/uc?export=download&confirm=no_antivirus&id=1BHGZmbkEh3lJs_euj3R_nVm_Z30SSe2P'
gdown.download(url, '/content/')

file_path = '/content/Resume-frontend-example.pdf'

resume_ru = pdf_to_text(file_path).replace('\n',' ')

Downloading...
From: https://drive.google.com/uc?export=download&confirm=no_antivirus&id=1BHGZmbkEh3lJs_euj3R_nVm_Z30SSe2P
To: /content/Resume-frontend-example.pdf
100%|██████████| 46.4k/46.4k [00:00<00:00, 41.1MB/s]


In [24]:
# For Yandex Translate
oauth_token = ""
catalog_id = ""

In [25]:
# Translate resume from Russian to English
import requests

IAM_TOKEN = oauth_token
folder_id = catalog_id
target_language = 'en'
texts = [resume_ru]

body = {
    "targetLanguageCode": target_language,
    "texts": texts,
    "folderId": folder_id,
}

headers = {
    "Content-Type": "application/json",
    "Authorization": "Api-Key {}".format(IAM_TOKEN)
}

response = requests.post('https://translate.api.cloud.yandex.net/translate/v2/translate',
    json = body,
    headers = headers
)

print(response.text)

resume = response.text

{
 "translations": [
  {
   "text": "Frontend developer Specializations: • Programmer, developer Employment: full—time Work schedule: full-time, flexible schedule, remote work Experience 4 years 10 months March 2023 - present 2 years Open Solutions Penza Frontend developer Project: Omnichannel front-end solution \"Agent's Personal account\" in the insurance sector Tasks: - Optimization devkit architectures for the micro-frontend and backend-for-frontend being developed; - Participation in the development of a single ApiService suitable for several micro-frontends; - Development of functions for parsing and converting nested data structures to create a contract; - Development of various forms for creating a contract; - Development of UI components; - BFF development. Stack: - frontend (micro-frontend) - react, redux, redux-thunk, styled-components, react-hook-form, module-federation; - backend (back-for-frontend) - Node.js, express.  _____________________________________________________

In [26]:
# Tokenize the text
encoded_input = tokenizer(resume, padding=True, truncation=True, return_tensors='pt')

In [27]:
# We get a prediction from the model
with torch.no_grad():
    output = model(**encoded_input)

# We obtain logits - the output values ​​of the model before applying softmax
logits = output.logits

# We use softmax to obtain class probabilities
probabilities = F.softmax(logits, dim=1)

# We output the probabilities for each class
for i, prob in enumerate(probabilities[0]):
    print(f"Вероятность класса {i}: {prob.item()}")

# We obtain the class index with maximum probability
predicted_class = torch.argmax(probabilities, dim=1).item()
print(f"Предсказанный класс: {predicted_class}")

Compiling the model with `torch.compile` and using a `torch.cpu` device is not supported. Falling back to non-compiled mode.


Вероятность класса 0: 0.10254279524087906
Вероятность класса 1: 0.38966163992881775
Вероятность класса 2: 0.04467160254716873
Вероятность класса 3: 0.37534236907958984
Вероятность класса 4: 0.06892707943916321
Вероятность класса 5: 0.013792800717055798
Вероятность класса 6: 0.005061719100922346
Предсказанный класс: 1
