# Fine Tuning Embedding

<img src="https://miro.medium.com/v2/resize:fit:1400/0*AjX-xfa4UvNVu9js.jpg" width=600>




---
## Install Dependencies

In [None]:
%%capture
!pip install --upgrade sentence-transformers datasets transformers torch tensorboard

In [None]:
pip install sentence_transformers



In [1]:
import torch
from datasets import Dataset
from sentence_transformers.training_args import BatchSamplers
import pandas as pd
from datasets import load_dataset, concatenate_datasets
from google.colab import drive
from huggingface_hub import login
from google.colab import userdata
from sentence_transformers import SentenceTransformer, models
from sentence_transformers.evaluation import InformationRetrievalEvaluator, SequentialEvaluator
from sentence_transformers.util import cos_sim
from collections import defaultdict
from sentence_transformers import  SentenceTransformerModelCardData
from sentence_transformers.models import Normalize, Pooling, Transformer
from sentence_transformers import  SentenceTransformerTrainingArguments, SentenceTransformerTrainer
from sentence_transformers.losses import MultipleNegativesRankingLoss, TripletLoss
from sentence_transformers.losses import TripletDistanceMetric
import os
import math
from transformers import AutoTokenizer

drive.mount('/content/drive', force_remount=True)

torch.set_float32_matmul_precision('high')

Mounted at /content/drive


**Login to Hugging Face**

Used for pushing model to the Hugging Face Hub and downloading gated models or datasets

In [2]:

login(token=userdata.get('HF_TOKEN'), add_to_git_credential=True)


-----

------

---
## IMPORT DATA: Preparation


In [3]:
# Load dataset from the Drive
#PUT THE ACTUAL PATH.
df_unseen_test = pd.read_csv('/content/drive/Shareddrives/Master_Thesis/Data/unseen_test.csv')
df_train = pd.read_csv('/content/drive/Shareddrives/Master_Thesis/Data/train.csv')
df_test = pd.read_csv('/content/drive/Shareddrives/Master_Thesis/Data/test.csv')
print(len(df_unseen_test))
print(len(df_test))
print(len(df_train))
t=len(df_unseen_test)+len(df_test)+len(df_train)
print(t )

210
8668
79897
88775


As we can see there we dont have the (anchor, positive, negative) triplets structure in the dataset. That is why from the **hard_negative_mining** notebook we used BM25 and the cosine similarity to obtain from each content, a negative question.

We just need to have the triplets structure in the Train. That might be needed in case that we use the TripletLoss funcion.
 https://sbert.net/docs/cross_encoder/loss_overview.html#loss-table

In [5]:
import pickle
with open('/content/drive/Shareddrives/Master_Thesis/Fine-Tuning/Embeddings/hard_negatives_modernbert_dapt.pkl','rb') as f:
    hard_negs = pickle.load(f)

#matches the negatives
for idx, row in df_train.iterrows():
    source_id = row['id']
    matches = hard_negs.get(source_id, [])

    if matches:
        match_id = matches[0]
        question_row = df_train.loc[df_train['id'] == match_id, 'content']

        if not question_row.empty:


          df_train.at[idx, 'negative'] = question_row.values[0]

        else:
          df_train.at[idx, 'negative'] = None
    else:
        df_train.at[idx, 'negative'] = None

Clean the NANs

In [6]:
print(len(df_train[df_train['negative'].isna()]))
len(df_train[df_train['negative'].isna()])
dataset_train=df_train[df_train['negative'].notna()]

21


In [7]:
dataset_train.head()

Unnamed: 0,id,paper id,title,categories,type,content,question,negative
0,130,2501.00784,cloitre's self-generating sequence,"['math.co', 'cs.dm', 'cs.fl', 'math.nt']",theorems,Let $g_n$ be the number of $1$'s in the sequen...,What is the limit of the proportion of 1's in ...,\label{thm:bounds_initial}\n Le...
1,265,2501.00809,initial ideals of weighted forms and the genus...,"['math.ac', 'math.ag']",theorems,\label{ThmConjAreTrue}\nConjectures \ref{Conj1...,Does the statement of \textbf{ThmConjAreTrue} ...,[{\cite[Corollary 2.2.2 with $p=3$]{BSY}}]\n ...
2,266,2501.00809,initial ideals of weighted forms and the genus...,"['math.ac', 'math.ag']",propositions,}\n\newcommand{\ep}{,\\emph{Is the statement \emph{If $X$ is a comp...,\label{prop:coherence}\n\tIf $X$ is a qcqs sch...
3,267,2501.00809,initial ideals of weighted forms and the genus...,"['math.ac', 'math.ag']",definitions,}\n\newcommand{\ed}{,Is the statement $\ed{True}$?,\label{main result 3}\nThe statement that ever...
4,313,2501.00845,spectral spaces of normal subgroups,"['math.gr', 'math.gn']",theorems,\label{mth}\nLet $G$ be a group having a maxim...,Does the set $\mathcal{N}^+(G)$ of proper norm...,\label{maxodd}\r\nLet $G$ be a finite group. T...


WE NEED TO RENAME THE columns.
We then need to format the dataset into a structure expected in the upcoming training: `[anchor, positive, id]`. We remove the extraneous columns, rename our `question` and `text` columns, and add in a simple `id` column.

In [8]:
#TRAIN
dataset_train = Dataset.from_pandas(dataset_train, preserve_index=False)
# Clean & Format Columns
dataset_train = dataset_train.rename_column("question", "anchor")
dataset_train = dataset_train.rename_column("content", "positive")
dataset_train = dataset_train.remove_columns([ "title", "type"]) # keep category , paper id


#TEST:
df_test = Dataset.from_pandas(df_test, preserve_index=False)
df_test = df_test.rename_column("question", "anchor")
df_test = df_test.rename_column("content", "positive")
df_test = df_test.remove_columns([ "title", "type"])

Once formatted, we shuffle the entries and split into a 90/10 train/test split. These are saved briefly onto our disk for easier loading.

In [9]:
print(dataset_train)
print(df_test)

Dataset({
    features: ['id', 'paper id', 'categories', 'positive', 'anchor', 'negative'],
    num_rows: 79876
})
Dataset({
    features: ['id', 'paper id', 'categories', 'positive', 'anchor'],
    num_rows: 8668
})


In [10]:
#We change the format to pandas for faster training
train_df = dataset_train.to_pandas()
test_df  = df_test.to_pandas()
train_df.to_json("train_dataset.json",
                 orient="records", lines=True, index=False)
test_df.to_json("test_dataset.json",
                orient="records", lines=True, index=False)

---
## Base Model Evaluation


We build a SentenceTrasformar so we can use The InformationRetreivalEvaluator.

https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator


In [None]:


repo_id = "Master-thesis-NAP/ModernBert-DAPT-math"
#download the model
word_model = models.Transformer(repo_id )
#  Add mean pooling
pooling = models.Pooling(
    word_model.get_word_embedding_dimension(),
    pooling_mode_mean_tokens=True,
)

#  WE build a  SentenceTransformer

model = SentenceTransformer(modules=[word_model, pooling])


config.json:   0%|          | 0.00/1.22k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/599M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/20.8k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.58M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

To run our base evaluations, we need to prepare the data slightly differently for the [InformationRetrievalEvaluator](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator). This evaluator requires three key data structures:

1. A corpus dictionary mapping IDs to documents (`{paper id: theorem}`)
2. A queries dictionary mapping IDs to questions (`{query_id: query}`)
3. A relevant_docs dictionary specifying which corpus documents are relevant for each query (`{query_id: [paper id]}`)

To build these structures:
- We combine train and test datasets into a single corpus_dataset to ensure all text chunks are available during evaluation
- The corpus dictionary is created from the combined corpus_dataset, containing all the papers
- The queries dictionary is created only from the test_dataset, as we want to evaluate on unseen questions
- For the relevance mapping, we use paper id as the connecting key to identify which corpus documents contain the  **id** (theorem_id) relevant to each test query

In [11]:
# Load train and test datasets from their respective JSON files
train_df = pd.read_json("train_dataset.json", orient="records", lines=True)
test_df  = pd.read_json("test_dataset.json",  orient="records", lines=True)

# Convert them to Hugging-Face Datasets

train_dataset = Dataset.from_pandas(train_df, preserve_index=False)
test_dataset  = Dataset.from_pandas(test_df,  preserve_index=False)



corpus_dataset = concatenate_datasets([train_dataset, test_dataset])

# Convert datasets into dictionary format required by the InformationRetrievalEvaluator
corpus = dict(zip(corpus_dataset["id"], corpus_dataset["positive"]))
queries = dict(zip(test_dataset["id"],     test_dataset["anchor"]))



#Build Relevant Docs dictionary.
#IMPORTANT: WE CONSIDER THAT all content chunks /STATEMENTS (theorems, lemmas, etc) from the same paper can be considered relevant
DOC_ID_COL = "paper id"


paper_to_ids = defaultdict(list)
for row in corpus_dataset:  # not .iterrows()
    paper_to_ids[row["paper id"]].append(row["id"])


relevant_docs = {
    row["id"]: paper_to_ids[row["paper id"]]
    for row in test_dataset  # not .iterrows()
}

  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)


In [None]:

# Dimensions of interest
evaluator = InformationRetrievalEvaluator(
    queries=queries,
    corpus=corpus,
    precision_recall_at_k=[3, 5,30],
    ndcg_at_k=[10],
    mrr_at_k=[10],
    map_at_k=[100],
    accuracy_at_k=[1, 3, 5],
    relevant_docs=relevant_docs,
    name="DAPT",
    score_functions={"cosine": cos_sim},
)

In [None]:
# Evaluate the model
base_results = evaluator(model)
print(base_results)

W0531 10:04:47.439000 776 torch/_inductor/utils.py:1137] [1/0] Not enough SMs to use max_autotune_gemm mode


{'DAPT_cosine_accuracy@1': 0.1915089986155976, 'DAPT_cosine_accuracy@3': 0.23038763267189663, 'DAPT_cosine_accuracy@5': 0.24884633133364098, 'DAPT_cosine_precision@3': 0.11736655899092448, 'DAPT_cosine_precision@5': 0.08774803876326719, 'DAPT_cosine_precision@30': 0.027780341485925245, 'DAPT_cosine_recall@3': 0.014179030659348972, 'DAPT_cosine_recall@5': 0.016992025331513266, 'DAPT_cosine_recall@30': 0.028731105394634734, 'DAPT_cosine_ndcg@10': 0.08082287866528948, 'DAPT_cosine_mrr@10': 0.2167149650969443, 'DAPT_cosine_map@100': 0.01940228326573222}


In [None]:
base_results

{'DAPT_cosine_accuracy@1': 0.1915089986155976,
 'DAPT_cosine_accuracy@3': 0.23038763267189663,
 'DAPT_cosine_accuracy@5': 0.24884633133364098,
 'DAPT_cosine_precision@3': 0.11736655899092448,
 'DAPT_cosine_precision@5': 0.08774803876326719,
 'DAPT_cosine_precision@30': 0.027780341485925245,
 'DAPT_cosine_recall@3': 0.014179030659348972,
 'DAPT_cosine_recall@5': 0.016992025331513266,
 'DAPT_cosine_recall@30': 0.028731105394634734,
 'DAPT_cosine_ndcg@10': 0.08082287866528948,
 'DAPT_cosine_mrr@10': 0.2167149650969443,
 'DAPT_cosine_map@100': 0.01940228326573222}

Base results dapt without fine tunning, and WITH normalzing layer: {'TESTING_cosine_accuracy@1': 0.1915089986155976,
 'TESTING_cosine_accuracy@3': 0.23038763267189663,
 'TESTING_cosine_accuracy@5': 0.24884633133364098,
 'TESTING_cosine_accuracy@10': 0.2779187817258883,
 'TESTING_cosine_precision@1': 0.1915089986155976,
 'TESTING_cosine_precision@3': 0.11736655899092448,
 'TESTING_cosine_precision@5': 0.08774803876326719,
 'TESTING_cosine_precision@10': 0.057095062298107985,
 'TESTING_cosine_recall@1': 0.008326191921750556,
 'TESTING_cosine_recall@3': 0.014179030659348972,
 'TESTING_cosine_recall@5': 0.016992025331513266,
 'TESTING_cosine_recall@10': 0.020932749229215983,
 'TESTING_cosine_ndcg@10': 0.08082287866528948,
 'TESTING_cosine_mrr@10': 0.2167149650969443,
 'TESTING_cosine_map@100': 0.019401976985438323}

Base results dapt without fine tunning and WITHOUT normalzing layer.
{'TESTING_cosine_accuracy@1': 0.1915089986155976,
 'TESTING_cosine_accuracy@3': 0.23038763267189663,
 'TESTING_cosine_accuracy@5': 0.24884633133364098,
 'TESTING_cosine_accuracy@10': 0.2779187817258883,
 'TESTING_cosine_precision@1': 0.1915089986155976,
 'TESTING_cosine_precision@3': 0.11736655899092448,
 'TESTING_cosine_precision@5': 0.08774803876326719,
 'TESTING_cosine_precision@10': 0.057095062298107985,
 'TESTING_cosine_recall@1': 0.008326191921750556,
 'TESTING_cosine_recall@3': 0.014179030659348972,
 'TESTING_cosine_recall@5': 0.016992025331513266,
 'TESTING_cosine_recall@10': 0.020932749229215983,
 'TESTING_cosine_ndcg@10': 0.08082287866528948,
 'TESTING_cosine_mrr@10': 0.2167149650969443,
 'TESTING_cosine_map@100': 0.01940228326573222}


 THE RESULTS ARE THE SAME (AS expected) the normalization only affect in the loss funtion.

---
## Training



### DAPT MODERNBERT
NOw we fine-tune ModernBERT

In [None]:

#LOAD THE DAPT MODEL with SDPA for using Flash Attention 2
torch.set_float32_matmul_precision("high")

repo_id = "Master-thesis-NAP/ModernBert-DAPT-math"

word_model = models.Transformer(repo_id )
pooling = models.Pooling(
    word_model.get_word_embedding_dimension(),
     pooling_mode_mean_tokens=True,

)
normalize = Normalize()
model = SentenceTransformer(modules=[word_model, pooling, normalize ],model_kwargs={"attn_implementation": "sdpa"},
    model_card_data=SentenceTransformerModelCardData(
        language="en",
        license="apache-2.0",
        model_name="ModernBERT DAPT Embed DAPT Math",
    ))

config.json:   0%|          | 0.00/1.22k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/599M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/20.8k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.58M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

In [None]:
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

In [None]:
#Loss function:
train_loss = MultipleNegativesRankingLoss(model)
#train_loss = TripletLoss(model,distance_metric=TripletDistanceMetric.COSINE, triplet_margin = 0.1)


In [None]:
# Training Arguments
args = SentenceTransformerTrainingArguments(
    output_dir="/content/drive/Shareddrives/Master_Thesis/Models/FT_math",
    num_train_epochs=4,                                        # number of epochs
    per_device_train_batch_size=16,                            # train batch size
    gradient_accumulation_steps=8,
    per_device_eval_batch_size=16,                             # evaluation batch size
    warmup_ratio=0.1,                                          # warmup ratio
    learning_rate=2e-5,                                        # learning rate, 2e-5 is a good value
    lr_scheduler_type="cosine",                                # use cosine learning rate scheduler
    optim="adamw_torch_fused",                                 # use fused adamw optimizer
    tf32=True,                                                 # use tf32 precision
    bf16=True,                                                 # use bf16 precision
    batch_sampler=BatchSamplers.NO_DUPLICATES,                 # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
    eval_strategy="epoch",                                     # evaluate after each epoch
    save_strategy="epoch",                                     # save after each epoch
    logging_steps=10,                                          # log every 10 steps
    save_total_limit=3,
    load_best_model_at_end=True,
    metric_for_best_model="eval_TESTING_cosine_ndcg@10",
    report_to="none"
)

Finally, package our model, training arguments, dataset, loss function and evaluator together into a `SentenceTransformerTrainer`

In [None]:
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset.select_columns(
        [ "anchor","positive"]
    ),
    loss=train_loss,
    evaluator=evaluator,
)

Start the training run!

In [None]:


total_examples = len(train_dataset)
global_batch_size = args.per_device_train_batch_size * args.gradient_accumulation_steps
steps_per_epoch = math.ceil(total_examples / global_batch_size)
total_steps = steps_per_epoch * args.num_train_epochs

print(f"Estimated steps per epoch: {steps_per_epoch}")
print(f"Estimated total steps: {total_steps}")


In [None]:

# Set memory management behavior
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

# Clean up any old memory references
torch.cuda.empty_cache()
torch.cuda.ipc_collect()




In [None]:
# Start training
trainer.train()

# Save the best model based on our  criteria
trainer.save_model('/content/drive/Shareddrives/Master_Thesis/Models/FT_DAPT_MB/O-2')

W0527 15:59:52.014000 1046 torch/_inductor/utils.py:1137] [1/0] Not enough SMs to use max_autotune_gemm mode


Epoch,Training Loss,Validation Loss,Testing Cosine Accuracy@1,Testing Cosine Accuracy@3,Testing Cosine Accuracy@5,Testing Cosine Accuracy@10,Testing Cosine Precision@1,Testing Cosine Precision@3,Testing Cosine Precision@5,Testing Cosine Precision@10,Testing Cosine Recall@1,Testing Cosine Recall@3,Testing Cosine Recall@5,Testing Cosine Recall@10,Testing Cosine Ndcg@10,Testing Cosine Mrr@10,Testing Cosine Map@100
1,0.2391,No log,0.838025,0.890632,0.909552,0.929626,0.838025,0.579219,0.464213,0.323846,0.040315,0.07856,0.101057,0.132426,0.422884,0.868531,0.150034
2,0.1081,No log,0.858329,0.909552,0.927434,0.944855,0.858329,0.600177,0.481864,0.338371,0.041394,0.08159,0.104801,0.138141,0.439159,0.887392,0.157644
3,0.0278,No log,0.86802,0.91832,0.93251,0.949585,0.86802,0.611867,0.493539,0.347589,0.041867,0.083153,0.107391,0.142074,0.449327,0.896366,0.163769


In [None]:
# from DAPT:
evaluator(model)

{'TESTING_cosine_accuracy@1': 0.868020304568528,
 'TESTING_cosine_accuracy@3': 0.9183202584217812,
 'TESTING_cosine_accuracy@5': 0.9325103830179973,
 'TESTING_cosine_accuracy@10': 0.9495846792801107,
 'TESTING_cosine_precision@1': 0.868020304568528,
 'TESTING_cosine_precision@3': 0.6118674050146131,
 'TESTING_cosine_precision@5': 0.49353945546838945,
 'TESTING_cosine_precision@10': 0.34758883248730965,
 'TESTING_cosine_recall@1': 0.04186710795480722,
 'TESTING_cosine_recall@3': 0.08315252408701693,
 'TESTING_cosine_recall@5': 0.1073909448198794,
 'TESTING_cosine_recall@10': 0.14207392775097807,
 'TESTING_cosine_ndcg@10': 0.4493273991613623,
 'TESTING_cosine_mrr@10': 0.8963655316764447,
 'TESTING_cosine_map@100': 0.16376932233660765}

In [None]:
output_dir='/content/drive/Shareddrives/Master_Thesis/Models/FT_DAPT_MB/O-4'
trainer.save_model(output_dir)

In [None]:
# Upload model to hub
trainer.model.push_to_hub("Master-thesis-NAP/ModernBERT-DAPT-Embed-DAPT-Math-v2")

model.safetensors:   0%|          | 0.00/596M [00:00<?, ?B/s]

'https://huggingface.co/Master-thesis-NAP/ModernBERT-DAPT-Embed-DAPT-Math-v2/commit/f391a3df8fb1cefdccbfd951b412239241d7a3f3'

---
## Evaluating Trained Model

In [None]:

repo_id = "Master-thesis-NAP/ModernBERT-DAPT-Embed-DAPT-Math-v2"

word_model = models.Transformer(repo_id )
pooling = models.Pooling(
    word_model.get_word_embedding_dimension(),
    pooling_mode_mean_tokens=True,
)

#  Construimos un modelo SentenceTransformer
model = SentenceTransformer(modules=[word_model, pooling])

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/596M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/21.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.58M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

In [None]:
ft_results=evaluator(model)

In [None]:
ft_results

{'DAPT_cosine_accuracy@1': 0.8685971389017074,
 'DAPT_cosine_accuracy@3': 0.9185509921550531,
 'DAPT_cosine_accuracy@5': 0.9329718504845408,
 'DAPT_cosine_precision@3': 0.6120596831256729,
 'DAPT_cosine_precision@5': 0.4931702814951546,
 'DAPT_cosine_precision@30': 0.17400399938471006,
 'DAPT_cosine_recall@3': 0.0831579210570278,
 'DAPT_cosine_recall@5': 0.1073902419834836,
 'DAPT_cosine_recall@30': 0.19437924561958866,
 'DAPT_cosine_ndcg@10': 0.44907151448652377,
 'DAPT_cosine_mrr@10': 0.8965814325268632,
 'DAPT_cosine_map@100': 0.16377609366370482}

In [None]:
df_unseen = Dataset.from_pandas(df_unseen_test, preserve_index=False)
df_unseen = df_unseen.rename_column("question", "anchor")
df_unseen = df_unseen.rename_column("content", "positive")
df_unseen = df_unseen.remove_columns([ "title", "type"])


corpus_dataset_unseen = concatenate_datasets([train_dataset, test_dataset, df_unseen])
corpus_unseen = dict(zip(corpus_dataset_unseen["id"], corpus_dataset_unseen["positive"]))
queries_unseen = dict(zip(df_unseen["id"],     df_unseen["anchor"]))

DOC_ID_COL = "paper id"

paper_to_ids = defaultdict(list)
for row in corpus_dataset_unseen:  # not .iterrows()
    paper_to_ids[row["paper id"]].append(row["id"])


relevant_docs_unseen = {
    row["id"]: paper_to_ids[row["paper id"]]
    for row in df_unseen
}
evaluator_unseen = InformationRetrievalEvaluator(
    queries=queries_unseen,
    corpus=corpus_unseen,
    precision_recall_at_k=[3, 5,30],
    ndcg_at_k=[10],
    mrr_at_k=[10],
    map_at_k=[100],
    accuracy_at_k=[1, 3, 5],
    relevant_docs=relevant_docs_unseen,
    name="TESTING",
    score_functions={"cosine": cos_sim},
)

  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)


In [None]:
unseen_results= evaluator_unseen(model)

In [None]:
unseen_results

{'TESTING_cosine_accuracy@1': 0.8238095238095238,
 'TESTING_cosine_accuracy@3': 0.8666666666666667,
 'TESTING_cosine_accuracy@5': 0.8952380952380953,
 'TESTING_cosine_precision@3': 0.611111111111111,
 'TESTING_cosine_precision@5': 0.4885714285714286,
 'TESTING_cosine_precision@30': 0.14761904761904762,
 'TESTING_cosine_recall@3': 0.13095238095238093,
 'TESTING_cosine_recall@5': 0.17448979591836733,
 'TESTING_cosine_recall@30': 0.31632653061224497,
 'TESTING_cosine_ndcg@10': 0.4250808487003156,
 'TESTING_cosine_mrr@10': 0.8534882842025697,
 'TESTING_cosine_map@100': 0.2553838491816305}

---
## Base vs FT Comparison
### 📊 Performance Improvement from Base to Fine-Tuned Model

| Metric                     | Base Value | Fine-Tuned Value | % Improvement |
|---------------------------|------------|------------------|----------------|
| **Accuracy@1**            | 0.1915     | 0.8680           | 353.20%        |
| **Accuracy@3**            | 0.2304     | 0.9183           | 298.63%        |
| **Accuracy@5**            | 0.2488     | 0.9325           | 274.66%        |
| **Accuracy@10**           | 0.2779     | 0.9496           | 241.74%        |
| **Precision@1**           | 0.1915     | 0.8680           | 353.20%        |
| **Precision@3**           | 0.1174     | 0.6119           | 421.22%        |
| **Precision@5**           | 0.0877     | 0.4935           | 462.59%        |
| **Precision@10**          | 0.0571     | 0.3476           | 508.74%        |
| **Recall@1**              | 0.0083     | 0.0419           | 402.45%        |
| **Recall@3**              | 0.0142     | 0.0832           | 485.97%        |
| **Recall@5**              | 0.0170     | 0.1074           | 531.12%        |
| **Recall@10**             | 0.0209     | 0.1421           | 579.02%        |
| **NDCG@10**               | 0.0808     | 0.4493           | 456.11%        |
| **MRR@10**                | 0.2167     | 0.8964           | 313.78%        |
| **MAP@100**               | 0.0194     | 0.1638           | 743.87%        |


In [None]:
%%capture
!pip install --upgrade sentence-transformers
!pip install git+https://github.com/huggingface/transformers

In [None]:
#some visual intuition.

model = SentenceTransformer("Master-thesis-NAP/ModernBERT-DAPT-Embed-DAPT-Math")

sentences = [
    "What is the error estimate for the difference between the exact solution and the local oscillation decomposition (LOD) solution in terms of the $L_0$ norm?",
    "\\label{RL1}\nThe system \\eqref{R3} has the following positive fixed points if $0 <\\alpha\\leq1$ and $b>d$\n$$E^*=\\left(\\dfrac{d}{b}, \\dfrac{(b-d) r}{b^2}\\right)$$",
    "\\label{theo1d}\nWith the assumptions and setting is this section,  the finite difference solution  computed using the improved harmonic average method applied to \\eqn{eq1d} or \\eqn{eq1dB}  has second order convergence in the infinity norm, that is,\n\\eqm\n  \\|\\mathbf{E} \\|_{\\infty}\\le C h^2,\n\\enm\nassuming that the true solution of \\eqn{eq1d} is piecewise $C^4$ excluding the interface $\\alf$, that is, \n$u(x) \\in C^4(0,\\alf)  \\cup C^4(\\alf,1)$. \n%where $C$ is a generic error constant.",
    "\\label{Corollary}\n     Let Assumptions~\\ref{assum_1} and~\\ref{assump2} be satisfied. Let $u$ be the solution of~\\eqref{WeakForm} and let $u_{H,k}$ be the LOD solution of~\\eqref{local_probelm }. Then we have \n     \\begin{equation}\\label{L2Estimate}\n         \\|u-I_Hu_{H,k}\\|_0\\lesssim  \\|u-I_Hu\\|_0+\\|u-u_{H,k}\\|_0 +H|u-u_{H,k}|_1.\n     \\end{equation}\n     %\\[\\|u-I_Hu_{H,k}\\|_0\\lesssim H |u|_1 +|u-u_{H,k}|_1.\\]"
]
embeddings = model.encode(sentences)

similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/205 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/46.2k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/54.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/596M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/21.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.58M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

torch.Size([4, 4])


In [None]:
similarities[0]

tensor([1.0000, 0.3755, 0.3706, 0.6477])

In [None]:
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities[0])

For comparison, output from our base model nomic-ai/modernbert-embed-base: `tensor([1.0000, 0.6490, 0.4759])`



---
# EVALUATION OF THE MATHBERT
Check base results with mathbert



In [None]:

# Load tokenizer and model
model_mathbert = "tbs17/MathBERT"
tokenizer = AutoTokenizer.from_pretrained(model_mathbert)
word_model_mathbert = models.Transformer(model_mathbert)


pooling = models.Pooling(
    word_model_mathbert.get_word_embedding_dimension(),
    pooling_mode_mean_tokens=True
)

# Combine into SentenceTransformer model
model_mathbert = SentenceTransformer(modules=[word_model_mathbert, pooling])


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/569 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/441M [00:00<?, ?B/s]

In [None]:
base_results_mathbert = evaluator(model_mathbert)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [None]:
print(base_results_mathbert)

{'TESTING_cosine_accuracy@1': 0.420973696354407, 'TESTING_cosine_accuracy@3': 0.48673281033687127, 'TESTING_cosine_accuracy@5': 0.5171896631287495, 'TESTING_cosine_accuracy@10': 0.5597600369173973, 'TESTING_cosine_precision@1': 0.420973696354407, 'TESTING_cosine_precision@3': 0.2848023381018305, 'TESTING_cosine_precision@5': 0.22634979233964006, 'TESTING_cosine_precision@10': 0.15647208121827413, 'TESTING_cosine_recall@1': 0.018471265713603746, 'TESTING_cosine_recall@3': 0.03498644320395509, 'TESTING_cosine_recall@5': 0.044857397602586536, 'TESTING_cosine_recall@10': 0.05867424903717985, 'TESTING_cosine_ndcg@10': 0.20608851025347755, 'TESTING_cosine_mrr@10': 0.46289783256788425, 'TESTING_cosine_map@100': 0.060336401234208074}


In [None]:
base_results_mathbert

{'TESTING_cosine_accuracy@1': 0.420973696354407,
 'TESTING_cosine_accuracy@3': 0.48673281033687127,
 'TESTING_cosine_accuracy@5': 0.5171896631287495,
 'TESTING_cosine_accuracy@10': 0.5597600369173973,
 'TESTING_cosine_precision@1': 0.420973696354407,
 'TESTING_cosine_precision@3': 0.2848023381018305,
 'TESTING_cosine_precision@5': 0.22634979233964006,
 'TESTING_cosine_precision@10': 0.15647208121827413,
 'TESTING_cosine_recall@1': 0.018471265713603746,
 'TESTING_cosine_recall@3': 0.03498644320395509,
 'TESTING_cosine_recall@5': 0.044857397602586536,
 'TESTING_cosine_recall@10': 0.05867424903717985,
 'TESTING_cosine_ndcg@10': 0.20608851025347755,
 'TESTING_cosine_mrr@10': 0.46289783256788425,
 'TESTING_cosine_map@100': 0.060336401234208074}

In [None]:

# Initial Loss
train_loss = MultipleNegativesRankingLoss(model_mathbert)

In [None]:
# Training Arguments
args = SentenceTransformerTrainingArguments(
    output_dir="/content/drive/Shareddrives/Master_Thesis/Models/FT_math", # output directory and hugging face model ID
    num_train_epochs=4,                                        # number of epochs
    per_device_train_batch_size=16,                            # train batch size
    gradient_accumulation_steps=8,                            # for a global batch size of 512
    per_device_eval_batch_size=16,                             # evaluation batch size
    warmup_ratio=0.1,                                          # warmup ratio
    learning_rate=2e-5,                                        # learning rate, 2e-5 is a good value
    lr_scheduler_type="cosine",                                # use cosine learning rate scheduler
    optim="adamw_torch_fused",                                 # use fused adamw optimizer
    tf32=True,                                                 # use tf32 precision
    bf16=True,                                                 # use bf16 precision
    batch_sampler=BatchSamplers.NO_DUPLICATES,                 # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
    eval_strategy="epoch",                                     # evaluate after each epoch
    save_strategy="epoch",                                     # save after each epoch
    logging_steps=10,                                          # log every 10 steps
    save_total_limit=3,                                        # save only the last 3 models
    load_best_model_at_end=True,                               # load the best model when training ends
    metric_for_best_model="eval_DAPT_cosine_ndcg@10",       # Optimizing for the best ndcg@10 score for the 128 dimension
    report_to="none"                                           # Turning off training logging for now, input 'wandb' etc. if desired.
)

In [None]:
trainer = SentenceTransformerTrainer(
    model=model_mathbert,
    args=args,
    train_dataset=train_dataset.select_columns(
        [ "anchor","positive"]
    ),
    loss=train_loss,
    evaluator=evaluator,
)

In [None]:


total_examples = len(train_dataset)
global_batch_size = args.per_device_train_batch_size * args.gradient_accumulation_steps
steps_per_epoch = math.ceil(total_examples / global_batch_size)
total_steps = steps_per_epoch * args.num_train_epochs

print(f"Estimated steps per epoch: {steps_per_epoch}")
print(f"Estimated total steps: {total_steps}")


Estimated steps per epoch: 625
Estimated total steps: 2500


### FT mathbert

In [None]:
# Start training
trainer.train()

# Save the best model based on our  criteria
trainer.save_model('/content/drive/Shareddrives/Master_Thesis/Models/FT_math')

Epoch,Training Loss,Validation Loss,Dapt Cosine Accuracy@1,Dapt Cosine Accuracy@3,Dapt Cosine Accuracy@5,Dapt Cosine Precision@3,Dapt Cosine Precision@5,Dapt Cosine Precision@30,Dapt Cosine Recall@3,Dapt Cosine Recall@5,Dapt Cosine Recall@30,Dapt Cosine Ndcg@10,Dapt Cosine Mrr@10,Dapt Cosine Map@100
1,0.1324,No log,0.744924,0.820374,0.84737,0.509268,0.404868,0.144593,0.068046,0.08684,0.160039,0.370932,0.788887,0.128413
2,0.1716,No log,0.766844,0.836179,0.86156,0.524342,0.418551,0.149427,0.07048,0.089756,0.1652,0.383212,0.807441,0.133615
3,0.0916,No log,0.783802,0.852792,0.878057,0.540878,0.433479,0.155649,0.072686,0.09321,0.171996,0.395622,0.823723,0.140295
4,0.0639,No log,0.787494,0.857176,0.879903,0.544532,0.436594,0.157107,0.073302,0.09409,0.173688,0.398593,0.827123,0.142188


In [None]:
# Results:
ft_math_results=evaluator(model_mathbert)

In [None]:
ft_math_results

{'DAPT_cosine_accuracy@1': 0.7874942316566682,
 'DAPT_cosine_accuracy@3': 0.8571758191047532,
 'DAPT_cosine_accuracy@5': 0.8799030918320259,
 'DAPT_cosine_precision@3': 0.5445316105214583,
 'DAPT_cosine_precision@5': 0.4365943700969082,
 'DAPT_cosine_precision@30': 0.15710659898477158,
 'DAPT_cosine_recall@3': 0.07330216839356178,
 'DAPT_cosine_recall@5': 0.09409036887768508,
 'DAPT_cosine_recall@30': 0.1736880419786027,
 'DAPT_cosine_ndcg@10': 0.3985931936035165,
 'DAPT_cosine_mrr@10': 0.8271233912731359,
 'DAPT_cosine_map@100': 0.14218839668722377}