# Model Training on Academic Dataset

In this notebook, we are using 2 base models and training them on arxiv research dataset to be able to perform efficient summarization and semantic search.

## Summarization

For Summarization, we are using the `google/flan-t5-small` model as the base.
This model works on a T4 compute and uses GPU power for training.

We used the `google/flan-t5-small` model for several reasons
1. This model is a fine-tuned version of T5 (Text-to-Text Transfer Transformer)
2. It is pretrained for summarization usecases, which means it performs better than a regular T5 model and requires less fine-tuning.
3. Flan model learns summarization more efficiently because it is built for Seq2Seq tasks
4. It is light weight with 80 mn parameters, requiring less GPU, fast training and fast inferencing. Best balance of speed, memory usage, and accuracy.


## Semantic Search
For Semantic Search capabilities, we are training the `sentence-transformers/all-MiniLM-L6-v2` model by providing a training dataset of 2 similar papers as input. And then computing the similarity score during the training phase. The model uses this training set to better understand how similarity will work for research and academic papers.

We used `sentence-transformers/all-MiniLM-L6-v2` model for several reasons -
1. Unlike the general transformers like BERT, sentence-transformers models are pretrained for sentence similarity tasks.
2. 22 mn parameters - making it light-weight, optimizing for speed and efficiency.
3. Since academic papers use complex terminology, we need a model that already understands long-form text similarity.
4. This model can encode text into dense vectors, meaning it works natively with vector search libraries like FAISS, Pinecone, and Elasticsearch.

## Methodology

1. The models are trained on academic data to fine-tune them further for the specific usecase of applying semantic search on academic papers.
2. The performance of the fine-tuned models is compared with the base models to ensure the training was successful.
3. The new model data is pushed to a GitHub repo and the application uses these models natively in the code.

## Step 1 - Install and Import Libraries

In [None]:
!pip install accelerate evaluate datasets faiss-cpu pandas rouge_score sentence-transformers tensorboard torch transformers

In [None]:
# Standard Library Imports (Built-in Python Modules)
import os
import sys
import json
import logging
import random
import re

# Third-Party Libraries (External Dependencies)
import numpy as np
import pandas as pd
import torch
import faiss

# Google Colab Utilities
from google.colab import drive

# Hugging Face Libraries (Datasets, Transformers, Trainer)
from datasets import load_dataset, DatasetDict
from transformers import (
    T5Tokenizer,
    T5ForConditionalGeneration,
    Trainer,
    DataCollatorForSeq2Seq,
    TrainingArguments
)

# Sentence Transformers (Semantic Search)
from sentence_transformers import SentenceTransformer, losses, InputExample
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter

## Step 2 - Environment Setup

1. Check GPU availability
2. Load the search and summarization models
3. Load the dataset for both models

In [None]:
# confirm GPU is available.
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))

os.environ["WANDB_DISABLED"] = "true"

# load pre-trained SBERT model optimized for academic papers
device = "cuda" if torch.cuda.is_available() else "cpu"

search_model="sentence-transformers/all-MiniLM-L6-v2"
sbert_model = SentenceTransformer(search_model, device=device)

# load transformer model and tokenizer
summarization_model = "google/flan-t5-small"
# The tokenizer is responsible for converting human-readable text into numerical representations (tokens) that a model can understand.
tokenizer = T5Tokenizer.from_pretrained(summarization_model)
t5_model = T5ForConditionalGeneration.from_pretrained(summarization_model).to(device)


# load academic papers dataset
dataset = load_dataset("scientific_papers", "arxiv")
print("Sample keys:", dataset.keys())
print(dataset)

# **Training the Summarization Model**

---



## Step 1 - Preprocessing the Dataset (summarization model)

1. Select a subset of the dataset for training and testing.
2. A preprocess function that splits samples for training into input, labels and attention_mask for summarization model training.
3. Initialize the tokenized_dataset

In [None]:
training_size = 2000  # adjust as needed
validation_size = 500
test_size = 500

# select small subsets from train, validation, and test
small_train_dataset = dataset["train"].select(range(min(training_size, len(dataset["train"]))))
small_validation_dataset = dataset["validation"].select(range(min(validation_size, len(dataset["validation"]))))
small_test_dataset = dataset["test"].select(range(min(test_size, len(dataset["test"]))))

In [None]:
# add labels to training dataset and clean labels for negatives.
def preprocess_function(samples):
    inputs = tokenizer(["summarize: " + doc for doc in samples["article"]],
                        max_length=256, truncation=True, padding="max_length")
    labels = tokenizer(samples["abstract"],
                        max_length=64, truncation=True, padding="max_length")

    # Convert PAD tokens to -100 for loss computation
    labels["input_ids"] = np.where(
        np.array(labels["input_ids"]) == tokenizer.pad_token_id, -100, np.array(labels["input_ids"])
    ).tolist()


    return {
        "input_ids": inputs["input_ids"],
        # attention_mask marks real tokens as 1 and padding tokens to ignore as 0. This helps the model ignore any padding tokens.
        "attention_mask": inputs["attention_mask"],
        "labels": labels["input_ids"]
    }

In [None]:
# Tokenize datasets
tokenized_dataset = {
    "train": small_train_dataset.map(preprocess_function, batched=True, remove_columns=["article", "abstract", "section_names"]),
    "validation": small_validation_dataset.map(preprocess_function, batched=True, remove_columns=["article", "abstract", "section_names"]),
    "test": small_test_dataset.map(preprocess_function, batched=True, remove_columns=["article", "abstract", "section_names"]),
}

# Convert to DatasetDict format
tokenized_dataset = DatasetDict(tokenized_dataset)

# Set format for PyTorch training
tokenized_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

# Print dataset sizes
print(f"Train size: {len(tokenized_dataset['train'])}")
print(f"Validation size: {len(tokenized_dataset['validation'])}")
print(f"Test size: {len(tokenized_dataset['test'])}")

## Step 2 - Train the Summarization Model
1. Use a custom trainer to log runs and metrics for training loss, GPU consumption, validation loss and learning rate.
2. Train the summarization model

### Goal

> Training and Validation loss should reduce with every epoch.
> Loss should be lesser than or near 3

In [None]:
training_args = TrainingArguments(
    output_dir="./model_checkpoints/summarization_model",
    eval_strategy="epoch",  # Validate after each epoch
    save_strategy="epoch",
    load_best_model_at_end=True,
    logging_strategy="steps",  # ✅ Log training loss regularly
    logging_steps=50,  # ✅ Log every 50 steps
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=10,
    weight_decay=0.01,
    max_grad_norm=1.0,
    fp16=False,
    save_total_limit=2,
    report_to="tensorboard",
)

# Data collator for batch padding
data_collator = DataCollatorForSeq2Seq(tokenizer, model=t5_model)

In [None]:
# ✅ Create a SummaryWriter instance to store logs
writer = SummaryWriter("./model_checkpoints/runs/summarization_model_training")

class CustomTrainer(Trainer):
    def log(self, logs, start_time=None):
        super().log(logs)

        # ✅ Log additional metrics
        if "loss" in logs:
            writer.add_scalar("Training Loss", logs["loss"], logs.get("step", 0))
        if "eval_loss" in logs:
            writer.add_scalar("Validation Loss", logs["eval_loss"], logs.get("step", 0))
        if "learning_rate" in logs:
            writer.add_scalar("Learning Rate", logs["learning_rate"], logs.get("step", 0))
        writer.add_scalar("GPU Memory (MB)", torch.cuda.memory_allocated() / (1024 * 1024), logs.get("step", 0))

        # ✅ Save logs in real-time
        writer.flush()

trainer = CustomTrainer(
    model=t5_model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],  # ✅ Use the small train dataset
    eval_dataset=tokenized_dataset["validation"],  # ✅ Use the small validation dataset
    tokenizer=tokenizer,
    data_collator=data_collator
)

In [None]:
trainer.train()

## Step 3 - Evaluate the trained summarization model

Test if the model summarizes information well by using the testing dataset.

In [None]:
trainer.evaluate(tokenized_dataset["test"])

In [None]:
def generate_summary(text):
    inputs = tokenizer("summarize: " + text, return_tensors="pt", max_length=1024, truncation=True)
    inputs = {k: v.to("cuda") for k, v in inputs.items()}  # Move to GPU if available

    summary_ids = t5_model.generate(
        **inputs,
        max_length=256,
        min_length=100,
        num_beams=5,  # use beam search to improve quality
        repetition_penalty=2.0,  # discourages repeated phrases
        length_penalty=1.0,  # adjusts summary length dynamically
        early_stopping=True  # stops generatoin when a good summary is found
    )

    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# test with a sample
sample_text = dataset["test"][9]["article"]
print("\nOriginal Article:", sample_text[:500])  # Print first 500 chars
print("\nGenerated Summary:", generate_summary(sample_text))

## Step 4 - Save the fine tuned summarization model

In [None]:
t5_model.save_pretrained("./model_checkpoints/fine_tuned_summarization_model")
tokenizer.save_pretrained("./model_checkpoints/fine_tuned_summarization_model")

# **Training the Search Model**

---

## Step 1 - Preprocessing for Search Model

1. Fetch arxiv metadata from Google Drive
2. Truncate the data to use a subset - upto 10000 records
3. Use truncated data in a dataframe
4. Clean the text by removing new lines, extraspcaes and special characters
5. Prepare training dataset by forming a pair of 2 similar academic papers that belong to the same category.
6. Manually check if the pairs are infact similar

In [None]:
drive.mount('/content/drive', force_remount=True)

file_path = "/content/drive/MyDrive/Project_Talina_Manjula_Mar_2025/arxiv-metadata-oai-snapshot.json"
output_file = "./model_checkpoints/truncated_arxiv.json"

N = 20000  # Change this to the number of records you want

with open(file_path, "r") as infile, open(output_file, "w") as outfile:
    for i, line in enumerate(infile):
        if i >= N:
            break  # Stop after N lines
        outfile.write(line)

print(f"Truncated file saved as {output_file} with {N} records!")

In [None]:
df = pd.read_json("./model_checkpoints/truncated_arxiv.json", lines=True)


# view dataset structure
print(df.head())
print(df.columns)
print(df.info())

df = df[["title", "abstract", "categories"]].dropna()

print(df.head())

In [None]:
def clean_text(text):
    text = re.sub(r"\s+", " ", text)  # remove extra spaces
    text = re.sub(r"\n", " ", text)  # remove newlines
    text = re.sub(r"[^a-zA-Z0-9,.?!:;()\-]", " ", text)  # remove non-text characters
    return text.strip()

# apply cleaning
df["title"] = df["title"].apply(clean_text)
df["abstract"] = df["abstract"].apply(clean_text)

print(df.head())

In [None]:
# form pair papers with the exact same category
train_examples = []
grouped_papers = df.groupby("categories")

print("Unique Categories in Dataset:")
print(df["categories"].unique())

for _, group in grouped_papers:
    papers = group.sample(n=min(200, len(group)), random_state=42)
    for i in range(len(papers) - 1):
        train_examples.append(InputExample(texts=[papers.iloc[i]["title"] + " " + papers.iloc[i]["abstract"],
                                                  papers.iloc[i + 1]["title"] + " " + papers.iloc[i + 1]["abstract"]]))

# shuffle data
random.shuffle(train_examples)

In [None]:
# check 5 random training pairs
for i in range(5):
    text1, text2 = train_examples[i].texts
    print(f"**Pair {i+1}**")
    print(f"**Text 1:** {text1[:300]}...")  # Print first 300 characters
    print(f"**Text 2:** {text2[:300]}...")
    print("=" * 100)

print(len(train_examples))

## Step 2 - Train the Search Model

### Goal

> The Validation Loss should be in the range of 1-2.5

In [None]:
# convert to SBERT input format
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=20)

# loss function: maximizes similarity of paired texts
train_loss = losses.MultipleNegativesRankingLoss(sbert_model)

# Train SBERT model for 8 epochs
sbert_model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=10, warmup_steps=100)

# Save trained model
sbert_model.save("./model_checkpoints/fine_tuned_search_model")

print("SBERT Training Complete!")

In [None]:
# Load fine-tuned SBERT model
model = SentenceTransformer("./model_checkpoints/fine_tuned_search_model")
# Encode all research papers into vectors
paper_texts = df["title"] + " " + df["abstract"]
paper_embeddings = model.encode(paper_texts.tolist())

# Create FAISS index
index = faiss.IndexFlatL2(paper_embeddings.shape[1])
index.add(np.array(paper_embeddings))

# Save FAISS index
faiss.write_index(index, "./model_checkpoints/paper_index.faiss")

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np

# Load fine-tuned and baseline models
fine_tuned_model = SentenceTransformer("./model_checkpoints/fine_tuned_search_model")  # Your fine-tuned model
baseline_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")  # Pretrained baseline

# Sample text for testing
sample_text = "Artificial Intelligence in Healthcare"

# Generate embeddings
embedding_finetuned = fine_tuned_model.encode(sample_text)
embedding_baseline = baseline_model.encode(sample_text)

# Compare embeddings
similarity = np.dot(embedding_finetuned, embedding_baseline) / (np.linalg.norm(embedding_finetuned) * np.linalg.norm(embedding_baseline))

print(f"Cosine Similarity Between Fine-tuned and Baseline Model: {similarity:.4f}")


In [None]:
import torch

# Load models
fine_tuned_model = SentenceTransformer("./model_checkpoints/fine_tuned_search_model")
baseline_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# Get model parameters
fine_tuned_params = list(fine_tuned_model.parameters())
baseline_params = list(baseline_model.parameters())

# Check if parameters changed
param_changes = [torch.sum(fine_tuned_params[i] - baseline_params[i]).item() for i in range(len(fine_tuned_params))]

print(f"Total Parameter Change: {sum(param_changes):.6f}")

In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

# Load fine-tuned and baseline models
fine_tuned_model = SentenceTransformer("./model_checkpoints/fine_tuned_search_model")  # Your fine-tuned model
baseline_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")  # Baseline model

# Load dataset (Make sure it's the same one used for fine-tuning)
df = pd.read_json("./model_checkpoints/truncated_arxiv.json", lines=True)
df = df[["title", "abstract"]].dropna()  # Keep only relevant columns

# Search Query
query = "Give me some papers that talk about Albert Einstein and Theory of Relativity"
query_embedding_finetuned = fine_tuned_model.encode([query])
query_embedding_baseline = baseline_model.encode([query])

# Encode all research paper titles using both models
paper_embeddings_finetuned = fine_tuned_model.encode(df["title"].tolist())
paper_embeddings_baseline = baseline_model.encode(df["title"].tolist())

# Compute similarity scores
similarity_scores_finetuned = cosine_similarity(query_embedding_finetuned, paper_embeddings_finetuned)[0]
similarity_scores_baseline = cosine_similarity(query_embedding_baseline, paper_embeddings_baseline)[0]

# Get top 5 most similar papers for both models
top_indices_finetuned = similarity_scores_finetuned.argsort()[-5:][::-1]
top_indices_baseline = similarity_scores_baseline.argsort()[-5:][::-1]

# Get top scores
top_scores_finetuned = similarity_scores_finetuned[top_indices_finetuned]
top_scores_baseline = similarity_scores_baseline[top_indices_baseline]

# Print Results for Fine-tuned Model
print("\n**Fine-Tuned Model Results**:")
for i, (index, score) in enumerate(zip(top_indices_finetuned, top_scores_finetuned)):
    row = df.iloc[index]
    print(f"**Rank {i+1}** (Similarity: {score:.4f})")
    print(f"**Title:** {row['title']}")
    print(f"**Abstract:** {row['abstract'][:300]}...")
    print("=" * 100)

# Print Results for Baseline Model
print("\n**Baseline Model Results**:")
for i, (index, score) in enumerate(zip(top_indices_baseline, top_scores_baseline)):
    row = df.iloc[index]
    print(f"**Rank {i+1}** (Similarity: {score:.4f})")
    print(f"**Title:** {row['title']}")
    print(f"**Abstract:** {row['abstract'][:300]}...")
    print("=" * 100)

# Download the Full Model Checkpoints

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
!cp -r /content/model_checkpoints /content/drive/MyDrive/