# Data

We use the following dataset for fine-tuning:

- [arXiv papers](https://www.kaggle.com/datasets/neelshah18/arxivdataset)

The papers on arXiv also include papers on computational biology, genomics, etc.

An alternative is the [dataset](https://zenodo.org/record/7695390) from [a recent study](https://www.biorxiv.org/content/10.1101/2023.04.10.536208v1.full.pdf) with titles and labels of papers from PubMed. It contains 20 million papers, but only titles are listed (no abstracts).

In this notebook, we use data and tags from arXiv.

# Models

We use BERT trained on biomedical data (from PubMed) as a base model.

- [BiomedNLP-PubMedBERT](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract)

---

# Imports

In [1]:
import torch
import transformers
import numpy as np
import pandas as pd
from tqdm import tqdm

import torch
from datasets import Dataset, ClassLabel
from transformers import AutoTokenizer, AutoModelForMaskedLM, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
from transformers import pipeline
import evaluate

  from .autonotebook import tqdm as notebook_tqdm


# Load data

Let's load the data for fine-tuning - in particular, we will need the titles of the articles, their abstracts and tags.

In [2]:
import os
cur_dir = os.getcwd()
print(cur_dir)
df = pd.read_json(f"{cur_dir}/output2.json")


c:\Users\PC02\Desktop\DataScrap\AI_topic_recog


Let's combine the titles and abstracts and save the text in the appropriate column:

In [3]:
df['text'] = df['title'] + "\n" + df['abstract']
print(df['text'])

0        Sparsity-certifying Graph Decompositions\n  We...
1        The evolution of the Earth-Moon system based o...
2        A determinant of Stirling cycle numbers counts...
3        From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...
4        Bosonic characters of atomic Cooper pairs acro...
                               ...                        
41679    Deuteron production in p-Be interactions at 45...
41680    Statistical relativistic temperature transform...
41681    The time-dependent Born-Oppenheimer approximat...
41682    The design of the time-of-flight system for MI...
41683    Vortices in Quantum Rontgen Effect\n  By the a...
Name: text, Length: 41684, dtype: object


In [4]:
df.head(2)

Unnamed: 0,id,authors,title,categories,abstract,update_date,authors_parsed,text
0,704.0002,Ileana Streinu and Louis Theran,Sparsity-certifying Graph Decompositions,math.CO cs.CG,"We describe a new algorithm, the $(k,\ell)$-...",2008-12-13,"[[Streinu, Ileana, ], [Theran, Louis, ]]",Sparsity-certifying Graph Decompositions\n We...
1,704.0003,Hongjun Pan,The evolution of the Earth-Moon system based o...,physics.gen-ph,The evolution of Earth-Moon system is descri...,2008-01-13,"[[Pan, Hongjun, ]]",The evolution of the Earth-Moon system based o...


## Labels

We will use categories from arXiv, such as `astro-ph` for astrophysics articles or `cs.CV` for computer vision (computer science).

In [5]:
df['category'] = [i.split()[0].strip() for i in df['categories']]
categories = np.unique(df['category'])
num_labels = len(categories)
print(f"Total: {num_labels} labels such as {categories[0]}, {categories[1]}, ..., {categories[-1]}")
# df['category'] = [eval(i)[0]['term'].strip() for i in df['categories']]
# categories = np.unique(df['category'])
# num_labels = len(categories)
# print(f"Total: {num_labels} labels such as {categories[0]}, {categories[1]}, ..., {categories[-1]}")

Total: 133 labels such as astro-ph, cond-mat.dis-nn, ..., stat.ML


In [6]:
pd.DataFrame({
    "category": categories,
    "category_index": np.arange(num_labels),
}).head()

Unnamed: 0,category,category_index
0,astro-ph,0
1,cond-mat.dis-nn,1
2,cond-mat.mes-hall,2
3,cond-mat.mtrl-sci,3
4,cond-mat.other,4


In [7]:
df = pd.DataFrame({
    "category": categories,
    "category_index": np.arange(num_labels),
}).set_index("category").join(df.set_index("category"), how="right", sort=False).reset_index()

# Model

In [8]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


Tokenizer (name + abstract -> tokens):

In [9]:
tokenizer = AutoTokenizer.from_pretrained("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract")

The model itself, in which `AutoModelForSequenceClassification` will replace the head for the classification task:

In [10]:
model = AutoModelForSequenceClassification.from_pretrained("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract", num_labels=num_labels).to(device)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [11]:
print(model)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

# Training

## Data Loaders

To work with `transformers`, it may be more convenient to use the `datasets` library for working with data.

Let's create (hugging face) [dataset](https://huggingface.co/docs/datasets/tabular_load#pandas-dataframes):

In [12]:
np.random.seed(42)
train_indices = np.sort(np.random.choice(np.arange(len(df)), size=37_000, replace=False))
test_indices = np.array([i for i in np.arange(len(df)) if i not in train_indices])

In [13]:
train_df = df.loc[:,["text", "category"]].iloc[train_indices]
test_df = df.loc[:,["text", "category"]].iloc[test_indices]

train_ds = Dataset.from_pandas(train_df, split="train")
test_ds = Dataset.from_pandas(test_df, split="test")

In [14]:
def tokenize_text(row):
    return tokenizer(
        row["text"],
        max_length=512,
        truncation=True,
        padding='max_length',
    )

train_ds = train_ds.map(tokenize_text, batched=True)
test_ds = test_ds.map(tokenize_text, batched=True)

Map: 100%|██████████| 37000/37000 [00:11<00:00, 3171.80 examples/s]
Map: 100%|██████████| 4684/4684 [00:01<00:00, 3358.03 examples/s]


In [15]:
labels_map = ClassLabel(num_classes=num_labels, names=list(categories))

def transform_labels(row):
    # default name for a label (label or label_ids)
    return {"label": labels_map.str2int(row["category"])}

# OR:
#
# labels_map = pd.Series(
#     np.arange(num_labels),
#     index=categories,
# )
#
# def transform_labels(row):
#     return {"label": labels_map[row["category"]]}

train_ds = train_ds.map(transform_labels, batched=True)
test_ds = test_ds.map(transform_labels, batched=True)

train_ds = train_ds.cast_column('label', labels_map)
test_ds = test_ds.cast_column('label', labels_map)

Map: 100%|██████████| 37000/37000 [00:00<00:00, 190685.32 examples/s]
Map: 100%|██████████| 4684/4684 [00:00<00:00, 234189.06 examples/s]
Casting the dataset: 100%|██████████| 37000/37000 [00:00<00:00, 331775.35 examples/s]
Casting the dataset: 100%|██████████| 4684/4684 [00:00<00:00, 360340.42 examples/s]


## Prepare training

In [16]:
model = AutoModelForSequenceClassification.from_pretrained(
    "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract",
    num_labels=num_labels,
    id2label={i:labels_map.names[i] for i in range(len(categories))},
    label2id={labels_map.names[i]:i for i in range(len(categories))},
).to(device)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [17]:
tokenizer = AutoTokenizer.from_pretrained("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract")

Будем вычислять accuracy:

In [18]:
metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [20]:
training_args = TrainingArguments(
    output_dir="bert-paper-classifier-arxiv",
    eval_strategy="epoch",
    per_device_train_batch_size=32,
    num_train_epochs=5,
    logging_steps=10,
)

In [21]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

  0%|          | 1/5790 [02:40<258:42:49, 160.89s/it]

OutOfMemoryError: CUDA out of memory. Tried to allocate 384.00 MiB. GPU 0 has a total capacity of 6.00 GiB of which 0 bytes is free. Of the allocated memory 20.08 GiB is allocated by PyTorch, and 290.91 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

: 

In [None]:
# Convert to a python file and run training:
#! jupyter nbconvert finetuning-arxiv.ipynb --to python

# Save and share

In [None]:
trainer.args.hub_model_id = "bert-paper-classifier-arxiv"

In [None]:
tokenizer.save_pretrained("bert-paper-classifier-arxiv")

('bert-paper-classifier/tokenizer_config.json',
 'bert-paper-classifier/special_tokens_map.json',
 'bert-paper-classifier/vocab.txt',
 'bert-paper-classifier/added_tokens.json',
 'bert-paper-classifier/tokenizer.json')

In [None]:
trainer.save_model("bert-paper-classifier-arxiv")

Запушим модель на HF Hub:

In [None]:
trainer.push_to_hub()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

To https://huggingface.co/oracat/bert-paper-classifier
   915ccf0..862abb7  main -> main



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


# Inference

Теперь попробуем загрузить модель с HF Hub:

In [None]:
inference_tokenizer = AutoTokenizer.from_pretrained("oracat/bert-paper-classifier-arxiv")
inference_model = AutoModelForSequenceClassification.from_pretrained("oracat/bert-paper-classifier-arxiv")

Downloading (…)okenizer_config.json:   0%|          | 0.00/394 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/225k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/679k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/6.04k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

In [None]:
pipe = pipeline("text-classification", model=inference_model, tokenizer=inference_tokenizer, top_k=None)

In [None]:
def top_pct(preds, threshold=.95):
    preds = sorted(preds, key=lambda x: -x["score"])

    cum_score = 0
    for i, item in enumerate(preds):
        cum_score += item["score"]
        if cum_score >= threshold:
            break

    preds = preds[:(i+1)]

    return preds

In [None]:
def format_predictions(preds) -> str:
    """
    Prepare predictions and their scores for printing to the user
    """
    out = ""
    for i, item in enumerate(preds):
        out += f"{i+1}. {item['label']} (score {item['score']:.2f})\n"
    return out

In [None]:
print(
    format_predictions(
        top_pct(
            pipe("Attention Is All You Need\nThe dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration.")[0]
        )
    )
)

1. cs.LG (score 0.88)
2. cs.AI (score 0.07)
3. cs.NE (score 0.03)

