<a href="https://colab.research.google.com/github/AirbornBird88/hmill-exper/blob/main/Arxiv_classification_with_bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Klasifikace článků na ArXivu do kategorií pomocí BERT modelu (fine tuning přístup)

In [1]:
pip install transformers



## Přečíst ArXiv dataset a jeho předzpracování do vhodné podoby

In [2]:
!pip install kaggle



In [3]:
# Upload Kaggle API Key
from google.colab import files
files.upload()  # Upload the kaggle.json file

Saving kaggle_mill.json to kaggle_mill.json


{'kaggle_mill.json': b'{"username":"airbornbird88","key":"523660c5fbf3cccd2135247a1a03565c"}'}

In [4]:
# Set up Kaggle configuration
!mkdir -p ~/.kaggle
!mv kaggle_mill.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle_mill.json

In [5]:
# Download the arXiv dataset
!kaggle datasets download -d Cornell-University/arxiv

Dataset URL: https://www.kaggle.com/datasets/Cornell-University/arxiv
License(s): CC0-1.0
Downloading arxiv.zip to /content
100% 1.29G/1.29G [00:14<00:00, 121MB/s]
100% 1.29G/1.29G [00:14<00:00, 96.5MB/s]


In [6]:
# Extract the dataset
!unzip arxiv.zip

Archive:  arxiv.zip
  inflating: arxiv-metadata-oai-snapshot.json  


In [7]:
ls /content

arxiv-metadata-oai-snapshot.json  arxiv.zip  [0m[01;34msample_data[0m/


In [8]:
import json

# File path
data_file = "/content/arxiv-metadata-oai-snapshot.json"

# List to store parsed JSON objects
samples = []

# Maximum number of samples to parse
max_samples = 5000

# Open the file and read line by line
with open(data_file, 'r') as file:
    for line in file:
        # Parse each line as a JSON object and append it to the samples list
        samples.append(json.loads(line))

        # Check if we've reached the maximum number of samples
        if len(samples) >= max_samples:
            break

# samples now contains the parsed JSON objects


In [9]:
# Assuming 'samples' contains the parsed JSON objects
# Print example of multiple JSON objects in samples
for i in range(5):  # Adjust the range as needed
    print(json.dumps(samples[i], indent=2))


{
  "id": "0704.0001",
  "submitter": "Pavel Nadolsky",
  "authors": "C. Bal\\'azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan",
  "title": "Calculation of prompt diphoton production cross sections at Tevatron and\n  LHC energies",
  "comments": "37 pages, 15 figures; published version",
  "journal-ref": "Phys.Rev.D76:013009,2007",
  "doi": "10.1103/PhysRevD.76.013009",
  "report-no": "ANL-HEP-PR-07-12",
  "categories": "hep-ph",
  "license": null,
  "abstract": "  A fully differential calculation in perturbative quantum chromodynamics is\npresented for the production of massive photon pairs at hadron colliders. All\nnext-to-leading order perturbative contributions from quark-antiquark,\ngluon-(anti)quark, and gluon-gluon subprocesses are included, as well as\nall-orders resummation of initial-state gluon radiation valid at\nnext-to-next-to-leading logarithmic accuracy. The region of phase space is\nspecified in which the calculation is most reliable. Good agreement is\ndemonstrated with

In [10]:
import pandas as pd

# Convert JSON data to DataFrame
df = pd.json_normalize(samples)

In [11]:
df.head()

Unnamed: 0,id,submitter,authors,title,comments,journal-ref,doi,report-no,categories,license,abstract,versions,update_date,authors_parsed
0,704.0001,Pavel Nadolsky,"C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-...",Calculation of prompt diphoton production cros...,"37 pages, 15 figures; published version","Phys.Rev.D76:013009,2007",10.1103/PhysRevD.76.013009,ANL-HEP-PR-07-12,hep-ph,,A fully differential calculation in perturba...,"[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...",2008-11-26,"[[Balázs, C., ], [Berger, E. L., ], [Nadolsky,..."
1,704.0002,Louis Theran,Ileana Streinu and Louis Theran,Sparsity-certifying Graph Decompositions,To appear in Graphs and Combinatorics,,,,math.CO cs.CG,http://arxiv.org/licenses/nonexclusive-distrib...,"We describe a new algorithm, the $(k,\ell)$-...","[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",2008-12-13,"[[Streinu, Ileana, ], [Theran, Louis, ]]"
2,704.0003,Hongjun Pan,Hongjun Pan,The evolution of the Earth-Moon system based o...,"23 pages, 3 figures",,,,physics.gen-ph,,The evolution of Earth-Moon system is descri...,"[{'version': 'v1', 'created': 'Sun, 1 Apr 2007...",2008-01-13,"[[Pan, Hongjun, ]]"
3,704.0004,David Callan,David Callan,A determinant of Stirling cycle numbers counts...,11 pages,,,,math.CO,,We show that a determinant of Stirling cycle...,"[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",2007-05-23,"[[Callan, David, ]]"
4,704.0005,Alberto Torchinsky,Wael Abu-Shammala and Alberto Torchinsky,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,,"Illinois J. Math. 52 (2008) no.2, 681-689",,,math.CA math.FA,,In this paper we show how to compute the $\L...,"[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...",2013-10-15,"[[Abu-Shammala, Wael, ], [Torchinsky, Alberto, ]]"


In [12]:
# Function to concatenate all fields into a single text column
def concatenate_fields(row):
    text = []
    for key, value in row.items():
        if key not in ['id', 'categories'] and value is not None:
            text.append(f"{key}: {value}")
    return ". ".join(text)

In [13]:
# Apply the function to create the Text column
df['Text'] = df.apply(concatenate_fields, axis=1)

In [14]:
# Keep only the necessary columns
df = df[['id', 'Text', 'categories']]
df.columns = ['ArticleId', 'Text', 'Categories']

In [15]:
df.head()

Unnamed: 0,ArticleId,Text,Categories
0,704.0001,submitter: Pavel Nadolsky. authors: C. Bal\'az...,hep-ph
1,704.0002,submitter: Louis Theran. authors: Ileana Strei...,math.CO cs.CG
2,704.0003,submitter: Hongjun Pan. authors: Hongjun Pan. ...,physics.gen-ph
3,704.0004,submitter: David Callan. authors: David Callan...,math.CO
4,704.0005,submitter: Alberto Torchinsky. authors: Wael A...,math.CA math.FA


In [16]:
print(f"Total records: {len(df)}")
categories = df['Categories'].unique()
print(f"Categories: {categories.tolist()}")
df.head()

Total records: 5000
Categories: ['hep-ph', 'math.CO cs.CG', 'physics.gen-ph', 'math.CO', 'math.CA math.FA', 'cond-mat.mes-hall', 'gr-qc', 'cond-mat.mtrl-sci', 'astro-ph', 'math.NT math.AG', 'math.NT', 'math.CA math.AT', 'hep-th', 'math.PR math.AG', 'hep-ex', 'nlin.PS physics.chem-ph q-bio.MN', 'math.NA', 'nlin.PS', 'cond-mat.str-el cond-mat.stat-mech', 'math.RA', 'math.CA math.PR', 'cond-mat.str-el', 'physics.optics physics.comp-ph', 'q-bio.PE q-bio.CB quant-ph', 'q-bio.QM q-bio.MN', 'hep-ph hep-lat nucl-th', 'math.OA math.FA', 'math.QA math-ph math.MP', 'physics.gen-ph quant-ph', 'cond-mat.stat-mech cond-mat.mtrl-sci', 'astro-ph nlin.CD physics.plasm-ph physics.space-ph', 'nlin.PS nlin.SI', 'quant-ph cs.IT math.IT', 'cs.NE cs.AI', 'gr-qc astro-ph', 'physics.ed-ph quant-ph', 'math.DG gr-qc', 'cond-mat.soft cond-mat.mtrl-sci', 'physics.pop-ph', 'nucl-th', 'math.FA', 'cs.DS', 'math.AG math.CO', 'math.NT math.CV', 'math.DS', 'physics.soc-ph', 'math-ph math.MP', 'math.AG', 'hep-ph hep-ex n

Unnamed: 0,ArticleId,Text,Categories
0,704.0001,submitter: Pavel Nadolsky. authors: C. Bal\'az...,hep-ph
1,704.0002,submitter: Louis Theran. authors: Ileana Strei...,math.CO cs.CG
2,704.0003,submitter: Hongjun Pan. authors: Hongjun Pan. ...,physics.gen-ph
3,704.0004,submitter: David Callan. authors: David Callan...,math.CO
4,704.0005,submitter: Alberto Torchinsky. authors: Wael A...,math.CA math.FA


In [17]:
# Group by 'Categories' and get the size of each group
category_counts = df.groupby(['Categories']).size().reset_index(name='Count')

# Sort the categories by count for better readability
category_counts = category_counts.sort_values(by='Count', ascending=False)

# Display the resulting DataFrame
print(category_counts)

                         Categories  Count
0                          astro-ph    915
283                          hep-ph    231
910                        quant-ph    205
305                          hep-th    193
50                cond-mat.mtrl-sci    111
..                              ...    ...
441                 math.CA math.DS      1
69   cond-mat.other hep-th quant-ph      1
443                 math.CA math.OA      1
444                 math.CA math.PR      1
942  stat.ME physics.soc-ph stat.AP      1

[943 rows x 2 columns]


In [18]:
# Split the categories into separate rows
df_exploded = df.assign(Categories=df['Categories'].str.split()).explode('Categories')

# Group by 'Categories' and get the size of each group
category_counts = df_exploded.groupby(['Categories']).size().reset_index(name='Count')

# Sort the categories by count for better readability
category_counts = category_counts.sort_values(by='Count', ascending=False)

# Display the resulting DataFrame
print(category_counts)

      Categories  Count
0       astro-ph   1076
51        hep-th    463
50        hep-ph    439
129     quant-ph    308
47         gr-qc    256
..           ...    ...
34         cs.MA      1
122     q-bio.TO      1
1    astro-ph.EP      1
87       nlin.CG      1
37         cs.NA      1

[135 rows x 2 columns]


## Load Bert model a vyzkoušet jeho Tokenizer

In [19]:
from transformers import BertTokenizer
from transformers.tokenization_utils_base import BatchEncoding

transformer_model_id = 'bert-base-multilingual-cased'
tokenizer = BertTokenizer.from_pretrained(transformer_model_id)

example_text = 'This is an example of the New Revelation'
tokens = tokenizer.tokenize(text=example_text)
print(f"Tokens: {tokens}")

bert_input = tokenizer(example_text, padding='max_length', max_length = 50,
                       truncation=True, return_tensors="pt")
print(type(bert_input))
print(bert_input['input_ids'])
print(bert_input['token_type_ids'])
print(bert_input['attention_mask'])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]



config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

Tokens: ['This', 'is', 'an', 'example', 'of', 'the', 'New', 'Rev', '##ela', '##tion']
<class 'transformers.tokenization_utils_base.BatchEncoding'>
tensor([[  101, 10747, 10124, 10151, 14351, 10108, 10105, 10287, 24774, 15108,
         10822,   102,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0]])
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]])
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]])


In [20]:
token_ids = bert_input['input_ids'][0]
print(f"Tokens again: {tokenizer.convert_ids_to_tokens(token_ids)}")
print(f"Words: {tokenizer.decode(token_ids)}")


Tokens again: ['[CLS]', 'This', 'is', 'an', 'example', 'of', 'the', 'New', 'Rev', '##ela', '##tion', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
Words: [CLS] This is an example of the New Revelation [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]


## Připravit ArXiv dataset pro trénování modelu

In [21]:
# Get unique categories from your DataFrame
unique_categories = df['Categories'].unique()

# Create a category2index mapping dictionary
category2index = {category: index for index, category in enumerate(unique_categories)}

# Display the mapping dictionary
print(category2index)

{'hep-ph': 0, 'math.CO cs.CG': 1, 'physics.gen-ph': 2, 'math.CO': 3, 'math.CA math.FA': 4, 'cond-mat.mes-hall': 5, 'gr-qc': 6, 'cond-mat.mtrl-sci': 7, 'astro-ph': 8, 'math.NT math.AG': 9, 'math.NT': 10, 'math.CA math.AT': 11, 'hep-th': 12, 'math.PR math.AG': 13, 'hep-ex': 14, 'nlin.PS physics.chem-ph q-bio.MN': 15, 'math.NA': 16, 'nlin.PS': 17, 'cond-mat.str-el cond-mat.stat-mech': 18, 'math.RA': 19, 'math.CA math.PR': 20, 'cond-mat.str-el': 21, 'physics.optics physics.comp-ph': 22, 'q-bio.PE q-bio.CB quant-ph': 23, 'q-bio.QM q-bio.MN': 24, 'hep-ph hep-lat nucl-th': 25, 'math.OA math.FA': 26, 'math.QA math-ph math.MP': 27, 'physics.gen-ph quant-ph': 28, 'cond-mat.stat-mech cond-mat.mtrl-sci': 29, 'astro-ph nlin.CD physics.plasm-ph physics.space-ph': 30, 'nlin.PS nlin.SI': 31, 'quant-ph cs.IT math.IT': 32, 'cs.NE cs.AI': 33, 'gr-qc astro-ph': 34, 'physics.ed-ph quant-ph': 35, 'math.DG gr-qc': 36, 'cond-mat.soft cond-mat.mtrl-sci': 37, 'physics.pop-ph': 38, 'nucl-th': 39, 'math.FA': 40, 

In [22]:
# Assuming num_categories is the number of categories in arXiv data
num_categories = len(unique_categories)
num_categories

943

In [23]:
from pandas.core.reshape.encoding import DataFrame
import numpy as np
import torch


class ARXIVDataset(torch.utils.data.Dataset):
    def __init__(self, data_frame: DataFrame):
        self.labels = [category2index[category_label]
                       for category_label in data_frame['Categories']]
        self.texts = [text for text in data_frame['Text']]
        """self.texts = [tokenizer(text, padding='max_length', max_length = 512,
                                truncation=True, return_tensors="pt")
                       for text in data_frame['Text']]"""

    def classes(self) -> list[int]:
        return self.labels

    def __len__(self) -> int:
        return len(self.labels)

    def get_batch_labels(self, idx: int) -> np.array:
        return np.array(self.labels[idx])

    def get_batch_texts(self, idx: int) -> BatchEncoding:
        return self.texts[idx]

    def __getitem__(self, idx: int) -> tuple:

        text = self.texts[idx]
        label = self.labels[idx]
        encoding = tokenizer(text, padding='max_length', max_length=512, truncation=True, return_tensors="pt")
        input_ids = encoding['input_ids'].squeeze(0)  # Remove batch dimension
        attention_mask = encoding['attention_mask'].squeeze(0)  # Remove batch dimension
        return input_ids, attention_mask, label

In [24]:
np.random.seed(112)
# randomize all examples and split them into 80th percentile and 90th percentile
# per training, validation and test set:
df_train, df_val, df_test = np.split(
    df.sample(frac=1, random_state=42),
    [int(.8 * len(df)), int(.9 * len(df))]
    )

print(f"{len(df)} rozděleno na: {len(df_train)} train, {len(df_val)} validation, {len(df_test)} test")

5000 rozděleno na: 4000 train, 500 validation, 500 test


# Architektura sítě založené na Bert transformeru

In [25]:
from torch import nn
from transformers import BertModel

class BertClassifier(nn.Module):
    def __init__(self, num_categories, dropout=0.5):
        super(BertClassifier, self).__init__()

        self.bert = BertModel.from_pretrained(transformer_model_id)
        self.dropout = nn.Dropout(dropout)
        self.linear = nn.Linear(768, num_categories)
        self.relu = nn.ReLU()
        # self.relu = nn.Softmax(dim=1)

    def forward(self, input_ids, mask):
        contextualized_token_embeddings, pooled_output = self.bert(input_ids = input_ids, attention_mask=mask, return_dict=False)
        dropout_output = self.dropout(pooled_output)
        linear_output = self.linear(dropout_output)
        final_layer = self.relu(linear_output)
        return final_layer


# Model natrénovat/dotrénovat na klasifikační úlohu pomocí trénovacích a validačních dat

---



In [26]:
use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")
if use_cuda:
    print("Using CUDA")

Using CUDA


In [27]:
"""from torch.optim import Adam
from tqdm import tqdm

def train(model: BertClassifier, train_data: DataFrame, val_data: DataFrame, learning_rate: float, epochs: int):

    train, val = ARXIVDataset(train_data), ARXIVDataset(val_data)

    train_dataloader = torch.utils.data.DataLoader(train, batch_size=2, shuffle=True)
    val_dataloader = torch.utils.data.DataLoader(val, batch_size=2)

    criterion = nn.CrossEntropyLoss()
    optimizer = Adam(model.parameters(), lr=learning_rate)

    if use_cuda:
            model = model.cuda()
            criterion = criterion.cuda()

    for epoch_num in range(epochs):
            total_acc_train = 0
            total_loss_train = 0
            for train_input_ids, train_attention_mask, train_label in tqdm(train_dataloader):
                train_label = train_label.to(device)
                train_input_ids = train_input_ids.to(device)
                train_attention_mask = train_attention_mask.to(device)

                output = model(train_input_ids, train_attention_mask)

                batch_loss = criterion(output, train_label)
                total_loss_train += batch_loss.item()

                acc = (output.argmax(dim=1) == train_label).sum().item()
                total_acc_train += acc

                model.zero_grad()
                batch_loss.backward()
                optimizer.step()


            total_acc_val = 0
            total_loss_val = 0

            with torch.no_grad():

                for val_input_ids, val_attention_mask, val_label in val_dataloader:
                    val_label = val_label.to(device)
                    val_input_ids = val_input_ids.to(device)
                    val_attention_mask = val_attention_mask.to(device)

                    output = model(val_input_ids, val_attention_mask)

                    batch_loss = criterion(output, val_label)
                    total_loss_val += batch_loss.item()

                    acc = (output.argmax(dim=1) == val_label).sum().item()
                    total_acc_val += acc


            print(
                f'Epochs: {epoch_num + 1} | Train Loss: {total_loss_train / len(train_data): .3f} \
                | Train Accuracy: {total_acc_train / len(train_data): .3f} \
                | Val Loss: {total_loss_val / len(val_data): .3f} \
                | Val Accuracy: {total_acc_val / len(val_data): .3f}')

EPOCHS = 5
model = BertClassifier(num_categories)
LR = 1e-6

train(model, df_train, df_val, LR, EPOCHS)
"""

"from torch.optim import Adam\nfrom tqdm import tqdm\n\ndef train(model: BertClassifier, train_data: DataFrame, val_data: DataFrame, learning_rate: float, epochs: int):\n\n    train, val = ARXIVDataset(train_data), ARXIVDataset(val_data)\n\n    train_dataloader = torch.utils.data.DataLoader(train, batch_size=2, shuffle=True)\n    val_dataloader = torch.utils.data.DataLoader(val, batch_size=2)\n\n    criterion = nn.CrossEntropyLoss()\n    optimizer = Adam(model.parameters(), lr=learning_rate)\n\n    if use_cuda:\n            model = model.cuda()\n            criterion = criterion.cuda()\n\n    for epoch_num in range(epochs):\n            total_acc_train = 0\n            total_loss_train = 0\n            for train_input_ids, train_attention_mask, train_label in tqdm(train_dataloader):\n                train_label = train_label.to(device)\n                train_input_ids = train_input_ids.to(device)\n                train_attention_mask = train_attention_mask.to(device)\n\n               

In [28]:
from sklearn.metrics import precision_score, recall_score, f1_score
from tqdm import tqdm

def train_and_evaluate(model: BertClassifier, train_data: DataFrame, val_data: DataFrame, learning_rate: float, epochs: int):
    train = ARXIVDataset(train_data)
    val = ARXIVDataset(val_data)

    train_dataloader = torch.utils.data.DataLoader(train, batch_size=2, shuffle=True)
    val_dataloader = torch.utils.data.DataLoader(val, batch_size=2)

    if use_cuda:
        model = model.cuda()

    optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)
    loss_fn = torch.nn.CrossEntropyLoss()

    for epoch in range(epochs):
        model.train()
        total_acc_train = 0
        total_loss_train = 0
        all_preds_train = []
        all_labels_train = []

        for train_batch in tqdm(train_dataloader, desc=f"Training Epoch {epoch+1}"):
            input_ids, attention_mask, train_label = train_batch

            input_ids = input_ids.to(device)
            attention_mask = attention_mask.to(device)
            train_label = train_label.to(device)

            optimizer.zero_grad()
            output = model(input_ids, attention_mask)
            loss = loss_fn(output, train_label)
            loss.backward()
            optimizer.step()

            preds = output.argmax(dim=1)
            acc = (preds == train_label).sum().item()
            total_acc_train += acc
            total_loss_train += loss.item()

            all_preds_train.extend(preds.cpu().numpy())
            all_labels_train.extend(train_label.cpu().numpy())

        train_accuracy = total_acc_train / len(train_data)
        train_loss = total_loss_train / len(train_dataloader)
        train_precision = precision_score(all_labels_train, all_preds_train, average='weighted', zero_division=0)
        train_recall = recall_score(all_labels_train, all_preds_train, average='weighted', zero_division=0)
        train_f1 = f1_score(all_labels_train, all_preds_train, average='weighted', zero_division=0)

        model.eval()
        total_acc_val = 0
        total_loss_val = 0
        all_preds_val = []
        all_labels_val = []

        with torch.no_grad():
            for val_batch in tqdm(val_dataloader, desc=f"Validation Epoch {epoch+1}"):
                input_ids, attention_mask, val_label = val_batch

                input_ids = input_ids.to(device)
                attention_mask = attention_mask.to(device)
                val_label = val_label.to(device)

                output = model(input_ids, attention_mask)
                loss = loss_fn(output, val_label)

                preds = output.argmax(dim=1)
                acc = (preds == val_label).sum().item()
                total_acc_val += acc
                total_loss_val += loss.item()

                all_preds_val.extend(preds.cpu().numpy())
                all_labels_val.extend(val_label.cpu().numpy())

        val_accuracy = total_acc_val / len(val_data)
        val_loss = total_loss_val / len(val_dataloader)
        val_precision = precision_score(all_labels_val, all_preds_val, average='weighted', zero_division=0)
        val_recall = recall_score(all_labels_val, all_preds_val, average='weighted', zero_division=0)
        val_f1 = f1_score(all_labels_val, all_preds_val, average='weighted', zero_division=0)

        print(f'Epochs: {epoch + 1} | '
              f'Train Loss: {train_loss: .3f} | Train Accuracy: {train_accuracy: .3f} | '
              f'Train Precision: {train_precision: .3f} | Train Recall: {train_recall: .3f} | Train F1 Score: {train_f1: .3f} | '
              f'Val Loss: {val_loss: .3f} | Val Accuracy: {val_accuracy: .3f} | '
              f'Val Precision: {val_precision: .3f} | Val Recall: {val_recall: .3f} | Val F1 Score: {val_f1: .3f}')

In [29]:
EPOCHS = 5
LR = 1e-6
model = BertClassifier(num_categories)

# Assuming `model`, `df_train`, and `df_val` are already defined
train_and_evaluate(model, df_train, df_val, LR, EPOCHS)

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

Training Epoch 1: 100%|██████████| 2000/2000 [08:30<00:00,  3.92it/s]
Validation Epoch 1: 100%|██████████| 250/250 [00:17<00:00, 14.61it/s]


Epochs: 1 | Train Loss:  5.350 | Train Accuracy:  0.210 | Train Precision:  0.085 | Train Recall:  0.210 | Train F1 Score:  0.117 | Val Loss:  4.858 | Val Accuracy:  0.254 | Val Precision:  0.193 | Val Recall:  0.254 | Val F1 Score:  0.196


Training Epoch 2: 100%|██████████| 2000/2000 [08:31<00:00,  3.91it/s]
Validation Epoch 2: 100%|██████████| 250/250 [00:18<00:00, 13.84it/s]


Epochs: 2 | Train Loss:  4.604 | Train Accuracy:  0.303 | Train Precision:  0.203 | Train Recall:  0.303 | Train F1 Score:  0.222 | Val Loss:  4.517 | Val Accuracy:  0.300 | Val Precision:  0.212 | Val Recall:  0.300 | Val F1 Score:  0.227


Training Epoch 3: 100%|██████████| 2000/2000 [08:33<00:00,  3.89it/s]
Validation Epoch 3: 100%|██████████| 250/250 [00:17<00:00, 14.00it/s]


Epochs: 3 | Train Loss:  4.319 | Train Accuracy:  0.340 | Train Precision:  0.241 | Train Recall:  0.340 | Train F1 Score:  0.255 | Val Loss:  4.337 | Val Accuracy:  0.330 | Val Precision:  0.225 | Val Recall:  0.330 | Val F1 Score:  0.251


Training Epoch 4: 100%|██████████| 2000/2000 [08:31<00:00,  3.91it/s]
Validation Epoch 4: 100%|██████████| 250/250 [00:18<00:00, 13.21it/s]


Epochs: 4 | Train Loss:  4.133 | Train Accuracy:  0.372 | Train Precision:  0.250 | Train Recall:  0.372 | Train F1 Score:  0.284 | Val Loss:  4.233 | Val Accuracy:  0.356 | Val Precision:  0.237 | Val Recall:  0.356 | Val F1 Score:  0.269


Training Epoch 5: 100%|██████████| 2000/2000 [08:33<00:00,  3.89it/s]
Validation Epoch 5: 100%|██████████| 250/250 [00:16<00:00, 14.81it/s]

Epochs: 5 | Train Loss:  3.979 | Train Accuracy:  0.395 | Train Precision:  0.287 | Train Recall:  0.395 | Train F1 Score:  0.312 | Val Loss:  4.148 | Val Accuracy:  0.362 | Val Precision:  0.247 | Val Recall:  0.362 | Val F1 Score:  0.282





Vyzkoušet model na neviděných testovacích datech

In [30]:
def evaluate(model: BertClassifier, test_data: DataFrame):
    test = ARXIVDataset(test_data)
    test_dataloader = torch.utils.data.DataLoader(test, batch_size=2)

    if use_cuda:
        model = model.cuda()

    total_acc_test = 0
    with torch.no_grad():
        for test_batch in test_dataloader:
            input_ids, attention_mask, test_label = test_batch

            input_ids = input_ids.to(device)
            attention_mask = attention_mask.to(device)
            test_label = test_label.to(device)

            output = model(input_ids, attention_mask)
            acc = (output.argmax(dim=1) == test_label).sum().item()
            total_acc_test += acc

    print(f'Test Accuracy: {total_acc_test / len(test_data): .3f}')

evaluate(model, df_test)

Test Accuracy:  0.348


In [31]:
pip install scikit-learn



Chceme také přesnost, úplnost a F1 skóre modelu.

In [33]:
import torch
from sklearn.metrics import precision_score, recall_score, f1_score
from torch.utils.data import DataLoader

def evaluate(model, test_data, batch_size=2, use_cuda=True):
    test = ARXIVDataset(test_data)
    test_dataloader = DataLoader(test, batch_size=batch_size)

    device = torch.device("cuda" if use_cuda and torch.cuda.is_available() else "cpu")
    model = model.to(device)

    total_acc_test = 0
    all_preds_test = []
    all_labels_test = []

    model.eval()
    with torch.no_grad():
        for test_batch in tqdm(test_dataloader, desc="Evaluating"):
            input_ids, attention_mask, test_label = [item.to(device) for item in test_batch]

            output = model(input_ids, attention_mask)
            preds = output.argmax(dim=1)
            acc = (preds == test_label).sum().item()
            total_acc_test += acc

            all_preds_test.extend(preds.cpu().numpy())
            all_labels_test.extend(test_label.cpu().numpy())

    test_accuracy = total_acc_test / len(test_data)
    test_precision = precision_score(all_labels_test, all_preds_test, average='weighted', zero_division=0)
    test_recall = recall_score(all_labels_test, all_preds_test, average='weighted', zero_division=0)
    test_f1 = f1_score(all_labels_test, all_preds_test, average='weighted', zero_division=0)

    print(f'Test Accuracy: {test_accuracy:.3f}')
    print(f'Test Precision: {test_precision:.3f}')
    print(f'Test Recall: {test_recall:.3f}')
    print(f'Test F1 Score: {test_f1:.3f}')

# Assuming `model` and `df_test` are already defined
evaluate(model, df_test)

Evaluating: 100%|██████████| 250/250 [00:27<00:00,  9.26it/s]

Test Accuracy: 0.348
Test Precision: 0.249
Test Recall: 0.348
Test F1 Score: 0.278





Taky nás může zajámat, kolik parametrů má model anebo konkrétní predikce pro náhodné pozorování.

In [37]:
# Count the total number of parameters
total_params = sum(p.numel() for p in model.parameters())

print(f"Total number of parameters in the model: {total_params}")

Total number of parameters in the model: 178578607


In [38]:
# Prediction for random observation (article)

import random
import json
import pandas as pd
import torch

# File path
data_file = "/content/arxiv-metadata-oai-snapshot.json"

# Function to load a random sample from the JSON data
def load_random_sample(file_path):
    with open(file_path, 'r') as file:
        # Read the entire file into memory
        lines = file.readlines()
        # Choose a random line
        random_line = random.choice(lines)
        # Parse the JSON object from the chosen line
        random_sample = json.loads(random_line)
    return random_sample

# Function to concatenate all fields into a single text column
def concatenate_fields(row):
    text = []
    for key, value in row.items():
        if key not in ['id', 'categories'] and value is not None:
            text.append(f"{key}: {value}")
    return ". ".join(text)

# Load a random sample
random_sample = load_random_sample(data_file)

# Normalize the random sample to DataFrame
df = pd.json_normalize([random_sample])

# Apply the function to create the Text column
df['Text'] = df.apply(concatenate_fields, axis=1)

# Keep only the necessary columns
df = df[['id', 'Text', 'categories']]
df.columns = ['ArticleId', 'Text', 'Categories']

# Assuming `tokenizer`, `model`, `device`, and `category2index` are already defined

# Tokenize the text using the BERT tokenizer
bert_input = tokenizer(
    df['Text'].iloc[0],  # Assuming the text column is named 'Text'
    padding='max_length',
    max_length=512,  # Adjust the maximum length if needed
    truncation=True,
    return_tensors="pt"
)

# Pass the tokenized input to your model to obtain predictions
with torch.no_grad():
    model_output = model(
        bert_input['input_ids'].to(device),
        bert_input['attention_mask'].to(device)
    )

    # Obtain the index of the predicted category
    best_category_idx = model_output.argmax(dim=1).item()

    # Decode the predicted category index into its corresponding label
    best_category = list(category2index.keys())[best_category_idx]

    # Print predictions along with article ID, text, and categories
    print(f"Article ID: {df['ArticleId'].iloc[0]}")
    print(f"Text: {df['Text'].iloc[0]}")
    print(f"Categories: {df['Categories'].iloc[0]}")
    print(f"Predicted Category: {best_category}")


Article ID: 0801.3685
Text: submitter: Parthapratim Biswas. authors: Parthapratim Biswas and D. A. Drabold. title: Inverse approach to atomistic modeling: Applications to a-Si:H and
  g-GeSe2. comments: 3 pages, 5 figures, submitted to Journal of Non-crystalline Solids. doi: 10.1016/j.jnoncrysol.2007.09.043. abstract:   We discuss an inverse approach for atomistic modeling of glassy materials.
The focus is on structural modeling and electronic properties of hydrogenated
amorphous silicon and glassy GeSe2 alloy. The work is based upon a new approach
"experimentally constrained molecular relaxation (ECMR)". Unlike conventional
approaches (such as molecular dynamics (MD) and Monte Carlo simulations(MC),
where a potential function is specified and the system evolves either
deterministically (MD) or stochastically (MC), we develop a novel scheme to
model structural configurations using experimental data in association with
density functional calculations. We have applied this approach to mo

Let save the model parameters

In [59]:
# Print the model architecture
print(model)

# Get the total number of parameters in the model
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params}")

# Get the number of trainable parameters in the model
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Trainable parameters: {trainable_params}")

# Get detailed information about each layer and its parameters
for name, param in model.named_parameters():
    print(f"Layer: {name}, Size: {param.size()}")

BertClassifier(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(119547, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwi

In [48]:
# Save the trained model
torch.save(model.state_dict(), 'bert_classifier_final_ArXiv.pt')

In [52]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [57]:
import os

# Create the /content/drive/models directory if it doesn't exist
models_dir = '/content/drive/My Drive/models'
if not os.path.exists(models_dir):
    os.makedirs(models_dir)


In [58]:
# Define the path where you want to save the model in Google Drive
model_save_path = '/content/drive/My Drive/models/bert_classifier_final_ArXiv.pt'

# Save the trained model to Google Drive
torch.save(model.state_dict(), model_save_path)