# Papers classification

***Notre objectif est de construire un modèle qui utilise le résumé et le titre d'un article pour prédire s'il sera rejeté ou non.***

C'est un article avec du code pour affiner BERT afin d'effectuer une classification de texte sur un ensemble de données d'articles scientifique acceptés et rejétés.

Dans cet article, nous allons :

- Charger le jeu de données Papers-Dataset
- Charger un modèle BERT à partir de [Huggingface](https://huggingface.co/)
- Construire notre propre modèle en combinant BERT avec un classificateur
- Entraîner le modèle en rafinnant BERT dans le cadre de notre tâche
- Enregister le modèle afin de l'utiliser pour classer des articles

A la fin, vous aurez une architecture que vous pouvez réutiliser dans vos prochains projets de classifications de texte.

### C'est quoi BERT ?

Introduit en 2018, [BERT: Bidirectional Encoder Representations from Transformers](https://arxiv.org/abs/1810.04805), selon ses auteurs est conçu pour pré-entraîner des représentations bidirectionnelles profondes à partir de textes non étiquetés en conditionnant **conjointement le contexte gauche et droit** dans toutes les couches.

BERT est survenu pour compléter deux techniques de word embedding: [ELMo](https://paperswithcode.com/method/elmo) et [GPT](https://paperswithcode.com/method/gpt).
Pendant que ELMo encode le contexte de manière bidirectionnelle mais utilise des architectures spécifiques aux tâches GPT est indifférent aux tâches mais code le contexte de gauche à droite.

Nous pouvons résumer les caractéristiques de ces modèles comme suit:

| Modèle | Context                | Tâche                  | Encode           |
| ------ | ---------------------- | ---------------------- | ---------------- |
| ELMo   | sensible au contexte ✅ | spécifique à la tâche  | bi-directionnel✅ |
| GPT    | sensible au contexte   | agnostique à la tâche✅ | gauche à droite  |
| BERT   | sensible au contexte   | agnostique à la tâche  | bi-directionnel  |

### Outils et pré-requits
Pour construire notre modèle, nous travaillerons avec le framework Pytorch et [Pytorch Lightning](https://www.pytorchlightning.ai/)

## 1- Preparation des données

### 1-1 Chargement des données

Vous pouvez trouver le jeu de données utilisé à cette [adresse](https://raw.githubusercontent.com/Godwinh19/Papers-Dataset/main/data/ICLR%20papers%20datasets.csv)

In [1]:
import numpy as np
import pandas as pd
import requests
import io
import warnings
warnings.filterwarnings('ignore')

In [2]:
dataset_url = "https://raw.githubusercontent.com/Godwinh19/Papers-Dataset/main/data/ICLR%20papers%20datasets.csv"
s = requests.get(dataset_url).content
data = pd.read_csv(io.StringIO(s.decode('utf-8')), usecols=['title', 'abstract', 'accepted'])
# Affichons l'entête de nos données
data.head()

Unnamed: 0,title,abstract,accepted
0,What Matters for On-Policy Deep Actor-Critic M...,"In recent years, reinforcement learning (RL) h...",1
1,Theoretical Analysis of Self-Training with Dee...,"Self-training algorithms, which train a model ...",1
2,Learning to Reach Goals via Iterated Supervise...,Current reinforcement learning (RL) algorithms...,1
3,Deep symbolic regression: Recovering mathemati...,Discovering the underlying mathematical expres...,1
4,Optimal Rates for Averaged Stochastic Gradient...,We analyze the convergence of the averaged sto...,1


Dans ce tableau, nous avons:
- title: le titre de l'article
- abstract: le résumé
- accepted: champ décrivant si l'article a été accépté (1) ou non (0)

Dans notre cas, nous allons nous intéresser aux champs *title, abstract, accepted*

### 1-2 Transformation des colonnes

Nous allons transformer les colonnes *title* et *abstract* en une seule colonne appélé *description*; puis renommer le champ *accepted* en *label* de par sa fonction.

In [3]:
data['description'] = data['title'] + " - " + data['abstract']
transformed_data = data[['description', 'accepted']].rename(columns={'accepted': 'label'}).copy()
transformed_data.head()

Unnamed: 0,description,label
0,What Matters for On-Policy Deep Actor-Critic M...,1
1,Theoretical Analysis of Self-Training with Dee...,1
2,Learning to Reach Goals via Iterated Supervise...,1
3,Deep symbolic regression: Recovering mathemati...,1
4,Optimal Rates for Averaged Stochastic Gradient...,1


### 1-3 Transformation des données en entrée au modèle

> Le code pour le traitement des échantillons de données peut devenir désordonné et difficile à maintenir ; idéalement, nous voulons que le code de nos ensembles de données soit découplé du code d'apprentissage de nos modèles pour une meilleure lisibilité et modularité.
>
> <cite>pytorch docs</cite>

Avec pytorch, nous chargeons les données avec la classe `Dataset`

In [4]:
from torch.utils.data import Dataset

class PapersDataset(Dataset):
    def __init__(self, description, targets, tokenizer, max_length):
        self.description = description
        self.targets = targets
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.description)
    
    def __getitem__(self, item):
        description = str(self.description[item])
        target = self.targets[item]
        
        encoding = self.tokenizer.encode_plus(
            description,
            add_special_tokens=True,
            max_length=self.max_length,
            return_token_type_ids=False,
            padding="max_length",
            return_attention_mask=True,
            return_tensors="pt",
            truncation=True,
        )
        
        return {
            "article_text": description,
            "input_ids": encoding["input_ids"].flatten(),
            "attention_mask": encoding["attention_mask"].flatten(),
            "targets": torch.tensor(target, dtype=torch.long),
        }
        

Précedemment, nous avons introduit `tokenizer`. De façon simple, la tokénisation des mots est le processus qui consiste à diviser un grand échantillon de texte en mots. Il s'agit d'une exigence fondamentale dans les tâches de traitement du langage naturel où chaque mot doit être capturé séparément pour une analyse ultérieure. [Lire sur la tokenisation ici](https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html)

## 2- Chargement des données dans des dataloaders

Toujour dans l'objectif de rendre le code ordonné et facile à maintenir, nous allons faire une dernière transformation qui consiste en charger les données dans des `dataloader`. Pour faire ces configurations, nous allons utiliser *pytorch lightning*
Pous plus de détails, veuillez lire la documentation du module de données de lightning qui explique chaque étape du processus [ici](https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.core.datamodule.html#pytorch_lightning.core.datamodule.LightningDataModule).

In [5]:
import pytorch_lightning as pl
import torch
from transformers import BertTokenizer
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader

In [6]:
class BertDataModule(pl.LightningDataModule):
    def __init__(self, **kwargs):
        """
        Initialization of inherited lightning data module
        """
        super(BertDataModule, self).__init__()
        self.BERT_PRE_TRAINED_MODEL_NAME = "bert-base-uncased"
        self.df_train = None
        self.df_val = None
        self.df_test = None
        
        self.train_data_loader = None
        self.val_data_loader = None
        self.test_data_loader = None
        
        self.MAX_LEN = 100
        self.encoding = None
        self.tokenizer = None
    
    def setup(self, stage=None):
        """
        Read the data, parse it and split the data into train, test, validation data

        :param stage: Stage - training or testing
        """
        
        num_samples = 80
        df = (
            transformed_data
            .sample(num_samples)
        )
        
        self.tokenizer = BertTokenizer.from_pretrained(self.BERT_PRE_TRAINED_MODEL_NAME)
        
        RANDOM_SEED = 0
        np.random.seed(RANDOM_SEED)
        torch.manual_seed(RANDOM_SEED)
        
        df_train, df_test = train_test_split(
            df, test_size=0.3, random_state=RANDOM_SEED, stratify=df["label"]
        )
        
        df_val, df_test = train_test_split(
            df_test, test_size=0.5, random_state=RANDOM_SEED, stratify=df_test["label"]
        )
        
        self.df_train, self.df_val, self.df_test = df_train, df_val, df_test
    
    def create_data_loader(self, df, tokenizer, max_len, batch_size=8):
        """
        Generic data loader function

        :param df: Input dataframe
        :param tokenizer: bert tokenizer
        :param max_len: Max length of the claims datapoint
        :param batch_size: Batch size for training

        :return: Returns the constructed dataloader
        """
        dataset = PapersDataset(
            description=df.description.to_numpy(),
            targets=df.label.to_numpy(),
            tokenizer=tokenizer,
            max_length=max_len
        )
        
        return DataLoader(
            dataset, batch_size=batch_size, num_workers=0
        )
    
    def train_dataloader(self):
        """
        :return: output - Train data loader for the given input
        """
        self.train_data_loader = self.create_data_loader(
            self.df_train, self.tokenizer, self.MAX_LEN 
        )
        
        return self.train_data_loader
    
    def val_dataloader(self):
        """
        :return: output - Validation data loader for the given input
        """
        self.val_data_loader = self.create_data_loader(
            self.df_val, self.tokenizer, self.MAX_LEN
        )
        return self.val_data_loader

    def test_dataloader(self):
        """
        :return: output - Test data loader for the given input
        """
        self.test_data_loader = self.create_data_loader(
            self.df_test, self.tokenizer, self.MAX_LEN
        )
        return self.test_data_loader

## 3- Construction du réseau
Dans cette étape, nous allons construire notre classifier à partir d'un modèle appris de BERT.

La configuration d'un modele avec pytorch lightning est expliqué [ici](https://pytorch-lightning.readthedocs.io/en/stable/common/lightning_module.html).

In [7]:
import pytorch_lightning as pl
import torch
import torch.nn.functional as F
from pytorch_lightning.callbacks import (
    EarlyStopping,
    ModelCheckpoint,
    LearningRateMonitor,
)
from sklearn.metrics import accuracy_score
from torch import nn
from transformers import BertModel, AdamW

class BertPapersClassifier(pl.LightningModule):
    def __init__(self, **kwargs):
        """
        Initializes the network, optimizer and scheduler
        """
        super(BertPapersClassifier, self).__init__()
        self.BERT_PRE_TRAINED_MODEL_NAME = "bert-base-uncased"
        
        self.bert_model = BertModel.from_pretrained(self.BERT_PRE_TRAINED_MODEL_NAME)
    
        
        for param in self.bert_model.parameters():
            param.requires_grad = False
        
        self.drop = nn.Dropout(p=0.2)
        n_classes = 2
        
        self.fc1 = nn.Linear(self.bert_model.config.hidden_size, 512)
        self.out = nn.Linear(512, n_classes)
        
        self.scheduler = None
        self.optimizer = None
    
    def forward(self, input_ids, attention_mask):
        """
        :param input_ids: Input data
        :param attention_maks: Attention mask value

        :return: output - Accepted or not for the given papers snippet
        """
        output = self.bert_model(input_ids=input_ids, attention_mask=attention_mask)
        output = F.relu(self.fc1(output.pooler_output))
        output = self.drop(output)
        output = self.out(output)
        return output
    
    def training_step(self, train_batch, batch_idx):
        """
        Training the data as batches and returns training loss on each batch

        :param train_batch Batch data
        :param batch_idx: Batch indices

        :return: output - Training loss
        """
        input_ids = train_batch["input_ids"].to(self.device)
        attention_mask = train_batch["attention_mask"].to(self.device)
        targets = train_batch["targets"].to(self.device)
        
        output = self.forward(input_ids, attention_mask)
        loss = F.cross_entropy(output, targets)
        self.log("train loss", loss)
        return {"loss": loss}
    
    def test_step(self, test_batch, batch_idx):
        """
        Performs test and computes the accuracy of the model

        :param test_batch: Batch data
        :param batch_idx: Batch indices

        :return: output - Testing accuracy
        """
        input_ids = test_batch["input_ids"].to(self.device)
        attention_mask = test_batch["attention_mask"].to(self.device)
        targets = test_batch["targets"].to(self.device)
        
        output = self.forward(input_ids, attention_mask)
        _, y_hat = torch.max(output, dim=1)
        test_acc = accuracy_score(y_hat.cpu(), targets.cpu())
        
        return {"test_acc": torch.tensor(test_acc)}
    
    def validation_step(self, val_batch, batch_idx):
        """
        Performs validation of data in batches

        :param val_batch: Batch data
        :param batch_idx: Batch indices

        :return: output - valid step loss
        """

        input_ids = val_batch["input_ids"].to(self.device)
        attention_mask = val_batch["attention_mask"].to(self.device)
        targets = val_batch["targets"].to(self.device)
        
        output = self.forward(input_ids, attention_mask)
        loss =  F.cross_entropy(output, targets)
        
        return {"val_step_loss": loss}
    
    def validation_epoch_end(self, outputs):
        """
        Computes average validation accuracy

        :param outputs: outputs after every epoch end

        :return: output - average valid loss
        """
        avg_loss = torch.stack([x["val_step_loss"] for x in outputs]).mean()
        self.log("val_loss", avg_loss, sync_dist=True)

    def test_epoch_end(self, outputs):
        """
        Computes average test accuracy score

        :param outputs: outputs after every epoch end

        :return: output - average test loss
        """
        avg_test_acc = torch.stack([x["test_acc"] for x in outputs]).mean()
        self.log("avg_test_acc", avg_test_acc)
    
    def configure_optimizers(self):
        """
        Initializes the optimizer and learning rate scheduler

        :return: output - Initialized optimizer and scheduler
        """
        self.optimizer = AdamW(self.parameters(), lr=0.001)
        self.scheduler = {
            "scheduler": torch.optim.lr_scheduler.ReduceLROnPlateau(
                self.optimizer,
                mode="min",
                factor=0.2,
                patience=2,
                min_lr=1e-6,
                verbose=True
            ),
            "monitor": "val_loss",
        }
        return [self.optimizer], [self.scheduler]

## 4- Entrainement

In [11]:
import os
from pytorch_lightning import Trainer

torch.cuda.empty_cache()

data_module = BertDataModule(accelerator="gpu")
data_module.setup(stage="fit")

b_model = BertPapersClassifier()
early_stopping = EarlyStopping(monitor="val_loss", mode="min", verbose=True)

checkpoint_callback = ModelCheckpoint(
        dirpath=os.getcwd(),
        save_top_k=1,
        verbose=True,
        monitor="val_loss",
        
        mode="min",
    )
lr_logger = LearningRateMonitor()

trainer = pl.Trainer(
    max_epochs=10, gpus=1, accelerator="gpu",
    callbacks=[lr_logger, early_stopping, checkpoint_callback], checkpoint_callback=True,
)

trainer.fit(b_model, data_module)
trainer.test(datamodule=data_module)

torch.save(b_model.state_dict(), "bert_model_dict.pt")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  rank_zero_deprecation(
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
  rank_zero_deprecati

Validation sanity check: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Metric val_loss improved. New best score: 0.783
Epoch 0, global step 6: val_loss reached 0.78308 (best 0.78308), saving model to "C:\Users\RD_3\Documents\Python\PapersDataset\Papers-classification\epoch=0-step=6.ckpt" as top 1


Validating: 0it [00:00, ?it/s]

Metric val_loss improved by 0.127 >= min_delta = 0.0. New best score: 0.656
Epoch 1, global step 13: val_loss reached 0.65584 (best 0.65584), saving model to "C:\Users\RD_3\Documents\Python\PapersDataset\Papers-classification\epoch=1-step=13.ckpt" as top 1


Validating: 0it [00:00, ?it/s]

Epoch 2, global step 20: val_loss was not in top 1


Validating: 0it [00:00, ?it/s]

Epoch 3, global step 27: val_loss was not in top 1


Validating: 0it [00:00, ?it/s]

Metric val_loss improved by 0.026 >= min_delta = 0.0. New best score: 0.630
Epoch 4, global step 34: val_loss reached 0.63005 (best 0.63005), saving model to "C:\Users\RD_3\Documents\Python\PapersDataset\Papers-classification\epoch=4-step=34.ckpt" as top 1


Validating: 0it [00:00, ?it/s]

Metric val_loss improved by 0.033 >= min_delta = 0.0. New best score: 0.597
Epoch 5, global step 41: val_loss reached 0.59723 (best 0.59723), saving model to "C:\Users\RD_3\Documents\Python\PapersDataset\Papers-classification\epoch=5-step=41.ckpt" as top 1


Validating: 0it [00:00, ?it/s]

Epoch 6, global step 48: val_loss was not in top 1


Validating: 0it [00:00, ?it/s]

Epoch 7, global step 55: val_loss was not in top 1


Validating: 0it [00:00, ?it/s]

Monitored metric val_loss did not improve in the last 3 records. Best score: 0.597. Signaling Trainer to stop.
Epoch 8, global step 62: val_loss was not in top 1


Epoch     9: reducing learning rate of group 0 to 2.0000e-04.


Restoring states from the checkpoint path at C:\Users\RD_3\Documents\Python\PapersDataset\Papers-classification\epoch=5-step=41.ckpt
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loaded model weights from checkpoint at C:\Users\RD_3\Documents\Python\PapersDataset\Papers-classification\epoch=5-step=41.ckpt


Testing: 0it [00:00, ?it/s]

--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'avg_test_acc': 0.6875}
--------------------------------------------------------------------------------


## 5- Test


In [12]:
import torch

article = "Sparse Quantized Spectral Clustering - Given a large data matrix, sparsifying, quantizing, and/or performing other entry-wise nonlinear operations can have numerous benefits, ranging from speeding up iterative algorithms for core numerical linear algebra problems to providing nonlinear filters to design state-of-the-art neural network models. Here, we exploit tools from random matrix theory to make precise statements about how the eigenspectrum of a matrix changes under such nonlinear transformations. In particular, we show that very little change occurs in the informative eigenstructure, even under drastic sparsification/quantization, and consequently that very little downstream performance loss occurs when working with very aggressively sparsified or quantized spectral clustering problems.\
We illustrate how these results depend on the nonlinearity, we characterize a phase transition beyond which spectral clustering becomes possible, and we show when such nonlinear transformations can introduce spurious non-informative eigenvectors."
#original label = 1 : accepted

# Predict on a Pandas DataFrame.
import pandas as pd

model = BertPapersClassifier()

model.load_state_dict(torch.load("bert_model_dict.pt"))
model.eval()

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

inputs = tokenizer(article, padding=True)

input_ids = torch.tensor(inputs["input_ids"]).unsqueeze(0)
attention_mask = torch.tensor(inputs["attention_mask"]).unsqueeze(0)


print(model)
print(input_ids)

out = model(input_ids, attention_mask)
print(out)
print(torch.max(out.data, 1))
print(torch.max(out.data, 1).indices==torch.tensor([1]))

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BertPapersClassifier(
  (bert_model): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwis

En prenant l'exemple de [cet article](https://openreview.net/forum?id=pBqLS-7KYAF) qui n'était pas dans nos données, notre modèle retourne une acceptation 😁:

## Note de fin

L'ensemble du code est disponible [ici](https://github.com/Godwinh19/Papers-Dataset/)
Notre jeu de données utilisé est très faible, nous allons l'augmenter mais toute fois notre modèle accomplit sa fonction. 
Dans un prochain article, nous verrons comment nous pouvons effectuer le monitoring de nos modèles, mettre en production les plus performants tout en ayant les meilleurs paramètres.

Cheers ☕!