<a href="https://colab.research.google.com/github/Dash400air/SRWS_PSG/blob/main/SRWS_PSG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GPU check

In [1]:
!nvidia-smi

Sun Aug 15 07:14:08 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P0    28W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Setup

In [2]:
!mkdir srws
%cd ./srws

mkdir: cannot create directory ‘srws’: File exists
/content/srws


In [3]:
!pip install transformers==4.5.0 pytorch-lightning==1.2.7



In [4]:
import os
import random
import pandas as pd
import numpy as np

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import RobertaTokenizer, RobertaForSequenceClassification
import pytorch_lightning as pl

from sklearn.metrics import fbeta_score
from sklearn.model_selection import StratifiedKFold

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [6]:
train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/SRWS-PSG/train (2).csv')
test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/SRWS-PSG/test (1).csv')

# Data

In [7]:
train.head()

Unnamed: 0,id,title,abstract,judgement
0,0,One-year age changes in MRI brain volumes in o...,Longitudinal studies indicate that declines in...,0
1,1,Supportive CSF biomarker evidence to enhance t...,The present study was undertaken to validate t...,0
2,2,Occurrence of basal ganglia germ cell tumors w...,Objective: To report a case series in which ba...,0
3,3,New developments in diagnosis and therapy of C...,The etiology and pathogenesis of idiopathic ch...,0
4,4,Prolonged shedding of SARS-CoV-2 in an elderly...,,0


In [8]:
train.describe()

Unnamed: 0,id,judgement
count,27145.0,27145.0
mean,13572.0,0.023282
std,7836.230865,0.150802
min,0.0,0.0
25%,6786.0,0.0
50%,13572.0,0.0
75%,20358.0,0.0
max,27144.0,1.0


In [9]:
test.head()

Unnamed: 0,id,title,abstract
0,27145,Estimating the potential effects of COVID-19 p...,The objective of the paper is to analyse chang...
1,27146,Leukoerythroblastic reaction in a patient with...,
2,27147,[15O]-water PET and intraoperative brain mappi...,[15O]-water PET was performed on 12 patients w...
3,27148,Adaptive image segmentation for robust measure...,We present a method that significantly improve...
4,27149,Comparison of Epidemiological Variations in CO...,The objective of this study is to compare the ...


In [10]:
train = train.drop(index=range(250,27145))

In [11]:
train.tail()

Unnamed: 0,id,title,abstract,judgement
245,245,Inhibition of amyloid fibrillogenesis and toxi...,Aggregation of proteins in tissues is associat...,0
246,246,Patterns of cortical thinning in idiopathic ra...,Idiopathic rapid eye movement sleep behavior d...,0
247,247,Breast Cancer Patients' Response to COVID-19-R...,,0
248,248,Diagnosis of aortitis in 18F-FDG-PET,,0
249,249,COVID-19 patients with hypertension have more ...,This study aims to explore the effect of hyper...,0


# Config

In [12]:
class Config:
    def __init__(self):
        self.model = "roberta-large"
        self.title_max = 134
        self.abstract_max = 3302
        self.seed = 471

Config = Config()

# Seed

In [13]:
def seed_torch(seed=42):
    random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True

seed_torch(Config.seed)

# Kfold

In [14]:
def get_train_data(train):
    Fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=Config.seed)
    for n, (train_index, val_index) in enumerate(Fold.split(train, train["judgement"])):
        train.loc[val_index, "fold"] = int(n)
    train["fold"] = train["fold"].astype(np.uint8)

    return train

train = get_train_data(train)



In [15]:
train.head()

Unnamed: 0,id,title,abstract,judgement,fold
0,0,One-year age changes in MRI brain volumes in o...,Longitudinal studies indicate that declines in...,0,1
1,1,Supportive CSF biomarker evidence to enhance t...,The present study was undertaken to validate t...,0,3
2,2,Occurrence of basal ganglia germ cell tumors w...,Objective: To report a case series in which ba...,0,3
3,3,New developments in diagnosis and therapy of C...,The etiology and pathogenesis of idiopathic ch...,0,2
4,4,Prolonged shedding of SARS-CoV-2 in an elderly...,,0,0


In [16]:
test.head()

Unnamed: 0,id,title,abstract
0,27145,Estimating the potential effects of COVID-19 p...,The objective of the paper is to analyse chang...
1,27146,Leukoerythroblastic reaction in a patient with...,
2,27147,[15O]-water PET and intraoperative brain mappi...,[15O]-water PET was performed on 12 patients w...
3,27148,Adaptive image segmentation for robust measure...,We present a method that significantly improve...
4,27149,Comparison of Epidemiological Variations in CO...,The objective of this study is to compare the ...


# Dataset

In [17]:
class BaseDataset(Dataset):
    def __init__(self, df, include_labels=True):
        tokenizer = RobertaTokenizer.from_pretrained(Config.model)

        self.df = df
        self.include_labels = include_labels

        self.title = df["title"].tolist()
        self.encoded = tokenizer(
            self.title,
            return_tensors='pt',
            max_length = Config.title_max,
            padding = 'max_length', 
            truncation = True,
            return_attention_mask=True
        )
        
        if self.include_labels:
            self.labels = df["judgement"].values

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        input_ids = self.encoded['input_ids'][idx]
        attention_mask = self.encoded['attention_mask'][idx]
        output = {'input_ids': input_ids, 'attention_mask': attention_mask}

        if self.include_labels:
            label = torch.tensor(self.labels[idx]).float()
            output_l = {'input_ids': input_ids, 'attention_mask': attention_mask, 'labels': label}
            return output_l

        return output

# Model

In [18]:
class RobertaForSequenceClassification_pl(pl.LightningModule):

  def __init__(self, model_name, num_labels, lr):
    # model_name: Transformersのモデル名
    # num_labels: ラベルの数
    # lr: 学習率

    super().__init__()

    self.save_hyperparameters()
    
    self.bert_sc = RobertaForSequenceClassification.from_pretrained(
        model_name,
        num_labels=num_labels
    )

  def training_step(self, batch, batch_idx):
    output = self.bert_sc(**batch)
    loss = output.loss
    self.log('train_loss', loss)
    return loss

  def validation_step(self, batch, batch_idx):
    output = self.bert_sc(**batch)
    val_loss = output.loss
    self.log('val_loss', val_loss)

  def test_step(self, batch, batch_idx):
    labels = batch.pop('labels').detach().cpu().numpy() #labelsについて，GPU上のTensorではfbeta_scoreが受け付けてくれないため，CPUに移動させ，Numpyに変換
    output = self.bert_sc(**batch)
    labels_predicted = output.logits.detach().cpu().numpy().argmax(-1)  #同上
    fbeta = fbeta_score(labels, labels_predicted, beta=7)
    self.log('fbeta_score', fbeta)
  
  def configure_optimizers(self):
    return torch.optim.AdamW(self.parameters(), lr=self.hparams.lr)

# Dataloader

In [19]:
valid = 0
test = 1

trn_idx = train[train["fold"] > test].index
val_idx = train[train["fold"] == valid].index
test_idx = train[train["fold"] == test].index

train_folds = train.loc[trn_idx].reset_index(drop=True)
valid_folds = train.loc[val_idx].reset_index(drop=True)
test_folds = train.loc[test_idx].reset_index(drop=True)

train_dataset = BaseDataset(train_folds)
valid_dataset = BaseDataset(valid_folds)
test_dataset = BaseDataset(test_folds)

train_loader = DataLoader(
        train_dataset,
        batch_size=16,
        shuffle=True,
        num_workers=4,
        pin_memory=True,
        drop_last=True,
    )

valid_loader = DataLoader(
        valid_dataset,
        batch_size=16,
        shuffle=False,
        num_workers=4,
        pin_memory=True,
        drop_last=False,
    )

test_loader = DataLoader(
        test_dataset,
        batch_size=16,
    )

  cpuset_checked))


# Run

In [20]:
checkpoint = pl.callbacks.ModelCheckpoint(
    monitor='val_loss',
    mode='min',
    save_top_k=1,
    save_weights_only=True,
    dirpath='model/',
)

trainer = pl.Trainer(
    gpus=1,
    max_epochs=5,
    callbacks = [checkpoint]
)

GPU available: True, used: True
TPU available: False, using: 0 TPU cores


In [21]:
model = RobertaForSequenceClassification_pl(
    Config.model, num_labels=1, lr=1e-5)

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.dense.weight', 'classif

In [22]:
trainer.fit(model, train_loader, valid_loader)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name    | Type                             | Params
-------------------------------------------------------------
0 | bert_sc | RobertaForSequenceClassification | 355 M 
-------------------------------------------------------------
355 M     Trainable params
0         Non-trainable params
355 M     Total params
1,421.443 Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

  cpuset_checked))


Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

1

# Test

In [23]:
tokenizer = RobertaTokenizer.from_pretrained(Config.model)

encoded = tokenizer(
            test_folds["title"].tolist(),
            return_tensors='pt',
            max_length = Config.title_max,
            padding = 'max_length', 
        )

encoded = { k: v.cuda() for k, v in encoded.items() }

In [24]:
best_model = checkpoint.best_model_path
model = RobertaForSequenceClassification_pl.load_from_checkpoint(best_model)
bert_sc = model.bert_sc.cuda()

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.dense.weight', 'classif

In [25]:
with torch.no_grad():
  output = bert_sc(**encoded)
score = output.logits

In [28]:
m = nn.Softmax(dim=0)
print(m(score))

tensor([[0.0203],
        [0.0200],
        [0.0201],
        [0.0197],
        [0.0202],
        [0.0198],
        [0.0202],
        [0.0199],
        [0.0201],
        [0.0200],
        [0.0199],
        [0.0198],
        [0.0200],
        [0.0199],
        [0.0198],
        [0.0197],
        [0.0202],
        [0.0198],
        [0.0199],
        [0.0198],
        [0.0202],
        [0.0203],
        [0.0198],
        [0.0200],
        [0.0200],
        [0.0199],
        [0.0199],
        [0.0200],
        [0.0199],
        [0.0200],
        [0.0200],
        [0.0201],
        [0.0203],
        [0.0204],
        [0.0197],
        [0.0198],
        [0.0199],
        [0.0198],
        [0.0205],
        [0.0206],
        [0.0199],
        [0.0200],
        [0.0202],
        [0.0204],
        [0.0200],
        [0.0201],
        [0.0200],
        [0.0198],
        [0.0199],
        [0.0197]], device='cuda:0')
