<a href="https://colab.research.google.com/github/Tyanakai/medical_paper_classification/blob/main/medical_summary.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>医学論文の自動仕分けチャレンジ 要約生成</h1>

# 1. はじめに

本ノートブックは、[medical_EDA.ipynb](https://github.com/Tyanakai/medical_paper_classification/blob/main/medical_EDA.ipynb)において前処理し保存したファイル(`p_train.csv`,`p_test.csv`)の文字列データから要約文を生成します。<br>
モデルは、pubmedデータセットで訓練された、こちらの[Longformer Encoder-Decoder (LED)](https://huggingface.co/patrickvonplaten/led-large-16384-pubmed)を用います。<br>
尚、colabratory上で、ランタイムのタイプをGPUに設定した状態での実行を想定しています。


# 2. 事前に完了していること

- [medical_EDA.ipynb](https://github.com/Tyanakai/medical_paper_classification/blob/main/medical_EDA.ipynb)を実行

# 3. 環境準備

## 3.1 GPU
計算量が多いので、GPUを用います。

In [None]:
!nvidia-smi

Sat Oct 30 03:02:53 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## 3.2 Google Driveのマウント

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## 3.3 ライブラリのインストール、インポート

In [None]:
! pip install -q transformers

import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
import torch
import torch.nn as nn
import transformers as T

[K     |████████████████████████████████| 3.1 MB 5.0 MB/s 
[K     |████████████████████████████████| 56 kB 4.2 MB/s 
[K     |████████████████████████████████| 895 kB 49.5 MB/s 
[K     |████████████████████████████████| 3.3 MB 41.8 MB/s 
[K     |████████████████████████████████| 596 kB 44.4 MB/s 
[?25h

# 4. ファイル、モデルの準備

## 4.1 config
各種設定を行います。<br>
model : Hugging Faceのパス<br>
output_max_length : 出力する最大token数

In [None]:
class Config:
    model = "patrickvonplaten/led-large-16384-pubmed"
    output_max_length = 512
   
    train_file = "p_train.csv" #@param
    test_file = "p_test.csv" #@param
    text_col = "text"
    debug = False #@param {"type": "boolean"}

DRIVE = "/content/drive/MyDrive/signate/medical_paper"
INPUT = os.path.join(DRIVE, "input")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## 4.2 csvファイルロード、前処理
`title`と`abstract`を連結した文字列を入力として用います。

In [None]:
def get_data(file_path):

    df = pd.read_csv(os.path.join(INPUT, file_path))
    if Config.debug:
        df = df.iloc[:10]
    df["text"] = df["title"] + " " + df["abstract"].fillna("")
    return df

train_df = get_data(Config.train_file)
test_df = get_data(Config.test_file)

## 4.3 対象index抽出
`text`が512語以上になるindexを抽出します。

In [None]:
# train_df
tr_over512_idx = train_df[train_df.text.str.split().map(lambda x: len(x))>=512].index
# test_df
te_over512_idx = test_df[test_df.text.str.split().map(lambda x: len(x))>=512].index

tr_over512_idx.shape[0], te_over512_idx.shape[0]

(175, 315)

## 4.4 tokenizer, modelのロード

In [None]:
tokenizer = T.LEDTokenizer.from_pretrained(Config.model)
model = T.LEDForConditionalGeneration.from_pretrained(
    Config.model, 
    return_dict_in_generate=True).to(device)

Downloading:   0%|          | 0.00/780k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.71G [00:00<?, ?B/s]

# 5. 実行

In [None]:
tr_summary_list = []
for idx in tqdm(over_512_tr_idx):
    torch.cuda.empty_cache()
    input_ids = tokenizer(train_df.iloc[idx]["summary"], return_tensors="pt").input_ids.to(device)
    global_attention_mask = torch.zeros_like(input_ids)
    global_attention_mask[:, 0] = 1

    sequences = model.generate(input_ids,
                               global_attention_mask=global_attention_mask,
                               num_beams=1, 
                               max_length=Config.output_max_length, 
                               early_stopping=True)

    summary = tokenizer.batch_decode(sequences[0], skip_special_tokens=True)
    tr_summary_list.append(summary)

tr_summary_list[:10]

In [None]:
train_df["summary"] = train_df["text"]
train_df.loc[tr_over512_idx, "summary"] = tr_summary_list
train_df.head()

In [None]:
te_summary_list = []
for idx in tqdm(over_512_te_idx):
    torch.cuda.empty_cache()
    input_ids = tokenizer(test_df.iloc[idx]["summary"], return_tensors="pt").input_ids.to(device)
    global_attention_mask = torch.zeros_like(input_ids)
    global_attention_mask[:, 0] = 1

    sequences = model.generate(input_ids,
                               global_attention_mask=global_attention_mask,
                               num_beams=1, 
                               max_length=Config.output_max_length, 
                               early_stopping=True)

    summary = tokenizer.batch_decode(sequences[0], skip_special_tokens=True)
    te_summary_list.append(summary)

te_summary_list[:10]

  0%|          | 0/3253 [00:00<?, ?it/s]

In [None]:
test_df["summary"] = test_df["text"]
test_df.loc[te_over512_idx, "summary"] = te_summary_list
test_df.head()

Unnamed: 0,id,title,abstract,judgement,text,summary
0,27145,Estimating the potential effects of COVID-19 p...,The objective of the paper is to analyse chang...,,Estimating the potential effects of COVID-19 p...,Estimating the potential effects of COVID-19 p...
1,27146,Leukoerythroblastic reaction in a patient with...,,,Leukoerythroblastic reaction in a patient with...,Leukoerythroblastic reaction in a patient with...
2,27147,[15O]-water PET and intraoperative brain mappi...,[15O]-water PET was performed on 12 patients w...,,[15O]-water PET and intraoperative brain mappi...,[15O]-water PET and intraoperative brain mappi...
3,27148,Adaptive image segmentation for robust measure...,We present a method that significantly improve...,,Adaptive image segmentation for robust measure...,Adaptive image segmentation for robust measure...
4,27149,Comparison of Epidemiological Variations in CO...,The objective of this study is to compare the ...,,Comparison of Epidemiological Variations in CO...,Comparison of Epidemiological Variations in CO...
...,...,...,...,...,...,...
40829,67974,"Knowledge, Attitude, and Practices of Healthca...",In the current outbreak of novel coronavirus (...,,"Knowledge, Attitude, and Practices of Healthca...","Knowledge, Attitude, and Practices of Healthca..."
40830,67975,Safety and Efficacy of Anti-Il6-Receptor Tocil...,BACKGROUND: As the novel SARS-CoV-2 pandemic o...,,Safety and Efficacy of Anti-Il6-Receptor Tocil...,Safety and Efficacy of Anti-Il6-Receptor Tocil...
40831,67976,Functional imaging of head and neck tumors usi...,Positron emission tomography (PET) is an imagi...,,Functional imaging of head and neck tumors usi...,Functional imaging of head and neck tumors usi...
40832,67977,Effectiveness of 3D virtual imaging,,,Effectiveness of 3D virtual imaging,Effectiveness of 3D virtual imaging


# 6. 保存

In [None]:
train_df.to_csv(os.path.join(INPUT, "ps_train.csv"), index=False)
test_df.to_csv(os.path.join(INPUT, "ps_test.csv"), index=False)