<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Training-Notebook" data-toc-modified-id="Training-Notebook-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Training Notebook</a></span><ul class="toc-item"><li><span><a href="#Base-Training" data-toc-modified-id="Base-Training-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Base Training</a></span></li><li><span><a href="#Datset-Initialization" data-toc-modified-id="Datset-Initialization-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Datset Initialization</a></span></li><li><span><a href="#Finally:-Training!" data-toc-modified-id="Finally:-Training!-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Finally: Training!</a></span></li><li><span><a href="#Saving" data-toc-modified-id="Saving-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Saving</a></span></li></ul></li></ul></div>

# Training Notebook
---
Running this notebook allows you to train the model I am presenting for this challenge. The code is pretty straight forward to follow, but you have to make sure you that you have access to a GPU machine. If you are running this via Colab, make sure to enable the GPU processing before you run the notebook.

In [0]:
!pip install transformers

In [2]:
from google.colab import drive
import sys
drive.mount('/gdrive')
sys.path.append('../gdrive/My Drive/')
from resources.data_utils import TextDataset
from resources.model_utils import train, generate
from resources.general_utils import set_seed, gpu_information_summary
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch
import os

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /gdrive


In [3]:
n_gpu, device = gpu_information_summary()
set_seed(93, n_gpu)

+----------------+----------+
|      Key       |  Value   |
+----------------+----------+
|      GPU       | Tesla T4 |
| Number of GPUs |    1     |
+----------------+----------+


## Base Training
---
In order to start the training process we need a base model architecture and a tokenizer. Below we will have an instance of each.

In [4]:
tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")
model = GPT2LMHeadModel.from_pretrained("distilgpt2")

HBox(children=(IntProgress(value=0, description='Downloading', max=1042301, style=ProgressStyle(description_wi…




HBox(children=(IntProgress(value=0, description='Downloading', max=456318, style=ProgressStyle(description_wid…




HBox(children=(IntProgress(value=0, description='Downloading', max=651, style=ProgressStyle(description_width=…




HBox(children=(IntProgress(value=0, description='Downloading', max=352833716, style=ProgressStyle(description_…




## Datset Initialization
---
Now we will create an instance of our dataset to convert it from a simple `.txt` file to inputs of our nural network.

In [None]:
file_path = "../gdrive/My Drive/councelchat.txt"
all_dataset = TextDataset(tokenizer=tokenizer, file_path=file_path)
n_valid = len(all_dataset) // 20
n_train = len(all_dataset) - n_valid
train_dataset, valid_dataset = torch.utils.data.random_split(
    all_dataset, [n_train, n_valid]
)

## Finally: Training!
---

In [0]:
train_input = {
    "train_dataset": train_dataset,
    "model": model,
    "tokenizer": tokenizer,
    "per_gpu_train_batch_size": 4,
    "learning_rate": 5e-5,
    "num_train_epochs": 4,
    "pad_values": {"input_ids": tokenizer.eos_token_id, "attention_mask": 0},
    "evaluate_during_training": True,
    "valid_dataset": valid_dataset,
    "max_steps": -1,
    "gradient_accumulation_steps": 1,
    "weight_decay": 0,
    "adam_epsilon": 1e-8,
    "warmup_steps": 0,
    "max_grad_norm": 1,
    "fp16": False,
    "fp16_opt_level": "O1",
    "seed_value": 93,
    "logging_steps": 50,
}
train(**train_input)

## Saving
Now we will save the trained models so we can later on use the in the `sequence_generation.ipynb` in order to generate **Reflections** on a given patient case.


In [0]:
out_dir = "../gdrive/My Drive/fine_tuned/"
model.save_pretrained(out_dir)
tokenizer.save_pretrained(out_dir)