# Train

This is a notebook for finding the training and saving a GPT2 PyTorch language model on Hansard text. This should be ran with a GPU in Google Colab.

Clones the github repo and downloads required module (HuggingFace transformers)

In [None]:
!git clone https://github.com/CallumRai/Hansard/
!pip install transformers

Imports required classes and changes working directory

In [None]:
import os
os.chdir("Hansard")

from hansard import Hansard, Corpus, DataLoader, Trainer
import torch

Downloads hansard between two dates as ```.html```, and converts into ```.json```

The start and end dates can be changed, in YYYY-MM-DD form

In [None]:
print("Downloading and extracting hansard ...")

s_date = "2019-01-01"
e_date = "2020-06-01"

hansard = Hansard(s_date, e_date)
hansard.download()
hansard.extract()

Creates a corpus of utterances from hansard

The date range must have been previously downloaded, however, can be different to the previous.

In [None]:
print("Creating corpus ...")

s_date = "2019-01-01"
e_date = "2020-06-01"

corpus = Corpus(s_date, e_date)
corpus.full()

Creates dataloaders for training and  model


In [None]:
print("Creating dataloaders ...")

dataloader_path = "hansard/data/dataloaders/"
train_path = dataloader_path + 'train_loader_debug.pth'

# If train or val loaders files do not exist make them
if not os.path.isfile(train_path):

    loader_class = DataLoader(s_date, e_date)
    train_loader = loader_class.train(1)

    torch.save(train_loader, train_path)
    
train_loader = torch.load(train_path)

Trains the model

```epochs```, ```lr``` (learning rate), ```warmup_steps``` can be changed.

Note: The training step currently takes upwards of 2 hours to run

In [None]:
trainer = Trainer(train_loader)

epochs = 2
lr = 2e-5
warmup_steps = 100

print("Training ...\n")
trainer.train(epochs, lr, warmup_steps)

The model should now be saved in ```Hansard/hansard/date/model``` as ```pretrained.pth```. 

This can either be downloaded as a PyTorch model, or if you intend to upload it as a Huggingface model it would be easiest to follow the instructions found on the github readme