# GPU Training Notebook

This notebook allows us to train our model using a GPU.

**Before running this notebook, follow these steps:**

1. In your Google Drive, go to MyDrive and create a folder `inf265_project_3`.
2. Put the required files in this folder: `tokenizer.py`, `train.py`, `config.py` and `utils.py`.
3. In the upper-right corner, click the down arrow and select `Change runtime type`.
4. Choose `Runtime: Python3` and `Hardware accelerator: T4 GPU`. Do not select the `High-RAM` option.
5. If required, click `Connect`. The bottom status bar should read something like `Connected to Python 3 Google Compute Engine backend (GPU)`.

**Warning:** You get some free compute time every 24 hours. As long as you are connected to a GPU runtime, this will count towards your quota. If you are not training your model, make sure to click `Runtime >> Disconnect and delete runtime` so you don't waste your free compute.

## Install Python Libraries

Start by running the following cell to install required libraries:

In [None]:
!pip install datasets tokenizers

## Imports and Mounting Google Drive

To save the tokenizer, model and optimizer checkpoints, we will mount Google Drive in the next code cell. Make sure you have created a directory `inf265_project_3` in your Google Drive under `MyDrive` and put your Python files there.

We also use the `autoreload` Jupyter extension allowing us to re-import external files without restarting the kernel. This is useful if you need to do small changes in some Python files. You can find the files in the file browser (the folder icon in the left sidebar). Note that you need to mount your Google Drive before you can access the files from Colab. It might also take a few seconds before the file is updated after saving.

Run the following cell to mount Google Drive and import the necessary files.

In [None]:
%load_ext autoreload
%autoreload 2

from google.colab import drive
drive.mount('/content/drive')
import sys
sys.path.append('/content/drive/MyDrive/inf265_project_3')

from pathlib import Path
from tokenizer import train_tokenizer
from train import train_model
from config import config
from utils import print_config

# Append paths to filenames for saving on Google Drive
gdrive_base_path = "/content/drive/MyDrive/inf265_project_3/"

if "MyDrive" not in config.tokenizer_filename: # Only append once
  config.tokenizer_filename = gdrive_base_path + config.tokenizer_filename
  config.model_filename = gdrive_base_path + config.model_filename
  config.optimizer_filename = gdrive_base_path + config.optimizer_filename

print_config(config)

## Training the Tokenizer

Train and save the tokenizer. This might take a few minutes to complete. But you only have to do this once as it will save the tokenizer for later use.

In [None]:
if not Path(config.tokenizer_filename).exists():
  tokenizer = train_tokenizer(config)
else:
  print(f"Tokenizer already exists at {config.tokenizer_filename}")

## Training Your Model

We use the `train_model` function from `train.py`. This will save a model (and optimizer) checkpoint every 500 epochs. If you get disconnected or use all your daily compute, you can continue training again later.

When you have trained your model for around 3-5 epochs, download the model and tokenizer files from Google Drive and put them in your local `temp` folder. Then you can use these when doing inference (text generation).

A single epoch might take around 30 minutes to complete.

In [None]:
train_model(config)