<a href="https://colab.research.google.com/github/lucapernice/MLOPS_Project/blob/main/research/demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# FINETUNING with lmqg

--------------------------------------------------------------------------------

For more info on lmqg ---> [check here](https://github.com/asahi417/lm-question-generation)

--------------------------------------------------------------------------------





## Necessary libraries
We use `os` to install *lmqg* in this virtual machine



In [None]:
import os

In [None]:
!pip install lmqg

Collecting lmqg
  Downloading lmqg-0.1.1.tar.gz (100 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/100.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━[0m [32m61.4/100.1 kB[0m [31m1.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m100.1/100.1 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pytextrank (from lmqg)
  Downloading pytextrank-3.2.5-py3-none-any.whl (30 kB)
Collecting sentencepiece (from lmqg)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m45.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets (from lmqg)
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5

Finally we can import the model and also the training dataset.

In [None]:
import torch
from lmqg import TransformersQG,GridSearcher
from datasets import load_dataset

## Quick test

Here we simply perform a test:


1.   English model is selected;
2.   Context is provided:
    `William Turner was an English painter who specialised in watercolour landscapes. He is often known as William Turner of Oxford or just Turner of Oxford to distinguish him from his contemporary, J. M. W. Turner. Many of Turner's paintings depicted the countryside around Oxford. One of his best known pictures is a view of the city of Oxford from Hinksey Hill.`;

3. The model generate the questions based on the context given;



In [None]:
model = TransformersQG(language="en")
context = "William Turner was an English painter who specialised in watercolour landscapes. He is often known " \
          "as William Turner of Oxford or just Turner of Oxford to distinguish him from his contemporary, " \
          "J. M. W. Turner. Many of Turner's paintings depicted the countryside around Oxford. One of his " \
          "best known pictures is a view of the city of Oxford from Hinksey Hill."
qa = model.generate_qa(context)

In [None]:
qa

[('What language was William Turner?', 'English'),
 ('What is William Turner often known as to distinguish him from J. M. W. Turner?',
  'William Turner of Oxford or just Turner of Oxford'),
 ("Where did many of William Turner's paintings depict?", 'countryside'),
 ("What is one of William Turner's best known pictures?", 'Hinksey Hill.')]

## TRAINING

The dataset can be downloaded as follow:

In [None]:
dataset = load_dataset("lmqg/qg_squad")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading data:   0%|          | 0.00/61.4M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/8.13M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/8.16M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

For more information on how the gridsearch is performed, check the [link ](https://github.com/asahi417/lm-question-generation)




In [None]:
from lmqg import GridSearcher
trainer = GridSearcher(
    checkpoint_dir='tmp_ckpt',
    dataset_path='lmqg/qg_squad',
    model='t5-small',
    epoch=10,
    epoch_partial=5,
    batch=64,
    n_max_config=5,
    gradient_accumulation_steps=[2, 4],
    lr=[1e-04, 5e-04, 1e-03],
    label_smoothing=[0, 0.15]
)
trainer.run()



KeyboardInterrupt: 

# ZIP MODELS

Zip the model files to download on your machine

In [None]:
!ls


sample_data  tmp_ckpt


In [None]:
!zip -r tmp_ckpt.zip tmp_ckpt

Download the file

In [None]:
from google.colab import files
files.download('tmp_ckpt.zip')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# EVALUATION

In [None]:
!lmqg-eval -m "tmp_ckpt/model_gjvfnu" -e "./eval_metrics" -d "lmqg/qg_squad" -l "en"

2024-01-25 17:19:21.103513: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-25 17:19:21.103571: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-25 17:19:21.105156: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Traceback (most recent call last):
  File "/usr/local/bin/lmqg-eval", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/lmqg/lmqg_cl/model_evaluation.py", line 50, in main
    metric = evaluate(
  File "/usr/local/lib/python3.10/dist-packages/lmqg/automatic_evaluation.py", line 204, in evaluate
    lm = TransformersQG(model,