**Copyright 2021 Antoine SIMOULIN.**

Licensed under the Apache License, Version 2.0 (the "License");

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

# Evaluating GPT-fr on the Wikitext-fr benchmark 🇫🇷

<img src="https://raw.githubusercontent.com/AntoineSimoulin/gpt-fr/main/imgs/logo.png" alt="GPT-fr logo" width="200">

**GPT-fr** is a French GPT model for French developped by [Quantmetry](https://www.quantmetry.com/) and the [Laboratoire de Linguistique Formelle (LLF)](http://www.llf.cnrs.fr/en).

If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers, 🤗 Tokenizers and 🤗 Datasets. You may also change the hardware to **GPU** since all computation will be much faster.

In [None]:
%%capture
!pip install git+https://github.com/huggingface/transformers.git
!pip install tokenizers
!pip install datasets

## Requirements

In [None]:
import torch
import transformers
import datasets
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from datasets import load_dataset
from tqdm.auto import tqdm

In [None]:
# Check GPU is available and libraries version
print('Pytorch version ...............{}'.format(torch.__version__))
print('Transformers version ..........{}'.format(transformers.__version__))
print('Datasets version ..............{}'.format(datasets.__version__))
print('GPU available .................{}'.format('\u2705' if torch.cuda.device_count() > 0 else '\u274c'))
print('Available devices .............{}'.format(torch.cuda.device_count()))
print('Active CUDA Device: ...........{}'.format(torch.cuda.current_device()))
print('Current cuda device: ..........{}'.format(torch.cuda.current_device()))

## Loading the model

In [None]:
if torch.cuda.is_available():
  device = torch.device('cuda:0')
else:
  device = torch.device('cpu')

# Load pretrained model and tokenizer.
# The model will be downloaded from HuggingFace hub and cached.
# It may take ~5 minutes for the first excecution.

model = GPT2LMHeadModel.from_pretrained("asi/gpt-fr-cased-base").to(device)
tokenizer = GPT2Tokenizer.from_pretrained("asi/gpt-fr-cased-base")
tokenizer.add_special_tokens({
  "eos_token": "</s>",
  "bos_token": "<s>",
  "unk_token": "<unk>",
  "pad_token": "<pad>",
  "mask_token": "<mask>"
})

In [None]:
# Set model in eval mode (do not apply dropout)
model = model.eval()

In [None]:
# We concatenate all paragraphs from the dataset and encode them using the tokenizer.

dataset = load_dataset(
  "asi/wikitext_fr",
  "wikitext-72"
)

dataset = [l['paragraph'].rstrip() for l in dataset['test']]
dataset = [l for l in dataset if l]

encodings = tokenizer(' '.join(dataset), return_tensors='pt')

In [None]:
max_length = model.config.n_positions
stride = 1024

lls = []
for i in tqdm(range(0, encodings.input_ids.size(1), stride)):
    begin_loc = max(i + stride - max_length, 0)
    end_loc = min(i + stride, encodings.input_ids.size(1))
    trg_len = end_loc - i    # may be different from stride on last loop
    input_ids = encodings.input_ids[:,begin_loc:end_loc].to(device)
    target_ids = input_ids.clone()
    target_ids[:,:-trg_len] = -100

    with torch.no_grad():
        outputs = model(input_ids, labels=target_ids)
        log_likelihood = outputs[0] * trg_len

    lls.append(log_likelihood)

ppl = torch.exp(torch.stack(lls).sum() / end_loc)

print("perplexity on the wikitext test set is {:.2f}".format(ppl))

# perplexity on the wikitext test set is 12.9 with gpt-fr-base
# perplexity on the wikitext test set is 109,2 with gpt-fr-small