## Notebook setup

### Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Dependencies

In [None]:
!pip install -r "/content/drive/MyDrive/ml_projects/tdt12_nlp_creative/requirements.txt"

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 15.0 MB/s 
Collecting demoji
  Downloading demoji-1.1.0-py3-none-any.whl (42 kB)
[K     |████████████████████████████████| 42 kB 1.7 MB/s 
[?25hCollecting pytorch-lightning==1.7.7
  Downloading pytorch_lightning-1.7.7-py3-none-any.whl (708 kB)
[K     |████████████████████████████████| 708 kB 60.0 MB/s 
[?25hCollecting aitextgen
  Downloading aitextgen-0.6.0.tar.gz (572 kB)
[K     |████████████████████████████████| 572 kB 64.5 MB/s 
[?25hCollecting pyDeprecate>=0.3.1
  Downloading pyDeprecate-0.3.2-py3-none-any.whl (10 kB)
Collecting torchmetrics>=0.7.0
  Downloading torchmetrics-0.10.3-py3-none-any.whl (529 kB)
[K     |████████████████████████████████| 529 kB 57.2 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.0-py3-

### Python Imports

In [None]:
# File/string operations
DATA = "/content/drive/MyDrive/ml_projects/tdt12_nlp_creative"
import shutil
import os
# Logging
import logging
logging.basicConfig(
        format="%(asctime)s — %(levelname)s — %(name)s — %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO
    )
# Model operations
from aitextgen import aitextgen

### Verify active GPU in Colab

In [None]:
!nvidia-smi

Wed Nov 23 16:31:29 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P8    11W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Load TokenDataset

In [None]:
file_name = "dataset_cache.tar.gz"
shutil.copy(f"{DATA}/dataset/{file_name}", f"./{file_name}")

'./dataset_cache.tar.gz'

## Model fine-tuning

### Load Model


There are several sizes of GPT-2:

* `124M` (default): the "small" model, 500MB on disk.
* `355M` (default): the "medium" model, 1.5GB on disk.
* `774M` (default): the "large" model, 3GB on disk.

You can also finetune a GPT Neo model instead, which is more suitable for longer texts and the base model has more recent data:

* `125M`: Analogous to the GPT-2 124M model.
* `350M`: Analogous to the GPT-2 355M model

The next cell downloads the model and saves it in the Colaboratory VM. If the model has already been downloaded, running this cell will reload it.

In [None]:
ai = aitextgen(tf_gpt2="355M", to_gpu=True)

# Comment out the above line and uncomment the below line to use GPT Neo instead.
# ai = aitextgen(model="EleutherAI/gpt-neo-125M", to_gpu=True)

INFO:aitextgen:Downloading the 355M GPT-2 TensorFlow weights/config from Google's servers


Fetching checkpoint:   0%|          | 0.00/77.0 [00:00<?, ?it/s]

Fetching hparams.json:   0%|          | 0.00/91.0 [00:00<?, ?it/s]

Fetching model.ckpt.data-00000-of-00001:   0%|          | 0.00/1.42G [00:00<?, ?it/s]

Fetching model.ckpt.index:   0%|          | 0.00/10.4k [00:00<?, ?it/s]

Fetching model.ckpt.meta:   0%|          | 0.00/927k [00:00<?, ?it/s]

INFO:aitextgen:Converting the 355M GPT-2 TensorFlow weights to PyTorch.
Converting TensorFlow checkpoint from /content/aitextgen/355M
Loading TF weight model/h0/attn/c_attn/b with shape [3072]
Loading TF weight model/h0/attn/c_attn/w with shape [1, 1024, 3072]
Loading TF weight model/h0/attn/c_proj/b with shape [1024]
Loading TF weight model/h0/attn/c_proj/w with shape [1, 1024, 1024]
Loading TF weight model/h0/ln_1/b with shape [1024]
Loading TF weight model/h0/ln_1/g with shape [1024]
Loading TF weight model/h0/ln_2/b with shape [1024]
Loading TF weight model/h0/ln_2/g with shape [1024]
Loading TF weight model/h0/mlp/c_fc/b with shape [4096]
Loading TF weight model/h0/mlp/c_fc/w with shape [1, 1024, 4096]
Loading TF weight model/h0/mlp/c_proj/b with shape [1024]
Loading TF weight model/h0/mlp/c_proj/w with shape [1, 4096, 1024]
Loading TF weight model/h1/attn/c_attn/b with shape [3072]
Loading TF weight model/h1/attn/c_attn/w with shape [1, 1024, 3072]
Loading TF weight model/h1/attn

Save PyTorch model to aitextgen/pytorch_model.bin


INFO:aitextgen:Loading 355M GPT-2 model from /aitextgen.


Save configuration file to aitextgen/config.json


INFO:aitextgen:GPT2 loaded with 354M parameters.
INFO:aitextgen:Gradient checkpointing enabled for model training.
INFO:aitextgen:Using the default GPT-2 Tokenizer.


### Fine-tuning

In [None]:
ai.train(file_name,
         line_by_line=True,
         from_cache=True,
         num_steps=5000,
         generate_every=1000,
         save_every=1000,
         save_gdrive=True,
         learning_rate=1e-3,
         batch_size=1, 
         )

INFO:aitextgen:Loading text from dataset_cache.tar.gz with generation length of 1024.
INFO:aitextgen.TokenDataset:TokenDataset containing 19,287,928 subsets loaded via cache.
  f"Setting `Trainer(gpus={gpus!r})` is deprecated in v1.7 and will be removed"
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
  f"The `Callback.{hook}` hook was deprecated in v1.6 and"
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/5000 [00:00<?, ?it/s]

[1m1,000 steps reached: saving model to /trained_model[0m


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


[1m1,000 steps reached: generating sample texts.[0m
, but it also works, and the film has been a great, great film. The film is great. It has a lot of moments, and it's a great movie. It is a great film, but it does not follow the book. The acting is great, the story starts as a beautiful house with the main character. It is too long, as well as the film itself. It is the film. The plot is a little boring, the film is very predictable and very predictable, the story has some very funny moments, but the acting is very strange. The acting is very good, the writing is over-dramatise, the film has lots of moments where the film is shown in a musical hall and a musical hall. It's a good movie, but the direction is very bad. The film may not be a good movie, but it's still a very cheesy one. And you could't imagine what other is making. It's a good movie, and if you don't go to see it, you're not expecting to be disappointed."

[1m2,000 steps reached: saving model to /trained_model[0m


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


[1m2,000 steps reached: generating sample texts.[0m
 to the cast. But the film is just a few hours of film."

[1m3,000 steps reached: saving model to /trained_model[0m


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


[1m3,000 steps reached: generating sample texts.[0m
 I am just gonna just start this review on this site. I think this should be his first exposure to this movie. I mean it was so bad that I was scared of seeing it on DVD, and I did not understand it. I'm just too happy that this was made, and I don't understand what the movie has to offer. I mean. this is so bad that I was even disappointed in this movie. A movie is great and I think it should have been made. I'll go down as the main character has been getting on with his wife. I will always be interested in the movie, and I'm really glad to see this movie. It's a real movie, and I was the biggest person of all time. This movie is good, but I mean it's all that bad, I could not get on this movie when it finished. I cannot believe that many people would be angry about this movie, but I did not agree with them. If you like it, then I will certainly be able to go see this movie. The movie is very good, I am afraid. The acting and the d

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


[1m4,000 steps reached: generating sample texts.[0m
, it is a very bad movie, but the acting was very good. I enjoyed the action, the plot, and the plot. The only thing I wish was the best way I could describe it was that it was really bad, but not that bad. Some of the dialog did show the characters, but the actors didn't show the outcome. I had to say I was disappointed, the movie is so predictable that I was asking the question 'Is it really about the plot? A simple question I can say I am just telling you. There were a few episodes in it, but that's about it. It just is the plot, I had problems with the plot, and the plot was just dull. The fact that I could watch this with my friends was not a mystery, but it was just the plot, and the acting. I'm not sure if it was ever made, but it was just plain boring, I could watch it without the characters. The storyline was pretty thin, and nothing a real mystery was resolved. I was thinking about it, and the storyline was more complicate

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


[1m5,000 steps reached: generating sample texts.[0m


INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_steps=5000` reached.


 was not really too sure what to do with the titular character. It was so much more like a big disappointment. It was in the sense that it would be the worst movie we've ever seen in a while. How can you get an answer to what you have already just just watched? If you're going to spend time with this movie, you might have a better time with this movie. I'm glad I did. I can't blame the acting. I thought the actors who gave an excellent job did not make a good job. They did not make a perfect job for the movie. I'm sure they could have given the movie a 7.0. I'm sure there were a few people who had no idea what to do with the character. I don't know what to do with their actions. The movie just didn't work. I can relate to them. They seemed to tell the story that they could have done it too well. I don't know what to do with those two. I'm sure if they were just trying to make a movie about what would happen with the other characters. Then they built a lot of suspense. I'll just assume 

INFO:aitextgen:Saving trained model pytorch_model.bin to /trained_model


## Inference

### Re-load model for increased performance while inferencing

In [None]:
ai = aitextgen(model_folder="trained_model", to_gpu=True)

INFO:aitextgen:Loading model from provided weights and config in /trained_model.
  "Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 "
INFO:aitextgen:GPT2 loaded with 354M parameters.
INFO:aitextgen:Using the default GPT-2 Tokenizer.


### Inference for each rating

In [None]:
reviews = []
for i in range(1,11):
  reviews.append(ai.generate(n=5,
                             prompt=f"{i}<|rating|>",
                             prepend_bos=None,
                             min_length=None,
                             max_length=500,
                             temperature=0.8,
                             top_k=40, # Limits sampled tokens to the top k values
                             #top_p=40, # Limits sampled tokens to the cumulative probability
                             #num_beams=5, # If greater than 1, executes beam search for cleaner text
                             #repetition_penalty=1.0, # If greater than 1.0, penalizes repetition in a text to avoid infinite loops
                             #length_penalty=1.0, # If greater than 1.0, penalizes text proportional to the length
                             do_sample=True,
                             return_as_list=True,
                             ))

for i, rating in enumerate(reviews):
  rating_text = "-"*15 + f"\n  Rating: {i+1}\n" + "-"*15
  print(rating_text)
  for j, review in enumerate(rating):
    print("'" + review.split("<|rating|>")[-1] + "'")
  print()

---------------
  Rating: 1
---------------
'I do not really believe that this film was ever seen on video. It was simply an insult to the intelligence of the main characters.If I had a problem with this movie I would have no problem with the film. The story and atmosphere of the film had absolutely nothing to do with the movie. I was glad to see what I could do with the film. I was laughing at the movie, and I did not have the same feeling as the film. And it's pretty clear that it was a very poorly executed setup. Don't blame the actors for this movie. They should be ashamed to be ashamed for the poor acting. The music was very badly written and the sound was horrible. I can only think about how much better it was. The film is a lot of the film. I didn't know what to do with the plot. It seemed to be a nod to films like the 'Killer and the Bird' and the 'Killer. I cannot deny this. I have no idea what director/actor/actor/FX was thinking about. I can't believe what people were thinki