## Notebook setup

### Mount Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Dependencies

In [2]:
!pip install -r "/content/drive/MyDrive/ml_projects/tdt12_nlp_creative/requirements.txt"

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 4.7 MB/s 
Collecting demoji
  Downloading demoji-1.1.0-py3-none-any.whl (42 kB)
[K     |████████████████████████████████| 42 kB 1.7 MB/s 
[?25hCollecting pytorch-lightning==1.7.7
  Downloading pytorch_lightning-1.7.7-py3-none-any.whl (708 kB)
[K     |████████████████████████████████| 708 kB 67.4 MB/s 
[?25hCollecting aitextgen
  Downloading aitextgen-0.6.0.tar.gz (572 kB)
[K     |████████████████████████████████| 572 kB 71.8 MB/s 
[?25hCollecting torchmetrics>=0.7.0
  Downloading torchmetrics-0.10.3-py3-none-any.whl (529 kB)
[K     |████████████████████████████████| 529 kB 66.2 MB/s 
Collecting pyDeprecate>=0.3.1
  Downloading pyDeprecate-0.3.2-py3-none-any.whl (10 kB)
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.0-py3-n

### Python Imports

In [3]:
# File/string operations
DATA = "/content/drive/MyDrive/ml_projects/tdt12_nlp_creative"
import shutil
import os
# Logging
import logging
logging.basicConfig(
        format="%(asctime)s — %(levelname)s — %(name)s — %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO
    )
# Model operations
from aitextgen import aitextgen

## Model Inference

### Copy fine-tuned model from Google Drive to Colaboratory VM


In [4]:
model_dir = "ATG_20221123_163129"
shutil.copytree(f"{DATA}/trained_models/{model_dir}", f"./trained_model")

'./trained_model'

### Load model

In [5]:
ai = aitextgen(model_folder="./trained_model", to_gpu=True)

INFO:aitextgen:Loading model from provided weights and config in /./trained_model.
  "Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 "
INFO:aitextgen:GPT2 loaded with 354M parameters.
INFO:aitextgen:Using the default GPT-2 Tokenizer.


### Generate a training scheme


In [9]:
setup = {
    #"temperature": [0.1, 0.3, 0.5, 0.7, 0.9],
    "top_k": [10, 20, 30, 40, 50],
    #"top_p": [0.6, 0.75, 0.9, 1.0]
}

# Top-p is usually set to a high value (like 0.75) with the purpose of limiting the long tail of low-probability tokens that may be sampled. We can use both top-k and top-p together. If both k and p are enabled, p acts after k.

### Run inference and save to file

In [10]:
from collections import defaultdict

res = defaultdict(list)
for key, values in setup.items():
  for i in range(10):
    out = []
    for value in values:
      texts = ai.generate(n=1,
                          prompt=f"{i+1}<|rating|>",
                          prepend_bos=None,
                          min_length=100,
                          max_length=300,
                          temperature=value if key == "temperature" else 0.7,
                          top_k=value if key == "top_k" else 50, # Limits sampled tokens to the top k values
                          top_p=value if key == "top_p" else 1.0, # Limits sampled tokens to the cumulative probability
                          #num_beams=2, # If greater than 1, executes beam search for cleaner text
                          #repetition_penalty=1.0, # If greater than 1.0, penalizes repetition in a text to avoid infinite loops
                          #length_penalty=1.0, # If greater than 1.0, penalizes text proportional to the length
                          do_sample=True,
                          return_as_list=True,
                          )
      out.append(f"{key}:{value}")
      for text in texts:
        out.append(text)
    res[key].append(out)

# Out array
out_array = []
for key, item in res.items():
  header = "-"*3 + key + "-"*3
  print(header)
  out_array.append(header)
  for i, rating in enumerate(item):
    rating_text = "-"*15 + f"\n  Rating: {i+1}\n" + "-"*15
    print(rating_text)
    out_array.append(rating_text)
    for j, review in enumerate(rating):
      review_text = "'" + review.split("<|rating|>")[-1] + "'"
      print(review_text.rstrip())
      out_array.append(review_text.rstrip())
    print()
    out_array.append("")    
  
# Experiments
i = 0
while os.path.exists(f"{DATA}/output/aitextgen/experiment_run{i}.txt"):
  i += 1
file_name = f"{DATA}/output/aitextgen/experiment_run{i}.txt"
with open(file_name, 'w+') as f:
    for item in out_array:
        f.write(f"{item}\n")


---top_k---
---------------
  Rating: 1
---------------
'top_k:10'
'I don't understand the high ratings for this film. I was so bored and I couldn't stop thinking about it. I don't blame the film for that. The movie itself is so horrible. The acting is atrocious. It seems like the actors were in it, for the most part. It's like they were having too much money. They were given the material that they were given. I don't know what they were thinking, for example, the one scene that he was having a conversation with the girl, the other time he said ""I don't know what you will do"". I don't know what the hell they were doing. It's too much. The only good thing I can say is that they should have left it to them. It's too much. If they should have just kept telling that this movie would have been better. I would have given it a zero. If I had gone out there, I would have given it a zero if that would have been possible. I am not sure what the hell was. I'm sure that the film is going to be a