# Baseline Model
This is a first shot at generation blogposts. This model is meant to be set up easy and quick and produce some result, not necessaryly usable text. I will try it GPT2 first, maybe also something else. 


## GPT2 only, no fine-tuning
htis one is really easy to set up, we basically only need the pipeline. The task at hand is text generation, and our model will be gpt2. 

In [None]:
pip install transformers

In [None]:
from transformers import pipeline
blogPoster = pipeline(task='text-generation', model='distilgpt2')

We can now call this with no arguments or an empty string as a starting point and it will give some text. However, it won't per default write about NFTs but about a wide range of topics from the texts, it has been trained on. 

To get an idea, we can generate a few of those. 

In [None]:
blogPoster("", num_return_sequences=5)

We can push it a bit in the right direction by giving it something to start with e.g. "The next NFT-based..."

In [None]:
blogPoster("The next NFT-based", num_return_sequences=5)

Ok, so that was quite bad, as was to be expected, probably. I think we will have to do some fine-tuning.

## Fine-tune GPT2 with the NFT articles data set 
The idea is to take the gpt2-model and train it a few epochs on the NFT articles data set. This will make it more probable to genereate relevant output without the need for a full training from scratch. So we use transfer learning to get there faster. This approach is called domain adaption. So we want to adapt the gpt2-model to the domain of creating NFT-related articles. 

There are two tutorials to follow here: 
https://huggingface.co/course/chapter7/6?fw=pt
This one is on Training a causal language model from scratch, and the other one https://huggingface.co/course/chapter3/3?fw=tf
is on fine-tuning a pretrained model. Those two can be combined to give the desired result, hopefully. 

## Load the model and tokenizer
Now we import the nessessary modules. We will use the AutoTokenizer and AutoModelForCausalLM to simply use the tokenizer and model from the pretrained model. We will start with the "distilgpt2"-model. 

In [46]:
import tensorflow as tf
import numpy as np
from transformers import AutoTokenizer, TFAutoModelForCausalLM

checkpoint = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForCausalLM.from_pretrained(checkpoint)

model.compile() # (optimizer="adam", loss="sparse_categorical_crossentropy")
model.summary()

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/475M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.
No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour, please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


Model: "tfgpt2lm_head_model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 transformer (TFGPT2MainLaye  multiple                 124439808 
 r)                                                              
                                                                 
Total params: 124,439,808
Trainable params: 124,439,808
Non-trainable params: 0
_________________________________________________________________


## Load the data set
First, we need to load the data set into colab. We will load the data from github. 

In [None]:
pip install datasets

In [None]:
!wget https://github.com/DemocracyStudio/NFT_text_generation/blob/from-dev-alwin/Data/nft_content.csv?raw=true

In [6]:
from datasets import load_dataset

In [None]:
dataset = load_dataset("csv", data_files="nft_content.csv?raw=true")

In [8]:
print(dataset["train"][10]["content_raw"])

Step Hero Multiverse
Follow
Aug 24, 2021
·
3 min read
·
Listen
20,000 $BUSD And 8% Commission Rate For Affiliate Promoter Of HERO NFT Mystery Chests
This August 31st, accompanying the launch of HERO NFT Marketplace, Step Hero will launch our first sales of HERO NFT Mystery Chests.
Furthermore, we will hold a Sales Affiliate Program where the top 33 best promoters having highest commission earn a total prize pool of 20,000 $BUSD, and all participants gain a commission rate of 8% of the total value of successful transactions.
About HERO NFT Mystery Chests
Description
Each HERO Mystery Chest contains a random HERO NFT Collectible in the form of a HERO Character. You can use these HERO Characters to play Step Hero RPG, trade them on the marketplace, integrate them into other digital artworks, or keep them for your collection.
These HERO Collectibles include 4 HERO Characters with different attributes, stats, classes, and rarity.
4 HERO Characters appearing in this sale include:
1. King Art

## Prepare the data for training
Here we prepare the data for input into the model for training. 

My first approach was to just use the tokenizer to tokenize the articles and then call model.fit with the tokenized data for training. however, this naive approach did not work. 

We can tokenize one article or all articles at once, it seems to work i.e. does not give an error.

In [49]:
tokenizer.pad_token = tokenizer.eos_token

In [47]:
def tokenize_function(examples):
  output = tokenizer(examples["content_raw"], padding=True, truncation=True, return_tensors="tf")
  output["labels"] = output["input_ids"]
  return output 
  #

In [50]:
tokenized_artcls = dataset.map(
    tokenize_function,
    remove_columns=["content_raw", 
                    "articleUrl", 
                    "date", 
                    "title", 
                    "authorName", 
                    'authorName', 
                    'clapCount', 
                    'category', 
                    'length', 
                    'content_clean', 
                    'sentiment']
    )

  0%|          | 0/3541 [00:00<?, ?ex/s]

In [52]:
print(len(tokenized_artcls["train"][0]["input_ids"][0]))

717


## optional: group articles into even chunks
doesnt work either

In [None]:
# block_size = tokenizer.model_max_length
block_size = 64

In [None]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, though you could add padding instead if the model supports it
    # In this, as in all things, we advise you to follow your heart
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [None]:
lm_datasets = tokenized_artcls.map(
    group_texts,
    batched=True,
)

In [None]:
type(lm_datasets["train"])

In [None]:
lm_datasets["train"][0]["input_ids"][0]

In [None]:
len(lm_datasets["train"][21]["input_ids"][0])

In [None]:
tokenizer.decode(lm_datasets["train"][2]["input_ids"][0])

In [None]:
#tokenized_artcls["train"][0]["input_ids"]

## go on with other stuf

In [53]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding="max_length", max_length=256, return_tensors="tf")

In [54]:
train_Set = tokenized_artcls["train"].to_tf_dataset(
    columns = ["attention_mask", "input_ids", "labels"],
    shuffle=True, 
    batch_size=8,
    collate_fn=data_collator
)

ValueError: ignored

## Train the model

In [None]:
# train one batch to see if it works (it does not at the moment)
model.train_on_batch(batch)

In [None]:
model.train_step(tokenized_artcls)