### 0. Install requirements

In [None]:
%pip install -r requirements.txt

### 1. Testing Model *GPT2*

Use model with ***pipeline*** as a high-level helper.  
you can see the model detaile [here](https://huggingface.co/openai-community/gpt2).

In [48]:
from transformers import pipeline

pipe = pipeline("text-generation", model="openai-community/gpt2")

pipe("The future of AI is ", max_length=50, num_return_sequences=1, max_new_tokens=50)

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=50) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'The future of AI is \xa0to get better at it.\nWhat is the current state of AI?\nThe top thinkers in our field are all focused on AI. The next major AI research group will be called the AI Institute for Artificial Intelligence (AII). This is'}]

the answers are false. for example:
'US is a country in ʻal-Arabia, the region with' or 'The future of AI is \xa0beyond the human brain. It's coming'

### 2. Fine-Tuning

Here we do ***Fine-Tuning*** with datasets ***wikitext***.

- #### 2-1. Load datasets

In [23]:
from datasets import load_dataset

dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
dataset = dataset.select(range(200)) # for fast running

dataset

Dataset({
    features: ['text'],
    num_rows: 200
})

- ##### 2-2. Load Model & tokenizer

In [24]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "openai-community/gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_name)

- #### 2-3. Tokenize the dataset

In [33]:
def tokenize_dataset(ds):
    return tokenizer(ds["text"], padding="max_length", truncation=True, max_length=128)

tokenized_dataset = dataset.map(tokenize_dataset, batched=True, remove_columns="text")

tokenized_dataset

Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 200
})

- #### 2-4. Set up the *Data-Collator*

In [34]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm = False, #gpt2 isn't a masked language model
)

- #### 2-5. Define *Traning-Arguments* & initialize the *Trainer*

In [43]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./fine-tuned-gpt2",  # Directory to save the model
    overwrite_output_dir=True,       # Overwrite on Directory (If it already exists)
    num_train_epochs=3,              # Number of training epochs
    per_device_train_batch_size=8,   # Batch size per device
    save_steps=500,                  # Save checkpoint every 500 steps
    save_total_limit=2,              # Keep only the last 2 checkpoints
    logging_dir="./logs",            # Directory for logs
    logging_steps=100,               # Log every 100 steps
    eval_strategy  ="steps",         # Evaluate every `eval_steps`
    eval_steps=500,                  # Evaluation frequency
    learning_rate=5e-5,              # Learning rate
    weight_decay=0.01,               # Weight decay
    fp16=True,                       # Use mixed precision (if GPU supports it)
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    eval_dataset=tokenized_dataset,  # Use the same dataset for evaluation
    data_collator=data_collator,
)

- #### 2-6. Training

In [44]:
trainer.train()

`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss,Validation Loss


TrainOutput(global_step=75, training_loss=3.5896175130208334, metrics={'train_runtime': 485.7532, 'train_samples_per_second': 1.235, 'train_steps_per_second': 0.154, 'total_flos': 39193804800000.0, 'train_loss': 3.5896175130208334, 'epoch': 3.0})

- #### 2-7. Save *Fine-Tuned-gpt2*

In [45]:
model.save_pretrained("./fine-tuned-gpt2")
tokenizer.save_pretrained("./fine-tuned-gpt2")

('./fine-tuned-gpt2\\tokenizer_config.json',
 './fine-tuned-gpt2\\special_tokens_map.json',
 './fine-tuned-gpt2\\vocab.json',
 './fine-tuned-gpt2\\merges.txt',
 './fine-tuned-gpt2\\added_tokens.json',
 './fine-tuned-gpt2\\tokenizer.json')

- #### 2-8. Testing new model

In [47]:
fine_tuned_model = AutoModelForCausalLM.from_pretrained("./fine-tuned-gpt2")
fine_tuned_tokenizer = AutoTokenizer.from_pretrained("./fine-tuned-gpt2")

pipe = pipeline("text-generation", model=fine_tuned_model, tokenizer=fine_tuned_tokenizer)

pipe("The future of AI is ", max_length=50, num_return_sequences=1, max_new_tokens=50)

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=50) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'The future of AI is \n\nAI is already a highly complex subject that is still under research and development. Many of the problems that we face today are within the scope of the present research and development efforts. This requires a broad assessment of the present state of the art and'}]