## Encode/Tokenize
1. In order to feed the model tweets and have them treated "independently" (in reality several are read in a single batch), we separate tweets using the special token "<|endoftext|>",  used in pre-training by Open AI to separate documents. Our dataset then becomes something like:
<|endoftext|>This is my first tweet!<|endoftext|>Second tweet already!<|endoftext|>

Note: having no space around <|endoftext|> empirically leads to better predictions.

2. Mix tweets and shuffling for each epoch and then reapply 1.
3. 80% --> training, 20% --> validation



## Fine Tune Model:
1. Import GPT-2 small from: https://huggingface.co/docs/transformers/index
2. Fine tune model (take a look at run_language_modeling.py and run_generation.py)
3. Consider using sweets to get optimal hyper parameters
4. Consider using RF on input parameters vs val loss to generate features importance table

Their best choices were:
    cosine learning scheduler
    no gradient accumulation
    no warmup
    4 epochs
    learning rate of 1.37e-4

Things to consider:
learning rate scheduler, number of epochs, learning rate, etc


## NOTE:
Code block below are a draft. A final working version with wandb is in the colab notebook linked below.
To run, you will need to upload the .txt data file for the user that is Generated in "Scrapping and Cleaning.ipynb"

https://colab.research.google.com/drive/19iV11ZU4mX9JeZfVchC21uROXRugYkw4?authuser=1#scrollTo=ZzPjnbH8YXdp

## Draft: 
### Do not run below as will lag out computer

In [1]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from transformers import (
            AutoTokenizer, AutoModelForCausalLM,
            TextDataset, DataCollatorForLanguageModeling,
            Trainer, TrainingArguments,
            get_cosine_schedule_with_warmup)
# from huggingface_hub.hf_api import HfAp
import api.filter


import os
import numpy as np
import re
import pandas as pd
import tensorflow as tf
import torch
import pathlib
import random

In [3]:
#Tokenize:
file_path = "./data/cleaned_SenSanders_5000.txt"

tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = AutoModelForCausalLM.from_pretrained('gpt2')
block_size = tokenizer.model_max_length
train_dataset = TextDataset(tokenizer=tokenizer, file_path=file_path, block_size=block_size, overwrite_cache=True)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

Token indices sequence length is longer than the specified maximum sequence length for this model (913001 > 1024). Running this sequence through the model will result in indexing errors


In [2]:
!nvidia-smi

'nvidia-smi' is not recognized as an internal or external command,
operable program or batch file.


In [4]:
#Train:
#Largely taken from: https://colab.research.google.com/github/borisdayma/huggingtweets/blob/master/huggingtweets-demo.ipynb#scrollTo=ZSCf6QyF8AG-

ALLOW_NEW_LINES = False     # seems to work better <--- from source
LEARNING_RATE = 1.372e-4
EPOCHS = 4
seed = random.randint(0,2**32-1)
training_args = TrainingArguments(
    output_dir="./model_files",
    overwrite_output_dir=True,
    do_train=True,
    num_train_epochs=1,
    per_device_train_batch_size=1,
    prediction_loss_only=True,
    logging_steps=5,
    save_steps=0,
    seed=seed,
    learning_rate = LEARNING_RATE)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset)



In [None]:
trainer.train()
trainer.model.config.task_specific_params['text-generation'] = {
    'do_sample': True,
    'min_length': 10,
    'max_length': 160,
    'temperature': 1.,
    'top_p': 0.95,
    'prefix': '<|endoftext|>'}

In [None]:
#Save Model:
trainer.save_model()

In [None]:
# Prediction step
start_with_bos = '<|endoftext|>' + start
encoded_prompt = trainer.tokenizer(start_with_bos, add_special_tokens=False, return_tensors="pt").input_ids
encoded_prompt = encoded_prompt.to(trainer.model.device)

# prediction
output_sequences = trainer.model.generate(
    input_ids=encoded_prompt,
    max_length=160,
    min_length=10,
    temperature=1.,
    top_p=0.95,
    do_sample=True,
    num_return_sequences=10
    )
generated_sequences = []

# decode prediction
predictions = []
for generated_sequence_idx, generated_sequence in enumerate(output_sequences):
    generated_sequence = generated_sequence.tolist()
    text = trainer.tokenizer.decode(generated_sequence, clean_up_tokenization_spaces=True, skip_special_tokens=True)
    if not ALLOW_NEW_LINES:
        limit = text.find('\n')
        text = text[: limit if limit != -1 else None]
    generated_sequences.append(text.strip())

for i, g in enumerate(generated_sequences):
    predictions.append([start, g])a