# Train GPT-2 to Generate Tweets

<font size="2">*Adapted from [HuggingFace](https://colab.research.google.com/github/borisdayma/huggingtweets/blob/master/huggingtweets-demo.ipynb).*</font>

Generating realistic text has become more and more efficient with models such as [GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf). Those models are trained on very large datasets and require heavy computer resources (and time!).

However, we can use Transfer Learning and a single GPU to quickly fine-tune a pre-trained model on a given task.

We test if we can imitate the writing style of a Twitter user by only using some of his tweets. Twitter API let us download "only" the 3200 most recent tweets from any single user, which we then filter out (to remove retweets, short content, etc).

Here is an example for Elon Musk's next breakthrough 😉

![HuggingTweets Illustration](https://raw.githubusercontent.com/borisdayma/huggingtweets/master/img/example.png)

In [None]:
#####################
# Check if we have  #
# access to the GPU #
#####################

from IPython.display import clear_output

In [None]:
!pip install tweepy
!pip install torch transformers wandb -qqq
!curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
!sudo apt-get install git-lfs

clear_output()

## Set up a Twitter Development Account

In order to access Twitter data, we need to:

* [Create a Twitter development account](https://developer.twitter.com/en/apply-for-access)
* [Create a Twitter app](https://developer.twitter.com/en/apps)
* Fet your consumer API keys: `API key` and `API Secret Key`

The entire process only takes a few minutes.

In [None]:
#############################
# Enter your credentials    #
# (don't share with anyone) #
#############################

consumer_key    = 'CONSUMER_KEY'
consumer_secret = 'CONSUMER_SECRET'

## Download @User Tweets

We download latest tweets associated to a user account through [Tweepy](http://docs.tweepy.org/).

In [None]:
import tweepy

auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
api  = tweepy.API(auth)

We grab all available tweets (limited to 3200 per API limitations) based on Twitter handle.

**Note**: Protected users may only be requested when the authenticated user either "*owns*" the timeline or is an approved follower of the owner.

In [None]:
handle = 'elonmusk'

In [None]:
"""Adapted from https://gist.github.com/onmyeoin/62c72a7d61fc840b2689b2cf106f583c"""

############################################
# Initialize a list to hold all the tweepy #
# Tweets & list with no retweets           #
############################################

user_tweets = []

#################################################
# make initial request for most recent tweets   #
# with extended mode enabled to get full tweets #
#################################################

latest_tweets = api.user_timeline(screen_name=handle, tweet_mode='extended', count=200)

if latest_tweets:

    user_tweets.extend(latest_tweets)

    ####################################################
    # save the id of the oldest tweet decreased by one #
    ####################################################

    oldest = user_tweets[-1].id - 1

    while True:

        #####################################
        # all subsequent requests use the   #
        # ax_id param to prevent duplicates #
        #####################################

        extra_tweets = api.user_timeline(screen_name=handle, tweet_mode='extended', count=200, max_id=oldest)

        ############################################
        # stop if no more tweets (try a few        #
        # times as they sometimes eventually come) #
        ############################################

        if not extra_tweets: break

        user_tweets.extend(extra_tweets)

        oldest = extra_tweets[-1].id - 1

        print(f'Downloaded {len(extra_tweets)} tweets so far.')

n_tweets = len(user_tweets)

print(f'\nGrabbed {n_tweets} tweets from @{handle}.')

Downloaded 200 tweets so far.
Downloaded 199 tweets so far.
Downloaded 199 tweets so far.
Downloaded 200 tweets so far.
Downloaded 200 tweets so far.
Downloaded 200 tweets so far.
Downloaded 199 tweets so far.
Downloaded 198 tweets so far.
Downloaded 200 tweets so far.
Downloaded 200 tweets so far.
Downloaded 200 tweets so far.
Downloaded 199 tweets so far.
Downloaded 200 tweets so far.
Downloaded 199 tweets so far.
Downloaded 200 tweets so far.

Grabbed 3193 tweets from @elonmusk.


Get the text from tweets and remove `RT`

In [None]:
tweet_list = [tweet.full_text for tweet in user_tweets if not hasattr(tweet, 'retweeted_status')]

print(f'Found {n_tweets} tweets, including {n_tweets - len(tweet_list)} RT, keeping {len(tweet_list)}')

Found 3193 tweets, including 175 RT, keeping 3018


## Create a dataset from downloaded tweets

We remove:
* Retweets (since it's not in the wording style of target author)
* Tweets with no interesting content (limited to URL's, User Mentionss, `thank you`…)

We clean up remaining tweets:
* We remove URL's
* We correct special characters

In [None]:
import random, re, torch

In [None]:
print(f'Total number of tweets: {len(tweet_list)} / {len(user_tweets)}')

Total number of tweets: 3018 / 3193


In [None]:
def fix_text(text):
    text = text.replace('&amp;', '&')
    text = text.replace('&lt;', '<')
    text = text.replace('&gt;', '>')
    return text

In [None]:
def clean_tweet(tweet, allow_new_lines = False):

    bad_start = ['http:', 'https:']

    for w in bad_start:
        tweet = re.sub(f" {w}\\S+", "", tweet)      # removes white space before url
        tweet = re.sub(f"{w}\\S+ ", "", tweet)      # in case a tweet starts with a url
        tweet = re.sub(f"\n{w}\\S+ ", "", tweet)    # in case the url is on a new line
        tweet = re.sub(f"\n{w}\\S+", "", tweet)     # in case the url is alone on a new line
        tweet = re.sub(f"{w}\\S+", "", tweet)       # any other case?

    tweet = re.sub(' +', ' ', tweet)                # replace multiple spaces with one space (makes the previous work worthless?)

    if not allow_new_lines: tweet = ' '.join(tweet.split())

    return tweet.strip()

In [None]:
def boring_tweet(tweet):
    """Check if this is a boring tweet"""

    boring_stuff     = ['http', '@', '#']
    not_boring_words = len([None for w in tweet.split() if all(bs not in w.lower() for bs in boring_stuff)])

    return not_boring_words < 3

In [None]:
curated_tweets     = [fix_text(tweet) for tweet in tweet_list]
clean_tweets       = [clean_tweet(tweet) for tweet in curated_tweets]
informative_tweets = [tweet for tweet in clean_tweets if not boring_tweet(tweet)]

In [None]:
print(f'Total number of tweets: {len(informative_tweets)} / {len(tweet_list)}')

Total number of tweets: 1922 / 3018


In [None]:
###################################
# Create a file based on multiple #
# epochs with tweets mixed up     #
###################################

total_text = '<|endoftext|>' + '<|endoftext|>'.join(informative_tweets) + '<|endoftext|>'

## Model Training

For GPT-2 fine-tuning, we are only using a `124M` model here but gpt-2 has the option to use `355M` or `774M` model.

In [None]:
allow_new_lines = False
learning_rate   = 1.372e-4
epochs          = 4

In [None]:
import transformers

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TextDataset,
    DataCollatorForLanguageModeling,
    Trainer,
    TrainingArguments,
    get_cosine_schedule_with_warmup
)

Saving the `informative tweets` corpus as text file

In [None]:
file = open('tweet.txt', 'w')

file.write(total_text)
file.close()

In [None]:
import pathlib

tokenizer = AutoTokenizer.from_pretrained('gpt2')

model     = AutoModelForCausalLM.from_pretrained('gpt2', cache_dir=pathlib.Path('cache').resolve())

Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/config.json
Model config GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-g

In [None]:
block_size    = tokenizer.model_max_length
train_dataset = TextDataset(tokenizer=tokenizer, file_path=f"tweet.txt", block_size=block_size, overwrite_cache=True)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

seed = random.randint(0,2**32-1)

training_args = TrainingArguments(
    output_dir                  = f"output/{handle}",
    overwrite_output_dir        = True,
    do_train                    = True,
    num_train_epochs            = 5,
    per_device_train_batch_size = 1,
    prediction_loss_only        = True,
    logging_steps               = 5,
    save_steps                  = 0,
    seed                        = seed,
    learning_rate               = learning_rate
)

Creating features from dataset file at 
Saving features into cached file cached_lm_GPT2TokenizerFast_1024_tweet.txt [took 0.001 s]
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [None]:
combined_dict = {**model.config.to_dict(), **training_args.to_sanitized_dict()}

In [None]:
trainer = Trainer(
  model         = model,
  tokenizer     = tokenizer,
  args          = training_args,
  data_collator = data_collator,
  train_dataset = train_dataset
)

In [None]:
train_dataloader = trainer.get_train_dataloader()
num_train_steps  = len(train_dataloader)

trainer.create_optimizer_and_scheduler(num_train_steps)

trainer.lr_scheduler = get_cosine_schedule_with_warmup(
    trainer.optimizer,
    num_warmup_steps   = 0,
    num_training_steps = num_train_steps
)

In [None]:
trainer.train()

***** Running training *****
  Num examples = 42
  Num Epochs = 5
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 210
  Number of trainable parameters = 124439808
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Step,Training Loss
5,4.6282
10,4.4095
15,4.2547
20,4.1968
25,4.1586
30,4.0083
35,4.1078
40,3.9552
45,3.7137
50,3.9255




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=210, training_loss=3.4006070091610865, metrics={'train_runtime': 84.4011, 'train_samples_per_second': 2.488, 'train_steps_per_second': 2.488, 'total_flos': 109742653440000.0, 'train_loss': 3.4006070091610865, 'epoch': 5.0})

In [None]:
trainer.model.config.task_specific_params['text-generation'] = {
  'do_sample':   True,
  'min_length':  10,
  'max_length':  160,
  'temperature': 1.0,
  'top_p':       0.95,
  'prefix':      '<|endoftext|>'
}

## GPT-2 Tweet Generation

Let’s take a look at some of the tweets that our fine-tuned model can generate.

In [None]:
start = 'My dream is'

In [None]:
start_with_bos = '<|endoftext|>' + start

encoded_prompt = trainer.tokenizer(start_with_bos, add_special_tokens=False, return_tensors="pt").input_ids
encoded_prompt = encoded_prompt.to(trainer.model.device)

* `temperature`: governs the randomness and thus the creativity of the responses. A temperature of 0 means roughly that the model will always select the highest probability word.
A higher temperature means that the model might select a word with slightly lower probability, leading to more variation, randomness and creativity.
A very high temperature therefore increases the risk of “hallucination”, meaning that the AI starts selecting words that will make no sense or be offtopic.

* `num_beams`: refers to beam search, which is used for text generation. It returns the n most probable next words, rather than greedy search which returns the most probable next word. Instead of greedily choosing the most likely next step as the sequence is constructed, the beam search expands all possible next steps and keeps the k most likely, where k is a user-specified parameter and controls the number of beams or parallel searches through the sequence of probabilities.

In [None]:
output_sequences = trainer.model.generate(
  input_ids            = encoded_prompt,
  max_length           = 160,              # The maximum length the generated tokens can have.
  min_length           = 10,               # The minimum length of the sequence to be generated.
  temperature          = 1.0,              # The value used to modulate the next token probabilities.

  top_p                = 0.95,             # If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
  do_sample            = True,             # Whether or not to use sampling ; use greedy decoding otherwise.
  num_return_sequences = 10
)

Generate config GenerationConfig {
  "bos_token_id": 50256,
  "eos_token_id": 50256,
  "transformers_version": "4.26.1"
}

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [None]:
generated_sequences = []
predictions         = []

for generated_sequence_idx, generated_sequence in enumerate(output_sequences):

    generated_sequence = generated_sequence.tolist()

    text = trainer.tokenizer.decode(generated_sequence, clean_up_tokenization_spaces=True, skip_special_tokens=True)

    if not allow_new_lines:
        limit = text.find('\n')
        text  = text[: limit if limit != -1 else None]

    generated_sequences.append(text.strip())

for i, g in enumerate(generated_sequences): predictions.append([start, g])

In [None]:
for s in generated_sequences: print(s)

My dream is to be able to upload videos & music for my friends & followers simultaneously in both my home & office. We will take that next step.
My dream is to make a game engine that is more beautiful for every person, not just for some. Will do so by setting the minimum goal for a game size of 1Ghz for the rest of Earth.
My dream is to bring people that love the game and understand the value of the product.
My dream is to become the first Russian citizen to hold an honorary Federal Reserve Chair. As it happens, it is the least he can do:
My dream is to play that game in the dark. That is the goal. We will add a second option later this week.
My dream is that my children will play an epic video game on the back of my home computer
My dream is that I will try my best to do my part to make The New York Times look better!
My dream is for the future of Twitter!
My dream is that the future, but at present the facts don't match.
My dream is a world where robots are the go-to-read of the peo