# Creating an AI Esports Pro Twitter Account with gpt-2-simple

## Preface

GPT-2 is a prominent open-source (MIT) language-processing machine-learning model that can generate text given training data. In this project, we aim to train GPT-2 on the tweets of 21 esports professionals (current and former) to generate original tweets of the same style.

We will utilize the TWINT (Twitter Intelligence Tool) and gpt-2-simple libraries to achieve this goal.

## Scraping Data

We used TWINT to scrape 41,379 tweets from various esports professionals and saved the raw text data into a single column .csv.

Our basic queries through powershell were `twint -u username -s " -filter:replies -filter:links" -fr -o file.csv --csv`

This code scrapes all tweets from a user "username" (excluding replies, retweets, and tweets including links), and their associated data into a .csv. We repeated this process for 21 different esports pros, deleted all columns but raw text data, and merged all .csvs into a final "tweets.csv" as our input for GPT-2.

## Training the Model

Colab is a convenient development environment for this project as it comes preinstalled with TensorFlow (a requirement for gpt-2-simple) and allows us to present our project to you in an orderly fashion.

Here we tell Colab to use the latest version of TensorFlow 1, as gpt-2-simple does not support TensorFlow 2. We then install gpt-2-simple onto the VM (Virtual Machine) and import it as gpt2.

In [1]:
%tensorflow_version 1.x
!pip install gpt-2-simple
import gpt_2_simple as gpt2

TensorFlow 1.x selected.
Collecting gpt-2-simple
  Downloading https://files.pythonhosted.org/packages/6f/e4/a90add0c3328eed38a46c3ed137f2363b5d6a07bf13ee5d5d4d1e480b8c3/gpt_2_simple-0.7.1.tar.gz
Collecting toposort
  Downloading https://files.pythonhosted.org/packages/e9/8a/321cd8ea5f4a22a06e3ba30ef31ec33bea11a3443eeb1d89807640ee6ed4/toposort-1.5-py2.py3-none-any.whl
Building wheels for collected packages: gpt-2-simple
  Building wheel for gpt-2-simple (setup.py) ... [?25l[?25hdone
  Created wheel for gpt-2-simple: filename=gpt_2_simple-0.7.1-cp36-none-any.whl size=23581 sha256=fcd0d1512d4072f231432f437973d55d5f5a0257c6a064189849a483c3b459c4
  Stored in directory: /root/.cache/pip/wheels/0c/f8/23/b53ce437504597edff76bf9c3b8de08ad716f74f6c6baaa91a
Successfully built gpt-2-simple
Installing collected packages: toposort, gpt-2-simple
Successfully installed gpt-2-simple-0.7.1 toposort-1.5
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please 

We download a GPT-2 model of size 355M.

In [2]:
gpt2.download_gpt2(model_name="355M")

Fetching checkpoint: 1.05Mit [00:00, 262Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 115Mit/s]                                                    
Fetching hparams.json: 1.05Mit [00:00, 407Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 1.42Git [00:05, 251Mit/s]                                  
Fetching model.ckpt.index: 1.05Mit [00:00, 247Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 110Mit/s]                                                 
Fetching vocab.bpe: 1.05Mit [00:00, 160Mit/s]                                                       


We mount one of our personal Google Drives onto the VM for easy access to our input data.

In [3]:
gpt2.mount_gdrive()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


We load tweets.csv from Google Drive, create a TensorFlow session object, and finetune (train) our fresh model on tweets.csv over 1001 steps. When it is finished, our checkpoint model will be stored in a folder named "trained_model". Every 50 steps, the program will print the training progress and every 200 steps, a sample of generated text. It will also save a checkpoint model every 200 steps.

In [4]:
gpt2.copy_file_from_gdrive("tweets.csv")

sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset="tweets.csv",
              model_name="355M",
              steps=1001,
              restore_from="fresh",
              run_name="trained_model",
              print_every=50,
              sample_every=200,
              save_every=200
              )

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Instructions for updating:
Please use tensorflow.python.ops.op_selector.get_backward_walk_ops.
Loading checkpoint models/355M/model.ckpt
INFO:tensorflow:Restoring parameters from models/355M/model.ckpt


100%|██████████| 1/1 [00:00<00:00,  8.21it/s]

Loading dataset...





dataset has 1592043 tokens
Training...
[50 | 91.08] loss=2.42 avg=2.42
[100 | 173.36] loss=3.01 avg=2.71
[150 | 255.85] loss=2.63 avg=2.69
[200 | 338.04] loss=2.06 avg=2.53
Saving checkpoint/trained_model/model-200
||<|startoftext|>I really like the idea of the top half of the stage being frozen when they enter a wall. Maybe it will be fun to see them slide down the wall when they enter it in a wall rush?<|endoftext|>
<|startoftext|>I really enjoy how competitive, and unique, Brawl can be. When you see a lot of the same people with the same things, it's fun<|endoftext|>
<|startoftext|>I really enjoyed the ESEA games today, still having a lot of practice going into it. Hope everyone still shows up, and I'll come back to the stage again tomorrow!<|endoftext|>
<|startoftext|>I really like getting a little creative with my Smash 4 sets and creating fun, exciting moments to share with my family, while also making the game more fun to play if you're not too used or don't know what to do with

You will notice <|startoftext|> and <|endoftext|> flags in this output that, of course, did not come with the original text data. gpt-2-simple treats single column .csvs as a special case, and adds these flags for easier processing.

We save the checkpoint model to Google Drive in case we want to load it again at a later time.

In [5]:
gpt2.copy_checkpoint_to_gdrive(run_name="trained_model")

## Generating Text

We generate 20 text samples, in parallel (batch of 20), excluding prefix "<|startoftext|>" and ending "<|endoftext|>". The token length of 100 is rather arbitrary, since our model has been training on tweet-length data anyways. Higher temperatures (0.7 - 1) generate more "interesting" text, according to gpt-2-simple documentation.

In [6]:
gpt2.generate(sess,
              run_name="trained_model",
              length=100,
              temperature=0.7,
              prefix="<|startoftext|>",
              truncate="<|endoftext|>",
              include_prefix=False,
              nsamples=20,
              batch_size=20
              )

vamoa is a izz
He tried to take me out in the other room, but I'm the one that got the stick. He's a good teammate, he's a good player, and I already have a team to look forward to with my team.  He tried to take me out, I don't know why, but I did get the stick back. I'm sorry. I took the wrong idea out of the other room.
I have introduced a new score so the Season 3 match will be played at 9:00pm EST. I have no idea how to use the new game mode, but it makes me so happy that it is being introduced, so I will work on it for Season 3 and see if I can make it work for everyone's enjoyment. I have an idea but I need to reach out to the community for feedback first =/
Lola 3-0s her
The first half of the year has been EXTREMELY rough for us, still just got the 12th win in a row with a relatively low margin of victory.  We'll be playing the new season on the 3rd side, and I can't help but look forward to the new map, map, and map design.
For half the day I've been feeling really bad. I’ve b

We use the same parameters as above, but generate a text file of 1000 samples to our VM hard disk. We will then curate this text file for the most interesting/humorous tweets for our AI esports pro! You can follow his wacky antics at: https://twitter.com/SolidMaldo

In [7]:
gpt2.generate_to_file(sess,
                      run_name="trained_model",
                      destination_path="1000_generated_tweets.txt",
                      length=100,
                      temperature=0.7,
                      prefix="<|startoftext|>",
                      truncate="<|endoftext|>",
                      include_prefix=False,
                      nsamples=1000,
                      batch_size=20
                      )