# Tweet-generator for #66DaysofData

In this notebook I will train a <a href="https://openai.com/blog/better-language-models/">GTP-2</a> model with collected tweets. I collected almost 15000 tweets form the #66DayofData-challenge on Twitter. You can read about the process on my <a href="https://markusmueller-ds.github.io/portfolio/66days_analysis.html">website</a>. The goal is to create a tweet generator based on this dataset.

The creator of the <a href="https://github.com/minimaxir/gpt-2-simple">gtp-2-simple</a> libary publisched a great <a href="https://minimaxir.com/2020/01/twitter-gpt2-bot/">blog post</a> explaining the process of using GTP-2 to create a tweet generator. 

### What is GPT-2?
> GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages (40GB text data).

### Trainings-parameters
25.04.2021
- model: '124M'
- steps: 2000
- run_name: 'run1'

Training time: 01:18
Evaluation: avg loss =1.20




## Imports

In [None]:
%tensorflow_version 1.x
!pip install -q gpt-2-simple
import gpt_2_simple as gpt2
from datetime import datetime

TensorFlow 1.x selected.
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



In [None]:
gpt2.mount_gdrive()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# check gpu
# best case: Nvidia P100 GPU
!nvidia-smi

Sun Apr 25 08:24:43 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   63C    P8    11W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Load and prepare data

In [None]:
import numpy as np
import pandas as pd

In [None]:
data = pd.read_csv('/content/drive/MyDrive/66Days-Generator/finalFrame.csv')

In [None]:
data

Unnamed: 0,tweet_id,user_id,user_name,created_at,full_text,retweets,favorite
0,1299601482749181952,1292469347370360839,DuckPython,2020-08-29 06:55:13+00:00,@KenJee_DS looking forward to #66DaysOfData,0,1
1,1299734773456203777,1159830350102781953,KenJee_DS,2020-08-29 15:44:52+00:00,Very excited to announce the #66daysofdata ini...,51,269
2,1299735515923505153,719854244,Sachin_g_here,2020-08-29 15:47:49+00:00,@KenJee_DS Looking fwd to #66Daysofdata,0,1
3,1299735809004769282,1001046433695285249,gautham53814486,2020-08-29 15:48:59+00:00,Let’s start #66daysofdata https://t.co/IPm1WhHaHB,0,2
4,1299736210575769605,1652520728,khudiamayankino,2020-08-29 15:50:35+00:00,@KenJee_DS count me in #66daysofdata,0,1
...,...,...,...,...,...,...,...
17241,1381306986067750918,731856877139558400,ABYA80,2021-04-11 18:03:43+00:00,R2: #66daysofdata with @KenJee_DS \n\nDay 27: ...,1,14
17242,1381326847527550983,324583975,georgekanellos,2021-04-11 19:22:38+00:00,Days 16-18(R2) of #66daysofdata by @KenJee_DS\...,0,1
17243,1381336589641646083,1300492664308146176,MarkusM99098101,2021-04-11 20:01:21+00:00,Day 40 of #66DaysOfData r2:\n\nread the first...,0,5
17244,1381338886241157124,1282311789464760321,HeqiqetEhmedova,2021-04-11 20:10:29+00:00,Day 4 of #100DaysOfCode ; #66daysofdata \n ✔...,10,5


In [None]:
data.duplicated('tweet_id').sum()

2500

In [None]:
# drop duplicates
data = data.drop_duplicates(subset=['tweet_id'], keep='last')

In [None]:
# frop unrelevant columns
data = data[['full_text']]

In [None]:
data.reset_index(inplace=True, drop=True)

In [None]:
data

Unnamed: 0,full_text
0,@KenJee_DS looking forward to #66DaysOfData
1,Very excited to announce the #66daysofdata ini...
2,@KenJee_DS Looking fwd to #66Daysofdata
3,Let’s start #66daysofdata https://t.co/IPm1WhHaHB
4,@KenJee_DS count me in #66daysofdata
...,...
14741,R2: #66daysofdata with @KenJee_DS \n\nDay 27: ...
14742,Days 16-18(R2) of #66daysofdata by @KenJee_DS\...
14743,Day 40 of #66DaysOfData r2:\n\nread the first...
14744,Day 4 of #100DaysOfCode ; #66daysofdata \n ✔...


In [None]:
# remove new line char
data.full_text = data.full_text.replace(r'\n','', regex=True)

In [None]:
data

Unnamed: 0,full_text
0,@KenJee_DS looking forward to #66DaysOfData
1,Very excited to announce the #66daysofdata ini...
2,@KenJee_DS Looking fwd to #66Daysofdata
3,Let’s start #66daysofdata https://t.co/IPm1WhHaHB
4,@KenJee_DS count me in #66daysofdata
...,...
14741,R2: #66daysofdata with @KenJee_DS Day 27: Had ...
14742,Days 16-18(R2) of #66daysofdata by @KenJee_DSF...
14743,Day 40 of #66DaysOfData r2:read the first sec...
14744,Day 4 of #100DaysOfCode ; #66daysofdata ✔️D...


In [None]:
# save file
data.to_csv('finalData.csv', index=False)

In [None]:
final_data = pd.read_csv('/content/finalData.csv')

In [None]:
final_data

Unnamed: 0,full_text
0,@KenJee_DS looking forward to #66DaysOfData
1,Very excited to announce the #66daysofdata ini...
2,@KenJee_DS Looking fwd to #66Daysofdata
3,Let’s start #66daysofdata https://t.co/IPm1WhHaHB
4,@KenJee_DS count me in #66daysofdata
...,...
14743,R2: #66daysofdata with @KenJee_DS Day 27: Had ...
14744,Days 16-18(R2) of #66daysofdata by @KenJee_DSF...
14745,Day 40 of #66DaysOfData r2:read the first sec...
14746,Day 4 of #100DaysOfCode ; #66daysofdata ✔️D...


In [None]:
gpt2.encode_csv('/content/drive/MyDrive/66Days-Generator/finalData.csv')

## Train GPT-2
I used the 124M base model. There are more performant modeks but hty use more disk space and are more suitable for longer texts, which is not the case for me.
- Other models: 355M, 774M and 1558M

In [None]:
# downlad model
gpt2.download_gpt2(model_name="124M")

Fetching checkpoint: 1.05Mit [00:00, 242Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 4.91Mit/s]
Fetching hparams.json: 1.05Mit [00:00, 507Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 498Mit [00:11, 42.2Mit/s]                                  
Fetching model.ckpt.index: 1.05Mit [00:00, 334Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 7.97Mit/s]
Fetching vocab.bpe: 1.05Mit [00:00, 6.05Mit/s]


The following code will finetune the GPT-2 model

In [None]:
sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset='/content/drive/MyDrive/66Days-Generator/csv_encoded.txt',
              model_name='124M',
              steps=2000,
              restore_from='fresh',
              run_name='run1',
              print_every=10,
              sample_every=500,
              save_every=500
              )

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Loading checkpoint models/124M/model.ckpt
INFO:tensorflow:Restoring parameters from models/124M/model.ckpt


  0%|          | 0/1 [00:00<?, ?it/s]

Loading dataset...


100%|██████████| 1/1 [00:04<00:00,  4.61s/it]


dataset has 979860 tokens
Training...
[10 | 29.27] loss=2.89 avg=2.89
[20 | 52.25] loss=2.78 avg=2.83
[30 | 75.78] loss=2.62 avg=2.76
[40 | 98.88] loss=2.63 avg=2.73
[50 | 121.79] loss=2.56 avg=2.69
[60 | 144.93] loss=2.63 avg=2.68
[70 | 168.11] loss=2.44 avg=2.65
[80 | 191.19] loss=2.51 avg=2.63
[90 | 214.29] loss=2.37 avg=2.60
[100 | 237.45] loss=2.44 avg=2.58
[110 | 260.60] loss=2.47 avg=2.57
[120 | 283.76] loss=2.33 avg=2.55
[130 | 306.92] loss=2.41 avg=2.54
[140 | 330.08] loss=2.44 avg=2.53
[150 | 353.21] loss=2.28 avg=2.51
[160 | 376.35] loss=2.42 avg=2.51
[170 | 399.48] loss=2.42 avg=2.50
[180 | 422.61] loss=2.35 avg=2.49
[190 | 445.73] loss=2.30 avg=2.48
[200 | 468.84] loss=2.43 avg=2.48
[210 | 491.96] loss=2.39 avg=2.47
[220 | 515.08] loss=2.19 avg=2.46
[230 | 538.22] loss=2.21 avg=2.45
[240 | 561.36] loss=2.37 avg=2.44
[250 | 584.49] loss=2.20 avg=2.43
[260 | 607.64] loss=2.17 avg=2.42
[270 | 630.74] loss=2.46 avg=2.42
[280 | 653.83] loss=2.08 avg=2.41
[290 | 676.92] loss=2.1

In [None]:
# Zip run in checkpoint folder
!zip -r /content/run1.zip /content/checkpoint/run1

  adding: content/checkpoint/run1/ (stored 0%)
  adding: content/checkpoint/run1/model-2000.data-00000-of-00001 (deflated 7%)
  adding: content/checkpoint/run1/model-2000.meta (deflated 91%)
  adding: content/checkpoint/run1/counter (stored 0%)
  adding: content/checkpoint/run1/model-2000.index (deflated 62%)
  adding: content/checkpoint/run1/events.out.tfevents.1619339169.0cf9f22a178c (deflated 61%)
  adding: content/checkpoint/run1/encoder.json (deflated 67%)
  adding: content/checkpoint/run1/checkpoint (deflated 40%)
  adding: content/checkpoint/run1/vocab.bpe (deflated 53%)
  adding: content/checkpoint/run1/hparams.json (deflated 28%)


In [None]:
!unzip /content/run1.zip

Archive:  /content/run1.zip
   creating: content/checkpoint/run1/
  inflating: content/checkpoint/run1/model-2000.data-00000-of-00001  
  inflating: content/checkpoint/run1/model-2000.meta  
 extracting: content/checkpoint/run1/counter  
  inflating: content/checkpoint/run1/model-2000.index  
  inflating: content/checkpoint/run1/events.out.tfevents.1619339169.0cf9f22a178c  
  inflating: content/checkpoint/run1/encoder.json  
  inflating: content/checkpoint/run1/checkpoint  
  inflating: content/checkpoint/run1/vocab.bpe  
  inflating: content/checkpoint/run1/hparams.json  


In [None]:
gpt2.generate(sess, run_name='run1')

Ken Jee’s next step might be to start a data science book.  Not gonna lie. It's gonna take time.  But, ya, ya! :) #66daysofdata<|endoftext|>
<|startoftext|>Day 25 of #66daysofdata:I've completed the Introduction to the Tidyverse course on Dataquest. It's one of my favorite parts of the course, it's good people I know and use Text Analysis and Data Wrangling for data science.<|endoftext|>
<|startoftext|>Day 3 of #66daysofdata :I've completed the Pandas course on Kaggle! https://t.co/Zq2ubw2kYM<|endoftext|>
<|startoftext|>Day 25: I had watched: Tutorial for Understanding and Visualizing Machine Learning in Python: https://t.co/yVZ9qJit2F #66daysofdata #datascience https://t.co/Zj7m9Z5zPs<|endoftext|>
<|startoftext|>Day 25 of #66daysofdata:I finished the kaggle python course. I’m also going to start a data science one :)<|endoftext|>
<|startoftext|>Day 25 of #66daysofdataSQL Revisit:SQL statements are not required when working with large strings. (char, aggregate, timestamp)• working with

## Test generated tweets for similarity

In [None]:
import pandas as pd
import numpy as np

In [None]:
tweet_data = pd.read_csv('/content/drive/MyDrive/66Days-Generator/finalData.csv')

In [None]:
tweet_data

Unnamed: 0,full_text
0,@KenJee_DS looking forward to #66DaysOfData
1,Very excited to announce the #66daysofdata ini...
2,@KenJee_DS Looking fwd to #66Daysofdata
3,Let’s start #66daysofdata https://t.co/IPm1WhHaHB
4,@KenJee_DS count me in #66daysofdata
...,...
14743,R2: #66daysofdata with @KenJee_DS Day 27: Had ...
14744,Days 16-18(R2) of #66daysofdata by @KenJee_DSF...
14745,Day 40 of #66DaysOfData r2:read the first sec...
14746,Day 4 of #100DaysOfCode ; #66daysofdata ✔️D...


### Check similarity the naive way

- simply check if a generated tweet is exactly the same as in the data

In [None]:
str_ = """Day 3 of #66daysofdata. I used the time to brush up on some statistics.I learnt about the distributions and different means of doing the EDA. I also interacted with some datasets and played around with them."""

if tweet_data['full_text'].str.contains(str_).any():
  print('yes')
else: print('no')

no


## Use Spacy-NLP to describe similarity

In [None]:
# get the nlp model
!python -m spacy download en_core_web_md

Collecting en_core_web_md==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.5/en_core_web_md-2.2.5.tar.gz (96.4MB)
[K     |████████████████████████████████| 96.4MB 1.1MB/s 
Building wheels for collected packages: en-core-web-md
  Building wheel for en-core-web-md (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-md: filename=en_core_web_md-2.2.5-cp37-none-any.whl size=98051305 sha256=960b2e6d2f20da99df0596325e3174e228160257d9d1e7bc80f07c64bb563a7c
  Stored in directory: /tmp/pip-ephem-wheel-cache-4tysq1l8/wheels/df/94/ad/f5cf59224cea6b5686ac4fd1ad19c8a07bc026e13c36502d81
Successfully built en-core-web-md
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


In [None]:
import spacy
# load the language model
nlp = spacy.load("en_core_web_md")

In [None]:
# check similarity
doc1 = nlp("""Day 3 of #66daysofdata. I used the time to brush up on some statistics.I learnt about the distributions and different means of doing the EDA. I also interacted with some datasets and played around with them.""")
doc2 = nlp("""How do I turn tv on/of?""")
doc1.similarity(doc2)

0.8649768063352461

In [None]:
# loop over the real tweets
# returns a list with the similarity to each tweet
similarity_scores = []
doc1 = nlp("""Day 3 of #66daysofdata. I used the time to brush up on some statistics.I learnt about the distributions and different means of doing the EDA. I also interacted with some datasets and played around with them.""")

for x in range(0,1000):
  score_ = doc1.similarity(nlp(tweet_data.full_text[x]))    
  similarity_scores.append(score_)

In [None]:
# Average similarity
# higher better so generated tweet is similar to real tweet
sum(similarity_scores) / len(similarity_scores)

0.8709303380014755

In [None]:
max(similarity_scores)

0.9673041213495218

In [None]:
# https://stackoverflow.com/questions/2474015/getting-the-index-of-the-returned-max-or-min-item-using-max-min-on-a-list
max(range(len(similarity_scores)), key=similarity_scores.__getitem__)

646

In [None]:
min(range(len(similarity_scores)), key=similarity_scores.__getitem__)

789

In [None]:
similarity_scores[646]

0.9673041213495218

In [None]:
tweet_data.full_text[646]

"Day 6 of #66daysofdataToday I performed Exploratory Data Analysis on Telco's Churn dataset. I really like that domain. I have seen other's kaggle kernels, and I realized that I have to learn doing some advanced visualizations. I also make note of the packages used. @KenJee_DS"

In [None]:
tweet_data.full_text[789]

'#Dia5 de #66daysofdata por @KenJee_DSMiren esta belleza!!Primer capitulo realizado ✅ Mi primera aplicacion entendida e implementada en #pythonprogramming usando #JupyterNotebooks 👩🏻\u200d💻🥳 https://t.co/WIBcZ45Isa'

In [None]:
min(similarity_scores)

0.09145792753401237

In [None]:
similarity_scores[789]

0.09145792753401237

Lets remove stop words since they tend to inflate the similarity score.

https://stackoverflow.com/questions/52113939/spacy-strange-similarity-between-two-sentences

In [None]:
# https://betterprogramming.pub/the-beginners-guide-to-similarity-matching-using-spacy-782fc2922f7c
def remove_stopwords_fast(text):
    doc = nlp(text.lower())
    result = [token.text for token in doc if token.text not in nlp.Defaults.stop_words]
    return " ".join(result)

In [None]:
text = """Day 3 of #66daysofdata. I used the time to brush up on some statistics.I learnt about the distributions and different means of doing the EDA. I also interacted with some datasets and played around with them."""
remove_stopwords_fast(text)

'day 3 # 66daysofdata . time brush statistics.i learnt distributions different means eda . interacted datasets played .'

In [None]:
similarity_scores_stop = []
doc1 = nlp(remove_stopwords_fast(text))

for x in range(0,1000):
  score_ = doc1.similarity(nlp(remove_stopwords_fast(tweet_data.full_text[x])))    
  similarity_scores_stop.append(score_)

In [None]:
# removing stop words reduced the similarity of 10 percent points
sum(similarity_scores_stop) / len(similarity_scores_stop)

0.7756576540945388

In [None]:
max(similarity_scores_stop)

0.9211037812250895

In [None]:
max(range(len(similarity_scores)), key=similarity_scores.__getitem__)

646

In [None]:
remove_stopwords_fast(tweet_data.full_text[646])

'day 6 # 66daysofdatatoday performed exploratory data analysis telco churn dataset . like domain . seen kaggle kernels , realized learn advanced visualizations . note packages . @kenjee_ds'

### Limitaions for using Spacy-similarity scores
> Computing similarity scores can be helpful in many situations, but it’s also important to maintain realistic expectations about what information it can provide. Words can be related to each other in many ways, so a single “similarity” score will always be a mix of different signals, and vectors trained on different data can produce very different results that may not be useful for your purpose (https://spacy.io/usage/linguistic-features#vectors-similarity)

Google's Universal Sentence Encoder: https://tfhub.dev/google/universal-sentence-encoder/4. 

Facebook's Infersent Encoder: https://github.com/facebookresearch/InferSent

## Google's Universal Sentence Encoder
>  encodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.

- traind and optimaized for greater-then-word length text
- INPUT: English text of variable length
- OUTUPU: a 512 dimensional vector 

- no preprocessing is necessary (stop word removal)

In [2]:
# load the the model
from absl import logging

import tensorflow as tf

import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
  return model(input)

module https://tfhub.dev/google/universal-sentence-encoder/4 loaded


In [3]:
tweet_data = pd.read_csv('/content/drive/MyDrive/66Days-Generator/finalData.csv')

In [8]:
# choose tweet 646 since it had the hightest similrity score using spaCy
real_tweet = tweet_data.full_text[646]
# generated tweet
generated_tweet = """Day 3 of #66daysofdata. I used the time to brush up on some statistics.I learnt about the distributions and different means of doing the EDA. I also interacted with some datasets and played around with them."""
tweets = [real_tweet, generated_tweet]

In [9]:
# Reduce logging outpur
logging.set_verbosity(logging.ERROR)

# compute the embeddigns
message_embeddings = embed(tweets)

In [17]:
message_embeddings

<tf.Tensor: shape=(2, 512), dtype=float32, numpy=
array([[-0.01785527, -0.06353536,  0.01804276, ...,  0.08779907,
        -0.05331371, -0.06075975],
       [-0.03645666, -0.04723918, -0.03063959, ...,  0.09627496,
        -0.03469933, -0.06400408]], dtype=float32)>

In [12]:
for i, message_embedding in enumerate(np.array(message_embeddings).tolist()):
  print("Message: {}".format(tweets[i]))
  print("Embedding size: {}".format(len(message_embedding)))
  message_embedding_snippet = ", ".join(
      (str(x) for x in message_embedding[:3]))
  print("Embedding: [{}, ...]\n".format(message_embedding_snippet))

Message: Day 6 of #66daysofdataToday I performed Exploratory Data Analysis on Telco's Churn dataset. I really like that domain. I have seen other's kaggle kernels, and I realized that I have to learn doing some advanced visualizations. I also make note of the packages used. @KenJee_DS
Embedding size: 512
Embedding: [-0.017855273559689522, -0.06353535503149033, 0.01804276369512081, ...]

Message: Day 3 of #66daysofdata. I used the time to brush up on some statistics.I learnt about the distributions and different means of doing the EDA. I also interacted with some datasets and played around with them.
Embedding size: 512
Embedding: [-0.03645665571093559, -0.04723918065428734, -0.030639585107564926, ...]



In [18]:
# The semantic similarity of two sentences can be trivially computed as the inner product of the encodings
np.inner(message_embeddings[0], message_embeddings[1])

0.5870047

This is compared to the score form sypCy quite low, showing the stated limitiations of the approch from spaCy