<a href="https://colab.research.google.com/github/Chiosas/Trump_Tweets/blob/master/Trump_Tweets_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## ***Initial Setup***

### ***Environment Setup***

In [0]:
import os
from pathlib import Path

In [0]:
dir_name = 'Trump_Tweets'
DATA_DIR = Path(f'data/{dir_name}')
MODEL_DIR = Path(f'model/{dir_name}')

In [0]:
try:
    import google.colab
    IN_COLAB = True
except:
    IN_COLAB = False

FIRST_RUN = not os.path.exists(str(MODEL_DIR))

In [0]:
if not IN_COLAB:
    os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
    os.environ["CUDA_VISIBLE_DEVICES"]="0"

In [0]:
if FIRST_RUN:
    os.makedirs(MODEL_DIR, exist_ok=True)
    os.makedirs(DATA_DIR, exist_ok=True)

if IN_COLAB and FIRST_RUN:
    !pip install -q --upgrade scikit-optimize
    !pip install -q -U --pre efficientnet
    # !pip install -q -U tensorflow-datasets
    !pip install -q -U --no-deps tensorflow-addons~=0.6
    !pip install -q -U tensorflow_hub
    !pip install -q -U git+https://github.com/huggingface/transformers    

### ***Kaggle Setup***

In [0]:
def setup_kaggle():
    x = !ls kaggle.json
    assert x == ['kaggle.json'], 'Upload kaggle.json'
    !mkdir /root/.kaggle
    !mv kaggle.json /root/.kaggle
    !chmod 600 /root/.kaggle/kaggle.json

In [0]:
# Make sure you've uploaded 'kaggle.json' file into Colab
if IN_COLAB and FIRST_RUN:
    setup_kaggle()

In [0]:
import kaggle

In [0]:
if IN_COLAB and FIRST_RUN:
    kaggle.api.authenticate()
    kaggle.api.dataset_download_files(
        dataset='austinvernsonger/donaldtrumptweets',
        path=DATA_DIR,
        unzip=True,
    )

### ***Importing Dependencies***

In [10]:
try:
    # %tensorflow_version only exists in Colab.
    %tensorflow_version 2.x
except Exception:
    pass

TensorFlow 2.x selected.


In [11]:
from imports import *

import tensorflow_hub as hub



In [12]:
tf.__version__

'2.0.0'

In [0]:
if FIRST_RUN:
    exit()

In [0]:
%matplotlib inline

## ***Data Description***

### **Content**

![Trump](https://cdn.cnn.com/cnnnext/dam/assets/180925135532-gfx-twitter-donald-trump-tweet-exlarge-169.jpg)

This dataset is a collection of more than 30,000 Donald Trump tweets, dating from 2009 to 2016.

The columns of the data file are:

* Text — full message posted on Twitter,
* Date — date when Twitter message was posted,
* Favorites — number of times Twitter message was marked as favorite by the other users,
* Retweets — number of times Twitter message was re-posted or shared by the other users,
* Tweet ID — ID of Twitter message.


We will use this dataset to create a word-level text generator using a pretrained architecture.

### ***Data Exploration***

In [0]:
# Reading the data from the file
raw_data = pd.read_csv(DATA_DIR/'data.csv', low_memory=False)

In [16]:
print('Number of rows in the dataset:', raw_data.shape[0])
print('Number of columns in the dataset:', raw_data.shape[1])

Number of rows in the dataset: 31175
Number of columns in the dataset: 5


In [17]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31175 entries, 0 to 31174
Data columns (total 5 columns):
Text         31174 non-null object
Date         31175 non-null object
Favorites    31175 non-null int64
Retweets     31175 non-null int64
Tweet ID     31175 non-null int64
dtypes: int64(3), object(2)
memory usage: 1.2+ MB


In [18]:
raw_data.describe(include='all')

Unnamed: 0,Text,Date,Favorites,Retweets,Tweet ID
count,31174,31175,31175.0,31175.0,31175.0
unique,31057,31174,,,
top,MAKE AMERICA GREAT AGAIN!,2016-01-14 05:45:41,,,
freq,11,2,,,
mean,,,3167.715926,1255.764555,4.654273e+17
std,,,11655.175669,4638.563418,1.789587e+17
min,,,0.0,0.0,1698309000.0
25%,,,18.0,14.0,3.185713e+17
50%,,,54.0,63.0,4.73849e+17
75%,,,643.0,613.0,6.108192e+17


In [19]:
raw_data.head()

Unnamed: 0,Text,Date,Favorites,Retweets,Tweet ID
0,I have not heard any of the pundits or comment...,2016-12-21 13:29:38,14755,4055,811564284706689024
1,"I would have done even better in the election,...",2016-12-21 13:24:29,11129,2789,811562990285848576
2,Campaigning to win the Electoral College is mu...,2016-12-21 13:15:14,14906,3925,811560662853939200
3,"Yes, it is true - Carlos Slim, the great busin...",2016-12-20 20:27:57,51424,12578,811307169043849216
4,"especially how to get people, even with an unl...",2016-12-20 13:09:18,35699,8008,811196778779463684


In [0]:
raw_data.dropna(inplace=True)

In [21]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 31174 entries, 0 to 31174
Data columns (total 5 columns):
Text         31174 non-null object
Date         31174 non-null object
Favorites    31174 non-null int64
Retweets     31174 non-null int64
Tweet ID     31174 non-null int64
dtypes: int64(3), object(2)
memory usage: 1.4+ MB


In [0]:
text_col = 'Text'

## ***Data Preparation***

### ***Data Preprocessing - Vocabulary***

In [23]:
text_corpus = " \n<TWEETEND>\n ".join(raw_data[text_col].values)
print(text_corpus[:1000])

I have not heard any of the pundits or commentators discussing the fact that I spent FAR LESS MONEY on the win than Hillary on the loss! 
<TWEETEND>
 I would have done even better in the election, if that is possible, if the winner was based on popular vote - but would campaign differently 
<TWEETEND>
 Campaigning to win the Electoral College is much more difficult & sophisticated than the popular vote. Hillary focused on the wrong states! 
<TWEETEND>
 Yes, it is true - Carlos Slim, the great businessman from Mexico, called me about getting together for a meeting. We met, HE IS A GREAT GUY! 
<TWEETEND>
 especially how to get people, even with an unlimited budget, out to vote in the vital swing states ( and more). They focused on wrong states 
<TWEETEND>
 Bill Clinton stated that I called him after the election. Wrong, he called me (with a very nice congratulations). He "doesn't know much" ... 
<TWEETEND>
 "@mike_pence: Congratulations to @RealDonaldTrump; officially elected President o

In [24]:
all_chars = "".join(sorted(set(text_corpus)))
print(f'Length of all characters in text corpus - {len(all_chars)}\nCharacters:', all_chars)

Length of all characters in text corpus - 137
Characters: 
 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]_`abcdefghijklmnopqrstuvwxyz{|}~ £«®´º»Éèéíïñıĺ​‎‏–—―‘’“”•…′€™●★☆☉☞♡《ＲＴ􏰀


In [0]:
text_corpus = re.sub(r'[‘’`´′]', "'", text_corpus)
text_corpus = re.sub(r'[“”«»]', '"', text_corpus)

In [0]:
text_corpus = re.sub(
    r"[^ \n<=>()*+-_,.'\":;?!…/0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz#$£€%&@]",
    "",
    text_corpus,
)

In [27]:
all_chars = "".join(sorted(set(text_corpus)))
print(f'Length of all characters in text corpus - {len(all_chars)}\nCharacters:', all_chars)

Length of all characters in text corpus - 93
Characters: 
 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]_abcdefghijklmnopqrstuvwxyz£…€


In [0]:
def preprocess_text(text):
    return keras.preprocessing.text.text_to_word_sequence(text)

In [29]:
text_words = preprocess_text(text_corpus)
print(f'We have total {len(text_words)} words and {len(text_corpus)} characters in tweets.')

We have total 581149 words and 3927715 characters in tweets.


In [0]:
def make_vocabulary(word_list):
    vocabulary = collections.Counter()
    vocabulary.update(word_list)
    return vocabulary

In [0]:
vocabulary = make_vocabulary(text_words)

In [32]:
vocabulary.most_common()[0:11]

[('tweetend', 31173),
 ('the', 17631),
 ('to', 11121),
 ('a', 9416),
 ('realdonaldtrump', 8719),
 ('is', 7916),
 ('you', 7895),
 ('and', 7318),
 ('in', 7077),
 ('of', 6604),
 ('i', 6511)]

In [33]:
len(vocabulary)

42074

In [0]:
word_to_id = {word: index for index, word in enumerate(vocabulary)}

In [35]:
for word in preprocess_text('Make America Great Again!'):
    print(word_to_id.get(word) or 'Not available')

267
268
56
269


In [0]:
# Limiting the vocabulary to words detected 5 or more times in text corpus
THRESHOLD = 5

vocabulary = [word for word, count in vocabulary.most_common() if count >= THRESHOLD]
vocabulary_size = len(vocabulary)
n_oov_buckets = vocabulary_size // 10

words = tf.constant(vocabulary)
word_ids = tf.range(len(vocabulary), dtype=tf.int64)

vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)
table = tf.lookup.StaticVocabularyTable(vocab_init, n_oov_buckets)

In [37]:
vocabulary_size

6701

In [0]:
def encode_text(text):
    return table.lookup(tf.constant(preprocess_text(text)))

In [39]:
encode_text('Make America Great Again!')

<tf.Tensor: id=19, shape=(4,), dtype=int64, numpy=array([70, 65, 17, 89])>

### ***Data Preprocessing - Reducing Data***

In [0]:
# Removing tweets having at least one word not present in the vocabulary
reduced_data = raw_data.copy()

In [0]:
def reduce_df(tweet_text):
    drop_tweet = max(encode_text(tweet_text).numpy()) < vocabulary_size
    return drop_tweet

In [0]:
reduced_data['Leave'] = reduced_data['Text'].apply(reduce_df)

In [0]:
reduced_data = reduced_data[reduced_data['Leave']]
reduced_data = reduced_data.drop(['Leave'], axis=1)

In [44]:
reduced_data.head()

Unnamed: 0,Text,Date,Favorites,Retweets,Tweet ID
2,Campaigning to win the Electoral College is mu...,2016-12-21 13:15:14,14906,3925,811560662853939200
5,Bill Clinton stated that I called him after th...,2016-12-20 13:03:59,67369,16962,811195441710764032
6,"""@mike_pence: Congratulations to @RealDonaldTr...",2016-12-20 02:50:25,66605,14547,811041034323054592
7,"""@Franklin_Graham: Congratulations to Presiden...",2016-12-20 02:46:01,44713,9659,811039925571354624
11,We should tell China that we don't want the dr...,2016-12-18 00:59:25,62769,17611,810288321880555520


In [45]:
print('Number of rows in the dataset:', reduced_data.shape[0])
print('Number of columns in the dataset:', reduced_data.shape[1])

Number of rows in the dataset: 5490
Number of columns in the dataset: 5


In [46]:
text_corpus = " \n<TWEETEND>\n ".join(reduced_data[text_col].values)
print(text_corpus[:1000])

Campaigning to win the Electoral College is much more difficult & sophisticated than the popular vote. Hillary focused on the wrong states! 
<TWEETEND>
 Bill Clinton stated that I called him after the election. Wrong, he called me (with a very nice congratulations). He "doesn't know much" ... 
<TWEETEND>
 "@mike_pence: Congratulations to @RealDonaldTrump; officially elected President of the United States today by the Electoral College!" 
<TWEETEND>
 "@Franklin_Graham: Congratulations to President-elect @realDonaldTrump--the electoral votes are in and it's official." Thank you Franklin! 
<TWEETEND>
 We should tell China that we don't want the drone they stole back.- let them keep it! 
<TWEETEND>
 Mobile, Alabama today at 3:00 P.M. Last rally of the year - "THANK YOU ALABAMA AND THE SOUTH" Biggest of all crowds expected, see you there! 
<TWEETEND>
 Last night in Orlando, Florida, was incredible - massive crowd - THANK YOU FLORIDA! Today at 3:00 P.M. I will be in Alabama for last rally! 


In [0]:
text_corpus = re.sub(r'[‘’`´′]', "'", text_corpus)
text_corpus = re.sub(r'[“”«»]', '"', text_corpus)

In [0]:
text_corpus = re.sub(
    r"[^ \n<=>()*+-_,.'\":;?!…/0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz#$£€%&@]",
    "",
    text_corpus,
)

In [49]:
all_chars = "".join(sorted(set(text_corpus)))
print(f'Total of all different characters in text corpus - {len(all_chars)}\nCharacters:', all_chars)

Total of all different characters in text corpus - 87
Characters: 
 !"#$%&'()+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz…


In [50]:
text_words = preprocess_text(text_corpus)
print(f'We have {len(text_words)} words and {len(text_corpus)} characters left in reduced number of tweets.')

We have 105029 words and 648330 characters left in reduced number of tweets.


### ***Dataset Creation***

In [0]:
BATCH_SIZE = 256
AUTOTUNE = tf.data.experimental.AUTOTUNE

In [0]:
n_steps = 64
window_length = n_steps + 1

In [0]:
def make_dataset(data, vocabulary_size, window_length, batch_size):
    dataset = tf.data.Dataset.from_tensor_slices(data)
    dataset = dataset.window(window_length, shift=1, drop_remainder=True)
    dataset = dataset.flat_map(lambda window: window.batch(window_length, drop_remainder=True))
    dataset = dataset.shuffle(math.ceil(len(data) / n_steps))
    dataset = dataset.batch(batch_size, drop_remainder=True)
    dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, -1:]), num_parallel_calls=AUTOTUNE)
    dataset = dataset.map(lambda xs, ys: (tf.one_hot(xs, depth=vocabulary_size), tf.squeeze(ys)), num_parallel_calls=AUTOTUNE)
    dataset = dataset.repeat()
    dataset = dataset.prefetch(buffer_size=AUTOTUNE)
    return dataset

In [0]:
train_data = encode_text(text_corpus)
train_size = len(train_data)

In [0]:
train_dataset = make_dataset(train_data, vocabulary_size, window_length, BATCH_SIZE)
train_data_steps = math.ceil(train_size / BATCH_SIZE)

In [56]:
train_dataset.element_spec

(TensorSpec(shape=(256, 64, 6701), dtype=tf.float32, name=None),
 TensorSpec(shape=(256,), dtype=tf.int64, name=None))

In [57]:
for xs, ys in train_dataset.take(1):
    print(xs.shape, ys.shape)
    print(xs[0].numpy().argmax(axis=-1))
    print(ys[0].numpy())

(256, 64, 6701) (256,)
[ 143   76    1   91  137  232  435  129  106    2  574    3  518  569
 4129    8    1 1022    9    1  493  165 4132   10   77   86   41    0
 2002  822    1  433  381    5  390   11  170    0 3636  274 2668 4133
    3   57 4840  106 4841    8    1  227   76    7  244 1089   19   18
 1154  274 2339  103  208    1   91  137]
0


## ***Exploring Different Model Architectures***

In [0]:
tf.get_logger().setLevel('ERROR')

In [0]:
EPOCHS = 20
LEARN_RATE = 1e-4

### ***RNN Model - Simple LSTM***

In [0]:
def make_simple_lstm_model(
    n_categories,
    lstm_size,
    lstm_dropout,
    dropout,
    l1,
    l2,
):
    input_layer = keras.layers.Input(shape=[None, n_categories])
    lstm1_layer = keras.layers.LSTM(lstm_size, dropout=lstm_dropout, return_sequences=True)(input_layer)
    lstm2_layer = keras.layers.LSTM(lstm_size, dropout=lstm_dropout)(lstm1_layer)
    batch_norm_layer = keras.layers.BatchNormalization()(lstm2_layer)
    dropout_layer = keras.layers.Dropout(dropout)(batch_norm_layer)
    output_layer = keras.layers.Dense(
        n_categories,
        kernel_regularizer=keras.regularizers.l1_l2(l1, l2),
        activation=keras.activations.softmax,
    )(dropout_layer)

    return keras.Model(inputs=input_layer, outputs=output_layer)

In [0]:
model = make_simple_lstm_model(
    n_categories=vocabulary_size,
    lstm_size=256,
    lstm_dropout=0.0,
    dropout=0.0,
    l1=1e-4,
    l2=1e-6,
)

In [0]:
model.compile(
    loss=keras.losses.sparse_categorical_crossentropy,
    optimizer=keras.optimizers.Adam(lr=LEARN_RATE),
)

In [81]:
model.summary()

Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         [(None, None, 6701)]      0         
_________________________________________________________________
lstm_4 (LSTM)                (None, None, 256)         7124992   
_________________________________________________________________
lstm_5 (LSTM)                (None, 256)               525312    
_________________________________________________________________
batch_normalization_2 (Batch (None, 256)               1024      
_________________________________________________________________
dropout_2 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 6701)              1722157   
Total params: 9,373,485
Trainable params: 9,372,973
Non-trainable params: 512
_______________________________________________

In [82]:
history = model.fit(
    x=train_dataset,
    steps_per_epoch=train_data_steps,
    epochs=EPOCHS,
    callbacks=[
        keras.callbacks.ReduceLROnPlateau(monitor="loss", factor=0.3, patience=5),
        keras.callbacks.EarlyStopping(monitor="loss", patience=13, restore_best_weights=True),
        keras.callbacks.TerminateOnNaN(),
    ],
)

Train for 411 steps
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [0]:
def preprocess(text):
    input_text = encode_text(text)
    input_text = tf.one_hot(input_text, vocabulary_size)
    input_text = tf.expand_dims(input_text, axis=0)
    return input_text

In [0]:
def predict_next_word(model, text, temperature=1):
    prediction_input = preprocess(text)
    prediction_probs = model.predict(prediction_input, steps=1)
    rescaled_logits = tf.math.log(prediction_probs) / temperature
    word_index = tf.random.categorical(rescaled_logits, num_samples=1)
    word = vocabulary[tf.squeeze(word_index)]
    return word

In [0]:
def generate_text(model, text, n_words=10, temperature=1):
    for _ in range(n_words):
        text += f' {predict_next_word(model, text, temperature)}'
    return text

In [0]:
def generate_samples(model, seed_text, n_words, temperatures):
    for temperature in temperatures:
        print(f'Temperature is set at {temperature:.0%}')
        print(generate_text(model, seed_text, n_words=n_words, temperature=temperature))

In [87]:
generate_samples(model, 'Make America great again!', 20, [0.2, 0.5, 1, 1.5, 2])

Temperature is set at 20%
Make America great again! tweetend tweetend i i i be realdonaldtrump president president i i miss miss miss pageant pageant i am miss miss
Temperature is set at 50%
Make America great again! tweetend tweetend i i realdonaldtrump president president i i agree i miss i miss miss miss pageant pageant i am
Temperature is set at 100%
Make America great again! tweetend tweetend i si wednesday president really seankesser support i treated i congratulations pays our wall pageant again i miss
Temperature is set at 150%
Make America great again! tweetend tweetend glennbeck choker mitt remember rumor realdonaldtrump i i didn't fold planes achievers miss apprentice 9 apprentice obama's miss
Temperature is set at 200%
Make America great again! tweetend obnoxious lyin' parade vattenfallgroup bump mrs due thedc ma prisoners i recommend pensacola ms 7 obnoxious lyin' wall miss


### ***RNN Model - Bidirectional LSTM***

In [0]:
def make_bidir_lstm_model(
    n_categories,
    lstm_size,
    lstm_dropout,
    dropout,
    l1,
    l2,
):
    input_layer = keras.layers.Input(shape=[None, n_categories])
    lstm1_layer = keras.layers.Bidirectional(
        keras.layers.LSTM(lstm_size, dropout=lstm_dropout, return_sequences=True)
    )(input_layer)
    lstm2_layer = keras.layers.Bidirectional(
        keras.layers.LSTM(lstm_size, dropout=lstm_dropout)
    )(lstm1_layer)
    batch_norm_layer = keras.layers.BatchNormalization()(lstm2_layer)
    dropout_layer = keras.layers.Dropout(dropout)(batch_norm_layer)
    output_layer = keras.layers.Dense(
        n_categories,
        kernel_regularizer=keras.regularizers.l1_l2(l1, l2),
        activation=keras.activations.softmax,
    )(dropout_layer)

    return keras.Model(inputs=input_layer, outputs=output_layer)

In [0]:
model = make_bidir_lstm_model(
    n_categories=vocabulary_size,
    lstm_size=256,
    lstm_dropout=0.0,
    dropout=0.0,
    l1=1e-4,
    l2=1e-6,
)

In [0]:
model.compile(
    loss=keras.losses.sparse_categorical_crossentropy,
    optimizer=keras.optimizers.Adam(lr=LEARN_RATE),
)

In [91]:
model.summary()

Model: "model_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_4 (InputLayer)         [(None, None, 6701)]      0         
_________________________________________________________________
bidirectional_2 (Bidirection (None, None, 512)         14249984  
_________________________________________________________________
bidirectional_3 (Bidirection (None, 512)               1574912   
_________________________________________________________________
batch_normalization_3 (Batch (None, 512)               2048      
_________________________________________________________________
dropout_3 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 6701)              3437613   
Total params: 19,264,557
Trainable params: 19,263,533
Non-trainable params: 1,024
___________________________________________

In [92]:
history = model.fit(
    x=train_dataset,
    steps_per_epoch=train_data_steps,
    epochs=EPOCHS,
    callbacks=[
        keras.callbacks.ReduceLROnPlateau(monitor="loss", factor=0.3, patience=5),
        keras.callbacks.EarlyStopping(monitor="loss", patience=13, restore_best_weights=True),
        keras.callbacks.TerminateOnNaN(),
    ],
)

Train for 411 steps
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [93]:
generate_samples(model, 'Make America great again!', 20, [0.2, 0.5, 1, 1.5, 2])

Temperature is set at 20%
Make America great again! tweetend s and s is of s is and are has has has has to should should has has been
Temperature is set at 50%
Make America great again! tweetend s and s is is s of are deal has has has to has should has been has has
Temperature is set at 100%
Make America great again! tweetend and doesn't s and are is s and of should should has has has has been has has should
Temperature is set at 150%
Make America great again! tweetend s tweetend should doesn't has wants are and has has of doesn't has should no clinton are has has
Temperature is set at 200%
Make America great again! tweetend tweetend and sanders than s are has is of respect are is has should been should s to meaning


### ***TF Hub Model - Wiki words 250 normalized***

In [0]:
def make_text_dataset(data, window_length, batch_size):
    dataset = tf.data.Dataset.from_tensor_slices(data)
    dataset = dataset.window(window_length, shift=1, drop_remainder=True)
    dataset = dataset.flat_map(lambda window: window.batch(window_length, drop_remainder=True))
    dataset = dataset.shuffle(math.ceil(len(data) / n_steps))
    dataset = dataset.batch(batch_size, drop_remainder=True)
    dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, -1:]), num_parallel_calls=AUTOTUNE)
    dataset = dataset.map(
        lambda xs, ys: (tf.strings.reduce_join(xs, axis=1, separator=' '), table.lookup(tf.squeeze(ys))),
        num_parallel_calls=AUTOTUNE,
    )
    dataset = dataset.repeat()
    dataset = dataset.prefetch(buffer_size=AUTOTUNE)
    return dataset

In [0]:
text_data = tf.constant(text_words)
text_data_size = len(text_data)

In [0]:
text_dataset = make_text_dataset(text_data, window_length, BATCH_SIZE)
text_data_steps = math.ceil(text_data_size / BATCH_SIZE)

In [97]:
text_dataset.element_spec

(TensorSpec(shape=(256,), dtype=tf.string, name=None),
 TensorSpec(shape=(256,), dtype=tf.int64, name=None))

In [98]:
for xs, ys in text_dataset.take(1):
    print(xs.shape, ys.shape)
    print(xs[0])
    print(ys[0])

(256,) (256,)
tf.Tensor(b'out to vote this election is far from over we are doing well but there is much time left go florida tweetend just out according to cnn utah officials report voting machine problems across entire country tweetend i will be watching the election results from trump tower in manhattan with my family and friends very exciting tweetend today we make america great again tweetend', shape=(), dtype=string)
tf.Tensor(70, shape=(), dtype=int64)


In [0]:
def make_hub_model(
    model_url,
    n_categories,
    dropout,
    l1,
    l2,
):
    input_layer = keras.layers.Input(shape=[], dtype=tf.string)
    hub_layer = hub.KerasLayer(model_url)(input_layer)
    batch_norm_layer = keras.layers.BatchNormalization()(hub_layer)
    dropout_layer = keras.layers.Dropout(dropout)(batch_norm_layer)
    output_layer = keras.layers.Dense(
        n_categories,
        kernel_regularizer=keras.regularizers.l1_l2(l1, l2),
        activation=keras.activations.softmax,
    )(dropout_layer)
    return keras.Model(inputs=input_layer, outputs=output_layer)

In [0]:
def train_model(
    model,
    epochs,
    lrs=None,
    optimizers=None,
    verbose=1,
):
    if optimizers is None:
        optimizers = [keras.optimizers.Adam(lr) for lr in lrs]

    model.layers[0].trainable = False
    model.compile(
        loss=keras.losses.sparse_categorical_crossentropy,
        optimizer=optimizers[0],
    )
    model.fit(
        text_dataset,
        steps_per_epoch=text_data_steps,
        epochs=epochs[0],
        callbacks=[
            keras.callbacks.ReduceLROnPlateau(patience=1, factor=0.3),
            keras.callbacks.EarlyStopping(patience=3, restore_best_weights=True),
            keras.callbacks.TerminateOnNaN(),
        ],
        verbose=verbose,
    )

    model.layers[0].trainable = True
    model.compile(
        loss=keras.losses.sparse_categorical_crossentropy,
        optimizer=optimizers[1],
    )
    history = model.fit(
        text_dataset,
        steps_per_epoch=text_data_steps,
        epochs=epochs[1],
        callbacks=[
            keras.callbacks.ReduceLROnPlateau(patience=5, factor=0.3),
            keras.callbacks.EarlyStopping(patience=13, restore_best_weights=True),
            keras.callbacks.TerminateOnNaN(),
        ],
        verbose=verbose,
    )

    return model, history

In [0]:
url = 'https://tfhub.dev/google/Wiki-words-250-with-normalization/2'

In [0]:
model = make_hub_model(
    model_url=url,
    n_categories=vocabulary_size,
    dropout=0.0,
    l1=1e-4,
    l2=1e-6,
)

In [103]:
model, history = train_model(
    model=model,
    epochs=[max(1, EPOCHS // 10), EPOCHS],
    optimizers=[keras.optimizers.Adam(lr=LEARN_RATE * 0.3), keras.optimizers.Adam(lr=LEARN_RATE)],
    verbose=1,
)

Train for 411 steps
Epoch 1/2
Epoch 2/2
Train for 411 steps
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [0]:
def predict_next_word(model, text, temperature=1):
    prediction_input = tf.constant([text], dtype=tf.string)
    prediction_probs = model.predict(prediction_input)
    rescaled_logits = tf.math.log(prediction_probs) / temperature
    word_index = tf.random.categorical(rescaled_logits, num_samples=1)
    word = vocabulary[tf.squeeze(word_index)]
    return word

In [105]:
generate_samples(model, 'Make America great again!', 20, [0.2, 0.5, 1, 1.5, 2])

Temperature is set at 20%
Make America great again! celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice
Temperature is set at 50%
Make America great again! realdonaldtrump celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice realdonaldtrump celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice
Temperature is set at 100%
Make America great again! virginia barackobama also celebapprentice actual watching uncle 2012 please scotland now never please is such learn i'll south i realdonaldtrump
Temperature is set at 150%
Make America great again! state

### ***TF Hub Model - Wiki words 500 normalized***

In [0]:
url = 'https://tfhub.dev/google/Wiki-words-500-with-normalization/2'

In [0]:
model = make_hub_model(
    model_url=url,
    n_categories=vocabulary_size,
    dropout=0.0,
    l1=1e-4,
    l2=1e-6,
)

In [108]:
model, history = train_model(
    model=model,
    epochs=[max(1, EPOCHS // 10), EPOCHS],
    optimizers=[keras.optimizers.Adam(lr=LEARN_RATE * 0.3), keras.optimizers.Adam(lr=LEARN_RATE)],
    verbose=1,
)

Train for 411 steps
Epoch 1/2
Epoch 2/2
Train for 411 steps
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [109]:
generate_samples(model, 'Make America great again!', 20, [0.2, 0.5, 1, 1.5, 2])

Temperature is set at 20%
Make America great again! celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice
Temperature is set at 50%
Make America great again! celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice celebapprentice realdonaldtrump celebapprentice celebapprentice
Temperature is set at 100%
Make America great again! won't realdonaldtrump club 5 tremendous at like always ask happy our says we doral celebapprentice a prix stern marble loan
Temperature is set at 150%
Make America great again! economic he ally is equity

### ***TF Hub Model - nnlm 50 dims normalized***

In [0]:
url = 'https://tfhub.dev/google/tf2-preview/nnlm-en-dim50-with-normalization/1'

In [0]:
model = make_hub_model(
    model_url=url,
    n_categories=vocabulary_size,
    dropout=0.0,
    l1=1e-4,
    l2=1e-6,
)

In [112]:
model, history = train_model(
    model=model,
    epochs=[max(1, EPOCHS // 10), EPOCHS],
    optimizers=[keras.optimizers.Adam(lr=LEARN_RATE * 0.3), keras.optimizers.Adam(lr=LEARN_RATE)],
    verbose=1,
)

Train for 411 steps
Epoch 1/2
Epoch 2/2
Train for 411 steps
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [113]:
generate_samples(model, 'Make America great again!', 20, [0.2, 0.5, 1, 1.5, 2])

Temperature is set at 20%
Make America great again! tweetend tweetend on tweetend the be the the the tweetend the the of the best the the of the best
Temperature is set at 50%
Make America great again! tweetend tweetend on that tweetend the is and on in show the tweetend tweetend i the the new i enjoy
Temperature is set at 100%
Make America great again! tweetend towers more 30 not sr i'm for day fix pm thomas business star' respectful snowden i lateshow cnbc know
Temperature is set at 150%
Make America great again! bangor stand the years thanks trump al my famous epa dj aides lawrence 'president 000 pure homeland face realdonaldtrump electric
Temperature is set at 200%
Make America great again! dealmaker thank springfield would salmond form personal hurts representatives queen achieve face sharks doesn't earn apart egypt costs pastors downside


### ***TF Hub Model - nnlm 128 dims normalized***

In [0]:
url = 'https://tfhub.dev/google/nnlm-en-dim128-with-normalization/2'

In [0]:
model = make_hub_model(
    model_url=url,
    n_categories=vocabulary_size,
    dropout=0.0,
    l1=1e-4,
    l2=1e-6,
)

In [116]:
model, history = train_model(
    model=model,
    epochs=[max(1, EPOCHS // 10), EPOCHS],
    optimizers=[keras.optimizers.Adam(lr=LEARN_RATE * 0.3), keras.optimizers.Adam(lr=LEARN_RATE)],
    verbose=1,
)

Train for 411 steps
Epoch 1/2
Epoch 2/2
Train for 411 steps
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [117]:
generate_samples(model, 'Make America great again!', 20, [0.2, 0.5, 1, 1.5, 2])

Temperature is set at 20%
Make America great again! tweetend the tweetend realdonaldtrump realdonaldtrump realdonaldtrump realdonaldtrump realdonaldtrump tweetend realdonaldtrump realdonaldtrump realdonaldtrump realdonaldtrump trump realdonaldtrump realdonaldtrump realdonaldtrump trump trump trump
Temperature is set at 50%
Make America great again! tweetend you is tweetend the is great and tweetend on will be and the a if the the to more
Temperature is set at 100%
Make America great again! show run tweetend for to bernie as a will if i tx your to people of without j again settle
Temperature is set at 150%
Make America great again! never could calm 8 believe love wasserman xx nc brilliant tweetend guide tweetend carolina instagram at will apprenticenbc achieve draw
Temperature is set at 200%
Make America great again! mid bought reduce tweetend conde trumpdoral currently caught jewish sp interested will spent than congratulations m really going hammer intelligence


### ***TF Hub Model - Google News Swivel 20 dims***

In [0]:
url = 'https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1'

In [0]:
model = make_hub_model(
    model_url=url,
    n_categories=vocabulary_size,
    dropout=0.0,
    l1=1e-4,
    l2=1e-6,
)

In [120]:
model, history = train_model(
    model=model,
    epochs=[max(1, EPOCHS // 10), EPOCHS],
    optimizers=[keras.optimizers.Adam(lr=LEARN_RATE * 0.3), keras.optimizers.Adam(lr=LEARN_RATE)],
    verbose=1,
)

Train for 411 steps
Epoch 1/2
Epoch 2/2
Train for 411 steps
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [121]:
generate_samples(model, 'Make America great again!', 20, [0.2, 0.5, 1, 1.5, 2])

Temperature is set at 20%
Make America great again! the tweetend tweetend the tweetend the to tweetend the the tweetend tweetend tweetend tweetend the tweetend tweetend the tweetend tweetend
Temperature is set at 50%
Make America great again! on that to the you tweetend you the the tweetend the be tweetend be at will i the to tweetend
Temperature is set at 100%
Make America great again! get why ebola are helping marco i tweetend kravis soldier house extend will way electric company smart always by totally
Temperature is set at 150%
Make America great again! k amazing years dept www1 conrad donate veterans la 11746… polling linda with she any was aware hopeful which seem
Temperature is set at 200%
Make America great again! stuck stupidly drilling files himself championship snurk total amazing mind 00am iq exec twisting franklin monday joniernst crazy seth very
