### TrumpTweet Notebook
This notebook references the code to create a Trump tweet. This includes:
- Processing the Trump Tweet archive to create a clean file of Trump's tweets
- Defining and training GRU long short term memory recurrent neural network
- Feeding data into the model in order to create a novel tweet

#### Data Prep

In [24]:
# Disable warnings
import warnings
warnings.filterwarnings('ignore')

In [34]:
# Run the helpers scripts with the data and model helper objects
%run scripts/helpers
#from scripts import helpers

In [3]:
# Verify GPU support
tf.config.list_physical_devices('GPU')


[]

##### Create a datahelper and process raw data

In [35]:
# Create a datahelper object and designate the input file
dh = DataHelper(file_name='tweets_12-29-2020.csv')

# Prep the raw data to create the tweet file
dh.prep_raw_data(start_date='2020-06-01', end_date='2020-12-29')

# Print the number of tweets
print('The number of Tweets sent by Trump during the period is {}'.format(dh.num_tweets))


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56090 entries, 0 to 56089
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         56090 non-null  int64 
 1   text       56090 non-null  object
 2   isRetweet  56090 non-null  object
 3   isDeleted  56090 non-null  object
 4   device     56090 non-null  object
 5   favorites  56090 non-null  int64 
 6   retweets   56090 non-null  int64 
 7   date       56090 non-null  object
dtypes: int64(3), object(5)
memory usage: 3.4+ MB
None


TypeError: '>=' not supported between instances of 'datetime.datetime' and 'str'

##### Tokenize the text and create the dataset for model training

In [5]:
# Tokenize the text and create the dataset
dataset, tokenizer = dh.create_tokenizer('inputdata/clean_tweet.txt')
print('The number of unique characters is {0:,} and the dataset size is {1:,} document(s).' \
      ' The number of windows in the dataset for processing is {2:,}.'.format(dh.num_unique_chars,dh.dataset_size,dh.num_data_windows))


Dataset and tokenizer creation complete.
The number of unique characters is 75 and the dataset size is 1 document(s). The number of windows in the dataset for processing is 4,926,622.


##### Create the Model and Train It
The model is a stateless RNN made up of GRU cells

In [6]:
# Create the modelhelper object
mh = ModelHelper(epochs=20)

# Create the model
model = mh.create_model(tokenizer)

# Compile the model
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam')

# Save a checkpoint after every epoch
EPOCHS = 20
checkpoint_filepath = 'checkpoints/weights.{epoch:02d}.hdf5'
model_checkpoint = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=False,
    save_freq='epoch',
    monitor='val_loss',
    mode='min',
    save_best_only=False)

# Fit the model
history = model.fit(dataset,epochs=EPOCHS,callbacks=[model_checkpoint])


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


##### Start Training Model from Saved Checkpoint
Training the model takes quite some time. A checkpoint is saved every epoch so the code below will allow you to resume training from a checkpoint

In [None]:
#restart training from saved checkpoint
new_model = load_model('checkpoints/weights.20.hdf5')

# Save a checkpoint after every epoch
EPOCHS = 1
checkpoint_filepath = 'checkpoints/weights.{epoch:02d}.hdf5'
model_checkpoint = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=False,
    save_freq='epoch',
    monitor='val_loss',
    mode='min',
    save_best_only=False)

#Fit the model
history = new_model.fit(dataset,epochs=EPOCHS,callbacks=[model_checkpoint])


In [None]:
# Save model
model.save('model.h5')

#### Load a saved model and generate a Tweet

In [2]:
# Create a datahelper object and designate the input file
dh = DataHelper(file_name='tweets_11-06-2020.csv')

# Prep the raw data to create the tweet file
dh.prep_raw_data(start_date='2020-08-01', end_date='2020-11-19')

# Print the number of tweets
print('The number of Tweets sent by Trump during the period is {}'.format(dh.num_tweets))

Data processing complete.
The number of Tweets sent by Trump during the period is 2362


In [3]:
# re-create the tokenizer
# Tokenize the text and create the dataset
dataset, tok = dh.create_tokenizer('inputdata/clean_tweet.txt')
print('The number of unique characters is {0:,} and the dataset size is {1:,} document(s).' \
      ' The number of windows in the dataset for processing is {2:,}.'.format(dh.num_unique_chars,dh.dataset_size,dh.num_data_windows))

# restore saved model for inferencing
mh = ModelHelper(epochs=1, tokenizer=tok)
new_model = mh.restore_model('weights.20.hdf5')

Dataset and tokenizer creation complete.
The number of unique characters is 75 and the dataset size is 1 document(s). The number of windows in the dataset for processing is 273,569.


##### Generate some tweets

In [9]:
# Create a long sequence of text
print(mh.create_tweet(text='kamala', model=new_model, n_chars=140, temperature=0.01))

kamala harring down — always hase always want to but the reporting and playens and failed the workers of delphi. i always put american workers and


#### Create an updated tweet file with latest 1000 tweets

In [40]:
from datetime import datetime

# Get the latest 1000 tweets
df_latest_tweets = pd.read_json(path_or_buf='https://www.thetrumparchive.com/latest-tweets', orient='records')
df_latest_tweets = df_latest_tweets.drop(columns=['isFlagged'])

# Sort by date descending
df_latest_tweets = df_latest_tweets.sort_values('date')

# Cast boolean values as string
df_latest_tweets['isRetweet'] = df_latest_tweets['isRetweet'].astype(str)
df_latest_tweets['isDeleted'] = df_latest_tweets['isDeleted'].astype(str)

# Replace TRUE and FALSE with t and f 
df_latest_tweets.replace(to_replace='True',value='t',inplace=True)
df_latest_tweets.replace(to_replace='False',value='f',inplace=True)

# Create consistent datetime format
df_latest_tweets['date'] = df_latest_tweets['date'].dt.strftime('%Y-%m-%d %H:%M:%S')

# Append to the 11-6-2020 archive
df_archive = pd.read_csv('./inputdata/tweets_11-06-2020.csv')
df_all_tweets = df_archive.append(df_latest_tweets,ignore_index=True)

# Save to a new csv file
today = datetime.today().strftime('%m-%d-%Y')
file_name = './inputdata/tweets_' + today + '.csv'
df_all_tweets.to_csv(file_name,index=False)


In [30]:
df_latest_tweets.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 999 to 0
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         1000 non-null   int64 
 1   text       1000 non-null   object
 2   isRetweet  1000 non-null   object
 3   isDeleted  1000 non-null   object
 4   device     1000 non-null   object
 5   favorites  1000 non-null   int64 
 6   retweets   1000 non-null   int64 
 7   date       1000 non-null   object
dtypes: int64(3), object(5)
memory usage: 70.3+ KB
