# Trabalho de RnR - Gerando Tweets com LSTMs

Usaremos Tweets do Presidente dos Tesla, Elon Musk, para treinar um modelo, onde a arquitetura deverá ter duas camadas LSTM e uma Densa. O modelo irá gerar tweets automaticamente.

O conjunto de dados está disponível no site de competições em Data Science Kaggle(Link abaixo)

## Detalhes

- **Dataset:** https://www.kaggle.com/kulgen/elon-musks-tweets
- **Aluno:** Plínio Larrubia Ferreira de Moura


## 1: Obtenção dos dados


### Importação das Bibliotecas


In [50]:
!pip install keras --upgrade
!pip install matplotlib --upgrade
!pip install numpy --upgrade
!pip install pandas --upgrade
!pip install seaborn --upgrade
!pip install tensorflow --upgrade




In [51]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Embedding
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf
import time
import sys


### Carregando Base de Dados


In [52]:
# from google.colab import drive
# drive.mount('/content/drive')


In [53]:
elon_base = pd.read_csv('data/data_elonmusk.csv', encoding='latin1')


## 2: Análise Explorátoria (O possível)


In [54]:
elon_base.head()


Unnamed: 0,row ID,Tweet,Time,Retweet from,User
0,Row0,@MeltingIce Assuming max acceleration of 2 to ...,2017-09-29 17:39:19,,elonmusk
1,Row1,RT @SpaceX: BFR is capable of transporting sat...,2017-09-29 10:44:54,SpaceX,elonmusk
2,Row2,@bigajm Yup :),2017-09-29 10:39:57,,elonmusk
3,Row3,Part 2 https://t.co/8Fvu57muhM,2017-09-29 09:56:12,,elonmusk
4,Row4,Fly to most places on Earth in under 30 mins a...,2017-09-29 09:19:21,,elonmusk


In [55]:
elon_base.iloc[-5:, :]


Unnamed: 0,row ID,Tweet,Time,Retweet from,User
3213,Row3213,"@YOUSRC Amos's article was fair, but his edito...",2012-11-20 08:52:03,,elonmusk
3214,Row3214,These articles in Space News describe why Aria...,2012-11-20 08:38:31,,elonmusk
3215,Row3215,Was misquoted by BBC as saying Europe's rocket...,2012-11-20 08:30:44,,elonmusk
3216,Row3216,Just returned from a trip to London and Oxford...,2012-11-19 08:59:46,,elonmusk
3217,Row3217,RT @Jon_Favreau: My Model S just arrived and I...,2012-11-16 17:59:47,Jon_Favreau,elonmusk


In [56]:
np.unique(elon_base['User'], return_counts=True)


(array(['elonmusk'], dtype=object), array([3218]))

In [57]:
elon_base.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3218 entries, 0 to 3217
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   row ID        3218 non-null   object
 1   Tweet         3218 non-null   object
 2   Time          3218 non-null   object
 3   Retweet from  525 non-null    object
 4   User          3218 non-null   object
dtypes: object(5)
memory usage: 125.8+ KB


In [58]:
elon_base.describe()


Unnamed: 0,row ID,Tweet,Time,Retweet from,User
count,3218,3218,3218,525,3218
unique,3218,3216,3217,201,1
top,Row0,RT @SpaceX: Webcast of Falcon 9 launch is now ...,2013-07-24 10:49:13,SpaceX,elonmusk
freq,1,2,2,109,3218


In [59]:
retweet_values = elon_base['Retweet from'].unique()
retweet_values


array([nan, 'SpaceX', 'mayemusk', 'MotorTrend', 'FastCompany', 'Gizmodo',
       'Tesla', 'karpathy', 'Hyperloop', 'Teslarati', 'cmleahey', 'NASA',
       'OpenAI', 'mkarolian', 'guardianeco', 'rj8w', 'TheOnion',
       'tanyaofmars', 'mcannonbrookes', 'BBCWorld', 'HuffPostAU',
       'justAfanDavid', 'cnntech', 'jovanik21', 'ElectrekCo',
       'businessinsider', 'newscientist', 'RobertIger', 'TheMarsSociety',
       'insideclimate', 'GeorgeTakei', 'waitbutwhy', 'creepypuppet',
       'wcax', 'business', 'MrJc2012', 'mashable', '350', 'XHNews',
       'realamberheard', 'AAAauto', 'ProfBrianCox', 'PokerVixen',
       'RH_Way', 'morrisonbrett', 'arctechinc', 'Herifin_teki',
       'RicardoTwumasi', 'Shkottt', 'danieltadros', 'bobbykansara',
       'ryanrossuk', 'verge', 'SteveCase', 'WIRED', 'techinsider',
       'neiltyson', 'tsrandall', 'BBC_TopGear', 'IridiumComm', 'thedrive',
       'TechCrunch', 'garrett_bauman', 'TheEconomist', 'NatGeo',
       'RealRonHoward', 'ggreenwald', 'jody

### Retweets da @SpaceX


In [60]:
by_retweets = elon_base.groupby(['Retweet from'])


In [61]:
pd.DataFrame(by_retweets['Tweet'].get_group(name='SpaceX'))


Unnamed: 0,Tweet
1,RT @SpaceX: BFR is capable of transporting sat...
5,RT @SpaceX: Supporting the creation of a perma...
10,"RT @SpaceX: Nine years ago today, Falcon 1 bec..."
44,RT @SpaceX: More photos from today?s Falcon 9 ...
110,RT @SpaceX: Successful deployment of FORMOSAT-...
...,...
2699,RT @SpaceX: [PHOTO] Falcon 9 and SES-8 liftoff...
2729,RT @SpaceX: Falcon 9 launch window opens tmrw ...
2733,RT @SpaceX: Falcon 9 targeted to launch SES-8 ...
2900,RT @SpaceX: [WATCH] SpaceX?s 5.2m satellite fa...


## 3: Preparação dos Dados


In [62]:
tweets_data = elon_base['Tweet'].values
tweets_data

array(["@MeltingIce Assuming max acceleration of 2 to 3 g's, but in a comfortable direction. Will feel like a mild to moder? https://t.co/fpjmEgrHfC",
       'RT @SpaceX: BFR is capable of transporting satellites to orbit, crew and cargo to the @Space_Station and completing missions to the Moon an?',
       '@bigajm Yup :)', ...,
       "Was misquoted by BBC as saying Europe's rocket has no chance. Just said the [Franco-German] Ariane 5 has no chance, so go with Ariane 6.",
       'Just returned from a trip to London and Oxford, where I met with many interesting people. I really like Britain!',
       'RT @Jon_Favreau: My Model S just arrived and I went electric like Dylan! #FF @TeslaMotors @elonmusk'],
      dtype=object)

In [63]:
dataset_text = '\n'.join(tweets_data)


In [64]:
characteres = sorted(list(set(dataset_text)))
print('{} unique characters'.format(len(characteres)))


97 unique characters


In [65]:
characteres

['\n',
 ' ',
 '!',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 'A',
 'B',
 'C',
 'D',
 'E',
 'F',
 'G',
 'H',
 'I',
 'J',
 'K',
 'L',
 'M',
 'N',
 'O',
 'P',
 'Q',
 'R',
 'S',
 'T',
 'U',
 'V',
 'W',
 'X',
 'Y',
 'Z',
 '[',
 ']',
 '^',
 '_',
 '`',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z',
 '|',
 '~',
 '\xa0',
 '°',
 'Ü',
 'ä',
 'é']

In [66]:
char2idx = { char: index for index, char in enumerate(characteres) }
char2idx

{'\n': 0,
 ' ': 1,
 '!': 2,
 '#': 3,
 '$': 4,
 '%': 5,
 '&': 6,
 "'": 7,
 '(': 8,
 ')': 9,
 '*': 10,
 '+': 11,
 ',': 12,
 '-': 13,
 '.': 14,
 '/': 15,
 '0': 16,
 '1': 17,
 '2': 18,
 '3': 19,
 '4': 20,
 '5': 21,
 '6': 22,
 '7': 23,
 '8': 24,
 '9': 25,
 ':': 26,
 ';': 27,
 '<': 28,
 '=': 29,
 '>': 30,
 '?': 31,
 '@': 32,
 'A': 33,
 'B': 34,
 'C': 35,
 'D': 36,
 'E': 37,
 'F': 38,
 'G': 39,
 'H': 40,
 'I': 41,
 'J': 42,
 'K': 43,
 'L': 44,
 'M': 45,
 'N': 46,
 'O': 47,
 'P': 48,
 'Q': 49,
 'R': 50,
 'S': 51,
 'T': 52,
 'U': 53,
 'V': 54,
 'W': 55,
 'X': 56,
 'Y': 57,
 'Z': 58,
 '[': 59,
 ']': 60,
 '^': 61,
 '_': 62,
 '`': 63,
 'a': 64,
 'b': 65,
 'c': 66,
 'd': 67,
 'e': 68,
 'f': 69,
 'g': 70,
 'h': 71,
 'i': 72,
 'j': 73,
 'k': 74,
 'l': 75,
 'm': 76,
 'n': 77,
 'o': 78,
 'p': 79,
 'q': 80,
 'r': 81,
 's': 82,
 't': 83,
 'u': 84,
 'v': 85,
 'w': 86,
 'x': 87,
 'y': 88,
 'z': 89,
 '|': 90,
 '~': 91,
 '\xa0': 92,
 '°': 93,
 'Ü': 94,
 'ä': 95,
 'é': 96}

In [67]:
idx2char = np.array(characteres)
idx2char

array(['\n', ' ', '!', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',',
       '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
       ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F',
       'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S',
       'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '^', '_', '`', 'a',
       'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n',
       'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '|',
       '~', '\xa0', '°', 'Ü', 'ä', 'é'], dtype='<U1')

In [68]:
char2idx['é'], idx2char[96]

(96, 'é')

In [69]:
text_as_int = np.array([char2idx[char] for char in dataset_text])
text_as_int.shape, text_as_int

((300787,), array([32, 45, 68, ..., 84, 82, 74]))

In [70]:
print('{} characters mapped to int ---> {}'.format(repr(dataset_text[:15]), text_as_int[:15]))

'@MeltingIce Ass' characters mapped to int ---> [32 45 68 75 83 72 77 70 41 66 68  1 33 82 82]


## 4: Criação dos exemplos de treinamento e batches

In [71]:
len(dataset_text)

300787

In [72]:
seq_length = 100
examples_per_epoch = len(dataset_text) // seq_length
examples_per_epoch

3007

In [73]:
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

In [74]:
sequences = char_dataset.batch(batch_size=seq_length + 1, drop_remainder=True)

In [75]:
for item in sequences.take(50):
  print(repr(''.join(idx2char[item.numpy()])))

"@MeltingIce Assuming max acceleration of 2 to 3 g's, but in a comfortable direction. Will feel like a"
' mild to moder? https://t.co/fpjmEgrHfC\nRT @SpaceX: BFR is capable of transporting satellites to orbi'
't, crew and cargo to the @Space_Station and completing missions to the Moon an?\n@bigajm Yup :)\nPart 2'
' https://t.co/8Fvu57muhM\nFly to most places on Earth in under 30 mins and anywhere in under 60. Cost '
'per seat should be? https://t.co/dGYDdGttYd\nRT @SpaceX: Supporting the creation of a permanent, self-'
'sustaining human presence on Mars. https://t.co/kCtBLPbSg8 https://t.co/ra6hKsrOcG\nBFR will take you '
'anywhere on Earth in less than 60 mins https://t.co/HWt9BZ1FI9\nMars City\nOpposite of Earth. Dawn and '
'dusk sky are blue on Mars and day sky is red. https://t.co/XHcZIdgqnb\nMoon Base Alpha https://t.co/vo'
"Y8qEW9kl\nWill be announcing something really special at today's talk https://t.co/plXTBJY6ia\nRT @Spac"
'eX: Nine years ago today, Falcon 1 became the first 

In [76]:
def split_input_target(chunk):
  input_text = chunk[:-1]
  target_text = chunk[1:]
  return input_text, target_text

In [77]:
dataset = sequences.map(split_input_target)

In [78]:
for input_example, target_example in dataset.take(10):
  print('Input data:', repr(''.join(idx2char[input_example.numpy()])))
  print('Target data:', repr(''.join(idx2char[target_example.numpy()])))

Input data: "@MeltingIce Assuming max acceleration of 2 to 3 g's, but in a comfortable direction. Will feel like "
Target data: "MeltingIce Assuming max acceleration of 2 to 3 g's, but in a comfortable direction. Will feel like a"
Input data: ' mild to moder? https://t.co/fpjmEgrHfC\nRT @SpaceX: BFR is capable of transporting satellites to orb'
Target data: 'mild to moder? https://t.co/fpjmEgrHfC\nRT @SpaceX: BFR is capable of transporting satellites to orbi'
Input data: 't, crew and cargo to the @Space_Station and completing missions to the Moon an?\n@bigajm Yup :)\nPart '
Target data: ', crew and cargo to the @Space_Station and completing missions to the Moon an?\n@bigajm Yup :)\nPart 2'
Input data: ' https://t.co/8Fvu57muhM\nFly to most places on Earth in under 30 mins and anywhere in under 60. Cost'
Target data: 'https://t.co/8Fvu57muhM\nFly to most places on Earth in under 30 mins and anywhere in under 60. Cost '
Input data: 'per seat should be? https://t.co/dGYDdGttYd\nRT @SpaceX

In [79]:
batch_size = 64
buffer_size = 10000

In [80]:
dataset = dataset.shuffle(buffer_size).batch(batch_size, drop_remainder = True)

In [81]:
dataset

<BatchDataset shapes: ((64, 100), (64, 100)), types: (tf.int64, tf.int64)>

## 5: Construção do modelo de Redes Neurais Recorrentes (LTSM)


In [82]:
len(characteres)

97

In [83]:
vocab_size = len(characteres)
embedding_dim = 256
rnn_units = 1024

In [91]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential()
  model.add(Embedding(vocab_size, embedding_dim, batch_input_shape=[batch_size, None]))
  model.add(LSTM(rnn_units, return_sequences=True, stateful=True, recurrent_initializer='glorot_uniform'))
  model.add(LSTM(rnn_units, return_sequences=True, stateful=True))
  model.add(Dense(vocab_size))
  return model

In [92]:
model = build_model(vocab_size = len(characteres), embedding_dim=embedding_dim, rnn_units=rnn_units, batch_size=batch_size)

In [93]:
model.summary()


Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (64, None, 256)           24832     
                                                                 
 lstm_4 (LSTM)               (64, None, 1024)          5246976   
                                                                 
 lstm_5 (LSTM)               (64, None, 1024)          8392704   
                                                                 
 dense_3 (Dense)             (64, None, 97)            99425     
                                                                 
Total params: 13,763,937
Trainable params: 13,763,937
Non-trainable params: 0
_________________________________________________________________


## 6: Treinamento do modelo

### Otimizador e loss function

In [94]:
def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

In [95]:
model.compile(optimizer='Adam', loss=loss)

### Checkpoints

In [96]:
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, 'ckpt_{epoch:02d}-{loss:.4f}')
checkpoint_callback = ModelCheckpoint(filepath=checkpoint_prefix, save_weights_only=True)

### Execução do Treinamento

In [97]:
%%time
epochs = 12
history = model.fit(dataset, epochs = epochs, callbacks=[checkpoint_callback])

Epoch 1/12
Epoch 2/12
Epoch 3/12
Epoch 4/12
Epoch 5/12
Epoch 6/12
Epoch 7/12
Epoch 8/12
Epoch 9/12
Epoch 10/12
Epoch 11/12
Epoch 12/12
CPU times: user 1min 39s, sys: 7.66 s, total: 1min 47s
Wall time: 2min 20s


## 7: Geração dos Tweets


### Restauração do último checkpoint

In [98]:
tf.train.latest_checkpoint(checkpoint_dir)

'./training_checkpoints/ckpt_12-1.6133'

In [99]:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size = 1)
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))

In [100]:
model.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_4 (Embedding)     (1, None, 256)            24832     
                                                                 
 lstm_6 (LSTM)               (1, None, 1024)           5246976   
                                                                 
 lstm_7 (LSTM)               (1, None, 1024)           8392704   
                                                                 
 dense_4 (Dense)             (1, None, 97)             99425     
                                                                 
Total params: 13,763,937
Trainable params: 13,763,937
Non-trainable params: 0
_________________________________________________________________


### Loop de previsão

In [101]:
def generate_text(model, start_string):
  # Número de caracteres a serem gerados
  num_generate = 100

  # Conversão dos caracteres iniciais de string para números
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # Lista para armazenar os textos gerados pela rede neural
  text_generated = []

  # Parâmetro temperatura
  # Valores baixos resultam em melhores textos (deve ser testado)
  temperature = 1.0

  # Loop para gerar os textos
  for i in range(num_generate):
    # Previsões
    predictions = model(input_eval)

    # Tratamento das previsões
    predictions = tf.squeeze(predictions, 0)
    predictions = predictions / temperature
    predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

    # Passamos a previsão como próxima entrada da rede
    input_eval = tf.expand_dims([predicted_id], 0)
    #print(idx2char[predicted_id])
    text_generated.append(idx2char[predicted_id])
  
  return (start_string + ''.join(text_generated))

In [114]:
print(generate_text(model, start_string='Tesla '))

Tesla http://t.co/AgfZSTdlwq to all homured another upssate) but be designed by @TandashRvoaldoch Model 3 
