### Übung 2d

In [3]:
import tensorflow as tf
import pandas as pd
import numpy as np
import os
import time
#tf.enable_eager_execution()

Ich importiere mir die Trumptweets von einem Github Repo, dass ich gefunden habe.

In [4]:
path_to_file = tf.keras.utils.get_file('tweets.csv', 'https://raw.githubusercontent.com/mkearney/trumptweets/master/data/trumptweets-1515775693.tweets.csv')
df = pd.read_csv(path_to_file,low_memory = False )

Beispiel Tweet:

In [5]:
print(df["text"][100])

Wishing everyone a wonderful Independence Day weekend. We have a lot to be thankful for.


Ich wandle nun die Tweets in weiterverarbeitbaren Text um.

In [6]:
tweets = df["text"].\
              str.replace('(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})','').\
              str.lower().\
              str.replace('[^a-z0-9,. ]', '')

Beispiel Text:

In [8]:
text = ' '.join(tweets)
print(text[6600:7000])

forget to tune in tonight to see another unpredictable and exciting episode of the apprentice 10 pm on nbc trump international tower in chicago ranked 6th tallest building in world by council on tall buildings  urban habitat  eric did a great job with his eric trump foundation annual charity outing. im proud of him.   today is donald trumps birthday send him your bday wishes here  tonights episode


Nun erstelle ich ein Vocabulary mit den einzelnen Zeichen, die in den Tweets verwendet wurden.

In [10]:
vocab = sorted(set(text))
print ('{} unique characters'.format(len(vocab)))

39 unique characters


Nun mappe ich die einzelnen Zeichen noch zu einem Index.

In [11]:
# Creating a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}

In [12]:
text_as_int = np.array([char2idx[c] for c in text])
print ('So, {} looks like -> {}'.format(repr(text[6800:6811]), text_as_int[6800:6811]))

So, 'buildings  ' looks like -> [14 33 21 24 16 21 26 19 31  0  0]


Jetzt wird die Maximale Sequenzlänge mit 280 Zeichen festgelegt und daraus die Beispiele pro Epoche.

In [13]:
# The maximum length text we want
seq_length = 280
examples_per_epoch = len(text)//seq_length
# Create training inputs / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
idx2char = np.array(vocab)

In [14]:
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

In [15]:
def split_input_target(chunk):     
    input_text = chunk[:-1]     
    target_text = chunk[1:]
    return input_text, target_text
dataset = sequences.map(split_input_target)

Ich habe mich für eine Batchsize von 32 entschieden und eine Buffersize von 1000.

In [16]:
BATCH_SIZE = 32
# Buffer size to shuffle the dataset
BUFFER_SIZE = 1000
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

Nun wird das Neuronale Netz erstellt. Ich wähle 3 Layer. Als ersten Layer einen Embedding Layer, gefolgt von einem LSTM mit dem Initializer "Glorot Uniform" und Return Sequences = True. Als Outputlayer einen Denselayer mit 39 Outputs.

In [17]:
# Length of the vocabulary in chars
vocab_size = len(vocab)
# The embedding dimension
embedding_dim = 256
# Number of RNN units
rnn_units = 256
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.LSTM(rnn_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(vocab_size)
  ])
    return model
model = build_model(
  vocab_size = len(vocab),
  embedding_dim=embedding_dim,
  rnn_units=rnn_units,
  batch_size=BATCH_SIZE)

In [18]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (32, None, 256)           9984      
_________________________________________________________________
lstm (LSTM)                  (32, None, 256)           525312    
_________________________________________________________________
dense (Dense)                (32, None, 39)            10023     
Total params: 545,319
Trainable params: 545,319
Non-trainable params: 0
_________________________________________________________________


Als Loss Function verwende ich die Sparse Categorical Crossenetropy und als Optimizer den Adam Optimizer.

In [20]:
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

In [21]:
adam = tf.compat.v1.train.AdamOptimizer()
model.compile(optimizer=adam, loss=loss)

In [22]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")
checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

Ich fitte das Modell nur 3 Epochen, da es ansonsten auf meinem Rechner zu lange dauert.

In [24]:
EPOCHS=3
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback], steps_per_epoch= 100)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [25]:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))

Nun habe ich aus dem Internet noch eine Helpfer Cunction gefunden, die mir einen Tweet generieren kann. Angeben muss ich nur das Startwort.

In [26]:
def generate_text(model, start_string):
  # Number of characters to generate
  num_generate = 280
  # Converting our start string to numbers
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)
  text_generated = []
  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
  temperature = 0.6
  model.reset_states()
  for i in range(num_generate):
      predictions = model(input_eval)
      predictions = tf.squeeze(predictions, 0)
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()
      input_eval = tf.expand_dims([predicted_id], 0)
      text_generated.append(idx2char[predicted_id])
  return (start_string + ''.join(text_generated))

Nun geben wir einenn Testtweet aus mit dem Startwort "America".

In [27]:
print(generate_text(model, start_string="america"))

americagrand trump of the leomed cotini got great and boke for corent the we is tha toul a hillory on he is jug thent dicling goter the more ow our in lecy ond th sump en i jund in wank is amp to noped coint conouter at presile wank we  that and on ejored iblice and got co perent lust c


Und nun starten wir noch einen Tweet mit China.

In [28]:
print(generate_text(model, start_string="china"))

chinadstrump trump in and thane hill thus shempest in are we wall gollery to kis a dos mecuring his soil bout anoumary a got is now thank count lut for dor and bum kernoul he im to tre pooder ist a thow she will ned sedence eprentiep at mofar ant cation the and of of the is an cand th
