# Creando un poema con Keras


Natural Language Processing - Tokenization (NLP Zero to Hero - Part 1) : https://www.youtube.com/watch?v=fNxaJsNG3-s&list=PLQY2H8rRoyvzDbLUZkbudP-MFQZwNmU4S

El objetivo es crear un programa con tensorflow y keras, que genere textos coherentes de poesía mediante una red neuronal que prediga palabras en función de otras.

## 0 - Import de librerías generales y carga de textos

In [1]:
import tensorflow 
from tensorflow import keras
import numpy as np
import requests
import random
import os

In [2]:
base_url  = 'https://raw.githubusercontent.com/MrCabss69/KerasTextClassification/main/resources/'
train_url = base_url + 'train.txt'
test_url  = base_url + 'test.txt'
textos    = []
for url in [train_url,test_url]:
  textos += requests.get(url).content.decode("utf-8").split('\n')
  textos.remove('')

In [3]:
N = len(textos)
print('Muestras totales: ',len(textos))
print(textos[:10])
random.shuffle(textos)

Muestras totales:  18000
['i didnt feel humiliated;sadness', 'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake;sadness', 'im grabbing a minute to post i feel greedy wrong;anger', 'i am ever feeling nostalgic about the fireplace i will know that it is still on the property;love', 'i am feeling grouchy;anger', 'ive been feeling a little burdened lately wasnt sure why that was;sadness', 'ive been taking or milligrams or times recommended amount and ive fallen asleep a lot faster but i also feel like so funny;surprise', 'i feel as confused about life as a teenager or as jaded as a year old man;fear', 'i have been with petronas for years i feel that petronas has performed well and made a huge profit;joy', 'i feel romantic too;love']


## 1 - Tokenización y preprocesamiento de texto

 Los inputs para el entrenamiento del modelo serán cadenas de texto (frases), y serán de tamaño fijo, teniendo el vocabulario de entrenamiento un tamaño limitado. Debemos transformar los textos en vectores numéricos codificando las palabras individuales del texto.

 Keras implementa la clase Tokenizer de preprocessin/text con varios métodos útiles predefinidos.
 


In [4]:
from tensorflow.keras.layers import Conv1D, MaxPooling1D, Embedding, Bidirectional, LSTM, Dense, Dropout
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras import Sequential, regularizers

In [5]:
# creamos el tokenizador sin límite de palabras
tkn = Tokenizer()

# ajustamos a los textos
tkn.fit_on_texts(textos)

# ya tenemos nuestro vocabulario indexado
w_idx = tkn.word_index

print(len(w_idx))
print(w_idx)

16184


In [6]:
seq = tkn.texts_to_sequences(textos)
print(seq)

[[1, 2, 6, 87, 402, 9, 84, 768, 219, 1306, 6, 867, 4, 1733, 25, 195, 21, 27, 7940, 885, 127, 1915, 3, 12, 2945, 1815, 27, 1239, 2453, 88, 134, 121, 203, 4, 127, 34, 14, 18, 949, 7941, 37, 6, 56, 11], [1, 59, 371, 27, 5, 294, 9, 1, 104, 47, 114, 2, 856, 18, 5, 537, 1, 295, 51, 3279, 3, 18, 5, 67, 1, 31, 13, 83, 7], [1, 20, 4, 4472, 5, 8, 10, 261, 3, 318, 11], [1, 26, 62, 1306, 927, 5, 200, 4473, 3, 1, 2, 256, 79], [1, 52, 538, 655, 9, 3764, 132, 25, 4, 19, 68, 408, 81, 38, 68, 800, 81, 3, 68, 80, 197, 112, 120, 18, 19, 134, 596, 7], [1, 62, 2, 16, 526, 4, 7942, 1916, 3, 45, 229, 72, 89, 4, 32, 380, 10, 7], [77, 51, 8, 82, 467, 162, 333, 11], [1, 75, 2, 460, 38, 1, 419, 108, 5, 5624, 4474, 5625, 7943, 37, 7944, 152, 12, 1917, 63, 310, 64, 13, 82, 701, 7], [1, 2, 376, 483, 3, 4475, 7], [1, 2, 65, 49, 5, 175, 10, 5, 409, 5626, 3280, 1432, 7945, 2150, 1208, 21, 328, 5, 420, 2454, 23], [17, 257, 1, 2, 9, 154, 637, 622, 22, 1077, 36, 18, 19, 11], [1, 246, 97, 74, 125, 17, 760, 41, 328, 29, 5,

Para comparar frases de diferente longitud, o tener muestras con el mismo número de variables/columnas hace falta rellenar a 0's las posiciones de los strings más cortos. Se coge la longitud más larga como tamaño máximo (número inicial de columnas)

In [7]:
padded = np.array(pad_sequences(seq, padding='pre',maxlen=max([len(s) for s in seq])))
s = padded.shape

In [8]:
print(padded.shape)
print('Hay actualmente ',s[0],' frases de muestra')
print('El vocabulario actualmente tiene una longitud de ',s[1])

(18000, 67)
Hay actualmente  18000  frases de muestra
El vocabulario actualmente tiene una longitud de  67


Las opciones de la función de padding son realizar el rellanado a 0's en las columnas de mayor índice (padding='post'), y dar una anchura máxima de palabra distinta de la encontrada en los textos (maxlen=5)

## 2 - Separación de datos para entrenamiento

In [9]:
idx_w = {v:k for k,v in w_idx.items()}

In [10]:
# train_s, test_s, val_s
T_SIZE = int(0.8*N)
train_x, labels = padded[:T_SIZE,:-1],padded[:T_SIZE,-1]

In [11]:
print('Primera muestra: ', train_x[0])
print('En texto: ', [ idx_w[w] for w in train_x[0] if w != 0])
print('Categoría', idx_w[labels[0]])

Primera muestra:  [   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    1    2    6   87  402
    9   84  768  219 1306    6  867    4 1733   25  195   21   27 7940
  885  127 1915    3   12 2945 1815   27 1239 2453   88  134  121  203
    4  127   34   14   18  949 7941   37    6   56]
En texto:  ['i', 'feel', 'a', 'bit', 'ashamed', 'that', 'its', 'taken', 'us', 'nearly', 'a', 'month', 'to', 'build', 'this', 'thing', 'but', 'with', 'nathans', 'crazy', 'work', 'schedule', 'and', 'my', 'limited', 'abilities', 'with', 'power', 'tools', 'we', 'were', 'only', 'able', 'to', 'work', 'on', 'it', 'for', 'short', 'spurts', 'at', 'a', 'time']
Categoría sadness


In [12]:
#train_x = train_x / np.linalg.norm(train_x)
train_y = tensorflow.keras.utils.to_categorical(labels)
print(train_x.shape)
print(train_y.shape)

(14400, 66)
(14400, 80)


## 3 -Creando el modelo

Código del modelo sacado de la clase de TF

In [13]:
n_filters= 10
k_size = 100
p_size = 10

In [26]:
model = Sequential([
    Embedding(input_dim=train_x.shape[1],output_dim=200),
    Bidirectional(LSTM(units=150,kernel_regularizer=regularizers.L2(l=0.01))),
    Dropout(rate=0.1),
    Dense(units=200,activation='relu'),
    Dropout(rate=0.2),
    Dense(units=train_y.shape[1],activation='sigmoid')
])
model.compile(optimizer='adam', loss='categorical_crossentropy',  metrics=['accuracy'])
model.build()
history = model.fit(train_x, train_y, epochs=10, verbose=1)


Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [27]:
#tensorflow.keras.utils.plot_model(model)
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, None, 200)         13200     
                                                                 
 bidirectional_1 (Bidirectio  (None, 300)              421200    
 nal)                                                            
                                                                 
 dropout_2 (Dropout)         (None, 300)               0         
                                                                 
 dense_2 (Dense)             (None, 200)               60200     
                                                                 
 dropout_3 (Dropout)         (None, 200)               0         
                                                                 
 dense_3 (Dense)             (None, 80)                16080     
                                                      

## 4 - Evaluación y testeo del modelo



In [16]:
test_x, test_l = padded[T_SIZE:T_SIZE+500,:-1],padded[T_SIZE:T_SIZE+500,-1]
test_y = tensorflow.keras.utils.to_categorical(test_l)
print(test_x.shape)
print(test_y.shape)


(500, 66)
(500, 80)


In [17]:
model.evaluate(x=test_x,y=test_y,verbose=1)



[1.5914115905761719, 0.34200000762939453]

In [18]:
sample_text = ' It is a sad day'
tkn2 = Tokenizer()
tkn2.fit_on_texts([sample_text])
dw = tkn2.word_index

In [19]:
input = np.array([w_idx[k] for k in dw.keys()])
print(input)

[ 14  24   6 257 101]


In [20]:
padded = pad_sequences([input],maxlen=len(w_idx))
print(padded.shape)
print(padded)
padded = padded / np.linalg.norm(padded)


(1, 16184)
[[  0   0   0 ...   6 257 101]]


In [21]:
p = model.predict(padded)
print(p.shape)

(1, 80)


In [22]:
l = np.where(p[0] == max(p[0]))[0]

In [23]:
print(l)
print(idx_w[l[0]])

[7]
joy


In [24]:
idx = np.unravel_index(np.argmax(p), p.shape)

In [25]:
print(idx)
print()

(0, 7)

