# Predicción de generos de libros basados en el resumen

A continaución se decribe la técnica de text mining aplicada al objetivo de conseguir predicciones lo más certeras posibles en el género de un libro. Inicialmente se trata de generar un único género. La técnica aplicada es word embedding [tutorial](https://www.tensorflow.org/tutorials/text/word_embeddings#word_embeddings_2)

In [None]:
#from google.colab import drive
#drive.mount('/content/drive')

In [None]:
import tensorflow as tf
import pandas as pd
import numpy as np
import io
import matplotlib.pyplot as plt

from tensorflow.keras import Model, Sequential
from tensorflow.keras.layers import Activation, Dense, Embedding, GlobalAveragePooling1D, Dropout
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

In [None]:
# Variables generales del modelo
BATCH_SIZE = 32
VOCAB_SIZE = 12000
SEQUENCE_LENGTH = 150
EMBEDDING_DIM = 128
EPOCHS = 2
N_GRAMS = 5

## Paso 1: Carga de datos preprocesados

In [None]:
# Se cargan los datos para la predicción de un género
data = pd.read_csv('./datos/datos_p1.csv')

In [None]:
display(data)

Se calcula el número de generos totales

In [None]:
genres = np.unique(data['Main Genre'])
num_genres = len(genres)

# Se muestran los 5 primeros generos
print("Los 5 primeros generos son: " + str(genres[:5]))
print("Y hay un total de: " + str(num_genres))

Se realiza una aleatorización de los datos para evitar posibles sesgos

In [None]:
data = data.sample(frac=1).reset_index(drop=True)

Se realiza una vectorización de las salidas 

In [None]:
label_vector = []
for i, label in enumerate(data["Main Genre"].values):
    label_vector.append([])
    for genre in genres:
        if label == genre:
            label_vector[i].append(1)
        else:
            label_vector[i].append(0)

label_vector = np.array(label_vector)

In [None]:
label_vector.shape

In [None]:
label_vector[0]

División de los datos en conjuntos de entrenamiento y test

In [None]:
train_size = (int)(0.7*data.shape[0])
train_X = data["Description"][:train_size].values
train_y = label_vector[:train_size]
test_X = data["Description"][train_size:].values
test_y = label_vector[train_size:]

In [None]:
# Se transforman los conjuntos al formato solicitado por tensorflow
train_dataset = tf.data.Dataset.from_tensor_slices((train_X, train_y)).batch(10)
test_dataset = tf.data.Dataset.from_tensor_slices((test_X, test_y)).batch(10)

## Paso 2: Generar el modelo


In [None]:
vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=SEQUENCE_LENGTH,
    ngrams=N_GRAMS)

In [None]:
# Determina la frecuencia de valores individuales y crea un vocabulario con ellas
vectorize_layer.adapt(train_X)
# Para visualizar el vocabulario: vectorize_layer.get_vocabulary()

In [None]:
model = Sequential([
  vectorize_layer,
  Embedding(VOCAB_SIZE, EMBEDDING_DIM, name="embedding"),
  GlobalAveragePooling1D(),
  Dense(128, activation='relu'),
  Dropout(0.3),
  Dense(256, activation='relu'),
  Dropout(0.3),
  Dense(num_genres)
])

In [None]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
              metrics=[tf.keras.metrics.CategoricalAccuracy()])

In [None]:
train_dataset

In [None]:
historic = model.fit(
    train_dataset,
    validation_data=test_dataset, 
    epochs=EPOCHS)

In [None]:
model.summary()

In [None]:
plt.plot(historic.history['categorical_accuracy'])
plt.plot(historic.history['val_categorical_accuracy'])
plt.title('model categorical accuracy')
plt.ylabel('categorical accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'])
plt.ylim([0, 1])
plt.show()

In [None]:
index = historic.history['val_categorical_accuracy'].index(max(historic.history['val_categorical_accuracy']))

print("Valor entrenamiento final:", historic.history['categorical_accuracy'][-1])
print("Valor test final:", historic.history['val_categorical_accuracy'][-1])

print()
print()

print("Valor entrenamiento sin overfitting:", historic.history['categorical_accuracy'][index])
print("Valor test sin overfitting:", historic.history['val_categorical_accuracy'][index])

print("Numero de ciclos hasta overfitting:", index)

In [None]:
vectorize_layer.get_vocabulary()

In [None]:
prediction_model = Sequential([model, tf.keras.layers.Softmax()])

In [None]:
for i, j in test_dataset.take(1):
  predicted_genre = np.argmax(prediction_model.predict(i)[3])
  print(genres[predicted_genre])
  print(genres[np.argmax(j[3])])

Caso de uso



In [None]:
prediction = prediction_model.predict(["The Kingkiller Chronicle takes place in the fictional world of Temerant, a large continent of which the known part, called the Four Corners of Civilization, is divided into several distinct nations and cultures. Much of the world follows a religion similar, though not identical, to medieval Christianity. Coexisting alongside the mortal world is the realm of The Fae, a parallel universe inhabited by supernatural creatures which can move between the two realms only when the moon is full. Magic exists in Temerant, too, but obeys a well-defined set of rules and principles that can only be exploited by those who have trained in its professional and scientific use."])

In [None]:
predicted_genre = np.argmax(prediction)

In [None]:
print(genres[predicted_genre])

Real: fantasía, aventura

In [None]:
prediction = prediction_model.predict(["Our species, Homo sapiens, started out as just another, biologically insignificant, animal among countless others, but at some point made a very quick “jump” to domination over other species that has left man anxious, destructive, and often miserable despite all our efforts to be happy. Yuval Noah Harari’s book, ‘Sapiens,’ traces the origins, mechanisms, and effects of what we think of as “human progress” from small bands of hunter gatherers 100,000 years ago to the present-day global network through which our species has come to dominate the entire Earth. This bird’s-eye view of human history delves into the development of sapiens’ social organization and the structural features of human cultures through which our species has spread, replicated, and evolved, including the breakthroughs of human language with its infinite expressions, the use of the human imagination, and the rise of the concept of a better future that lies just beyond the horizon. Through these mechanisms, our species has evolved socially, rather than biologically, into a species that’s increasingly in control of our own destiny. This book considers not only how this evolution has occurred but questions of whether this “progress” has truly been beneficial and how we might, with our awareness of how we got here, decide where we want to go."])

In [None]:
predicted_genre = np.argmax(prediction)

In [None]:
print(genres[predicted_genre])

Real: historia, ciencia, filosofía