# Estimación de estadísticas a partir de un *corpus*

A partir de la narración de un partido de futbol, tomamos diferentes enfoques de **Natural Language Processing** y técnicas de *machine learning* para obtener la cantidad de goles que se dieron en el partido. 

## Link del partido
El script del partido utilizado es del siguiente enlace:

https://youtu.be/YV1eb5-JPrw

***

In [3]:
import numpy as np

## Clasificación de las frases que correspondieron a un gol
Para los modelos presentados más adelante, tenemos las frases de los comentaristas y definimos un vector con 1's en las posiciones donde ocurrieron goles

In [4]:
# Ocurrencias de palabra "goal" +- 20 palabras atrás y adelante
docs = ["this locker room with this score lane the threat that his team have produced [music] still looking for the go-ahead goal but it defended as well as he could against the barcelona team that isn't doing too much wrong could",
 'come back he will tonight in his dreams now keeping a few numbers back do not want to surrender a goal before the break if they can avoid it what a sonido got the ball and neymar needed to get',
 "be the goat as plenty of superstars out there to make the difference i think there's going to be a goal or two in this second novel but real madrid who's got the bench it could perhaps make a difference",
 'ricky by [applause] do not actually scored in his first two classic rows andreessen rica followed up of a game-tying goal the next year only on the warriors today i send a change between nes that and messi been going',
 "just can't find enough rainbow on that pass to take it over the top of marcel winter capisce the breakthrough goal coming from the beheaded boy niki sanders would we're back his little boy and shakira it's a wonderful hello",
 'even for the costa rica man the greatest gear for the greatest game in the world at soccer calm second goal of the season for pk raqqa ditch puts it in perfectly here i said there was a golden ray',
 "in recent years not quite back to his sevilla days we'll pass it off this time gets messy with the goal barcelona be a bit more patient not steal against roma is not the only derby of note the classical",
 "upstairs wanted to see a little bit more magic from this classical encounter knotted up at one won't see a goal for 56 minutes we see two and seven who is never satisfied that's for sure sounding like a man",
 'air set the table for kareem and the cream rises to the top again inside for iniesta when he first goal for ben zoma tying his watermark in spain still in the race for the pichichi all those looks away',
 'juan carlos moon zuy he steps up in place of luis enrique for every set play a free-kick around the goal and also a corner that was not an accident that was purposely by design neymar came across and when',
 'accident that was purposely by design neymar came across and when all the players were celebrating after bk scored the goal javier mascherano ran across to warn carlos and zuy and congratulated him back to you phil on the defensive',
 'benzema the player coming out a different look they see different animal the content here to place been fitting wonderful goal scorer for real madrid he brought them back into this game gets the caned i trace from the world-class',
 'chance for anybody once it got over and took that deflection but he catapulted that one horn when he first goal of the season in league play for benzema now this will likely push ronaldo higher up more into that',
 'earns the whistle real question now is for real madrid how much do they want to push for a second goal as barcelona looks content to bleed another game off the schedule as he stripped conseguido again si out whines',
 "feeds i don't see anything wrong there people i'm sorry on that angle there's nothing wrong that's a legitimately grid goal by gareth from this angle and again that's a very harsh call against the mayor breathing dragon that rises",
 'zinedine zidane he has come to life in the second half here on the bench for real madrid when the goal went in he celebrated with his technical staff with the substitute with everybody and he seemed to be the',
 'with the substitute with everybody and he seemed to be the last person inside camp now that realized that the goal was disallowed it was phenomenal viewing see dan clearly furious afterwards phil thank you very much ter daughter sobrino',
 "attempt at defending from alba and gareth again so close to not making lo do bravo and getting the go-ahead goal yet again there will be plenty to argue about back in the nation's capital oh ramos going in strong",
 'always categorizes not perhaps getting along with each other only got along each other lake heckle and jeckle their wonderful goal for the lead 7th minute is neymar bursting forward looking for an equaliser top of the box lays it',
 'back messi the flex kicked off by any and cleared we still might have some change but what a huge goal for ronaldo only the eight league clásico goal he is scored but the 16th of all time only disgusting',
 'and cleared we still might have some change but what a huge goal for ronaldo only the eight league clásico goal he is scored but the 16th of all time only disgusting chance for swat has been a push and',
 "of it now it seems as though barcelona with a spark they've got a man advantage but they're down a goal and two minutes left of regulation did they take it too easy the 39 match unbeaten streak about to"]

print('Número de oraciones (ejemplos): ', len(docs))

Número de oraciones (ejemplos):  22


In [5]:
# Definimos las oraciones en donde hubo gol
labels = np.zeros(22)
labels[[8, 9, 21]] = 1.
labels.reshape(-1, 1)

array([[0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.]])

***
## Modelo de redes neuronales recurrentes (RNN) con GloVe *embeddings*

A continuación, planteamos un modelo de redes neuronales recurrentes para aprender la ocurrencia de oraciones (secuencias) que indican mayor probabilidad de gol.  
 
### *Transfer learning* con GloVe
Para esto, utilizamos un *embedding* preentrenado de Glove (https://nlp.stanford.edu/projects/glove/). Específicamente, el archivo de *embedding* de Twitter de 25 dimensiones para las palabras, debido a que en este se registran cercanías semánticas con algunos de los jugadores de futbol en el script.

### LSTM con keras
Aplicamos un modelo de redes neuronales recurrentes con keras para aprender las secuencias con mayor probabilidad de gol.

In [1]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Flatten, Embedding
from tensorflow.keras.initializers import Constant

import tensorflow as tf
gpu = tf.config.experimental.list_physical_devices('GPU')[0]
tf.config.experimental.set_memory_growth(gpu, True)

In [6]:
# Definiciones útiles
MAX_SEQUENCE_LENGTH = 45
EMBEDDING_DIM = 25

In [7]:
# Preparar el Tokenizer
t = Tokenizer()
t.fit_on_texts(docs)
# Obtener las secuencias de números
sequences = t.texts_to_sequences(docs)

# Vocabulario y tamaño del vocabulario
word_index = t.word_index
vocab_size = len(word_index) + 1
print("Tamaño del vocabulario: %d" % (vocab_size))

# Obtener las secuencias espaciadas (o uniformes)
data = sequence.pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH, padding="pre")

Tamaño del vocabulario: 394


In [8]:
# Secuencia correspondiente a la oración 21
data[21, :]

array([  0,   0,   0,   0,   0,  11,  10,  24,  10, 379,  19, 380,  34,
        12,   2, 381, 382,  25,   2,  60, 383,  14, 384, 385,   2,   3,
         5,  39, 101, 386,  11, 387, 388,  36,  89,  10,  76, 389,   1,
       390, 391, 392, 393, 130,   6])

In [9]:
# Índice de palabras para entrenar el modelo
# word_index

## Cargar el *embedding* a la memoria

In [10]:
from gensim.models import KeyedVectors

In [11]:
# Cargar el modelo Stanford Glove de Twitter
word2vec_file = '../../Glove/glove.twitter.27B.25d.word2vec'
glove_model = KeyedVectors.load_word2vec_format(word2vec_file, binary=False)

In [12]:
# Embedding de la palabra hello
glove_model['hello']

array([-0.77069  ,  0.12827  ,  0.33137  ,  0.0050893, -0.47605  ,
       -0.50116  ,  1.858    ,  1.0624   , -0.56511  ,  0.13328  ,
       -0.41918  , -0.14195  , -2.8555   , -0.57131  , -0.13418  ,
       -0.44922  ,  0.48591  , -0.6479   , -0.84238  ,  0.61669  ,
       -0.19824  , -0.57967  , -0.65885  ,  0.43928  , -0.50473  ],
      dtype=float32)

In [13]:
# Palabras más cercanas a Cristiano Ronaldo
glove_model.most_similar(positive=["cristiano"], topn=5)

[('iniesta', 0.9716404676437378),
 ('messi', 0.9683734178543091),
 ('xavi', 0.9600971341133118),
 ('casillas', 0.9525176882743835),
 ('falcao', 0.945814847946167)]

In [14]:
# Palabras más cercanas a la palabra "goal"
glove_model.most_similar(positive=["guatemala"], topn=5)

[('paraguay', 0.958189845085144),
 ('ecuador', 0.9491326808929443),
 ('bolivia', 0.9409966468811035),
 ('nicaragua', 0.940276026725769),
 ('honduras', 0.9361501336097717)]

In [15]:
try: 
    print(glove_model['NO'])
except KeyError: 
    print("La llave no está definida")

La llave no está definida


### Aplicación de la técnica de *Transfer learning*

In [16]:
# Crear una matriz de embedding para el vocabulario
embedding_matrix = np.zeros((vocab_size, EMBEDDING_DIM))
for word, i in word_index.items():
    try:
        embedding_vector = glove_model[word]
        embedding_matrix[i] = embedding_vector
    except KeyError:
        # Las palabras no encontradas en el modelo glove serán cero
        continue

In [17]:
embedding_matrix.shape

(394, 25)

In [18]:
# Cargar la matriz de embedding en una capa de Embedding
# Configuramos trainable = False para mantener el embedding fijo
embedding_layer = Embedding(vocab_size, EMBEDDING_DIM,
                            embeddings_initializer=Constant(embedding_matrix),
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

In [19]:
embedding_layer

<tensorflow.python.keras.layers.embeddings.Embedding at 0x1c13a63fac8>

Esta capa se define como no entrenable debido a que ya contiene embebida los vectores de ponderaciones para las secuencias del modelo.

## Definir el modelo RNN con keras

In [20]:
# Definir el modelo secuencial
model = Sequential()
model.add(embedding_layer)
model.add(LSTM(10))
model.add(Dense(1, activation='sigmoid'))
# Compilar el modelo
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# Resumir el modelo
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 45, 25)            9850      
_________________________________________________________________
lstm (LSTM)                  (None, 10)                1440      
_________________________________________________________________
dense (Dense)                (None, 1)                 11        
Total params: 11,301
Trainable params: 1,451
Non-trainable params: 9,850
_________________________________________________________________


In [22]:
# Ajustar el modelo
#model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=3, batch_size=64)
history = model.fit(data, labels, epochs=100, verbose=1)

Train on 22 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
E

In [23]:
# Comparación de los datos ajustados por el modelo y los observados
np.column_stack((model.predict(data) , labels.reshape(-1, 1)))

array([[0.02021797, 0.        ],
       [0.01971275, 0.        ],
       [0.02188036, 0.        ],
       [0.02423301, 0.        ],
       [0.02097651, 0.        ],
       [0.02472274, 0.        ],
       [0.02494118, 0.        ],
       [0.01752526, 0.        ],
       [0.86930841, 1.        ],
       [0.82751679, 1.        ],
       [0.02429537, 0.        ],
       [0.01819284, 0.        ],
       [0.02426372, 0.        ],
       [0.02509282, 0.        ],
       [0.0235166 , 0.        ],
       [0.01813381, 0.        ],
       [0.02105876, 0.        ],
       [0.02858564, 0.        ],
       [0.03306217, 0.        ],
       [0.01912416, 0.        ],
       [0.02335004, 0.        ],
       [0.79572475, 1.        ]])

In [24]:
# Para predecir un único valor
model.predict(data[0].reshape(1, -1))

array([[0.02021797]], dtype=float32)

### Guardar el modelo

In [25]:
# Serializar el modelo a formato JSON
model_json = model.to_json()
with open("nlp-football-model-lstm.json", "w") as json_file:
    json_file.write(model_json)
    
# Escribir los pesos sinápticos en formato HDF5
model.save_weights("nlp-football-model-lstm-weights.h5")
print("Se ha guardado el modelo y sus parámetros")

Se ha guardado el modelo y sus parámetros


### Cargar el modelo entrenado

In [27]:
from tensorflow.keras.models import model_from_json

In [29]:
# Cargar el arcihvo json y crear el modelo
json_file = open('nlp-football-model-lstm.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
# Crear el modelo
loaded_model = model_from_json(loaded_model_json)
# Cargar los pesos al modelo
loaded_model.load_weights("nlp-football-model-lstm-weights.h5")
print("Modelo cargado del disco")

Modelo cargado del disco


In [30]:
loaded_model.predict(data[0].reshape(1, -1))

array([[0.02021798]], dtype=float32)