#Gensim

##Introducción

Gensim es una libreria de codigo abierto para "unsupervised topic modeling" y para procesamiento de lenguaje natural, usas aprendizaje maquina estadistico moderno, a continuacion veremos una forma de utilizar esta herramienta.

Se cargan las librerias correspondientes

In [1]:
import gensim.downloader as api
from gensim.models import Word2Vec

Aseguramos que los datos usar no sean vacios y cargamos

In [2]:
info = api.info("text8")
assert(len(info) > 0)
dataset = api.load("text8")
model = Word2Vec(dataset)

Guardamos los datos para usarlos posteriormente

In [4]:
model.save('./text8-word2vec.bin')

Explorando el espacio embebido con gensim

In [5]:
from gensim.models import KeyedVectors
model = KeyedVectors.load("text8-word2vec.bin")
word_vectors = model.wv

Imprimimos las primeras palabras del vocabulario

In [7]:
words = word_vectors.vocab.keys()
print([x for i, x in enumerate(words) if i <10])

['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']


In [8]:
assert("king" in words)

Buscar palabras similares a "king"

In [9]:
def print_most_similar(word_conf_pairs,k):
  for i, (word,conf) in enumerate(word_conf_pairs):
    print("{:.3f} {:s}".format(conf,word))
    if i >= k-1:
      break
print_most_similar(word_vectors.most_similar("king"),5)

Probar relaciones entre palabras, en este caso:
Francia es a Paris, lo que Berlín es a **Alemania**.

In [11]:
print_most_similar(word_vectors.most_similar(
    positive=['france','berlin'], negative=['paris']
),1)

0.792 germany


Usando otra medida de similitud

In [12]:
print_most_similar(word_vectors.most_similar_cosmul(
    positive=['france','berlin'],negative=['paris']
),1)

0.971 germany


La función **doesnt_match()** puede ser utilizada para detectar palabras que no correspondan a una lista

In [13]:
print(word_vectors.doesnt_match(["hindus", "parsis", "singapore",
"christians"]))

singapore


  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


También se puede utilizar para medir la similitud entre dos palabras

In [14]:
for word in ["woman","dog","whale","tree"]:
  print("similarity({:s}, {:s}) = {:.3f}".format(
      "man",word,
      word_vectors.similarity("man",word)
  ))

similarity(man, woman) = 0.712
similarity(man, dog) = 0.461
similarity(man, whale) = 0.286
similarity(man, tree) = 0.285


La funcion **similar_by_word()** es equivalente a similar, pero normaliza el vector antes de computar

In [15]:
print(print_most_similar(word_vectors.similar_by_word('singapore'),5))

0.897 malaysia
0.844 indonesia
0.840 nepal
0.836 uganda
0.827 kenya
None


Computar la distancia entre dos palabras usando la funcion **distancia**
Es equivalente a 1-similarity()

In [16]:
print('distance(singapore,malaysia) = {:.3f}'.format(
    word_vectors.distance('singapore','malaysia')
))

distance(singapore,malaysia) = 0.103


Y buscar vectores para una palabra dada

In [17]:
vec_song = word_vectors["song"]
vec_song_2 = word_vectors.word_vec("song", use_norm=True)

# Usar palabras embebidas para detección de spam

In [18]:
import argparse
import numpy as np
import os
import shutil
import tensorflow as tf

In [19]:
from sklearn.metrics import accuracy_score, confusion_matrix

Descargar los datos

In [20]:
def download_and_read(url):
  local_file = url.split('/')[-1]
  p = tf.keras.utils.get_file(local_file, url,extract=True, cache_dir=".")
  labels, texts = [], []
  local_file = os.path.join("datasets", "SMSSpamCollection")
  with open(local_file, "r") as fin:
    for line in fin:
      label, text = line.strip().split('\t')
      labels.append(1 if label == "spam" else 0)
      texts.append(text)
  return texts, labels

In [21]:
DATASET_URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"
texts, labels = download_and_read(DATASET_URL)

Downloading data from https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip


Preparar los datos, convirtiendos a cadenas de enteros de la misma longitud

In [22]:
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(texts)
text_sequences = tokenizer.texts_to_sequences(texts)
text_sequences = tf.keras.preprocessing.sequence.pad_sequences(text_sequences)
num_records = len(text_sequences)
max_seqlen = len(text_sequences[0])
print("{:d} sentences, max length: {:d}".format(num_records, max_seqlen))

5574 sentences, max length: 189


One-Hot encoding a las etiquetas(labels)

In [23]:
NUM_CLASSES = 2
cat_labels = tf.keras.utils.to_categorical(labels, num_classes=NUM_CLASSES)

Acceder al vocabulario con el atributo **word_index**, el cual es un diccionario de cada palabra y su posición

In [24]:
word2idx = tokenizer.word_index
idx2word = {v:k for k, v in word2idx.items()}
word2idx["PAD"] = 0
idx2word[0] = "PAD"
vocab_size = len(word2idx)
print("vocab size: {:d}".format(vocab_size))

vocab size: 9010


Creamos el dataset para nuestro clasificador 

In [25]:
dataset = tf.data.Dataset.from_tensor_slices((text_sequences, cat_labels))
dataset = dataset.shuffle(10000)
test_size = num_records // 4
val_size = (num_records - test_size) // 10
test_dataset = dataset.take(test_size)
val_dataset = dataset.skip(test_size).take(val_size)
train_dataset = dataset.skip(test_size + val_size)

In [26]:
BATCH_SIZE = 128
test_dataset = test_dataset.batch(BATCH_SIZE, drop_remainder=True)
val_dataset = val_dataset.batch(BATCH_SIZE, drop_remainder=True)
train_dataset = train_dataset.batch(BATCH_SIZE, drop_remainder=True)

Crear la matriz embebida, solo usando palabras en el vocabulario

In [27]:
def build_embedding_matrix(sequences, word2idx, embedding_dim,embedding_file):
  if os.path.exists(embedding_file):
    E = np.load(embedding_file)
  else:
    vocab_size = len(word2idx)
    [ 248 ]
    E = np.zeros((vocab_size, embedding_dim))
    word_vectors = api.load(EMBEDDING_MODEL)
    for word, idx in word2idx.items():
      try:
        E[idx] = word_vectors.word_vec(word)
      except KeyError: # word not in embedding
          pass
    np.save(embedding_file, E)
  return E

EMBEDDING_DIM = 300
DATA_DIR = "./"
EMBEDDING_NUMPY_FILE = os.path.join(DATA_DIR, "E.npy")
EMBEDDING_MODEL = "glove-wiki-gigaword-300"
E = build_embedding_matrix(text_sequences, word2idx,
EMBEDDING_DIM,
EMBEDDING_NUMPY_FILE)
print("Embedding matrix:", E.shape)

Embedding matrix: (9010, 300)


# Definir el clasificador de spam

In [28]:
class SpamClassifierModel(tf.keras.Model):
    def __init__(self, vocab_sz, embed_sz, input_length,
            num_filters, kernel_sz, output_sz, 
            run_mode, embedding_weights, 
            **kwargs):
        super(SpamClassifierModel, self).__init__(**kwargs)
        if run_mode == "scratch":
            self.embedding = tf.keras.layers.Embedding(vocab_sz, 
                embed_sz,
                input_length=input_length,
                trainable=True)
        elif run_mode == "vectorizer":
            self.embedding = tf.keras.layers.Embedding(vocab_sz, 
                embed_sz,
                input_length=input_length,
                weights=[embedding_weights],
                trainable=False)
        else:
            self.embedding = tf.keras.layers.Embedding(vocab_sz, 
                embed_sz,
                input_length=input_length,
                weights=[embedding_weights],
                trainable=True)
        self.dropout = tf.keras.layers.SpatialDropout1D(0.2)
        self.conv = tf.keras.layers.Conv1D(filters=num_filters,
            kernel_size=kernel_sz,
            activation="relu")
        self.pool = tf.keras.layers.GlobalMaxPooling1D()
        self.dense = tf.keras.layers.Dense(output_sz, 
            activation="softmax"
        )

    def call(self, x):
        x = self.embedding(x)
        x = self.dropout(x)
        x = self.conv(x)
        x = self.pool(x)
        x = self.dense(x)
        return x

Definir el modelo

In [29]:
conv_num_filters = 256
conv_kernel_size = 3
model = SpamClassifierModel(
    vocab_size, EMBEDDING_DIM, max_seqlen,
    conv_num_filters, conv_kernel_size, NUM_CLASSES,
    'scratch', E)
model.build(input_shape=(None, max_seqlen))

In [30]:
model.summary()

Model: "spam_classifier_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        multiple                  2703000   
_________________________________________________________________
spatial_dropout1d (SpatialDr multiple                  0         
_________________________________________________________________
conv1d (Conv1D)              multiple                  230656    
_________________________________________________________________
global_max_pooling1d (Global multiple                  0         
_________________________________________________________________
dense (Dense)                multiple                  514       
Total params: 2,934,170
Trainable params: 2,934,170
Non-trainable params: 0
_________________________________________________________________


Compilar el modelo

In [31]:
model.compile(optimizer="adam", loss="categorical_crossentropy",
metrics=["accuracy"])

Entrenar y evaluar el modelo


Al estar desbalanceados los datos (747 spam contra 4827 no spam) se adecua la red para que un error de spam sea 8 veces mas influyente que uno de no spam con **CLASS_WEIGHTS**

In [32]:
NUM_EPOCHS = 3
CLASS_WEIGHTS = { 0: 1, 1: 8 }
BATCH_SIZE = 128

In [33]:
model.fit(train_dataset, epochs=NUM_EPOCHS, validation_data=val_dataset,class_weight=CLASS_WEIGHTS)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7f82915da510>

Evaluar con los datos de prueba

In [34]:
labels, predictions = [], []
for Xtest, Ytest in test_dataset:
  Ytest_ = model.predict_on_batch(Xtest)
  ytest = np.argmax(Ytest, axis=1)
  ytest_ = np.argmax(Ytest_, axis=1)
  labels.extend(ytest.tolist())
  predictions.extend(ytest.tolist())
print("test accuracy: {:.3f}".format(accuracy_score(labels,predictions)))
print("confusion matrix")
print(confusion_matrix(labels, predictions))

test accuracy: 1.000
confusion matrix
[[1123    0]
 [   0  157]]
