<a href="https://colab.research.google.com/github/Backto77/Machine-Learning/blob/master/Clasificador_de_Spam.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **1. Instalando e Importando Dependencias**

In [0]:
!pip3 install keras sklearn tqdm numpy keras_metrics

Collecting keras_metrics
  Downloading https://files.pythonhosted.org/packages/32/c9/a87420da8e73de944e63a8e9cdcfb1f03ca31a7c4cdcdbd45d2cdf13275a/keras_metrics-1.1.0-py2.py3-none-any.whl
Installing collected packages: keras-metrics
Successfully installed keras-metrics-1.1.0


In [0]:
import tqdm
import numpy as np
import keras_metrics # for recall and precision metrics
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.layers import Embedding, LSTM, Dropout, Dense
from keras.models import Sequential
from keras.utils import to_categorical
from keras.callbacks import ModelCheckpoint, TensorBoard
from sklearn.model_selection import train_test_split
import time
import numpy as np
import pickle

Using TensorFlow backend.


**Definamos algunos hiperparámetros:**

In [0]:
SEQUENCE_LENGTH = 100 # la longitud de todas las secuencias (numero de palabras por ejemplo)
EMBEDDING_SIZE = 100  # Uso de vectores de incrustación GloVe de 100 dimensiones
TEST_SIZE = 0.25 # relación del conjunto de prueba

BATCH_SIZE = 64
EPOCHS = 20 # numero de Epochs

# to convert labels to integers and vice-versa
label2int = {"ham": 0, "spam": 1}
int2label = {0: "ham", 1: "spam"}

# **2. Cargando el Dataset**

**Vamos a usar este database: https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip y la pondremos en una carpeta llamada "data"**

In [0]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip

--2019-09-24 04:43:12--  https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 203415 (199K) [application/x-httpd-php]
Saving to: ‘smsspamcollection.zip’


2019-09-24 04:43:12 (1.65 MB/s) - ‘smsspamcollection.zip’ saved [203415/203415]



In [0]:
from zipfile import ZipFile
with ZipFile('smsspamcollection.zip', 'r') as zf:
    zf.extractall('data/')

In [0]:
def load_data():
    """
    Loads SMS Spam Collection dataset
    """
    texts, labels = [], []
    with open("data/SMSSpamCollection") as f:
        for line in f:
            split = line.split()
            labels.append(split[0].strip())
            texts.append(' '.join(split[1:]).strip())
    return texts, labels

**Llamamos a la funcion:**

In [0]:
# cargamos los datos
X, y = load_data()

# **3. Preparando el Dataset**

In [0]:
# Tokenización de texto
# vectorizando texto, convirtiendo cada texto en secuencia de enteros
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X)
# convertir a secuencia de enteros
X = tokenizer.texts_to_sequences(X)

**Vamos a imprimir el primer ejemplo**

In [0]:
print(X[0])

[49, 471, 4435, 842, 755, 658, 64, 8, 1327, 88, 123, 351, 1328, 148, 2996, 1329, 67, 58, 4436, 144]


**Un grupo de números, cada número entero corresponde a una palabra en el vocabulario, que es lo que la red neuronal necesita de todos modos. Sin embargo, las muestras no tienen la misma longitud, necesitamos una forma de tener una secuencia de longitud fija**

**Como resultado, estamos usando la función *keras.preprocessing.sequence.pad_sequences()* que rellena las secuencias al comienzo de cada secuencia con ceros:**

In [0]:
# convertir numpy arrays
X = np.array(X)
y = np.array(y)
# secuencias de pad al comienzo de cada secuencia con ceros
# por ejemplo si SEQUENCE_LENGTH=4:
# [[5, 3, 2], [5, 1, 2, 3], [3, 4]]
# será transformado a:
# [[0, 5, 3, 2], [5, 1, 2, 3], [0, 0, 3, 4]]
X = pad_sequences(X, maxlen=SEQUENCE_LENGTH)

**Como recordarán, establecemos SEQUENCE_LENGTH en 100, de esta manera, todas las secuencias tienen una longitud de 100.**

**Ahora nuestras etiquetas también son texto, pero vamos a hacer un enfoque diferente aquí, ya que las etiquetas son solo "spam" y "ham", necesitamos hacer one-hot encode:**

In [0]:
# Etiquetas One Hot encoding
# [spam, ham, spam, ham, ham] se convertirá a:
# [1, 0, 1, 0, 1] y luego a:
# [[0, 1], [1, 0], [0, 1], [1, 0], [0, 1]]

y = [ label2int[label] for label in y ]
y = to_categorical(y)

**Usamos *keras.utils.to_categorial()* aquí, que hace lo que su nombre sugiere, intentemos imprimir la primera muestra de las etiquetas:**

In [0]:
print(y[0])

[1. 0.]


**Eso significa que la primera muestra es "ham".**

**A continuación, barajemos y dividamos los datos de entrenamiento y prueba:**

In [0]:
# dividir y barajar
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE, random_state=7)

# **4. Construyendo el Modelo**

**Comencemos escribiendo una función para cargar los vectores de incrustación previamente entrenados:**

In [0]:
def get_embedding_vectors(tokenizer, dim=100):
    embedding_index = {}
    with open(f"data/glove.6B.{dim}d.txt", encoding='utf8') as f:
        for line in tqdm.tqdm(f, "Reading GloVe"):
            values = line.split()
            word = values[0]
            vectors = np.asarray(values[1:], dtype='float32')
            embedding_index[word] = vectors

    word_index = tokenizer.word_index
    embedding_matrix = np.zeros((len(word_index)+1, dim))
    for word, i in word_index.items():
        embedding_vector = embedding_index.get(word)
        if embedding_vector is not None:
            # words not found will be 0s
            embedding_matrix[i] = embedding_vector
            
    return embedding_matrix

In [0]:
!wget http://nlp.stanford.edu/data/glove.6B.zip

--2019-09-24 04:16:38--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2019-09-24 04:16:38--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2019-09-24 04:16:39--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2019-0

In [0]:
from zipfile import ZipFile
with ZipFile('glove.6B.zip', 'r') as zf:
    zf.extractall('data/')

**Definamos la función que construye el modelo:**

In [0]:
def get_model(tokenizer, lstm_units):
    """
    Constructs the model,
    Embedding vectors => LSTM => 2 output Fully-Connected neurons with softmax activation
    """
    # obtener los vectores de incrustación GloVe
    embedding_matrix = get_embedding_vectors(tokenizer)
    model = Sequential()
    model.add(Embedding(len(tokenizer.word_index)+1,
              EMBEDDING_SIZE,
              weights=[embedding_matrix],
              trainable=False,
              input_length=SEQUENCE_LENGTH))

    model.add(LSTM(lstm_units, recurrent_dropout=0.2))
    model.add(Dropout(0.3))
    model.add(Dense(2, activation="softmax"))
    # compilar como rmsprop optimizer
    # así como con la métrica de recuerdo
    model.compile(optimizer="rmsprop", loss="categorical_crossentropy",
                  metrics=["accuracy", keras_metrics.precision(), keras_metrics.recall()])
    model.summary()
    return model

**Tenga en cuenta que la precisión no es suficiente para determinar si el modelo está funcionando bien, eso es porque este conjunto de datos no está equilibrado, solo unas pocas muestras son spam (porque es raro). Como resultado, utilizaremos métricas de precisión y recuperación.**

**Llamemos a la función:**

In [0]:
# construye el modelo con 128 unidades LSTM
model = get_model(tokenizer=tokenizer, lstm_units=128)

Reading GloVe: 400000it [00:12, 31641.63it/s]







Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 100, 100)          901000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               117248    
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 258       
Total params: 1,018,506
Trainable params: 117,506
Non-trainable params: 901,000
_________________________________________________________________


# **5. Entrenando el Modelo**

In [0]:
# Inicializar nuestras devoluciones de llamada ModelCheckpoint y TensorBoard
# model checkpoint para guardar los mejores weights
model_checkpoint = ModelCheckpoint("logs/spam_classifier_{val_loss:.2f}", save_best_only=True,
                                    verbose=1)
# for better visualization
tensorboard = TensorBoard(f"logs/spam_classifier_{time.time()}")
# print our data shapes
print("X_train.shape:", X_train.shape)
print("X_test.shape:", X_test.shape)
print("y_train.shape:", y_train.shape)
print("y_test.shape:", y_test.shape)
# Entrenar el modelo
model.fit(X_train, y_train, validation_data=(X_test, y_test),
          batch_size=BATCH_SIZE, epochs=EPOCHS,
          callbacks=[tensorboard, model_checkpoint],
          verbose=1)

X_train.shape: (4180, 100)
X_test.shape: (1394, 100)
y_train.shape: (4180, 2)
y_test.shape: (1394, 2)
Train on 4180 samples, validate on 1394 samples
Epoch 1/20

Epoch 00001: val_loss improved from inf to 0.08234, saving model to logs/spam_classifier_0.08
Epoch 2/20

Epoch 00002: val_loss improved from 0.08234 to 0.07535, saving model to logs/spam_classifier_0.08
Epoch 3/20

Epoch 00003: val_loss improved from 0.07535 to 0.06858, saving model to logs/spam_classifier_0.07
Epoch 4/20

Epoch 00004: val_loss did not improve from 0.06858
Epoch 5/20

Epoch 00005: val_loss did not improve from 0.06858
Epoch 6/20

Epoch 00006: val_loss did not improve from 0.06858
Epoch 7/20

Epoch 00007: val_loss did not improve from 0.06858
Epoch 8/20

Epoch 00008: val_loss improved from 0.06858 to 0.06414, saving model to logs/spam_classifier_0.06
Epoch 9/20

Epoch 00009: val_loss did not improve from 0.06414
Epoch 10/20

Epoch 00010: val_loss improved from 0.06414 to 0.06355, saving model to logs/spam_clas

<keras.callbacks.History at 0x7f135065cb00>

# **6. Evaluando el Modelo**

In [0]:
# obtener la pérdida y las métricas
result = model.evaluate(X_test, y_test)
# extraer esos
loss = result[0]
accuracy = result[1]
precision = result[2]
recall = result[3]

print(f"[+] Accuracy: {accuracy*100:.2f}%")
print(f"[+] Precision:   {precision*100:.2f}%")
print(f"[+] Recall:   {recall*100:.2f}%")

[+] Accuracy: 98.78%
[+] Precision:   99.09%
[+] Recall:   99.50%


Esto es lo que significa cada métrica:

*   **Accuracy:** porcentaje de predicciones correctas.
*   **Recall:** porcentaje de correos electrónicos no deseados que se predijeron correctamente.
* **Precision:** porcentaje de correos electrónicos clasificados como spam que en realidad eran spam



**¡Excelente! vamos a probar esto:**

In [0]:
def get_predictions(text):
    sequence = tokenizer.texts_to_sequences([text])
    # rellenar la secuencia
    sequence = pad_sequences(sequence, maxlen=SEQUENCE_LENGTH)
    # obtener prediccion
    prediction = model.predict(sequence)[0]
    # one-hot encoded vector, revertir usando np.argmax
    return int2label[np.argmax(prediction)]

**Vamos a crear un texto spam**

In [0]:
text = "Congratulations! you have won 100,000$ this week, click here to claim fast"
print(get_predictions(text))

spam


**OK, ahora uno legitimo**

In [0]:
text = "Hi man, I was wondering if we can meet tomorrow."
print(get_predictions(text))

ham
