#  Introducción a Sklearn 🧠

In [None]:
import csv
import pandas as pd
import numpy as np

Spam or ham (spam o no-spam)? 

Puede descargar el dataset [aquí](https://raw.githubusercontent.com/mohitgupta-omg/Kaggle-SMS-Spam-Collection-Dataset-/master/spam.csv)

#### Spam 
> Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's  

#### Ham
> Oops, I'll let you know when my roommate's done


In [None]:
# Download the dataset
![ ! -f spam.csv ] && wget https://raw.githubusercontent.com/mohitgupta-omg/Kaggle-SMS-Spam-Collection-Dataset-/master/spam.csv

--2019-10-14 19:13:40--  https://raw.githubusercontent.com/mohitgupta-omg/Kaggle-SMS-Spam-Collection-Dataset-/master/spam.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 503663 (492K) [text/plain]
Saving to: ‘spam.csv’


2019-10-14 19:13:40 (24.9 MB/s) - ‘spam.csv’ saved [503663/503663]



In [None]:
spam_or_ham = pd.read_csv("spam.csv", encoding='latin-1')[["v1", "v2"]]
spam_or_ham.columns = ["label", "text"]
spam_or_ham.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
spam_or_ham["label"].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

## Vectorización

**Tokenización**: convertir un párrafo u oración a unidades (tokens), usualmente cada palabra es un token. 

En este caso, nuestra función `tokenize` es bastante simple (e ineficiente), pero sirve para nuestros simple propósito.

**Stopword removal**: eliminar tokens irrelevantes, palabras comunes y a veces signos de puntuación.

En nuestro caso, únicamente estamos eliminando los símbolos de puntuación con ayuda del set `punctuation`.

In [None]:
import string
punctuation = set(string.punctuation)

def tokenize(sentence):
    tokens = []
    for token in sentence.split():
        new_token = []
        for character in token:
            if character not in punctuation:
                new_token.append(character.lower())
        if new_token:
            tokens.append("".join(new_token))
    return tokens

In [None]:
tokenize("Go until jurong point, crazy.. ")

['go', 'until', 'jurong', 'point', 'crazy']

In [None]:
spam_or_ham.head()["text"].apply(tokenize)

0    [go, until, jurong, point, crazy, available, o...
1                       [ok, lar, joking, wif, u, oni]
2    [free, entry, in, 2, a, wkly, comp, to, win, f...
3    [u, dun, say, so, early, hor, u, c, already, t...
4    [nah, i, dont, think, he, goes, to, usf, he, l...
Name: text, dtype: object

**Stemming/Lemmatization**: Convertir cada token a su forma base: {“biblioteca”, “bibliotecario”, ”bibliotecas”} → “bibliotec”.

En nuestro caso no estamos haciendo este paso, pero si es necesario, puedes revisar cosas como [NLTK - stemming](https://pythonspot.com/nltk-stemming/) o [Lemmatization Approaches with Examples in Python](https://www.machinelearningplus.com/nlp/lemmatization-examples-python/)

**One-Hot encoding**: después de la tokenización, poner en una tabla todos los tokens en el vocabulario y por cada ocurrencia de un token en un texto, marcar con un 1 en la fila correspondiente, por ejemplo considerando las dos frases siguientes:

 1. Call FREEPHONE 0800 542 0578 now!
 2. Did you call me just now ah?
 
Obtendríamos algo como esto:
 
|       | 0578 | 0800 | 542 | ah | call | did | freephone | just | me | now | you |
|-------|------|------|-----|----|------|-----|-----------|------|----|-----|-----|
| **1** | 1    | 1    | 1   | 0  | 1    | 0   | 1         | 0    | 0  | 1   | 0   |
| **2** | 0    | 0    | 0   | 1  | 1    | 1   | 0         | 1    | 1  | 1   | 1   |  

Aquí es donde entra **Scikit-Learn** a través de la clase `CountVectorizer` del módulo `sklearn.feature_extraction.text`.

[Slides]

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
demo_vectorizer = CountVectorizer(
    tokenizer = tokenize,
    binary=True
)

Explicación de los parámetros:  

 - **tokenizer = tokenize**: `CountVectorizer` tiene un tokenizador por default, al pasarle nuestra función lo estamos reemplazando con el que nosotros escribimos.  
 - **binary = True**: `CountVectorizer` por default en lugar de `1` cuenta el número de ocurrencias de cada token, al establecer `binary = True`, le estamos indicando que no importa cuantas veces ocurra una palabra, solamente la debe contar una vez

In [None]:
examples = [
    "Call FREEPHONE 0800 542 0578 now!",
    "Did you call me just now ah?"
]
demo_vectorizer.fit(examples)
vectors = demo_vectorizer.transform(examples).toarray()

Usamos `fit` y `transform` de manera separada, aunque en este caso pudimos haber usado `fit_transform`.

**Nota**: usamos `toarray` para obtener un un *numpy array* ya que por default `transform` devuelve una [matriz dispersa](https://en.wikipedia.org/wiki/Sparse_matrix) que, mientras que es buena para no consumir memoria, no es tan amigable para mostrar cómo es que se ven los datos.

In [None]:
headers = sorted(demo_vectorizer.vocabulary_.keys())
pd.DataFrame(vectors, columns=headers)

Unnamed: 0,0578,0800,542,ah,call,did,freephone,just,me,now,you
0,1,1,1,0,1,0,1,0,0,1,0
1,0,0,0,1,1,1,0,1,1,1,1


[Slides]

In [None]:
from sklearn.model_selection import train_test_split
train_text,test_text, train_labels, test_labels = train_test_split(spam_or_ham["text"], 
                                                                    spam_or_ham["label"],
                                                                    stratify=spam_or_ham["label"])
print(f"Training examples: {len(train_text)}, testing examples {len(test_text)}")

Training examples: 4179, testing examples 1393


Una vez separados los datos, ahora si podemos comenzar a entrenar nuestro algoritmo, comenzando por generar un nuevo vectorizador:

In [None]:
real_vectorizer = CountVectorizer(tokenizer = tokenize, binary=True)

train_X = real_vectorizer.fit_transform(train_text)
test_X = real_vectorizer.transform(test_text)

train_X.shape

(4179, 8244)

[Slides]

In [None]:
from sklearn.svm import LinearSVC

In [None]:
classifier = LinearSVC()
classifier.fit(train_X, train_labels)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

In [None]:
from sklearn.metrics import accuracy_score

predicciones = classifier.predict(test_X)

accuracy = accuracy_score(test_labels, predicciones)

print(f"Accuracy: {accuracy:.4%}")

Accuracy: 98.4925%


### Predicciones en nuevos datos

In [None]:
spam = "Want to win FREE tickets to a football match? txt WIN"
ham = "Do you want to go to a football match with me?"

examples = [
    spam,
    ham
]

examples_X = real_vectorizer.transform(examples)
predicciones = classifier.predict(examples_X)

In [None]:
for text, label in zip(examples, predicciones):
    print(f"{label:5} - {text}")

spam  - Want to win FREE tickets to a football match? txt WIN
ham   - Do you want to go to a football match with me?
