In [0]:
# Filtado de mensajes spam

## Descripción del problema real

La recepción de publicidad no deseada a traves mensajes de texto usando SMS (Short Message Service) es un problema que afecta a muchos usuarios de teléfonos móviles. El problema radica en que los usuarios deben pagar por los mesajes recibidos, y por este motivo resulta muy importante que las compañías prestadoras del servicio puedan filtrar mensajes indeseados antes de enviarlos a su destinatario final. Los mensajes tienen una longitud máxima de 160 caracteres, por lo que el texto resulta poco para realizar la clasificación, en comparación con textos más largos (como los emails). Adicionalmente, los errores de digitación dificultan el proceso de detección automática.

## Descripción del problema en términos de los datos

Se tiene una muestra contiene 5574 mensajes en inglés, no codificados y clasificados como legítimos (ham) o spam (http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/). La información está almacenada en el archivo `datos/spam-sms.zip`.El problema en términos de los datos consiste en clasificar si un mensaje SMS es legítico o spam, a partir del análisis de las palabras que contiente, partiendo del supuesto de que ciertas palabras que son más frecuentes dependiendo del tipo de mensaje. Esto implica que en la fase de preparación de los datos se deben extraer las palabras que contiene cada mensaje para poder realizar el análsis.

## Aproximaciones posibles

En este caso, se desea comparar los resultados de un modelo de redes neuronales artificiales y otras técnicas estadísticas para realizar la clasificación.

## Requerimientos

Usted debe:

* Preprocesar los datos para representarlos usando bag-of-words.


* Construir un modelo de regresión logística como punto base para la comparación con otros modelos más complejos.


* Construir un modelo de redes neuronales artificiales. Asimismo, debe determinar el número de neuronas en la capa o capas ocultas.


* Utiizar una técnica como crossvalidation u otra similar para establecer la robustez del modelo.


* Presentar métricas de desempeño para establecer las bondades y falencias de cada clasificador.

In [0]:
!pip install -q wordcloud

In [4]:
!pip install tensorflow==1.14

Collecting tensorflow==1.14
[?25l  Downloading https://files.pythonhosted.org/packages/de/f0/96fb2e0412ae9692dbf400e5b04432885f677ad6241c088ccc5fe7724d69/tensorflow-1.14.0-cp36-cp36m-manylinux1_x86_64.whl (109.2MB)
[K     |████████████████████████████████| 109.2MB 98kB/s 
Collecting tensorboard<1.15.0,>=1.14.0
[?25l  Downloading https://files.pythonhosted.org/packages/91/2d/2ed263449a078cd9c8a9ba50ebd50123adf1f8cfbea1492f9084169b89d9/tensorboard-1.14.0-py3-none-any.whl (3.1MB)
[K     |████████████████████████████████| 3.2MB 36.5MB/s 
Collecting tensorflow-estimator<1.15.0rc0,>=1.14.0rc0
[?25l  Downloading https://files.pythonhosted.org/packages/3c/d5/21860a5b11caf0678fbc8319341b0ae21a07156911132e0e71bffed0510d/tensorflow_estimator-1.14.0-py2.py3-none-any.whl (488kB)
[K     |████████████████████████████████| 491kB 25.5MB/s 
Installing collected packages: tensorboard, tensorflow-estimator, tensorflow
  Found existing installation: tensorboard 2.2.0
    Uninstalling tensorboard-2.2.

In [109]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords
import wordcloud

import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger') 
import string

import glob
import re
from tensorflow.python.keras.layers import Dense
from tensorflow.python.keras import Sequential
%tensorflow_version 1.x
import tensorflow as tf

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
TensorFlow is already loaded. Please restart the runtime to change versions.


In [110]:
#Por favor mover la carpeta que le comparto "evaluacion-semoralesco-master" a su unidad antes de ejecutar este bloque de código
#para asegurar que la ruta especificada en el próximo bloque es correcta
#importación de los contenidos presentes del drive para poder leer los datos a utilizar en el modelo
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
df = pd.DataFrame(columns = ['type', 'text'])

In [0]:
#Verificar que una vez movida la carpeta a "Mi unidad" la ruta sí sea la correcta
import csv
sms = open("/content/drive/My Drive/evaluacion-semoralesco-master/datos/SMSSpamCollection.txt", "r")
csv_reader = csv.reader(sms, delimiter='\t')

In [0]:
#Lectura del archivo txt como dataframe
import sys
sms_data = []
sms_labels = []
for i in csv_reader:
  sms_labels.append(i[0])
  sms_data.append(i[1])
df['type'] = sms_labels
df['text'] = sms_data

In [114]:
df.describe()

Unnamed: 0,type,text
count,5572,5572
unique,2,5169
top,ham,"Sorry, I'll call later"
freq,4825,30


In [115]:
df.head()

Unnamed: 0,type,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [116]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

df['stemmed'] = df.text.apply(lambda x: ' '.join([stemmer.stem(w) for w in x.split() ]))

df.head(10)

Unnamed: 0,type,text,stemmed
0,ham,"Go until jurong point, crazy.. Available only ...","Go until jurong point, crazy.. avail onli in b..."
1,ham,Ok lar... Joking wif u oni...,Ok lar... joke wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entri in 2 a wkli comp to win FA cup fina...
3,ham,U dun say so early hor... U c already then say...,U dun say so earli hor... U c alreadi then say...
4,ham,"Nah I don't think he goes to usf, he lives aro...","nah I don't think he goe to usf, he live aroun..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...,freemsg hey there darl it' been 3 week' now an...
6,ham,Even my brother is not like to speak with me. ...,even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...,As per your request 'mell mell (oru minnaminun...
8,spam,WINNER!! As a valued network customer you have...,winner!! As a valu network custom you have bee...
9,spam,Had your mobile 11 months or more? U R entitle...,had your mobil 11 month or more? U R entitl to...


In [117]:
#Se estandarizan los textos, para de esta manera facilitar el acceso a los mismos
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(
    analyzer='word',        # a nivel de palabra
    lowercase=True,         # convierte a minúsculas
    stop_words='english',   # stop_words en inglés
    binary=True,            # Los valores distintos de cero son fijados en 1
    min_df=5                # ignora palabras con baja freq
)


dtm = count_vect.fit_transform(df.stemmed)
dtm.shape

(5572, 1535)

In [118]:
#Palabras aprendidas de los mensajes de texto
vocabulary = count_vect.get_feature_names()
len(vocabulary)

1535

In [119]:
#Primeras palabras del vocabulario
vocabulary[0:10]

['00',
 '000',
 '02',
 '03',
 '04',
 '06',
 '0800',
 '08000839402',
 '08000930705',
 '0870']

In [121]:
#Recupera los mensajes de la dtm
def dtm2words(dtm, vocabulary, index):
    as_list = dtm[index,:].toarray().tolist()
    docs = []
    for i in index:
        k = [vocabulary[iword] for iword, ifreq in enumerate(as_list[i]) if ifreq > 0]
        docs += [k]
    return docs

for i, x in enumerate(dtm2words(dtm, vocabulary, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])):
    print('Org: ', df.text[i])
    print('Mod: ', ' '.join(x))
    print('')

Org:  Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
Mod:  avail bugi cine got great la onli point wat world

Org:  Ok lar... Joking wif u oni...
Mod:  joke lar ok wif

Org:  Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
Mod:  appli comp cup entri final free question rate receiv std text txt win wkli

Org:  U dun say so early hor... U c already then say...
Mod:  alreadi dun earli say

Org:  Nah I don't think he goes to usf, he lives around here though
Mod:  don goe live nah think usf

Org:  FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
Mod:  50 darl freemsg fun hey like ok send std week word xxx

Org:  Even my brother is not like to speak with me. They treat me like aids patent.
Mod:  brother like speak treat

Org

In [0]:
#Se crean los conjuntos de entrenamiento y prueba respectivamente
X_train      = dtm[0:4168,]
X_test       = dtm[4169:,]
y_train_true = df.type[0:4168]
y_test_true  = df.type[4169:]

In [124]:
#Se observa la distribución de los datos de entrenamiento
round(100 * y_train_true.value_counts() / sum(y_train_true.value_counts()), 1)

ham     86.5
spam    13.5
Name: type, dtype: float64

In [125]:
#Se observa la distribución de los datos de entrenamiento
round(100 * y_test_true.value_counts() / sum(y_test_true.value_counts()), 1)

ham     87.0
spam    13.0
Name: type, dtype: float64

In [126]:
#Entrenamiento del modelo
from sklearn.naive_bayes import BernoulliNB

#Clasificador Naive Bayes
clf = BernoulliNB()

#Se entrena el clasificador
clf.fit(X_train.toarray(), y_train_true)
clf

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [127]:
#Pronóstico para los datos de prueba
y_test_pred = clf.predict(X_test.toarray())
y_test_pred_prob = clf.predict_proba(X_test.toarray())
y_test_pred

array(['spam', 'ham', 'ham', ..., 'ham', 'ham', 'ham'], dtype='<U4')

In [128]:
#Se observan las métricas de desempeño
from sklearn.metrics import confusion_matrix
confusion_matrix(y_true = y_test_true,
                 y_pred = y_test_pred)

array([[1215,    5],
       [  19,  164]])

In [129]:
#Se observa una predicción
clf.predict_proba(X_test.toarray())

array([[5.75630664e-15, 1.00000000e+00],
       [9.99634400e-01, 3.65599574e-04],
       [9.99669927e-01, 3.30073060e-04],
       ...,
       [9.99991916e-01, 8.08391898e-06],
       [9.99938624e-01, 6.13761686e-05],
       [9.99978078e-01, 2.19217806e-05]])

In [130]:

#Nueva tabla que muestra la clasificación actual des sistema
results = pd.DataFrame(data = {
    'actual_type':  y_test_true,
    'predict_type': y_test_pred,
    'prob_ham':     [v[0] for v in y_test_pred_prob],
    'prob_spam':    [v[1] for v in y_test_pred_prob]})

results.head(5)

Unnamed: 0,actual_type,predict_type,prob_ham,prob_spam
4169,spam,spam,5.756307e-15,1.0
4170,ham,ham,0.9996344,0.0003655996
4171,ham,ham,0.9996699,0.0003300731
4172,ham,ham,1.0,2.023721e-09
4173,ham,ham,0.9999999,6.049725e-08


In [131]:
#Se visualizan errores de predicción del sistema
results[results['actual_type'] != results['predict_type']]

Unnamed: 0,actual_type,predict_type,prob_ham,prob_spam
4213,spam,ham,0.999482,0.000518
4249,spam,ham,0.763449,0.236551
4256,spam,ham,0.97118,0.02882
4297,spam,ham,0.730392,0.269608
4298,spam,ham,0.634119,0.365881
4344,ham,spam,0.356805,0.643195
4373,spam,ham,0.992373,0.007627
4394,spam,ham,0.989484,0.010516
4399,ham,spam,0.392947,0.607053
4514,spam,ham,0.999775,0.000225


In [132]:
#Se extraen mensajes con probabilidad cercana a 0.5 debido a que pueden crear amnigüedad en el sistema
results[(results['prob_spam'] > 0.4) & (results['prob_spam'] < 0.6)]

Unnamed: 0,actual_type,predict_type,prob_ham,prob_spam
4253,ham,ham,0.52086,0.47914
4931,spam,spam,0.46691,0.53309
5324,ham,ham,0.568126,0.431874
5370,spam,ham,0.540665,0.459335
5377,spam,ham,0.589744,0.410256


In [133]:
#Visualización de mensajes mal clasificados con probabilidad cercana a 0.5
results[(results['prob_spam'] > 0.4) &
        (results['prob_spam'] < 0.6) &
        (results['actual_type'] != results['predict_type'])]

Unnamed: 0,actual_type,predict_type,prob_ham,prob_spam
5370,spam,ham,0.540665,0.459335
5377,spam,ham,0.589744,0.410256
