# Regresión Logistica: Deteccion de SPAM

En este ejerciccio se muestran los fundamentos de la regresiín lógistica planteando unos de los primeros problemas que fueron solucionados mediante el uso de técnicas de Machine Learning:
Detección de SPAM.


## Enunciado del ejercicio.

Se propone la construcción de un sistema automático capaz de predecir si un correo determinado, es un correo SPAM 
o no, para ello se utilizará el siguiente DataSet.

[2007 TREC Public Spam Corpus](Terabox)

La carpeta contiene 75,419 mensajes de correo electronico.
25,220 son Ham
50,199 son SPAM

These messages constitute all the messages delivered to a particular server between these dates:

Sun,  8 Appr 2007 13:07:21 - 0400
Fri,  6 Jul  2007 07:03:53 - 0400

* Aprendizaje **Supervizado**.
* Aprendizaje **basado en modelos**
* Se corresponde con un **Modelo lineal generalizado.
* Realiza predicciones computando una **suma ponderada de las
caracteristicas de entrada** sumandole una constante conocida como
bias, pero se aplica una función lógica al resultado.

### 1.- Funciones complementarias.

En este caso práctico relacionado con la detección de correos eléctronicos de SPAM,
el conjunto de datos que se dispone, esta formado por correos electrónicos, con sus correspondientes
cabeceras y campos adicionales. POr lo tanto, requieren un preprocesamiento previo a que sean ingeridos
por el álgoritmo de Machine Learning.

In [1]:
# Esta clase facilita el preprocesamiento de correos electrónicos que poseen código HTML 
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs = True
        self.fed = []
        
    def handle_data(self, d):
        self.fed.append(d)
        
    def get_data(self):
        return ''.join(self.fed)

In [2]:
# Función que se encargará de eliminar los tags HTML que se encuentran
def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

In [3]:
#Ejemplo de eliminación de los tags HTML de un texto
t= '<tr><td align="left"><a href="../../issues/51/16.html#article">Phrack World News</a></td>'
strip_tags(t)

'Phrack World News'

Ademas de eliminar los tags de HTML que se encunetren en el correo electrónico deben realizarse otras acciones de 
preprocesamiento para evitar que los mensajes contengan ruido inecesario.
Entre ellas se encuentra la eliminación de los signos de puntuación, eliminacion de posibles campos de correo electrónico que no son relevantes o eliminación de los afijos de una palabra manteniendo unicamente la raiz de la misma (Stemming). La clase que se muestra a continuación realiza estas transformaciones.

In [4]:
import email
import string 
import nltk

class Parser:
    
    def __init__(self):
        self.stemmer = nltk.PorterStemmer()
        self.stopwords =set(nltk.corpus.stopwords.words('english'))
        self.punctuation = list(string.punctuation)
        
    def parse(self, email_path):
        """Parse an email."""
        with open(email_path, errors = 'ignore') as e:
            msg = email.message_from_file(e)
        return None if not msg else self.get_email_content(msg)
    def get_email_content(self, msg):
        """Extract the email content."""
        subject = self.tokenize(msg['subject']) if msg['subject'] else[]
        body = self.get_email_body(msg.get_payload(),
                                   msg.get_content_type())
        content_type = msg.get_content_type()
        #  Returning the content of the email
        return{"subject": subject,
              "body": body,
              "content_type": content_type}
    
    def get_email_body(self, payload, content_type):
        """Extract the body of the email."""
        body = []
        if type(payload) is str and content_type == 'text/plain':
            return self.tokenize(payload)
        elif type(payload) is str and content_type == 'text/html':
            return self.tokenize(strip_tags(payload))
        elif type(payload) is list:
            for p in payload:
                body += self.get_email_body(p.get_payload(),
                                            p.get_content_type())
        return body
                                            
    def tokenize(self, text):
        """Transform a text string in tokens. Perform two main actions,
        clean the punctuation symbols and stemming of the text."""
        for c in self.punctuation:
            text = text.replace(c, "")
            text = text.replace("\t", " ")
            text = text.replace("\n", " ")
            tokens = list(filter(None, text.split(" ")))
            # Stemming of the tokens
            return[self.stemmer.stem(w) for w in tokens if w not in self.stopwords]

##### Lectura de un correo en formato raw

In [5]:
inmail = open("datasets/trec07p/data/inmail.1").read()
print(inmail)

From RickyAmes@aol.com  Sun Apr  8 13:07:32 2007
Return-Path: <RickyAmes@aol.com>
Received: from 129.97.78.23 ([211.202.101.74])
	by speedy.uwaterloo.ca (8.12.8/8.12.5) with SMTP id l38H7G0I003017;
	Sun, 8 Apr 2007 13:07:21 -0400
Received: from 0.144.152.6 by 211.202.101.74; Sun, 08 Apr 2007 19:04:48 +0100
Message-ID: <WYADCKPDFWWTWTXNFVUE@yahoo.com>
From: "Tomas Jacobs" <RickyAmes@aol.com>
Reply-To: "Tomas Jacobs" <RickyAmes@aol.com>
To: the00@speedy.uwaterloo.ca
Subject: Generic Cialis, branded quality@ 
Date: Sun, 08 Apr 2007 21:00:48 +0300
X-Mailer: Microsoft Outlook Express 6.00.2600.0000
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="--8896484051606557286"
X-Priority: 3
X-MSMail-Priority: Normal
Status: RO
Content-Length: 988
Lines: 24

----8896484051606557286
Content-Type: text/html;
Content-Transfer-Encoding: 7Bit

<html>
<body bgcolor="#ffffff">
<div style="border-color: #00FFFF; border-right-width: 0px; border-bottom-width: 0px; margin-bottom: 0px;" align="

##### Lectura del indice.


In [6]:
import nltk
nltk.download('stopwords')
p= Parser()
p.parse("datasets/trec07p/data/inmail.1")

[nltk_data] Downloading package stopwords to /home/ivan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


{'subject': ['gener', 'cialis,', 'brand', 'quality@'],
 'body': ['do',
  'feel',
  'pressur',
  'perform',
  'rise',
  'occasion??',
  'tri',
  'viagra.....',
  'anxieti',
  'thing',
  'past',
  'back',
  'old',
  'self.'],
 'content_type': 'multipart/alternative'}

##### Lectura del indice.

Estas funciones complementarias se encargan de cargar en memoria la ruta de cada correo electrónico y su 
etiqueta correspondiente {spam, ham} 

In [7]:
index = open("datasets/trec07p/full/index").readlines()
index

['spam ../data/inmail.1\n',
 'ham ../data/inmail.2\n',
 'spam ../data/inmail.3\n',
 'spam ../data/inmail.4\n',
 'spam ../data/inmail.5\n',
 'spam ../data/inmail.6\n',
 'spam ../data/inmail.7\n',
 'spam ../data/inmail.8\n',
 'spam ../data/inmail.9\n',
 'ham ../data/inmail.10\n',
 'spam ../data/inmail.11\n',
 'spam ../data/inmail.12\n',
 'spam ../data/inmail.13\n',
 'spam ../data/inmail.14\n',
 'spam ../data/inmail.15\n',
 'spam ../data/inmail.16\n',
 'spam ../data/inmail.17\n',
 'spam ../data/inmail.18\n',
 'spam ../data/inmail.19\n',
 'ham ../data/inmail.20\n',
 'ham ../data/inmail.21\n',
 'spam ../data/inmail.22\n',
 'spam ../data/inmail.23\n',
 'spam ../data/inmail.24\n',
 'spam ../data/inmail.25\n',
 'spam ../data/inmail.26\n',
 'spam ../data/inmail.27\n',
 'spam ../data/inmail.28\n',
 'ham ../data/inmail.29\n',
 'spam ../data/inmail.30\n',
 'ham ../data/inmail.31\n',
 'spam ../data/inmail.32\n',
 'spam ../data/inmail.33\n',
 'ham ../data/inmail.34\n',
 'spam ../data/inmail.35\n',
 

In [8]:
import os

DATASET_PATH = "datasets/trec07p"

def parse_index(path_to_index, n_elements):
    ret_indexes = []
    index = open(path_to_index).readlines()
    
    for i in range(n_elements):
        mail = index[i].split("../")
        label = mail[0]
        path = mail[1][:-1] 
        ret_indexes.append({"label":label,"email_path": os.path.join(DATASET_PATH, path)})
    return ret_indexes

In [9]:
def parse_email(index):
    p = Parser()
    pmail = p.parse(index["email_path"])
    return pmail, index["label"]

In [10]:
indexes = parse_index("datasets/trec07p/full/index", 10)
indexes

[{'label': 'spam ', 'email_path': 'datasets/trec07p/data/inmail.1'},
 {'label': 'ham ', 'email_path': 'datasets/trec07p/data/inmail.2'},
 {'label': 'spam ', 'email_path': 'datasets/trec07p/data/inmail.3'},
 {'label': 'spam ', 'email_path': 'datasets/trec07p/data/inmail.4'},
 {'label': 'spam ', 'email_path': 'datasets/trec07p/data/inmail.5'},
 {'label': 'spam ', 'email_path': 'datasets/trec07p/data/inmail.6'},
 {'label': 'spam ', 'email_path': 'datasets/trec07p/data/inmail.7'},
 {'label': 'spam ', 'email_path': 'datasets/trec07p/data/inmail.8'},
 {'label': 'spam ', 'email_path': 'datasets/trec07p/data/inmail.9'},
 {'label': 'ham ', 'email_path': 'datasets/trec07p/data/inmail.10'}]

## 2. Procesamiento de Datos


Con las funciones presentadas anteriormente se permite la lectura de los correos electronicos de manera programática y el procesamiento de los mismos para eliminar aquellos componentes que no reuslten de utilidad para la detección de correos SPAM. Sin embargo cada uno de los correos, sigue estando representado por un diccionario de python con una serie de palabras

In [11]:
# Cargar el índice y las etiquetas en memoría.
index = parse_index("datasets/trec07p/full/index", 1)

In [12]:
# Leer el primer correo
import os

open(index[0]["email_path"]).read()

'From RickyAmes@aol.com  Sun Apr  8 13:07:32 2007\nReturn-Path: <RickyAmes@aol.com>\nReceived: from 129.97.78.23 ([211.202.101.74])\n\tby speedy.uwaterloo.ca (8.12.8/8.12.5) with SMTP id l38H7G0I003017;\n\tSun, 8 Apr 2007 13:07:21 -0400\nReceived: from 0.144.152.6 by 211.202.101.74; Sun, 08 Apr 2007 19:04:48 +0100\nMessage-ID: <WYADCKPDFWWTWTXNFVUE@yahoo.com>\nFrom: "Tomas Jacobs" <RickyAmes@aol.com>\nReply-To: "Tomas Jacobs" <RickyAmes@aol.com>\nTo: the00@speedy.uwaterloo.ca\nSubject: Generic Cialis, branded quality@ \nDate: Sun, 08 Apr 2007 21:00:48 +0300\nX-Mailer: Microsoft Outlook Express 6.00.2600.0000\nMIME-Version: 1.0\nContent-Type: multipart/alternative;\n\tboundary="--8896484051606557286"\nX-Priority: 3\nX-MSMail-Priority: Normal\nStatus: RO\nContent-Length: 988\nLines: 24\n\n----8896484051606557286\nContent-Type: text/html;\nContent-Transfer-Encoding: 7Bit\n\n<html>\n<body bgcolor="#ffffff">\n<div style="border-color: #00FFFF; border-right-width: 0px; border-bottom-width: 0

In [13]:
# Parsear el primer correo
mail, label = parse_email(index[0])
print("El correo es: ", label)
print(mail)

El correo es:  spam 
{'subject': ['gener', 'cialis,', 'brand', 'quality@'], 'body': ['do', 'feel', 'pressur', 'perform', 'rise', 'occasion??', 'tri', 'viagra.....', 'anxieti', 'thing', 'past', 'back', 'old', 'self.'], 'content_type': 'multipart/alternative'}


El algorítmo de regresión logística no es capaz de ingerir texto como parte del conjunto de datos de los correos electronicos parseados en una representación númerica

##### Aplicación de CountVectorizer

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

#Preparar el email en una cadena de texto
prep_email = [" ".join(mail["subject"]) + " ".join(mail["body"])]

vectorizer = CountVectorizer()
X = vectorizer.fit(prep_email)
print("e-mail: ", prep_email, "\n")
print("Carac. de entradas: ", vectorizer.get_feature_names_out())

e-mail:  ['gener cialis, brand quality@do feel pressur perform rise occasion?? tri viagra..... anxieti thing past back old self.'] 

Carac. de entradas:  ['anxieti' 'back' 'brand' 'cialis' 'do' 'feel' 'gener' 'occasion' 'old'
 'past' 'perform' 'pressur' 'quality' 'rise' 'self' 'thing' 'tri' 'viagra']


In [15]:
X = vectorizer.transform(prep_email)
print("\nValues: \n", X.toarray())


Values: 
 [[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]


In [16]:
from sklearn.preprocessing import OneHotEncoder

prep_email = [[w] for w in mail['subject'] + mail['body']]

enc = OneHotEncoder(handle_unknown ='ignore')
X = enc.fit_transform(prep_email)

print("Features:\n", enc.get_feature_names_out())
print("\nValues\n", X.toarray())

Features:
 ['x0_anxieti' 'x0_back' 'x0_brand' 'x0_cialis,' 'x0_do' 'x0_feel'
 'x0_gener' 'x0_occasion??' 'x0_old' 'x0_past' 'x0_perform' 'x0_pressur'
 'x0_quality@' 'x0_rise' 'x0_self.' 'x0_thing' 'x0_tri' 'x0_viagra.....']

Values
 [[0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 

##### Funciones auxiliares para el prepocesamiento del conjunto de datos.

In [17]:
def create_prep_dataset(index_path, n_elements):
    X = []
    y = []
    indexes = parse_index(index_path, n_elements)
    for i in range(n_elements):
        print("\rParsing email: {0}".format(i+1), end = '')
        mail, label = parse_email(indexes[i])
        X.append(" ".join(mail['subject']) + " ".join(mail['body']))
        y.append(label)
    return X, y

## 3. Entrenamiento del algorítmo.

In [18]:
# Leer unicamente un subconjunto de 100 correos electronicos
X_train, y_train = create_prep_dataset("datasets/trec07p/full/index", 100)
X_train

Parsing email: 100

['gener cialis, brand quality@do feel pressur perform rise occasion?? tri viagra..... anxieti thing past back old self.',
 'typo /debian/readmhi, i\'v updat gulu i check mirrors. it seem littl typo /debian/readm file example: http://gulus.usherbrooke.ca/debian/readm ftp://ftp.fr.debian.org/debian/readm "testing, lenny. access releas dists/testing. the current test develop snapshot name etch. packag test unstabl pass autom test propog release." etch replac lenni like readme.html -- yan morin consult en logiciel libr yan.morin@savoirfairelinux.com 514-994-1556 -- to unsubscribe, email debian-mirrors-request@lists.debian.org subject "unsubscribe". trouble? contact listmaster@lists.debian.org',
 'authent viagramega authenticv i a g r a $ discount pricec i a l i s $discount pricedo miss it, click here. http://www.moujsjkhchum.com authent viagra mega authenticv i a g r a $ discount pricec i a l i s $discount pricedo miss it, click here.',
 "nice talk yahey billy, realli fun go night talking,

##### Vectorizacion de los datos

In [19]:
Vectorizer = CountVectorizer
X_train = vectorizer.fit_transform(X_train)

In [20]:
print(X_train.toarray())
print("\nFeacture: ", len(vectorizer.get_feature_names_out()))

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

Feacture:  5121


In [21]:
import pandas as pd
pd.DataFrame(X_train.toarray(), columns = [vectorizer.get_feature_names_out()])

Unnamed: 0,00,000,0000,000000,000099,0001pt,000m,0085,009,01000u,...,ҳϊ,ӧanz,ӭѯ,ԡšݡ淶,լһʽ,չҵϣ,سŵþʊʊݾѯ,ڶҵţ,㶫иï26,饻jwk
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
96,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
97,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
98,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
y_train

['spam ',
 'ham ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'ham ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'ham ',
 'ham ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'ham ',
 'spam ',
 'ham ',
 'spam ',
 'spam ',
 'ham ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'ham ',
 'ham ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'ham ',
 'spam ',
 'ham ',
 'spam ',
 'spam ',
 'ham ',
 'spam ',
 'spam ',
 'spam ',
 'ham ',
 'ham ',
 'spam ',
 'spam ',
 'spam ',
 'ham ',
 'ham ',
 'ham ',
 'spam ',
 'ham ',
 'spam ',
 'ham ',
 'spam ',
 'spam ',
 'spam ',
 'ham ',
 'spam ',
 'spam ',
 'ham ',
 'spam ',
 'ham ',
 'spam ',
 'spam ',
 'ham ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ']

##### Entrenamiento del algortimo de regresion logistica con el conjunto de datos preporocesado

In [23]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(X_train, y_train)

## 4. Prediccion

### Lectura del conjunto de correos

In [24]:
# Leer 150 correos del DataSet y quedarse únicamente
# con los 50 últimos. Estos 50 correos electronicos no se han utilizado para 
# Entrenar el algoritmo.
X, y = create_prep_dataset("datasets/trec07p/full/index", 150)
X_test = X[100:]
y_test = y[100:]

Parsing email: 150

##### Prepocesamiento de los correos con el vectorizador creado anteriormente.

In [25]:
X_test = vectorizer.transform(X_test)

##### Prediccion del tipo de correo

In [26]:
y_pred = clf.predict(X_test)
y_pred

array(['spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ',
       'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ',
       'spam ', 'spam ', 'ham ', 'spam ', 'spam ', 'spam ', 'spam ',
       'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ',
       'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ',
       'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ',
       'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ',
       'spam '], dtype='<U5')

In [27]:
print("Prediccion: \n", y_pred)
print("Etiquetas reales \n", y_test)

Prediccion: 
 ['spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam '
 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'ham ' 'spam '
 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam '
 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam '
 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam '
 'spam ' 'spam ' 'spam ' 'spam ' 'spam ']
Etiquetas reales 
 ['spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'ham ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'ham ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'ham ', 'spam ', 'spam ', 'ham ', 'spam ', 'spam ', 'ham ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ']


##### Evaluacion de resultados

In [28]:
from sklearn.metrics import accuracy_score

print('Accuracy: {:.3f}'.format(accuracy_score(y_test, y_pred)))

Accuracy: 0.920
