# Regresión Logística: Detección de SPAM

En este ejercicio se muestran los fundamentos de la Regresión Logística planteando uno de los primeros problemas que fueron solucionados mediante el uso de técnicas de Machine Learning: la detección de SPAM.

## Enunciado del ejercicio

Se propone la construcción de un sistema de aprendizaje automático capaz de predecir si un correo determinado se corresponde con un correo de SPAM o no, para ello, se utilizará el siguiente conjunto de datos:

##### [2007 TREC Public Spam Corpus](https://plg.uwaterloo.ca/cgi-bin/cgiwrap/gvcormac/foo07)
The corpus trec07p contains 75,419 messages:

    25220 ham
    50199 spam

These messages constitute all the messages delivered to a particular
server between these dates:

    Sun, 8 Apr 2007 13:07:21 -0400
    Fri, 6 Jul 2007 07:04:53 -0400

### 1. Funciones complementarias

In [195]:
# Esta clase facilita el preprocesamiento de correos electrónicos que poseen código HTML
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict=False
        self.convert_charrefs=False
        self.fed=[]


    def handle_data(self,d):
        self.fed.append(d)

    def get_data(self):
        return ''.join(self.fed)

In [196]:
# Esta función se encarga de elimar los tags HTML que se encuentren en el texto del correo electrónico
def strip_tags(html):
    s=MLStripper()
    s.feed(html)
    return s.get_data()

In [197]:
# Ejemplo de eliminación de los tags HTML de un texto
t = '<tr><td><a href="unurl">PhrackWorld</a></td>'
strip_tags(t)

'PhrackWorld'

In [198]:
import email
import string
import nltk
from nltk.stem import PorterStemmer
from bs4 import BeautifulSoup  # Para eliminar etiquetas HTML si es necesario

class parser:
    def __init__(self) -> None:
        self.stemmer = PorterStemmer()  # Corrección en el uso del stemmer
        self.stopwords = set(nltk.corpus.stopwords.words('english'))
        self.punctuation = list(string.punctuation)

    def parse(self, email_path):
        with open(email_path, errors='ignore') as e:
            msg = email.message_from_file(e)
            return None if not msg else self.get_email_content(msg)  # Cambiado a get_email_content
    
    def get_email_content(self, msg):
        subject = self.tokenize(msg['Subject']) if msg['Subject'] else []
        body = self.get_email_body(msg.get_payload(), msg.get_content_type())
        content_type = msg.get_content_type()
        return {"Subject": subject, "Body": body, "content_type": content_type}
    
    def get_email_body(self, payload, content_type):
        body = []
        if isinstance(payload, str) and content_type == "text/plain":
            return self.tokenize(payload)
        elif isinstance(payload, str) and content_type == "text/html":
            return self.tokenize(self.strip_tags(payload))
        elif isinstance(payload, list):
            for p in payload:
                body += self.get_email_body(p.get_payload(), p.get_content_type())
        return body
    
    def strip_tags(self, html):
        # Método para eliminar etiquetas HTML
        soup = BeautifulSoup(html, "html.parser")
        return soup.get_text()

    def tokenize(self, text):
        # Eliminar puntuación
        for c in self.punctuation:
            text = text.replace(c, "")
        text = text.replace("\t", " ")
        text = text.replace("\n", " ")
        tokens = list(filter(None, text.split(" ")))
        return [self.stemmer.stem(w) for w in tokens if w.lower() not in self.stopwords]


##### Lectura de un correo en formato raw

In [199]:
inmail=open("C:\\Users\\joses\\Escritorio\\trec07p\\trec07p\\practica\\1.eml").read()
print(inmail)

Delivered-To: joseguadalupesantossanchez3@gmail.com
Received: by 2002:ad4:45b2:0:b0:6c3:3327:49d1 with SMTP id y18csp2299149qvu;
        Mon, 7 Oct 2024 19:23:34 -0700 (PDT)
X-Google-Smtp-Source: AGHT+IHCu8Xgb3wMfuITC2VN/TQ/+avSq1nUsIokez7jKMj3H82byLDW2OBkbiZ740tk5i1AZHwU
X-Received: by 2002:a05:622a:118e:b0:458:3116:f06d with SMTP id d75a77b69052e-45d9ba3ad6emr215244881cf.22.1728354214291;
        Mon, 07 Oct 2024 19:23:34 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1728354214; cv=none;
        d=google.com; s=arc-20240605;
        b=CIdcI/X9ZNAORj914aihJP/yPLN2pKG7jU8VRyimYrKLlY4cWSi33JqU77ZL0UgKUA
         Z7ARhIXi1jqcojoAkqDIJjwI2XiCpBOMBnaPf0s1cWQFVXc+jQF/0Fjx5AltRdGaBt5I
         L6HW/I2nWEczkjPx/1nESN/0Ezba+1+HhEwN7C8P35jq3/hFfyeVLkL+6j5ZuuTKTjEC
         fxmzUIYjMFsKPmsQ0L9Pv/HlrTMEvnnvvXWkSf5jfT1Xy4N2x28mT0sbicl6TGTRfr9C
         tDBxp6DhlqhSg+LAd49VK6md/Dw0vf1JPsHvI9ela9eUgchfLPCBK1kzC/tVhLO64/yl
         sqxg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=go

##### Parsing del correo electrónico

In [200]:
p=parser()
p.parse("C:\\Users\\joses\\Escritorio\\trec07p\\trec07p\\practica\\1.eml")

{'Subject': ['utf8qtemutemencionc3b3'],
 'Body': ['temu',
  'properli',
  'view',
  'full',
  'messag',
  'content',
  'pleas',
  'open',
  'email',
  'high',
  'version',
  'mail',
  'client',
  'browser',
  'visit',
  'websit',
  'see',
  'recommend',
  'product',
  'httpsapptemucommb',
  'slandinggoodshtmlbgfs3d1xcid3dtextmailplanding3d1xsrc3dm',
  'ailmsgid3d1282024100810b785693021583249409449f2aypmmt',
  'unsubscrib',
  'httpswwwtemucombgmsunsubscribeemailhtmlplanding3d1xsrc3dma',
  'il1xcid3dtextmailxsid3dunsubscribemsgid3d1282024100810b785693',
  '021583249409449f2aypmmt',
  'term',
  'condit',
  'httpswwwtemucomtermsofusehtmlplanding3d1xsrc3dmailxcid',
  '3dtextmailmsgid3d1282024100810b785693021583249409449f2aypmmt',
  'privaci',
  'cooki',
  'polici',
  'httpswwwtemucombgpprivatepolicyhtmlplanding3d1xsrc3dmail',
  'xcid3dtextmailmsgid3d1282024100810b785693021583249409449f2aypmmt',
  'f0',
  '9f92b3todavc3ada',
  'tien',
  'una',
  'tarjeta',
  'de',
  'regalo',
  'sin',
  'rec

##### Lectura del índice

In [201]:
index = open("C:\\Users\\joses\\Escritorio\\trec07p\\trec07p\\full\\index2").readlines()
index

['spam ../practica/1.eml\n',
 'spam ../practica/2.eml\n',
 'spam ../practica/3.eml\n',
 'spam ../practica/4.eml\n',
 'spam ../practica/5.eml\n',
 'ham ../practica/6.eml\n',
 'ham ../practica/7.eml\n',
 'ham ../practica/8.eml\n',
 'ham ../practica/9.eml\n',
 'ham ../practica/10.eml\n',
 'ham ../practica/11.eml\n',
 'ham ../practica/12.eml\n',
 'ham ../practica/13.eml\n',
 'ham ../practica/14.eml\n',
 'ham ../practica/15.eml\n',
 'ham ../practica/16.eml\n',
 'ham ../practica/17.eml\n',
 'ham ../practica/18.eml\n',
 'ham ../practica/19.eml\n',
 'ham ../practica/20.eml\n',
 '\t']

In [202]:
import os
DATASET_PATH= "C:\\Users\\joses\\Escritorio\\trec07p\\trec07p"

def parse_index(path_to_index,n_elements):
    ret_indexes=[]
    index = open (path_to_index).readlines()
    for i in range(n_elements):
        mail=index[i].split('../')
        label =mail[0]
        path=mail[1][:-1]
        ret_indexes.append({
            "label":label,
            "email_path":os.path.join(DATASET_PATH,path)
        })
    return ret_indexes
def parse_email(index):
    p=parser
    pemail=p.parse(index["email_path"])
    return pemail,index["label"]

In [203]:
def parse_email(index):
    p=parser()
    pemail=p.parse(index["email_path"])
    return pemail,index["label"]

In [204]:
indexes=parse_index("C:\\Users\\joses\\Escritorio\\trec07p\\trec07p\\full\\index2", 19)
indexes

[{'label': 'spam ',
  'email_path': 'C:\\Users\\joses\\Escritorio\\trec07p\\trec07p\\practica/1.eml'},
 {'label': 'spam ',
  'email_path': 'C:\\Users\\joses\\Escritorio\\trec07p\\trec07p\\practica/2.eml'},
 {'label': 'spam ',
  'email_path': 'C:\\Users\\joses\\Escritorio\\trec07p\\trec07p\\practica/3.eml'},
 {'label': 'spam ',
  'email_path': 'C:\\Users\\joses\\Escritorio\\trec07p\\trec07p\\practica/4.eml'},
 {'label': 'spam ',
  'email_path': 'C:\\Users\\joses\\Escritorio\\trec07p\\trec07p\\practica/5.eml'},
 {'label': 'ham ',
  'email_path': 'C:\\Users\\joses\\Escritorio\\trec07p\\trec07p\\practica/6.eml'},
 {'label': 'ham ',
  'email_path': 'C:\\Users\\joses\\Escritorio\\trec07p\\trec07p\\practica/7.eml'},
 {'label': 'ham ',
  'email_path': 'C:\\Users\\joses\\Escritorio\\trec07p\\trec07p\\practica/8.eml'},
 {'label': 'ham ',
  'email_path': 'C:\\Users\\joses\\Escritorio\\trec07p\\trec07p\\practica/9.eml'},
 {'label': 'ham ',
  'email_path': 'C:\\Users\\joses\\Escritorio\\trec07p\\tr

### 2. Preprocesamiento de los datos del conjunto de datos

Con las funciones presentadas anteriormente se permite la lectura de los correos electrónicos de manera programática y el procesamiento de los mismos para eliminar aquellos componentes que no resultan de utilidad para la detección de correos de SPAM. Sin embargo, cada uno de los correos sigue estando representado por un diccionario de Python con una serie de palabras.

In [205]:
index=parse_index("C:\\Users\\joses\\Escritorio\\trec07p\\trec07p\\full\\index2", 1)
print(index)


[{'label': 'spam ', 'email_path': 'C:\\Users\\joses\\Escritorio\\trec07p\\trec07p\\practica/1.eml'}]


In [206]:
import os

open(index[0]["email_path"]).read()

'Delivered-To: joseguadalupesantossanchez3@gmail.com\nReceived: by 2002:ad4:45b2:0:b0:6c3:3327:49d1 with SMTP id y18csp2299149qvu;\n        Mon, 7 Oct 2024 19:23:34 -0700 (PDT)\nX-Google-Smtp-Source: AGHT+IHCu8Xgb3wMfuITC2VN/TQ/+avSq1nUsIokez7jKMj3H82byLDW2OBkbiZ740tk5i1AZHwU\nX-Received: by 2002:a05:622a:118e:b0:458:3116:f06d with SMTP id d75a77b69052e-45d9ba3ad6emr215244881cf.22.1728354214291;\n        Mon, 07 Oct 2024 19:23:34 -0700 (PDT)\nARC-Seal: i=1; a=rsa-sha256; t=1728354214; cv=none;\n        d=google.com; s=arc-20240605;\n        b=CIdcI/X9ZNAORj914aihJP/yPLN2pKG7jU8VRyimYrKLlY4cWSi33JqU77ZL0UgKUA\n         Z7ARhIXi1jqcojoAkqDIJjwI2XiCpBOMBnaPf0s1cWQFVXc+jQF/0Fjx5AltRdGaBt5I\n         L6HW/I2nWEczkjPx/1nESN/0Ezba+1+HhEwN7C8P35jq3/hFfyeVLkL+6j5ZuuTKTjEC\n         fxmzUIYjMFsKPmsQ0L9Pv/HlrTMEvnnvvXWkSf5jfT1Xy4N2x28mT0sbicl6TGTRfr9C\n         tDBxp6DhlqhSg+LAd49VK6md/Dw0vf1JPsHvI9ela9eUgchfLPCBK1kzC/tVhLO64/yl\n         sqxg==\nARC-Message-Signature: i=1; a=rsa-sha256; c=relaxe

In [207]:
mail, label = parse_email(index[0])
print("El correo es:", label)
print(mail)

El correo es: spam 
{'Subject': ['utf8qtemutemencionc3b3'], 'Body': ['temu', 'properli', 'view', 'full', 'messag', 'content', 'pleas', 'open', 'email', 'high', 'version', 'mail', 'client', 'browser', 'visit', 'websit', 'see', 'recommend', 'product', 'httpsapptemucommb', 'slandinggoodshtmlbgfs3d1xcid3dtextmailplanding3d1xsrc3dm', 'ailmsgid3d1282024100810b785693021583249409449f2aypmmt', 'unsubscrib', 'httpswwwtemucombgmsunsubscribeemailhtmlplanding3d1xsrc3dma', 'il1xcid3dtextmailxsid3dunsubscribemsgid3d1282024100810b785693', '021583249409449f2aypmmt', 'term', 'condit', 'httpswwwtemucomtermsofusehtmlplanding3d1xsrc3dmailxcid', '3dtextmailmsgid3d1282024100810b785693021583249409449f2aypmmt', 'privaci', 'cooki', 'polici', 'httpswwwtemucombgpprivatepolicyhtmlplanding3d1xsrc3dmail', 'xcid3dtextmailmsgid3d1282024100810b785693021583249409449f2aypmmt', 'f0', '9f92b3todavc3ada', 'tien', 'una', 'tarjeta', 'de', 'regalo', 'sin', 'reclamar', 'c2a1ac', 'c3a9ptala', 'ant', 'del', '9', 'oct', '2024', 'c

El algoritmo de Regresión Logística no es capaz de ingerir texto como parte del conjunto de datos. Por lo tanto, deben aplicarse una serie de funciones adicionales que transformen el texto de los correos electrónicos parseados en una representación numérica.

##### Aplicación de CountVectorizer

In [208]:
from sklearn.feature_extraction.text import CountVectorizer

# Preapración del email en una cadena de texto
prep_email = [" ".join(mail['Subject']) + " ".join(mail['Body'])]

vectorizer = CountVectorizer()
X = vectorizer.fit(prep_email)

print("Email:", prep_email, "\n")
print("Características de entrada:", vectorizer.get_feature_names_out())

Email: ['utf8qtemutemencionc3b3temu properli view full messag content pleas open email high version mail client browser visit websit see recommend product httpsapptemucommb slandinggoodshtmlbgfs3d1xcid3dtextmailplanding3d1xsrc3dm ailmsgid3d1282024100810b785693021583249409449f2aypmmt unsubscrib httpswwwtemucombgmsunsubscribeemailhtmlplanding3d1xsrc3dma il1xcid3dtextmailxsid3dunsubscribemsgid3d1282024100810b785693 021583249409449f2aypmmt term condit httpswwwtemucomtermsofusehtmlplanding3d1xsrc3dmailxcid 3dtextmailmsgid3d1282024100810b785693021583249409449f2aypmmt privaci cooki polici httpswwwtemucombgpprivatepolicyhtmlplanding3d1xsrc3dmail xcid3dtextmailmsgid3d1282024100810b785693021583249409449f2aypmmt f0 9f92b3todavc3ada tien una tarjeta de regalo sin reclamar c2a1ac c3a9ptala ant del 9 oct 2024 cd8f e2808c c2a0 e28087 c2ad cd8f e2808c c2a0 e28087 c2adcd8f e2808c c2a0 e28087 c2adcd8f e2808c c2a0 e28087 c2adcd8f e2808c c2a0 e2 8087 c2adcd8f e2808c c2a0 e28087 c2adcd8f e2808c c2 a0 e2808

In [209]:
X=vectorizer.transform(prep_email)
print("\n")





##### Aplicación de OneHotEncoding

In [210]:
from sklearn.preprocessing import OneHotEncoder

prep_email = [[w] for w in mail['Subject'] + mail['Body']]

enc = OneHotEncoder(handle_unknown='ignore')
X = enc.fit_transform(prep_email)

print("Features:\n", enc.get_feature_names_out())
print("\nValues:\n", X.toarray())

Features:
 ['x0_0' 'x0_00' 'x0_00000fontfamilygraphikmedium' ... 'x0_\xad͏'
 'x0_\u2007' 'x0_\u200c']

Values:
 [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


##### Funciones auxiliares para preprocesamiento del conjunto de datos

In [211]:
def create_prep_dataset(index_path, n_elements):
    X = []
    y = []
    indexes = parse_index(index_path, n_elements)
    for i in range(n_elements):
        print("\rParsing email: {0}".format(i+1), end='')
        mail, label = parse_email(indexes[i])
        X.append(" ".join(mail['Subject']) + " ".join(mail['Body']))
        y.append(label)
    return X, y


### 3. Entrenamiento del algoritmo 

In [212]:
# Leemos únicamente un subconjunto de 100 correos electrónicos
X_train, y_train = create_prep_dataset("C:\\Users\\joses\\Escritorio\\trec07p\\trec07p\\full\\index", 100)
X_train

Parsing email: 100

['gener ciali brand qualitifeel pressur perform rise occas tri viagra anxieti thing past back old self',
 'typo debianreadmhi ive updat gulu check mirror seem littl typo debianreadm file exampl httpgulususherbrookecadebianreadm ftpftpfrdebianorgdebianreadm test lenni access releas diststest current test develop snapshot name etch packag test unstabl pass autom test propog releas etch replac lenni like readmehtml yan morin consult en logiciel libr yanmorinsavoirfairelinuxcom 5149941556 unsubscrib email debianmirrorsrequestlistsdebianorg subject unsubscrib troubl contact listmasterlistsdebianorg',
 'authent viagramega authenticv g r discount pricec l discount pricedo miss click httpwwwmoujsjkhchumcom authent viagra mega authenticv g r discount pricec l discount pricedo miss click',
 'nice talk yahey billi realli fun go night talk said felt insecur manhood notic toilet quit small area worri websit tell secret weapon extra 3 inch trust girl love bigger one ive 5 time mani chick sinc use pi

##### Aplicamos la vectorización a los datos

In [213]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)

In [214]:
print(X_train.toarray())
print("\nFeatures:", len(vectorizer.get_feature_names_out()))

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

Features: 3933


In [215]:
import pandas as pd

pd.DataFrame(X_train.toarray(), columns=[vectorizer.get_feature_names_out()])

Unnamed: 0,0000,00085,003,00450,01,0107,014,01417,020,023,...,ӧanz,ӭѯ,ԡšݡ淶,լһʽ,չҵϣ,سŵþʊʊݾѯ,ڶҵţ,㶫иï26,饻jwk,쵼ã
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
96,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
97,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
98,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [216]:
y_train

['spam ',
 'ham ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'ham ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'ham ',
 'ham ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'ham ',
 'spam ',
 'ham ',
 'spam ',
 'spam ',
 'ham ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'ham ',
 'ham ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'ham ',
 'spam ',
 'ham ',
 'spam ',
 'spam ',
 'ham ',
 'spam ',
 'spam ',
 'spam ',
 'ham ',
 'ham ',
 'spam ',
 'spam ',
 'spam ',
 'ham ',
 'ham ',
 'ham ',
 'spam ',
 'ham ',
 'spam ',
 'ham ',
 'spam ',
 'spam ',
 'spam ',
 'ham ',
 'spam ',
 'spam ',
 'ham ',
 'spam ',
 'ham ',
 'spam ',
 'spam ',
 'ham ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ',
 'spam ']

###### Entrenamiento del algoritmo de regresión logística con el conjunto de datos preprocesado

In [217]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(X_train, y_train)

### 4. Predicción

##### Lectura de un conjunto de correos nuevos

In [218]:
# Leemos 150 correos de nuestro conjunto de datos y nos quedamos únicamente con los 50 últimos 
# Estos 50 correos electrónicos no se han utilizado para entrenar el algoritmo
X, y = create_prep_dataset("C:\\Users\\joses\\Escritorio\\trec07p\\trec07p\\full\\index2", 20)
X_test = X
y_test = y

Parsing email: 12

ParserRejectedMarkup: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 AssertionError: unknown status keyword 'endi' in marked section

##### Preprocesamiento de los correos con el vectorizador creado anteriormente

In [145]:
X_test = vectorizer.transform(X_test)

##### Predicción del tipo de correo

In [146]:
y_pred = clf.predict(X_test)
y_pred

array(['ham ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'spam ',
       'ham ', 'spam ', 'spam ', 'spam '], dtype='<U5')

In [147]:
print("Predicción:\n", y_pred)
print("\nEtiquetas reales:\n", y_test)

Predicción:
 ['ham ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'spam ' 'ham ' 'spam '
 'spam ' 'spam ']

Etiquetas reales:
 ['spam ', 'spam ', 'spam ', 'spam ', 'spam ', 'ham ', 'ham ', 'ham ', 'ham ', 'ham ', 'ham ']


##### Evaluación de los resultados

In [148]:
from sklearn.metrics import accuracy_score

print('Accuracy: {:.3f}'.format(accuracy_score(y_test, y_pred)))

Accuracy: 0.455
