<a href="https://colab.research.google.com/drive/1eQOkYB4XMQdSeHV_pKvt9kSUW8vQo6LQ">Abre este Jupyter en Google Colab</a>

# Regresión Logística: Detección de SPAM

En este ejercicio se muestran los fundamentos de la Regresión Logística planteando uno de los primeros problemas que fueron solucionados mediante el uso de técnicas de Machine Learning: la detección de SPAM.

## Enunciado del ejercicio

Se propone la construcción de un sistema de aprendizaje automático capaz de predecir si un correo determinado se corresponde con un correo de SPAM o no, para ello, se utilizará el siguiente conjunto de datos:

##### [2007 TREC Public Spam Corpus](https://plg.uwaterloo.ca/cgi-bin/cgiwrap/gvcormac/foo07)
The corpus trec07p contains 75,419 messages:

    25220 ham
    50199 spam

These messages constitute all the messages delivered to a particular
server between these dates:

    Sun, 8 Apr 2007 13:07:21 -0400
    Fri, 6 Jul 2007 07:04:53 -0400

### 0. Imports

In [12]:
# Instalación de librerías externas
!pip install scikit-learn
!pip install nltk

Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting click (from nltk)
  Downloading click-8.2.1-py3-none-any.whl.metadata (2.5 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2025.7.34-cp313-cp313-win_amd64.whl.metadata (41 kB)
Collecting tqdm (from nltk)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   ---------------------------------------- 1.5/1.5 MB 19.6 MB/s eta 0:00:00
Downloading regex-2025.7.34-cp313-cp313-win_amd64.whl (275 kB)
Downloading click-8.2.1-py3-none-any.whl (102 kB)
Downloading tqdm-4.67.1-py3-none-any.whl (78 kB)
Installing collected packages: tqdm, regex, click, nltk

   ---------- ----------------------------- 1/4 [regex]
   ------------------------------ --------- 3/4 [nltk]
   ------------------------------ --------- 3/4 [nltk]
   ------------------------------ --------- 3/4 [nltk]

### 1. Funciones complementarias

En este caso práctico relacionado con la detección de correos electrónicos de SPAM, el conjunto de datos que disponemos esta formado por correos electrónicos, con sus correspondientes cabeceras y campos adicionales. Por lo tanto, requieren un preprocesamiento previo a que sean ingeridos por el algoritmo de Machine Learning.

In [107]:
# Esta clase facilita el preprocesamiento de correos electrónicos que poseen código HTML
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        super().__init__()
        self.convert_charrefs = True
        self.fed = []

    def handle_data(self,d):
        self.fed.append(d)

    def get_data(self):
        return "".join(self.fed)

In [108]:
# Esta función se encarga de elimar los tags HTML que se encuentren en el texto del correo electrónico
def strip_tags(html: str) -> str:
    s = MLStripper()
    s.feed(html)
    return s.get_data()

In [3]:
# Ejemplo de eliminación de los tags HTML de un texto
tag = '<div><p align="left">Hola a todos<span align="center">esto es una tag</span></p></div>'
print(strip_tags(tag))

Hola a todosesto es una tag


Además de eliminar los posibles tags HTML que se encuentren en el correo electrónico, deben realizarse otras acciones de preprocesamiento para evitar que los mensajes contengan ruido innecesario. Entre ellas se encuentra la eliminación de los signos de puntuación, eliminación de posibles campos del correo electrónico que no son relevantes o eliminación de los afijos de una palabra manteniendo únicamente la raiz de la misma (Stemming). La clase que se muestra a continuación realiza estas transformaciones.

In [109]:
import email
import string
import nltk
nltk.download('stopwords')

class Parser:
    def __init__(self):
        self.stemmer = nltk.PorterStemmer()
        self.stopwords = set(nltk.corpus.stopwords.words('english'))
        self.punctuation = list(string.punctuation)

    def parse(self, email_path):
        """Parse an email."""
        with open(email_path, errors= 'ignore') as e:
            msg = email.message_from_file(e)
        return None if not msg else self.get_email_content(msg)

    def get_email_content(self, msg):
        """Extract the email content"""
        subject = self.tokenize(msg['Subject']) if msg['Subject'] else []
        body = self.get_email_body(msg.get_payload(), msg.get_content_type())
        content_type = msg.get_content_type()
        return {
            "subject": subject,
            "body": body,
            "content_type": content_type
        }

    def get_email_body(self, payload, content_type):
        """Extract the body of the email"""
        body = []
        if type(payload) is str and content_type == 'text/plain':
            return self.tokenize(payload)
        elif type(payload) is str and content_type == 'text/html':
            return self.tokenize(strip_tags(payload))
        elif type(payload) is list:
            for p in payload:
                body += self.get_email_body(p.get_payload(), p.get_content_type())
        return body

    def tokenize(self, text):
        """Transform a text string in tokens. Perform two main actions, clean
        the punctuation symbols and do stemming of the text."""
        for c in self.punctuation:
            text = text.replace(c, "")
        text = text.replace("\t", " ")
        text = text.replace("\n", " ")
        tokens = list(filter(None, text.split(" ")))
        return [self.stemmer.stem(w) for w in tokens if w not in self.stopwords]



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rafae\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


##### Lectura de un correo en formato raw

In [110]:
import os

file_path = "datasets/trec07p/data/inmail.1"
inmail = open(file_path).read()
print(inmail)

From RickyAmes@aol.com  Sun Apr  8 13:07:32 2007
Return-Path: <RickyAmes@aol.com>
Received: from 129.97.78.23 ([211.202.101.74])
	by speedy.uwaterloo.ca (8.12.8/8.12.5) with SMTP id l38H7G0I003017;
	Sun, 8 Apr 2007 13:07:21 -0400
Received: from 0.144.152.6 by 211.202.101.74; Sun, 08 Apr 2007 19:04:48 +0100
Message-ID: <WYADCKPDFWWTWTXNFVUE@yahoo.com>
From: "Tomas Jacobs" <RickyAmes@aol.com>
Reply-To: "Tomas Jacobs" <RickyAmes@aol.com>
To: the00@speedy.uwaterloo.ca
Subject: Generic Cialis, branded quality@ 
Date: Sun, 08 Apr 2007 21:00:48 +0300
X-Mailer: Microsoft Outlook Express 6.00.2600.0000
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="--8896484051606557286"
X-Priority: 3
X-MSMail-Priority: Normal
Status: RO
Content-Length: 988
Lines: 24

----8896484051606557286
Content-Type: text/html;
Content-Transfer-Encoding: 7Bit

<html>
<body bgcolor="#ffffff">
<div style="border-color: #00FFFF; border-right-width: 0px; border-bottom-width: 0px; margin-bottom: 0px;" align="

##### Parsing del correo electrónico

In [111]:
file_path = "datasets/trec07p/data/inmail.1"
p = Parser()
p.parse(file_path)

{'subject': ['gener', 'ciali', 'brand', 'qualiti'],
 'body': ['do',
  'feel',
  'pressur',
  'perform',
  'rise',
  'occas',
  'tri',
  'viagra',
  'anxieti',
  'thing',
  'past',
  'back',
  'old',
  'self'],
 'content_type': 'multipart/alternative'}

##### Lectura del índice

Estas funciones complementarias se encargan cargar en memoria la ruta de cada correo electrónico y su etiqueta correspondiente {spam, ham}

In [112]:

index_path = "datasets/trec07p/full/index"
print(open(index_path).readlines())


['spam ../data/inmail.1\n', 'ham ../data/inmail.2\n', 'spam ../data/inmail.3\n', 'spam ../data/inmail.4\n', 'spam ../data/inmail.5\n', 'spam ../data/inmail.6\n', 'spam ../data/inmail.7\n', 'spam ../data/inmail.8\n', 'spam ../data/inmail.9\n', 'ham ../data/inmail.10\n', 'spam ../data/inmail.11\n', 'spam ../data/inmail.12\n', 'spam ../data/inmail.13\n', 'spam ../data/inmail.14\n', 'spam ../data/inmail.15\n', 'spam ../data/inmail.16\n', 'spam ../data/inmail.17\n', 'spam ../data/inmail.18\n', 'spam ../data/inmail.19\n', 'ham ../data/inmail.20\n', 'ham ../data/inmail.21\n', 'spam ../data/inmail.22\n', 'spam ../data/inmail.23\n', 'spam ../data/inmail.24\n', 'spam ../data/inmail.25\n', 'spam ../data/inmail.26\n', 'spam ../data/inmail.27\n', 'spam ../data/inmail.28\n', 'ham ../data/inmail.29\n', 'spam ../data/inmail.30\n', 'ham ../data/inmail.31\n', 'spam ../data/inmail.32\n', 'spam ../data/inmail.33\n', 'ham ../data/inmail.34\n', 'spam ../data/inmail.35\n', 'spam ../data/inmail.36\n', 'spam .

In [113]:
import os
from pathlib import Path
def parse_index(path_to_index, n_elements=None, must_exist=True):
    rows = []
    base_dir = Path(path_to_index).parent  # .../trec07p/full

    with open(path_to_index, encoding="utf-8", errors="ignore") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            parts = line.split(None, 1)
            if len(parts) != 2:
                continue
            label, rel_path = parts
            abs_path = (base_dir / rel_path).resolve()
            if must_exist and not abs_path.exists():
                continue  # salta los que no tienes en disco

            rows.append({"label": label, "email_path": str(abs_path)})

            if n_elements is not None and len(rows) >= n_elements:
                break

    return rows

In [114]:
def parse_email(index):
    p = Parser()
    pmail = p.parse(index["email_path"])
    return pmail, index["label"]

In [115]:
indexes = parse_index(index_path,10, must_exist=False)
indexes

[{'label': 'spam',
  'email_path': 'C:\\Users\\rafae\\Documents\\curso_machine_learning\\practices\\datasets\\trec07p\\data\\inmail.1'},
 {'label': 'ham',
  'email_path': 'C:\\Users\\rafae\\Documents\\curso_machine_learning\\practices\\datasets\\trec07p\\data\\inmail.2'},
 {'label': 'spam',
  'email_path': 'C:\\Users\\rafae\\Documents\\curso_machine_learning\\practices\\datasets\\trec07p\\data\\inmail.3'},
 {'label': 'spam',
  'email_path': 'C:\\Users\\rafae\\Documents\\curso_machine_learning\\practices\\datasets\\trec07p\\data\\inmail.4'},
 {'label': 'spam',
  'email_path': 'C:\\Users\\rafae\\Documents\\curso_machine_learning\\practices\\datasets\\trec07p\\data\\inmail.5'},
 {'label': 'spam',
  'email_path': 'C:\\Users\\rafae\\Documents\\curso_machine_learning\\practices\\datasets\\trec07p\\data\\inmail.6'},
 {'label': 'spam',
  'email_path': 'C:\\Users\\rafae\\Documents\\curso_machine_learning\\practices\\datasets\\trec07p\\data\\inmail.7'},
 {'label': 'spam',
  'email_path': 'C:\\Us

### 2. Preprocesamiento de los datos del conjunto de datos

Con las funciones presentadas anteriormente se permite la lectura de los correos electrónicos de manera programática y el procesamiento de los mismos para eliminar aquellos componentes que no resultan de utilidad para la detección de correos de SPAM. Sin embargo, cada uno de los correos sigue estando representado por un diccionario de Python con una serie de palabras.

In [116]:
# Cargamos el índice y las etiquetas en memoria

idx = parse_index(index_path, n_elements=3)
import os
for r in idx: 
    print(os.path.exists(r["email_path"]), r["email_path"])


True C:\Users\rafae\Documents\curso_machine_learning\practices\datasets\trec07p\data\inmail.1
True C:\Users\rafae\Documents\curso_machine_learning\practices\datasets\trec07p\data\inmail.10
True C:\Users\rafae\Documents\curso_machine_learning\practices\datasets\trec07p\data\inmail.11


In [118]:
# Leemos el primer correo
idx1 = indexes[0]["email_path"]
open(idx1).read()

'From RickyAmes@aol.com  Sun Apr  8 13:07:32 2007\nReturn-Path: <RickyAmes@aol.com>\nReceived: from 129.97.78.23 ([211.202.101.74])\n\tby speedy.uwaterloo.ca (8.12.8/8.12.5) with SMTP id l38H7G0I003017;\n\tSun, 8 Apr 2007 13:07:21 -0400\nReceived: from 0.144.152.6 by 211.202.101.74; Sun, 08 Apr 2007 19:04:48 +0100\nMessage-ID: <WYADCKPDFWWTWTXNFVUE@yahoo.com>\nFrom: "Tomas Jacobs" <RickyAmes@aol.com>\nReply-To: "Tomas Jacobs" <RickyAmes@aol.com>\nTo: the00@speedy.uwaterloo.ca\nSubject: Generic Cialis, branded quality@ \nDate: Sun, 08 Apr 2007 21:00:48 +0300\nX-Mailer: Microsoft Outlook Express 6.00.2600.0000\nMIME-Version: 1.0\nContent-Type: multipart/alternative;\n\tboundary="--8896484051606557286"\nX-Priority: 3\nX-MSMail-Priority: Normal\nStatus: RO\nContent-Length: 988\nLines: 24\n\n----8896484051606557286\nContent-Type: text/html;\nContent-Transfer-Encoding: 7Bit\n\n<html>\n<body bgcolor="#ffffff">\n<div style="border-color: #00FFFF; border-right-width: 0px; border-bottom-width: 0

In [119]:
# Parseamos el primer correo
print(indexes[0])
mail, label = parse_email(indexes[0])
print("El correo es:", label)
print(mail)

{'label': 'spam', 'email_path': 'C:\\Users\\rafae\\Documents\\curso_machine_learning\\practices\\datasets\\trec07p\\data\\inmail.1'}
El correo es: spam
{'subject': ['gener', 'ciali', 'brand', 'qualiti'], 'body': ['do', 'feel', 'pressur', 'perform', 'rise', 'occas', 'tri', 'viagra', 'anxieti', 'thing', 'past', 'back', 'old', 'self'], 'content_type': 'multipart/alternative'}


El algoritmo de Regresión Logística no es capaz de ingerir texto como parte del conjunto de datos. Por lo tanto, deben aplicarse una serie de funciones adicionales que transformen el texto de los correos electrónicos parseados en una representación numérica.

##### Aplicación de CountVectorizer

In [120]:
from sklearn.feature_extraction.text import CountVectorizer

#Preparamos el email en una cadena de texto
prep_email = [" ".join(mail["subject"]) + " ".join(mail["body"])]

vectorizer = CountVectorizer()
X = vectorizer.fit(prep_email)

print("Email:", prep_email, "\n")
print("Caracteristiccas de entrada:", vectorizer.get_feature_names_out())

Email: ['gener ciali brand qualitido feel pressur perform rise occas tri viagra anxieti thing past back old self'] 

Caracteristiccas de entrada: ['anxieti' 'back' 'brand' 'ciali' 'feel' 'gener' 'occas' 'old' 'past'
 'perform' 'pressur' 'qualitido' 'rise' 'self' 'thing' 'tri' 'viagra']


In [121]:
X = vectorizer.transform(prep_email)
print("\nValues:\n", X.toarray())


Values:
 [[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]


##### Aplicación de OneHotEncoding

In [122]:
from sklearn.preprocessing import OneHotEncoder
prep_email = [[w] for w in mail['subject'] + mail['body']]
enc = OneHotEncoder(handle_unknown
                    ='ignore')
X = enc.fit_transform(prep_email)
print("Features:\n", enc.get_feature_names_out())
print("\nValues:\n", X.toarray())

Features:
 ['x0_anxieti' 'x0_back' 'x0_brand' 'x0_ciali' 'x0_do' 'x0_feel' 'x0_gener'
 'x0_occas' 'x0_old' 'x0_past' 'x0_perform' 'x0_pressur' 'x0_qualiti'
 'x0_rise' 'x0_self' 'x0_thing' 'x0_tri' 'x0_viagra']

Values:
 [[0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0

##### Funciones auxiliares para preprocesamiento del conjunto de datos

In [123]:
def create_prep_dataset(index_path, n_elements):
    X = []
    y = []
    indexes = parse_index(index_path, n_elements)
    for i in range(n_elements):
        print("\nParsing email:", i, " "*5, end=' ')
        mail, label =  parse_email(indexes[i])
        X.append(" ".join(mail["subject"]) + " ".join(mail["body"]))
        y.append(label)
    return X, y
        

### 3. Entrenamiento del algoritmo 

In [124]:
#Leer solo los primeros 100 correos electronicos
index_path = "datasets/trec07p/full/index"
test_rows = parse_index(index_path, n_elements=5)
for r in test_rows:
    print(os.path.isfile(r["email_path"]), r["email_path"])

X_train, y_train = create_prep_dataset(index_path, 100)
X_train

True C:\Users\rafae\Documents\curso_machine_learning\practices\datasets\trec07p\data\inmail.1
True C:\Users\rafae\Documents\curso_machine_learning\practices\datasets\trec07p\data\inmail.10
True C:\Users\rafae\Documents\curso_machine_learning\practices\datasets\trec07p\data\inmail.11
True C:\Users\rafae\Documents\curso_machine_learning\practices\datasets\trec07p\data\inmail.100
True C:\Users\rafae\Documents\curso_machine_learning\practices\datasets\trec07p\data\inmail.101

Parsing email: 0       
Parsing email: 1       
Parsing email: 2       
Parsing email: 3       
Parsing email: 4       
Parsing email: 5       
Parsing email: 6       
Parsing email: 7       
Parsing email: 8       
Parsing email: 9       
Parsing email: 10       
Parsing email: 11       
Parsing email: 12       
Parsing email: 13       
Parsing email: 14       
Parsing email: 15       
Parsing email: 16       
Parsing email: 17       
Parsing email: 18       
Parsing email: 19       
Parsing email: 20       
Parsing 

['gener ciali brand qualitido feel pressur perform rise occas tri viagra anxieti thing past back old self',
 'r confidenceinterv helphi i use r find 90 confidenceinterv sensit specif follow diagnost test a particular diagnost test multipl sclerosi conduct 20 ms patient 20 healthi subject 6 ms patient classifi healthi 8 healthi subject classifi suffer ms furthermor i need find number ms patient requir sensit 1 is simpl rcommand i complet new r help pleas jochen view messag context httpwwwnabblecomconfidenceintervalshelptf3544217htmla9894014 sent r help mail list archiv nabblecom rhelpstatmathethzch mail list httpsstatethzchmailmanlistinforhelp pleas read post guid httpwwwrprojectorgpostingguidehtml provid comment minim selfcontain reproduc code',
 'for smilegood dayvisit new onlin drug store save upto 85today special offer viagra for as low as 162 per dose ciali super viagra for as low as 438 per dose levitra for as low as 444 per dose much much special offer todayy need 15 minut to be 

##### Aplicamos la vectorización a los datos

In [125]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)

In [126]:
print(X_train.toarray())
print("\nFeatures:", len(vectorizer.get_feature_names_out()))

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

Features: 5281


In [127]:
import pandas as pd
pd.DataFrame(X_train.toarray(), columns=[vectorizer.get_feature_names_out()])


Unnamed: 0,0000,000000,000066,0000ff,00085,00450,0089,009,00jpi7u6ib3q35ow3goutqqrfx8bm5nufxwssut98e53cwfxzdpo49s1prqu7aswei,0100,...,zeroon,zi8jrt1b9s7y1g2xgbz9mb06s1cj5ahmvzpmj0w5ujrmmnvg16viygcqo98k9ltub8y97x9,zimbabw,zip,zoom,zou7xy78kmvtjrzmlaqm0eao66l2lzi5hhvjgwq2bvwvp5dwmstl2vdqtk0vbagbax829,zrcb4q1k51epi4nlzwzyvkyfpmvdwyz2vqvuqe41uu2t9arorus0vqo9yc72ypodkpkzo,zs4vzzfnwm5prxuab3stwqftoh7jfwao0okh9mfjt2j9tnor6vt9fbqdxqrtx63ixu8ud,zvezda,zzf257kjlyfqixqvoi96q89fyuhjak6vnxpoihuooupyak2pgnwy8js1vo9bbxltxxx
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,2,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
96,0,0,0,0,0,0,0,0,0,0,...,0,0,0,2,0,0,0,0,0,0
97,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
98,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [128]:
y_train

['spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'ham',
 'spam',
 'ham',
 'ham',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'ham',
 'spam',
 'spam',
 'ham',
 'ham',
 'ham',
 'ham',
 'spam',
 'spam',
 'ham',
 'ham',
 'ham',
 'spam',
 'ham',
 'spam',
 'spam',
 'ham',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'ham',
 'ham',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'ham',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'ham',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam']

### 3. Entrenamiento

###### Entrenamiento del algoritmo de regresión logística con el conjunto de datos preprocesado

In [129]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X_train, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


### 4. Predicción

##### Lectura de un conjunto de correos nuevos

In [130]:
# Leemos 150 correos de nuestro conjunto de datos y nos quedamos únicamente con los 50 últimos 
# Estos 50 correos electrónicos no se han utilizado para entrenar el algoritmo
X, y = create_prep_dataset(index_path, 150)
X_test = X[100:]
y_test = y[100:]
X_test


Parsing email: 0       
Parsing email: 1       
Parsing email: 2       
Parsing email: 3       
Parsing email: 4       
Parsing email: 5       
Parsing email: 6       
Parsing email: 7       
Parsing email: 8       
Parsing email: 9       
Parsing email: 10       
Parsing email: 11       
Parsing email: 12       
Parsing email: 13       
Parsing email: 14       
Parsing email: 15       
Parsing email: 16       
Parsing email: 17       
Parsing email: 18       
Parsing email: 19       
Parsing email: 20       
Parsing email: 21       
Parsing email: 22       
Parsing email: 23       
Parsing email: 24       
Parsing email: 25       
Parsing email: 26       
Parsing email: 27       
Parsing email: 28       
Parsing email: 29       
Parsing email: 30       
Parsing email: 31       
Parsing email: 32       
Parsing email: 33       
Parsing email: 34       
Parsing email: 35       
Parsing email: 36       
Parsing email: 37       
Parsing email: 38       
Parsing email: 39       
Parsing e

['real viagramega authenticv i a g r a discount pricec i a l i s discount pricedo miss it click httpwwwsilvejkkbrandcom mega authenticv i a g r a discount pricec i a l i s discount pricedo miss it click',
 'notemotorway footbridg theatric protest tabloid press home sit couch night i say short essay concern journey anew impression global interconnect differ cul ture subject comput screen accomplish noth sit long hour hallucinog nic narcot avail market peopl industri replica tion rampant certain degre problem thi enabl c linton administr view would still abl everyth hand when i left painter believ brush tool paint program could thu permit af fordabl price peopl wish live arcad game could fulfil ed simpli sit front becom realiti some believ charact alreadi exist like you avoid place you meet mani peopl you keep wood metal enhanc elev nativ artwork newspap borrow book li braryget video play point peopl never leav termin entir thr ough comput languag if believ theori anywher those p retti h

##### Preprocesamiento de los correos con el vectorizador creado anteriormente

In [131]:
X_test = vectorizer.transform(X_test)

##### Predicción del tipo de correo

In [132]:
y_pred = clf.predict(X_test)
y_pred

array(['spam', 'spam', 'ham', 'ham', 'spam', 'spam', 'spam', 'spam',
       'spam', 'spam', 'ham', 'ham', 'ham', 'spam', 'spam', 'spam', 'ham',
       'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'ham', 'spam',
       'ham', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'spam', 'spam', 'ham', 'ham', 'ham', 'ham', 'spam', 'spam', 'spam',
       'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam'],
      dtype='<U4')

In [133]:
print("Prediccion:", y_pred)
print("Etiqueta reales", y_test)

Prediccion: ['spam' 'spam' 'ham' 'ham' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'ham'
 'ham' 'ham' 'spam' 'spam' 'spam' 'ham' 'spam' 'spam' 'spam' 'spam' 'spam'
 'spam' 'ham' 'spam' 'ham' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam'
 'spam' 'spam' 'spam' 'ham' 'ham' 'ham' 'ham' 'spam' 'spam' 'spam' 'spam'
 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam']
Etiqueta reales ['spam', 'spam', 'ham', 'ham', 'spam', 'spam', 'spam', 'spam', 'spam', 'ham', 'ham', 'ham', 'ham', 'spam', 'spam', 'spam', 'ham', 'spam', 'spam', 'spam', 'spam', 'spam', 'ham', 'ham', 'spam', 'ham', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'ham', 'ham', 'ham', 'ham', 'spam', 'spam', 'ham', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam']


##### Evaluación de los resultados

In [134]:
from sklearn.metrics import accuracy_score
print('Accuracy: {:.3f}'.format(accuracy_score(y_test, y_pred)))

Accuracy: 0.940


### 5. Aumentando el conjunto de datos

In [143]:
# Leemos 12000 correos electrónicos
X, y = create_prep_dataset(index_path, 1200)


Parsing email: 0       
Parsing email: 1       
Parsing email: 2       
Parsing email: 3       
Parsing email: 4       
Parsing email: 5       
Parsing email: 6       
Parsing email: 7       
Parsing email: 8       
Parsing email: 9       
Parsing email: 10       
Parsing email: 11       
Parsing email: 12       
Parsing email: 13       
Parsing email: 14       
Parsing email: 15       
Parsing email: 16       
Parsing email: 17       
Parsing email: 18       
Parsing email: 19       
Parsing email: 20       
Parsing email: 21       
Parsing email: 22       
Parsing email: 23       
Parsing email: 24       
Parsing email: 25       
Parsing email: 26       
Parsing email: 27       
Parsing email: 28       
Parsing email: 29       
Parsing email: 30       
Parsing email: 31       
Parsing email: 32       
Parsing email: 33       
Parsing email: 34       
Parsing email: 35       
Parsing email: 36       
Parsing email: 37       
Parsing email: 38       
Parsing email: 39       
Parsing e

In [144]:
# Utilizamos 10000 correos electrónicos para entrenar el algoritmo y 2000 para realizar pruebas
X_train, y_train = X[:1000], y[:1000]
X_test, y_test = X[1000:], y[1000:]

In [145]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)

In [146]:
clf = LogisticRegression()
clf.fit(X_train, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


In [147]:
X_test = vectorizer.transform(X_test)

In [148]:
y_pred = clf.predict(X_test)

In [149]:
print('Accuracy: {:.3f}'.format(accuracy_score(y_test, y_pred)))

Accuracy: 0.985
