# Regresión Logística: Detección de SPAM

En este ejercicio se muestran los fundamentos de la Regresión Logística planteando uno de los primeros problemas que fueron solucionados mediante el uso de técnicas de Machine Learning: la detección de SPAM.

## Enunciado del ejercicio

Se propone la construcción de un sistema de aprendizaje automático capaz de predecir si un correo determinado se corresponde con un correo de SPAM o no, para ello, se utilizará el siguiente conjunto de datos:

##### [2007 TREC Public Spam Corpus](https://plg.uwaterloo.ca/cgi-bin/cgiwrap/gvcormac/foo07)
The corpus trec07p contains 75,419 messages:

    25220 ham
    50199 spam

These messages constitute all the messages delivered to a particular
server between these dates:

    Sun, 8 Apr 2007 13:07:21 -0400
    Fri, 6 Jul 2007 07:04:53 -0400

### 1. Funciones complementarias

In [105]:
# Esta clase facilita el preprocesamiento de correos electrónicos que poseen código HTML
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs = True
        self.fed = []

    def handle_data(self, d):
        self.fed.append(d)

    def get_data(self):
        return ''.join(self.fed)

In [106]:
# Esta función se encarga de elimar los tags HTML que se encuentren en el texto del correo electrónico
def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

In [107]:
# Ejemplo de eliminación de los tags HTML de un texto
t = '<tr><td align="left"><a href="../../issues/51/16.html#article">Phrack World News</a></td>'
strip_tags(t)

'Phrack World News'

In [108]:
import email
import string
import nltk
nltk.download('stopwords')

class Parser:

    def __init__(self):
        self.stemmer = nltk.PorterStemmer()
        self.stopwords = set(nltk.corpus.stopwords.words('english'))
        self.punctuation = list(string.punctuation)

    def parse(self, email_path):
        """Parse an email."""
        with open(email_path, errors='ignore') as e:
            msg = email.message_from_file(e)
        return None if not msg else self.get_email_content(msg)

    def get_email_content(self, msg):
        """Extract the email content."""
        subject = self.tokenize(msg['Subject']) if msg['Subject'] else []
        body = self.get_email_body(msg.get_payload(),
                                   msg.get_content_type())
        content_type = msg.get_content_type()
        # Returning the content of the email
        return {"subject": subject,
                "body": body,
                "content_type": content_type}

    def get_email_body(self, payload, content_type):
        body = []
        if type(payload) is str and content_type == 'text/plain':
            return self.tokenize(payload)
        elif type(payload) is str and content_type == 'text/html':
            return self.tokenize(strip_tags(payload))
        elif type(payload) is list:
            for p in payload:
                body += self.get_email_body(p.get_payload(),
                                            p.get_content_type())
        return body

    def tokenize(self, text):
        for c in self.punctuation:
            text = text.replace(c, "")
        text = text.replace("\t", " ")
        text = text.replace("\n", " ")
        tokens = list(filter(None, text.split(" ")))
        # Stemming of the tokens
        return [self.stemmer.stem(w) for w in tokens if w not in self.stopwords]


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sebma\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


##### Lectura de un correo en formato raw

In [320]:
inmail =open("C:\\lrec07p\\dataset\\inmail1.1").read()
print(inmail)


Received: from CH3PR22MB4513.namprd22.prod.outlook.com (::1) by
 MW4PR22MB3160.namprd22.prod.outlook.com with HTTPS; Wed, 11 Sep 2024 13:36:27
 +0000
Received: from BN9PR03CA0721.namprd03.prod.outlook.com (2603:10b6:408:110::6)
 by CH3PR22MB4513.namprd22.prod.outlook.com (2603:10b6:610:1a0::21) with
 Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7939.26; Wed, 11 Sep
 2024 13:36:23 +0000
Received: from DS3PEPF0000C37A.namprd04.prod.outlook.com
 (2603:10b6:408:110:cafe::b9) by BN9PR03CA0721.outlook.office365.com
 (2603:10b6:408:110::6) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7939.24 via Frontend
 Transport; Wed, 11 Sep 2024 13:36:22 +0000
Authentication-Results: spf=pass (sender IP is 54.240.11.128)
 smtp.mailfrom=amazonses.com; dkim=pass (signature was verified)
 header.d=amazon.com;dmarc=pass action=none
 header.from=amazon.com;compauth=pass reason=100
Received-SPF: Pass (protec

##### Parsing del correo electrónico

In [321]:
p = Parser()
p.parse("C:\\lrec07p\\dataset\\inmail1.1") 

{'subject': ['comienc', 'utilizar', 'aw', 'hoy', 'mismo'],
 'body': ['aquc3ad',
  'comienza',
  'su',
  'recorrido',
  'httpsemailawscloudcommteylvrats03njyaaagvgevkswubrkljzsxmzlu5gs6kv2',
  'mg8jptnvlbutsskkdlx91kelwcjx7qhgv7snorz1iue3d',
  'explor',
  'la',
  'nube',
  'de',
  'aw',
  'amazon',
  'web',
  'servic',
  'aw',
  'cuenta',
  'con',
  'lo',
  'servicio',
  'para',
  'ayudarlo',
  'crear',
  'ap',
  'licacion',
  'sofisticada',
  'con',
  'mayor',
  'flexibilidad',
  'escalabilidad',
  'fiabilidad',
  'configur',
  'su',
  'cuenta',
  'httpsemailawscloudcommteylvrats03njyaaagvgevks0i',
  'fihfrda1qapm4xqxc1aebxxlrptqfmiemqzsn3oc1eo1l0bwtmzehalhuwr1ge3d',
  'vea',
  'la',
  'documentacic3b3n',
  'httpsemailawscloudcommteylvrats03njyaaagvg',
  'evksnfmtf6aevqhmw98qesewi1u6xbjxhuv2dhgvw1z2nwyc3fflighax0bbv9ll5vtai3d',
  'proteja',
  'su',
  'usuario',
  'rac3adz',
  'activ',
  'la',
  'autenticacic3b3n',
  'multifactor',
  'para',
  'proteg',
  'lo',
  'recurso',
  'de',
  'l

##### Lectura del índice

In [323]:
index = open("C:\\lrec07p\\full_2\\index").readlines()
index

['spam ../dataset/inmail1.1\n',
 'spam ../dataset/inmail2.2\n',
 'spam ../dataset/inmail3.3\n',
 'spam ../dataset/inmail4.4\n',
 'spam ../dataset/inmail5.5\n',
 'ham ../dataset/inmail6.6\n',
 'ham ../dataset/inmail7.7\n',
 'ham ../dataset/inmail8.8\n',
 'ham ../dataset/inmail9.9\n',
 'ham ../dataset/inmail10.10\n',
 'ham ../dataset/inmail11.11\n',
 'ham ../dataset/inmail12.12\n',
 'ham ../dataset/inmail13.13\n',
 'ham ../dataset/inmail14.14\n',
 'ham ../dataset/inmail15.15\n',
 'ham ../dataset/inmail16.16\n',
 'ham ../dataset/inmail17.17\n',
 'ham ../dataset/inmail18.18\n',
 'ham ../dataset/inmail19.19\n',
 'ham ../dataset/inmail20.20']

In [324]:
import os

DATASET_PATH = "C:\\lrec07p"

def parse_index(path_to_index, n_elements):
    ret_indexes = []
    index = open(path_to_index).readlines()
    for i in range(n_elements):
        mail = index[i].split(" ../")
        label = mail[0]
        path = mail[1][:-1]
        ret_indexes.append({"label":label, "email_path":os.path.join(DATASET_PATH, path)})
    return ret_indexes

In [325]:
def parse_email(index):
    p = Parser()
    pmail = p.parse(index["email_path"])
    return pmail, index["label"]

In [326]:
indexes = parse_index("C:\lrec07p\\full_2\\index", 20)
indexes

[{'label': 'spam', 'email_path': 'C:\\lrec07p\\dataset/inmail1.1'},
 {'label': 'spam', 'email_path': 'C:\\lrec07p\\dataset/inmail2.2'},
 {'label': 'spam', 'email_path': 'C:\\lrec07p\\dataset/inmail3.3'},
 {'label': 'spam', 'email_path': 'C:\\lrec07p\\dataset/inmail4.4'},
 {'label': 'spam', 'email_path': 'C:\\lrec07p\\dataset/inmail5.5'},
 {'label': 'ham', 'email_path': 'C:\\lrec07p\\dataset/inmail6.6'},
 {'label': 'ham', 'email_path': 'C:\\lrec07p\\dataset/inmail7.7'},
 {'label': 'ham', 'email_path': 'C:\\lrec07p\\dataset/inmail8.8'},
 {'label': 'ham', 'email_path': 'C:\\lrec07p\\dataset/inmail9.9'},
 {'label': 'ham', 'email_path': 'C:\\lrec07p\\dataset/inmail10.10'},
 {'label': 'ham', 'email_path': 'C:\\lrec07p\\dataset/inmail11.11'},
 {'label': 'ham', 'email_path': 'C:\\lrec07p\\dataset/inmail12.12'},
 {'label': 'ham', 'email_path': 'C:\\lrec07p\\dataset/inmail13.13'},
 {'label': 'ham', 'email_path': 'C:\\lrec07p\\dataset/inmail14.14'},
 {'label': 'ham', 'email_path': 'C:\\lrec07p\\d

### 2. Preprocesamiento de los datos del conjunto de datos

Con las funciones presentadas anteriormente se permite la lectura de los correos electrónicos de manera programática y el procesamiento de los mismos para eliminar aquellos componentes que no resultan de utilidad para la detección de correos de SPAM. Sin embargo, cada uno de los correos sigue estando representado por un diccionario de Python con una serie de palabras.

In [327]:
index = parse_index("C:\lrec07p\\full_2\\index", 1) #etiquetar el correo
print(index)

[{'label': 'spam', 'email_path': 'C:\\lrec07p\\dataset/inmail1.1'}]


In [328]:
import os
open(index[0]["email_path"]).read()

'Received: from CH3PR22MB4513.namprd22.prod.outlook.com (::1) by\n MW4PR22MB3160.namprd22.prod.outlook.com with HTTPS; Wed, 11 Sep 2024 13:36:27\n +0000\nReceived: from BN9PR03CA0721.namprd03.prod.outlook.com (2603:10b6:408:110::6)\n by CH3PR22MB4513.namprd22.prod.outlook.com (2603:10b6:610:1a0::21) with\n Microsoft SMTP Server (version=TLS1_2,\n cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7939.26; Wed, 11 Sep\n 2024 13:36:23 +0000\nReceived: from DS3PEPF0000C37A.namprd04.prod.outlook.com\n (2603:10b6:408:110:cafe::b9) by BN9PR03CA0721.outlook.office365.com\n (2603:10b6:408:110::6) with Microsoft SMTP Server (version=TLS1_2,\n cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7939.24 via Frontend\n Transport; Wed, 11 Sep 2024 13:36:22 +0000\nAuthentication-Results: spf=pass (sender IP is 54.240.11.128)\n smtp.mailfrom=amazonses.com; dkim=pass (signature was verified)\n header.d=amazon.com;dmarc=pass action=none\n header.from=amazon.com;compauth=pass reason=100\nReceived

In [329]:
mail,label = parse_email(index[0])
print("El correo es:", label)
print(mail)

El correo es: spam
{'subject': ['comienc', 'utilizar', 'aw', 'hoy', 'mismo'], 'body': ['aquc3ad', 'comienza', 'su', 'recorrido', 'httpsemailawscloudcommteylvrats03njyaaagvgevkswubrkljzsxmzlu5gs6kv2', 'mg8jptnvlbutsskkdlx91kelwcjx7qhgv7snorz1iue3d', 'explor', 'la', 'nube', 'de', 'aw', 'amazon', 'web', 'servic', 'aw', 'cuenta', 'con', 'lo', 'servicio', 'para', 'ayudarlo', 'crear', 'ap', 'licacion', 'sofisticada', 'con', 'mayor', 'flexibilidad', 'escalabilidad', 'fiabilidad', 'configur', 'su', 'cuenta', 'httpsemailawscloudcommteylvrats03njyaaagvgevks0i', 'fihfrda1qapm4xqxc1aebxxlrptqfmiemqzsn3oc1eo1l0bwtmzehalhuwr1ge3d', 'vea', 'la', 'documentacic3b3n', 'httpsemailawscloudcommteylvrats03njyaaagvg', 'evksnfmtf6aevqhmw98qesewi1u6xbjxhuv2dhgvw1z2nwyc3fflighax0bbv9ll5vtai3d', 'proteja', 'su', 'usuario', 'rac3adz', 'activ', 'la', 'autenticacic3b3n', 'multifactor', 'para', 'proteg', 'lo', 'recurso', 'de', 'la', 'cuenta20', 'activ', 'la', 'mfa', 'e280ba', 'httpsemailawscloudcommteylvrats03njyaaa

El algoritmo de Regresión Logística no es capaz de ingerir texto como parte del conjunto de datos. Por lo tanto, deben aplicarse una serie de funciones adicionales que transformen el texto de los correos electrónicos parseados en una representación numérica.

##### Aplicación de CountVectorizer

In [330]:
from sklearn.feature_extraction.text import CountVectorizer
prep_email=[" ".join(mail['subject'])+" ".join(mail['body'])]

vectorizer = CountVectorizer()
x= vectorizer.fit(prep_email)
print("email: ",prep_email,"\n")
print("entradas: ",vectorizer.get_feature_names_out())

email:  ['comienc utilizar aw hoy mismoaquc3ad comienza su recorrido httpsemailawscloudcommteylvrats03njyaaagvgevkswubrkljzsxmzlu5gs6kv2 mg8jptnvlbutsskkdlx91kelwcjx7qhgv7snorz1iue3d explor la nube de aw amazon web servic aw cuenta con lo servicio para ayudarlo crear ap licacion sofisticada con mayor flexibilidad escalabilidad fiabilidad configur su cuenta httpsemailawscloudcommteylvrats03njyaaagvgevks0i fihfrda1qapm4xqxc1aebxxlrptqfmiemqzsn3oc1eo1l0bwtmzehalhuwr1ge3d vea la documentacic3b3n httpsemailawscloudcommteylvrats03njyaaagvg evksnfmtf6aevqhmw98qesewi1u6xbjxhuv2dhgvw1z2nwyc3fflighax0bbv9ll5vtai3d proteja su usuario rac3adz activ la autenticacic3b3n multifactor para proteg lo recurso de la cuenta20 activ la mfa e280ba httpsemailawscloudcommteylvrats03njyaaagvgev ks96ihu3expkksuzeplyvzjgked7wbv46v4ay3yzy2vu6482eltncxp74rxhcwvuny3d conozca la consola de administracic3b3n de aw comienc con lo aspecto bc3a1sico de la nube conozca lo concepto bc3a1sico explor la prc3a1ctica recomenda

In [331]:
X = vectorizer.transform(prep_email)
print("\nValues:\n", X.toarray())


Values:
 [[  2   1   1   1  10   2   2  11   6   1   1   1   1   1 441   1   2   1
    2   1   1   1   1   1   1   1   1   1   1   1   1   2   1   1   1   1
    1   1   2   1   1   3   1   1   1   2   1   1   1   1   1   2   1   1
    2   1   1   1   1   1   2   1   1   2   6   1   2   1   2   4   1   1
    1  14   6   4   1   3   1   2   1   1   1   1   1   3   3   4   2  10
    2   2  18   4   1   1   1   1   2   1   4   2   1   6   1   2   3   1
    2   1   2   1   1   4   4   2   1   1   2   2   2   1   1   1   2   1
    1  22   1   2   4   2   1   9   2   2   5   4   2   2   2   2   2   7
    1   1   1   6   2   2   2  15   1  32   2   2   1   3   2   1   1   2
    9   2   8   2   1   4   2  12   2   1   2   1   1   4   1   1   1   1
    1   1   3   2   1   2   4   2   2   2   1   1   2   6   7   1   1   1
    1   1   1   1   1   1   1   1   1   1   1   1   1   1   2   1   1   1
    6   1   1   1   1   2   1   1   1   4   1   1   1   4  10   1   6   1
    1   4   8   3   2   2   

##### Aplicación de OneHotEncoding

In [332]:
from sklearn.preprocessing import OneHotEncoder

prep_email = [[w] for w in mail['subject'] + mail['body']]

enc = OneHotEncoder(handle_unknown='ignore')
X = enc.fit_transform(prep_email)

print("Features:\n", enc.get_feature_names_out())
print("\nValues:\n", X.toarray())

Features:
 ['x0_0' 'x0_000000' 'x0_007eb9' 'x0_022' 'x0_09' 'x0_0972d3' 'x0_0pt'
 'x0_0px' 'x0_100' 'x0_10px' 'x0_12px' 'x0_13px' 'x0_15px' 'x0_16px'
 'x0_18px' 'x0_2' 'x0_20' 'x0_2022' 'x0_20px' 'x0_22px' 'x0_232f3e'
 'x0_23px' 'x0_28px' 'x0_30px'
 'x0_31qmpdgkzwdtckwnmnkoqswfuav3eloot53bitlqz0nnzx4zrikicz3gpsygdu3d'
 'x0_32f6y9hlggeye3crauk3omlppl43d' 'x0_360px' 'x0_38px'
 'x0_3brh02ibbgiz5kacfucap7we2zu4anpyzgpbonxupgbnhr0r9qn8ciuxwjccnxyljxqimho'
 'x0_3ghggwb7zhiplg9wvisbcyaknwqbnzuffnpf7wpa3swvnxn9dhresfvc0zznm91n0hotqxodol'
 'x0_3gycltg9ph64cwzzocmly6ffj4yfioieoeolms9cjnmdlxui3d'
 'x0_3ptfygvk01eervonf98k03d'
 'x0_3rpkczcwrm7hq6yy3jp6xjpdmdj6wrgse3ktkr3vvxbvikbovtzy14mbyi8wy4v9mm13h4'
 'x0_410' 'x0_44ed82' 'x0_46px' 'x0_480px' 'x0_48px'
 'x0_4pizdj7mbovapexfa9ag4mh9pu8qehpqxybo79wxjle2rs87qubwjzdmvmvafdfhdp0lfcaio'
 'x0_4vzvdnfe2b7e81ro4ga0vv6zuqr4em4rz9armbljxqkh2k5x5se4txxkcxqovwzulxluxpld'
 'x0_50px'
 'x0_54wmh3k4hidysjazjazto9poty2fgg43pusywaq8edxl0vc5w1aktj6js43d'
 'x0_56px'

##### Funciones auxiliares para preprocesamiento del conjunto de datos

In [333]:
def create_prep_dataset(index_path, n_elements):
    X = []
    y = []
    indexes = parse_index(index_path, n_elements)
    for i in range(n_elements):
        print("\rParsing email: {0}".format(i+1), end='')
        try:
            mail, label = parse_email(indexes[i])
            X.append(" ".join(mail['subject']) + " ".join(mail['body']))
            y.append(label)
        except:
            pass
    return X, y

### 3. Entrenamiento del algoritmo 

In [334]:
# Leemos únicamente un subconjunto de 100 correos electrónicos
X_train,y_train = create_prep_dataset("C:\\lrec07p\\full\\index",100)
X_train

Parsing email: 100

['gener ciali brand qualitido feel pressur perform rise occas tri viagra anxieti thing past back old self',
 'typo debianreadmhi ive updat gulu i check mirror it seem littl typo debianreadm file exampl httpgulususherbrookecadebianreadm ftpftpfrdebianorgdebianreadm test lenni access releas diststest the current test develop snapshot name etch packag test unstabl pass autom test propog releas etch replac lenni like readmehtml yan morin consult en logiciel libr yanmorinsavoirfairelinuxcom 5149941556 to unsubscrib email debianmirrorsrequestlistsdebianorg subject unsubscrib troubl contact listmasterlistsdebianorg',
 'authent viagramega authenticv i a g r a discount pricec i a l i s discount pricedo miss it click httpwwwmoujsjkhchumcom authent viagra mega authenticv i a g r a discount pricec i a l i s discount pricedo miss it click',
 'nice talk yahey billi realli fun go night talk said felt insecur manhood i notic toilet quit small area worri websit i tell secret weapon extra 3 inch trust g

##### Aplicamos la vectorización a los datos

In [335]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)

In [336]:
print(X_train.toarray())
print("\nFeatures:", len(vectorizer.get_feature_names_out()))

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

Features: 4911


In [337]:
import pandas as pd

pd.DataFrame(X_train.toarray(), columns=[vectorizer.get_feature_names_out()])

Unnamed: 0,0000,000000,00085,002,003,00450,009,01,01000u,0107,...,ӧanz,ӭѯ,ԡšݡ淶,լһʽ,չҵϣ,سŵþʊʊݾѯ,ڶҵţ,㶫иï26,饻jwk,쵼ã
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
96,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
97,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
98,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [338]:
y_train

['spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'ham',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'ham',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'ham',
 'ham',
 'spam',
 'spam',
 'spam',
 'ham',
 'ham',
 'ham',
 'spam',
 'ham',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'ham',
 'spam',
 'ham',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam']

###### Entrenamiento del algoritmo de regresión logística con el conjunto de datos preprocesado

In [339]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(X_train, y_train)

### 4. Predicción

##### Lectura de un conjunto de correos nuevos

In [340]:
# Leemos 150 correos de nuestro conjunto de datos y nos quedamos únicamente con los 50 últimos 
# Estos 50 correos electrónicos no se han utilizado para entrenar el algoritmo
X, y = create_prep_dataset("C:\\lrec07p\\full_2\\index", 20) #Poner los 20
X_test = X[1:]
y_test = y[1:]

Parsing email: 20

##### Preprocesamiento de los correos con el vectorizador creado anteriormente

In [341]:
X_test = vectorizer.transform(X_test)

##### Predicción del tipo de correo

In [342]:
y_pred = clf.predict(X_test)
y_pred

array(['spam', 'ham', 'spam', 'spam', 'spam', 'ham', 'ham', 'spam', 'ham',
       'spam', 'ham', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam'],
      dtype='<U4')

In [343]:
print("Predicción:\n", y_pred)
print("\nEtiquetas reales:\n", y_test)

Predicción:
 ['spam' 'ham' 'spam' 'spam' 'spam' 'ham' 'ham' 'spam' 'ham' 'spam' 'ham'
 'spam' 'spam' 'spam' 'spam' 'spam' 'spam']

Etiquetas reales:
 ['spam', 'spam', 'spam', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham']


##### Evaluación de los resultados

In [344]:
from sklearn.metrics import accuracy_score

print('Accuracy: {:.3f}'.format(accuracy_score(y_test, y_pred)))

Accuracy: 0.353


### 5. Aumentando el conjunto de datos

In [133]:
# Leemos 20 correos electrónicos
X, y = create_prep_dataset("C:\\lrec07p\\full\\index", 12000)

Parsing email: 12000

In [134]:
# Utilizamos 10000 correos electrónicos para entrenar el algoritmo y 2000 para realizar pruebas
X_train, y_train = X[:10000], y[:10000]
X_test, y_test = X[10000:], y[10000:]

In [215]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)

AttributeError: 'csr_matrix' object has no attribute 'lower'

In [36]:
clf = LogisticRegression()
clf.fit(X_train, y_train)

In [44]:
X_test = vectorizer.transform(X_test)

AttributeError: 'csr_matrix' object has no attribute 'lower'

In [49]:
y_pred = clf.predict(X_test)

In [46]:
print('Accuracy: {:.3f}'.format(accuracy_score(y_test, y_pred)))

Accuracy: 0.986
