# Regresión Logística: Detección de SPAM

En este ejercicio se muestran los fundamentos de la Regresión Logística plamteando uno de los primeros problemas que fueron solucionados mediante el uso de técnicas de Machine Learning: Detección de SPAM.

## Enunciado del ejercicio 
Se propone la cosntrucción de un sistema de aprendizaje automatico capaz de predicir si un correo determinado corresponde a un correo SPAM o no, para esto se utilizara el siguiente DataSet: 

[DataSet]()

The corpus trec07p contains 75,419 messages:

25,220 Ham
50,199 SPAM

Theese messages contitue all the messages delivered to a particular server between these dates:

Sun, 8 Apr 2007 13:07:21 -0400
Fri, 6 Jul 2007 07:04:53 -0400

### 1.- Funciones Complementarias 

En este caso practico relacionado con la detectción de e-mails de SPAM, el DataSet del que se dispone. esa formado por e-mails, con sus corresponedientes cabeceras y campos adicionales. Por lo tanto requieren un preprocesamineto previo a ser ingeridos por el algoritmo de Machine Learning

In [1]:
# Esta clase facilita el procesamiento de correos Eleectronicos que poseen código HTML
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs = True
        self.fed = []

    def handle_data(self, d):
        self.fed.append(d)

    def get_data(self):
        return ''.join(self.fed)

In [2]:
# Esta funcion se encarga de eliminar los tags de HTML que se encuentran en el texto del e-mail.

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data() 

In [3]:
# EJemplo de elimiinacion de los tags HTML de un texto 
t = '<tr><td align="left"><a href="../../issues/21/16.html#article">Phrack Wordl News</a><td>'
strip_tags(t)

'Phrack Wordl News'

Ademas de eliminar los posibles tag o htmnl que se encuentren en el correo electtrónico deben realizarse otras acciones de preprocesamiento para evitar que los mensajes contengan ruido inecesario. Entre ellas se encuentra la eliminación de los signos de puntuación, eliminación de posibles campos de correo electrónico que no son relevantes o afijos de una palabra manteniendo únicamente la raiz de la misma (Steamming). La clase que se muestra a continuación realiza estas transformaciones.

In [4]:
import email
import string
import nltk


In [5]:
class Parser:

    def __init__(self):
        self.stemmer = nltk.PorterStemmer()
        self.stopwords = set(nltk.corpus.stopwords.words('english'))
        self.punctuation = list(string.punctuation)

    def parse(self, email_path):
        """Parse an email."""
        with open(email_path, errors='ignore') as e:
            msg = email.message_from_file(e)
        return None if not msg else self.get_email_content(msg)
    
    def get_email_content(self, msg):
        """Extract the email content"""
        subject = self.tokenize(msg['Subject']) if msg['Subject'] else[]
        body = self.get_email_body(msg.get_payload(),
                                   msg.get_content_type())
        content_type = msg.get_content_type()
        # Returning the content of the email
        return {"subject": subject,
                "body": body,
                "content_type": content_type}
    
    def get_email_body(self, payload, content_type):
        """Extract the body of the email"""
        body = []
        if type(payload) is str and content_type == 'text/plain':
            return self.tokenize(payload)
        elif type(payload) is str and content_type == 'text/html':
            return self.tokenize(strip_tags(payload))
        elif type(payload) is list:
            for p in payload:
                body += self.get_email_body(p.get_payload(),
                                            p.get_content_type())
        return body
    
    def tokenize(self, text):
        """Transform a text string in tokens. perform two main actions,
        clean the punctuation symbols and do stemming of the text."""
        for c in self.punctuation:
            text = text.replace(c, "")
        text = text.replace("\t", " ")
        text = text.replace("\n", " ")
        tokens = list(filter(None, text.split(" ")))
        # Stemming of the tokens
        return [self.stemmer.stem(w) for w in tokens if w not in self.stopwords]

Lectura de un e-mail en formato Raw

In [6]:
inmail = open("/home/SpringyB6/Documents/Simulacion/datasets/trec07p/data/inmail.1").read()

print(inmail)

From RickyAmes@aol.com  Sun Apr  8 13:07:32 2007
Return-Path: <RickyAmes@aol.com>
Received: from 129.97.78.23 ([211.202.101.74])
	by speedy.uwaterloo.ca (8.12.8/8.12.5) with SMTP id l38H7G0I003017;
	Sun, 8 Apr 2007 13:07:21 -0400
Received: from 0.144.152.6 by 211.202.101.74; Sun, 08 Apr 2007 19:04:48 +0100
Message-ID: <WYADCKPDFWWTWTXNFVUE@yahoo.com>
From: "Tomas Jacobs" <RickyAmes@aol.com>
Reply-To: "Tomas Jacobs" <RickyAmes@aol.com>
To: the00@speedy.uwaterloo.ca
Subject: Generic Cialis, branded quality@ 
Date: Sun, 08 Apr 2007 21:00:48 +0300
X-Mailer: Microsoft Outlook Express 6.00.2600.0000
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="--8896484051606557286"
X-Priority: 3
X-MSMail-Priority: Normal
Status: RO
Content-Length: 988
Lines: 24

----8896484051606557286
Content-Type: text/html;
Content-Transfer-Encoding: 7Bit

<html>
<body bgcolor="#ffffff">
<div style="border-color: #00FFFF; border-right-width: 0px; border-bottom-width: 0px; margin-bottom: 0px;" align="

Parsear el e-mail


In [7]:
p = Parser()
p.parse("/home/SpringyB6/Documents/Simulacion/datasets/trec07p/data/inmail.1")

{'subject': ['gener', 'ciali', 'brand', 'qualiti'],
 'body': ['do',
  'feel',
  'pressur',
  'perform',
  'rise',
  'occas',
  'tri',
  'viagra',
  'anxieti',
  'thing',
  'past',
  'back',
  'old',
  'self'],
 'content_type': 'multipart/alternative'}

Lectura del indice

Estas funciones complementarias se encargan de cargar en memoria la ruta de cada correo electrónico y su etiqueta correspondiente {ham,spam}.

In [8]:
index = open("/home/SpringyB6/Documents/Simulacion/datasets/trec07p/full/index").readlines()
index

['spam ../data/inmail.1\n',
 'ham ../data/inmail.2\n',
 'spam ../data/inmail.3\n',
 'spam ../data/inmail.4\n',
 'spam ../data/inmail.5\n',
 'spam ../data/inmail.6\n',
 'spam ../data/inmail.7\n',
 'spam ../data/inmail.8\n',
 'spam ../data/inmail.9\n',
 'ham ../data/inmail.10\n',
 'spam ../data/inmail.11\n',
 'spam ../data/inmail.12\n',
 'spam ../data/inmail.13\n',
 'spam ../data/inmail.14\n',
 'spam ../data/inmail.15\n',
 'spam ../data/inmail.16\n',
 'spam ../data/inmail.17\n',
 'spam ../data/inmail.18\n',
 'spam ../data/inmail.19\n',
 'ham ../data/inmail.20\n',
 'ham ../data/inmail.21\n',
 'spam ../data/inmail.22\n',
 'spam ../data/inmail.23\n',
 'spam ../data/inmail.24\n',
 'spam ../data/inmail.25\n',
 'spam ../data/inmail.26\n',
 'spam ../data/inmail.27\n',
 'spam ../data/inmail.28\n',
 'ham ../data/inmail.29\n',
 'spam ../data/inmail.30\n',
 'ham ../data/inmail.31\n',
 'spam ../data/inmail.32\n',
 'spam ../data/inmail.33\n',
 'ham ../data/inmail.34\n',
 'spam ../data/inmail.35\n',
 

In [9]:
import os 

DATASET_PATH = "/home/SpringyB6/Documents/Simulacion/datasets/trec07p"

def parse_index(path_to_index, n_elements):
    ret_indexes = []
    index = open(path_to_index).readlines()
    for i in range(n_elements):
        mail = index[i].split(" ../")
        label = mail[0]
        path = mail[1][:-1]
        ret_indexes.append({"label":label, "email_path": os.path.join(DATASET_PATH, path)})
    return ret_indexes


#import os
#
#DATASET_PATH = "datasets/trec07p"
#
#def parse_index(path_to_index, n_elements):
#    ret_indexes = []
#    index = open(path_to_index).readlines()
#    for i in range(n_elements):
#        label, rel_path = index[i].strip().split(" ../")
#        email_path = os.path.join(DATASET_PATH, rel_path)
#        ret_indexes.append({"label": label, "email_path": email_path})
#    return ret_indexes

In [10]:
def parse_email(index):
    p = Parser()
    pmail = p.parse(index["email_path"])
    return pmail, index["label"]

In [11]:
indexes = parse_index("/home/SpringyB6/Documents/Simulacion/datasets/trec07p/full/index", 10)
indexes

[{'label': 'spam',
  'email_path': '/home/SpringyB6/Documents/Simulacion/datasets/trec07p/data/inmail.1'},
 {'label': 'ham',
  'email_path': '/home/SpringyB6/Documents/Simulacion/datasets/trec07p/data/inmail.2'},
 {'label': 'spam',
  'email_path': '/home/SpringyB6/Documents/Simulacion/datasets/trec07p/data/inmail.3'},
 {'label': 'spam',
  'email_path': '/home/SpringyB6/Documents/Simulacion/datasets/trec07p/data/inmail.4'},
 {'label': 'spam',
  'email_path': '/home/SpringyB6/Documents/Simulacion/datasets/trec07p/data/inmail.5'},
 {'label': 'spam',
  'email_path': '/home/SpringyB6/Documents/Simulacion/datasets/trec07p/data/inmail.6'},
 {'label': 'spam',
  'email_path': '/home/SpringyB6/Documents/Simulacion/datasets/trec07p/data/inmail.7'},
 {'label': 'spam',
  'email_path': '/home/SpringyB6/Documents/Simulacion/datasets/trec07p/data/inmail.8'},
 {'label': 'spam',
  'email_path': '/home/SpringyB6/Documents/Simulacion/datasets/trec07p/data/inmail.9'},
 {'label': 'ham',
  'email_path': '/ho

## 2.- Preprocesamiento de los datos del DataSet

Con las funciones presentadas anteriormente se permite la lectura de los correos electrónicos de manera programática y el preprocesamineto de los mismos para eliminar aquellos componentes que no resultan de utilidad para la detección de correos SPAM. Sin embargo cada uno de los corrreos sigue estando representado por un diccionario de Python con una serie de palabras.

In [12]:
# Cargar el índice y las etiquetas en memoria
index = parse_index("/home/SpringyB6/Documents/Simulacion/datasets/trec07p/full/index", 1)
index

[{'label': 'spam',
  'email_path': '/home/SpringyB6/Documents/Simulacion/datasets/trec07p/data/inmail.1'}]

In [13]:
# Leer el primer correo 
import os 

open(index[0]["email_path"]).readlines()

['From RickyAmes@aol.com  Sun Apr  8 13:07:32 2007\n',
 'Return-Path: <RickyAmes@aol.com>\n',
 'Received: from 129.97.78.23 ([211.202.101.74])\n',
 '\tby speedy.uwaterloo.ca (8.12.8/8.12.5) with SMTP id l38H7G0I003017;\n',
 '\tSun, 8 Apr 2007 13:07:21 -0400\n',
 'Received: from 0.144.152.6 by 211.202.101.74; Sun, 08 Apr 2007 19:04:48 +0100\n',
 'Message-ID: <WYADCKPDFWWTWTXNFVUE@yahoo.com>\n',
 'From: "Tomas Jacobs" <RickyAmes@aol.com>\n',
 'Reply-To: "Tomas Jacobs" <RickyAmes@aol.com>\n',
 'To: the00@speedy.uwaterloo.ca\n',
 'Subject: Generic Cialis, branded quality@ \n',
 'Date: Sun, 08 Apr 2007 21:00:48 +0300\n',
 'X-Mailer: Microsoft Outlook Express 6.00.2600.0000\n',
 'MIME-Version: 1.0\n',
 'Content-Type: multipart/alternative;\n',
 '\tboundary="--8896484051606557286"\n',
 'X-Priority: 3\n',
 'X-MSMail-Priority: Normal\n',
 'Status: RO\n',
 'Content-Length: 988\n',
 'Lines: 24\n',
 '\n',
 '----8896484051606557286\n',
 'Content-Type: text/html;\n',
 'Content-Transfer-Encoding: 7Bi

In [14]:
# Parseo del correo 
mail, label = parse_email(index[0])
print("El correo es ", label)
print(mail)

El correo es  spam
{'subject': ['gener', 'ciali', 'brand', 'qualiti'], 'body': ['do', 'feel', 'pressur', 'perform', 'rise', 'occas', 'tri', 'viagra', 'anxieti', 'thing', 'past', 'back', 'old', 'self'], 'content_type': 'multipart/alternative'}


El algoritmo de Regresión Logística no es capaz de inegerir texto como parte de DataSet. Por lo tanto debe de aplicarse una serie de funciones adicionales que tranformen el texto de los correos parseados en una representación númerica.

Aplicación de CountVectorizer

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

# Preparación del e-mail en una cadena de texto

prep_email = [" " .join(mail['subject']) + " ".join(mail['body'])]

vectorizer = CountVectorizer()
x = vectorizer.fit(prep_email)
print("e-mail: ", prep_email, "\n")
print("Caraccterísticas de entrada: ", vectorizer.get_feature_names_out())

e-mail:  ['gener ciali brand qualitido feel pressur perform rise occas tri viagra anxieti thing past back old self'] 

Caraccterísticas de entrada:  ['anxieti' 'back' 'brand' 'ciali' 'feel' 'gener' 'occas' 'old' 'past'
 'perform' 'pressur' 'qualitido' 'rise' 'self' 'thing' 'tri' 'viagra']


In [16]:
x = vectorizer.transform(prep_email)
print("\nValues:\n", x.toarray())


Values:
 [[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]


In [17]:
from sklearn.preprocessing import OneHotEncoder

prep_email = [[w] for w in mail['subject'] + mail['body']]

enc =  OneHotEncoder(handle_unknown='ignore')
x = enc.fit_transform(prep_email)

print("Features: \n", enc.get_feature_names_out())
print("\nValues:\n", x.toarray())

Features: 
 ['x0_anxieti' 'x0_back' 'x0_brand' 'x0_ciali' 'x0_do' 'x0_feel' 'x0_gener'
 'x0_occas' 'x0_old' 'x0_past' 'x0_perform' 'x0_pressur' 'x0_qualiti'
 'x0_rise' 'x0_self' 'x0_thing' 'x0_tri' 'x0_viagra']

Values:
 [[0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 

Funciones auxiliares para el preprocesamiento del DataSet


In [18]:
def create_prep_dataset(index_path, n_elements):
    X = []
    y = []
    
    indexes = parse_index(index_path, n_elements)
    for i in range(n_elements):
        print("\nParsing email: {}".format(i+1), end='')
        mail, label = parse_email(indexes[i])
        X.append(" ".join(mail['subject']) + " ".join(mail['body']))
        y.append(label)
    return X, y

## 3.- Entrenamiento del Algoritmo

In [19]:
# Leer únicamente un subconjunto de 100 correos
X_train, y_train = create_prep_dataset("/home/SpringyB6/Documents/Simulacion/datasets/trec07p/full/index", 100)
X_train


Parsing email: 1
Parsing email: 2
Parsing email: 3
Parsing email: 4
Parsing email: 5
Parsing email: 6
Parsing email: 7
Parsing email: 8
Parsing email: 9
Parsing email: 10
Parsing email: 11
Parsing email: 12
Parsing email: 13
Parsing email: 14
Parsing email: 15
Parsing email: 16
Parsing email: 17
Parsing email: 18
Parsing email: 19
Parsing email: 20
Parsing email: 21
Parsing email: 22
Parsing email: 23
Parsing email: 24
Parsing email: 25
Parsing email: 26
Parsing email: 27
Parsing email: 28
Parsing email: 29
Parsing email: 30
Parsing email: 31
Parsing email: 32
Parsing email: 33
Parsing email: 34
Parsing email: 35
Parsing email: 36
Parsing email: 37
Parsing email: 38
Parsing email: 39
Parsing email: 40
Parsing email: 41
Parsing email: 42
Parsing email: 43
Parsing email: 44
Parsing email: 45
Parsing email: 46
Parsing email: 47
Parsing email: 48
Parsing email: 49
Parsing email: 50
Parsing email: 51
Parsing email: 52
Parsing email: 53
Parsing email: 54
Parsing email: 55
Parsing email: 56


['gener ciali brand qualitido feel pressur perform rise occas tri viagra anxieti thing past back old self',
 'typo debianreadmhi ive updat gulu i check mirror it seem littl typo debianreadm file exampl httpgulususherbrookecadebianreadm ftpftpfrdebianorgdebianreadm test lenni access releas diststest the current test develop snapshot name etch packag test unstabl pass autom test propog releas etch replac lenni like readmehtml yan morin consult en logiciel libr yanmorinsavoirfairelinuxcom 5149941556 to unsubscrib email debianmirrorsrequestlistsdebianorg subject unsubscrib troubl contact listmasterlistsdebianorg',
 'authent viagramega authenticv i a g r a discount pricec i a l i s discount pricedo miss it click httpwwwmoujsjkhchumcom authent viagra mega authenticv i a g r a discount pricec i a l i s discount pricedo miss it click',
 'nice talk yahey billi realli fun go night talk said felt insecur manhood i notic toilet quit small area worri websit i tell secret weapon extra 3 inch trust g

In [20]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)


In [21]:
print(X_train.toarray())
print("\nFeatures: ", len(vectorizer.get_feature_names_out()))

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

Features:  4911


In [22]:
import pandas as pd

pd.DataFrame(X_train.toarray(), columns=[vectorizer.get_feature_names_out()])

Unnamed: 0,0000,000000,00085,002,003,00450,009,01,01000u,0107,...,ӧanz,ӭѯ,ԡšݡ淶,լһʽ,չҵϣ,سŵþʊʊݾѯ,ڶҵţ,㶫иï26,饻jwk,쵼ã
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
96,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
97,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
98,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
y_train

['spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'ham',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'ham',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'ham',
 'ham',
 'spam',
 'spam',
 'spam',
 'ham',
 'ham',
 'ham',
 'spam',
 'ham',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'ham',
 'spam',
 'ham',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam']

Entrenamiento del algoritmo de regresión logística con el DataSet preprocesado.

In [24]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()

clf.fit(X_train, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


### 4.- Prediicción 

Lectura de un DataSet de correos electrónicos


In [25]:
# Leer 150 correos electrónicos de nuestro DataSet y nos quedamos unicamente con los 50 ultimos. 
# Estos 50 correos electrónicos no se han utilizado para entrenar el algoritmo. 

X, y = create_prep_dataset("/home/SpringyB6/Documents/Simulacion/datasets/trec07p/full/index", 150)
X_test = X[100:]
y_test = y[100:]


Parsing email: 1
Parsing email: 2
Parsing email: 3
Parsing email: 4
Parsing email: 5
Parsing email: 6
Parsing email: 7
Parsing email: 8
Parsing email: 9
Parsing email: 10
Parsing email: 11
Parsing email: 12
Parsing email: 13
Parsing email: 14
Parsing email: 15
Parsing email: 16
Parsing email: 17
Parsing email: 18
Parsing email: 19
Parsing email: 20
Parsing email: 21
Parsing email: 22
Parsing email: 23
Parsing email: 24
Parsing email: 25
Parsing email: 26
Parsing email: 27
Parsing email: 28
Parsing email: 29
Parsing email: 30
Parsing email: 31
Parsing email: 32
Parsing email: 33
Parsing email: 34
Parsing email: 35
Parsing email: 36
Parsing email: 37
Parsing email: 38
Parsing email: 39
Parsing email: 40
Parsing email: 41
Parsing email: 42
Parsing email: 43
Parsing email: 44
Parsing email: 45
Parsing email: 46
Parsing email: 47
Parsing email: 48
Parsing email: 49
Parsing email: 50
Parsing email: 51
Parsing email: 52
Parsing email: 53
Parsing email: 54
Parsing email: 55
Parsing email: 56


Preprocesamiento de los correos electrónicos con el vectorizador creado anteriormente.  

In [26]:
X_test = vectorizer.transform(X_test)

Prediccíon del tipo de correo

In [27]:
y_pred = clf.predict(X_test)
y_pred

array(['spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'spam', 'spam', 'ham', 'spam', 'spam', 'spam', 'spam', 'spam',
       'ham', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'spam', 'spam'], dtype='<U4')

In [28]:
print("Predicción: \n", y_pred)
print("\nEtiquetas Reales:\n",y_test)

Predicción: 
 ['spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam'
 'ham' 'spam' 'spam' 'spam' 'spam' 'spam' 'ham' 'spam' 'spam' 'spam'
 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam'
 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam'
 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam']

Etiquetas Reales:
 ['spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'ham', 'spam', 'spam', 'spam', 'spam', 'spam', 'ham', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'ham', 'spam', 'spam', 'ham', 'spam', 'spam', 'ham', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam']


Evaluación de resultados

In [29]:
from sklearn.metrics  import accuracy_score
print('Accuracy: {:3f}'.format(accuracy_score(y_test, y_pred)))

Accuracy: 0.940000


### 5.- Aumentando el DataSet

In [30]:
# Leer  12,000 correos electronicos para entrenar el algoritmo y 2,000 para realizar pruebas. 
X, y = create_prep_dataset("/home/SpringyB6/Documents/Simulacion/datasets/trec07p/full/index", 12000)


Parsing email: 1
Parsing email: 2
Parsing email: 3
Parsing email: 4
Parsing email: 5
Parsing email: 6
Parsing email: 7
Parsing email: 8
Parsing email: 9
Parsing email: 10
Parsing email: 11
Parsing email: 12
Parsing email: 13
Parsing email: 14
Parsing email: 15
Parsing email: 16
Parsing email: 17
Parsing email: 18
Parsing email: 19
Parsing email: 20
Parsing email: 21
Parsing email: 22
Parsing email: 23
Parsing email: 24
Parsing email: 25
Parsing email: 26
Parsing email: 27
Parsing email: 28
Parsing email: 29
Parsing email: 30
Parsing email: 31
Parsing email: 32
Parsing email: 33
Parsing email: 34
Parsing email: 35
Parsing email: 36
Parsing email: 37
Parsing email: 38
Parsing email: 39
Parsing email: 40
Parsing email: 41
Parsing email: 42
Parsing email: 43
Parsing email: 44
Parsing email: 45
Parsing email: 46
Parsing email: 47
Parsing email: 48
Parsing email: 49
Parsing email: 50
Parsing email: 51
Parsing email: 52
Parsing email: 53
Parsing email: 54
Parsing email: 55
Parsing email: 56


In [31]:
X_train, y_train = X[:10000], y[:10000]
X_test, y_test = X[10000:12000], y[10000:12000]

vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)

In [32]:
clf = LogisticRegression()
clf.fit(X_train, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


In [33]:
X_test = vectorizer.transform(X_test)
y_pred = clf.predict(X_test)
print('Accuracy: {:.3f}'.format(accuracy_score(y_test, y_pred)))

Accuracy: 0.987
