# Regresion Logistica: Deteccion de SPAM

En este ejercicio se muestran lo fuendamentos de la Regrsion Logistica planteando uno de los primeros problemas que fueron solucionados mediante el uso de tecnicas de Machine Learning: Deteccion de SPAM. 

La **Regresion Lineal** ayuda a predecir eventos a fuuro, mientras que la **Regresion Logistica** nos ayuda a predecir la probabilidad.

## Enunciado del ejercicio

Se propone la construccion de un sistema de aprendizaje automatico capaz de predecir si un correo determinado corresponde a un correo SPAM o no, para esto, se utilizara el siguiente DataSet:
[DataSet](https://www.kaggle.com/datasets/imdeepmind/preprocessed-trec-2007-public-corpus-dataset)

The corpus trec07p contains 75,419 messages:

25, 220 Ham
50, 199 SPAM

These messages contitute all the messages delivered to a particular server between these dates:

Sun, 8 Apr 2007 13:07:21 -0400
Frio, 6 Jul 2007 07:04:53 -0400

### 1.- Funciones complementarias 

En este caso practico relacionado con la deteccion de e-mails de SPAM, el DataSet  del que se dispone, esta formado por e-mails con sus correspondientes cabeceras y compos adicionales. Por lo tanto requieren un preprocesamiento previo a ser ingeridos por el algoritmo de Machine Learning.

In [44]:
# Esta clase facilita el preprocesamiento de correos electronicos que poseen codigo HTML
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs = True
        self.fed = []
        
    def handle_data(self,d):
        self.fed.append(d)
    
    def get_data(self):
        return ''.join(self.fed)    

In [45]:
# Esta funcion se encarga de eliminar los tags HTML que se encuentran en el texto del e-mail
def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

In [46]:
# Ejemplo de eliminacion de los tags HTML de un texto
t = '<tr><td aling="left"><a href="../../issues/51/16.html#article">Phrack World News</a></td></tr>'
strip_tags(t)

'Phrack World News'

Ademas de eliminar los posibles tags HTML que se encuentren en el correo electronico deben realizarse otras acciones de preprocesamiento para evitar que lo mensajes contengan ruido inecesario. Entre ellas se encuentra la eliminacion de los signos de puntuacion, eliminacion de posibles campos de correo electronico que no son relevantes o eliminacion de afijos de una palabra manteniendo unicamnete la raiz de la misma (Stemming). La clase que se muestra a continuacion realiza estas transformaciones.

import nltk
nltk.download('stopwords')

In [47]:
import email 
import string
import nltk

In [48]:
class Parser:
    
    def __init__(self):
        self.stemmer = nltk.PorterStemmer()
        self.stopwords = set(nltk.corpus.stopwords.words('english'))
        self.punctuation = list(string.punctuation) 
        
    def parse(self, email_path):
        """Parse an email."""
        with open(email_path, errors='ignore') as e:
            msg = email.message_from_file(e)
        return None if not msg else self.get_email_content(msg)
    
    def get_email_content(self, msg):
        """Extract the email content"""
        subject = self.tokenize(msg['Subject']) if msg['Subject'] else[]
        body = self.get_email_body(msg.get_payload(),msg.get_content_type())
        content_type = msg.get_content_type()
        # Returning the content of the email 
        return {"subject": subject,
                "body": body,
                "content_type": content_type}
    
    def get_email_body(self, payload, content_type):
        """Extract the body of the email"""
        body = []
        if type(payload) is str and content_type == 'text/plain':
            return self.tokenize(payload)
        elif type(payload) is str and content_type == 'text/html':
            return self.tokenize(strip_tags(payload))
        elif type(payload) is list:
            for p in payload:
                body += self.get_email_body(p.get_payload(),
                                            p.get_content_type())
        return body
    
    def tokenize(self, text):
        """Transform a text string in tokens. Perform two main actions, clean the punctuation symbols and do stemming of the text."""
        for c in self.punctuation:
            text = text.replace(c, "")
        text = text.replace("\t", " ")
        text = text.replace("\n", " ")
        tokens = list(filter(None, text.split(" ")))
        # Stemming of the tokens
        return [self.stemmer.stem(w)for w in tokens if w not in self.stopwords]

Lectura de un e-mail en formato Raw

In [49]:
inmail = open("datasets/datasets/trec07p/data/inmail.1").read()
print(inmail)

From RickyAmes@aol.com  Sun Apr  8 13:07:32 2007
Return-Path: <RickyAmes@aol.com>
Received: from 129.97.78.23 ([211.202.101.74])
	by speedy.uwaterloo.ca (8.12.8/8.12.5) with SMTP id l38H7G0I003017;
	Sun, 8 Apr 2007 13:07:21 -0400
Received: from 0.144.152.6 by 211.202.101.74; Sun, 08 Apr 2007 19:04:48 +0100
Message-ID: <WYADCKPDFWWTWTXNFVUE@yahoo.com>
From: "Tomas Jacobs" <RickyAmes@aol.com>
Reply-To: "Tomas Jacobs" <RickyAmes@aol.com>
To: the00@speedy.uwaterloo.ca
Subject: Generic Cialis, branded quality@ 
Date: Sun, 08 Apr 2007 21:00:48 +0300
X-Mailer: Microsoft Outlook Express 6.00.2600.0000
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="--8896484051606557286"
X-Priority: 3
X-MSMail-Priority: Normal
Status: RO
Content-Length: 988
Lines: 24

----8896484051606557286
Content-Type: text/html;
Content-Transfer-Encoding: 7Bit

<html>
<body bgcolor="#ffffff">
<div style="border-color: #00FFFF; border-right-width: 0px; border-bottom-width: 0px; margin-bottom: 0px;" align="

Parser el e-mail

In [50]:
p = Parser()
p.parse("datasets/datasets/trec07p/data/inmail.1")

{'subject': ['gener', 'ciali', 'brand', 'qualiti'],
 'body': ['do',
  'feel',
  'pressur',
  'perform',
  'rise',
  'occas',
  'tri',
  'viagra',
  'anxieti',
  'thing',
  'past',
  'back',
  'old',
  'self'],
 'content_type': 'multipart/alternative'}

Lectura del indice 

Estas funciones complementarias se encargan de cargar en memoria la ruta de cada correo electronico y su etiqueta correspondiente {ham, spam}.

In [51]:
index = open("datasets/datasets/trec07p/full/index").readlines()
index

['spam ../data/inmail.1\n',
 'ham ../data/inmail.2\n',
 'spam ../data/inmail.3\n',
 'spam ../data/inmail.4\n',
 'spam ../data/inmail.5\n',
 'spam ../data/inmail.6\n',
 'spam ../data/inmail.7\n',
 'spam ../data/inmail.8\n',
 'spam ../data/inmail.9\n',
 'ham ../data/inmail.10\n',
 'spam ../data/inmail.11\n',
 'spam ../data/inmail.12\n',
 'spam ../data/inmail.13\n',
 'spam ../data/inmail.14\n',
 'spam ../data/inmail.15\n',
 'spam ../data/inmail.16\n',
 'spam ../data/inmail.17\n',
 'spam ../data/inmail.18\n',
 'spam ../data/inmail.19\n',
 'ham ../data/inmail.20\n',
 'ham ../data/inmail.21\n',
 'spam ../data/inmail.22\n',
 'spam ../data/inmail.23\n',
 'spam ../data/inmail.24\n',
 'spam ../data/inmail.25\n',
 'spam ../data/inmail.26\n',
 'spam ../data/inmail.27\n',
 'spam ../data/inmail.28\n',
 'ham ../data/inmail.29\n',
 'spam ../data/inmail.30\n',
 'ham ../data/inmail.31\n',
 'spam ../data/inmail.32\n',
 'spam ../data/inmail.33\n',
 'ham ../data/inmail.34\n',
 'spam ../data/inmail.35\n',
 

In [52]:
import os 
DATASET_PATH="datasets/datasets/trec07p"

def parse_index(path_to_index, n_elements):
    ret_indexes = []
    index = open(path_to_index).readlines()
    for i in range(n_elements):
        mail = index[i].split(" ../")
        label = mail[0]
        path = mail[1][:-1]
        ret_indexes.append({"label": label, "email_path":os.path.join(DATASET_PATH, path)})
    return ret_indexes

In [53]:
def parse_email(index):
    p = Parser()
    pmail = p.parse(index["email_path"])
    return pmail, index["label"]

In [54]:
indexes = parse_index("datasets/datasets/trec07p/full/index", 10)
indexes

[{'label': 'spam', 'email_path': 'datasets/datasets/trec07p/data/inmail.1'},
 {'label': 'ham', 'email_path': 'datasets/datasets/trec07p/data/inmail.2'},
 {'label': 'spam', 'email_path': 'datasets/datasets/trec07p/data/inmail.3'},
 {'label': 'spam', 'email_path': 'datasets/datasets/trec07p/data/inmail.4'},
 {'label': 'spam', 'email_path': 'datasets/datasets/trec07p/data/inmail.5'},
 {'label': 'spam', 'email_path': 'datasets/datasets/trec07p/data/inmail.6'},
 {'label': 'spam', 'email_path': 'datasets/datasets/trec07p/data/inmail.7'},
 {'label': 'spam', 'email_path': 'datasets/datasets/trec07p/data/inmail.8'},
 {'label': 'spam', 'email_path': 'datasets/datasets/trec07p/data/inmail.9'},
 {'label': 'ham', 'email_path': 'datasets/datasets/trec07p/data/inmail.10'}]

## 2.- Preprocesamiento de los datos del DataSet

Con la funcione presentadas anteriorente se permite la lectura de los correos electronicos de manera programatica y el preproesamiento de los mismos para eliminar aquellos componentes que no resultan de utilidad para la deteccion de correos SPAM. Sin embargo, cada uno de los correos sigue estando representado por un diccionario de Python con una serie de palabras.

In [55]:
# Cargar el indice y las etiquetas en memoria
index = parse_index("datasets/datasets/trec07p/full/index", 1)

In [56]:
# Leer el primer correo
import os
open(index[0]["email_path"]).readlines()

['From RickyAmes@aol.com  Sun Apr  8 13:07:32 2007\n',
 'Return-Path: <RickyAmes@aol.com>\n',
 'Received: from 129.97.78.23 ([211.202.101.74])\n',
 '\tby speedy.uwaterloo.ca (8.12.8/8.12.5) with SMTP id l38H7G0I003017;\n',
 '\tSun, 8 Apr 2007 13:07:21 -0400\n',
 'Received: from 0.144.152.6 by 211.202.101.74; Sun, 08 Apr 2007 19:04:48 +0100\n',
 'Message-ID: <WYADCKPDFWWTWTXNFVUE@yahoo.com>\n',
 'From: "Tomas Jacobs" <RickyAmes@aol.com>\n',
 'Reply-To: "Tomas Jacobs" <RickyAmes@aol.com>\n',
 'To: the00@speedy.uwaterloo.ca\n',
 'Subject: Generic Cialis, branded quality@ \n',
 'Date: Sun, 08 Apr 2007 21:00:48 +0300\n',
 'X-Mailer: Microsoft Outlook Express 6.00.2600.0000\n',
 'MIME-Version: 1.0\n',
 'Content-Type: multipart/alternative;\n',
 '\tboundary="--8896484051606557286"\n',
 'X-Priority: 3\n',
 'X-MSMail-Priority: Normal\n',
 'Status: RO\n',
 'Content-Length: 988\n',
 'Lines: 24\n',
 '\n',
 '----8896484051606557286\n',
 'Content-Type: text/html;\n',
 'Content-Transfer-Encoding: 7Bi

In [57]:
# Parsear el primer correo 
mail, label = parse_email(index[0])
print("El correo es: ", label)
print(mail)

El correo es:  spam
{'subject': ['gener', 'ciali', 'brand', 'qualiti'], 'body': ['do', 'feel', 'pressur', 'perform', 'rise', 'occas', 'tri', 'viagra', 'anxieti', 'thing', 'past', 'back', 'old', 'self'], 'content_type': 'multipart/alternative'}


El algoritmo de Regresion Logistica no es capaz de ingerir texto como parte del DataSet. Por lo tanto deben aplicarse una serie de funciones adicionales que transformen el texto de los correos parseados en una representacion numerica.

Aplicacion de CountVectorizer

In [58]:
from sklearn.feature_extraction.text import CountVectorizer

# Preparacion del e-mail en una cadena de texto
prep_email = [" ".join(mail['subject'])+ " ".join(mail['body'])]

vectorizer = CountVectorizer()
x = vectorizer.fit(prep_email)
print("e-mail: ", prep_email,"\n")
print("Caracteristicas de entrada: ", vectorizer.get_feature_names_out())

e-mail:  ['gener ciali brand qualitido feel pressur perform rise occas tri viagra anxieti thing past back old self'] 

Caracteristicas de entrada:  ['anxieti' 'back' 'brand' 'ciali' 'feel' 'gener' 'occas' 'old' 'past'
 'perform' 'pressur' 'qualitido' 'rise' 'self' 'thing' 'tri' 'viagra']


In [59]:
# Convertirlo en un rreglo binario
x = vectorizer.transform(prep_email)
print("\nValues:\n",x.toarray())


Values:
 [[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]


In [60]:
# Construir la matriz
from sklearn.preprocessing import OneHotEncoder
# parecido a un ciclo anidado
prep_email = [[w] for w in mail['subject']+ mail['body']]
enc = OneHotEncoder(handle_unknown='ignore')
x = enc.fit_transform(prep_email)

print("Features:\n", enc.get_feature_names_out())
print("\nValues:\n", x.toarray())

Features:
 ['x0_anxieti' 'x0_back' 'x0_brand' 'x0_ciali' 'x0_do' 'x0_feel' 'x0_gener'
 'x0_occas' 'x0_old' 'x0_past' 'x0_perform' 'x0_pressur' 'x0_qualiti'
 'x0_rise' 'x0_self' 'x0_thing' 'x0_tri' 'x0_viagra']

Values:
 [[0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0

Funciones auxiliares para el preprocesamiento del DataSet

In [61]:
def create_prep_dataset(index_pat,n_elements):
    X = []
    y = []
    indexes = parse_index(index_pat, n_elements)
    for i in range(n_elements):
        print("\nParsing e-mail: {0}".format(i+1), end = '')
        mail, label = parse_email(indexes[i])
        X.append(" ".join(mail['subject']) + " ".join(mail['body']))
        y.append(label)
    return X,y

## 3.- Entrenamiento del Algoritmo

Entrenamiento del algoritmo de regresion logistica con el DataSet preprocesado

In [62]:
# Leer unicamente un subconjunto de 100 correos
X_train, y_train = create_prep_dataset("datasets/datasets/trec07p/full/index",100)
X_train


Parsing e-mail: 1
Parsing e-mail: 2
Parsing e-mail: 3
Parsing e-mail: 4
Parsing e-mail: 5
Parsing e-mail: 6
Parsing e-mail: 7
Parsing e-mail: 8
Parsing e-mail: 9
Parsing e-mail: 10
Parsing e-mail: 11
Parsing e-mail: 12
Parsing e-mail: 13
Parsing e-mail: 14


Parsing e-mail: 15
Parsing e-mail: 16
Parsing e-mail: 17
Parsing e-mail: 18
Parsing e-mail: 19
Parsing e-mail: 20
Parsing e-mail: 21
Parsing e-mail: 22
Parsing e-mail: 23
Parsing e-mail: 24
Parsing e-mail: 25
Parsing e-mail: 26
Parsing e-mail: 27
Parsing e-mail: 28
Parsing e-mail: 29
Parsing e-mail: 30
Parsing e-mail: 31
Parsing e-mail: 32
Parsing e-mail: 33
Parsing e-mail: 34
Parsing e-mail: 35
Parsing e-mail: 36
Parsing e-mail: 37
Parsing e-mail: 38
Parsing e-mail: 39
Parsing e-mail: 40
Parsing e-mail: 41
Parsing e-mail: 42
Parsing e-mail: 43
Parsing e-mail: 44
Parsing e-mail: 45
Parsing e-mail: 46
Parsing e-mail: 47
Parsing e-mail: 48
Parsing e-mail: 49
Parsing e-mail: 50
Parsing e-mail: 51
Parsing e-mail: 52
Parsing e-mail: 53
Parsing e-mail: 54
Parsing e-mail: 55
Parsing e-mail: 56
Parsing e-mail: 57
Parsing e-mail: 58
Parsing e-mail: 59
Parsing e-mail: 60
Parsing e-mail: 61
Parsing e-mail: 62
Parsing e-mail: 63
Parsing e-mail: 64
Parsing e-mail: 65
Parsing e-mail: 66
Parsing e-m

['gener ciali brand qualitido feel pressur perform rise occas tri viagra anxieti thing past back old self',
 'typo debianreadmhi ive updat gulu i check mirror it seem littl typo debianreadm file exampl httpgulususherbrookecadebianreadm ftpftpfrdebianorgdebianreadm test lenni access releas diststest the current test develop snapshot name etch packag test unstabl pass autom test propog releas etch replac lenni like readmehtml yan morin consult en logiciel libr yanmorinsavoirfairelinuxcom 5149941556 to unsubscrib email debianmirrorsrequestlistsdebianorg subject unsubscrib troubl contact listmasterlistsdebianorg',
 'authent viagramega authenticv i a g r a discount pricec i a l i s discount pricedo miss it click httpwwwmoujsjkhchumcom authent viagra mega authenticv i a g r a discount pricec i a l i s discount pricedo miss it click',
 'nice talk yahey billi realli fun go night talk said felt insecur manhood i notic toilet quit small area worri websit i tell secret weapon extra 3 inch trust g

In [63]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)

In [64]:
print(X_train.toarray())
print("\nFeatures: ",len(vectorizer.get_feature_names_out()))

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

Features:  4911


In [65]:
import pandas as pd 
pd.DataFrame(X_train.toarray(), columns=[vectorizer.get_feature_names_out()])

Unnamed: 0,0000,000000,00085,002,003,00450,009,01,01000u,0107,...,ӧanz,ӭѯ,ԡšݡ淶,լһʽ,չҵϣ,سŵþʊʊݾѯ,ڶҵţ,㶫иï26,饻jwk,쵼ã
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
96,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
97,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
98,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [66]:
# Entrenar y_train
y_train

['spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'ham',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'ham',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'ham',
 'ham',
 'spam',
 'spam',
 'spam',
 'ham',
 'ham',
 'ham',
 'spam',
 'ham',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'ham',
 'spam',
 'ham',
 'spam',
 'spam',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam']

In [67]:
# Entrenamiento del algoritmo de regrsion logistica con el DataSet preprocesado
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(X_train,y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


### 4.- Prediccion
lectira de un DataSet de correos electronicos

In [68]:
# Leer 150 correos electronicos de nuestro DataSet y nos quedamos unicamente con los 50  ultimos. Estos 50 correos electronicos no se han utilizado para entrenar el algoritmo.

X, y = create_prep_dataset("datasets/datasets/trec07p/full/index", 150)
X_test = X[100:]
y_test = y[100:]


Parsing e-mail: 1
Parsing e-mail: 2
Parsing e-mail: 3
Parsing e-mail: 4
Parsing e-mail: 5
Parsing e-mail: 6
Parsing e-mail: 7
Parsing e-mail: 8
Parsing e-mail: 9
Parsing e-mail: 10
Parsing e-mail: 11
Parsing e-mail: 12
Parsing e-mail: 13
Parsing e-mail: 14
Parsing e-mail: 15
Parsing e-mail: 16
Parsing e-mail: 17
Parsing e-mail: 18
Parsing e-mail: 19
Parsing e-mail: 20
Parsing e-mail: 21
Parsing e-mail: 22
Parsing e-mail: 23
Parsing e-mail: 24
Parsing e-mail: 25
Parsing e-mail: 26
Parsing e-mail: 27
Parsing e-mail: 28
Parsing e-mail: 29
Parsing e-mail: 30
Parsing e-mail: 31
Parsing e-mail: 32
Parsing e-mail: 33
Parsing e-mail: 34
Parsing e-mail: 35
Parsing e-mail: 36
Parsing e-mail: 37
Parsing e-mail: 38
Parsing e-mail: 39
Parsing e-mail: 40
Parsing e-mail: 41
Parsing e-mail: 42
Parsing e-mail: 43
Parsing e-mail: 44
Parsing e-mail: 45
Parsing e-mail: 46
Parsing e-mail: 47
Parsing e-mail: 48
Parsing e-mail: 49
Parsing e-mail: 50
Parsing e-mail: 51
Parsing e-mail: 52
Parsing e-mail: 53
P

Preprocesamiento de los correos electronicos con el vectorizador creado anteriormente.

In [69]:
X_test = vectorizer.transform(X_test)

Prediccion del tipo de correo

In [70]:
y_pred = clf.predict(X_test)
y_pred

array(['spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'spam', 'spam', 'ham', 'spam', 'spam', 'spam', 'spam', 'spam',
       'ham', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'spam', 'spam'], dtype='<U4')

In [71]:
# Programar el modelo matematico con un modelo de regresion del titanic. En api

In [72]:
print("Predicion:\n",y_pred)
print("\nEtiquetas Reales:\n",y_test)

Predicion:
 ['spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam'
 'ham' 'spam' 'spam' 'spam' 'spam' 'spam' 'ham' 'spam' 'spam' 'spam'
 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam'
 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam'
 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam']

Etiquetas Reales:
 ['spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'ham', 'spam', 'spam', 'spam', 'spam', 'spam', 'ham', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'ham', 'spam', 'spam', 'ham', 'spam', 'spam', 'ham', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam']


Evaluacion de resultados

In [73]:
from sklearn.metrics import accuracy_score

print('Accuracy: {:3f}'.format(accuracy_score(y_test,y_pred)))

Accuracy: 0.940000


### 5.-Aumentando el DataSet

In [74]:
# Leer 12,000 correos electronicos para entrenar el algoritmo y 2,000 para realizar pruebas 
X,Y = create_prep_dataset("./datasets/datasets/trec07p/full/index", 12000)


Parsing e-mail: 1
Parsing e-mail: 2
Parsing e-mail: 3
Parsing e-mail: 4
Parsing e-mail: 5
Parsing e-mail: 6
Parsing e-mail: 7
Parsing e-mail: 8
Parsing e-mail: 9
Parsing e-mail: 10
Parsing e-mail: 11
Parsing e-mail: 12
Parsing e-mail: 13
Parsing e-mail: 14
Parsing e-mail: 15
Parsing e-mail: 16
Parsing e-mail: 17
Parsing e-mail: 18
Parsing e-mail: 19
Parsing e-mail: 20
Parsing e-mail: 21
Parsing e-mail: 22
Parsing e-mail: 23
Parsing e-mail: 24
Parsing e-mail: 25
Parsing e-mail: 26
Parsing e-mail: 27
Parsing e-mail: 28
Parsing e-mail: 29
Parsing e-mail: 30
Parsing e-mail: 31
Parsing e-mail: 32
Parsing e-mail: 33
Parsing e-mail: 34
Parsing e-mail: 35
Parsing e-mail: 36
Parsing e-mail: 37
Parsing e-mail: 38
Parsing e-mail: 39
Parsing e-mail: 40
Parsing e-mail: 41
Parsing e-mail: 42
Parsing e-mail: 43
Parsing e-mail: 44
Parsing e-mail: 45
Parsing e-mail: 46
Parsing e-mail: 47
Parsing e-mail: 48
Parsing e-mail: 49
Parsing e-mail: 50
Parsing e-mail: 51
Parsing e-mail: 52
Parsing e-mail: 53
P

In [80]:
# Tomar 10,000 para entrenamiento y 2,000 para testeo
X_train,Y_train = X[:10000], Y[:10000]
X_test, Y_test = X[10000:], Y[10000:]


In [81]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)

In [82]:
clf = LogisticRegression()
clf.fit(X_train,Y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


In [83]:
X_test = vectorizer.transform(X_test)
y_pred = clf.predict(X_test)
print('Accuracy: {:.3f}'.format(accuracy_score(Y_test, y_pred)))

Accuracy: 0.987
