# Regresion Logistica: Deteccion de SPAM

En este ejercico se muestran los fundamentos de la regresion logistica, planteando uno de los primeros problemas que fueron solucionados mediante el uso ded tecnicas de Machine Learning: La deteccion de SPAM


##  Enunciado del ejercicio.
Se propone la construccion de un sistema de aprendizaje automatico capaz de predecir si un correo determinado se corresponde con un correo SPAM o no, para ello se utilizara el siguente DatSet:

##### [2007_TE _Public_Spam_Corpus (https://plg.uwaterloo.ca/~gvcormac/treccorpus07/)]
The corpus trec07p contains 75,419 messages:

    25220 ham
    50199 spam

These messages constitute all the messages delivered to a particular
server between these dates:

    Sun, 8 Apr 2007 13:07:21 -0400
    Fri, 6 Jul 2007 07:04:53 -0400

In [2]:
# En esta clase se facilita el procesamiento de correos electronicos 

# que poseen codigo html.
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs = True
        self.fed = []

    def handle_data(self, d):
        self.fed.append(d)

    def get_data(self):
        return ''.join(self.fed)

In [3]:
# Esta funcion se encarga de eliminar los tags HTML
# que se encunetren en el texto de los correos electronicos
def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

In [4]:
# Ejemplo de eliminacion de los tads HTML de un texto
t = '<tr><td align="left"><ahref="../../issues/51/16.html#article">Phrack world News </a><td>'
strip_tags(t)

'Phrack world News '

Ademas de eliminar los posiblrs tags html que se encuentran en el correo electronico deben realizarse otras acciones para evitar que los mensajes contengan ruido inecesario. Entre ellas se encuentra la eliminacion de signos de puntuacion, eliminancion de los posibles campos de correo electronico que no sean relevantes o eliminacion de los afijos de una palabra manteniendo unicamente la raiz de la misma(stemming). La clase que se muestra a continuacion realiza estas transformaciones.

In [6]:
import email
import string
import nltk


class Parser:
    def __init__(self):
        self.stemmer = nltk.PorterStemmer()
        self.stopwords = set(nltk.corpus.stopwords.words('english'))

        self.punctuation = list(string.punctuation)

    def parse(self, email_path):
        """Parse an email."""
        with open(email_path, errors = 'ignore') as e:
            msg = email.message_from_file(e)
        return None if not msg else self.get_email_content(msg)

    def get_email_content(self, msg):
        """Extract the email content."""
        subject = self.tokenize(msg['Subject']) if msg ['Subject'] else []
        body = self.get_email_body(msg.get_payload(),
                                  msg.get_content_type())
        content_type = msg.get_content_type()
        # Return the content of the email
        return {"subject": subject,
               "body": body,
               "content_type": content_type}

    def get_email_body(self, payload, content_type):
        """Extract the body of the email."""
        body = []
        if type(payload) is str and content_type == 'text/plain':
            return self.tokenize(payload)
        elif type(payload) is str and content_type == 'text/html':
            return self.tokenize(strip_tags(payload))
        elif type(payload) is list:
            for p in payload:
                body += self.get_email_body(p.get_payload(), 
                                           p.get_content_type())
        return body

    def tokenize(self, text):
        """Transform a text string in tokens. Perform two main actons,
        clean the puntuaction symbols and do stemming of the text"""
        for c in self.punctuation:
            text = text.replace(c, "")
        text = text.replace("\t", " ")
        text = text.replace("\n", " ")
        tokens = list(filter(None, text.split(" ")))
        # Stremming of the tokens
        return [self.stemmer.stem(w) for w in tokens if w not in self.stopwords]

Lectura de un correo en formato .raw

In [176]:
inmail = open("datasets/datasets/trec07p/data/India_Condom_Market_Dataset.csv").read()
print(inmail)

Year,Market Size (USD Million),CAGR (%),Material Type,Product Type,Distribution Channel,Event Name,Event Date,Company Involved,Event Details,Region,Market Penetration,Growth Rate (%),Brand Name,Market Share (%),Revenue Contribution (%),Innovation Index,Regulatory Impact,Awareness Campaign Impact
2027,1119.93,10.2,Non-latex,Male Condoms,Drug Stores,Product Launch,2027-06-20,Sirona Hygiene Private Limited,Launched new product,West India,Medium,12.27,Durex,31.01,6.66,6.28,High,1.23
2024,1479.27,12.52,Latex,Male Condoms,Drug Stores,Campaign,2020-03-14,Godrej Consumer Products Limited,Acquired smaller brand,West India,Low,12.89,Playboy,35.78,14.97,2.7,Low,6.43
2023,894.44,9.31,Latex,Male Condoms,Drug Stores,Product Launch,2027-03-23,TTK,Ran awareness campaign,North India,Medium,10.55,Skore,29.22,4.99,5.66,Medium,3.64
2030,1181.72,12.77,Non-latex,Male Condoms,Mass Merchandisers,Product Launch,2028-07-01,Godrej Consumer Products Limited,Ran awareness campaign,West India,Low,9.68,Durex,28.97,2

##### Parsing del correo electronio

In [178]:
p = Parser()
p.parse("datasets/datasets/trec07p/data/India_Condom_Market_Dataset.csv")

##### Lectura del indice
Estas funciones complementarias se encargan de cargar en memoria la ruta de cada correo electronico y su etiqueta correspondinete.
{Spam,ham}

In [180]:
index = open("datasets/datasets/trec07p/data/India_Condom_Market_Dataset.csv").readlines()
index

['Year,Market Size (USD Million),CAGR (%),Material Type,Product Type,Distribution Channel,Event Name,Event Date,Company Involved,Event Details,Region,Market Penetration,Growth Rate (%),Brand Name,Market Share (%),Revenue Contribution (%),Innovation Index,Regulatory Impact,Awareness Campaign Impact\n',
 '2027,1119.93,10.2,Non-latex,Male Condoms,Drug Stores,Product Launch,2027-06-20,Sirona Hygiene Private Limited,Launched new product,West India,Medium,12.27,Durex,31.01,6.66,6.28,High,1.23\n',
 '2024,1479.27,12.52,Latex,Male Condoms,Drug Stores,Campaign,2020-03-14,Godrej Consumer Products Limited,Acquired smaller brand,West India,Low,12.89,Playboy,35.78,14.97,2.7,Low,6.43\n',
 '2023,894.44,9.31,Latex,Male Condoms,Drug Stores,Product Launch,2027-03-23,TTK,Ran awareness campaign,North India,Medium,10.55,Skore,29.22,4.99,5.66,Medium,3.64\n',
 '2030,1181.72,12.77,Non-latex,Male Condoms,Mass Merchandisers,Product Launch,2028-07-01,Godrej Consumer Products Limited,Ran awareness campaign,West In

In [188]:
import os  # hacer modificaciones y perticiones
DATASET_PATH = "/home/rosy/SIMULACION/datasets/datasets/trec07p"

def parse_index(path_to_index, n_elements):
    ret_indexes = []
    with open(path_to_index) as f:
        index = f.readlines()
    for i in range(min(n_elements, len(index))):
        mail = index[i].strip().split(" ../")
        if len(mail) < 2:
            continue  # Omitir líneas que no tienen el formato esperado
        label = mail[0]
        path = mail[1][:-1]  # Elimina el salto de línea
        ret_indexes.append({"label": label, "email_path": os.path.join(DATASET_PATH, path)})
    return ret_indexes



In [192]:
def parse_email(index):
    p = Parser()
    pmail = p.parse(index["email_path"])
    return pmail, index["label"]

In [194]:
indexes = parse_index("datasets/datasets/trec07p/data/India_Condom_Market_Dataset.csv", 10)
print(indexes)

[]


##### Prepocesamiento del DataSet.

Con las funciones presentadas anteriormente se permite la lectura de los codigos electronicos de manera programatica  y el procesamineto de los mismos para eliminar aquellos componentes componentes que no resultan de utilidad para la deteccion de correos de SPAM. Sin embargo, cada uno de los correos sigue estando representado por un diccionario de python con una serie de palabras

In [208]:
# Cargar el indice y las etiquetas en memoria
index = parse_index("datasets/datasets/trec07p/data/India_Condom_Market_Dataset.csv", 1)

In [210]:
# Leemos primer correo

import os

# Supongo que ret_indexes contiene los resultados de parse_index
indexes = parse_index("datasets/datasets/trec07p/data/dataset_small.csv", 1)

# Abre el primer correo en la lista
with open(indexes[0]["email_path"], 'r') as email_file:
    email_content = email_file.read()

print(email_content)


IndexError: list index out of range

In [212]:
# Parsear el primer correo

# Supongo que index contiene los resultados de alguna función anterior
print("Contenido de index:", index)  # Añade esto para ver el contenido de index

try:
    mail, label = parse_email(index[0])
    print("El correo es: \n", label)
    print(mail)
except IndexError as e:
    print(f"Error: {e}. Asegúrate de que index no está vacío y tiene suficientes elementos.")


Contenido de index: []
Error: list index out of range. Asegúrate de que index no está vacío y tiene suficientes elementos.


El algoritmo de Regresion LogÍstica no es capaz de ingerir texto como parte del DataSet. Por lo tanto deben de aplicarse una serie de funciones adicionales que transformen el texto de los correos electrónicos parseados en una representación númerica

### Aplicacion de countVectorizer

In [216]:
from sklearn.feature_extraction.text import CountVectorizer

# Definición del diccionario mail
mail = {
    'subject': ['Asunto del correo'],
    'body': ['Este es el cuerpo del correo']
}

# Preparación del email en una cadena de texto.
prep_email = [" ".join(mail['subject']) + " " + " ".join(mail['body'])]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(prep_email)

print("\n\nE-mail:", prep_email, "\n")
print("Características de entrada:", vectorizer.get_feature_names_out())




E-mail: ['Asunto del correo Este es el cuerpo del correo'] 

Características de entrada: ['asunto' 'correo' 'cuerpo' 'del' 'el' 'es' 'este']


In [218]:
X = vectorizer.transform(prep_email)
print("\nValues:\n", X.toarray())


Values:
 [[1 2 1 2 1 1 1]]


#### Aplicacion de OneHotEncoding

In [204]:
from sklearn.preprocessing import OneHotEncoder

prep_email = [[w] for w in mail['subject'] +mail['body']]
enc = OneHotEncoder(handle_unknown = 'ignore')
X = enc.fit_transform(prep_email)

print("Features:\n", enc.get_feature_names_out(), "\n")
print("Values:", X.toarray())

Features:
 ['x0_Asunto del correo' 'x0_Este es el cuerpo del correo'] 

Values: [[1. 0.]
 [0. 1.]]


#### Funciones auxiliares para el procesamiento del DataSet

In [257]:
import pandas as pd

def create_prep_dataset(file_path, num_rows):
    # Leer el archivo CSV
    datos = pd.read_csv(file_path)
    # Seleccionar los primeros `num_rows` registros y sus etiquetas
    X_train = datos['Event Details'].head(num_rows).tolist()
    y_train = datos['Event Name'].head(num_rows).tolist()
    return X_train, y_train

# Ruta a tu archivo CSV
file_path = "datasets/datasets/trec07p/data/India_Condom_Market_Dataset.csv"

# Leer un subconjunto de 1000 registros
X_train, y_train = create_prep_dataset(file_path, 1000)

# Mostrar los primeros 5 registros procesados
for i in range(min(5, len(X_train))):
    print(f"Registro {i + 1}: {X_train[i]}")

# Verificar el contenido completo de X_train (solo se imprimirá la longitud para evitar saturación)
print(f"Total de registros en X_train: {len(X_train)}")




Registro 1: Launched new product
Registro 2: Acquired smaller brand
Registro 3: Ran awareness campaign
Registro 4: Ran awareness campaign
Registro 5: Acquired smaller brand
Total de registros en X_train: 1000


In [259]:
# Leer unicamente un subconjunto de 1000 correos electronicos.
X_train, y_train = create_prep_dataset("datasets/datasets/trec07p/data/India_Condom_Market_Dataset.csv", 1000)

# Mostrar los primeros 5 correos electrónicos procesados
for i in range(min(5, len(X_train))):
    print(f"Correo electrónico {i + 1}: {X_train[i]}")

# Verificar el contenido completo de X_train
X_train


Correo electrónico 1: Launched new product
Correo electrónico 2: Acquired smaller brand
Correo electrónico 3: Ran awareness campaign
Correo electrónico 4: Ran awareness campaign
Correo electrónico 5: Acquired smaller brand


['Launched new product',
 'Acquired smaller brand',
 'Ran awareness campaign',
 'Ran awareness campaign',
 'Acquired smaller brand',
 'Launched new product',
 'Ran awareness campaign',
 'Ran awareness campaign',
 'Acquired smaller brand',
 'Acquired smaller brand',
 'Acquired smaller brand',
 'Ran awareness campaign',
 'Ran awareness campaign',
 'Acquired smaller brand',
 'Launched new product',
 'Ran awareness campaign',
 'Acquired smaller brand',
 'Ran awareness campaign',
 'Launched new product',
 'Acquired smaller brand',
 'Ran awareness campaign',
 'Acquired smaller brand',
 'Ran awareness campaign',
 'Acquired smaller brand',
 'Acquired smaller brand',
 'Launched new product',
 'Ran awareness campaign',
 'Launched new product',
 'Ran awareness campaign',
 'Ran awareness campaign',
 'Ran awareness campaign',
 'Launched new product',
 'Acquired smaller brand',
 'Ran awareness campaign',
 'Launched new product',
 'Launched new product',
 'Launched new product',
 'Launched new produc

##### Aplicar vectorizacion a los datos

In [261]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)

In [263]:
print(X_train.toarray())
print("\nFeatures", len(vectorizer.get_feature_names_out()))

[[0 0 0 ... 1 0 0]
 [1 0 1 ... 0 0 1]
 [0 1 0 ... 0 1 0]
 ...
 [1 0 1 ... 0 0 1]
 [0 0 0 ... 1 0 0]
 [0 0 0 ... 1 0 0]]

Features 9


In [265]:
import pandas as pd

pd.DataFrame(X_train.toarray(), columns=[vectorizer.get_feature_names_out()])

Unnamed: 0,acquired,awareness,brand,campaign,launched,new,product,ran,smaller
0,0,0,0,0,1,1,1,0,0
1,1,0,1,0,0,0,0,0,1
2,0,1,0,1,0,0,0,1,0
3,0,1,0,1,0,0,0,1,0
4,1,0,1,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...
995,0,0,0,0,1,1,1,0,0
996,0,1,0,1,0,0,0,1,0
997,1,0,1,0,0,0,0,0,1
998,0,0,0,0,1,1,1,0,0


In [267]:
y_train

['Product Launch',
 'Campaign',
 'Product Launch',
 'Product Launch',
 'Campaign',
 'Product Launch',
 'Acquisition',
 'Acquisition',
 'Acquisition',
 'Campaign',
 'Product Launch',
 'Acquisition',
 'Campaign',
 'Acquisition',
 'Campaign',
 'Campaign',
 'Campaign',
 'Acquisition',
 'Acquisition',
 'Acquisition',
 'Campaign',
 'Campaign',
 'Product Launch',
 'Product Launch',
 'Campaign',
 'Campaign',
 'Product Launch',
 'Acquisition',
 'Product Launch',
 'Product Launch',
 'Product Launch',
 'Campaign',
 'Campaign',
 'Product Launch',
 'Product Launch',
 'Campaign',
 'Product Launch',
 'Campaign',
 'Campaign',
 'Product Launch',
 'Campaign',
 'Acquisition',
 'Campaign',
 'Campaign',
 'Campaign',
 'Campaign',
 'Acquisition',
 'Product Launch',
 'Product Launch',
 'Acquisition',
 'Acquisition',
 'Campaign',
 'Acquisition',
 'Acquisition',
 'Product Launch',
 'Acquisition',
 'Campaign',
 'Product Launch',
 'Campaign',
 'Campaign',
 'Campaign',
 'Acquisition',
 'Acquisition',
 'Acquisition

#### Entrenamiento del algoritmo de Regresion Logistica con en DataSet preprocesado

In [269]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(X_train, y_train)

# 4.- Prediccion

In [298]:
# Lectura de un DataSet de correos nuevos.

# Leer 1500 correos de nuestro DataSet y quedarnos unicamente con los 500 ultimos correos electronicos, los cuales no se han utilizado
# Para entrenar el  algoritmo
X,y = create_prep_dataset("datasets/datasets/trec07p/data/India_Condom_Market_Dataset.csv", 250)
X_test = X[100:]
y_test = y[100:]

##### Preprpcesamiento de los correos electronicos con el vectorizado creado anteriormenete 

In [301]:
X_test =  vectorizer.transform(X_test)

In [303]:
y_pred = clf.predict(X_test)
y_pred

array(['Product Launch', 'Product Launch', 'Acquisition',
       'Product Launch', 'Product Launch', 'Acquisition',
       'Product Launch', 'Product Launch', 'Product Launch',
       'Product Launch', 'Acquisition', 'Acquisition', 'Acquisition',
       'Product Launch', 'Acquisition', 'Product Launch',
       'Product Launch', 'Product Launch', 'Product Launch',
       'Acquisition', 'Acquisition', 'Product Launch', 'Product Launch',
       'Acquisition', 'Acquisition', 'Acquisition', 'Product Launch',
       'Product Launch', 'Acquisition', 'Product Launch',
       'Product Launch', 'Product Launch', 'Product Launch',
       'Product Launch', 'Product Launch', 'Acquisition',
       'Product Launch', 'Acquisition', 'Product Launch', 'Acquisition',
       'Product Launch', 'Product Launch', 'Acquisition',
       'Product Launch', 'Product Launch', 'Product Launch',
       'Product Launch', 'Acquisition', 'Acquisition', 'Product Launch',
       'Acquisition', 'Product Launch', 'Product 

In [305]:
print("Prediccion\n", y_pred)
print("\nEtiquetas Reales", y_test)

Prediccion
 ['Product Launch' 'Product Launch' 'Acquisition' 'Product Launch'
 'Product Launch' 'Acquisition' 'Product Launch' 'Product Launch'
 'Product Launch' 'Product Launch' 'Acquisition' 'Acquisition'
 'Acquisition' 'Product Launch' 'Acquisition' 'Product Launch'
 'Product Launch' 'Product Launch' 'Product Launch' 'Acquisition'
 'Acquisition' 'Product Launch' 'Product Launch' 'Acquisition'
 'Acquisition' 'Acquisition' 'Product Launch' 'Product Launch'
 'Acquisition' 'Product Launch' 'Product Launch' 'Product Launch'
 'Product Launch' 'Product Launch' 'Product Launch' 'Acquisition'
 'Product Launch' 'Acquisition' 'Product Launch' 'Acquisition'
 'Product Launch' 'Product Launch' 'Acquisition' 'Product Launch'
 'Product Launch' 'Product Launch' 'Product Launch' 'Acquisition'
 'Acquisition' 'Product Launch' 'Acquisition' 'Product Launch'
 'Product Launch' 'Acquisition' 'Product Launch' 'Product Launch'
 'Product Launch' 'Product Launch' 'Acquisition' 'Acquisition'
 'Acquisition' 'Acq

#### Evaluacion de Resultados

In [308]:
from sklearn.metrics import accuracy_score
print("Accuracy: {:.3f}".format(accuracy_score(y_test, y_pred)))

Accuracy: 0.347


# 5.- Aumentando el DataSet

In [296]:
# Leer 20000 correos electronicos
X, y = create_prep_dataset("datasets/datasets/trec07p/data/India_Condom_Market_Dataset.csv", 25000)

In [285]:
# Utilizamos 15,000 para entrenar el algoritmo y 5,000 para realizar proebas
X_train, y_train = X[:15000], y[:15000]
X_test, y_test = X[15000:], y[15000:]

In [287]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)

In [289]:
clf = LogisticRegression()
clf.fit(X_train, y_train)

In [291]:
X_test = vectorizer.transform(X_test)
y_pred = clf.predict(X_test)

In [293]:
print("Accuracy: {:.3f}".format(accuracy_score(y_test, y_pred)))

Accuracy: 0.337
