# Regresion Logistica: Analisis 50K

En este ejercicio se muestran los fundamentos de la regresión logistica, planteando uno de los pequeños problemas que fueron solucionados mediante el uso de técnicas de Machine Learning: Detección de SPAM.
About Dataset
Songs Dataset (2000-2020)
Description

This dataset contains information about songs released between 2000 and 2020. The dataset includes the song's title, artist, album, genre, release date, duration, and popularity.
Columns

    Title: Rolling in the Deep
    Artist: Adele
    Album: 21
    Genre: Pop
    Release Date: 2011-01-24
    Duration: 231
    Popularity: 85

Usage

This dataset can be used for various data analysis and machine learning tasks related to music.
License

This dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0).

In [1]:
# En esta clase se facilita el preprocesamiento de correos electronicos que poseen codigo HTML

from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs = True
        self.fed = []

    def handle_data(self, d):
        self.fed.append(d)

    def get_data(self):
        return ''.join(self.fed)

In [2]:
# Esta funcion se encarga de eliminar los tags HTML
# Que se encuentren en el texto de los correos electronicos

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

In [3]:
#Ejemplo de eliminacion de los tags de HTML de un texto
t = '<tr><td align="left"><ahref="../../.issues/51/16.html#article">Phrack World News</a><td>'
strip_tags(t)

'Phrack World News'

Ademas de eliminar los posibles tags HTML que se encuentran en el correo electronico, deben realizarse otras acciones para evitar que  los
mensajes contengan ruido inecesario. Entre ellas se encuentran la eliminacion de los signos de puntuacion, eliminacion de los posibles campos de correo
electronico que no son relevantes o eliminación de los afijos de una palabra manteniendo unicamente la raiz de la misma (Stemming).
La clase que se muestra a continuación realiza estas transformaciones.

In [4]:
import email
import string
import nltk

class Parser:
    def __init__(self):
        self.stemmer = nltk.PorterStemmer()
        self.stopwords = set(nltk.corpus.stopwords.words('english'))
        self.punctuation = list(string.punctuation)
    def parse(self, email_path):
        """Parse an Email"""
        with open(email_path, errors = 'ignore') as e:
            msg = email.message_from_file(e)
        return None if not msg else self.get_email_content(msg)
    def get_email_content(self, msg):
        """Extract the email content."""
        subject = self.tokenize(msg['Subject']) if msg ['Subject'] else []
        body = self.get_email_body(msg.get_payload(),
                                    msg.get_content_type())
        content_type = msg.get_content_type()
        # Return the contentm of the email
        return {"subject": subject,
               "body": body,
               "content_type":content_type}
    def get_email_body(self, payload, content_type):
        """Extract the body of the email."""
        body = []
        if type(payload) is str and content_type == 'text/plain':
            return self.tokenize(payload)
        elif type(payload) is str and content_type == 'text/html':
            return self.tokenize(strip_tags(payload))
        elif type(payload) is list:
            for p in payload:
                body += self.get_email_body(p.get_payload(),
                                           p.get_content_type())
        return body
    def tokenize(self, text):
        """Transform a text string in tokens. Perform two main actions, clean the puntuaction symbols and do stemming of the text"""
        for c in self.punctuation:
            text = text.replace(c, "")
        text = text.replace("\t", " ")
        text = text.replace("\n", " ")
        tokens = list(filter(None, text.split(" ")))
        # Stemming of the tokens
        return [self.stemmer.stem(w) for w in tokens if w not in self.stopwords]



Lectura de un correo en formato .raw

In [6]:
import pandas as pd
file_path = "/home/raul/Escritorio/proyectoRegresion50k/songs_2000_2020_50k.csv"
dataset = pd.read_csv(file_path)
dataset.head(10)


Unnamed: 0,Title,Artist,Album,Genre,Release Date,Duration,Popularity
0,Include name this.,Patrick Anderson,Care.,R&B,2008-01-09,262,71
1,Manage west energy.,Eric Miller,Raise get.,Jazz,2011-08-20,187,37
2,Evening court painting.,Richard Curry,Sport.,Electronic,2010-05-30,212,58
3,Section turn hour.,James Smith,Full.,Hip-Hop,2014-10-12,272,59
4,Five agreement teach.,Amy Rodriguez,Eat.,Blues,2005-06-09,131,34
5,Turn child.,Jessica Martin,Cold according.,R&B,2006-09-16,207,58
6,Old.,Cheyenne Powell,Oil.,Country,2010-04-23,163,72
7,Clear fly over.,Aaron Coleman,Strategy development.,Classical,2010-02-06,183,73
8,Agency employee present.,Brandon Henderson,Might live.,Country,2020-02-18,243,69
9,Face become we.,Raymond White,Probably camera.,Blues,2011-11-07,177,55


## Parsing del correo electronico

In [7]:
p = Parser()
p.parse("/home/raul/Escritorio/proyectoRegresion50k/songs_2000_2020_50k.csv")

##### Lectura del Indice

Estas funciones complementarias se encargan de cargar en memoria la ruta de cada correo electrónico y su etiqueta correspondiente.

{Spam, ham}

In [8]:
index = open("/home/raul/Escritorio/proyectoRegresion50k/songs_2000_2020_50k.csv").readlines()
index

['Title,Artist,Album,Genre,Release Date,Duration,Popularity\n',
 'Include name this.,Patrick Anderson,Care.,R&B,2008-01-09,262,71\n',
 'Manage west energy.,Eric Miller,Raise get.,Jazz,2011-08-20,187,37\n',
 'Evening court painting.,Richard Curry,Sport.,Electronic,2010-05-30,212,58\n',
 'Section turn hour.,James Smith,Full.,Hip-Hop,2014-10-12,272,59\n',
 'Five agreement teach.,Amy Rodriguez,Eat.,Blues,2005-06-09,131,34\n',
 'Turn child.,Jessica Martin,Cold according.,R&B,2006-09-16,207,58\n',
 'Old.,Cheyenne Powell,Oil.,Country,2010-04-23,163,72\n',
 'Clear fly over.,Aaron Coleman,Strategy development.,Classical,2010-02-06,183,73\n',
 'Agency employee present.,Brandon Henderson,Might live.,Country,2020-02-18,243,69\n',
 'Face become we.,Raymond White,Probably camera.,Blues,2011-11-07,177,55\n',
 'Couple bank.,Paul Stephens,And.,Reggae,2016-03-31,245,76\n',
 'Wife subject.,Julie Martin,Of.,Rock,2003-01-27,191,97\n',
 'Business research.,Michael Glass,Speak.,Blues,2019-10-12,145,92\n',
 '

In [9]:
import os
import csv

DATASET_PATH = "/home/raul/Escritorio/proyectoRegresion50k/songs_2000_2020_50k.csv"  # Ajusta esto a la ruta de tu dataset

def parse_dataset(path_to_index, n_elements):
    ret_indexes = []
    with open(path_to_index, newline='') as csvfile:
        index_reader = csv.reader(csvfile, delimiter=',')
        for i, row in enumerate(index_reader):
            if i >= n_elements:
                break
            if len(row) > 3:  # Verifica que la línea tenga al menos cuatro columnas (Title, Artist, Album, Genre)
                title = row[0].strip()
                artist = row[1].strip()
                album = row[2].strip()
                genre = row[3].strip()
                ret_indexes.append({"title": title, "artist": artist, "album": album, "genre": genre})
    return ret_indexes

# Cargar el índice y las etiquetas en memoria
index = parse_dataset("/home/raul/Escritorio/proyectoRegresion50k/songs_2000_2020_50k.csv", 1)


In [10]:
indexes = parse_dataset("/home/raul/Escritorio/proyectoRegresion50k/songs_2000_2020_50k.csv", 50000)
indexes

[{'title': 'Title', 'artist': 'Artist', 'album': 'Album', 'genre': 'Genre'},
 {'title': 'Include name this.',
  'artist': 'Patrick Anderson',
  'album': 'Care.',
  'genre': 'R&B'},
 {'title': 'Manage west energy.',
  'artist': 'Eric Miller',
  'album': 'Raise get.',
  'genre': 'Jazz'},
 {'title': 'Evening court painting.',
  'artist': 'Richard Curry',
  'album': 'Sport.',
  'genre': 'Electronic'},
 {'title': 'Section turn hour.',
  'artist': 'James Smith',
  'album': 'Full.',
  'genre': 'Hip-Hop'},
 {'title': 'Five agreement teach.',
  'artist': 'Amy Rodriguez',
  'album': 'Eat.',
  'genre': 'Blues'},
 {'title': 'Turn child.',
  'artist': 'Jessica Martin',
  'album': 'Cold according.',
  'genre': 'R&B'},
 {'title': 'Old.',
  'artist': 'Cheyenne Powell',
  'album': 'Oil.',
  'genre': 'Country'},
 {'title': 'Clear fly over.',
  'artist': 'Aaron Coleman',
  'album': 'Strategy development.',
  'genre': 'Classical'},
 {'title': 'Agency employee present.',
  'artist': 'Brandon Henderson',
  

# Preprocesamiento del DataSet

Con las funciones presentadas anteriormente se permite la lectura de los correos electrónicos de manera programática y el procesamiento de los mismos para eliminar aquellos componentes que no resultan de utilidad para la detección de correos de SPAM. Sin embargo cada uno de los correos sigue estando representado por un diccionario de Python con una serie de palabras.

In [11]:
# Leer el primer registro
print(index[0])

{'title': 'Title', 'artist': 'Artist', 'album': 'Album', 'genre': 'Genre'}


In [12]:
# Parsear el primer registro
record = index[0]
print("El registro es: ", record["title"])
print("Artista: ", record["artist"])
print("Álbum: ", record["album"])
print("Género: ", record["genre"])


El registro es:  Title
Artista:  Artist
Álbum:  Album
Género:  Genre


el algoritmo de regresión logistica no es capaz de ingerir texto como parte del DataSet. Por lo tanto, genera aplicarse una serie de funciones adicionales que transformen el texto de los correos electronicos parseados en una representacion numérica

### Aplicación de Count Vectorizer

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

# Preparación del dataset en una lista de cadenas de texto
prep_data = [" ".join([record['title'], record['artist'], record['album'], record['genre']]) for record in index]

vectorizer = CountVectorizer()
X = vectorizer.fit(prep_data)

print("\n\nData:\n", prep_data, "\n")
print("Características de entrada: \n", vectorizer.get_feature_names_out())

X = vectorizer.transform(prep_data)
print("\nValues:\n", X.toarray())




Data:
 ['Title Artist Album Genre'] 

Características de entrada: 
 ['album' 'artist' 'genre' 'title']

Values:
 [[1 1 1 1]]


### Aplicacion de OneHotEncoding

In [14]:
from sklearn.preprocessing import OneHotEncoder

# Preparación del dataset en una lista de listas
prep_data = [[record['title'], record['artist'], record['album'], record['genre']] for record in index]

enc = OneHotEncoder(handle_unknown='ignore')
X = enc.fit_transform(prep_data)

print("Features: \n", enc.get_feature_names_out(), "\n")
print("Values:\n", X.toarray())


Features: 
 ['x0_Title' 'x1_Artist' 'x2_Album' 'x3_Genre'] 

Values:
 [[1. 1. 1. 1.]]


### Funciones auxiliares para le preprocesamiento del DataSet

In [15]:
import csv

def create_prep_dataset(index_path, n_elements):
    X = []
    y = []
    with open(index_path, newline='') as csvfile:
        index_reader = csv.reader(csvfile, delimiter=',')
        for i, row in enumerate(index_reader):
            if i >= n_elements:
                break
            if len(row) > 3:  # Verifica que la línea tenga al menos cuatro columnas (Title, Artist, Album, Genre)
                title = row[0].strip()
                artist = row[1].strip()
                album = row[2].strip()
                genre = row[3].strip()
                X.append(" ".join([title, artist, album]))
                y.append(genre)
    return X, y

# Leer únicamente un subconjunto de 1000 registros
X_train, y_train = create_prep_dataset('/home/raul/Escritorio/proyectoRegresion50k/songs_2000_2020_50k.csv', 1000)
X_train

['Title Artist Album',
 'Include name this. Patrick Anderson Care.',
 'Manage west energy. Eric Miller Raise get.',
 'Evening court painting. Richard Curry Sport.',
 'Section turn hour. James Smith Full.',
 'Five agreement teach. Amy Rodriguez Eat.',
 'Turn child. Jessica Martin Cold according.',
 'Old. Cheyenne Powell Oil.',
 'Clear fly over. Aaron Coleman Strategy development.',
 'Agency employee present. Brandon Henderson Might live.',
 'Face become we. Raymond White Probably camera.',
 'Couple bank. Paul Stephens And.',
 'Wife subject. Julie Martin Of.',
 'Business research. Michael Glass Speak.',
 'Act majority. Raymond Ramos Lead.',
 'Party yet. Mr. Jeffery Harris North.',
 'Traditional war with. Jason Davis Reveal senior.',
 'None possible. Joseph Allen Catch from.',
 'Give statement. Charles Gentry Can.',
 'Dog call ready member. Juan Gutierrez Company modern.',
 'Make. Brenda Wood Only local.',
 'Artist sort. Thomas Webster By.',
 'College grow. Lisa Rodgers Gas.',
 'National 

### Aplicar vectorización de los datos

In [16]:
vectorizer = CountVectorizer()
X_train= vectorizer.fit_transform(X_train)

In [17]:
print(X_train.toarray())
print("\nFeatures", len(vectorizer.get_feature_names_out()))

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

Features 1766


In [18]:
import pandas as pd

pd.DataFrame(X_train.toarray(), columns=[vectorizer.get_feature_names_out()])

Unnamed: 0,aaron,abigail,ability,able,about,above,accept,according,account,across,...,yet,yolanda,you,young,your,yourself,yvonne,zachary,zavala,zimmerman
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
998,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
y_train

['Genre',
 'R&B',
 'Jazz',
 'Electronic',
 'Hip-Hop',
 'Blues',
 'R&B',
 'Country',
 'Classical',
 'Country',
 'Blues',
 'Reggae',
 'Rock',
 'Blues',
 'Jazz',
 'R&B',
 'Electronic',
 'Electronic',
 'Classical',
 'Country',
 'Country',
 'Classical',
 'Jazz',
 'Classical',
 'R&B',
 'Jazz',
 'Jazz',
 'Jazz',
 'Classical',
 'Reggae',
 'Electronic',
 'Classical',
 'Pop',
 'Classical',
 'Hip-Hop',
 'Rock',
 'Hip-Hop',
 'Rock',
 'Country',
 'Rock',
 'Pop',
 'Pop',
 'Jazz',
 'Blues',
 'Reggae',
 'Country',
 'Reggae',
 'Rock',
 'Classical',
 'Rock',
 'Electronic',
 'R&B',
 'Electronic',
 'R&B',
 'Country',
 'Reggae',
 'Blues',
 'Electronic',
 'Jazz',
 'Blues',
 'Jazz',
 'Electronic',
 'Electronic',
 'Pop',
 'Jazz',
 'Blues',
 'Reggae',
 'Rock',
 'Classical',
 'Electronic',
 'Blues',
 'Reggae',
 'Electronic',
 'Classical',
 'Pop',
 'Blues',
 'Hip-Hop',
 'Pop',
 'Country',
 'Classical',
 'Reggae',
 'Classical',
 'Blues',
 'Hip-Hop',
 'Jazz',
 'Country',
 'Rock',
 'Hip-Hop',
 'Classical',
 'Blues'

#### Entrenamiento del Algoritmo de Regresión Logistica con el DataSet Preprocesado

In [20]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(X_train, y_train)

# 4.- Predicción

In [21]:
# Lectura de un DataSet de nuevos registros.

# Leer 1500 registros de nuestro DataSet y quedarnos únicamente con los 500 últimos
# registros, los cuales no se han utilizado para entrenar el algoritmo.
X, y = create_prep_dataset('/home/raul/Escritorio/proyectoRegresion50k/songs_2000_2020_50k.csv', 1500)

X_test = X[1000:]
y_test = y[1000:]

# Mostrar algunos resultados para verificar
for i in range(5):  # Muestra los primeros 5 elementos del conjunto de prueba
    print(f"Registro {i+1}: X_test = {X_test[i]}, y_test = {y_test[i]}")

Registro 1: X_test = Exactly. Cheryl Williamson Baby., y_test = Electronic
Registro 2: X_test = Hard better. Dan Bond Too., y_test = Rock
Registro 3: X_test = Physical movie should. Jose Marshall Kind technology., y_test = Rock
Registro 4: X_test = Name source happen consumer. Dawn Wilcox Window pull., y_test = Jazz
Registro 5: X_test = Team top. Jacqueline Schneider Set interesting., y_test = Hip-Hop


# Preprocesamiento de los correos electronicos con el vectorizando creado anteriormente

In [22]:
X_test_vect = vectorizer.transform(X_test)

In [23]:
y_pred = clf.predict(X_test_vect)
y_pred

array(['Pop', 'Jazz', 'Country', 'Pop', 'Country', 'Classical',
       'Classical', 'Hip-Hop', 'Electronic', 'Reggae', 'Blues', 'Hip-Hop',
       'Pop', 'Electronic', 'Hip-Hop', 'Reggae', 'Pop', 'R&B', 'Reggae',
       'Classical', 'Pop', 'Hip-Hop', 'Reggae', 'Classical', 'Hip-Hop',
       'Jazz', 'Country', 'Pop', 'Country', 'Country', 'Jazz', 'Hip-Hop',
       'Reggae', 'Blues', 'Hip-Hop', 'Jazz', 'Jazz', 'Pop', 'Pop', 'Pop',
       'Electronic', 'Jazz', 'Electronic', 'Jazz', 'Jazz', 'Blues', 'Pop',
       'Jazz', 'Pop', 'Country', 'Classical', 'Hip-Hop', 'Classical',
       'Reggae', 'Blues', 'Jazz', 'Jazz', 'Classical', 'Hip-Hop',
       'Country', 'Country', 'Classical', 'Jazz', 'Electronic', 'Pop',
       'Blues', 'Reggae', 'Rock', 'Jazz', 'Classical', 'Country',
       'Country', 'Electronic', 'Rock', 'Pop', 'Country', 'Pop', 'R&B',
       'R&B', 'Jazz', 'Country', 'Reggae', 'Pop', 'Rock', 'Country',
       'Rock', 'Reggae', 'Hip-Hop', 'Rock', 'Pop', 'Jazz', 'Pop', 'Pop',
      

In [24]:
print("Prediccion\n", y_pred)
print("\nEtiquetas Reales", y_test)

Prediccion
 ['Pop' 'Jazz' 'Country' 'Pop' 'Country' 'Classical' 'Classical' 'Hip-Hop'
 'Electronic' 'Reggae' 'Blues' 'Hip-Hop' 'Pop' 'Electronic' 'Hip-Hop'
 'Reggae' 'Pop' 'R&B' 'Reggae' 'Classical' 'Pop' 'Hip-Hop' 'Reggae'
 'Classical' 'Hip-Hop' 'Jazz' 'Country' 'Pop' 'Country' 'Country' 'Jazz'
 'Hip-Hop' 'Reggae' 'Blues' 'Hip-Hop' 'Jazz' 'Jazz' 'Pop' 'Pop' 'Pop'
 'Electronic' 'Jazz' 'Electronic' 'Jazz' 'Jazz' 'Blues' 'Pop' 'Jazz' 'Pop'
 'Country' 'Classical' 'Hip-Hop' 'Classical' 'Reggae' 'Blues' 'Jazz'
 'Jazz' 'Classical' 'Hip-Hop' 'Country' 'Country' 'Classical' 'Jazz'
 'Electronic' 'Pop' 'Blues' 'Reggae' 'Rock' 'Jazz' 'Classical' 'Country'
 'Country' 'Electronic' 'Rock' 'Pop' 'Country' 'Pop' 'R&B' 'R&B' 'Jazz'
 'Country' 'Reggae' 'Pop' 'Rock' 'Country' 'Rock' 'Reggae' 'Hip-Hop'
 'Rock' 'Pop' 'Jazz' 'Pop' 'Pop' 'Hip-Hop' 'Reggae' 'R&B' 'Classical'
 'Hip-Hop' 'Jazz' 'Rock' 'Hip-Hop' 'Pop' 'Jazz' 'Pop' 'R&B' 'Blues' 'Pop'
 'Reggae' 'Country' 'R&B' 'Pop' 'Pop' 'Pop' 'Pop' 'Pop' 'Elect

## Evaluacion de Resultados

In [25]:
from sklearn.metrics import accuracy_score
print("Accuracy: {:.3f}".format(accuracy_score(y_test, y_pred)))

Accuracy: 0.088


# 5.- Aumentando el DataSet 

In [26]:
# Leer 20,000 correos electronicos
X, y = create_prep_dataset('/home/raul/Escritorio/proyectoRegresion50k/songs_2000_2020_50k.csv', 20000)


In [27]:
# Utilizamos 15,000 correos para entrenar el algoritmo y 5000 para realizar pruebas.
X_train, y_train = X[:15000], y[:15000]
X_test, y_test = X[15000:], y[15000:]

In [28]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)

In [29]:
clf = LogisticRegression()
clf.fit(X_train, y_train)

In [30]:
X_test = vectorizer.transform(X_test)
y_pred = clf.predict(X_test)

In [36]:
print("Accuracy: {:.3f}".format(accuracy_score(y_test, y_pred)))

Accuracy: 0.106
