# IA AntiSpam
<details>
<summary>The project involves the following key components:</summary>

1. Data Collection: Gathering a diverse and representative dataset of spam and non-spam messages to train the machine learning models.

2. Data Preprocessing: Cleaning and preprocessing the collected data to remove noise, standardize formats, and extract relevant features.

3. Feature Engineering: Transforming the preprocessed data into a suitable format for training the machine learning models, including the extraction of text-based features such as word frequency, n-grams, and semantic analysis.

4. Model Training: Developing and training machine learning models using various algorithms such as Naive Bayes, Support Vector Machines, or Deep Learning models like Recurrent Neural Networks (RNNs) or Transformers.

5. Model Evaluation: Assessing the performance of the trained models using appropriate evaluation metrics such as accuracy, precision, recall, and F1-score.

6. Model Deployment: Integrating the trained model into a production environment, such as an email server or messaging platform, to automatically classify incoming messages as spam or non-spam.
</details>






## Imports

In [1]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import random
import os
import mailbox
from collections import Counter

nltk.download('stopwords')  # Uncomment if stopwords haven't been downloaded
nltk.download('wordnet')  # Uncomment if WordNetLemmatizer isn't present

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/codespace/nltk_data...


True

## Data Collection

In [5]:
# install required packages
!sudo apt update
!sudo apt install unrar
# unrar spam.rar and easy_ham.rar on data folder
!unrar x spam.rar spam/
!unrar x easy_ham.rar easy_ham/

946.01s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Hit:1 https://packages.microsoft.com/repos/microsoft-ubuntu-focal-prod focal InRelease
Hit:2 https://dl.yarnpkg.com/debian stable InRelease                           [0m
Hit:3 https://repo.anaconda.com/pkgs/misc/debrepo/conda stable InRelease       [0m
Hit:4 http://security.ubuntu.com/ubuntu focal-security InRelease            [0m[33m
Hit:6 http://archive.ubuntu.com/ubuntu focal InRelease                 [0mm[33m[33m
Hit:7 http://archive.ubuntu.com/ubuntu focal-updates InRelease                 [0m
Hit:5 https://packagecloud.io/github/git-lfs/ubuntu focal InRelease            [33m
Hit:8 http://archive.ubuntu.com/ubuntu focal-backports InRelease
Reading package lists... Done[33m
Building dependency tree       
Reading state information... Done
38 packages can be upgraded. Run 'apt list --upgradable' to see them.


953.07s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Reading package lists... Done
Building dependency tree       
Reading state information... Done
unrar is already the newest version (1:5.6.6-2build1).
0 upgraded, 0 newly installed, 0 to remove and 38 not upgraded.


959.67s - pydevd: Sending message related to process being replaced timed-out after 5 seconds



UNRAR 5.61 beta 1 freeware      Copyright (c) 1993-2018 Alexander Roshal


Extracting from spam.rar

Creating    spam                                                      OK
Extracting  spam/0390.176f9525715411d7e2ce36e5bab4c770                   0  OK 
Extracting  spam/0391.a52ab775baefe8b277a285560cac7d78                   0  OK 
Extracting  spam/0392.9e194dfff92f7d9957171b04a8d4b957                   0  OK 
Extracting  spam/0393.d3a4d296a35c6a7f39429247c007eeae                   0  OK 
Extracting  spam/0394.9c882c72ddfd810b56776fdaa1c727a6                   0  OK 
Extracting  spam/0395.bb934e8b4c39d5eab38f828a26f760b4                   1  OK 
Extracting  spam/0396.8ea0610e30c94adefd9b3489df436ad9                   1  OK 
Extracting  spam/0397.c02eba1386b00d640c954e5117dd1aa0                   1  OK 
Extracting  spam/0398.93e6be09b12b93697185c881c739605d                   1  OK 
Extracting  spam/0399.b9eab4251d9263129290cf7fc2aa4c7a                   1  OK 
Extracting  spam/0400.a15

965.12s - pydevd: Sending message related to process being replaced timed-out after 5 seconds



UNRAR 5.61 beta 1 freeware      Copyright (c) 1993-2018 Alexander Roshal


Extracting from easy_ham.rar

Creating    easy_ham                                                  OK
Extracting  easy_ham/2000.f5754a180fc6394657dc27921f88aaae               0  OK 
Extracting  easy_ham/0001.ea7e79d3153e7469e7a9c3e0af6a357e               0  OK 
Extracting  easy_ham/0002.b3120c4bcbf3101e661161ee7efcb8bf               0  OK 
Extracting  easy_ham/0003.acfc5ad94bbd27118a0d8685d18c89dd               0  OK 
Extracting  easy_ham/0004.e8d5727378ddde5c3be181df593f1712               0  OK 
Extracting  easy_ham/0005.8c3b9e9c0f3f183ddaf7592a11b99957               0  OK 
Extracting  easy_ham/0006.ee8b0dba12856155222be180ba122058               0  OK 
Extracting  easy_ham/0007.c75188382f64b090022fa3b095b020b0               0  OK 
Extracting  easy_ham/0008.20bc0b4ba2d99aae1c7098069f611a9b               0  OK 
Extracting  easy_ham/0009.435ae292d75abb1ca492dcc2d5cf1570               0  OK 
Extracting  easy_ham/

## Data preprocesing

In [10]:
email_list = []
carpetas = ['easy_ham', 'spam']
for carpeta in carpetas:
    for archivo in os.listdir(carpeta):
        # Abrir el archivo como un buzón de correo
        mbox = mailbox.mbox(carpeta + '/' + archivo)
        # Agregar cada correo en el buzón a la lista de correos
        for correo in mbox:
            email_list.append(correo)
    #guardar la cantidad de correos en cada carpeta en ua variable
    if carpeta == 'easy_ham':
        easy_ham_count = len(email_list)
    else:
        spam_count = len(email_list) - easy_ham_count

print("Correos Cargados")

def flatten_payload(email):
    email_text = email.get_payload()
    if isinstance(email_text, list):
        return ' '.join(flatten_payload(part) for part in email_text)
    else:
        return email_text


def preprocess_email(email):
    email_text = email.get_payload()
    email_text = flatten_payload(email)
    # 1. Conversión a minúsculas:
    email_text = email_text.lower()

    # 2. Eliminación de caracteres especiales y URLs:
    email_text = re.sub(r'[^a-z0-9\s]', ' ', email_text)  # Sustituye caracteres especiales
    email_text = re.sub(r'http\S+', ' ', email_text)  # Elimina URLs

    # 3. Tokenización (separa el texto en palabras):
    words = email_text.split()

    # 4. Eliminación de stopwords (palabras de enlace frecuentes):
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]

    # 5. Lematización (reduce palabras a su raíz):
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]

    return ' '.join(words)

def create_vocabulary(email_list, frequency_threshold=5):
    # Initialize a Counter object to store word frequencies
    word_freq = Counter()

    # Iterate over each email in the list
    for email in email_list:
        # Convert the email into a list of preprocessed words
        words = preprocess_email(email).split()

        # Update word frequencies
        word_freq.update(words)

    # Filter out uncommon words
    vocab_list = [word for word, freq in word_freq.items() if freq >= frequency_threshold]

    # Sort the vocabulary list
    vocab_list.sort()

    return vocab_list


# Guarda el vocabulario en un archivo de texto
if os.path.exists('vocab.txt'):
    with open('vocab.txt', 'r') as f:
        vocab = [line.strip() for line in f]
    print("Vocabulario Cargado")
else:
    vocab = create_vocabulary(email_list,5)
    with open('vocab.txt', 'w') as f:
        for word in vocab:
            f.write(word + '\n')
    print("Vocabulario creado y guardado")



Correos Cargados
Vocabulario creado y guardado


## Data Encoding

In [12]:
def one_hot_encode(text, vocab_dict):
    # Convierte el texto en una lista de palabras preprocesadas
    words = preprocess_email(text)

    # Crea un vector de ceros del tamaño del vocabulario
    encoded_email = np.zeros(len(vocab))

    # Asigna 1 a la posición correspondiente a cada palabra en el vocabulario
    for word in words:
        if word in vocab_dict:
            encoded_email[vocab_dict[word]] = 1

    return encoded_email

def DataFrame_GEN(email_list, spam_count, easy_ham_count):
    # Divide la lista de correos en correos de spam y ham
    spam_emails = email_list[:spam_count]
    ham_emails = email_list[spam_count:]

    # Agrega los correos electrónicos codificados a un DataFrame

    # Crea un DataFrame para almacenar los correos electrónicos y sus etiquetas
    email_df = pd.DataFrame(columns=['text', 'label'])

    # Crea una lista de correos electrónicos codificados
    encoded_emails = []
    vocab_dict = {word: i for i, word in enumerate(vocab)}

    # Codifica los correos electrónicos de spam
    encoded_emails = [one_hot_encode(email, vocab_dict) for email in spam_emails]

    print("Spam Codificado")

    # Codifica los correos electrónicos de ham
    encoded_emails += [one_hot_encode(email, vocab_dict) for email in ham_emails]

    print("Ham Codificado")

    # Agrega los correos electrónicos codificados al DataFrame

    email_df['text'] = encoded_emails
    email_df['label'] = ['spam'] * spam_count + ['ham'] * easy_ham_count

    # Guarda el DataFrame en un archivo CSV
    email_df.to_pickle('emails.pkl')

if os.path.exists('emails.pkl'):
    email_df = pd.read_pickle('emails.pkl')
    print("DataFrame Cargado")
else:
    DataFrame_GEN(email_list, spam_count, easy_ham_count)
    print("DataFrame Creado y Guardado")

Spam Codificado
Ham Codificado
DataFrame Creado y Guardado


## Model Trainig

In [2]:
from sklearn.model_selection import train_test_split
from tensorflow import keras
from keras import layers
from keras import Input

with open('vocab.txt', 'r') as file:
    vocab = file.read().splitlines()

# Read the emails.pkl dataframe
emails_df = pd.read_pickle('emails.pkl')

# Shuffle the dataframe
emails_df = emails_df.sample(frac=1).reset_index(drop=True)


# change train_set['label'] to 1 and 0
emails_df['label'] = emails_df['label'].apply(lambda x: 1 if x == 'spam' else 0)


# Separate the data into train, test, and evaluation sets
train_set, test_set = train_test_split(emails_df, test_size=0.2)
test_set, eval_set = train_test_split(test_set, test_size=0.5)


# Create he anti-spam AI model
model = keras.Sequential([
    Input(shape=(len(vocab),)),
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])




# matrix with the encoded emails
flatten_data = np.array(train_set['text'].to_list())

# Create the input matrix
X = np.zeros((flatten_data.shape[0], len(vocab)))
for i, email in enumerate(flatten_data):
    X[i] = email
# Create the output matrix
Y = np.array(train_set['label'].to_list())


# Train the model
model.fit(X, Y, epochs=5, batch_size=32)

# Evaluate the model
flatten_data = np.array(test_set['text'].to_list())
X_test = np.zeros((flatten_data.shape[0], len(vocab)))
for i, email in enumerate(flatten_data):
    X_test[i] = email
Y_test = np.array(test_set['label'].to_list())

test_loss, test_accuracy = model.evaluate(X_test, Y_test)

print(f'Test accuracy: {test_accuracy}')

2024-04-26 08:20:39.278725: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-04-26 08:20:39.279942: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-26 08:20:39.966149: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-26 08:20:41.971858: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


ValueError: Unrecognized data type: x=[[1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]] (of type <class 'numpy.ndarray'>)