# Exercise 3
## Spam Classification
### Context
The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.

### Content
The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.

This corpus has been collected from free or free for research sources at the Internet:

- A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. The Grumbletext Web site is: [Web Link](http://www.grumbletext.co.uk/).
- A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. The NUS SMS Corpus is avalaible at: [Web Link](http://www.comp.nus.edu.sg/~rpnlpir/downloads/corpora/smsCorpus/).
- A list of 450 SMS ham messages collected from Caroline Tag's PhD Thesis available at [Web Link](http://etheses.bham.ac.uk/253/1/Tagg09PhD.pdf).
- Finally, we have incorporated the SMS Spam Corpus v.0.1 Big. It has 1,002 SMS ham messages and 322 spam messages and it is public available at: [Web Link](http://www.esp.uem.es/jmgomez/smsspamcorpus/). This corpus has been used in the following academic researches:

Acknowledgements
The original dataset can be found [here](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection). The creators would like to note that in case you find the dataset useful, please make a reference to previous paper and the web page: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ in your papers, research, etc.

We offer a comprehensive study of this corpus in the following paper. This work presents a number of statistics, studies and baseline results for several machine learning methods.

Almeida, T.A., GÃ³mez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011.

In [None]:
#Se instala el paquete  "wget" para descargar archivos de la web de manera sencilla desde Python
!pip install wget

Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9656 sha256=417d0b82c6268a8a7e6d212011baa8ef741f832d9438b2c3c49e820c859175e9
  Stored in directory: /root/.cache/pip/wheels/8b/f1/7f/5c94f0a7a505ca1c81cd1d9208ae2064675d97582078e6c769
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


Se importa las librerias para el procesamiento de analisis de datos de texto, como funciones de descarga de datos en la web, procesamiento de archivos de texto, importación de palabras vacías (stopwords) del módulo nltk.corpus que no aportar información relevante, funciones para dividir un texto en palabras o tokens con word_tokenize, utilidades para trabajar con cadenas de texto como caracteres de puntuación y letras del alfabeto usando string, funciones de construcción de modelos en aprendizaje automatico como la regresión logística y  bosques aleatorios. Adicionalmente, se importa una libreria para examinar el rendimiento de predicción de modelos y se importó modelos de procesamiento de lenguaje natural (NLP), como Word2Vec

In [None]:
import pandas as pd
import numpy as np
import wget
import os
from zipfile import ZipFile

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
import string

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, auc, roc_curve

import gensim
from gensim.models import Word2Vec
import warnings

#Se suprime las advertencias y descarga recursos adicionales necesarios para ciertas funcionalidades de NLTK, como el tokenizador y la lista de stopwords:
warnings.filterwarnings('ignore')
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
#Se descarga y descomprime un archivo ZIP que contiene un conjunto de datos (SMS Spam Collection) desde una URL específica y luego cargar los datos en un DataFrame de pandas
try :
    from google.colab import files
    !wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
    !unzip smsspamcollection.zip
    df = pd.read_csv('SMSSpamCollection', sep='\t',  header=None, names=['target', 'text'])
except ModuleNotFoundError :
    url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip'
    path = os.getcwd()+'\Data'
    wget.download(url,path)
    temp=path+'\smsspamcollection.zip'
    file = ZipFile(temp)
    file.extractall(path)
    file.close()
    df = pd.read_csv(path + '\SMSSpamCollection', sep='\t',  header=None, names=['target', 'text'])

--2024-03-20 12:18:56--  https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘smsspamcollection.zip’

smsspamcollection.z     [ <=>                ] 198.65K  --.-KB/s    in 0.1s    

2024-03-20 12:18:57 (1.34 MB/s) - ‘smsspamcollection.zip’ saved [203415]

Archive:  smsspamcollection.zip
  inflating: SMSSpamCollection       
  inflating: readme                  


In [None]:
#Se observa dos campos en el dataframe (un campo categorico que define si es spam o no el correo y otro campo con el cuerpo del correo)
df.head()

Unnamed: 0,target,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
display(df.shape) #Number of rows (instances) and columns in the dataset
df["target"].value_counts()/df.shape[0] #Class distribution in the dataset

(5572, 2)

ham     0.865937
spam    0.134063
Name: target, dtype: float64

Se observa que el 13,4% de los correos en la base son correos spam

In [None]:
#Se define el campo de texto como la variable explicativa, mientras que la variable categorica para identificar si es spam se dummifica siendo 1 correo spam:
X = df['text']
y = df['target'].map({'ham':0, 'spam':1})

In [None]:
# split data into training and validation set
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = df['target'], test_size = 0.3, random_state = 18)

Preprocess the text data by removing stop words, converting all text to lowercase, and removing punctuation using NLTK package.


In [None]:
#Se hace una limpieza de datos como  la conversión a minúsculas, la eliminación de signos de puntuación, la tokenización, y la eliminación de palabras vacías en inglés
stop_words = set(stopwords.words('english'))
def preprocess(text):
    text = text.lower()
    text = ''.join([word for word in text if word not in string.punctuation])
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(tokens)

X_train = X_train.apply(preprocess)
X_test = X_test.apply(preprocess)

Train a Word2Vec model on the preprocessed training data using Gensim package.

Se realiza el entrenamiento de un modelo Word2Vec con las siguientes restricciones:
Una dimensión en los vectores de 100 palabras
Ventana maxima de 5 palabras vecinas para conocer el contexto
Se agregó 20 observaciones negativas
Se especifica que debe aparecer como minimo una vez las palabras para considerarse en el vocabulario

In [None]:
sentences = [sentence.split() for sentence in X_train]
model = Word2Vec(sentences, vector_size=100, window=5, negative=20, min_count=1, workers=4)

Convert the preprocessed text data to a vector representation using the Word2Vec model.

In [None]:
#Se convierte cada oración en el conjunto de entrenamiento X_train en un vector promedio de palabras tanto en el entrenamiento como en prueba:
def vectorize(sentence):
    words = sentence.split()
    words_vecs = [model.wv[word] for word in words if word in model.wv]
    if len(words_vecs) == 0:
        return np.zeros(100)
    words_vecs = np.array(words_vecs)
    return words_vecs.mean(axis=0)

X_train = np.array([vectorize(sentence) for sentence in X_train])
X_test = np.array([vectorize(sentence) for sentence in X_test])

Train a classification model such as logistic regression, random forests, or support vector machines using the vectorised training data and the sentiment labels.

In [None]:
clf = LogisticRegression()
clf.fit(X_train, y_train)

Evaluate the performance of the classification model on the testing set with the accuracy, precision, recall and F1 score.

In [None]:
y_pred = clf.predict(X_test)
fpr, tpr, thresholds = roc_curve(y_test, y_pred)

print('Accuracy:', accuracy_score(y_test, y_pred))
print('AUC:', auc(fpr, tpr))

Accuracy: 0.8660287081339713
AUC: 0.5


Se utilizó el modelo de resgresión logistica el cual arrojó un AUC del 50%, indicando que el modelo define aleatoriamente si es spam o no un correo

Se usa Random Forest classifier el cual arroja que el 98% de las predicciones que hizo fueron correctas

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split



# Divición del df en prueba y entrenamiento
X = df['text']
y = df['target'].map({'ham': 0, 'spam': 1})
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Stopwords
vectorizer = CountVectorizer(stop_words='english')

#ajuste de datos de de entrenamiento y transformación de los datos de prueba
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

# inicialización y entrenamieto con Random Forest Classifier
random_forest = RandomForestClassifier(n_estimators=100, random_state=42)
random_forest.fit(X_train_vectorized, y_train)

# predicciones y evaluación del modelo
y_pred = random_forest.predict(X_test_vectorized)
accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy: {accuracy:.2f}')


Accuracy: 0.98


Se predice el target usando TdidfVectorizer con Random Forest classifier el cual arroja la misma efectividad de las predicciones del modelo anterior

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer


X = df['text']
y = df['target'].map({'ham': 0, 'spam': 1})
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Iniciaización del  TfidfVectorizer con stopwords
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

#ajuste de datos de de entrenamiento y transformación de los datos de prueba
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

random_forest = RandomForestClassifier(n_estimators=100, random_state=42)
random_forest.fit(X_train_tfidf, y_train)

# predicciones y evaluación del modelo
y_pred = random_forest.predict(X_test_tfidf)
accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy using TF-IDF features: {accuracy:.2f}')


Accuracy using TF-IDF features: 0.98


Se predice el target usando CountVectorizer y TfideVectorizer con regresión logistica en donde se observa que hay una precisión levemente mas alta usando countvectorizer

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


X = df['text']
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# CountVectorizer
vectorizer_count = CountVectorizer(stop_words='english')
X_train_count = vectorizer_count.fit_transform(X_train)
X_test_count = vectorizer_count.transform(X_test)

# Usando TfidfVectorizer
vectorizer_tfidf = TfidfVectorizer(stop_words='english')
X_train_tfidf = vectorizer_tfidf.fit_transform(X_train)
X_test_tfidf = vectorizer_tfidf.transform(X_test)

# Modelo de Regresión Logística
log_reg = LogisticRegression(max_iter=1000)

# Entrenamiento y evaluación con CountVectorizer
log_reg.fit(X_train_count, y_train)
y_pred_count = log_reg.predict(X_test_count)
print(f'Precisión con CountVectorizer: {accuracy_score(y_test, y_pred_count):.2f}')

# Entrenamiento y evaluación con  TfidfVectorizer
log_reg.fit(X_train_tfidf, y_train)
y_pred_tfidf = log_reg.predict(X_test_tfidf)
print(f'Precisión con TfidfVectorizer: {accuracy_score(y_test, y_pred_tfidf):.2f}')


Precisión con CountVectorizer: 0.99
Precisión con TfidfVectorizer: 0.97


# Excercise 3.4

Increase and decrece the parameters values vector_size, window and negative then predict the target.

Plot the different values of the parameters with the performance of the model.

Use a Random Forest classifier and classification model of your choice and justify why.

In [None]:
from sklearn.model_selection import train_test_split


X = df['text']
y = df['target'].map({'ham': 0, 'spam': 1})

#Se vuelve a ajustar la muestra de entrenamiento en un 70%, la muestra en un 30% y se garantiza que la proporción de clases en los conjuntos de entrenamiento y prueba sea similar a la proporción original en los datos completos
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)


X_train_vect = np.array([vectorize(text) for text in X_train])
X_test_vect = np.array([vectorize(text) for text in X_test])


print(f'X_train_vect shape: {X_train_vect.shape}')
print(f'y_train shape: {y_train.shape}')
print(f'X_test_vect shape: {X_test_vect.shape}')
print(f'y_test shape: {y_test.shape}')



X_train_vect shape: (3900, 100)
y_train shape: (3900,)
X_test_vect shape: (1672, 100)
y_test shape: (1672,)


In [None]:
#Random Forest funcionan para para manejar grandes volumenes de datos haciendolo ideal en la clasificación de texto, especialmente cuando las representaciones vectoriales pueden ser de alta dimensionalidad
#Se utiliza un SVM como clasificador ya que este es efectivo en espacios de alta dimensionalidad y es eficiente en clasificaciones binarias o multiclase
# Clasificador de Bosque Aleatorio
from sklearn.svm import SVC
from sklearn.metrics import classification_report

clf_rf = RandomForestClassifier(n_estimators=100, random_state=42)
clf_rf.fit(X_train_vectorized, y_train)
predictions_rf = clf_rf.predict(X_test_vectorized)
print("Clasificador Random Forest:")
print(classification_report(y_test, predictions_rf))

# Clasificador SVM
clf_svm = SVC(kernel='linear', random_state=42)
clf_svm.fit(X_train_vectorized, y_train)
predictions_svm = clf_svm.predict(X_test_vectorized)
print("Clasificador SVM:")
print(classification_report(y_test, predictions_svm))

#El mejor clasificador es el Random Forest ya que se observa una precisión mayor para predecir los correos que son spam en un 13%


Clasificador Random Forest:
              precision    recall  f1-score   support

           0       0.87      0.96      0.91      1448
           1       0.13      0.04      0.06       224

    accuracy                           0.84      1672
   macro avg       0.50      0.50      0.48      1672
weighted avg       0.77      0.84      0.80      1672

Clasificador SVM:
              precision    recall  f1-score   support

           0       0.86      0.93      0.89      1448
           1       0.11      0.06      0.08       224

    accuracy                           0.81      1672
   macro avg       0.49      0.49      0.49      1672
weighted avg       0.76      0.81      0.78      1672



parameter variation: `vector_size`

In [None]:


# Función para vectorizar el conjunto de datos con Word2Vec
def vectorize(sentences, model):
    return np.array([np.mean([model.wv[word] for word in sentence if word in model.wv] or [np.zeros(model.vector_size)], axis=0) for sentence in sentences])

# valores para el parámetro vector_size
vector_sizes = [50, 100, 150, 200, 250]

for size in vector_sizes:
    print(f"Entrenando Word2Vec con vector_size={size}")


    model = Word2Vec(sentences=[sentence.split() for sentence in X_train], vector_size=size, window=5, min_count=1, workers=4)


    X_train_vect = vectorize([sentence.split() for sentence in X_train], model)
    X_test_vect = vectorize([sentence.split() for sentence in X_test], model)

    # Entrenamiento y evaluación con Random Forest
    clf_rf = RandomForestClassifier(n_estimators=100, random_state=42)
    clf_rf.fit(X_train_vect, y_train)
    predictions_rf = clf_rf.predict(X_test_vect)
    print(f"Resultado Random Forest con vector_size={size}:")
    print(classification_report(y_test, predictions_rf))

    # Entrenamiento y evaluación con SVM
    clf_svm = SVC(kernel='linear', random_state=42)
    clf_svm.fit(X_train_vect, y_train)
    predictions_svm = clf_svm.predict(X_test_vect)
    print(f"Resultado SVM con vector_size={size}:")
    print(classification_report(y_test, predictions_svm))

Entrenando Word2Vec con vector_size=50
Resultado Bosque Aleatorio con vector_size=50:
              precision    recall  f1-score   support

           0       0.90      1.00      0.95      1448
           1       0.97      0.32      0.48       224

    accuracy                           0.91      1672
   macro avg       0.94      0.66      0.72      1672
weighted avg       0.91      0.91      0.89      1672

Resultado SVM con vector_size=50:
              precision    recall  f1-score   support

           0       0.87      1.00      0.93      1448
           1       0.00      0.00      0.00       224

    accuracy                           0.87      1672
   macro avg       0.43      0.50      0.46      1672
weighted avg       0.75      0.87      0.80      1672

Entrenando Word2Vec con vector_size=100
Resultado Bosque Aleatorio con vector_size=100:
              precision    recall  f1-score   support

           0       0.91      1.00      0.95      1448
           1       0.98      

parameter variation: `window`

El Random Forest sigue siendo mejor clasificador que el support vector machine porque en cada uno de los tamaños de los vectores siempre tiene una mayor precisión el Random forest que el Support vector machine. Por otra parte, a medida que aumenta el tamaño del vector del random forest, su precisión de predecir los correos spam es mayor llegando a un 100% cuando el vector es de 150 palabras

In [None]:
#parametros a variar
window_sizes = [2, 5, 8, 10, 15]


for window_size in window_sizes:
    print(f"\nEntrenando Word2Vec con window_size={window_size}")


    model = Word2Vec(sentences=[sentence.split() for sentence in X_train], vector_size=100, window=window_size, min_count=1, workers=4)


    X_train_vect = vectorize([sentence.split() for sentence in X_train], model)
    X_test_vect = vectorize([sentence.split() for sentence in X_test], model)


    clf_rf = RandomForestClassifier(n_estimators=100, random_state=42)
    clf_rf.fit(X_train_vect, y_train)
    predictions_rf = clf_rf.predict(X_test_vect)
    print(f"Resultado Random Forest con window_size={window_size}:")
    print(classification_report(y_test, predictions_rf))

    # Entrenando y evaluando el clasificador SVM
    clf_svm = SVC(kernel='linear', random_state=42)
    clf_svm.fit(X_train_vect, y_train)
    predictions_svm = clf_svm.predict(X_test_vect)
    print(f"Resultado SVM con window_size={window_size}:")
    print(classification_report(y_test, predictions_svm))
#El random forest sigue siendo mejor el mejor modelo si se aplica el word2vec y se observa que el tamaño de los vecinos cercanos a 2 es el mejor parametro para tener una precisión del 100% pára predecir spam


Entrenando Word2Vec con window_size=2
Resultado Bosque Aleatorio con window_size=2:
              precision    recall  f1-score   support

           0       0.91      1.00      0.95      1448
           1       1.00      0.36      0.53       224

    accuracy                           0.91      1672
   macro avg       0.95      0.68      0.74      1672
weighted avg       0.92      0.91      0.90      1672

Resultado SVM con window_size=2:
              precision    recall  f1-score   support

           0       0.87      1.00      0.93      1448
           1       0.00      0.00      0.00       224

    accuracy                           0.87      1672
   macro avg       0.43      0.50      0.46      1672
weighted avg       0.75      0.87      0.80      1672


Entrenando Word2Vec con window_size=5
Resultado Bosque Aleatorio con window_size=5:
              precision    recall  f1-score   support

           0       0.91      1.00      0.95      1448
           1       0.98      0.38 

parameter variation: `negative`

In [None]:
negative_values = [5, 10, 15, 20]

for negative in negative_values:
    print(f"\nEntrenando Word2Vec con negative={negative}")


    model = Word2Vec(sentences=[sentence.split() for sentence in X_train], vector_size=100, window=5, min_count=1, workers=4, negative=negative)


    X_train_vect = vectorize([sentence.split() for sentence in X_train], model)
    X_test_vect = vectorize([sentence.split() for sentence in X_test], model)


    clf_rf = RandomForestClassifier(n_estimators=100, random_state=42)
    clf_rf.fit(X_train_vect, y_train)
    predictions_rf = clf_rf.predict(X_test_vect)
    print(f"Resultado Random Forest con negative={negative}:")
    print(classification_report(y_test, predictions_rf))

    # Entrenando y evaluando el clasificador SVM
    clf_svm = SVC(kernel='linear', random_state=42)
    clf_svm.fit(X_train_vect, y_train)
    predictions_svm = clf_svm.predict(X_test_vect)
    print(f"Resultado SVM con negative={negative}:")
    print(classification_report(y_test, predictions_svm))



Entrenando Word2Vec con negative=5
Resultado Random Forest con negative=5:
              precision    recall  f1-score   support

           0       0.91      1.00      0.95      1448
           1       1.00      0.38      0.55       224

    accuracy                           0.92      1672
   macro avg       0.96      0.69      0.75      1672
weighted avg       0.92      0.92      0.90      1672

Resultado SVM con negative=5:
              precision    recall  f1-score   support

           0       0.87      1.00      0.93      1448
           1       0.00      0.00      0.00       224

    accuracy                           0.87      1672
   macro avg       0.43      0.50      0.46      1672
weighted avg       0.75      0.87      0.80      1672


Entrenando Word2Vec con negative=10
Resultado Random Forest con negative=10:
              precision    recall  f1-score   support

           0       0.92      1.00      0.96      1448
           1       0.94      0.46      0.62       224

Se observa que usando el random forest con menos muestras negativas predice mejor el correo spam, pero a medida que va creciendo, va disminuyendo la precisión

Running best parameters `vector_size, window and negative` (All at the same time)

In [None]:
#los mejores parametros para el random Forest vector_size=200, window=10, y negative=20 sin embargo el SVM no mejoro con ajuste de param3tros
# Configuración óptima
best_vector_size = 200
best_window = 10
best_negative = 20

model_best = Word2Vec(sentences=[sentence.split() for sentence in X_train], vector_size=best_vector_size, window=best_window, min_count=1, workers=4, negative=best_negative)


def vectorize(sentences, model):
    return np.array([np.mean([model.wv[word] for word in sentence if word in model.wv] or [np.zeros(model.vector_size)], axis=0) for sentence in sentences])


X_train_vect_best = vectorize([sentence.split() for sentence in X_train], model_best)
X_test_vect_best = vectorize([sentence.split() for sentence in X_test], model_best)


clf_rf_best = RandomForestClassifier(n_estimators=100, random_state=42)
clf_rf_best.fit(X_train_vect_best, y_train)
predictions_rf_best = clf_rf_best.predict(X_test_vect_best)
print("Resultado Bosque Aleatorio con los mejores parámetros:")
print(classification_report(y_test, predictions_rf_best))


clf_svm_best = SVC(kernel='linear', random_state=42)
clf_svm_best.fit(X_train_vect_best, y_train)
predictions_svm_best = clf_svm_best.predict(X_test_vect_best)
print("Resultado SVM con los mejores parámetros:")
print(classification_report(y_test, predictions_svm_best))


Resultado Bosque Aleatorio con los mejores parámetros:
              precision    recall  f1-score   support

           0       0.93      0.99      0.96      1448
           1       0.89      0.53      0.66       224

    accuracy                           0.93      1672
   macro avg       0.91      0.76      0.81      1672
weighted avg       0.93      0.93      0.92      1672

Resultado SVM con los mejores parámetros:
              precision    recall  f1-score   support

           0       0.87      1.00      0.93      1448
           1       0.00      0.00      0.00       224

    accuracy                           0.87      1672
   macro avg       0.43      0.50      0.46      1672
weighted avg       0.75      0.87      0.80      1672



Se observa que al usar los mejores parametros de cada resultado, el random forest el cual fue mejor modelo que SVM, sigue siendo mejor el random forest pero con una precisión de predecir el spam en un 89% y un accuracy del 93%