<a href="https://colab.research.google.com/github/OsirisEscaL/Machine_learning/blob/main/Building_a_Spam_Email_Classifier_Using_Scikit_Learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a Spam Email Classifier Using Scikit-Learn

Since the beginning of electronic communication, spam email has been a persistent issue for email consumers. With the power of machine learning and libraries such as Scikit-Learn, we can automatically filter out undesirable messages and construct a robust spam email classifier. This article will demonstrate building an email spam classifier using a real-world text dataset.
We will discuss data preprocessing with TF-IDF, training classification algorithms like Logistic Regression, Naive Bayes, and Support Vector Machines (SVM), and evaluating their performance.

**Dataset**

For this project, we'll use the Email Spam Detection Dataset from [Kaggle](https://www.kaggle.com/datasets/shantanudhakadd/email-spam-detection-dataset-classification). The dataset is divided into "spam" and "ham" (non-spam) categories, making it an excellent choice for training a spam email classifier.

**Step 1: Importing Essential Libraries**

Importing the essential Python libraries for the project will be our initial step:

In [104]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC

**Step 2: Loading and Preprocessing the Dataset**

Once the dataset has been downloaded and extracted, it will be loaded and preprocessed.

In [105]:
# Load the dataset
data = pd.read_csv('spam.csv', encoding='latin-1')
data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [106]:
# Drop the columns with NaN values
data = data.drop(columns=['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1)

# Rename columns for clarity:
data.columns = ['label', 'text']

# Convert to numerical values where 'ham' is 0 and 'spam' is 1
label_mapping = {'ham': 0, 'spam': 1}
data['label'] = data['label'].map(label_mapping)
data.head()

Unnamed: 0,label,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [107]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data['text'], data['label'], test_size=0.2, random_state=42)

**Step 3: Feature Extraction with TF-IDF Vectorization**

We will use the TF-IDF vectorization technique to convert text data into numerical characteristics. TF-IDF is an abbreviation that stands for Term Frequency-Inverse Document Frequency. It is a common technique in natural language processing and information retrieval to convert collections of text documents into mathematical feature vectors indicating the relative relevance of each word in these documents.


In [108]:
# Create a TfidfVectorizer instance
tfidf_vectorizer = TfidfVectorizer()

# Fit the vectorizer on the training data and transform it
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

# Transform the test data using the same vectorizer
X_test_tfidf = tfidf_vectorizer.transform(X_test)

**Step 4: Evaluate various classification algorithms**

Next, we will define and train three well-known classifiers: Logistic Regression, Naive Bayes, and Support Vector Machines (SVM).

In [110]:
# Create a dictionary to store models
models = {
    'Logistic Regression': LogisticRegression(),
    'Naive Bayes Classifier': MultinomialNB(),
    'Support Vector Classifier': SVC(),
}

# Train and evaluate each model
results = {}
for model_name, model in models.items():
    model.fit(X_train_tfidf, y_train)
    y_pred = model.predict(X_test_tfidf)
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    results[model_name] = {'Accuracy': accuracy, 'F1': f1}

It is time to evaluate the efficacy of our trained classifiers using various metrics. We will assess accuracy, and f1-score.

First, What is accuracy? Accuracy is a frequently employed metric in machine learning and statistics that assesses the correctness of a classification model's predictions. A binary or multi-class classification problem evaluates how accurately a model classifies positive and negative instances. Typically, accuracy is expressed as a percentage, with values spanning from 0% (completely inaccurate predictions) to 100% (perfect predictions).

Second, What is F1 score? The F1 score is a commonly employed metric for evaluating the efficacy of a classification model, particularly in binary classification problems. It combines two essential metrics, precision, and recall, into a single score, balancing them. When dealing with imbalanced datasets in which one class significantly outnumbers the other, the F1 score is instrumental. In such situations, accuracy can be deceptive, as a model that predicts the majority class in all instances may achieve high accuracy but poorly perform when correctly identifying the minority class.
                

In [111]:
results = pd.DataFrame(results)
results

Unnamed: 0,Logistic Regression,Naive Bayes Classifier,Support Vector Classifier
Accuracy,0.965919,0.962332,0.982063
F1,0.856061,0.837209,0.928571


**Conclusion**

Constructing a spam email classifier using Scikit-Learn and actual text data is a valuable and practical endeavor. Following the steps outlined in this article, you can build a spam filter that keeps your inbox clean and efficient. Remember, we can enhance the model performance by tuning hyperparameters and investigating more advanced NLP techniques.