<a href="https://colab.research.google.com/github/Sayandip2023/CBTCIP/blob/main/e_mail_Spam_Detector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PROJECT

We have data on text of nearly 5600 e-mails, classified as spam or ham. We will train the model with this data on unseen e-mail texts and observe its predictions.

## Importing the necessary libraries
We will do all the imports required in this project.

In [28]:
import pandas as pd

import numpy as np

import re
import string

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

## Loading the dataset
We will load the dataset from the csv file we hav into a dataframe.

In [9]:
path="/content/Dataset.csv"
df = pd.read_csv(path)
df = df[['v1', 'v2']]
print(df.head())

     v1                                                 v2
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


## Preprocessing the dataset
We will split the dataset. We will keep 80% of the dataset in the training set and rest in the test set.

In [10]:
def preprocess_text(text):
    text = re.sub(r'[^\w\s]', '', text)
    text = text.lower()
    tokens = text.split()
    table = str.maketrans('', '', string.punctuation)
    tokens = [word.translate(table) for word in tokens]
    tokens = [word for word in tokens if word.isalpha()]
    preprocessed_text = ' '.join(tokens)
    return preprocessed_text

df['preprocessed_text'] = df['v2'].apply(preprocess_text)

## Extracting the features from the Dataset
Feature extraction is done to convert the raw text data into numerical features for making predictions.

In [11]:
vectorizer = TfidfVectorizer()
x = vectorizer.fit_transform(df['preprocessed_text'])
y = df['v1']

## Splitting the dataset

In [12]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

## Finding the best parameters
We will find the best fitting parameters for training our model.

In [21]:
param_grid = {'C': [0.01, 0.1, 1, 10], 'gamma': ['scale', 'auto'], 'kernel': ['linear', 'rbf']}
grid_search = GridSearchCV(SVC(), param_grid, cv=5)
grid_search.fit(x_train, y_train)
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

Best Parameters: {'C': 10, 'gamma': 'scale', 'kernel': 'linear'}


## Training the model
We can see that the best value of C is coming as 1 and the kernel is recommended to be linear. As the kernel would be linear, we need not think of gamma parameter here.

In [22]:
svm_classifier = SVC(kernel='linear', C=1)
svm_classifier.fit(x_train, y_train)

## Evaluating the model
Now that our model is trained, we will evaluate our model for various metrics.

In [26]:
y_pred = svm_model.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=['ham', 'spam'])
print("SVM Model Evaluation:")
print("Accuracy :", accuracy)
print(report)

SVM Model Evaluation:
Accuracy : 0.9757847533632287
              precision    recall  f1-score   support

         ham       0.97      1.00      0.99       965
        spam       1.00      0.82      0.90       150

    accuracy                           0.98      1115
   macro avg       0.99      0.91      0.94      1115
weighted avg       0.98      0.98      0.97      1115



## Cross Validation
We will evaluate the performance of the best Support Vector Machine during the grid search using 5-fold cross validation.

In [29]:
best_svm_classifier = grid_search.best_estimator_
cv_scores = cross_val_score(best_svm_classifier, x, y, cv=5)
print("Cross-Validation Scores:", cv_scores)
print("Mean Cross-Validation Accuracy:", np.mean(cv_scores))

Cross-Validation Scores: [0.9793722  0.97399103 0.97666068 0.98025135 0.97935368]
Mean Cross-Validation Accuracy: 0.977925787571149


##Predicting with the model
We will now supply a unseen sample to our trained model and observe the model's prediction on e-mail text not seen before by the model.

In [33]:
text = "You have won a cash prize of $10,000! Claim your prize now by calling 555-1234 or visit www.prizewinner.com."
preprocessed_random_text = preprocess_text(text)
tfidf_features = vectorizer.transform([preprocessed_random_text])
prediction = svm_model.predict(tfidf_features)
print("The text is detected to be", prediction[0],".")

The text is detected to be spam .
