
# **GOLDEN** **PHASE**-**CODERSCAVE**

Here Iam developing a robust Spam Email Filter using Natural Language Processing (NLP) techniques and machine learning algorithms. The goal is to create an intelligent system capable of accurately classifying emails as either spam or legitimate (ham) based on their content and linguistic features.

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

In [15]:
data = pd.read_csv("/content/emails.csv")
data

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1
...,...,...
5723,Subject: re : research and development charges...,0
5724,"Subject: re : receipts from visit jim , than...",0
5725,Subject: re : enron case study update wow ! a...,0
5726,"Subject: re : interest david , please , call...",0


In [5]:
X = data['text']
y = data['spam']

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [7]:
vectorizer = TfidfVectorizer(max_features=5000)


In [8]:
X_train_tfidf = vectorizer.fit_transform(X_train)

In [9]:
X_test_tfidf = vectorizer.transform(X_test)

In [10]:
svm_classifier = SVC(kernel='linear', C=1.0)
svm_classifier.fit(X_train_tfidf, y_train)

In [11]:
predictions = svm_classifier.predict(X_test_tfidf)

In [12]:
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)

Accuracy: 0.9921465968586387


In [13]:
print("\nClassification Report:")
print(classification_report(y_test, predictions))


Classification Report:
              precision    recall  f1-score   support

           0       0.99      1.00      0.99       856
           1       0.99      0.98      0.98       290

    accuracy                           0.99      1146
   macro avg       0.99      0.99      0.99      1146
weighted avg       0.99      0.99      0.99      1146



In [14]:
new_email = ["Hey there, just checking in to see if you're available for a call today?"]
new_email_tfidf = vectorizer.transform(new_email)
prediction = svm_classifier.predict(new_email_tfidf)
if prediction[0] == 1:
    print("This email is classified as spam.")
else:
    print("This email is classified as ham.")

This email is classified as ham.


# **CONCLUSION**

**High** **Precision** **and** **Recall**:

 Both precision and recall scores for both spam and ham classes are very high, indicating that the model correctly identifies spam and legitimate emails with great accuracy. This means that the model rarely misclassifies emails, ensuring that both spam emails are effectively filtered out and legitimate emails are retained.

**Excellent** **F1**-**scores**:

The F1-scores, which balance precision and recall, are also very high for both classes. This indicates that the model achieves a good balance between minimizing false positives (misclassifying legitimate emails as spam) and false negatives (misclassifying spam emails as legitimate).

**High** **Accuracy**:

 The overall accuracy of the model is also very high, indicating that the majority of emails are correctly classified. This means that the model is reliable and can be trusted to effectively distinguish between spam and legitimate emails.