## **Name:** Affan Zulfiqar
## **Reg No:** B22F0144AI050
## **Course** ANN LAB (01)

# **Lab Task 1**
**Scenario:**

You are a data scientist working at a tech company that provides email
filtering solutions to its users. Your team has been tasked with enhancing
the spam detection system. You have access to a labeled email dataset
containing spam and non-spam emails. Your goal is to preprocess this
dataset, develop an SVM-based classification model, and fine-tune it for
optimal performance. The final model will predict whether an incoming
email is spam or not, based on its content.
To achieve this, you will:
1. Preprocess email content to extract meaningful numerical features.
2. Train and evaluate the model using various SVM kernels (e.g.,
linear, RBF, polynomial).
3. Tune hyperparameters to maximize accuracy.

Importing Libraries

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import joblib


Load and split dataset into features and labels

In [None]:
emails_df = pd.read_csv('emails.csv')

X = emails_df['text']
y = emails_df['spam']
print(emails_df.head())

                                                text  spam
0  Subject: naturally irresistible your corporate...     1
1  Subject: the stock trading gunslinger  fanny i...     1
2  Subject: unbelievable new homes made easy  im ...     1
3  Subject: 4 color printing special  request add...     1
4  Subject: do not have money , get software cds ...     1


Split into training and testing sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Preprocess the text data

In [None]:
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

SVM Model

In [None]:
print("Training basic SVM model...")
basic_model = SVC(kernel='linear', C=1, random_state=42)
basic_model.fit(X_train_tfidf, y_train)
basic_y_pred = basic_model.predict(X_test_tfidf)

Training basic SVM model...


Evaluate the model

In [None]:
print("Basic SVM Model Evaluation:")
print("Accuracy:", accuracy_score(y_test, basic_y_pred))
print("Classification Report:\n", classification_report(y_test, basic_y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, basic_y_pred))

Basic SVM Model Evaluation:
Accuracy: 0.9921465968586387
Classification Report:
               precision    recall  f1-score   support

           0       0.99      1.00      0.99       856
           1       0.99      0.98      0.98       290

    accuracy                           0.99      1146
   macro avg       0.99      0.99      0.99      1146
weighted avg       0.99      0.99      0.99      1146

Confusion Matrix:
 [[854   2]
 [  7 283]]


Saving Basic Model

In [None]:
joblib.dump(basic_model, 'svm_basic_model.pkl')
print("Basic model saved as 'svm_basic_model.pkl'.")

Basic model saved as 'svm_basic_model.pkl'.


Hyperparameter tuning with GridSearchCV

In [None]:
param_grid = {
    'C': [0.1, 1, 10],
    'gamma': [1, 0.1, 0.01],
    'kernel': ['linear', 'rbf', 'poly']
}

print("Starting hyperparameter tuning...")
grid = GridSearchCV(SVC(), param_grid, refit=True, cv=3, verbose=2)
grid.fit(X_train_tfidf, y_train)

Starting hyperparameter tuning...
Fitting 3 folds for each of 27 candidates, totalling 81 fits
[CV] END ......................C=0.1, gamma=1, kernel=linear; total time=   4.1s
[CV] END ......................C=0.1, gamma=1, kernel=linear; total time=   5.0s
[CV] END ......................C=0.1, gamma=1, kernel=linear; total time=   4.2s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   5.2s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   6.0s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   5.3s
[CV] END ........................C=0.1, gamma=1, kernel=poly; total time=  11.4s
[CV] END ........................C=0.1, gamma=1, kernel=poly; total time=   9.7s
[CV] END ........................C=0.1, gamma=1, kernel=poly; total time=   9.8s
[CV] END ....................C=0.1, gamma=0.1, kernel=linear; total time=   6.2s
[CV] END ....................C=0.1, gamma=0.1, kernel=linear; total time=   4.1s
[CV] END .....

Best model from grid search

In [None]:
best_model = grid.best_estimator_
best_y_pred = best_model.predict(X_test_tfidf)

Evaluate the tuned model

In [None]:
print("Tuned SVM Model Evaluation:")
print("Best Parameters:", grid.best_params_)
print("Accuracy:", accuracy_score(y_test, best_y_pred))
print("Classification Report:\n", classification_report(y_test, best_y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, best_y_pred))

Tuned SVM Model Evaluation:
Best Parameters: {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}
Accuracy: 0.9921465968586387
Classification Report:
               precision    recall  f1-score   support

           0       0.99      1.00      0.99       856
           1       0.99      0.98      0.98       290

    accuracy                           0.99      1146
   macro avg       0.99      0.99      0.99      1146
weighted avg       0.99      0.99      0.99      1146

Confusion Matrix:
 [[853   3]
 [  6 284]]


Save the best model

In [None]:
joblib.dump(best_model, 'svm_best_model.pkl')
print("Tuned model saved as 'svm_best_model.pkl'.")

Tuned model saved as 'svm_best_model.pkl'.


Email Predictions:

In [None]:
emails_to_predict = [
    "Congratulations! You've won a free vacation. Click here to claim your prize!",
    "Hey, just checking if we're still on for lunch tomorrow at 1 PM."
]
emails_tfidf = vectorizer.transform(emails_to_predict)
predictions = best_model.predict(emails_tfidf)

for email, prediction in zip(emails_to_predict, predictions):
    print(f"\nEmail: {email}")
    print("Prediction:", "Spam" if prediction == 1 else "Not Spam")


Email: Congratulations! You've won a free vacation. Click here to claim your prize!
Prediction: Spam

Email: Hey, just checking if we're still on for lunch tomorrow at 1 PM.
Prediction: Not Spam
