# Spam Classification using AdaBoost and Naive Bayes
This assignment guides you through building a spam classifier using AdaBoost and Naive Bayes.
You will write pseudocode, preprocess data, train models, evaluate accuracy, and compare results.

## 1. Assignment Objectives
- Load and preprocess text data
- Implement pseudocode for both AdaBoost and Naive Bayes
- Train models
- Evaluate performance
- Compare results

## 2. Pseudocode: Naive Bayes Classifier
```
START
INPUT: Training text data with labels
PREPROCESS: Clean text → tokenize → remove stopwords → convert to vectors
CALCULATE prior probabilities for each class
FOR each word in vocabulary:
    CALCULATE likelihood P(word | class)
STORE probabilities
DURING prediction:
    For each class:
        Compute log probability of text belonging to class
    SELECT class with highest probability
END
```

## 3. Pseudocode: AdaBoost Classifier
```
START
INPUT: Preprocessed feature vectors
INITIALIZE: Equal weights for all samples
FOR t = 1 to T (number of weak learners):
    Train weak learner (e.g., decision stump)
    Compute error
    Compute alpha (learner weight)
    UPDATE sample weights
END FOR
FINAL prediction = weighted sum of weak learners
RETURN predicted class
END
```

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report


In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("ashfakyeafi/spam-email-classification")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/ashfakyeafi/spam-email-classification?dataset_version_number=3...


100%|██████████| 207k/207k [00:00<00:00, 726kB/s]

Extracting files...
Path to dataset files: /root/.cache/kagglehub/datasets/ashfakyeafi/spam-email-classification/versions/3





In [None]:
import os
# Show all files in the dataset folder
print("Files:", os.listdir(path))

# Load the main dataset
data_path = os.path.join(path, "email.csv")
data = pd.read_csv(data_path)

data.head()

Files: ['email.csv']


Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
# 6. Preprocessing
X = data['Message']
y = data['Category']
vectorizer = TfidfVectorizer()
X_vec = vectorizer.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_vec, y, test_size=0.2, random_state=42)

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# X_train and X_test are already vectorized from cell 6 (GmWmeW-N4K82)

# Train Naive Bayes
nb = MultinomialNB()
nb.fit(X_train, y_train)

y_pred_nb = nb.predict(X_test)

print("Naive Bayes Accuracy:", accuracy_score(y_test, y_pred_nb))
print(classification_report(y_test, y_pred_nb, zero_division=0))

Naive Bayes Accuracy: 0.9605381165919282
              precision    recall  f1-score   support

         ham       0.96      1.00      0.98       958
        spam       1.00      0.72      0.84       157

    accuracy                           0.96      1115
   macro avg       0.98      0.86      0.91      1115
weighted avg       0.96      0.96      0.96      1115



In [None]:
# 8. Train AdaBoost Classifier
weak_learner = DecisionTreeClassifier(max_depth=1)
ada = AdaBoostClassifier(estimator=weak_learner, n_estimators=50)
ada.fit(X_train, y_train)
y_pred_ada = ada.predict(X_test)
print('AdaBoost Accuracy:', accuracy_score(y_test, y_pred_ada))
print(classification_report(y_test, y_pred_ada))

AdaBoost Accuracy: 0.9443946188340807
              precision    recall  f1-score   support

         ham       0.94      0.99      0.97       958
        spam       0.94      0.64      0.77       157

    accuracy                           0.94      1115
   macro avg       0.94      0.82      0.87      1115
weighted avg       0.94      0.94      0.94      1115



## 9. Conclusion
- Compare Naive Bayes vs AdaBoost performance
- Discuss errors and improvements