
# Spam Filter Creation using Naive Bayes

In this exercise, you will create a spam filter using the Naive Bayes classifier. This exercise will help you understand how to preprocess text data, implement a Naive Bayes model, and evaluate its performance.

## Instructions

Complete the code in the sections marked `# TODO`. Make sure to run each cell in order to see the output of your code.



## Step 1: Data Loading and Preprocessing

First, we need to load and preprocess our dataset. The dataset consists of emails categorized into 'spam' and 'non-spam'.


### Required Libraries

In [1]:
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
import string
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

### Read Dataset

In [2]:
dataset = pd.read_csv('Dataset\spam_ham_dataset.csv')

### Pre-Process Data

In [3]:
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Pre-Process Steps
1. Convert String to Lower-case
2. Remove Punctuations
3. Remove Stopwords

In [4]:
def preprocess_text(text):
    text = text.lower()

    text = text.translate(str.maketrans('', '', string.punctuation))

    words = text.split() 
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]
    
    preprocessed_text = ' '.join(words)
    return preprocessed_text

In [5]:
dataset['text'] = dataset['text'].apply(preprocess_text)

dataset = dataset.drop(['Unnamed: 0', 'label'], axis=1)

In [6]:
dataset.head()

Unnamed: 0,text,label_num
0,subject enron methanol meter 988291 follow not...,0
1,subject hpl nom january 9 2001 see attached fi...,0
2,subject neon retreat ho ho ho around wonderful...,0
3,subject photoshop windows office cheap main tr...,1
4,subject indian springs deal book teco pvr reve...,0



## Step 2: Feature Extraction

Now, we'll convert the text data into numerical features using techniques like TF-IDF.


In [7]:
def extract_features(data, mode="tf-idf"):

    if mode == "tf-idf":
        vectorizer = TfidfVectorizer()
    elif mode == "bag-of-words":
        vectorizer = CountVectorizer()
    else:
        raise ValueError("Invalid mode. Choose 'tf-idf' or 'bag-of-words'.")
    
    features = vectorizer.fit_transform(data['text'])
    return features

In [8]:
mode = "tf-idf"
features = extract_features(dataset, mode=mode)
print(features.toarray())

[[0.10400081 0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]


### Train Test Split

In [9]:
X_train, X_test, y_train, y_test = train_test_split(features, dataset['label_num'].values, test_size=0.2, random_state=42)


## Step 3: Model Training

Next, implement and train the Naive Bayes classifier using the features extracted in the previous step.


In [10]:
class NaiveBayesClassifier:
    def __init__(self, num_classes):
        self.num_classes = num_classes
        self.class_priors = None
        self.likelihoods = None


    def fit(self, features, labels):
        num_docs, num_words = features.shape
        self.class_priors = np.zeros(self.num_classes)
        self.likelihoods = np.zeros((self.num_classes, num_words))

        for c in range(self.num_classes):
            # Calculate priors
            docs_in_class = labels == c
            self.class_priors[c] = np.sum(docs_in_class) / num_docs

            # Calculate likellihoods
            self.likelihoods[c] = (np.sum(features[docs_in_class], axis=0) + 1) / (np.sum(features[docs_in_class]) + num_words)

    
    def predict(self, features):
        num_docs, _ = features.shape
        predictions = []

        for doc in range(num_docs):
            log_probs = np.zeros(self.num_classes)

            for c in range(self.num_classes):
                # Calculate log probability for each class
                log_probs[c] = np.log(self.class_priors[c]) + np.sum(np.log(self.likelihoods[c, features[doc, :] > 0]))

            # Choose the class with the highest log probability
            predicted_class = np.argmax(log_probs)
            predictions.append(predicted_class)

        return predictions

### Train Model

In [11]:
num_classes = 2  # ham (0), spam (1)
nb_classifier = NaiveBayesClassifier(num_classes)
nb_classifier.fit(X_train.toarray(), y_train)

### Test Model

In [12]:
predictions = nb_classifier.predict(X_test.toarray())


## Step 4: Model Evaluation

Evaluate the performance of your model. Calculate metrics like accuracy, precision, and recall.


### Percision

In [13]:
precision = precision_score(y_test, predictions)
print(f"Percision: {precision}")

Percision: 0.9954545454545455


### Recall

In [14]:
recall = recall_score(y_test, predictions)
print(f"Recall: {recall}")


Recall: 0.7474402730375427


### F1 Score

In [15]:
f1 = f1_score(y_test, predictions)
print(f"F1 Score: {f1}")

F1 Score: 0.8538011695906433


### Accuracy

In [16]:
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy}")

Accuracy: 0.927536231884058


### Save Results into .CSV file

In [17]:
results_df = pd.DataFrame({'Spam': predictions})
results_df.to_csv('Q3.csv', index=False)


## Step 5: Improvement and Discussion

Discuss potential improvements to increase the model's performance. What changes could be made in preprocessing or model tuning?


### 1.Feature Engineering:
Currently, we are using TF-IDF (Term Frequency-Inverse Document Frequency) for feature extraction. We can try different feature engineering techniques and see if they improve the performance. For example, We can experiment with word embeddings (e.g., Word2Vec, GloVe) or other text representations like word counts or n-grams. 

### 2. Advanced Text Preprocessing
The current preprocessing step includes converting text to lowercase, removing punctuation, and removing stopwords. You can explore more advanced techniques like lemmatization, stemming, or handling special characters and URLs specific to spam emails.

### 3. Ensemble Methods
As we recently learned in our course, instead of using a single Naive Bayes classifier, you can try ensemble methods like bagging or boosting. Ensemble methods combine multiple models to improve performance and robustness.

### 4. Handling Class Imbalance
If the dataset is imbalanced (i.e., one class has significantly more samples than the other), it can affect the model's performance. You can try techniques like oversampling the minority class, undersampling the majority class, or using more advanced methods like SMOTE (Synthetic Minority Over-sampling Technique) just like what we did in previous homework!

### 5. Feature Selection
Not all features generated from text data may be relevant for distinguishing between spam and non-spam emails. Feature selection techniques can help identify the most informative features. You can explore methods like chi-squared test, mutual information, or feature importance from tree-based models to select the most discriminative features.