# NLP Project: Spam vs. Ham Classifier using Bag of Words

## Overview

This project aims to develop a predictive classifier for distinguishing spam emails from legitimate (ham) emails using Natural Language Processing (NLP) techniques. The approach involves creating a Bag of Words model to represent the text data and training a classifier for accurate spam and ham email classification.

## Dataset

The dataset used for this project consists of labeled emails, indicating whether each email is spam or ham. The data will be divided into a training set for model development and a test set for evaluating the classifier's performance.

## Methodology

### 1. Data Preprocessing

- **Text Cleaning:** Remove any irrelevant characters, punctuation, and HTML tags.
- **Tokenization:** Break down emails into individual words (tokens).
- **Lowercasing:** Convert all words to lowercase for uniformity.

### 2. Bag of Words Representation

Implement the Bag of Words model, which involves:

- **Tokenization:** Convert each email into a list of words.
- **Count Vectorization:** Create a matrix representing the frequency of each word in the dataset.

### 3. Model Development

Train a predictive classifier (e.g., Naive Bayes, Logistic Regression) using the Bag of Words representation. This involves:

- **Splitting Data:** Divide the dataset into training and testing sets.
- **Training the Model:** Use the training set to train the classifier.

### 4. Model Evaluation

Evaluate the classifier's performance using metrics such as accuracy, precision, recall, and F1-score on the test set.

## Tools and Libraries

- Python
- Jupyter Notebooks
- Scikit-learn for machine learning
- NLTK (Natural Language Toolkit) for NLP tasks

## Results

Provide a summary of the model's performance and any insights gained from the project.

## Conclusion

Summarize the key findings and discuss potential improvements or future work for enhancing the spam vs. ham classification model.

---



### Feature Pipeline
- Load data
- Exploratory data Analysis (if needed)
- Feature Engineering

In [1]:
import pandas as pd
import numpy as np

In [2]:
#Read the data set
df = pd.read_csv("Input_data\spam.csv")
print(df.shape)
print('')
df.head()

(5572, 2)



Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
#Looking at distruibution of target variable for imbalance check
df.Category.value_counts()

Category
ham     4825
spam     747
Name: count, dtype: int64

In [4]:
df['spam_flag'] = df['Category'].apply(lambda x: 1 if x=='spam' else 0)

df.spam_flag.value_counts()

spam_flag
0    4825
1     747
Name: count, dtype: int64

### Training Pipeline
- Train, test, Inference split
- Train pipeline creation
- Train, tune and validate the classifier

In [5]:
#Train test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.Message, df.spam_flag, test_size=0.2, random_state= 123)

In [6]:
print(X_train.shape)
print('')
print(y_train.shape)
print('')
print(X_test.shape)
print('')
print(y_test.shape)
print('')

(4457,)

(4457,)

(1115,)

(1115,)



In [7]:
X_train[:6]

385     Double mins and txts 4 6months FREE Bluetooth ...
4003    Did you get any gift? This year i didnt get an...
1283    Ever green quote ever told by Jerry in cartoon...
2327    The Xmas story is peace.. The Xmas msg is love...
1103           Black shirt n blue jeans... I thk i c ü...
5295    Alex says he's not ok with you not being ok wi...
Name: Message, dtype: object

#### Create bag of words representation using CountVectorizer

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

v = CountVectorizer()

X_train_cv = v.fit_transform(X_train.values)

print(X_train.shape)
print('')
print(X_train_cv.shape)


(4457,)

(4457, 7689)


In [9]:
X_train_cv.toarray().shape

(4457, 7689)

In [10]:
print(v.vocabulary_)



In [11]:
X_train_np = X_train_cv.toarray()
X_train_np[0]

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [12]:
np.where(X_train_np[0]!=0)

(array([  45,  598,  954, 1139, 1388, 1589, 1591, 2394, 2940, 4474, 4523,
        4573, 4651, 4781, 4913, 4952, 4955, 5165, 6279, 7058], dtype=int64),)

#### Train the naive bayes model

In [13]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train_cv, y_train)

In [14]:
X_test_cv = v.transform(X_test)

#### Evaluate Performance

In [15]:
from sklearn.metrics import classification_report

y_pred = model.predict(X_test_cv)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      1.00      0.99       962
           1       0.97      0.92      0.95       153

    accuracy                           0.99      1115
   macro avg       0.98      0.96      0.97      1115
weighted avg       0.99      0.99      0.99      1115



In [16]:
emails = [
    'Hey mohan, can we get together to watch footbal game tomorrow?',
    'Upto 20% discount on parking, exclusive offer just for you. Dont miss this reward!'
]

emails_count = v.transform(emails)
model.predict(emails_count)

array([0, 1], dtype=int64)

#### Train the model using sklearn pipeline and reduce number of lines of code

In [17]:
from sklearn.pipeline import Pipeline

clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())
])

In [18]:
clf.fit(X_train, y_train)

In [19]:
y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      1.00      0.99       962
           1       0.97      0.92      0.95       153

    accuracy                           0.99      1115
   macro avg       0.98      0.96      0.97      1115
weighted avg       0.99      0.99      0.99      1115



In [21]:
emails = [
    'Hey mohan, can we get together to watch footbal game tomorrow?',
    'Upto 20% discount on parking, exclusive offer just for you. Dont miss this reward!',
    'Get 40% refund on parking, using digital payment. Dont miss this reward!'
]

clf.predict(emails)

array([0, 1, 1], dtype=int64)