# Building a Spam Filter From Scratch
Did you know that one of the earliest mainstream applications of machine learning was the email spam filter back in the 90s? It may not be as flashy as a self-aware Skynet, but it definitely qualifies as ML! 

## Imports and Load Dataset

In [12]:
import pandas as pd
import numpy as np

df = pd.read_csv('TextFiles/smsspamcollection.tsv', sep='\t')
df.head()

Unnamed: 0,label,message,length,punct
0,ham,"Go until jurong point, crazy.. Available only ...",111,9
1,ham,Ok lar... Joking wif u oni...,29,6
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,6
3,ham,U dun say so early hor... U c already then say...,49,6
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,2


## Checking for missing values

In [13]:
df.isnull().sum()

label      0
message    0
length     0
punct      0
dtype: int64

## Looking at the *ham* and *spam* `label` column

In [14]:
df['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

4825 out of 5572 messages, or 86.6%, are ham. This means that any text classification model we create has to perform **better than 86.6%** to beat random chance.</font>

## Split the data into train & test sets

In [16]:
from sklearn.model_selection import train_test_split

X = df['message']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## Scikit-learn's CountVectorizer
Text preprocessing, tokenizing and the ability to filter out stopwords are all included in [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), which builds a dictionary of features and transforms documents to feature vectors.

In [17]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

# Fit and transform the vectorizer to the data
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape

(3733, 7082)

This shows that our training set is comprised of 3733 documents (messages), and 7082 features (unique word).

## Transform Counts to Frequencies with Tf-idf

In [18]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(3733, 7082)

Note: the fit_transform() method actually performs two operations: it fits an estimator to the data and then transforms our count-matrix to a tf-idf representation.

## Combine Steps with TfidVectorizer

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

X_train_tfidf = vectorizer.fit_transform(X_train)
X_train_tfidf.shape

(3733, 7082)

## Train a Classifier

In [20]:
from sklearn.svm import LinearSVC
clf = LinearSVC()
clf.fit(X_train_tfidf,y_train)

## Build a Pipeline

In [21]:
from sklearn.pipeline import Pipeline
# from sklearn.feature_extraction.text import TfidfVectorizer
# from sklearn.svm import LinearSVC

text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC()),
])

# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)  

## Test the classifier and display results

In [22]:
# Form a prediction set
predictions = text_clf.predict(X_test)

In [23]:
# Report the confusion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

[[1586    7]
 [  12  234]]


In [24]:
# Print a classification report
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         ham       0.99      1.00      0.99      1593
        spam       0.97      0.95      0.96       246

    accuracy                           0.99      1839
   macro avg       0.98      0.97      0.98      1839
weighted avg       0.99      0.99      0.99      1839



In [25]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

0.989668297988037


The model is performing exceedingly well, it correctly predicted spam **98.97%** of the time!

In [26]:
text_clf.predict(["Hi how are you doing"])

array(['ham'], dtype=object)

In [29]:
text_clf.predict(["Congratulations, You have been selected as a Winner. TEXT WON on 45454 today"])

array(['spam'], dtype=object)

In [34]:
messages = ["Hi, how are you doing","Congratulations, You have been selected as a Winner. TEXT WON on 45454 today"]

# Perform prediction using your classifier model
predictions = text_clf.predict(messages)

# Mapping the prediction labels to more user-friendly representations
label_mapping = {
    'ham': 'Not spam',
    'spam': 'Spam'
}

# Process each prediction and display the improved output
for message, prediction in zip(messages, predictions):
    predicted_label = label_mapping[prediction]
    print(f"Prediction: The message '{message}' is classified as '{predicted_label}'.")

Prediction: The message 'Hi, how are you doing' is classified as 'Not spam'.
Prediction: The message 'Congratulations, You have been selected as a Winner. TEXT WON on 45454 today' is classified as 'Spam'.


In [36]:
import pickle

# Pickle the vectorizer
pickle.dump(vectorizer,open('vectorizer.pkl','wb'))

# Pickle the trained classifier model
pickle.dump(clf,open('spam_classifier_model.pkl','wb'))