# Spam filter

In [1]:
# import the dependencies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv("spam.csv")
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
# check the number of spam and ham emails
df.groupby('Category').describe()

Unnamed: 0_level_0,Message,Message,Message,Message
Unnamed: 0_level_1,count,unique,top,freq
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,641,Please call our customer service representativ...,4


In [4]:
# use numerical representation of 1 for spam and 0 for ham
df['spam']=df['Category'].apply(lambda x: 1 if x=='spam' else 0)
df.head()

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [5]:
# split the data into training and testing portions
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.Message,df.spam)

Converting text data into a matrix of token counts using techniques like CountVectorizer is a fundamental step in natural language processing (NLP) tasks. Here's what it means and why it's significant:

**1. Tokenization:**
Tokenization is the process of breaking down text into smaller units called tokens. These tokens can be words, characters, or subwords depending on the tokenization strategy. For example, in English, words are commonly used as tokens.

**2. Counting Tokens:**
Once the text is tokenized, the next step is to count the occurrences of each token in the text. This results in a numerical representation of the text, where each token corresponds to a feature, and its count represents its importance or frequency in the document.

**3. Matrix Representation:**
The matrix of token counts is a tabular representation where rows correspond to documents or samples, and columns correspond to tokens or features. Each cell in the matrix contains the count of a particular token in the corresponding document.

**4. Significance:**
Converting text data into a matrix of token counts has several significant implications:

Numerical Representation: It converts text data into a numerical format that machine learning algorithms can understand and process.
Feature Extraction: It captures the frequency of occurrence of words (tokens) in documents, which can be important features for classification, clustering, or other NLP tasks.
Dimensionality Reduction: By representing text data as a matrix of token counts, it reduces the dimensionality of the data, making it more manageable for machine learning algorithms.
Compatibility with Algorithms: Many machine learning algorithms, such as Naive Bayes, Logistic Regression, and Support Vector Machines, require numerical input data. Token count matrices provide this numerical representation for text data.

In [6]:
# import the CountVectorizer class. CountVectorizer is a method in scikit-learn
# used for converting a collection of text documents into a matrix of token counts
from sklearn.feature_extraction.text import CountVectorizer

# create an instance of the CountVectorizer class. This instance will be used
# to transform text data into a matrix of token counts.
v = CountVectorizer()

# use the fit_transform method of the CountVectorizer object v to convert the text data
# in X_train.values into a matrix of token counts.
# the fit_transform method learns the vocabulary dictionary of the training data
# (X_train.values) and returns the document-term matrix (matrix of token counts)
# for the training data
X_train_count = v.fit_transform(X_train.values)

# convert the sparse matrix 'X_train_count' (which might be memory-efficient)
# into a dense array representation for easier visualization
# then select the first two rows of this dense array to get a glimpse of the token
# count matrix of the first two documents in the training data
X_train_count.toarray()[:2]

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [7]:
# check out some of the unique words
v.get_feature_names_out()[1000:1020]

array(['apologize', 'apology', 'app', 'apparently', 'appeal', 'appear',
       'appendix', 'applebees', 'apples', 'application', 'apply',
       'applyed', 'applying', 'appointment', 'appointments', 'appreciate',
       'appreciated', 'approaches', 'approaching', 'appropriate'],
      dtype=object)

In [8]:
# import the multinomial Naive Bayes model and instantiate it
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train_count,y_train)

In [9]:
emails = [
    'Hey mohan, can we get together to watch footbal game tomorrow?',
    'Upto 20% discount on parking, exclusive offer just for you. Dont miss this reward!'
]

# transform the text data into a document-term matrix using the vocabulary that has
# already been learned (from the previous fit operation). It applies the same
# tokenization and counting process as 'fit_transform', but it uses the vocabulary
# learned during the 'fit' operation. This is typically used when you have a new
# dataset and want to transform it into the same format as the training data based
# on the exisitng vocabulary without re-learning the vocabulary
emails_count = v.transform(emails)
model.predict(emails_count)

array([0, 1])

In [10]:
X_test_count = v.transform(X_test)
model.score(X_test_count, y_test)

0.9856424982053122

## Sklearn Pipeline

Sklearn has a nice feature called Pipeline where you can define a pipeline of your transformation, without having to transform the text into token counts during training and again at testing.


In [11]:
# create a pipeline of estimators by importing Pipeline from sklearn.pipeline
from sklearn.pipeline import Pipeline

# define a classification pipeline with steps - vectorizer and nb
# the vectorizer step performs text feature extraction using CountVectorizer.
# It analyzes the text data in X_train and converts it into a matrix of token
# counts during the pipeline's fit step
# the nb step uses a Multinomial NB clasifier. It takes the transformed features
# (token count matrix) from the previous step and learns to classify text documents
# into different categories based on the training data (X_train, y_train)
clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())
])

In [12]:
# Fit the pipeline to the training data and perform the above two-steps.
clf.fit(X_train, y_train)

In [13]:
clf.score(X_test,y_test)

0.9856424982053122

In [14]:
clf.predict(emails)

array([0, 1])