# Naive Bayes (the easy way)

We'll cheat by using sklearn.naive_bayes to train a spam classifier! Most of the code is just loading our training data into a pandas DataFrame that we can play with:

In [1]:
import os, io
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
import seaborn as seabornInstance 

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

data = pd.read_csv(r'C:\Users\saket\Desktop\dataset.csv')

Let's have a look at that DataFrame:

In [2]:
data.head()

Unnamed: 0,boolean,message
0,1,"""California is 49th out of 50 in the United S..."
1,1,"""Fox News barely covered (the Duggar family) ..."
2,1,"""Just about half of rural hospitals operate i..."
3,1,"""One out of every 30 people in the Greater Bo..."
4,1,"""The world food demand is going to double som..."


In [3]:
data.describe()

Unnamed: 0,boolean
count,2818.0
mean,0.479063
std,0.49965
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


Now we will use a CountVectorizer to split up each message into its list of words, and throw that into a MultinomialNB classifier. Call fit() and we've got a trained spam filter ready to go! It's just that easy.

In [4]:
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(data['message'].values)

classifier = MultinomialNB()
targets = data['boolean'].values
classifier.fit(counts, targets)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Let's try it out:

In [5]:
examples = ["Bernie hates Medicare4All."]
example_counts = vectorizer.transform(examples)
predictions = classifier.predict(example_counts)

predictions

array([0], dtype=int64)

## Activity

Our data set is small, so our spam classifier isn't actually very good. Try running some different test emails through it and see if you get the results you expect.

If you really want to challenge yourself, try applying train/test to this spam classifier - see how well it can predict some subset of the ham and spam emails.