# Spam classifier

from [UCI](https://archive.ics.uci.edu/ml/datasets/sms%20spam%20collection) we have a collection of spam sms messages. The readme file states that collection has a total of 4,827 SMS legitimate messages (86.6%) and a total of 747 (13.4%) spam messages.

Text preprocessing, tokenizing and filtering of stopwords are all included in ``sklearn.feature_extraction.CountVectorizer``, which builds a dictionary of features and transforms documents to feature vectors.

In [20]:
# get the data using pandas. 
# By inspection we see the file is tab-separated
import pandas as pd
df = pd.read_csv('smsspamcollection/SMSSpamCollection', 
                sep='\t', names=['ham or spam', 'message'])
df['label'] = df['ham or spam'].apply(lambda s: 1 if s == 'spam' else 0)
df.head(5)

Unnamed: 0,ham or spam,message,label
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [21]:
# split the dataset into train and test data

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = \
    train_test_split(df['message'], df['label'])

In [22]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

# Transform the message data (X_train and X_test) into vectors
bow_vectorizer = CountVectorizer()
training_vectors = bow_vectorizer.fit_transform(X_train)
test_vectors = bow_vectorizer.transform(X_test)

# Init a Naive Bayes Classifier and train it on the 
# vectorized training data (training_vectors) and 
# the training labels (y_train) 
spam_classifier = MultinomialNB()
spam_classifier.fit(training_vectors, y_train)

# see how good our classifier is using ``score``
predictions_score = spam_classifier.score(test_vectors, y_test)
print(predictions_score)

0.9827709978463748


In [28]:
# We could use our classifier to classify other messages
message = 'Yo boi! ya want a bigger schlong? take these pills ya woman\'s gonna be wet wet wet call this number now click on our link, first order free!!! usual price $50, no catch! call today, '
message_vector = bow_vectorizer.transform([message])
print(spam_classifier.predict(message_vector))


[1]
