# Exercise 2: Spam Detection
### Spam Data Set: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
The objective is to train a model, which can be used for automatic detection of spam messages.<br>
We will use the experience showing that 
- messages, containing words like 'free', 'win', 'winner', 'cash', 'prize' and the like usually contain spam
- spam messages tend to have words written in all capitals and 
- also tend to use a lot of exclamation marks

## Step 1: Get to Know the Dataset ### 
We will be using a [dataset](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) from the UCI Machine Learning repository.

In [1]:
import pandas as pd
# It is a pre-processed table with two columns - a label and a message
# Import the table into a pandas dataframe using the read_table method
df = pd.read_table('SMSSpamCollection', sep='\t', header=None, names=['label', 'message'])
df.shape

(5572, 2)

In [2]:
# Printing out first five rows to get idea about the data
df.head(200)

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
195,ham,How are you doing? Hope you've settled in for ...
196,ham,Gud mrng dear hav a nice day
197,ham,Did u got that persons story
198,ham,is your hamster dead? Hey so tmr i meet you at...


## Step 2: Data Preprocessing ###

### 2.1 Digitalize

In [3]:
# Convert the labels into numerical values, map 'ham' to 0 and 'spam' to 1
df['label'] = df.label.map({'ham':0, 'spam':1})
df.head() 

Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


### 2.2 Bag-of-Words Processing

A model, which represents a piece of text, such as a sentence or a document, as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.
The words are stored as tockens, with a count of frequency of their appearance.

1. Convert strings to lower case
2. Remove punctuation
3. Tokenize the message and give an integer ID to each token
4. Count frequencies

In [4]:
# Split the dataset into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'], random_state=1)

print('Number of rows in the total set: {}'.format(df.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))

Number of rows in the total set: 5572
Number of rows in the training set: 4179
Number of rows in the test set: 1393


In [5]:
# Create an instance of CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()

# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)
training_data.shape

(4179, 7456)

In [6]:
# Transform testing data and return the matrix 
# Note we are not fitting the testing data into the CountVectorizer()
test_data = count_vector.transform(X_test)
test_data.shape

(1393, 7456)

## Step 3: Train and Test

In [7]:
# Call Multinominal Naive Bayes and train the model
from sklearn.naive_bayes import MultinomialNB
myNB = MultinomialNB()
myNB.fit(training_data, y_train)

MultinomialNB()

In [8]:
# Test on the test data, try prediction
predictions = myNB.predict(test_data)

In [60]:
my_data = count_vector.transform(["Free entry"])
my_data

<1x7456 sparse matrix of type '<class 'numpy.int64'>'
	with 2 stored elements in Compressed Sparse Row format>

In [61]:
predictions = myNB.predict(my_data)

In [62]:
predictions

array([1], dtype=int64)

In [12]:
predictions.shape

(1393,)

## Step 4: Validate

In [13]:
# Validate the accuracy of the predictions
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy score: ', format(accuracy_score(y_test, predictions)))
print('Precision score: ', format(precision_score(y_test, predictions)))
print('Recall score: ', format(recall_score(y_test, predictions)))
print('F1 score: ', format(f1_score(y_test, predictions)))

Accuracy score:  0.9885139985642498
Precision score:  0.9720670391061452
Recall score:  0.9405405405405406
F1 score:  0.9560439560439562


## <span style="color:red">Task</span>
Repeat the training, testing and validation with the Decision Tree method previously researched.
Upload in the Assignment section the answer to the question: Which of the two methods gives better results?
Apply the proves.

In [29]:
from sklearn import model_selection
from sklearn.tree import DecisionTreeClassifier

In [34]:
clf = DecisionTreeClassifier()

In [35]:
clf.fit(training_data, y_train)

DecisionTreeClassifier()

In [37]:
# Test on the test data, try prediction
predictions = clf.predict(test_data)
predictions

array([0, 0, 0, ..., 0, 1, 0], dtype=int64)

In [43]:
# Validate the accuracy of the predictions
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy score: ', format(accuracy_score(y_test, predictions)))
print('Precision score: ', format(precision_score(y_test, predictions)))
print('Recall score: ', format(recall_score(y_test, predictions)))
print('F1 score: ', format(f1_score(y_test, predictions)))

Accuracy score:  0.9626704953338119
Precision score:  0.844559585492228
Recall score:  0.8810810810810811
F1 score:  0.8624338624338624


### Results

In [39]:
myNB.score(training_data, y_train)

0.9923426657094999

In [40]:
clf.score(training_data, y_train)

1.0

## Conclusion: 
### The DecisionTreeClassifier is a better model because of the perfect score of clf on the training data.
### But the recall score for clf is worse than for myNB, which means there are more false negatives for clf. The same counts for the precision where there must be more false positives in clf. 