# Exercise 2: Spam Detection
From github: https://github.com/datsoftlyngby/soft2022spring-DS/blob/main/Code/E10-2-Spam-Detection.ipynb

Spam Data Set: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

The objective is to train a model, which can be used for automatic detection of spam messages.
We will use the experience showing that

- messages, containing words like 'free', 'win', 'winner', 'cash', 'prize' and the like usually contain spam
- spam messages tend to have words written in all capitals and
- also tend to use a lot of exclamation marks



## Step 1. Get to know the data

We will be using a dataset from the UCI Machine Learning repository.
https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

In [4]:
import pandas as pd
# It is a pre-processed table with two columns - a label and a message
# Import the table into a pandas dataframe using the read_table method
df = pd.read_table('./smsspamcollection/SMSSpamCollection', sep='\t', header=None, names=['label', 'message'])
df.shape

(5572, 2)

In [2]:
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Step 2: Data processing

### 2.1: Digitalize

In [3]:
# Convert the labels into numerical values, map 'ham' to 0 and 'spam' to 1
df['label'] = df.label.map({'ham':0, 'spam':1})
df.head() 

Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


### 2.2 Bag-of-Words Processing

A model, which represents a piece of text, such as a sentence or a document, as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. The words are stored as tockens, with a count of frequency of their appearance.

1. Convert strings to lower case
2. Remove punctuation
3. Tokenize the message and give an integer ID to each token
4. Count frequencies

