### Exercise 4: SMS spam classifier

Spam filtering is a type of text categorization, such as language identification and sentiment analysis.

How we represent text by means of features?
- _bag of words_: boolean codifying the presence of the indexed words (`nltk.FreqDist`)
- _term frequency (tf)_: frequency of the indexed word in the sentence (`nltk.text.tf`)
- _term-frequency times inverse document-frequency (tf/idf)_: as above re-weighting to avoid effect of too common words such as english word _the_ (`nltk.text.tf_idf`)

#### SMS Spam Collection Data Set

source: UCI repository
https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
- 5574 examples
- 2 classes: ham and spam
- Example: `ham    What you doing?how are you?`

**Experiment design:**
- single validation (50% - 50%)
- randomly shuffle
- punctuation removed
- strings lowered

**Steps:**
1. Prepare the train and test dataset following the experiment design listed above
2. Using the bag of words, convert the train and test set into a vector of occurrences. This will be useful in order to use it as an input for a clasifier.
3. Train and test a simple kNN classifier
4. Compute the confusion matrix and analyze the results
5. **Optional:** Perform additional experiments with different experiment design. Some suggestions:
    - Different weight for the single validation
    - k-Fold cross-validation
    - Different text processing (punctuation vs. no punctuation, strings lowered vs. not lowered)
    - Dfferent parameters for the k-NN classifier (k, distance)
    - Different algorithms for the classifier (e.g. SVM)
    
**Optional:** Extend the solution to the use of lemmas and other preprocess issues (after tomorrow's class).

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier
from random import shuffle
from nltk.metrics.scores import accuracy
from nltk.metrics import ConfusionMatrix
import string
import nltk

from nltk.metrics.distance import jaccard_distance

In [4]:
with open('smsspamcollection/SMSSpamCollection','r') as f:
    raw_text = f.read()

In [5]:
#We will remove the punctuation
table_punct = str.maketrans({key: None for key in string.punctuation})
#table_digits = str.maketrans({key: None for key in string.digits})

#Separating each line
raw_lines = raw_text.split(sep='\n')
#Removing the last line
raw_lines.remove('')
#Removing punctuation
for i in range(len(raw_lines)):
    raw_lines[i] = raw_lines[i].translate(table_punct)
    
#Separating the features:
registers = [i.lower().split(sep='\t') for i in raw_lines]

#Now we will shuffle our strings:
shuffle(registers)

length = int(len(registers)/2)
#Half of the set will be for training
train = registers[:length]
#And the other half for testing
test = registers[length:]
