# Preprocessing

# Agenda
1. Identify the tokens/rules that will be used to split the content into words
2. Convert SMS content into word vectors

In [1]:
import pandas as pd

In [2]:
spam_df = pd.read_csv(
    filepath_or_buffer="./spam.csv"
    )

In [3]:
spam_df

Unnamed: 0,Spam,Text
0,False,"Go until jurong point, crazy.. Available only ..."
1,False,Ok lar... Joking wif u oni...
2,True,Free entry in 2 a wkly comp to win FA Cup fina...
3,False,U dun say so early hor... U c already then say...
4,False,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5569,True,This is the 2nd time we have tried 2 contact u...
5570,False,Will ü b going to esplanade fr home?
5571,False,"Pity, * was in mood for that. So...any other s..."
5572,False,The guy did some bitching but I acted like i'd...


## Tokens
From: Almeida, T. A., Hidalgo, J. M. G., & Yamakami, A. (2011, September). Contributions to the study of SMS spam filtering: new collection and results. In Proceedings of the 11th ACM symposium on Document engineering (pp. 259-262).
1. tokens start with a printable character, followed by any number of alphanumeric characters, excluding dots, commas and colons  from the middle of the pattern.
2. any sequence of characters separated by blanks, tabs, returns, dots, commas, colons and dashes are considered as tokens.

### Criticism
The tokens described in the paper do not adequately capture the words for this data as some uncecessary punctuation is included.

**Example:**
> ``` Python
> # Example sentences
> "I was under-prepared."
> "I was under-prepared?"
> "I was under-prepared!"
> 
> tok1 = "I", "was", "under-prepared"
> tok1 = "I", "was", "under-prepared?"
> tok1 = "I", "was", "under-prepared!"
> 
> tok2 = "I", "was", "under", "prepared"
> tok2 = "I", "was", "under", "prepared?"
> tok2 = "I", "was", "under", "prepared!"
> ```

**Expectation:**
> ``` Python
> "I was under-prepared."
> "I was under-prepared?"
> "I was under-prepared!"
>
> tok = "I", "was", "under", "prepared"
> tok = "I", "was", "under", "prepared"
> tok = "I", "was", "under", "prepared"
> ```

### Solution
[Regular expression solution from sklearn library](https://github.com/scikit-learn/scikit-learn/blob/d5082d32d/sklearn/feature_extraction/text.py#L350): `(?u)\b\w\w+\b`

This method converts all sentences in the corpus to lower case and only retrieves alphanumeric characters.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [5]:
# Return matrix of tfidf score for each configuration of documents (rows) and words (columns)
vectorizer = TfidfVectorizer()
spam_sparse = vectorizer.fit_transform(spam_df["Text"])
spam_dense = spam_sparse.todense()

In [6]:
feature_names = vectorizer.get_feature_names_out()

## Export Results

In [7]:
spam_vec = pd.DataFrame(
    spam_dense, columns=feature_names
)

In [8]:
spam_vec.to_csv(
    path_or_buf="spam_vec.csv", index=False
)