## Text classification: Spam or Ham

In this example based on the classical dataset Spambase Dataset (https://archive.ics.uci.edu/ml/datasets/spambase) we will try to make our own spam filter using scikit-learn library. The dataset contains text corpora of  5.574 text messages with labels "spam" or "ham". 

### Data

Data are attached to the task description for your convinience

In [59]:
import pandas as pd
df = pd.read_csv('3_data.csv', encoding='latin-1')

We delete all other columns except for two of interest: text messages and labels:

In [60]:
df = df[['v1', 'v2']]
df = df.rename(columns = {'v1': 'label', 'v2': 'text'})
df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Delete duplicates:

In [61]:
df = df.drop_duplicates('text')

Change labels to binary:

In [62]:
df['label'] = df['label'].map({'ham': 0, 'spam': 1})

### Text pre-processing (Task 1)

We need to complete the function for text pre-processing, to pre-process the text the following way:
* convert text to lowercase;
* remove stop-words;
* remove punctuation marks;
* normalizes the text using Snowball stemmer.

We recommend to use the NLTK library, in order not to compile a list of stop-words and not to implement the stemming algorithm yourself. Click the link to find the examples of stemmers application (https://www.nltk.org/howto/stem.html).

In [63]:
from nltk import stem
from nltk.corpus import stopwords
import re

stemmer = stem.SnowballStemmer('english')
stopwords = set(stopwords.words('english'))

def preprocess(text):
    text = re.sub(r'[^\w\s]', '', text)
    # Ваш код здесь
    text = text.split(sep=" ")
    text = [word.lower() for word in text if not word in stopwords]
    text = [stemmer.stem(word) for word in text]
    return " ".join(text)

Check that the function works correctly

In [64]:
assert preprocess("I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.") == "im gonna home soon dont want talk stuff anymor tonight k ive cri enough today"
assert preprocess("Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...") == "go jurong point crazi avail bugi n great world la e buffet cine got amor wat"

Apply to the text:

In [65]:
df['text'] = df['text'].apply(preprocess)

### Split the data to the training and test set (Task 2)

In [66]:
y = df['label'].values

Now we need to split the data to test (test) and training (train) sets. Scikit-learn library contains ready to use tools to do it.

In [67]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['text'], y, test_size=0.2, random_state=41)

### Classifier training (Task 3)

We came to the classifier training now.

First we extract features from the texts. It is strongly recommened to try several methods in order to check how each method influences the result (more information on defferent text representation methods you can find on the link https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction).

Then we train the classifier. We use SVM, but you can try different algorithms.

In [68]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# exctract features from the texts
vectorizer = TfidfVectorizer(decode_error='ignore')
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

In [69]:
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report

#train SVM model

model = LinearSVC(random_state = 41, C = 1.1)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Selfcheck. If the function ```preprocess``` is complimented correctly, then you should get the following model evaluation results.

In [70]:
print(classification_report(y_test, predictions, digits=3))

              precision    recall  f1-score   support

           0      0.979     0.996     0.987       898
           1      0.967     0.860     0.911       136

    accuracy                          0.978      1034
   macro avg      0.973     0.928     0.949      1034
weighted avg      0.978     0.978     0.977      1034



Let's predict results for the specified text

In [71]:
txt = "Take your prize, more than 100 computers, smartphones and TVs are supposed to be played in a free quiz. Call by phone 8 800 243 456"
txt = preprocess(txt)
txt = vectorizer.transform([txt])

In [72]:
model.predict(txt)

array([1], dtype=int64)

In [73]:
The message is classified as spam.

SyntaxError: invalid syntax (3549593910.py, line 1)