#  SMS spam

## Implementation steps

1. Data preprocessing
  - take SMS spam copied dataset
  - split it 90 / 10 for train/test
2. Data modelling
  - prepare vector representation
  - find a good robust classifier
3. Data and results visualization 
  - decision boundaries

### Acknowledgements 
- [Create a SMS spam classifier in python](https://medium.com/analytics-vidhya/sms-spam-classifier-natural-language-processing-1751e2b324ed) 
- [SMS Spam Classifier (Natural Language Processing)](https://towardsdatascience.com/create-a-sms-spam-classifier-in-python-b4b015f7404b). 




## Used techinical stack:
- [dataset](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/) 
- **TF-IDF** (Term Frequency-Inverse Document Frequency) **Vectorizer** -  because it put weights on words in accordance with words' frequency
- **Linear SVC Model** - because it showed relatively good results in comparison with Linear Regression and MNNB Model.

### Dataset description (from README):
- There is a total of 4,827 SMS legitimate messages (86.6%) and a total of 747 (13.4%) spam messages.
- The file contains lines with messages. Each line is composed by two columns: one with label (ham or spam) and other with the raw text.

In [1]:
from google.colab import files

import csv
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import roc_auc_score, confusion_matrix, classification_report
from sklearn.svm import LinearSVC

In [2]:
uploaded = files.upload()

Saving SMSSpamCollection.txt to SMSSpamCollection.txt


In [3]:
# convert .txt dataset file into .csv for convenience 
filename = list(uploaded.keys())[0]

with open(filename, 'r') as in_file:
    stripped = (line.strip() for line in in_file)
    lines = (line.split("\t") for line in stripped if line)
    with open('SMSSpamCollection.csv', 'w') as out_file:
        writer = csv.writer(out_file)
        writer.writerow(('Label', 'Message'))
        writer.writerows(lines)

In [4]:
# split the dataset into training and testing sets
dataset = pd.read_csv('SMSSpamCollection.csv')
dataset.loc[:,'Label'] = dataset.Label.map({'ham':0, 'spam':1})
dataset['Length'] = dataset['Message'].apply(len)

print("First 5 rows of the processed dataset:")
print(dataset.head(5))

x = dataset['Message'].values
y = dataset['Label'].values

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=0)
print("\n Splitting complete.")

First 5 rows of the processed dataset:
   Label                                            Message  Length
0      0  Go until jurong point, crazy.. Available only ...     111
1      0                      Ok lar... Joking wif u oni...      29
2      1  Free entry in 2 a wkly comp to win FA Cup fina...     155
3      0  U dun say so early hor... U c already then say...      49
4      0  Nah I don't think he goes to usf, he lives aro...      61

 Splitting complete.


In [5]:
# transform message data into meaningful number representation using a tfid vectorizer, 
# ignore terms with a document frequency lower than 5 (min_df parameter)
vectorizer = TfidfVectorizer(min_df=5, stop_words='english', ngram_range=[1,3])
x_train_transformed = vectorizer.fit_transform(x_train)
x_test_transformed = vectorizer.transform(x_test)
print('Fitting and transforming complete.')

Fitting and transforming complete.


In [18]:
# build a classifier and make predictions
clf = LinearSVC(penalty='l2', loss='hinge', random_state=0, C=10)
clf.fit(x_train_transformed, y_train)
y_predicted = clf.predict(x_test_transformed)

In [19]:
# analyse the results
print('AUC-ROC score of the model: ', roc_auc_score(y_test, y_predicted))

tn, fp, fn, tp = confusion_matrix(y_test, y_predicted).ravel()
print(f'True Positive Rate: { (tp / (tp + fn))}')
print(f'Specificity: { (tn / (tn + fp))}')
print(f'False Positive Rate: { (fp / (fp + tn))}')

print('Classificatoin report for the model:')
print(classification_report(y_test, y_predicted))

AUC-ROC score of the model:  0.9529239836132339
True Positive Rate: 0.9101123595505618
Specificity: 0.9957356076759062
False Positive Rate: 0.0042643923240938165
Classificatoin report for the model:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       469
           1       0.98      0.91      0.94        89

    accuracy                           0.98       558
   macro avg       0.98      0.95      0.97       558
weighted avg       0.98      0.98      0.98       558

