# SPAM DETECTION GUIDE

This notebook presents step-by-step guide to building and training ML-algorithm for binary classification task (in this case - spam detection).

## Preparation
In this part you will set up all things

First of all, you need to import all necessary external modules.
In this scenario we will use the following libraries:<br>
tensorflow_hub  -   includes word embeddings (you also need installed tensorflow); <br>
numpy           -   is a library used for basic math and linear algebra calculations;<br>
pandas          -   provides a comfort way to work with large amounts of data through DataFrames;<br>
sklearn         -   contains a lot of different algorithms and tools useful for ML projects;<br>

In [None]:
import tensorflow_hub as hub
import tensorflow.compat.v1 as tf
tf.disable_eager_execution()
import numpy as np
import pandas as pd
import sklearn
from sklearn.ensemble import RandomForestClassifier

Secondly, you need to specify the locations to all external files (for example, your dataset). You have to also specify all possible parameters for your code and store them in variables. This way, you will be protected from accidental typos :)

In [None]:
# specify location your dataset here
DATA_PATH = ''

# give name to label-column and text-column
COLUMN_LABEL = ''
COLUMN_TEXT = ''

# these are labels that indicate the type of message.
LABEL_LEGIT = 'LEGI'
LABEL_SPAM = 'SPAM'
LABEL_SMISHING = 'SMIS'

## Dataset 
In this part, we will load dataset and check it for errors.<br>
Here dataset is loaded from file into a DataFrame. DataFrame is basically a database or a table with columns where all the data is stored.<br>
You can access those columns by calling them by name:<br>
<code> column = dataframe\[column_name\] </code>

In [None]:
dataset = pd.read_csv(DATA_PATH, sep='\t', names=[COLUMN_LABEL, COLUMN_TEXT], header=None)
print('Total size:', dataset.shape[0])
print('Legit messages:', dataset[dataset[COLUMN_LABEL] == LABEL_LEGIT].shape[0])
print('Spam messages:', dataset[dataset[COLUMN_LABEL] == LABEL_SPAM].shape[0])
print('Smishing messages:', dataset[dataset[COLUMN_LABEL] == LABEL_SMISHING].shape[0])

For now we don't realy need smishing messages, so we will remove them.

In [None]:
dataset = dataset[((dataset[COLUMN_LABEL] == LABEL_LEGIT) | (dataset[COLUMN_LABEL] == LABEL_SPAM))]

# Let's check if they are gone
print('Smishing messages:', dataset[dataset[COLUMN_LABEL] == LABEL_SMISHING].shape[0])

If the dataset is fine, you may proceed to next step

## Data preprocessing
The difference between humans and machines is that the machines don't know words. So it is very difficult for them to process human language.<br>
To help them, scientists invented a bunch of methods to translate words and sentences into sets of numbers, which machines are used to. Those methods are called "embeddings".<br>
The point of embedding is to translate words (or complete sentences) into a new space of features (usualy of high dimensionality), so every word (sentence) will be represented as vector in this new space.<br>
In this example we will use ELMO embeddings (but you can use something else if you wish).<br>
You can also use simpler approach and transform your messages into feature-vectors using features of the message itself (for example, you can count length of every message as a feature, or you can count number of URLs).<br>
The more different features you use, the better are chances to succeed. 

In [None]:
def messages2vectors(messages):
    '''
    Transforms single message into feature-vector;
    Parameters:
        messages    -   array of strings;
    Returns:
        features    -   array of feature-vectors;   
    '''

    # embedding = nlu.load('elmo').predict(messages, output_level='token').elmo_embeddings
    # features = np.matrix(embedding)
    elmo = hub.Module("https://tfhub.dev/google/elmo/1")

    features = np.zeros((0, 1024))
    n = 100
    l = int(len(messages) / n) + 1 if len(messages) % 2 != 0 else int(len(messages) / n)
    for i in range(l):
        if i * n == len(messages):
            break
        # right = (i + 1) * n if (i + 1) * n < len(messages) else 0
        if (i + 1) * n < len(messages):
            right = (i + 1) * n
            embedds = elmo(messages[int(i * n) : right], signature="default", as_dict=True)["default"] 
        else:
            embedds = elmo(messages[:len(messages) - int(i * n)], signature="default", as_dict=True)["default"] 

        with tf.Session() as sess:
            sess.run(tf.global_variables_initializer())
            embedds = sess.run(embedds)
            features = np.concatenate([features, embedds])

    return features

Next step is to prepare labels. You need to convert all 'LEGI' labels to 0s and all 'SPAM' labels to 1s.<br>
For example: \['LEGI', 'LEGI', 'SPAM', 'LEGI', 'SPAM'\] -> \[0, 0, 1, 0, 1\]

In [None]:
def convert_labels(labels_raw):
    '''
    Transforms labels into numerical values;
    Parameters:
        labels_raw    -   array of text-labels;
    Returns:
        features    -   array of numerical labels;   
    ''' 

    # add your code here


    return labels

Now let's transform messages to features and change labels to numerical values

In [None]:
features = messages2vectors(dataset[COLUMN_TEXT])
labels = convert_labels(dataset[COLUMN_LABEL])
print(features.shape)
print(labels.shape)

To ensure that our model's predictions are valid, we need to split our data into 2 parts: training and testing.<br>
Because of it's nature, ML-algorithms have tendency to overfit on training data (which means that algorithm no longer 'predicts' class based on input but rather remembers that a particular input corresponds to a particular output).<br>
Overfitting causes model to completely fail on any data not present in training set.<br>
Separate (independent from training) data provides us with unbiased results to see 'real' performance.

In [None]:
def split_data(features, labels, ratio=0.7):
    '''
    Splits dataset into train/test parts using given ratio;
    Parameters:
        data    -   array of features;
        labels  -   array of corresponding labels;
        ratio   -   train/test size ratio;
    Returns:
        train_data      -   array of training features;   
        train_labels    -   array of training labels; 
        test_data       -   array of testing features; 
        test_labels     -   array of testing labels; 
    '''    


    positive_data = features[labels == 1] # all spam features
    negative_data = features[labels == 0] # all legit messages

    # We shuffle arrays to get random samples later
    random_indecies_positive = np.arange(positive_data.shape[0])
    np.random.shuffle(random_indecies_positive)
    random_indecies_negative = np.arange(negative_data.shape[0])
    np.random.shuffle(random_indecies_negative)

    n_positive_train = int(positive_data.shape[0] * ratio)
    n_negative_train = int(negative_data.shape[0] * ratio)

    # Training data are all indecies in 'ratio' part of shuffled indecies
    train_data = np.concatenate([positive_data[random_indecies_positive[:n_positive_train]], 
                                negative_data[random_indecies_negative[:n_negative_train]]])
    
    train_labels = np.asarray([1] * n_positive_train + [0] * n_negative_train)

    # Testing data are all indecies that remain
    test_data = np.concatenate([positive_data[random_indecies_positive[n_positive_train:]], 
                                negative_data[random_indecies_negative[n_negative_train:]]])

    test_labels = np.asarray([1] * (positive_data.shape[0]  - n_positive_train) + [0] * (negative_data.shape[0] - n_negative_train))

    return train_data, train_labels, test_data, test_labels



## Metrics
To see how good (or bad) our spam detector works, we have to use some metrics. In assigment you are required to compare FAR and FRR of different algorithms.<br>
FAR (False Acceptance Rate) - ratio of positive samples (spam in our case) wrongly predicted as negative (legitimate);<br>
FRR (False Rejection Rate) - ratio of negative samples (legitimate) wrongly predicted as positive (spam);<br>
These rates represent False Negative and False Positive Errors (you may also know them by names Type-1 Error and Type-2 Error)

Your task is to compute FAR and FRR based on given true labels and predicted labels

In [None]:
def get_metrics(labels, predictions):
    '''
    Computes metrics;
    Parameters:
        labels    -   array of labels;
        predictions  -   array of predictions;
    Returns:
        FAR -   False Acceptance Rate;
        FRR -   False Rejection Rate;
    '''  
    # add your code here
    FAR =
    FRR = 
    return FAR, FRR

## Model initialization
In this part we will create classifier and set it up. We will use Random Forest as example (note that in your assigment you must use all of given algorithms).<br>
Note, that every algorithm has it's unique set of parameters (called hyperparameters).<br>
Also, algorithms from <code>sklearn</code> library usualy have common methods <code>fit(X, Y)</code> and <code>predict(X)</code>.

In [None]:
classifierType = sklearn.ensemble.RandomForestClassifier
hyperparameters = {'n_estimators' : 100,
                'criterion' : 'gini',
                'max_depth' : None,
                'min_samples_split' : 2}

## Model Training and evaluation
Complete method below to do the following:<br>
1) Split <code>data</code> an labels into training and testing sets;<br>
2) Fit your model on training set using <code>fit(X, Y)</code> method;<br> 
3) Make predictions based on training set using <code>predict(X)</code> method;<br>
4) Compute FAR/FRR for training set;<br>
5) Make predictions based on testing set;<br>
6) Compute FAR/FRR for testing set;<br>


In [None]:
def evaluate(classifierType, hyperparameters, features, labels):
    '''
    Splits dataset into train/test parts using given ratio;
    Parameters:
        classifierType      -   type of ML algorithm to use;
        hyperparameters     -   dictionary of model's parameters;
        features            -   array of features;
        labels              -   array of labels
    Returns:
        trainFAR    -   False Acceptance Rate for train dataset;
        trainFRR    -   False Rejection Rate for train dataset;
        testFAR     -   False Acceptance Rate for test dataset;
        testFRR    -   False Rejection Rate for test dataset;
    '''    

    model = classifierType(**hyperparameters)

    # Split data
    # add your code here
    train_data, train_labels, test_data, test_labels = 

    print('Train set shape:', train_data.shape)
    print('Train labels shape:', train_labels.shape)
    print('Test set shape:', test_data.shape)
    print('Test labels shape:', test_labels.shape)

    # Fit your model
    # add your code here


    # Make predictions for training dataset
    # add your code here


    # Compute train FAR/FRR
    # add your code here
    trainFAR, trainFRR = 

    # Make predictions for testing dataset
    # add your code here
    predictions_test = 

    # Compute test FAR/FRR
    # add your code here
    testFAR, testFRR = 

    return trainFAR, trainFRR, testFAR, testFRR

In [None]:
# check if it works :)
trainFAR, trainFRR, testFAR, testFRR = evaluate(classifierType, hyperparameters, features, labels)
print('Train:')
print('\tFAR:', trainFAR)
print('\tFRR:', trainFRR)

print('Test:')
print('\tFAR:', testFAR)
print('\tFRR:', testFRR)

## Final Task
Combine your knowledge and compute FAR/FRR metrics for other algorithms from assignment.<br>
Don't forget to fill report and share it on google class.<br>
Good luck :)

In [None]:
# Add your code here




