# Machine Learning for Security Analysts - Workbook </br> Spam Filter (Naive Bayes)

---


Author: GTKlondike
</br>
Email: GTKlondike@gmail.com
</br>
YouTube: [NetSec Explained](https://www.youtube.com/channel/UCsKK7UIiYqvK35aWrCCgUUA)

**Dataset:** https://github.com/NetsecExplained/Machine-Learning-for-Security-Analysts

**Goal:** This workbook will walk you through the steps to build, train, test, and evaluate a Naive Bayes spam classifier from the ground up

**Outline:** 
* Initial Setup
* Tokenization
* Load Training Data
* Create Predict Function
* Test and Evaluate Models


## Instructions
To use Jupyter notebooks:
* To run a cell, click on the play button to the left of the code or pressh shift+enter
* You will see a busy indicator in the top left area while the runtime is executing
* A number will appear when the cell is done

# Initial Setup
We'll start by downloading the data and loading the needed libraries.

In [0]:
# Download data from Github
! git clone https://github.com/NetsecExplained/Machine-Learning-for-Security-Analysts.git
  
# Install dependencies
! pip install nltk sklearn pandas matplotlib seaborn
data_dir = "Machine-Learning-for-Security-Analysts"

In [0]:
# Common imports
import re, os, math, string, json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random
import re

# Import Seaborn heatmap graphs
import seaborn as sns

%matplotlib inline

# Import Natural Language ToolKit library and download dictionaries
import nltk
nltk.download('stopwords')
nltk.download('punkt')

print("\n### Libraries Imported ###\n")

In [0]:
# Test email from lecture slides
test_email = """
Re: Re: East Asian fonts in Lenny. Thanks for your support.  Installing unifonts did it well for me. ;)
Nima
--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
"""
print(test_email)

# Tokenization
Continuing from where we left off with the slides, we'll start by creating our tokenizer.

In [0]:
# Define tokenizer
#   The purpose of a tokenizer is to separate the features from the raw data

def tokenizer(text):
  """Separates feature words from the raw data
  Keyword arguments:
    text ---- The full email body
    
  :Returns -- The tokenized words; returned as a list
  """
  
  # Retrieve a list of punctuation characters, a list of stopwords, and a stemmer function
  punctuations = list(string.punctuation)
  stopwords = nltk.corpus.stopwords.words('english')
  stemmer = nltk.stem.PorterStemmer()
  
  
  # Set email body to lowercase, separate words and strip out punctuation
  tokens = nltk.word_tokenize(text.lower())
  tokens = [i.strip(''.join(punctuations)) 
            for i in tokens 
            if i not in punctuations]
  
  
  # User Porter Stemmer on each token
  tokens = [stemmer.stem(i)
            for i in tokens]
  return [w for w in tokens if w not in stopwords and w != ""]


print("\n### Tokenizer defined ###\n")

## Task 1 - Tokenize an email
1. Print the full email, **test_email**
2. Print the results of **tokenizer(test_email)**

In [0]:
# Let's see how our tokenizer changes our email
print("\n- Test Email Body -\n")
# (Write code here)





# Tokenize test email
print("\n - Tokenized Output -\n")
# (Write code here)






## Task 2 - Define readEmail() function
1. Given an email as input, use the **tokenizer()** function to get a list of words
2. Manually collect the word count of each word, and save the results in a *dict()* table
  - Ex: {'earth':2, 'water':9, 'fire':2, 'air':1}
3. Collect the total word count of the email, and save the results as an *int()*
4. Return the *dictionary* table and the word count

In [0]:
# Define email reader
 
def readEmail(email):
  """Reads an email and returns word counts
  Keyword arguments:
    email --- The email body to be read

  :Returns -- The count of each word; returned as a dict
           -- The total number of words; returned as a int
  """
  
  # Retrieve list of tokens
  # (Write code here)
  
  
  
  
  # Build table
  # (Write code here)

  
  
    
  # Get word count
  # (Write code here)

  
  
  

  # (Keep the following lines)
  return table, word_count



# (Keep the following lines)
print("\n### Email Reader Defined ###\n")

## Task 2a (optional) - Read the test_email
1. Collect the results of **readEmail(test_email)**
2. Print the word counts of each word token
3. Print the count of total words 

In [0]:
# Let's see how readEmail interprets our test email
# (Write code here)




print("\n- Word Counts -\n")
# (Write code here)




print("\n- Total Words in Email -\n")
# (Write code here)





# Load Training Data
With our tokenizer defined, let's take a look at our training data.

In [0]:
# 5 things to keep track of:

#   1. The NUMBER of unique words
#     This will be calculated as everything is loaded

unique_words_table = set()


#   2. The TOTAL NUMBER of words in Spam
#   3. The TOTAL NUMBER of words in Ham

spam_table_len = 0
ham_table_len = 0


#   4. The COUNT of each word in Spam
#   5. the COUNT of each word in Ham

spam_table = dict()
ham_table = dict()


print("\n### Initialized Feature Tables ###\n")

## Task 3a - Define learnHam() function
1. Collect the results of a provided email using the **readEmail()** function
2. Update the **unique_words_table** with new word tokens found in the email
3. Add the email word counts to the **ham_table_len** counter
4. Add word tokens and their word counts to the **ham_table**

\* This function should be semi-identical to the *learnSpam()* function

In [0]:
# Define Ham learning

def learnHam(email):
  """Reads an email, updates Ham table and word counts
  Keyword arguments:
    email --- The email body to be read

  :Returns -- N/A
  """
  
  # Include global variables
  global ham_table
  global ham_table_len
  global unique_words_table

  # Read the email
  # (Write code here)
  
  
  
  

  # Add UNIQUE words
  # (Write code here)
  
  
  

  # Add word count to TOTAL number of Ham words
  # (Write code here)
  
  
  

  # Add word tokens to Ham table
  # (Write code here)
  
  

    
# (Keep the following lines)
print("\n### Ham Learner Defined ###\n")

## Task 3b - Define learnSpam() function
1. Collect the results of a provided email using the **readEmail()** function
2. Update the **unique_words_table** with new word tokens found in the email
3. Add the email word counts to the **spam_table_len** counter
4. Add word tokens and their word counts to the **spam_table**

\* This function should be semi-identical to the *learnSpam()* function

In [0]:
# Define Spam learning

def learnSpam(email):
  """Reads an email, updates Spam table and word counts
  Keyword arguments:
    email --- The email body to be read

  :Returns -- N/A
  """
  
  # Include global variables
  global spam_table
  global spam_table_len
  global unique_words_table

  # Read the email
  # (Write code here)
  
  
  


  # Add UNIQUE words
  # (Write code here)
  
  
  
  
  
  # Add word count to TOTAL number of Spam words
  # (Write code here)
  
  
  
  
  
  
  # Add word tokens to Spam table
  # (Write code here)
  
    
    
    
    
    
# (Keep the following lines)    
print("\n### Spam Learner Defined ###\n")

## Task 4 - Load training data
1. Initialize two *int()* counters, named **spam_count** and **ham_count**, set to **0**
2. Load the email bodies from the **/ham** directory using the **learnHam()** function
  - Increment the *ham_count* counter for each email read
3. Load the email bodies form the **/spam** directory using the **learnSpam()** function
  - Increment the *spam_count* counter for each email read

In [0]:
# Load the training data

# Store count for calculating priors
# (Write code here)




# Load all of the emails from the "ham" directory
print("- Training Ham -")
# (Write code here)



    

# Load all of the emails from the "spam" directory
print("- Training Spam -")
# (Write code here)

    
    


# (Keep the following lines)
print("\n### Training complete ###\n")

## Task 4a (Optional) - View training data
1. Show the word counts and values of the first 5 words in the **ham_table**
2. Show the word counts and values of the first 5 words in the **spam_table**

In [0]:
# Let's see how our spam_table looks


print("- Showing ham_table Elements -")
#(Write code here)





print("- Showing spam_table Elements -")
#(Write code here)






# Create Predict Function
Now that the training data has been loaded, let's create a repeatable function that can perform predictions.


## The mathy bits
### Multinomial Naive Bayes function
$P(Spam|email) = \dfrac{P(email|Spam) \cdot P(Spam)}{P(email)}\\$
$P(word|Spam) = \dfrac{Count(word,Spam) + \alpha}{Count(Spam) + \alpha \cdot Count(unique\ words)}\\$
$P(Spam) = \dfrac{spam\ count}{spam\ count + ham\ count}\\$

### Properties of logarithms 
#### Multiplication

$log(A \cdot B) = log(A) + log(B)\\$


#### Division

$log(\dfrac{A}{B}) = log(A) - log(B)\\$


In [0]:
# Define the predict function

def predict(email, alpha=1, print_probs=False):
  """Reads an email, updates Ham table and word counts
  Keyword arguments:
    email -------- The email body to be read
    alpha -------- Smoothing alpha to be applied (almost always 1)
    print_probs -- Print probabilities for debugging purposes

  :Returns ------- The predicted class label; as a str
  """
  
  # Read the email
  tokens = tokenizer(email)

  # Retrieve N from (1. The NUMBER of unique words)
  N = len(unique_words_table)
  
  # Calculate priors - P(spam) and P(ham)
  spam_prior = spam_count / (spam_count + ham_count)
  ham_prior  =  ham_count / (spam_count + ham_count)
  
  # Retrieve denominator values for Spam and Ham calculations
  spam_denominator = spam_table_len + N*alpha
  ham_denominator = ham_table_len + N*alpha
  
  
  
  # Calculate the numerators
  spam_numerator = 1
  ham_numerator = 1
  
  for word in tokens:
    
    # Set to 0 incase word doesn't exist
    spam_table.setdefault(word, 0)
    ham_table.setdefault(word, 0)
    
    spam_numerator *= spam_table[word] + alpha
    ham_numerator *=  ham_table[word] + alpha
  
  
  
  # Calculate the probabilities
  #   Using log properties to prevent overflows/underflows  
  spam_probability = math.log(spam_prior) + (math.log(spam_numerator) - math.log(spam_denominator ** len(tokens)))
  ham_probability  = math.log(ham_prior) + (math.log( ham_numerator) - math.log( ham_denominator ** len(tokens)))

  
  
  # Print probabilities for debugging purposes
  if print_probs == True:
    print("- Probabilities -")
    print("Spam Probability: {}".format(spam_probability))
    print("Ham Probability:  {}".format(ham_probability))
  
  
  # Make classification decision
  if (spam_probability > ham_probability):
    return "spam"
  else:
    return "ham"
  

print("\n### Prediction Function Defined ###\n")


### Prediction Function Defined ###



## Task 5 - Predict test_email
1. Execute the **predict()** function on **test_email**
  - Set **print_probs=True** to display probability calculations
2. Print the predicted class
3. Print *test_email*

In [0]:
# Predict our test email
# (Write code here)



print("\n- Predicted Class -\n")
# (Write code here)




print("\n- Email Body -\n")
# (Write code here)




# Test and Evaluate the Model
OK, we have our training data loaded and a function to perform predictions. Now it's time to test and evaluate our model

But first, we're going to define a helper function to display our evaluation reports.

In [0]:
# Define report generator

def generate_report(cmatrix, score):
  """Generates and displays graphical reports
  Keyword arguments:
    cmatrix - Confusion matrix generated by the model
    score --- Score generated by the model
    
  :Returns -- N/A
  """
  
  # Generate confusion matrix heatmap
  plt.figure(figsize=(5,5))
  sns.heatmap(cmatrix, 
              annot=True, 
              fmt="d", 
              linewidths=.5, 
              square = True, 
              cmap = 'Blues', 
              annot_kws={"size": 16}, 
              xticklabels=['ham', 'spam'], 
              yticklabels=['ham', 'spam'])

  plt.xticks(rotation='horizontal', fontsize=16)
  plt.yticks(rotation='horizontal', fontsize=16)
  plt.xlabel('Actual Label', size=20);
  plt.ylabel('Predicted Label', size=20);

  title = 'Accuracy Score: {0:.4f}'.format(score)
  plt.title(title, size = 20);

  # Display confusion matrix
  plt.show()
  
  
print("\n### Report Generator Defined ###\n")

In [0]:
# Define a function to test the model

def testModel(alpha=1):
  """Evaluates the model with the given alpha
  Keyword arguments:
    alpha --- Smoothing alpha to be applied (almost always 1)

  :Returns -- N/A
  """
  
  # Initialize confusion matrix (true label vs predicted label)
  spam_spam = 0
  spam_ham  = 0
  ham_ham   = 0
  ham_spam  = 0
  
  
  # Predict testing emails
  print("- Predicting Testing Emails -")
  for filename in os.listdir(data_dir + '/test'):
      with open(data_dir + "/test/" + filename, 'r') as f:
          prediction = predict(f.read())
          true_label = re.split("txt\.", filename)[1]
          
          # Craft confusion matrix counts
          if (true_label == 'ham'):
            if (prediction == 'ham'):
              ham_ham += 1
            else:
              ham_spam += 1
          elif (true_label == 'spam'):
            if (prediction == 'spam'):
              spam_spam += 1
            else:
              spam_ham += 1
              
              
  # Calculate statistics
  cmatrix = [[ham_ham, spam_ham], 
             [ham_spam, spam_spam]]
  correctly_classified = ham_ham + spam_spam
  total_predictions   = ham_ham + spam_spam + ham_spam + spam_ham
  accuracy = float(correctly_classified) / total_predictions

  
  # Print testing statistics
  print("\n- Printing Test Statistics -\n")
  print("Total Emails: ", total_predictions)
  print("Correctly classified: ", correctly_classified)
  generate_report(cmatrix, accuracy)
  
  
print("\n### Model Evaluator Defined ###\n")

In [0]:
testModel()