<a href="https://colab.research.google.com/github/M-H-Amini/MachineLearning-AUT/blob/master/MLe_Lec5_SpamClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# In The Name Of ALLAH
# Machine Learning *elementary* Course
## Amirkabir University of Technology
### Mohammad Hossein Amini (mhamini@aut.ac.ir)
# Lecture 5 - Bayesian Classification

<img src="https://drive.google.com/uc?id=144SDpgv7EEy6Og1ZFNIv_nBaugKGiSCE" width="400">



# Spam Classification
Using a relatively tiny dataset of emails, with some really bad assumptions, we would build a spam classifier and see its performance. Today, we will just use **glob** and **numpy**.

In [0]:
import numpy as np
import glob

## Preparing Dataset
Let us prepare dataset

In [0]:
!unzip EmailsTrainingSet.zip > /dev/null

In [0]:
train_spam_files = glob.glob('TrainingSet/sp*.txt')
train_ham_files = glob.glob('TrainingSet/[!s]*.txt')
test_spam_files = glob.glob('TestSet/sp*.txt')
test_ham_files = glob.glob('TestSet/[!s]*.txt')

In [0]:
print(len(train_spam_files), len(train_ham_files))
print(len(test_spam_files), len(test_ham_files))

## Estimating Parameters Of The Classifier
We store words and their frequency in two dictionaries, one for spams and one for hams. We also implement the **updateDict** function such that it gets an email file and a dictionary and updates the dictionary with the content of the given email.

In [0]:
spam_dict = {}
ham_dict = {}
spam_no = len(train_spam_files)
ham_no = len(train_ham_files)
spam_words_no = 0
ham_words_no = 0

def updateDict(file, dictionary, show_details=False):
  with open(file) as f:
    content = f.read()
  words = content.split()
  
  if show_details:
    print('Content: ', content)
    print('Words: ', words)
    
  for word in words:
    if word in dictionary.keys():
      dictionary[word] = dictionary[word] + 1
    else:
      dictionary[word] = 1
  
for email in train_spam_files:
  updateDict(email, spam_dict)

for email in train_ham_files:
  updateDict(email, ham_dict)

for (key, value) in spam_dict.items():
  spam_words_no += value

for (key, value) in ham_dict.items():
  ham_words_no += value

spam_dict[0] = spam_words_no
ham_dict[0] = ham_words_no
print(spam_dict)
print(ham_dict)
print(spam_words_no, ham_words_no)

<img src="https://drive.google.com/uc?id=1KfSj_4JzPe9Xvz4YXZcTzxHhcjo3_14Z" width="700">



## Building The Classifier
**isSpam** function gets an email and predicts whether it's a spam or not.

In [0]:
def isSpam(file, spam_dict, ham_dict, show_details=False):
  with open(file) as f:
    content = f.read()
  words = content.split()
  spam_prob = 0
  ham_prob = 0

  for word in words:
    if word in spam_dict.keys():
      spam_prob += np.log((spam_dict[word]+1) / (spam_dict[0] + len(spam_dict.keys())))
    else:
      spam_prob += np.log((1) / (spam_dict[0] + len(spam_dict.keys())))

    if word in ham_dict.keys():
      ham_prob += np.log((ham_dict[word]+1) / (ham_dict[0] + len(ham_dict.keys())))
    else:
      ham_prob += np.log((1) / (ham_dict[0] + len(ham_dict.keys())))
    
  if show_details:
    print('Spam Probability: {}\nHam Probability: {}'.format(spam_prob, ham_prob))
  if spam_prob - ham_prob > 50:
    return 1
  return 0

print(isSpam(train_ham_files[1], spam_dict, ham_dict, 1))

## Performance Evaluation
For the evaluation of performance, we implement the **evaluate** function.

In [0]:
def evaluate(test_spam_files, test_ham_files, spam_dict, ham_dict, show_details=False):
  spam_corrects = 0
  spam_wrongs = 0
  ham_corrects = 0
  ham_wrongs = 0

  for spam in test_spam_files:
    predicted = isSpam(spam, spam_dict, ham_dict)
    if predicted == 1:
      spam_corrects += 1
    else:
      spam_wrongs += 1
    if show_details:
      print(f'Target: 1, Predicted: {predicted}')

  for ham in test_ham_files:
    predicted = isSpam(ham, spam_dict, ham_dict)
    if predicted == 0:
      ham_corrects += 1
    else:
      ham_wrongs += 1
    if show_details:
      print(f'Target: 0, Predicted: {predicted}')

  print('Spam Accuracy: ', spam_corrects/(spam_corrects + spam_wrongs))
  print('Ham Accuracy: ', ham_corrects/(ham_corrects + ham_wrongs))
  print('Overall Accuracy: ', (spam_corrects + ham_corrects) / (spam_corrects + spam_wrongs + ham_corrects + ham_wrongs))

evaluate(test_spam_files, test_ham_files, spam_dict, ham_dict, False)

<img src="https://drive.google.com/uc?id=1B8JvIcioH0a7W-zwFvceOm-bT_1x6YTS" width="400">
