<a href="https://colab.research.google.com/github/Juanfra21/nlp-yu/blob/main/M3_Part_I_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Assignment 3 Naïve Bayes and Sentiment Classification and Logistic Regression
Instructions
* Read the following Chapter 4: Naive Bayes and Sentiment Classification. Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright © 2021. All rights reserved. Draft of September 21, 2021. I have tried to pull out relevant notes for you below, but it is encouraged that you read each chapter provided.
* Read the following Chapter 5: Logistic Regression. Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright © 2021. All rights reserved. Draft of September 21, 2021. I have tried to pull out relevant notes for you below, but it is encouraged that you read each chapter provided.

Summary
Classification is one of the most important tasks of NLP and in machine learning. In NLP it often means the task of text categorization for both sentiment analysis, spam detection, and topic modeling. Naïve Bayes is often one of the first classification algorithms defined in NLP.  The intuition behind a classifier is lies at the underlying probability inferred by the Bayesian Inference, which uses Baye’s rule and conditional probabilities.

Here’s a reminder on Baye’s Rule:
P(y)=P(x)P(x)/(P(y))

We are saying “what is the probability of x given y”. Naïve Bayes is a generative model because there is an input that helps the model determine what the output could be. Said differently, “to train a generative model we first collect a large amount of data in some domain (e.g., think millions of images, sentences, or sounds, etc.) and then train a model to generate data like it.” [6]

So in the case of Naïve Bayes, we say given some word, what should be the class of the current word we are assessing? Contrastingly, discriminative models such as logistic regression, learn from features provided to the algorithm and then determine or predict what the class is. [7]


With Naïve Bayes, the assumption is that the probabilities are independent. We often call the Naïve Bayes classifier the bag-of-words approach. That’s because we are essentially throwing in the collection of words into a ‘bag’, selecting a word at random, and then calculating their frequency to use in the Bayesian Inference. Thus, context – the position of words -- is ignored and despite this, it turns out that the Naïve Bayes approach can be accurate and effective at determining whether an email is spam for example.

Back to bag of words. With bag of words, we assume that the position of the words are not relevant -- that dependency or context in the word phrase or sentence doesn’t matter. Relatedly, the naive Bayes assumption implies that the conditional probabilities are independent -- a rather strange assumption to make for words in a sentence! The equation for the naive Bayes classifier is outlined below:

You can use Naive Bayes by creating an index of words and walking through every word position in a test or corpus.


It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

For this Assignment, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).   One example corpus:   https://spamassassin.apache.org/old/publiccorpus/

You may work alone or in a group on this project.  You're welcome to use any tools or approach that you like.  Due before our next meetup. Starter code provided below.

Test example is provided at the end.

Libraries you may wish to use

In [1]:
import pandas as pd
import numpy as np
from os import makedirs, path, remove, rename, rmdir
from tarfile import open as open_tar
from shutil import rmtree
from urllib import request, parse
from glob import glob
from os import path
from re import sub
from email import message_from_file
from glob import glob
from sklearn.model_selection import StratifiedShuffleSplit
from collections import defaultdict
from functools import partial
from sklearn.metrics import (accuracy_score, f1_score, precision_score, recall_score)
from sklearn.model_selection import cross_val_predict
from sklearn.neighbors import KNeighborsClassifier
import gc

download corpus using the following functions

Note: you may need to mount your drive on google then run this location. See previous exercises.

In [2]:
def download_corpus(dataset_dir: str = 'data'):
    base_url = 'https://spamassassin.apache.org'
    corpus_path = 'old/publiccorpus'
    files = {
        '20021010_easy_ham.tar.bz2': 'ham',
        '20021010_hard_ham.tar.bz2': 'ham',
        '20021010_spam.tar.bz2': 'spam',
        '20030228_easy_ham.tar.bz2': 'ham',
        '20030228_easy_ham_2.tar.bz2': 'ham',
        '20030228_hard_ham.tar.bz2': 'ham',
        '20030228_spam.tar.bz2': 'spam',
        '20030228_spam_2.tar.bz2': 'spam',
        '20050311_spam_2.tar.bz2': 'spam' }

    #creates the folders: downloads, ham and spam
    downloads_dir = path.join(dataset_dir, 'downloads')
    ham_dir = path.join(dataset_dir, 'ham')
    spam_dir = path.join(dataset_dir, 'spam')

    makedirs(downloads_dir, exist_ok=True)
    makedirs(ham_dir, exist_ok=True)
    makedirs(spam_dir, exist_ok=True)


    for file, spam_or_ham in files.items():
        # download files from URL of each specific .bz2 file
        url = parse.urljoin(base_url, f'{corpus_path}/{file}')
        tar_filename = path.join(downloads_dir, file)
        request.urlretrieve(url, tar_filename)

        #list e-mails in the compressed .bz2 file
        emails = []
        with open_tar(tar_filename) as tar:
            tar.extractall(path=downloads_dir)
            for tarinfo in tar:
                if len(tarinfo.name.split('/')) > 1:
                    emails.append(tarinfo.name)

        # move e-mails to ham or spam directory
        for email in emails:
            directory, filename = email.split('/')
            directory = path.join(downloads_dir, directory)

            if not path.exists(path.join(dataset_dir, spam_or_ham, filename)):
                rename(path.join(directory, filename),
                   path.join(dataset_dir, spam_or_ham, filename))

        rmtree(directory)

download_corpus()

#How many e-mails are classified in our dataset as either Spam or not Spam?


In [3]:
#How many e-mails are classified in our dataset as either Spam or not Spam?
ham_dir = path.join('data', 'ham')
spam_dir = path.join('data', 'spam')

print('Number of Non-Spam E-mails:', len(glob(f'{ham_dir}/*')))
print('\nNumber of Spam E-mails:', len(glob(f'{spam_dir}/*')))

Number of Non-Spam E-mails: 6952

Number of Spam E-mails: 2399


# Classifier

## Read and store emails

First we create two functions which will help us to read and store the emails.

Emails can be multipart, as such, we need to create a function that will help us concatenate its parts in case they are, or if they are not multipart, just return the content of the email.

Then the second function, will read through the directory and return the emails and their subject as a string

In [4]:
def get_payload_recursive(msg):
    if msg.is_multipart():
        parts = [get_payload_recursive(part) for part in msg.get_payload()]
        return ''.join(parts)
    else:
        return msg.get_payload()

def read_emails(directory):
    email_files = glob(f'{directory}/*')
    emails = []
    for email_file in email_files:
        with open(email_file, 'r', encoding='latin1') as file:
            msg = message_from_file(file)
            subject = msg['subject'] if msg['subject'] is not None else ''  # Handle potential None values for subject
            payload = get_payload_recursive(msg) if msg.get_payload() is not None else ''  # Handle potential None values for content
            emails.append(subject + '\n' + payload)
    return emails

# Read emails from each directory
ham_emails = read_emails(ham_dir)
spam_emails = read_emails(spam_dir)

Now, we have two lists containing the emails as strings, let's see how it looks like:

In [5]:
print(spam_emails[10])

FORTUNE 500 COMPANY HIRING, AT HOME REPS.
Help wanted.  We are a 14 year old fortune 500 company, that is
growing at a tremendous rate.  We are looking for individuals who
want to work from home.

This is an opportunity to make an excellent income.  No experience
is required.  We will train you.

So if you are looking to be employed from home with a career that has
vast opportunities, then go:

http://www.basetel.com/wealthnow

We are looking for energetic and self motivated people.  If that is you
than click on the link and fill out the form, and one of our
employement specialist will contact you.

To be removed from our link simple go to:

http://www.basetel.com/remove.html


7749doNL1-136DfsE5701lGxl2-486pAKM7127JwoR4-054PCfq9499xMtW0-594hucS91l66




## Text Pre-Processing

Now, we will create a function for our text pre-processing pipeline, which will do the following:

1. Tokenize the emails.
2. Remove all punctuation within the emails.
3. Lowercase the emails.
4. Remove stop words.
5. Perform lemmatization on the emails.
6. Remove all tokens with less than 1 character.

In [6]:
# Import necessary libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from string import punctuation

# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [7]:
# Define english stop words
stop_words = stopwords.words('english')

# Define function to remove punctuation
translator = str.maketrans('', '', punctuation)

def preprocess_text(text):
    # Tokenize
    text = nltk.word_tokenize(text)

    # Remove punctuation
    text = [token.translate(translator) for token in text]

    # Lowercase
    text = [token.lower() for token in text]

    # Remove stop words
    text = [token for token in text if token not in stop_words]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    text = [lemmatizer.lemmatize(token) for token in text]

    # Remove tokens with less than 1 character
    text = [token for token in text if len(token) >= 1]

    return text

preprocessed_ham = []
preprocessed_spam = []

preprocessed_ham = [preprocess_text(email) for email in ham_emails]
preprocessed_spam = [preprocess_text(email) for email in spam_emails]

Let's see a before vs after example:

In [8]:
print(ham_emails[129])

Re: Java is for kiddies
JoeBar wrote:
>C is more reliable than Java??

Depends who writes it.  One guy will write a bug every 5 lines,
another every 5000 lines.  Put them both on a project and that will
average out to a bug every 4.995 lines.
<observation type=trivial>(Irrespective of language.  Pick the one
that best suits what you're trying to do.)</observation>

R




In [9]:
print(preprocessed_ham[129])

['java', 'kiddy', 'joebar', 'wrote', 'c', 'reliable', 'java', 'depends', 'writes', 'one', 'guy', 'write', 'bug', 'every', '5', 'line', 'another', 'every', '5000', 'line', 'put', 'project', 'average', 'bug', 'every', '4995', 'line', 'observation', 'typetrivial', 'irrespective', 'language', 'pick', 'one', 'best', 'suit', 'trying', 'observation', 'r']


## Test and train split and Term-Document Matrix

Now, we will proceed to create the term-document matrix with the already preprocessed emails.

In [10]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the CountVectorizer from scikit-learn
vectorizer = CountVectorizer()

def create_train_test_sets(preprocessed_ham, preprocessed_spam):
    # Combine the emails into a single list of emails
    all_emails = preprocessed_ham + preprocessed_spam

    # Create labels for the emails (1 for ham, 0 for spam)
    labels = [1] * len(preprocessed_ham) + [0] * len(preprocessed_spam)

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(all_emails, labels, test_size=0.2, random_state=2)

    # Fit the vectorizer to the training emails and transform them into a term-document matrix
    X_train = vectorizer.fit_transform([' '.join(email) for email in X_train])

    # Transform the testing emails into a term-document matrix
    X_test = vectorizer.transform([' '.join(email) for email in X_test])

    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = create_train_test_sets(preprocessed_ham, preprocessed_spam)

## Naive Bayes Model

We now create a Multinomial Naive Bayes model for spam detection and fit it to the training data, then we make predictions on the test data

In [11]:
from sklearn.naive_bayes import MultinomialNB

# Create a Multinomial Naive Bayes model
spam_detection_mnb = MultinomialNB()

# Fit the model to the training data
spam_detection_mnb.fit(X_train, y_train)

# Make predictions on the test data
y_pred = spam_detection_mnb.predict(X_test)

Let's see the accuracy of our model

In [12]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

0.9796900053447354

Now let's test the model with the test email. For that we need to preprocess it and vectorize it. Which is the format that our model requires.

By using  ``` predict_proba ``` we can obtain the probability that the model estimates of an email to be ham or spam.







In [13]:
spam_email = """
Subject: Get Rich Quick!

Dear Friend,

Congratulations! You've been selected to participate in an exclusive opportunity to make thousands of dollars from the comfort of your own home. Our revolutionary system guarantees quick and easy cash with minimal effort.

No more struggling to pay bills or worrying about financial security. With our proven method, you can start earning massive amounts of money in no time.

Here's what some of our satisfied customers have to say:
- "I was skeptical at first, but I'm now living my dream life thanks to this incredible system!" - John S.
- "I never thought making money online could be this simple. It's changed my life!" - Sarah L.

Don't miss out on this limited-time offer. Act now to secure your spot and start enjoying a life of financial freedom.

Click the link below to get started:
www.getrichquick.com

Remember, this opportunity is exclusive and won't last long. Take control of your financial future today!

Best regards,
The Get Rich Quick Team
"""

In [14]:
def test_email(email):
    # Preprocess the custom email
    preprocessed_custom_email = preprocess_text(email)

    # Convert the preprocessed email into a term-document matrix
    X_custom = vectorizer.transform([' '.join(preprocessed_custom_email)])

    # Predict the sentiment of the custom email
    predicted_sentiment = spam_detection_mnb.predict(X_custom)[0]

    if predicted_sentiment == 1:
        print("This email is predicted to be ham:")
    else:
        print("This email is predicted to be spam:")

    print("- The probability of this email to be ham is:", round((spam_detection_mnb.predict_proba(X_custom)[0][1])*100,2),"%")
    print("- The probability of this email to be spam is:", round((spam_detection_mnb.predict_proba(X_custom)[0][0])*100,2),"%")

test_email(spam_email)

This email is predicted to be spam:
- The probability of this email to be ham is: 0.0 %
- The probability of this email to be spam is: 100.0 %


As we see, it's correctly classified