**Name of student:** Rohan Karthikeyan  
**Roll Number:** MDS202226

## Basic imports


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pprint import pprint
pd.set_option('display.max_colwidth', 100)

In [None]:
import re
import string
import time

import wordcloud
from nltk.corpus import stopwords

from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression

## Working on the Enron spam subset

In [None]:
enron_spam = pd.read_csv('/kaggle/input/spamemails/enronSpamSubset.csv', usecols = [2, 3])
enron_spam

We have 10000 emails in our dataset. Let us now look at a sample email text:

In [None]:
enron_spam.iloc[3244, 0]

One can observe that this is clearly spam.

## Cleaning the dataset using `regex`

We perform the following steps of cleaning:
* Convert text into lowercase;
* Remove new or extra lines;
* Remove tabs, punctuation, and commas;
* Remove extra white spaces; and
* Remove stop words

In [None]:
stopword_list = stopwords.words('english')
stopword_list.append('subject')  # A commonly repeating word in emails

In [None]:
def preprocess_text(text):
    text = text.lower()  # Lowercase words
    text = re.sub(r"\n+", " ", text)  # Remove new line

    # Remove tabs, punctuation and commas
    text = re.sub("[" + string.punctuation + "]", " ", text)
    text = re.sub("\s+", " ", text) # Remove extra white space

    # Remove stopwords from text
    text = " ".join([word for word in text.split() if word not in stopword_list])
    return text

In [None]:
start = time.time()
enron_spam['Body'] = enron_spam['Body'].apply(lambda x : preprocess_text(x))
end = time.time()
print('Time to execute: {:.3f} secs.\n'.format(end - start))

# Display a few sample texts
display(enron_spam.sample(10))

In [None]:
# Display the number of spam and non-spam emails
pd.DataFrame(enron_spam['Label'].value_counts())

### Display a wordcloud of the email payloads

We display the 100 most common words found in the email payloads.

In [None]:
content = ' '.join(enron_spam['Body'].values)
fig, ax = plt.subplots(figsize = (14, 10))
wc = wordcloud.WordCloud(width = 800, height = 600, max_words = 100, stopwords = stopword_list).generate(content)
ax.imshow(wc)
plt.axis('off')
plt.show()

#### **Observations:** 
1. One can see that the word `enron` occurs frequently, as would be expected. 
2. The word `ect` also occurs frequently; it stands for $\text{Enron Capital and Trade Resources}$ and was Enron's trading and financial arm. 

## Building a Logistic regression classifier

Before we embark on our goal of spam classification, it is important to first extract features from the text. 
This is done by converting the raw text into a vector representation.

There are two main functions for doing the required task:
1. `CountVectorizer`: It converts a collection of text documents to a matrix of ***token counts***.
2. `TfidfVectorizer`: It converts a collection of text documents to a matrix of ***TF-IDF features***.

More details will be given under the respective sections.

Additionally, we report the **odds** by interpreting the coefficients of the top 15 features, as recognized by the `LogisticRegression`
classifier. The model reports log odds, which we can convert to odds, by taking the exponential: $e^{\text{log odds}}$.

In [None]:
X_raw = enron_spam['Body'].copy()
y = enron_spam['Label'].copy()

X_train_raw, X_test_raw, y_train, y_test = train_test_split(X_raw, y, test_size = 0.20, random_state = 49)
X_train_raw

### CountVectorizer aka Bag-of-words

This does two very simple things to convert text to numerical feature vectors:
1. It tokenizes the strings (either by word or by characters) and assigns each token an unique ID.
2. It counts the number of times each token occurs in the document.

The `scikit-learn` implementation comes with a number of useful arguments to the function, which we make use of.
Of importance to us is the `ngram_range` argument. We create a helper function that helps us change this argument and check the performance.



In [None]:
def perform_train_test(vectorizer, arguments):
    """Instantiate a Vectorizer object and perform training and testing."""
    for k, v in arguments.items():
        setattr(vectorizer, k, v)
    print('Vectorizer used: {}\n'.format(vectorizer))
    X_train = vectorizer.fit_transform(X_train_raw)

    classifier = LogisticRegression(class_weight = 'balanced')
    classifier.fit(X_train, y_train)

    X_test = vectorizer.transform(X_test_raw)
    y_pred_uni = classifier.predict(X_test)

    # Note that 0 == Not spam
    print('-'*50)
    print(classification_report(y_test, y_pred_uni, target_names = ['Not spam', 'Spam']))

    # Get top 15 features with the odds ratio
    print_odds_ratio(vectorizer, classifier)


def print_odds_ratio(vectorizer, classifier, num_features=15):
    """Print the odds ratio of the top `num_features` features of the classifier."""
    # Get mapping of terms to feature indices
    learned_vocab = vectorizer.vocabulary_

    # Obtain coefficient of features in the decision fn.
    coeffs = classifier.coef_[0]

    # Get indices of 15 largest elements
    indices = np.argpartition(coeffs, -num_features)[-num_features:]

    # Search for the `indices` in the learned vocabulary
    # First create a new dict from the old dict reversing the keys and values
    new = dict(zip(learned_vocab.values(), learned_vocab.keys()))

    # Calculate odds
    odds = {new[index]: np.round(np.exp(coeffs[index]), decimals=4) for index in indices}
    sorted_odds = sorted(odds.items(), key=lambda x:x[1])
    odds_ratio = dict(sorted_odds)  # Sort by value (asc.)
    print('-'*50)
    print('Odds ratio of the top {} features are:'.format(num_features))
    pprint(odds_ratio, width=1)

In [None]:
## Model 1: Only unigrams
cnt_vectorizer = CountVectorizer()
cnt_args = {'min_df': 5, 'max_features': 6000, 'ngram_range': (1, 1)}
perform_train_test(cnt_vectorizer, cnt_args)

In [None]:
## Model 2: Only bigrams
cnt_args['ngram_range'] = (2, 2)
perform_train_test(cnt_vectorizer, cnt_args)

In [None]:
## Model 3: Unigrams and bigrams
cnt_args['ngram_range'] = (1, 2)
perform_train_test(cnt_vectorizer, cnt_args)

#### **Inferences:** 

1. There is a ***pronounced drop*** in recall when using only bigrams.
2. In terms of predictive precision, there is ***no difference*** in using both unigrams and bigrams as opposed to using only unigrams.
3. There is a slight difference in the top 15 features recognized by the unigram-bigram model v/s the unigram model.
4. While some of the top features detected clearly increases the odds of the email being spam (e.g., *paliourg*, *click*), some features could also occur in non-spam emails (e.g. *software*, *mobile*).

### TfidfVectorizer aka Term-frequency inverse document frequency

While `CountVectorizer` only takes into account the number of times a word appears in the document, the `TfidfVectorizer` also takes
into account not only the number of occurrences of a word in the document but also how important that word is to the corpus.

In [None]:
## Model 1: Only unigrams
tfidf_vectorizer = TfidfVectorizer()
tfidf_args = {'min_df': 5, 'max_features': 6000, 'ngram_range': (1, 1)}
perform_train_test(tfidf_vectorizer, tfidf_args)

In [None]:
## Model 2: Only bigrams
tfidf_args['ngram_range'] = (2, 2)
perform_train_test(tfidf_vectorizer, tfidf_args)

In [None]:
## Model 3: Unigrams and bigrams
tfidf_args['ngram_range'] = (1, 2)
perform_train_test(tfidf_vectorizer, tfidf_args)

#### **Inferences:** 

1. The performance of `TfidfVectorizer` is **very similar** to the `CountVectorizer` in all three cases.
2. *Some* differences are observed between the corresponding top features list b/w the `TfidfVectorizer` and `CountVectorizer`.
3. The odds are higher than in the previous model, indicating that the classifier ***strongly*** believes the presence of one (or more) these features makes the email likely to be spam.
4. There is ***no difference*** in the top 15 features recognized by the unigram-bigram model v/s the unigram model.
5. The feature with the ***highest*** odds in two variants starts with *http*: a hyperlink! It's the second top feature in the bigram model too.