Load the emails dataset

In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [30]:
email_path = './emails.csv'
# Read data and set index column
email = pd.read_csv(email_path)

print(email)

                                                   text  spam
0     Subject: naturally irresistible your corporate...     1
1     Subject: the stock trading gunslinger  fanny i...     1
2     Subject: unbelievable new homes made easy  im ...     1
3     Subject: 4 color printing special  request add...     1
4     Subject: do not have money , get software cds ...     1
...                                                 ...   ...
5723  Subject: re : research and development charges...     0
5724  Subject: re : receipts from visit  jim ,  than...     0
5725  Subject: re : enron case study update  wow ! a...     0
5726  Subject: re : interest  david ,  please , call...     0
5727  Subject: news : aurora 5 . 2 update  aurora ve...     0

[5728 rows x 2 columns]


In [31]:
# output the first index of the email
print(type(email))
print(len(email))

<class 'pandas.core.frame.DataFrame'>
5728


In this class, we have been using the notation $\mathcal{D} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(n)}, y^{(n)})\}$ to represent a dataset. 

Please answer what $x^{(1)}$ and $y^{(1)}$ are in this dataset:

In [32]:
email.spam.value_counts()

spam
0    4360
1    1368
Name: count, dtype: int64

In [33]:
print(email.iloc[0])

text    Subject: naturally irresistible your corporate...
spam                                                    1
Name: 0, dtype: object


Split the dataset into training and testing datasets

In [34]:
from sklearn.model_selection import train_test_split

# Access the 'text' column of the DataFrame
email_text = email.text
Y = email.spam

# Split the data into training and test sets
email_train, email_test, y_train, y_test = train_test_split(email_text, Y, test_size=0.33, random_state=42)

Check whether the labels are distributed equally in the training and testing datasets

In [36]:
print(sum(y_train)/len(y_train))
print(sum(y_test)/len(y_test))

0.22804274172530622
0.2607086197778953


**Bag of words representation of the data** 

We start by defining a vocabulary $V$ containing all the possible words we are interested in, e.g.:
$$ V = \{\text{church}, \text{doctor}, \text{fervently}, \text{purple}, \text{slow}, ...\} $$


A bag of words representation of a document $x$ is a function $\phi(x) \to \{0,1\}^{|V|}$ that outputs a feature vector
$$
\phi(x) = \left( 
\begin{array}{c}
0 \\
1 \\
0 \\
\vdots \\
0 \\
\vdots \\
\end{array}
\right)
\begin{array}{l}
\;\text{church} \\
\;\text{doctor} \\
\;\text{purple} \\
\\
\;\text{slow} \\
\\
\end{array}
$$
of dimension $V$. The $j$-th component $\phi(x)_j$ equals $1$ if $x$ convains the $j$-th word in $V$ and $0$ otherwise.

We will construct the vocabulary dictionary and transform the data at the same time using "CountVectorizer"

In [43]:
from sklearn.feature_extraction.text import CountVectorizer

# Define the CountVectorizer object with the desired parameters
vectorizer = CountVectorizer(binary = True, max_features=1000)

# Fit the vectorizer to the training data and transform the data
X_train = vectorizer.fit_transform(email_train).toarray()
X_test =  vectorizer.transform(email_test).toarray()


Print the dictionary

In [53]:
count = vectorizer.vocabulary_
print(X_train.shape)
print(X_test.shape)

(3837, 1000)
(1891, 1000)


Check whether comma, "and" are in the vocabulary dictionary

In [54]:
print('comma' in count)
print('and' in count)

False
True


Visualize the feature representation

In [58]:
import seaborn as sns
import matplotlib.pyplot as plt

# Get the vocabulary and feature matrix
vocabulary = vectorizer.vocabulary_
X = X_train.toarray()

# Create a DataFrame with the feature matrix and column names
df = pd.DataFrame(X, columns=vocabulary.keys())

# Compute the correlation matrix
corr = df.corr()

# Create a heatmap of the correlation matrix
sns.heatmap(corr, cmap='coolwarm')

# Show the plot
plt.show()

AttributeError: 'numpy.ndarray' object has no attribute 'toarray'

Check the shape of the transformed feature vector length

In [59]:
n = X_train.shape[0]
p = X_train.shape[1]
K = 2

**Discriminative model**

Should we run Linear regression or Logistic regression here? why?

In [None]:
from sklearn.linear_model import LogisticRegression



Calculate training and testing accuracy

The label classes are not evenly distributed in this example, so we need to look at precision and recall.


|                   | Predictied Positive   | Predicted Negative         |
|-------------------|-----------------------|----------------------------|
|Acutal Positive    | True Positive (TP)  | False Negative (FN)  |
|Acutal Negative    | False Positive (FP)  | True Negative (TN)  |




precision = # true positive / (# true positive + # false positive) = # true positive / # predicted to be positive

recall = # true positive / (# true positive + # false negative) = # true positive / # actual positive 

The F1 score can be interpreted as a harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:

F1 = 2 * (precision * recall) / (precision + recall)

In the multi-class and multi-label case, this is the average of the F1 score of each class with weighting depending on the average parameter.

In [None]:
from sklearn import metrics


## Generative Model and Naive Bayes estimator

Recall that in generative models, we would fit two models on a corpus of emails $x$ with spam/non-spam labels $y$:

\begin{align*}
P(\mathbf{x}|y=\text{0}) && \text{and} && P(\mathbf{x}|y=\text{1})
\end{align*}

* $P(\mathbf{x} | y=1)$ *scores* each $\mathbf{x}$ based on how much it looks like spam.
* $P(\mathbf{x} | y=0)$ *scores* each $\mathbf{x}$ based on whether it looks like non-spam.

We also need to learn the distributions of labels:

* $P(y=k)$ encodes our prior beliefs for each class $k$. In spam classification, we might set $P(y=k)$ to the % of data with class $k$.

**How do we model these distributions?**

* Bernoulli distributions 
* Model the distribution of a random variable whose outcomes are binary: 
 * e.g., coin flips: if I flip a coin, I see head with probability 0.4
 * 0.4 is the parameter of the Bernoulli distribution
 * 0.4 is unknown and we need to learn this parameter
 
* What does unknown parameter mean here: 
 * This means that when I flip this coin, I do not know ahead of the time that I will see head with probability 0.4. 
 * Instead, I only observe realizations, e.g., {H, T, T, H, T, ...}
 
**How do we learn the parameter/mean of a Bernoulli distribution?**
* Empirical mean
* Coin flip: if I flip the coin 100 times, see 39 heads and 61 tails, then the estimated mean of the Bernoulli distribution is 0.39
* Recall that this is the also the optimal solution of the maximum likelihood estimator

In the spam detection problem, we use Bernoulli distributions to model:
* $P(x_j=1| y=1)$ for all j
* $P(x_j=1| y=0)$ for all j
* $P(y=1)$ 

* e.g., $P(x_1=1| y=1) = 0.3$: the probability that I see word "some" is 0.3 if the email were a spam
* e.g., $P(x_1=1| y=0) = 0.2$: the probability that I see word "some" is 0.2 if the email were not a spam

In the training data, 
* Calculate $P(x_1=1| y=1)$ using the empirical mean: the frequency $x_1$ appeared in the spam emails 


Calculate $P(x_j=1| y=1)$ for all j, and store the resulting values in a variable called psi

Repeat the same exercise, calculate:
* Create a variable called psis that has shape K by d.
    * The first row contains $P(x_j=1| y=0)$ for all j
    * The second row contains $P(x_j=1| y=1)$ for all j
* Create a variable called phis with shape K
    * stores $P(y=k)$ for all k = 0, 1

In [None]:
psis = np.zeros([K, d])
phis = np.zeros([K])



In [None]:
a = np.array([[1,2,3], [4,5,6]])
print(np.mean(a, axis=0))
print(np.mean(a, axis=1))

**Meaning of the first row of psis**
* indicate the frequency of the words when the email is not spam

**Meaning of the second row of psis**
* indicate the frequency of the words when the email is spam

In [None]:
# return the indices of the elements from the smallest to the biggest
sorted_spamicity = np.argsort(psis[1])
# retrive the top 10 most frequent words in the category of spam
print(sorted_spamicity[-10:])



## Predictions

Given a new $x'$, we return the most likely class to have generated it:

\begin{align*}
\arg \max_k P_\theta(y=k | x') & = \arg \max_k  \frac{P_\theta(x' | y=k) P_\theta(y=k)}{P_\theta(x')} \\
& = \arg \max_k P_\theta(x' | y=k) P_\theta(y=k),
\end{align*}

where we have applied Bayes' rule in the first line.

## Naive Bayes Assumption

The Naive Bayes assumption is a __general technique__ that can be used with any $d$-dimensional $x$ to construct tractable models $P(x|y)$.
* We simplify the model for $x$ as:
$$ P(x|y) = \prod_{j=1}^d P(x_j \mid y) $$

Thus, the prediction problem boils down to
\begin{align*}
\arg \max_k P(y=k | x') 
& = \arg \max_k P(x' | y=k) P(y=k),\\
& =  \arg \max_k \prod_{j=1}^d P(x_j' | y=k;\psi_{jk}) \phi_k
\end{align*}

**Log-Likelihood**
\begin{align*}
\arg \max_k \log P(y=k | x') 
& =  \arg \max_k \sum_{j=1}^d \log P(x_j' | y=k;\psi_{jk}) \phi_k\\
& = \arg \max_k \sum_{j=1}^d \log P(x_j' | y=k;\psi_{jk}) + \log \phi_k
\end{align*}

**How do we calculate $P(x_j' | y=k;\psi_{jk})$?**
* If I have a coin with head probability 0.4, what is the probability that I observe 0?
    * 0.4
* What is the probability that I observe 1?
    * 0.6

Similarly, here we have that 
* $P(x_j'=1 | y=k;\psi_{jk}) = \psi_{jk}$
* $P(x_j'=0 | y=k;\psi_{jk}) = 1 - \psi_{jk}$

In [None]:
# Remember: P(x|y) indicates P(x1|y)*P(x2|y)* ... * P(xn|y)
# Let's do P(x|y=0) first
def spam_predict(x, psis, phis, K=2):
    """This returns class assignments and scores under the NB 
    model. We compute \arg\max_y p(y|x) as \arg\max_y p(x|y)p(y)
    """
    n, d = x.shape
    psis = psis.clip(1e-14, 1-1e-14)
    log_py = np.log(phis)
    score = np.zeros((n, K))
    for i in range(n):
        for k in range(K):
            # fill in the calculation of score here

    return score.argmax(axis=1)


Check whether a word has only appeared in the y=0 but not y=1 or vice versa

Modify the dictionary to remove stop words

Process stop words in batches:

In [None]:
!pip3 install --user nltk

In [None]:
import nltk
from nltk.corpus import stopwords

In [None]:
nltk.download('stopwords')
print(stopwords.words('english'))

Preprocess the data to keep only the stem of the words

In [None]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()
new_dict = {}
value = 0


**Written Exercises**

**Naive Bayes with Binary Features.** Consider a group of 50 Cornell Students. 
20 of them are Master's students, while the rest 30 of them are PhD students. 
There are 5 Master's students who bike, and there are 5 Master's students who ski.
On the other hand, 20 PhD students bike, and 15 PhD students ski. 

We can formulate this as a machine learning problem by modeling the students with features $x =(x_1, x_2) \in \{0,1\}^2$, where $x_1$ is a binary indicator of whether the students bike and $x_2$ is a binary indicator of whether they ski, and the target $y$ equals $1$ if they are PhD students and $0$ if they are Master's students.

* Please elaborate in this context what is the Naive Bayes assumption.
* With the Naive Bayes assumption,
    % (probability of biking and skiing are conditionally independent given a study program), 
    find the probability of a student in this group who neither bikes or skis being a Master's student
    
* Suppose we know that every PhD who skis also bikes. Does it make sense to still assume that probability of biking and skiing are conditionally independent for a PhD student? If not, how would your answer to part (b) change with this knowledge (you can still assume probability of biking and skiing are conditionally independent for a Master's student)?

