In [1]:
import numpy as np
import pandas as pd
import textwrap
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix

# Explore and clean the data

    1. (2pt) Load the lingspam-emails.csv.bz2 dataset. 
    Browse a handful of emails, both spam and non-spam ones, to see what kind of text we are working with here.
    Hint: check out textwrap module to print long strings on multiple lines.

In [2]:
# Load the dataset
email_df = pd.read_csv('/home/jovyan/INFO371PS/Data/lingspam-emails.csv', sep = "\t")
# Browse a handful of emails
email_df.sample(5)

Unnamed: 0,spam,files,message
1394,False,8-1149msg0.txt,"Subject: toc for linguist list dear sirs , pr..."
2063,False,6-474msg2.txt,Subject: [ n + v ] verbal compounding content...
614,False,5-1326msg1.txt,"Subject: a n n o u n c i n g cunyforum 18 , ..."
727,False,5-1462msg1.txt,Subject: re : 5 . 1448 comparative method the...
2215,False,9-288msg1.txt,Subject: mental lexicon first international c...


In [3]:
# Browse 2 emails, can't do more, too long to display
textwrap.wrap(text = email_df.iloc[1140].message), textwrap.wrap(text = email_df.iloc[2127].message)

(['Subject: dear website operator  hi , i thought this could help your',
  'success . feel free to call me with any questions . sincerely ,',
  'jennifer powers 904-441 - 8080 env associates you will never receive a',
  'message from me again . * * * first time ever offered ! * * * keep',
  'your prospect pipeline - tm filled ! disappointed with traditional',
  "marketing ? maybe it 's time to consider ' business to business '",
  'direct e - mail . forget the " get rich quick " schemes and $ 395 +',
  'software . forget the " 60 million " address cd \'s that are filled',
  'with duplicates and even invalid , " generated " addresses , hidden in',
  'many different files that rarely add up to even a million prospects',
  'which are still unqualified . over 90 % are private personal addresses',
  'of people who do not want to be invaded and unless you have duplicate',
  'filtering software , you would be mailing many of them multiple times',
  ', with the same message ! no wonder they ca

    2. (3pt) Ensure the data is clean: remove all cases with missing spam and empty message field. We do not care about the file names.

In [4]:
# Drop the NA values and print basic information
email_df = email_df.dropna(subset=['spam', 'message'])
email_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2893 entries, 0 to 2892
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   spam     2893 non-null   bool  
 1   files    2893 non-null   object
 2   message  2893 non-null   object
dtypes: bool(1), object(2)
memory usage: 70.6+ KB


# Create Document-term matrix (DTM)

    1. (2pt) Choose ∼10 words which might be good to distinguish between spam/non-spam. Use these four: viagra, deadline, million, and and. Choose more words yourself (you may want to return here and reconsider your choice later).

In [5]:
# Pick a list of 10 words
list_of_words = ['viagra', 'deadline', 'million', 'and', 'money', 'bank', 'support', 'desperate', 'donation', 'congratulations']

    2. (10pt) Convert your messages into DTM. We do not use the full 60k-words DTM here but only a baby-DTM of the 10 words you picked above. You may add the DTM columns to the original data frame, or keep those in a separate structure.
    Creating the DTM involves finding whether the word is contained in the message for all emails in data. You can loop over emails and check each one individually, but pandas string methods make life much easier. You will want to do case-insensitive matching, checking for both upper and lower case. You may consider something like this:
    It is more intuitive to work with your data if you convert the logical values returned by contains to numbers.

In [6]:
# Create the DTM columns to the original df
for w in list_of_words:
    email_df[w] = email_df.message.str.lower().str.contains(w) + 0
email_df

Unnamed: 0,spam,files,message,viagra,deadline,million,and,money,bank,support,desperate,donation,congratulations
0,False,3-1msg1.txt,Subject: re : 2 . 882 s - > np np > date : su...,0,0,0,1,0,0,1,0,0,0
1,False,3-1msg2.txt,Subject: s - > np + np the discussion of s - ...,0,0,0,0,0,0,0,0,0,0
2,False,3-1msg3.txt,Subject: 2 . 882 s - > np np . . . for me it ...,0,0,0,0,0,0,0,0,0,0
3,False,3-375msg1.txt,"Subject: gent conference "" for the listserv ""...",0,0,0,1,1,1,0,0,0,0
4,False,3-378msg1.txt,Subject: query : causatives in korean could a...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2888,True,spmsgc50.txt,Subject: . international driver ' s license n...,0,0,0,1,0,0,0,0,0,0
2889,True,spmsgc51.txt,Subject: new on 95 . 8 capital fm this is new...,0,0,1,1,0,0,0,0,0,0
2890,True,spmsgc52.txt,Subject: re : new medical technology company ...,0,0,0,1,0,0,0,0,0,0
2891,True,spmsgc53.txt,Subject: re : your request for an overview ye...,0,0,1,1,1,0,0,0,0,0


    3. (3pt) Split your work data (i.e. the DTM) and target (the spam indicator) into training and validation chunks (80/20 is a good split).

In [7]:
# Create X and y variables
X = email_df[list_of_words]
y = email_df.spam*1
# Spilit the dataset into training and validation 
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2)

# Estimate and validate

    1. (2pt) Design a scheme for your variable names that describes these probabilities so that a) you understand what they mean; and b) the others (including your grader) will understand those!
    Hint: you may get some ideas from the Python notes, Section 2.3 Base Language. The first task is to compute these probabilities. Use only training data for this task.

| Variable Name | Notation                 | Definition                                                          |
|---------------|--------------------------|--------------------------------------------------------------------|
| Pr_S1         | $$Pr(S = 1)$$            | Probability that the email is spam                               |
| Pr_S0         | $$Pr(S = 0)$$            | Probability that the email is non-spam                           |
| Pr_W1         | $$Pr(W = 1)$$            | Probability that the the word exist in the email                               |
| Pr_W0         | $$Pr(W = 0)$$            | Probability that the the word does not exist in the email                           |
| Pr_W1_S1      | $$Pr(W = 1\\|S = 1)$$     | Conditional probability that the word is present in spam emails     |
| Pr_W1_S0      | $$Pr(W = 1\\|S = 0)$$     | Conditional probability that the word is present in non-spam emails |
| Pr_S1_W1      | $$Pr(S = 1\\|W = 1)$$     | Conditional probability that the email is spam given the condition that the word present in the email    |
| Pr_S0_W1      | $$Pr(S = 0\\|W = 1)$$     | Conditional probability that the email is not spam given the condition that the word present in the email |
| Pr_S1_W0      | $$Pr(S = 1\\|W = 0)$$     | Conditional probability that the email is spam given the condition that the word is not present in the email    |
| Pr_S0_W0      | $$Pr(S = 0\\|W = 0)$$     | Conditional probability that the email is not spam given the condition that the word is not present in the email |

    2. (4pt) Compute the priors, the unconditional probabilities for an email being spam and non-spam, Pr(category =S)and Pr(category =NS). These probabilities are based on the spam variable alone, not on the text.

In [8]:
# Create variable Pr_S1, Pr_S0
Pr_S1 = y_train.mean()
Pr_S0 = 1 - Pr_S1

The next tasks involve computing the following probabilities for each word out of the list of 10 you picked above, I recommend to avoid unneccessary complexity and just to write a loop over the words, compute the answers 3–8, and print the word and the corresponding results there.

    3. (4pt) For each word w, compute the normalizers, Pr(w=1)and Pr(w=0).
    Hint: this is Pr(million =1)=0.0484. But note this value (and the following hints) depends on your random training/validation split!

In [9]:
# For each word w, compute the normalizers
Pr_W1 = X_train.mean()
Pr_W0 = 1- Pr_W1
# for w in X_train.columns:
#     Pr_W1 = X_train[w].mean() 
#     Pr_W0 = 1 - Pr_W1
#     print(w, 'Pr(W = 0):', Pr_W1, 'Pr(W = 1)',Pr_W0)
#     Pr_W1_list.append(Pr_W1)
#     Pr_W0_list.append(Pr_W0)
print('Pr(W = 0):', Pr_W0, sep='\n')
print('Pr(W = 1)', Pr_W1, sep='\n')

Pr(W = 0):
viagra             0.999568
deadline           0.853500
million            0.949870
and                0.056180
money              0.914866
bank               0.939931
support            0.896283
desperate          0.997839
donation           0.999136
congratulations    0.997839
dtype: float64
Pr(W = 1)
viagra             0.000432
deadline           0.146500
million            0.050130
and                0.943820
money              0.085134
bank               0.060069
support            0.103717
desperate          0.002161
donation           0.000864
congratulations    0.002161
dtype: float64


    4. (7pt) For each word w, compute Pr(w=1|category =S)and Pr(w=1|category =NS). These probabilities are based on both the spam-variable and on the DTM component that corresponds to the word w.
    Hint: Pr(million =1|category =S)=0.252

In [10]:
# Compute Pr(W = 1|S = 1), Pr(W = 1|S = 0) for each w
Pr_W1_S1 = np.mean(X_train[y_train == 1], axis=0)
Pr_W1_S0 = np.mean(X_train[y_train == 0], axis=0)
# for w in X_train.columns:
#     Pr_W1_S1 = np.mean(X_train[w][y_train == 1])
#     Pr_W1_S0 = np.mean(X_train[w][y_train == 0])
#     print(w, 'Pr(W = 1|S = 1):', Pr_W1_S1, ', Pr(W = 1|S = 0):', Pr_W1_S0)
#     Pr_W1_S1_list.append(Pr_W1_S1)
#     Pr_W1_S0_list.append(Pr_W1_S0)
print('Pr(W = 1|S = 1):', Pr_W1_S1, sep='\n')
print('Pr(W = 1|S = 0):', Pr_W1_S0, sep='\n')

Pr(W = 1|S = 1):
viagra             0.002625
deadline           0.000000
million            0.249344
and                0.921260
money              0.383202
bank               0.165354
support            0.125984
desperate          0.010499
donation           0.002625
congratulations    0.007874
dtype: float64
Pr(W = 1|S = 0):
viagra             0.000000
deadline           0.175375
million            0.010864
and                0.948267
money              0.026384
bank               0.039317
support            0.099327
desperate          0.000517
donation           0.000517
congratulations    0.001035
dtype: float64


    5. (5pt) Finally, compute the probabilities of interest, Pr(category =S|w=1)and Pr(category = S|w=0). Compute this value using Bayes theorem, not directly by counting!
    For the check, you may also compute Pr(category =NS|w=1)and Pr(category =NS|w=0)
    Hint: Pr(category =S|million =1)=0.843. But note this number depends on your random testing-validation split!

In [11]:
# Compute Pr_S1_W1, Pr_S1_W0, Pr_S0_W1, Pr_S0_W0 for each word
Pr_S1_W1 = (Pr_W1_S1 * Pr_S1) / Pr_W1
Pr_S1_W0 = ((1 - Pr_W1_S1) * Pr_S1) / Pr_W0
Pr_S0_W1 = 1 - Pr_S1_W1
Pr_S0_W0 = 1 - Pr_S1_W0
print('Pr(S = 1|W = 1):', Pr_S1_W1, sep='\n')
print('Pr(S = 1|W = 0):', Pr_S1_W0, sep='\n')
# for w in X_train.columns:
#     Pr_S1_W1 = (np.mean(X_train[w][y_train == 1]) * Pr_S1) / X_train[w].mean() 
#     Pr_S1_W0 = ((1 - np.mean(X_train[w][y_train == 1])) * Pr_S1) / (1 - Pr_W1)
#     print(w, 'Pr(S = 1|W = 1):', Pr_S1_W1, 'Pr(S = 1|W = 0):', Pr_S1_W0)

Pr(S = 1|W = 1):
viagra             1.000000
deadline           0.000000
million            0.818966
and                0.160714
money              0.741117
bank               0.453237
support            0.200000
desperate          0.800000
donation           0.500000
congratulations    0.600000
dtype: float64
Pr(S = 1|W = 0):
viagra             0.164289
deadline           0.192911
million            0.130118
and                0.230769
money              0.111006
bank               0.146207
support            0.160559
desperate          0.163274
donation           0.164360
congratulations    0.163707
dtype: float64


    6. (6pt) Which of these probabilities have to sum to one? (E.g. Pr(category =1)+Pr(category =0)=1.) Which ones do not? Explain!

Now we are done with the estimator. Your fitted model is completely described by these probabilities. Let’s now turn to prediction, using your validation data. Note that we are still inside the loop over each word w!

As we can see from the table above, the sum of Pr(Category = S|W = 1) and Pr(Category = NS|W = 1) is 1. Also, the sum of Pr(Category = S|W = 0) and Pr(Category = NS|W = 0) is also 1. This is because they are given the same conditions. Both have the condition that the word exists in the email, or both have the condition that the word does not exist in the email. For example, if Pr(Category = S|W = 1) and Pr(Category = NS|W = 1), since they have the same condition, they have to sum up to be 1 since it is either spam or not. 

On the other hand, if the condition given is not the same, then the probabilities cannot add up to one since they have different conditions. Although, in some extreme situations, there is a highly unlikely situation where they may add up to 1 by chance.

In addition, Pr(w = 1) + Pr(w = 0) = 1 and Pr(category = S) + Pr(category = NS) = 1. This is because that this is a binary situation. The email can either be spam or not spam and one specific word can only exist or not exist in an email. That is why they also sum up to be 1.

    7. (8pt) For each email in your validation set, predict whether it is predicted to be spam or non-spam. 
    Hint: you should check if it contains the word wand use the appropriate probability, Pr(category =S|w=1)or Pr(category =S|w=0).

In [12]:
# Create a copy of the X_val to avoid warnings
prediction = X_val.copy()
# Create for loop to predict whether an email is spam or not based on the probabilities
for w in prediction.columns:
    new = np.where(X_val[w] > 0, Pr_S1_W1[w], Pr_S1_W0[w])
    spam = np.where(new > 0.5, 1, 0)
    prediction[w] = spam

prediction

Unnamed: 0,viagra,deadline,million,and,money,bank,support,desperate,donation,congratulations
1442,0,0,0,0,1,0,0,0,0,0
2585,0,0,0,0,0,0,0,0,0,0
1091,0,0,0,0,0,0,0,0,0,0
2002,0,0,0,0,0,0,0,0,0,0
579,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...
1267,0,0,0,0,0,0,0,0,0,0
751,0,0,0,0,0,0,0,0,0,0
818,0,0,0,0,0,0,0,0,0,0
346,0,0,0,0,0,0,0,0,0,0


    8. (5pt) Print the resulting confusion matrix and compute accuracy, precision and recall.

In [13]:
# Import warnings to ignore the warning
import warnings
warnings.filterwarnings('ignore')
# Create for loop to create confusion matrix, accuracy, precision, and recall for each word
for w in prediction.columns:
    cm = confusion_matrix(y_val, prediction[w])
    accuracy = accuracy_score(y_val, prediction[w])
    precision = precision_score(y_val, prediction[w])
    recall = recall_score(y_val, prediction[w])
    print(w, cm, sep='\n')
    print('Accuracy:', accuracy)
    print('Precision:', precision)
    print('Recall:', recall)

viagra
[[479   0]
 [100   0]]
Accuracy: 0.8272884283246977
Precision: 0.0
Recall: 0.0
deadline
[[479   0]
 [100   0]]
Accuracy: 0.8272884283246977
Precision: 0.0
Recall: 0.0
million
[[476   3]
 [ 79  21]]
Accuracy: 0.8583765112262521
Precision: 0.875
Recall: 0.21
and
[[479   0]
 [100   0]]
Accuracy: 0.8272884283246977
Precision: 0.0
Recall: 0.0
money
[[463  16]
 [ 66  34]]
Accuracy: 0.8583765112262521
Precision: 0.68
Recall: 0.34
bank
[[479   0]
 [100   0]]
Accuracy: 0.8272884283246977
Precision: 0.0
Recall: 0.0
support
[[479   0]
 [100   0]]
Accuracy: 0.8272884283246977
Precision: 0.0
Recall: 0.0
desperate
[[478   1]
 [ 99   1]]
Accuracy: 0.8272884283246977
Precision: 0.5
Recall: 0.01
donation
[[479   0]
 [100   0]]
Accuracy: 0.8272884283246977
Precision: 0.0
Recall: 0.0
congratulations
[[479   0]
 [ 98   2]]
Accuracy: 0.8307426597582038
Precision: 1.0
Recall: 0.02


    9. (5pt) Which steps above constitute model training? In which steps do you use trained model? What is a trained model in this case? Explain!
    Hint: a trained model is all you need to make predictions.

The steps above that we used to constitute model training was in question 3.2-3.5. In 3.2, we are calculating the probabilities of the categories of spam and non-spam emails (priors). In 3.3, we calculate the probabilities of the word appearing in the dataset or not (normalizers). In 3.4, we calculate probabilities using the prior and normalizers. Then, in 3.5, we used the previous steps to obtain the probabilities of interest so that we can use this trained model for predictions later on. 

The trained model is then later used in 3.7 and 3.8, where 3.7 used the trained model to predict each email in the validation set, labeling each email to identify if the email is spam or not. We used the predictions of the trained model in 3.7 to construct confusion matrices and compute accuracies, precisions, and recalls of the model in 3.8.

The trained model is all we need to make predictions. The trained model, in this case, is the set of these probabilities.

    10. (4pt) Comment the overall performance of the model–how do accuracy, precision and recall look like?

The overall performance of the accuracies is pretty high, with all of them above 80%. Precision varies greatly, going from as low as 0% to as high as 100%. Recall for all of them also varies a bit. However, all of the recall scores are below 50%.

    11. (8pt) Explain why do you see very low recall while the other indicators do not look that bad.

In the case of high precision and low recall, it is because the predictions are correct when compared to the training data. We see a very low recall because the word predictors can only detect a tiny part of the spam emails in the sample if they have a poor recall. This is not too surprising because spam emails include a lot of different words, but we are just using one word to predict. When the true positives (predicted as 1, actual is 1) are 0, both the recall and precision become 0. This is probably due to the fact of the rather small validation set and the fact that we are using only one predictor to determine whether an email is spam or not.

    12. (8pt) Explain why some words work well and others not:

    (a) why does “million” improve accuracy?

In [14]:
# Total number of emails we have in the data
email_df["million"].count()

2893

In [15]:
# Number of occurances for the word million in all emails
email_df["million"].sum()

140

In [16]:
# Number of spam emails that contains the word million
email_df[email_df.spam == 1]["million"].sum()

116

As we can see from the numbers above, the word million occured in 140 emails, and among those, there are 116 that are spam emails. This means that the word "million" occurs in spam emails pretty often. Therefore, when adding it to the model , it does improve accuracy.

    (b) why does “viagra” not work?

In [17]:
# Total number of emails we have in the data
email_df["viagra"].count()

2893

In [18]:
# Number of occurances for the word viagra in all emails
email_df["viagra"].sum()

1

In [19]:
# Number of spam emails that contains the word viagra
email_df[email_df.spam == 1]["viagra"].sum()

1

The word "viagra" does not improve accuracy because it barely shows up in the emails at all. Out of 2893 emails, there is only one email that conatins the word "viagra", it is not a good indication for whether an email is a spam or not. 

    (c) why does “deadline” not work?

In [20]:
# Number of occurances for the word deadline in all emails
email_df["deadline"].sum()

434

In [21]:
# Number of spam emails that contains the word deadline
email_df[email_df.spam == 1]["deadline"].sum()

0

"Deadline" does not work because 0 out of the 434 emails that contain the word "deadline" is spam. This is not a good indication for whether an email is a spam or not. Deadline is extremely likely to appear in non-spam emails, such as projects in companies or school projects.

    (d) why does “and” not work?
    Hint: You may just see where in which emails these words occur, and how frequently. These are all different reasons!

In [22]:
# Number of occurances for the word and in all emails
email_df["and"].sum()

2725

In [23]:
# Number of spam emails that contains the word and
email_df[email_df.spam == 1]["and"].sum()

443

The word "and" does not work becasue "and" is a very commonly used word. It appears that most of the emails in the dataset have the word "and". Thus, it is not a good indication to judge whether this email will be spam or not based on the word appearance. 

    13. (5pt) Add such smoothing to the model. You can either literally add two such lines of data, or alternatively manipulate the way you compute the probabilities.

In [24]:
# Make a new dataframe for smoothing
t_emails = pd.DataFrame(X_train, columns = list_of_words)
t_emails.insert(loc = 0, column = 'spam', value = y_train.values)
# Create four different ghost data
t_emails = t_emails.append({"spam": 1, "viagra": 1, "deadline": 1, "million": 1, "and": 1, "money": 1, "bank": 1, "support": 1, "desperate": 1, "donation": 1, "congratulations": 1}, ignore_index=True)
t_emails = t_emails.append({"spam": 1, "viagra": 0, "deadline": 0, "million": 0, "and": 0, "money": 0, "bank": 0, "support": 0, "desperate": 0, "donation": 0, "congratulations": 0}, ignore_index=True)
t_emails = t_emails.append({"spam": 0, "viagra": 1, "deadline": 1, "million": 1, "and": 1, "money": 1, "bank": 1, "support": 1, "desperate": 1, "donation": 1, "congratulations": 1}, ignore_index=True)
t_emails = t_emails.append({"spam": 0, "viagra": 0, "deadline": 0, "million": 0, "and": 0, "money": 0, "bank": 0, "support": 0, "desperate": 0, "donation": 0, "congratulations": 0}, ignore_index=True)

    14. (5pt) Repeat the tasks above: compute the probabilities, do predictions, compute the accuracy, precision, recall for all words.

In [25]:
X_train = t_emails[t_emails.columns.drop('spam')]
y_train = t_emails.spam
# Create variable Pr_S1, Pr_S0
Pr_S1 = y_train.mean()
Pr_S0 = 1 - Pr_S1

# For each word w, compute the normalizers
Pr_W1 = X_train.mean()
Pr_W0 = 1- Pr_W1

# Compute Pr(W = 1|S = 1), Pr(W = 1|S = 0) for each w
Pr_W1_S1 = np.mean(X_train[y_train == 1], axis=0)
Pr_W1_S0 = np.mean(X_train[y_train == 0], axis=0)

Pr_S1_W1 = (Pr_W1_S1 * Pr_S1) / Pr_W1
Pr_S1_W0 = ((1 - Pr_W1_S1) * Pr_S1) / Pr_W0
Pr_S0_W1 = 1 - Pr_S1_W1
Pr_S0_W0 = 1 - Pr_S1_W0

prediction = X_val.copy()

for w in prediction.columns:
    new = np.where(X_val[w] > 0, Pr_S1_W1[w], Pr_S1_W0[w])
    spam = np.where(new > 0.5, 1, 0)
    prediction[w] = spam
    
for w in prediction.columns:
    cm = confusion_matrix(y_val, prediction[w])
    accuracy = accuracy_score(y_val, prediction[w])
    precision = precision_score(y_val, prediction[w])
    recall = recall_score(y_val, prediction[w])
    print(w, cm, sep='\n')
    print('Accuracy:', accuracy)
    print('Precision:', precision)
    print('Recall:', recall)

viagra
[[479   0]
 [100   0]]
Accuracy: 0.8272884283246977
Precision: 0.0
Recall: 0.0
deadline
[[479   0]
 [100   0]]
Accuracy: 0.8272884283246977
Precision: 0.0
Recall: 0.0
million
[[476   3]
 [ 79  21]]
Accuracy: 0.8583765112262521
Precision: 0.875
Recall: 0.21
and
[[479   0]
 [100   0]]
Accuracy: 0.8272884283246977
Precision: 0.0
Recall: 0.0
money
[[463  16]
 [ 66  34]]
Accuracy: 0.8583765112262521
Precision: 0.68
Recall: 0.34
bank
[[479   0]
 [100   0]]
Accuracy: 0.8272884283246977
Precision: 0.0
Recall: 0.0
support
[[479   0]
 [100   0]]
Accuracy: 0.8272884283246977
Precision: 0.0
Recall: 0.0
desperate
[[478   1]
 [ 99   1]]
Accuracy: 0.8272884283246977
Precision: 0.5
Recall: 0.01
donation
[[479   0]
 [100   0]]
Accuracy: 0.8272884283246977
Precision: 0.0
Recall: 0.0
congratulations
[[479   0]
 [ 98   2]]
Accuracy: 0.8307426597582038
Precision: 1.0
Recall: 0.02


    15. (4pt) Comment on the results. Does smoothing improve the overall performance?

In [26]:
# Print the new probabilities
print(Pr_S1_W1)
print(Pr_S1_W0)

viagra             0.666667
deadline           0.002933
million            0.813559
and                0.161025
money              0.738693
bank               0.453901
support            0.202479
desperate          0.714286
donation           0.500000
congratulations    0.571429
dtype: float64
viagra             0.164579
deadline           0.193222
million            0.130455
and                0.234848
money              0.111373
bank               0.146532
support            0.160886
desperate          0.163566
donation           0.164650
congratulations    0.163998
dtype: float64


After adding smoothing to the model, it does not change metrics like the confusion matrix, accuracies, precisions, and recalls (Although it might change it because of the random train_test split, but highly unlikely). Adding smoothing does not improve the overall performance based on observation. However, it does change Pr(S = 1|W = 1) and Pr(S = 1|W = 0) a bit since we added four ghost lines of four different situations.