# DAT405 Introduction to Data Science and AI 
## Assignment 4: Spam classification using Naïve Bayes 

The exercise takes place in a notebook environment where you can chose to use Jupyter or Google Colabs. We recommend you use Google Colabs as it will facilitate remote group-work and makes the assignment less technical. 
Hints:
You can execute certain linux shell commands by prefixing the command with `!`. You can insert Markdown cells and code cells. The first you can use for documenting and explaining your results the second you can use writing code snippets that execute the tasks required.  

In this assignment you will implement a Naïve Bayes classifier in Python that will classify emails into spam and non-spam (“ham”) classes.  Your program should be able to train on a given set of spam and “ham” datasets. 
You will work with the datasets available at https://spamassassin.apache.org/old/publiccorpus/. There are three types of files in this location: 
-	easy-ham: non-spam messages typically quite easy to differentiate from spam messages. 
-	hard-ham: non-spam messages more difficult to differentiate 
-	spam: spam messages 

**Execute the cell below to download and extract the data into the environment of the notebook -- it will take a few seconds.** If you chose to use Jupyter notebooks you will have to run the commands in the cell below on your local computer, with Windows you can use 7zip (https://www.7-zip.org/download.html) to decompress the data.



## Contributors:
### Lukas Andersson - 35 Hours
### Ramapriya Navalpakkam -  30 hours
###

In [2]:
#Download and extract data
# !wget https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2
# !wget https://spamassassin.apache.org/old/publiccorpus/20021010_hard_ham.tar.bz2
# !wget https://spamassassin.apache.org/old/publiccorpus/20021010_spam.tar.bz2
# !tar -xjf 20021010_easy_ham.tar.bz2
# !tar -xjf 20021010_hard_ham.tar.bz2
# !tar -xjf 20021010_spam.tar.bz2

*The* data is now in the three folders `easy_ham`, `hard_ham`, and `spam`.

In [3]:
# !ls -lah

###1. Preprocessing: 
1.	Note that the email files contain a lot of extra information, besides the actual message. Ignore that and run on the entire text. 
2.	We don’t want to train and test on the same data. Split the spam and the ham datasets in a training set and a test set. (`hamtrain`, `spamtrain`, `hamtest`, and `spamtest`) **0.5p**


In [4]:
import pandas as pd
import os
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

dirs_easy = os.listdir('../data/easy_ham')
dirs_hard = os.listdir('../data/hard_ham')
dirs_spam = os.listdir('../data/spam')

easyham = pd.DataFrame([open("../data/easy_ham/" + file, "r", encoding="iso-8859-1").read() for file in dirs_easy]) 
hardham = pd.DataFrame([open("../data/hard_ham/" + file, "r", encoding="iso-8859-1").read() for file in dirs_hard]) 
spam = pd.DataFrame([open("../data/spam/" + file, "r", encoding="iso-8859-1").read() for file in dirs_spam])

easyham['label'] = 0
easyham.columns = ['message', 'label']
hardham['label'] = 0
hardham.columns = ['message', 'label']
spam['label'] = 1
spam.columns = ['message', 'label']

all_frames = [easyham, hardham, spam]
easy_frames = [easyham, spam]

all_data = pd.concat(all_frames)
easy_data = pd.concat(easy_frames)

X = CountVectorizer().fit_transform(all_data['message'])    
y = all_data['label']

hamtrain, hamtest, spamtrain, spamtest = train_test_split(X, y, test_size=0.25, random_state = 0)

###2. Write a Python program that: 
1.	Uses four datasets (`hamtrain`, `spamtrain`, `hamtest`, and `spamtest`) 
2.	Trains a Naïve Bayes classifier (e.g. Sklearn) on `hamtrain` and `spamtrain`, that classifies the test sets and reports True Positive and True Negative rates on the `hamtest` and `spamtest` datasets. You can use `CountVectorizer` to transform the email texts into vectors. Please note that there are different types of Naïve Bayes Classifier in SKlearn ([Documentation here](https://scikit-learn.org/stable/modules/naive_bayes.html)). Test two of these classifiers that are well suited for this problem
    - Multinomial Naive Bayes  
    - Bernoulli Naive Bayes. 






In [5]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import classification_report, confusion_matrix
from sklearn import metrics

mnb_classifier = MultinomialNB()
mnb_classifier.fit(hamtrain, spamtrain)
mnb_tn, mnb_fp, mnb_fn, mnb_tp = confusion_matrix(spamtest, mnb_classifier.predict(hamtest), normalize='true').ravel()

bnb_classifier = BernoulliNB()
bnb_classifier.fit(hamtrain, spamtrain)
bnb_tn, bnb_fp, bnb_fn, bnb_tp = confusion_matrix(spamtest, bnb_classifier.predict(hamtest), normalize='true').ravel()

print("Multinomial Naive Bayes: True Positive rate = " + str(mnb_tp) + " and True Negative rate = " + str(mnb_tn))
print("Bernoulli   Naive Bayes: True Positive rate = " + str(bnb_tp) + " and True Negative rate = " + str(bnb_tn))


Multinomial Naive Bayes: True Positive rate = 0.8896551724137931 and True Negative rate = 0.9911894273127754
Bernoulli   Naive Bayes: True Positive rate = 0.21379310344827587 and True Negative rate = 0.9823788546255506


a) Explain how the classifiers differ. What different interpretations do they have? **1p** 

Multinomial counts how many times a specific feature occurs while Bernoulli simply models if the feature occurs or not, but does not count.

Above we printed out the True Positive and True Negative for both methods, in our tests Multinomial performed way better, especially on finding real ham (True Positive).

### 3.Run your program on 
-	Spam versus easy-ham 
-	Spam versus (hard-ham + easy-ham). 
-   Discuss your results **2.5p** 

In [6]:
#Easy-ham

X = CountVectorizer().fit_transform(easy_data['message'])    
y = easy_data['label']

hamtrain, hamtest, spamtrain, spamtest = train_test_split(X, y, test_size=0.25, random_state = 0)

mnb_classifier = MultinomialNB()
mnb_classifier.fit(hamtrain, spamtrain)
mnb_tn, mnb_fp, mnb_fn, mnb_tp = confusion_matrix(spamtest, mnb_classifier.predict(hamtest), normalize='true').ravel()

bnb_classifier = BernoulliNB()
bnb_classifier.fit(hamtrain, spamtrain)
bnb_tn, bnb_fp, bnb_fn, bnb_tp = confusion_matrix(spamtest, bnb_classifier.predict(hamtest), normalize='true').ravel()

print("Multinomial Naive Bayes: True Positive rate = " + str(mnb_tp) + " and True Negative rate = " + str(mnb_tn) + " for Easy-ham ONLY")
print("Bernoulli   Naive Bayes: True Positive rate = " + str(bnb_tp) + " and True Negative rate = " + str(bnb_tn) + " for Easy-ham ONLY")


#Hard-ham + Easy-ham

X = CountVectorizer().fit_transform(all_data['message'])    
y = all_data['label']

hamtrain, hamtest, spamtrain, spamtest = train_test_split(X, y, test_size=0.25, random_state = 0)

mnb_classifier = MultinomialNB()
mnb_classifier.fit(hamtrain, spamtrain)
mnb_tn, mnb_fp, mnb_fn, mnb_tp = confusion_matrix(spamtest, mnb_classifier.predict(hamtest), normalize='true').ravel()

bnb_classifier = BernoulliNB()
bnb_classifier.fit(hamtrain, spamtrain)
bnb_tn, bnb_fp, bnb_fn, bnb_tp = confusion_matrix(spamtest, bnb_classifier.predict(hamtest), normalize='true').ravel()

print("Multinomial Naive Bayes: True Positive rate = " + str(mnb_tp) + " and True Negative rate = " + str(mnb_tn) + " for Easy- and Hard-ham")
print("Bernoulli   Naive Bayes: True Positive rate = " + str(bnb_tp) + " and True Negative rate = " + str(bnb_tn) + " for Easy- and Hard-ham")

Multinomial Naive Bayes: True Positive rate = 0.9310344827586207 and True Negative rate = 0.9984544049459042 for Easy-ham ONLY
Bernoulli   Naive Bayes: True Positive rate = 0.46551724137931033 and True Negative rate = 0.9969088098918083 for Easy-ham ONLY
Multinomial Naive Bayes: True Positive rate = 0.8896551724137931 and True Negative rate = 0.9911894273127754 for Easy- and Hard-ham
Bernoulli   Naive Bayes: True Positive rate = 0.21379310344827587 and True Negative rate = 0.9823788546255506 for Easy- and Hard-ham


They both performed better on the easy-ham only version, even though the difference was quite slight for Multinomial while it was quite large for Bernoulli. This however is probably just because it is easier data. The hard ham is much more similar to spam and therefore harder to distinguish as ham. This also means that the easy only got worse training than the mixed one, since it did not train on hard ham. It would probably be way better performance if it only trained on hard-ham since those are the features we want to distinguish to make better predictions.

# 4.	To avoid classification based on common and uninformative words it is common to filter these out. 

**a.** Argue why this may be useful. Try finding the words that are too common/uncommon in the dataset. **1p** 



**b.** Use the parameters in Sklearn’s `CountVectorizer` to filter out these words. Update the program from point 3 and run it on your data and report and discuss your results. You have two options to do this in Sklearn: either using the words found in part (a) or letting Sklearn do it for you. Argue for your decision-making. **1p** 


- Words that are too common as we see are mostly words like 'a', 'is', 'and'. These do not give us any information about the text and would certainly not aid in determining if the words are ham or spam words. Therefore removing these words would optimise the performance of the model as it would have fewer and more meaningful words to train on. Below we remove special characters, numbers and escape sequences before we fit the model to the counter. This ensures that the model skips learning from these unimportant characters and symbols when trying to classify ham and spam.

- We chose to use Sklearn's algorithm to filter out the most common words from the dataset.  The `max_df` parameter allows us to set a threashold above with the words having their frequency is ignored. We have set this value to 60% so the algorithm gives a 10% extra slack for words to have a slightly higher frequency the the others. This is a very convenient method to filter out the words compared to loop through each word found in the previous section to eliminiate it in the dataset. This can take a veyr long time. Both of these methods can however be applied only to the dataset after a preproccessing steps where we eliminate symbols, numbers and escape sequences. 


In [7]:

from collections import Counter
import string
import itertools
import re

def count_words():

    for message in all_data['message']:
        
        message = message.translate ({ord(c): " " for c in "!@#$%^&*()[]{};:,./<>?\|`~-=_+="})
        message = re.sub(r'[0-9]+', ' ', message) 
        message = message.replace('\n', ' ')
        message = message.replace('\t', ' ')
        
        
        # split the mails into words
        mails_splitted = message.split()
        word_counter = Counter()
        # count how many times a word occurs in all emails
        word_counter = word_counter + Counter(mails_splitted)
        
    return word_counter



word_counter = count_words()

#number of words
num_words = 30

#the least common words
most_uncommon_words = word_counter.most_common()[:-num_words-1:-1]
print("The most uncommon words :", most_uncommon_words)

#the top common words
most_common_words = word_counter.most_common(num_words)
print('\nThe most common words are:', most_common_words)

#popular_words = sorted(word_counter, key = word_counter.get, reverse = False)

#print(popular_words[:-num_words-1:-1])

The most uncommon words : [('Remove', 1), ('subject', 1), ('cn', 1), ('btamail', 1), ('bm', 1), ('mailto', 1), ('email', 1), ('blank', 1), ('send', 1), ('link', 1), ('click', 1), ('list', 1), ('address', 1), ('remove', 1), ('return', 1), ('convenient', 1), ('phone', 1), ('name', 1), ('Leave', 1), ('automated', 1), ('enjoy', 1), ('Beyond', 1), ('Priority', 1), ('international', 1), ('domestic', 1), ('Postal', 1), ('US', 1), ('next', 1), ('shipped', 1), ('orders', 1)]

The most common words are: [('of', 84), ('is', 63), ('and', 62), ('the', 59), ('to', 56), ('a', 50), ('for', 48), ('or', 45), ('as', 27), ('One', 27), ('Shangrila', 26), ('not', 24), ('tm', 23), ('Zowie', 23), ('Wowie', 23), ('that', 23), ('in', 22), ('be', 20), ('it', 19), ('botanical', 19), ('at', 19), ('this', 19), ('oz', 19), ('any', 18), ('product', 17), ('with', 16), ('non', 16), ('are', 15), ('has', 15), ('Kathmandu', 15)]


In [8]:
#!pip install nltk
#nltk.download('stopwords')
import nltk
from nltk.corpus import stopwords

#Easy-ham

X = CountVectorizer(max_df=0.6, stop_words=stopwords.words('english')).fit_transform(easy_data['message'])

y = easy_data['label']

hamtrain, hamtest, spamtrain, spamtest = train_test_split(X, y, test_size=0.25, random_state = 0)

mnb_classifier = MultinomialNB()
mnb_classifier.fit(hamtrain, spamtrain)
mnb_tn, mnb_fp, mnb_fn, mnb_tp = confusion_matrix(spamtest, mnb_classifier.predict(hamtest), normalize='true').ravel()

bnb_classifier = BernoulliNB()
bnb_classifier.fit(hamtrain, spamtrain)
bnb_tn, bnb_fp, bnb_fn, bnb_tp = confusion_matrix(spamtest, bnb_classifier.predict(hamtest), normalize='true').ravel()

print("Multinomial Naive Bayes: True Positive rate = " + str(mnb_tp) + " and True Negative rate = " + str(mnb_tn) + " for Easy-ham ONLY")
print("Bernoulli   Naive Bayes: True Positive rate = " + str(bnb_tp) + " and True Negative rate = " + str(bnb_tn) + " for Easy-ham ONLY")


#Hard-ham + Easy-ham

X = CountVectorizer(max_df=0.6,stop_words=stopwords.words('english')).fit_transform(all_data['message'])
y = all_data['label']

hamtrain, hamtest, spamtrain, spamtest = train_test_split(X, y, test_size=0.25, random_state = 0)

mnb_classifier = MultinomialNB()
mnb_classifier.fit(hamtrain, spamtrain)
mnb_tn, mnb_fp, mnb_fn, mnb_tp = confusion_matrix(spamtest, mnb_classifier.predict(hamtest), normalize='true').ravel()

bnb_classifier = BernoulliNB()
bnb_classifier.fit(hamtrain, spamtrain)
bnb_tn, bnb_fp, bnb_fn, bnb_tp = confusion_matrix(spamtest, bnb_classifier.predict(hamtest), normalize='true').ravel()

print("Multinomial Naive Bayes: True Positive rate = " + str(mnb_tp) + " and True Negative rate = " + str(mnb_tn) + " for Easy- and Hard-ham")
print("Bernoulli   Naive Bayes: True Positive rate = " + str(bnb_tp) + " and True Negative rate = " + str(bnb_tn) + " for Easy- and Hard-ham")

Multinomial Naive Bayes: True Positive rate = 0.9396551724137931 and True Negative rate = 0.9984544049459042 for Easy-ham ONLY
Bernoulli   Naive Bayes: True Positive rate = 0.3879310344827586 and True Negative rate = 0.9969088098918083 for Easy-ham ONLY
Multinomial Naive Bayes: True Positive rate = 0.9655172413793104 and True Negative rate = 0.9882525697503671 for Easy- and Hard-ham
Bernoulli   Naive Bayes: True Positive rate = 0.2 and True Negative rate = 0.9809104258443465 for Easy- and Hard-ham


- In the results we see that there is a significant improvement the identification of true positives in the Multinomial bayes classification of easy-ham and hard-ham vs spam than that of easy-ham vs spam. On the other hand, the there is a drop in the prediction of true positives by the Bernoulli's classification. 



### 5. Eeking out further performance
**a.**  Use a lemmatizer to normalize the text (for example from the `nltk` library). For one implementation look at the documentation ([here](https://scikit-learn.org/stable/modules/feature_extraction.html#customizing-the-vectorizer-classes)). Run your program again and answer the following questions: 
  - Why can lemmatization help?
  -	Does the result improve from 3 and 4? Discuss. **1.5p** 







- Lemmatisation transforms words to their orginial or base forms. For example, serialise, serialize, serialisation are different forms of the same word serial. A lemmatization on any of these words we can obtain the word serial. Lemmatization of words makes analysis on the dataset simpler by reducing the number of words to be analysed. In our case, the model can now easily learn to differentiate between ham or spam words and avoids unnecessary time learning different forms of the same word. 



In [9]:
#nltk.download('wordnet')

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize 

class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, articles):
        return [self.wnl.lemmatize(t) for t in word_tokenize(articles)]


#Easy-ham

X = CountVectorizer(max_df=0.6, tokenizer=LemmaTokenizer()
).fit_transform(easy_data['message'])

y = easy_data['label']

hamtrain, hamtest, spamtrain, spamtest = train_test_split(X, y, test_size=0.25, random_state = 0)

mnb_classifier = MultinomialNB()
mnb_classifier.fit(hamtrain, spamtrain)
mnb_tn, mnb_fp, mnb_fn, mnb_tp = confusion_matrix(spamtest, mnb_classifier.predict(hamtest), normalize='true').ravel()

bnb_classifier = BernoulliNB()
bnb_classifier.fit(hamtrain, spamtrain)
bnb_tn, bnb_fp, bnb_fn, bnb_tp = confusion_matrix(spamtest, bnb_classifier.predict(hamtest), normalize='true').ravel()

print("Multinomial Naive Bayes: True Positive rate = " + str(mnb_tp) + " and True Negative rate = " + str(mnb_tn) + " for Easy-ham ONLY")
print("Bernoulli   Naive Bayes: True Positive rate = " + str(bnb_tp) + " and True Negative rate = " + str(bnb_tn) + " for Easy-ham ONLY")


#Hard-ham + Easy-ham

X = CountVectorizer(max_df=0.7,tokenizer=LemmaTokenizer()).fit_transform(all_data['message'])
y = all_data['label']

hamtrain, hamtest, spamtrain, spamtest = train_test_split(X, y, test_size=0.25, random_state = 0)

mnb_classifier = MultinomialNB()
mnb_classifier.fit(hamtrain, spamtrain)
mnb_tn, mnb_fp, mnb_fn, mnb_tp = confusion_matrix(spamtest, mnb_classifier.predict(hamtest), normalize='true').ravel()

bnb_classifier = BernoulliNB()
bnb_classifier.fit(hamtrain, spamtrain)
bnb_tn, bnb_fp, bnb_fn, bnb_tp = confusion_matrix(spamtest, bnb_classifier.predict(hamtest), normalize='true').ravel()

print("Multinomial Naive Bayes: True Positive rate = " + str(mnb_tp) + " and True Negative rate = " + str(mnb_tn) + " for Easy- and Hard-ham")
print("Bernoulli   Naive Bayes: True Positive rate = " + str(bnb_tp) + " and True Negative rate = " + str(bnb_tn) + " for Easy- and Hard-ham")

Multinomial Naive Bayes: True Positive rate = 0.9827586206896551 and True Negative rate = 0.9984544049459042 for Easy-ham ONLY
Bernoulli   Naive Bayes: True Positive rate = 0.5 and True Negative rate = 0.9969088098918083 for Easy-ham ONLY
Multinomial Naive Bayes: True Positive rate = 0.9793103448275862 and True Negative rate = 0.9853157121879589 for Easy- and Hard-ham
Bernoulli   Naive Bayes: True Positive rate = 0.2896551724137931 and True Negative rate = 0.973568281938326 for Easy- and Hard-ham


We can see that adding lemmatisation to the Vectorization has made slight improvements in the Mulitnomial Naive Bayes classificaition and a reduction in the accuracy for the Bernoulli's classification. 


**b.** The split of the data set into a training set and a test set can lead to very skewed results. Why is this, and do you have suggestions on remedies? 
 What do you expect would happen if your training set were mostly spam messages while your test set were mostly ham messages?  **1p** 

This is because of the exact thing the later part of the question is suggesting, it could be very bad balanced and have big overrepresentations of either class in either test or training sets, which would impact the result a lot. If we trained it on mostly spam messages but had mostly ham in the test set it would probably perform very well at detecting spam, True Negative, but would perform worse on detecting ham, True Positive. It would also probably have quite a lot of False Negative, predicting that it is spam when it in fact is ham.

You could always preprocess the data in a way that you handle spam and ham separately and split them separately to then concatenate them into a set after both are individually split. This would ensure that the balance always is the same.

**c.** Re-estimate your classifier using `fit_prior` parameter set to `false`, and answer the following questions:
  - What does this parameter mean?
  - How does this alter the predictions? Discuss why or why not. **0.5p** 

In [12]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize 

class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, articles):
        return [self.wnl.lemmatize(t) for t in word_tokenize(articles)]


#Easy-ham

X = CountVectorizer(max_df=0.6, tokenizer=LemmaTokenizer()
).fit_transform(easy_data['message'])

y = easy_data['label']

hamtrain, hamtest, spamtrain, spamtest = train_test_split(X, y, test_size=0.25, random_state = 0)

mnb_classifier = MultinomialNB(fit_prior=False)
mnb_classifier.fit(hamtrain, spamtrain)
mnb_tn, mnb_fp, mnb_fn, mnb_tp = confusion_matrix(spamtest, mnb_classifier.predict(hamtest), normalize='true').ravel()

bnb_classifier = BernoulliNB(fit_prior=False)
bnb_classifier.fit(hamtrain, spamtrain)
bnb_tn, bnb_fp, bnb_fn, bnb_tp = confusion_matrix(spamtest, bnb_classifier.predict(hamtest), normalize='true').ravel()

print("Multinomial Naive Bayes: True Positive rate = " + str(mnb_tp) + " and True Negative rate = " + str(mnb_tn) + " for Easy-ham ONLY")
print("Bernoulli   Naive Bayes: True Positive rate = " + str(bnb_tp) + " and True Negative rate = " + str(bnb_tn) + " for Easy-ham ONLY")


#Hard-ham + Easy-ham

X = CountVectorizer(max_df=0.7,tokenizer=LemmaTokenizer()).fit_transform(all_data['message'])
y = all_data['label']

hamtrain, hamtest, spamtrain, spamtest = train_test_split(X, y, test_size=0.25, random_state = 0)

mnb_classifier = MultinomialNB()
mnb_classifier.fit(hamtrain, spamtrain)
mnb_tn, mnb_fp, mnb_fn, mnb_tp = confusion_matrix(spamtest, mnb_classifier.predict(hamtest), normalize='true').ravel()

bnb_classifier = BernoulliNB()
bnb_classifier.fit(hamtrain, spamtrain)
bnb_tn, bnb_fp, bnb_fn, bnb_tp = confusion_matrix(spamtest, bnb_classifier.predict(hamtest), normalize='true').ravel()

print("Multinomial Naive Bayes: True Positive rate = " + str(mnb_tp) + " and True Negative rate = " + str(mnb_tn) + " for Easy- and Hard-ham")
print("Bernoulli   Naive Bayes: True Positive rate = " + str(bnb_tp) + " and True Negative rate = " + str(bnb_tn) + " for Easy- and Hard-ham")

Multinomial Naive Bayes: True Positive rate = 0.9827586206896551 and True Negative rate = 0.9984544049459042 for Easy-ham ONLY
Bernoulli   Naive Bayes: True Positive rate = 0.5 and True Negative rate = 0.9969088098918083 for Easy-ham ONLY
Multinomial Naive Bayes: True Positive rate = 0.9793103448275862 and True Negative rate = 0.9853157121879589 for Easy- and Hard-ham
Bernoulli   Naive Bayes: True Positive rate = 0.2896551724137931 and True Negative rate = 0.973568281938326 for Easy- and Hard-ham


This parameter tells the model to weather to learn from its priors or use a uniform prior. Prior is nothing but the predictions that can be made by just looking at the data without any evidence at hand. Here we can see that there is no difference between setting the fit_prior to false or true.

**d.** The python model includes smoothing (`alpha` parameter ), explain why this can be important. 
  - What would happen if in the training data set the word 'money' only appears in spam examples? What would the model predict about a message containing the word 'money'? Does the prediction depend on the rest of the message and is that reasonable? Explain your reasoning  **1p** 

-

### What to report and how to hand in.

- You will need to clearly report all results in the notebook in a clear and appropriate way, either using plots or code output (f.x. "print statements"). 
- The notebook must be reproducible, that means, we must be able to use the `Run all` function from the `Runtime` menu and reproduce all your results. **Please check this before handing in.** 
- Save the notebook and share a link to the notebook (Press share in upper left corner, and use `Get link` option. **Please make sure to allow all with the link to open and edit.**
- Edits made after submission deadline will be ignored, graders will recover the last saved version before deadline from the revisions history.
- **Please make sure all cells are executed and all the output is clearly readable/visible to anybody opening the notebook.**