# <img style="float: left; padding-right: 10px; width: 45px" src="https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/iacs.png"> CS109B Data Science 2: <br>Advanced Topics in Data Science

## Homework 6 - Language Modelling and Text Classification

**Harvard University**<br/>
**Spring 2021**<br/>
**Instructors**: Mark Glickman, Pavlos Protopapas, and Chris Tanner <br/>
**Release Date**: March 24, 2021<br/>
**Due Date**: <font color="red">April 7 (11:59pm EST), 2021</font><br/>
<hr style="height:2pt">

In [1]:
## RUN THIS CELL TO GET THE RIGHT FORMATTING 
import requests
from IPython.core.display import HTML
styles = requests.get(
    "https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/"
    "content/styles/cs109.css"
).text

HTML(styles)

<a id="overview"></a>
    
# Overview 

In this homework, your goal is to learn the basic principles of language modelling and text classification. As you learned in lecture, these are distinct, independent tasks, but language modelling often yields one with useful representations of language (e.g., word embeddings). Specifically, these representations may be useful toward _other_ NLP tasks, such as text classification, question-answering, named entity recognition, and a slew of other NLP problems.

Toward this goal of understanding both language models (LMs) and text classification, you will build from scratch a simple language model (a _bi-gram_ language model). You will then gain experience working with popular pre-trained, open-sourced language models (e.g. `BERT`). Finally, you will develop text classification models for an actual real-life application of assisting in systematic reviews of medical research abstracts. 

<hr style="height:2pt">

### INSTRUCTIONS

- To submit your assignment follow the instructions given in Canvas.

- Please restart the kernel and run the entire notebook again before you submit.

- Running cells out of order is a common pitfall in Jupyter Notebooks. To make sure your code works restart the kernel and run the whole notebook again before you submit. 

- We have tried to include all the libraries you may need to do the assignment in the imports cell provided below. **Please use only the libraries provided in those imports.**

- Please use .head() when viewing data. Do not submit a notebook that is **excessively long**. 

- In questions that require code to answer, such as "calculate the $R^2$", do not just output the value from a cell. Write a `print()` function that clearly labels the output, includes a reference to the calculated value, and rounds it to a reasonable number of digits. **Do not hard code values in your printed output**. For example, this is an appropriate print statement:
```python
print(f"The R^2 is {R:.4f}")
```
- Your plots should be clearly labeled, including clear labels for the $x$ and $y$ axes as well as a descriptive title ("MSE plot" is NOT a descriptive title; "95% confidence interval of coefficients of polynomial degree 5" on the other hand is descriptive).

<hr style="height:2pt">

### Learning Objectives
- Become familiar with language models and what they actually do (going beyond using them as black boxes)
- Gain experience with popular, pre-trained language models like `BERT` and `GPT2`.
- Understand metrics for assessing the quality of language models
- Become comfortable using classifiers on text data
- Reflect on what further options you could pursue with the skills you learned in this homework (e.g., Could you implement an auto-complete or auto-correct app for text messages? What else can you do with the embeddings produced from `BERT` or similar models?)

<hr style="height:2pt">

### Notes
- Creating the `BigramLM` requires careful programming, as it's easy to make a mistake. Make sure you allocate appropriate time to complete it.
- You will train your own word embeddings. In this part your take away is to understand what embeddings are and create it on a new dataset. 
- The text classification task is for you to compare and contrast the various modeling techniques. Your goal is to apply deep learning models for classification on a new dataset.


<hr style="height:2pt">

<a id="contents"></a>

## Notebook Contents

- [**Part 0 (Setup Notebook)**](#part0)

- [**PART 1 [ 50 pts ]: Language Modelling**](#part1)
  - [Overview](#part1intro)
  - [Questions](#part1questions)
  - [Solutions](#part1solutions)


- [**PART 2 [ 15 pts ]: Word Embeddings**](#part2)
  - [Overview](#part2intro)
  - [Questions](#part2questions)
  - [Solutions](#part2solutions)


- [**PART 3 [ 35 pts ]: Text Classification**](#part3)
  - [Overview](#part3intro)
  - [Questions](#part3questions)
  - [Solutions](#part3solutions)

NOTE: Exact point values may change a little

<a id="part0"></a>

## Part 0. Setup Notebook

<div class='exercise'><b> Exercise 0 [0 pt]: Install HuggingFace transformers, download datasets</b>

[HuggingFace](https://huggingface.co/transformers/) is revolutionary in providing well-coded, open-source implementations of many state-of-the-art models for NLP. Additionally, they have hundreds of corpora available for research and development, too. Be sure you install the transformers package from HuggingFace. (See cell below to download datasets and installing HuggingFace transformers. )
    
    
- Install transformers 4.4.1
- Download hw_utils.py (Helper functions for homework)
- Download Harry Potter dataset - Part 1&2
- Download Medical Abstract dataset - Part 3
    
</div>


<font color="red">**Installation for Jupyter Hub**</font>



In [2]:
# !sudo /usr/share/anaconda3/bin/conda install -c huggingface transformers==4.4.1 -y

# !wget https://raw.githubusercontent.com/Harvard-IACS/2021-CS109B/master/content/misc/hw_utils.py

# import hw_utils

# !mkdir data
# hw_utils.download_file('https://cs109b-course-data.s3.amazonaws.com/hw6/harry_potter.zip',extract=True,base_path='data/harry_potter/',)

# hw_utils.download_file(" https://cs109b-course-data.s3.amazonaws.com/project_1.zip", base_path="datasets", extract=True)


<font color="red">**Installation for Google Colab**</font>

Remember to enbale GPU by going to "Runtime > Change Runtime Type" and selecting "GPU" for Hardware accelarator and then click "Save"

In [3]:
# !pip3 install transformers==4.4.1

# !wget https://raw.githubusercontent.com/Harvard-IACS/2021-CS109B/master/content/misc/hw_utils.py

# import hw_utils

# !mkdir data
# hw_utils.download_file('https://cs109b-course-data.s3.amazonaws.com/hw6/harry_potter.zip',extract=True,base_path='data/harry_potter/',)

# hw_utils.download_file(" https://cs109b-course-data.s3.amazonaws.com/project_1.zip", base_path="datasets", extract=True)


Regardless of your working environment (e.g.., Google Colab, JupyterHub, locally on your own machine), run the following cell to import all necessary libraries.

In [4]:
# import the necessary libraries
import os 
os.environ['TF_CPP_MIN_LOG_LEVEL']='2' #Trying to reduce tensorflow warnings
import re
import math
import string
import hw_utils # LOADS HW CODE (helps de-clutter this notebook)
import time
import json
import random
import numpy as np
import pandas as pd
import nltk
import tensorflow as tf
import matplotlib.pyplot as plt
import matplotlib.cm as cm

# useful structures and functions for experiments 
from time import sleep
from collections import Counter
from collections import defaultdict
from glob import glob

# specific machine learning functionality
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords 
from nltk.tokenize import RegexpTokenizer
from tensorflow import keras
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.utils import to_categorical
from tensorflow.python.keras import backend as K
from tensorflow.python.keras.utils.layer_utils import count_params
from sklearn.model_selection import train_test_split
from sklearn import manifold
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import f1_score, confusion_matrix
from transformers import BertTokenizer, TFBertForSequenceClassification, BertConfig
from transformers import GPT2Tokenizer, TFGPT2LMHeadModel

In [3]:
# download nltk's punkt sentence tokenizer
nltk.download('punkt')
# download nltk's stop words
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/christanner/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/christanner/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

The following cell prints installation/version details, which is very helpful for debugging potential package issues.

In [6]:
# Enable/Disable Eager Execution
# Reference: https://www.tensorflow.org/guide/eager
# TensorFlow's eager execution is an imperative programming environment that evaluates operations immediately, 
# without building graphs

#tf.compat.v1.disable_eager_execution()
#tf.compat.v1.enable_eager_execution()

print("tensorflow version", tf.__version__)
print("keras version", tf.keras.__version__)
print("Eager Execution Enabled:", tf.executing_eagerly())

# Get the number of replicas 
strategy = tf.distribute.MirroredStrategy()
print("Number of replicas:", strategy.num_replicas_in_sync)

devices = tf.config.experimental.get_visible_devices()
print("Devices:", devices)
print(tf.config.experimental.list_logical_devices('GPU'))

print("GPU Available: ", tf.config.list_physical_devices('GPU'))
print("All Physical Devices", tf.config.list_physical_devices())

# Better performance with the tf.data API
# Reference: https://www.tensorflow.org/guide/data_performance
AUTOTUNE = tf.data.experimental.AUTOTUNE

tensorflow version 2.4.1
keras version 2.4.0
Eager Execution Enabled: True
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
Number of replicas: 1
Devices: [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
[LogicalDevice(name='/device:GPU:0', device_type='GPU')]
GPU Available:  [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
All Physical Devices [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


<a id="part1"></a>


# PART 1. Language Modelling [50 pts]


[Return to contents](#contents)

<a id="part1intro"></a>
## Overview

Recall from lecture that language models (LMs), by definition, allow us to compute the probability of any given sequence $S$. In doing so, they implicitly allow us to compute the most probable <i>next</i> word, too (i.e., given a sequence of length $s_{i}$, the most likely next word is that which gives us the most likely sequence of length $s_{i+1}$.

Let's build a language model from scratch, starting with the most simple one, a `unigram` model. Next, you will build a `bi-gram` model. You'll also use a pre-trained `GPT-2` model. For all of these language models, we will use the text from Harry Potter books. You will evaluate these models both objectively, via perplexity, and subjectively, via your impressions of their outputs.


<a id="part1questions"></a>

## PART 1: Questions


<a id="q11"></a>

**[1.1:](#s11)** **Preprocessing + Base Function for Optimization** <font color='red'>(do not edit)</font>

<a id="q12"></a>

**[1.2:](#s12)** **Unigram Model** <font color='red'>(do not edit)</font>

Below, we provide the functionality for a unigram LM. It inherits the `optimize_parameters()` method from the base class, `NGramModel`. Notice that `optimize_parameters()` expects one to pass a class-specific `calculate_likelihood()` method to it. For example, our `UnigramModel` class defines its own `calculate_likelihood()` method, which accepts an `alpha` parameter. To be clear, the optimal `alpha` value is determined by `optimize_parameters()`.

The `UnigramModel` constructor only accepts a list of tokens (should always be the training tokens). For clarity, we now briefly describe each of its methods:

* `convert_tokens()` simply outputs the same input tokens it received, but after optionally UNKifying (\*U\*) the ones that were not present within the training tokens. Thus, the training tokens are left unedited, but any **dev** or **test** tokens will need to be UNKified.

* `get_word_counts()` returns a dictionary-style count of all tokens (_key_ = word type; _value_ = # of occurrences)

* `calculate_likelihood()` returns the unigram likelihood (negative log likelihood) of the passed-in tokens, based on the training data.
    

<a id="q13"></a>
**[1.3:](#s13)** **Bigram Model** <font color='red'>(you create)</font>  

<a id="q131"></a>

**[1.3.1:](#s131)** **Implementing a Bigram LM**
    
Your task is to write a Bigram model below. You have full freedom to design it however you wish, including adding any methods you wish, as long it can be executed by the cell that succeeds it. That is:

* the `BigramModel` constructor must accept a list of training tokens, just like how we instantiate the `UnigramModel`.

* you must provide functionality to return `dev_bigram_counts` (a `Counter` or `dict` of the bigram tokens).

* you must write `calculate_likelihood()`, which accepts and works with `dev_bigram_counts`.

* using our provided code, you'll find the optimal $\beta$ (while using the optimal $\alpha$ found from the UnigramModel)

* edit the "BIGRAM TESTING" CELL so that you can actually run your BigramModel

**NOTES:**

* You can design the structure of your bigram tokens to be any format you wish. I find it useful to stitch them together with an `_`, since an underscore is never present within our corpus. Using tuples would also be great.

* To be clear, let's say our **dev set** is `demo_dev.txt`, which has one sentence: __I love NLP.__ The above `parse_file()` will convert this to `dev_data`, a list of six tokens: **[\< s\>, i, love, nlp, ., \< s \>]**. Regardless of how you format the bigram tokens (e.g., tuples or concatenated with `_` or other special characters), your code should produce `dev_bigram_counts` as having five distinct keys/bigrams (illustrated here with the `_` format):

`Counter({'\< s\>\_i': 1, 'i_love': 1, 'love_\*U\*': 1, '\*U\*_.': 1, '._\< s \>': 1})`

* If you set the **train set** to be `demo_train.txt` and the **dev set** to be `demo_dev.txt`, then running the above UnigramModel will yield you with an optimal $\alpha$ of 4.97. The last line of the **BIGRAM TESTING** cell will invoke `bigram_lm.optimize_parameters()`. If you implemented the Bigram model correctly, this should find the optimal $\beta$ to be 1.97 (when using $\alpha$ = 4.97). Also, when using an $\alpha$ of 1 and a $\beta$ of 1, your Bigram should produce a likelihood of 6.6.

* Ultimately, you need to run your code on the provided Harry Potter datasets, but the `demo_*` files are provided merely as a sanity check.


<a id="q132"></a>

**[1.3.2:](#s132)** **Evaluate text**
    
This is a continuation of your ongoing work. That is, we're still concerning your Bigram LM, which was fit on `training_data` and the optimal $\alpha$ and $\beta$ values were based on the `dev_data`.

Now, your task is to implement the perplexity metric, so that we have a standardized way of evaluating our Bigram LM. Specifically:
1. Add a method `calculate_perplexity()` to your `BigramModel` class above, then
2. Edit the code cell below to get the perplexity scores for two different texts, `test_data1` and `test_data2`.
3. Print to the screen these two perplexity scores.

**NOTES:**

* As a reminder, perplexity is defined by $2^{-l}$, where $l = \frac{1}{M} \sum_{i=1}^{m}\text{log}(p(w_{i}))$.

* Notice, the $\text{log}(p(w_{i}))$ part of perplexity equation is just **log-likelihood.** In the previous exercise, you computed the **negative log-likelihood.**

* $M$ represents the size of our data that is being evaluated.


<a id="q133"></a>
    
**[1.3.3:](#s133)** **Interpret results**
    
Reflect on the two perplexity scores you received in the previous cell. Specifically, answer and discuss the following (1 or 2 sentences per question is sufficient, no need for fluff please):

1. Do the perplexity scores seem reasonable to you? (i.e., would you expect higher or lower values)?
2. Relative to each other, would you expect the perplexity scores to be switched -- in terms of which test set yielded the higher or lower perplexity? Why or why not?
3. Imagine we significantly trimmed the `test_data2` data to being only half in size. What would you expect to happen to its perplexity score?
4. In our work above, we trained on `training_data`, which is the first Harry Potter book. We found the optimal $\alpha$ and $\beta$ values based on `dev_data`, which is the second Harry Potter Book. One of the perplexity scores corresponds to `test_data2`, which is the first Lord of the Rings book. However, instead, if we let `dev_data` be a different Lord of the Rings book (e.g., the __second__ Lord of the Rings book), what would you expect to happen to our perplexity score for `test_data2`?


<a id="q14"></a>

**[1.4:](#s14)** **N-Gram Text Generation**

 
We now have some objective, quantitative indication of our `Unigram`'s and `Bigrams`' abilities to _model_ language. Let's additionally, subjectively inspect how well it can actually generate text.

We should generate text according to the following:

* use the optimal $\alpha$ and $\beta$ values that we learned from the original `dev_data` set (Harry Potter Book #2)
* **probabilistically** (not deterministically) generate each token. That is, we don't simply generate the maximum likely token at each time step. Instead, we randomly sample which token to generate, based on all tokens' likelihood.
* exclude the possibility of generating an UNK (\*U\*) token
* force our model to start w/ a \< s\> token
* stop once our model has generated a total of $N+1$ \< s\> tokens (i.e., $N$ sentences in total)

Below, we provide code that __probabilistically__ generates five sentences, using our `Unigram` LM.
    

<a id="q141"></a>
 
    
**[1.4.1:](#s141)** **Interpret results**

As you can see, the UnigramLM generates pretty non-sensical text. Write 2-3 sentences about the lengths of the sentences that are generated from a UnigramLM. Specifically, are the sentence lengths, on average, expected to be shorter than, equal to, or longer than sentences seen in the training data. How would this change as a function of the size of the training corpus?


<a id="q142"></a>
 
    
**[1.4.2:](#s142)** **Bigram Text Generation**
    
Write code below to generate $N$ sentences from your `BigramLM`, a la the UnigramLM text generation above. Your generation should adhere to the same requirements as listed above. That is:
    
* use the optimal $\alpha$ and $\beta$ values that we learned from the original `dev_data` set (Harry Potter Book #2)
* **probabilistically** (not deterministically) generate each token. That is, we don't simply generate the maximum likely token at each time step. Instead, we randomly sample which token to generate, based on all tokens' likelihood.
* exclude the possibility of generating an UNK (\*U\*) token
* force your model to start w/ a \< s\> token
* stop once your model has generated a total of $N+1$ \< s\> tokens (i.e., $N$ sentences in total)

Use your code to generate **five** sentences.


<a id="q143"></a>

**[1.4.3:](#s143)** **Interpret results**

Reflect on your BigramLM's output. Please write 3-6 sentences, in total (not per bulleted point), about:
1. its semantic and syntactic quality compared to the UnigramLM's output
2. its expected sentence length compared to the training corpus
3. explain one or two weaknesses with generating text with n-gram models


<a id="q15"></a>

**[1.5:](#s15)** **GPT-2**

In lecture, we learned about advanced LMs such as Transformers. Specifically, `GPT-2` is an autogressive LM that uses a Transformer Encoder and Transformer Decoder. It represents words as distributed, contextualized word embeddings. Using attention and self-attention, it's able to do a great job at modelling language. To see its power, let's use GPT-2 to model our Harry Potter corpus!

As a reminder, `GPT-2` has been pre-trained on a _vast_ amount of text data (40 gbs). So, we will not train it on Harry Potter from scratch; we will simply __fine-tune__ it on Harry Potter Book #1 (`Book_1_The_Philosophers_Stone.txt`).


<a id="q151"></a>


**[1.5.1:](#s151)** **Load and create tokenizer for GPT-2**
    
Load `GPT-2` from `HuggingFace`'s libraries. Then, fine-tune it on our Harry Potter Book #1 (`Book_1_The_Philosophers_Stone.txt`).

* Use `GPT2Tokenizer` to tokenize the input text
* NOTE: In [GPT2Tokenizer](https://huggingface.co/transformers/model_doc/gpt2.html#gpt2tokenizer), use `distilgpt2` as the `vocab_file` argument.
* Split the input text into into blocks of length **100**
* Generate inputs and labels by by shifting the input by one position:
    * For example if your tokens are [1,2,3,4,5] then your inputs will be: [1,2,3,4] and labels will be: [2,3,4,5]
 


<a id="q152"></a>
 
    
**[1.5.2:](#s152)** **Prepare data for GPT-2**
   
* Create TF Datasets with your inputs and labels
* Use a batch size of 12
* When creating `tf.data` pipelines for training dataset. Follow this order when building the pipeline:
  * Shuffle
  * Batch
  * Prefetch
  
<a id="q153"></a>

**[1.5.3:](#s153)** **Finetune GPT-2**

* Build a model using `TFGPT2LMHeadModel` from the `transformers` package from Hugging Face
* Load the pre-trained weights using `distilgpt2`



<a id="q154"></a>
    
**[1.5.4:](#s154)** **GPT-2 - model training**
    
* Use the `Adam` optimizer with a `learning_rate = 3e-5`, epsilon = `1e-08` and clipnorm = `1.0`
* Use the loss `SparseCategoricalCrossentropy(from_logits=True)`. The pretrained `distilgpt2` has **6** Transformer layers. When fine-tuning our language model, we do not want to compute losses for every layers. So, when defining the loss functions, remember to set `None` for all of the transformer output layers. We will need to pass in an _array_ of loss functions to the `model.compile(...)`, rather than just one.
* Use the metrics `SparseCategoricalAccuracy('accuracy')`
* Train the model for at least **30 epochs**. The more you train, the better it will be at generating high-quality text (as you will see in the next question).


<a id="q155"></a>

**[1.5.5:](#s155)** **Generate text with GPT-2**

Now, using your GPT-2 model, generate 5 sentences of Harry Potter text!
    
* Use a prompt (2-3 words) and tokenize the words
* Use the `model.generate(...)` function to generate output text
* Here's an example of text generated from the fine-tuned GPT-2 model:
    * `hufflepuff hates the slytherins.” “but what about you slytherin they hate me.” “oh they hate me!” said harry. “well i’ve got to go — i’ve got to go hogwarts!” “oh you are not going to be allowed goin’ around hogwarts is that funny professor — you know you can’t afford to go off to the bloody baron`
    
<a id="q156"></a> 
    
**[1.5.6:](#s156)** **Interpret results**
    
Reflect on your GPT-2's output. Please write 3-5 sentences, in total (not per bulleted point), about:
1. its semantic and syntactic quality compared to the NGram LM's output
2. weaknesses and strengths you see with the generated text



<a id="part1solutions"></a>

## PART 1: Solutions

[Return to contents](#contents)

<a id="s11"></a>


## **[1.1:](#q11)** **Preprocessing + Base Function for Optimization** <font color='red'>(do not edit)</font>

In [5]:
# parses a text file into tokens, while ignoring lines that have a
# specific "Page #" footer, which is present in the Harry Potter book data
def parse_file(file):

    # extracts all of the pertinent text, while ignoring the Page # lines
    file_contents = open(file, 'r')
    text_filtered = re.sub(r'Page \| \d+ .*', '', file_contents.read()).replace('\n', ' ')

    # converts the data into a list of tokens, with a
    # <s> denoting sentence boundaries
    sentences = sent_tokenize(text_filtered)
    
    # constructs all tokens
    tokens = ['<s>']
    for sent in sentences:
        sent = word_tokenize(sent)
        tokens.extend([token.lower() for token in sent])
        tokens.append('<s>')
    
    return tokens

#####################################
#   represents a generic n-gram model.
#
#   UnigramModel and BigramModel classes
#   will inherit from this class
#####################################
class NGramModel:
    def __init__(self): pass
        
    def optimize_parameters(self, calculate_likelihood, a, b, c, resphi, counts, alpha=-1):        

        if ((c-a) <= .02):        
            # checks if we need to optimize for alpha (i.e., the unigram model's case)
            if alpha == -1:
                f_a = calculate_likelihood(a, counts)
                f_b = calculate_likelihood(b, counts)
                f_c = calculate_likelihood(c, counts)
                
            # we already have an alpha and will optimize other params (i.e., bigram's case)
            else: 
                f_a = calculate_likelihood(alpha, a, counts)
                f_b = calculate_likelihood(alpha, b, counts)
                f_c = calculate_likelihood(alpha, c, counts)

            if (f_a <= f_b):
                if f_a <= f_c: return a, f_a
                return c, f_c
            elif f_b <= f_c: return b, f_b
            return c, f_c
        
        x = 0
        if c-b > b-a:
            x = b + math.floor(resphi*(c-b))
        else:
            x = b - math.floor(resphi*(b-a))
        
        # checks if we need to optimize for alpha (i.e., the unigram model's case)
        if alpha == -1:
            f_b = calculate_likelihood(b, counts)
            f_x = calculate_likelihood(x, counts)
            
        # we already have an alpha and will optimize other params (i.e., bigram's case)
        else:
            f_b = calculate_likelihood(alpha, b, counts)
            f_x = calculate_likelihood(alpha, x, counts)
        
        if f_x < f_b:
            if c-b > b-a:
                return self.optimize_parameters(calculate_likelihood, b, x, c, resphi, counts, alpha)
            return self.optimize_parameters(calculate_likelihood, a, x, b, resphi, counts, alpha)

        else:
            if c-b > b-a:
                return self.optimize_parameters(calculate_likelihood, a, b, x, resphi, counts, alpha)
            return self.optimize_parameters(calculate_likelihood, x, b, c, resphi, counts, alpha)

##=========================#
##   necessary variables   #
##=========================#

# starting boundaries for Golden-section search algorithm
resphi = 2 - (1 + math.sqrt(5)) / 2
a = 0.01
c = 1000
b = resphi*c

# constructs data
training_file = "data/harry_potter/demo_train.txt" # Book_1_The_Philosophers_Stone.txt" #
dev_file = "data/harry_potter/demo_dev.txt" #Book_2_The_Chamber_of_Secrets.txt" #
test_file1 = "data/harry_potter/Book_3_The_Prisoner_of_Azkaban.txt"
test_file2 = "data/harry_potter/the_fellowship_of_the_ring.txt"

training_data = parse_file(training_file)
dev_data = parse_file(dev_file)
test_data1 = parse_file(test_file1)
test_data2 = parse_file(test_file2)

print(training_data)

['<s>', 'i', 'love', 'cs109b', '.', '<s>', 'i', 'love', 'cs109b', '.', '<s>']


<a id="s12"></a>

## **[1.2:](#q12)** **Unigram Model** <font color='red'>(do not edit)</font>

<div class='exercise'>
    
Below, we provide the functionality for a unigram LM. It inherits the `optimize_parameters()` method from the base class, `NGramModel`. Notice that `optimize_parameters()` expects one to pass a class-specific `calculate_likelihood()` method to it. For example, our `UnigramModel` class defines its own `calculate_likelihood()` method, which accepts an `alpha` parameter. To be clear, the optimal `alpha` value is determined by `optimize_parameters()`.

The `UnigramModel` constructor only accepts a list of tokens (should always be the training tokens). For clarity, we now briefly describe each of its methods:

* `convert_tokens()` simply outputs the same input tokens it received, but after optionally UNKifying (\*U\*) the ones that were not present within the training tokens. Thus, the training tokens are left unedited, but any **dev** or **test** tokens will need to be UNKified.

* `get_word_counts()` returns a dictionary-style count of all tokens (_key_ = word type; _value_ = # of occurrences)

* `calculate_likelihood()` returns the unigram likelihood (negative log likelihood) of the passed-in tokens, based on the training data.
    
</div>

In [6]:
class UnigramModel(NGramModel):
    def __init__(self, training_tokens):
        
        self.training_unigrams = self.convert_tokens(training_tokens, False)
        self.training_word_counts = self.get_word_counts(self.training_unigrams)

    # dev and test tokens should UNK the out-of-vocabulary (OOV) words
    def convert_tokens(self, tokens, convert_to_unks):
        if convert_to_unks:
            tokens = ["*U*" if token not in self.training_unigrams else token for token in tokens]      
        return tokens
    
    def get_word_counts(self, tokens): return Counter(tokens)
    
    def calculate_likelihood(self, alpha, unigram_counts):

        likelihood = 0.0

        # we add 1 for the *U* type
        num_training_types = len(self.training_word_counts) + 1
        for token in unigram_counts:
            numerator = self.training_word_counts[token] + alpha
            denom = len(self.training_unigrams) + alpha*num_training_types
            likelihood += -1.0*unigram_counts[token]*math.log(numerator / denom)            

        return likelihood

Now, let's instantiate the model, fit it to our **training** data, and calculate the optimal $\alpha$ that maximizes the likelihood of the **dev** set.

In [9]:
# constructs a unigram LM
unigram_lm = UnigramModel(training_data)

# parses the dev data. for large corpora, this isn't fast
dev_tokens = unigram_lm.convert_tokens(dev_data, True)
dev_counts = unigram_lm.get_word_counts(dev_tokens)

# determines the MLE for alpha 
optimal_alpha, optimal_likelihood = \
    unigram_lm.optimize_parameters(unigram_lm.calculate_likelihood, a, b, c, resphi, dev_counts)

In [10]:
print(f"optimal_alpha: {optimal_alpha:.2f} (dev data's unigram likelihood: {optimal_likelihood:0.1f})")

optimal_alpha: 4.97 (dev data's unigram likelihood: 10.7)


<a id="s13"></a>

## **[1.3:](#q13)** **Bigram Model** <font color='red'>(you create)</font> [25 pts]

<a id="s131"></a>
<div class='exercise'>

**[1.3.1:](#q131)** **Implementing a Bigram LM**
    
Your task is to write a Bigram model below. You have full freedom to design it however you wish, including adding any methods you wish, as long it can be executed by the cell that succeeds it. That is:

* the `BigramModel` constructor must accept a list of training tokens, just like how we instantiate the `UnigramModel`.

* you must provide functionality to return `dev_bigram_counts` (a `Counter` or `dict` of the bigram tokens).

* you must write `calculate_likelihood()`, which accepts and works with `dev_bigram_counts`.

* using our provided code, you'll find the optimal $\beta$ (while using the optimal $\alpha$ found from the UnigramModel)

* edit the "BIGRAM TESTING" CELL so that you can actually run your BigramModel

**NOTES:**

* You can design the structure of your bigram tokens to be any format you wish. I find it useful to stitch them together with an `_`, since an underscore is never present within our corpus. Using tuples would also be great.

* To be clear, let's say our **dev set** is `demo_dev.txt`, which has one sentence: __I love NLP.__ The above `parse_file()` will convert this to `dev_data`, a list of six tokens: **[\< s\>, i, love, nlp, ., \< s \>]**. Regardless of how you format the bigram tokens (e.g., tuples or concatenated with `_` or other special characters), your code should produce `dev_bigram_counts` as having five distinct keys/bigrams (illustrated here with the `_` format):
> Counter({'\< s\>\_i': 1, 'i_love': 1, 'love_\*U\*': 1, '\*U\*_.': 1, '._\< s \>': 1})

* If you set the **train set** to be `demo_train.txt` and the **dev set** to be `demo_dev.txt`, then running the above UnigramModel will yield you with an optimal $\alpha$ of 4.97. The last line of the **BIGRAM TESTING** cell will invoke `bigram_lm.optimize_parameters()`. If you implemented the Bigram model correctly, this should find the optimal $\beta$ to be 1.97 (when using $\alpha$ = 4.97). Also, when using an $\alpha$ of 1 and a $\beta$ of 1, your Bigram should produce a likelihood of 6.6.

* Ultimately, you need to run your code on the provided Harry Potter datasets, but the `demo_*` files are provided merely as a sanity check.

</div>

In [11]:
# NOTE: FEEL FREE TO ADD ANY METHODS THAT YOU WISH
class BigramModel(NGramModel):
    def __init__(self, training_tokens):
        #### Your code here ####

        #### End code here ####
                
    # returns the bigram head counts (i.e., token#1)
    # and the full bigram counts (i.e., token#1_token#2)
    def get_word_counts(self, bigrams):
        #### Your code here ####

        #### End code here ####
            
    def calculate_perplexity(self, alpha, beta, num_tokens, bigram_counts):
        #### Your code here ####

        #### End code here ####
        
    def calculate_likelihood(self, alpha, beta, bigram_counts):
                
        total_likelihood = 0.0

        #### Your code here ####

        #### End code here ####
            
        return total_likelihood



In [12]:
###########################
## "BIGRAM TESTING" CELL ##
###########################

# constructs a bigram LM
bigram_lm = BigramModel(training_data)

#### Your code here ####

#### End code here ####

# determines the MLE for beta, using the pre-computed optimal unigram alpha
optimal_beta, optimal_likelihood = \
    bigram_lm.optimize_parameters(bigram_lm.calculate_likelihood, a, b, c, resphi, dev_bigram_counts, optimal_alpha)

In [13]:
print(f"optimal_alpha: {optimal_alpha:.2f}; optimal_beta: {optimal_beta:.2f} (dev data's bigram likelihood: {optimal_likelihood:0.1f})")

optimal_alpha: 4.97; optimal_beta: 1.97 (dev data's bigram likelihood: 6.1)


<a id="s132"></a>
<div class='exercise'>

**[1.3.2:](#q132)** **Evaluate text**
    
This is a continuation of your ongoing work. That is, we're still concerning your Bigram LM, which was fit on `training_data` and the optimal $\alpha$ and $\beta$ values were based on the `dev_data`.

Now, your task is to implement the perplexity metric, so that we have a standardized way of evaluating our Bigram LM. Specifically:
1. Add a method `calculate_perplexity()` to your `BigramModel` class above, then
2. Edit the code cell below to get the perplexity scores for two different texts, `test_data1` and `test_data2`.
3. Print to the screen these two perplexity scores.

**NOTES:**

* As a reminder, perplexity is defined by $2^{-l}$, where $l = \frac{1}{M} \sum_{i=1}^{m}\text{log}(p(w_{i}))$.

* Notice, the $\text{log}(p(w_{i}))$ part of perplexity equation is just **log-likelihood.** In the previous exercise, you computed the **negative log-likelihood.**

* $M$ represents the size of our data that is being evaluated.
</div>

In [14]:
#### Your code here ####

#### End code here ####

print(f"test_data1's perplexity: {test1_perplexity:0.2f}")
print(f"test_data2's perplexity: {test2_perplexity:0.2f}")

test_data1's perplexity: 4.15
test_data2's perplexity: 4.16


<a id="s133"></a>
<div class='exercise'>
    
**[1.3.3:](#q133)** **Interpret results**
    
Reflect on the two perplexity scores you received in the previous cell. Specifically, answer and discuss the following (1 or 2 sentences per question is sufficient, no need for fluff please):

1. Do the perplexity scores seem reasonable to you? (i.e., would you expect higher or lower values)?
2. Relative to each other, would you expect the perplexity scores to be switched -- in terms of which test set yielded the higher or lower perplexity? Why or why not?
3. Imagine we significantly trimmed the `test_data2` data to being only half in size. What would you expect to happen to its perplexity score?
4. In our work above, we trained on `training_data`, which is the first Harry Potter book. We found the optimal $\alpha$ and $\beta$ values based on `dev_data`, which is the second Harry Potter Book. One of the perplexity scores corresponds to `test_data2`, which is the first Lord of the Rings book. However, instead, if we let `dev_data` be a different Lord of the Rings book (e.g., the __second__ Lord of the Rings book), what would you expect to happen to our perplexity score for `test_data2`?
</div>

**INTERPRETATION:**

YOUR ANSWER HERE

<a id="s14"></a>


## **[1.4:](#q14)** **N-Gram Text Generation**

<div class='exercise'>
We now have some objective, quantitative indication of our `Unigram`'s and `Bigrams`' abilities to _model_ language. Let's additionally, subjectively inspect how well it can actually generate text.

We should generate text according to the following:

* use the optimal $\alpha$ and $\beta$ values that we learned from the original `dev_data` set (Harry Potter Book #2)
* **probabilistically** (not deterministically) generate each token. That is, we don't simply generate the maximum likely token at each time step. Instead, we randomly sample which token to generate, based on all tokens' likelihood.
* exclude the possibility of generating an UNK (\*U\*) token
* force our model to start w/ a \< s\> token
* stop once our model has generated a total of $N+1$ \< s\> tokens (i.e., $N$ sentences in total)

Below, we provide code that __probabilistically__ generates five sentences, using our `Unigram` LM.
    
<div/>


In [None]:
# probabilistically generates text from our UnigramLM
def generate_unigram_text(unigram_lm, num_sentences):
    
    # notice, we don't include an UNK (*U*) token, so no +1 adjustments
    num_training_unigram_types = len(unigram_lm.training_word_counts)

    # precomputes the likelihood of generating each token
    token_probs = {}
    total = 0
    for word_type in unigram_lm.training_word_counts:
        numerator = unigram_lm.training_word_counts[word_type] + optimal_alpha
        denom = len(unigram_lm.training_unigrams) + optimal_alpha*(num_training_unigram_types)
        prob = numerator / denom
        token_probs[word_type] = prob
        total += prob

    # notice that they sum to 1. the same should hold true for your bigram LM
    print("marginal probability:", total)

    # convenient format for probabilistic sampling
    tokens, token_probs = zip(*[(k, token_probs[k]) for k in token_probs.keys()])
    tokens = list(tokens)
    token_probs = list(token_probs)

    # let's generate sentences!
    output = "<s>"
    s_count = 0
    while s_count < num_sentences:
        rand_token = random.choices(tokens, token_probs)[0]
        output += " " + rand_token
        if rand_token == "<s>":
            s_count += 1
    return output

In [None]:
print(generate_unigram_text(unigram_lm, 5))

<a id="s141"></a>

<div class='exercise'> 
    
**[1.4.1:](#q141)** **Interpret results**

As you can see, the UnigramLM generates pretty non-sensical text. Write 2-3 sentences about the lengths of the sentences that are generated from a UnigramLM. Specifically, are the sentence lengths, on average, expected to be shorter than, equal to, or longer than sentences seen in the training data. How would this change as a function of the size of the training corpus?
</div>

**INTERPRETATION:**

YOUR ANSWER HERE

<a id="s142"></a>

<div class='exercise'> 
    
**[1.4.2:](#q142)** **Bigram Text Generation**
    
Write code below to generate $N$ sentences from your `BigramLM`, a la the UnigramLM text generation above. Your generation should adhere to the same requirements as listed above. That is:
    
* use the optimal $\alpha$ and $\beta$ values that we learned from the original `dev_data` set (Harry Potter Book #2)
* **probabilistically** (not deterministically) generate each token. That is, we don't simply generate the maximum likely token at each time step. Instead, we randomly sample which token to generate, based on all tokens' likelihood.
* exclude the possibility of generating an UNK (\*U\*) token
* force your model to start w/ a \< s\> token
* stop once your model has generated a total of $N+1$ \< s\> tokens (i.e., $N$ sentences in total)

Use your code to generate **five** sentences.
</div>

In [11]:
def generate_bigram_text(bigram_lm, num_sentences):
#### Your code here ####

#### End code here ####


In [18]:
print(generate_bigram_text(bigram_lm, 5))

None


<a id="s143"></a>

<div class='exercise'> 
    
**[1.4.3:](#q143)** **Interpret results**


Reflect on your BigramLM's output. Please write 3-6 sentences, in total (not per bulleted point), about:
1. its semantic and syntactic quality compared to the UnigramLM's output
2. its expected sentence length compared to the training corpus
3. explain one or two weaknesses with generating text with n-gram models
</div>

**INTERPRETATION:**

*Your answer here*

<a id="s15"></a>

## **[1.5:](#q15)** **GPT-2**

<br>
    
In lecture, we learned about advanced LMs such as Transformers. Specifically, `GPT-2` is an autogressive LM that uses a Transformer Encoder and Transformer Decoder. It represents words as distributed, contextualized word embeddings. Using attention and self-attention, it's able to do a great job at modelling language. To see its power, let's use GPT-2 to model our Harry Potter corpus!

As a reminder, `GPT-2` has been pre-trained on a _vast_ amount of text data (40 gbs). So, we will not train it on Harry Potter from scratch; we will simply __fine-tune__ it on Harry Potter Book #1 (`Book_1_The_Philosophers_Stone.txt`).

<a id="s151"></a>

<div class='exercise'> 
    
**[1.5.1:](#q151)** **Load and create tokenizer for GPT-2**
    
Load `GPT-2` from `HuggingFace`'s libraries. Then, fine-tune it on our Harry Potter Book #1 (`Book_1_The_Philosophers_Stone.txt`).

* Use `GPT2Tokenizer` to tokenize the input text
* NOTE: In [GPT2Tokenizer](https://huggingface.co/transformers/model_doc/gpt2.html#gpt2tokenizer), use `distilgpt2` as the `vocab_file` argument.
* Split the input text into into blocks of length **100**
* Generate inputs and labels by by shifting the input by one position:
    * For example if your tokens are [1,2,3,4,5] then your inputs will be: [1,2,3,4] and labels will be: [2,3,4,5]
 
</div>


In [12]:
# Load training text
training_file = "data/harry_potter/Book_1_The_Philosophers_Stone.txt"
with open(training_file) as file:
    training_data = file.read()

# Load tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")

#### Your code here ####

#### End code here ####


<a id="s152"></a>

<div class='exercise'> 
    
**[1.5.2:](#q152)** **Prepare data for GPT-2**
   
* Create TF Datasets with your inputs and labels
* Use a batch size of 12
* When creating tf.data pipelines for training dataset. Follow this order when building the pipeline:
  * Shuffle
  * Batch
  * Prefetch
</div>

In [None]:
# Your code here



<a id="s153"></a>

<div class='exercise'> 
    
**[1.5.3:](#q153)** **Finetune GPT-2**

* Build a model using `TFGPT2LMHeadModel` from the `transformers` package from Hugging Face
* Load the pre-trained weights using `distilgpt2`
</div>


In [13]:
# Your code here




<a id="s154"></a>

<div class='exercise'>
    
**[1.5.4:](#q154)** **GPT-2 - model training**
    
* Use the `Adam` optimizer with a `learning_rate = 3e-5`, epsilon = `1e-08` and clipnorm = `1.0`
* Use the loss `SparseCategoricalCrossentropy(from_logits=True)`. The pretrained `distilgpt2` has **6** Transformer layers. When fine-tuning our language model, we do not want to compute losses for every layers. So, when defining the loss functions, remember to set `None` for all of the transformer output layers. We will need to pass in an _array_ of loss functions to the `model.compile(...)`, rather than just one.
* Use the metrics `SparseCategoricalAccuracy('accuracy')`
* Train the model for at least **30 epochs**. The more you train, the better it will be at generating high-quality text (as you will see in the next question).

</div>

In [14]:
# Your code here



<a id="s155"></a>


<div class='exercise'>  

**[1.5.5:](#q155)** **Generate text with GPT-2**

Now, using your GPT-2 model, generate 5 sentences of Harry Potter text!
    
* Use a prompt (2-3 words) and tokenize the words
* Use the `model.generate(...)` function to generate output text
* Here an example text generated from the finetuned GPT2 model
    * `hufflepuff hates the slytherins.” “but what about you slytherin they hate me.” “oh they hate me!” said harry. “well i’ve got to go — i’ve got to go hogwarts!” “oh you are not going to be allowed goin’ around hogwarts is that funny professor — you know you can’t afford to go off to the bloody baron`
    
</div>

In [15]:
# Your code here



<a id="s156"></a>

<div class='exercise'>
    
**[1.5.6:](#q156)** **Interpret results**
    
Reflect on your GPT-2's output. Please write 3-5 sentences, in total (not per bulleted point), about:
1. its semantic and syntactic quality compared to the NGram LM's output
2. weaknesses and strengths you see with the generated text
</div>

**INTERPRETATION:**

*Your answer here*

<a id="part2"></a>
    
<!-- <div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA"> -->

# PART 2. Word Embeddings [15 pts]

[Return to contents](#contents)

<a id="part2intro"></a>

## Overview

[Return to contents](#contents)
    
As we saw in lecture, computers have no inherit way of making sense of any text. NLP, at large, concerns this very process of trying to intelligently "understand" and make use of language. The representation of language, especially at the word-level, is incredibly connected to a model's ability to perform well on any task. With a poor representation, models may never be able to use the text data.

Here, in Part 2, you will gain more hands-on experience working with and understanding word embeddings -- specifically, we will work with distributed, contextualized embeddings, which have been the de facto standard in NLP since 2013.

<a id="part2questions"></a>

## PART 2: Questions
The goal of this task train a model to create a word2vec embedding using text from Harry Potter Books.

[Return to contents](#contents)

<a id="q21"></a>

<hr style="height:1pt">

**[2.1:](#s21)** **Build Word2Vec Embeddings**

<a id="q211"></a>

**[2.1.1:](#s211)** **Load and preprocess Data**
    
For this section we will use Harry Potter Book #1 (`Book_1_The_Philosophers_Stone.txt`). You are welcome to train on more Harry Potter books. The more data you train with, the better your word embeddings will be.

* Load text from book 1 `Book_1_The_Philosophers_Stone.txt`
* Tokenize text using `nltk`'s word tokenizer, removing all punctuations
* Remove stopwords
* Use the provide helper function, `hw_utils.build_dataset(words, vocab_size)`, to generate sequential data. Your vocabulary size is the number of unique words types.


<a id="q212"></a>

**[2.1.2:](#s212)** **Create a Skip Grams dataset**

* Use the `skipgrams` function from tensorflow.keras to create the training dataset
* Use `window_size` of 4
* The output from `skipgrams` will consist of a word couples and labels. Split the couples into a target and context
    * For example if the outputs from `skipgrams` are `couples:[[4, 8], [966, 334], [87, 5326], [452, 4139], [744, 1988]] labels: [1, 1, 0, 0, 0]`
    * We need our training data in this format `target: [4, 966, 87, 452, 744], context: [8, 334, 5326, 4139, 1988], labels: [1, 1, 0, 0, 0]`


<a id="q213"></a> 

**[2.1.3:](#s213)** **Create a Skip Grams dataset**

* Create TF Datasets with your target, context, and labels
* Use a batch size of 1024
* When creating `tf.data` pipelines for training dataset. Follow this order when building the pipeline:
  * Shuffle
  * Batch
  * Prefetch
  
 
<a id="q214"></a>
    
**[2.1.4:](#s214)** **Build & Train a Word2Vec Skip Gram Model**

* Build a Word2Vec with an embedding dimension of **128**
* Your model would require a word embedding and context embedding
* Take the output from the word embedding and context embedding and perform a dot product
* Use `sigmoid` for your output activation
* Use `Adam` optimizer with `learning_rate = 0.01`
* Use `binary_crossentropy` as the loss function
* Use `accuracy` as the metrics
* Train for 10 or more epochs

<a id="q22"></a>

<hr style="height:1pt">

**[2.2:](#s22)** **Analyze Word2Vec**

<a id="q221"></a>

**[2.2.1:](#s221)** **Pairwise Similarity**

Let us analyze the trained word2vec embeddings
    
* From the trained model get the weights from the `word embedding` layer
* Find the top `5` words similar to these words `'hogwarts','quidditch','dumbledore','gryffindor','hermione','hagrid'`
* Use the util function provided `find_similar_words(...)`  to find similar words
 

<a id="q222"></a>
**[2.2.2:](#s222)** **Interpret Similarity**
* Discuss your thoughts on what similar words means for these embeddings

<a id="q223"></a>

**[2.2.3:](#s223)** **Visualize Embeddings**

* Get a list of similar words and embeddings by passing the whole `word_list = ['hogwarts','quidditch','dumbledore','gryffindor','hermione','hagrid']` to the `find_similar_words(...)` function
* Use the given plotting code to plot and visualize your embeddings

<a id="q224"></a>

**[2.2.4:](#s224)** **Interpret Embeddings**

* What can you interpret from the the embeddings plotted in the plot above? How might you make the embeddings better?
</div>

<a id="part2solutions"></a>

## PART 2: Solutions

[Return to contents](#contents)

<a id="s21"></a>

### **[2.1:](#q21)** **Build Word2Vec Embeddings**




<a id="s211"></a>
<div class='exercise'>  

**[2.1.1:](#s211)** **Load and preprocess Data**
    
For this section we will use Harry Potter Book #1 (`Book_1_The_Philosophers_Stone.txt`). You are welcome to train on more Harry Potter books. The more data you train with, the better your word embeddings will be.

* Load text from book 1 `Book_1_The_Philosophers_Stone.txt`
* Tokenize text using `nltk`'s word tokenizer, removing all punctuations
* Remove stopwords
* Use the provide helper function, `hw_utils.build_dataset(words, vocab_size)`, to generate sequential data. Your vocabulary size is the number of unique words types.


</div>

In [17]:
### your code here

<a id="s212"></a>
<div class='exercise'>  

**[2.1.2:](#q212)** **Create a Skip Grams dataset**

* Use the `skipgrams` function from tensorflow.keras to create the training dataset
* Use `window_size` of 4
* The output from `skipgrams` will consist of a word couples and labels. Split the couples into a target and context
    * For example if the outputs from `skipgrams` are `couples:[[4, 8], [966, 334], [87, 5326], [452, 4139], [744, 1988]] labels: [1, 1, 0, 0, 0]`
    * We need our training data in this format `target: [4, 966, 87, 452, 744], context: [8, 334, 5326, 4139, 1988], labels: [1, 1, 0, 0, 0]`

  
</div>

In [18]:
### your code here

<a id="s213"></a>
<div class='exercise'>  

**[2.1.3:](#q213)** **Create a Skip Grams dataset**

* Create TF Datasets with your target, context, and labels
* Use a batch size of 1024
* When creating `tf.data` pipelines for training dataset. Follow this order when building the pipeline:
  * Shuffle
  * Batch
  * Prefetch

</div>

In [19]:
### your code here


<a id="s214"></a>
<div class='exercise'>  
    
**[2.1.4:](#q214)** **Build & Train a Word2Vec Skip Gram Model**

* Build a Word2Vec with an embedding dimension of **128**
* Your model would require a word embedding and context embedding
* Take the output from the word embedding and context embedding and perform a dot product
* Use `sigmoid` for your output activation
* Use `Adam` optimizer with `learning_rate = 0.01`
* Use `binary_crossentropy` as the loss function
* Use `accuracy` as the metrics
* Train for 10 or more epochs

</div>

In [20]:
### your code here


<a id="s22"></a>

### **[2.2:](#q22)** **Analyze Word2Vec**

<a id="s221"></a>
<div class='exercise'>  

**[2.2.1:](#q221)** **Pairwise Similarity**

Let us analyze the trained word2vec embeddings
    
* From the trained model get the weights from the `word embedding` layer
* Find the top `5` words similar to these words `'hogwarts','quidditch','dumbledore','gryffindor','hermione','hagrid'`
* Use the util function provided `find_similar_words(...)`  to find similar words

</div>

In [22]:
### your code here

In [23]:
### your code here

In [24]:
### your code here

<a id="s222"></a>
<div class='exercise'>  
    
**[2.2.2:](#q222)** **Interpret Similarity**

* Discuss your thoughts on what "similar words" means for these embeddings

</div>

**INTERPRETATION:**

*Your answer here*

<a id="s223"></a>
<div class='exercise'>  

**[2.2.3:](#q223)** **Visualize Embeddings**

* Get a list of similar words and embeddings by passing the whole `word_list = ['hogwarts','quidditch','dumbledore','gryffindor','hermione','hagrid']` to the `find_similar_words(...)` function
* Use the given plotting code to plot and visualize your embeddings

</div>

In [25]:
### your code here

In [26]:
### your code here

<a id="s224"></a>
<div class='exercise'>  

**[2.2.4:](#q224)** **Interpret Embeddings**

* What can you interpret from the the embeddings plotted in the plot above? How might you make the embeddings better?

</div>

**INTERPRETATION:**

*Your answer here*

<a id="part3"></a>
    
<!-- <div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA"> -->

# PART 3. Text Classification [35 pts]

[Return to contents](#contents)
    

<a id="part3intro"></a>

## Overview

[Return to contents](#contents)
    
Throughout CS109A and CS109B, we have often modelled classification tasks. There is an infinite number of classification tasks that one could perform with text data, as there is no limit to the contents (and goals) of language. In fact, NLP has several sub-fields/ popular problems that are largely treated as classification tasks (e.g., sentiment analysis, natural language entailment, and generic 'text classification' like spam detection). Moreover, _nearly all_ NLP problems have at least some classification component.

Here, in Part 3, you will gain experience with text classification by working on an actual real-world task that Chris Tanner encountered this past year:

Medical research is produced at an astronomical rate ([a few thousand articles are published daily](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3191655/)). Thus, when embarking on a particular research study, conducting a proper literature search can be unwieldy and overwhelming. One typically provides very carefully crafted search terms, then has to sift through several thousand results. A doctor reads the **Abstracts** of thousands of these candidate papers. This process is referred to as a **Systematic Review**.

Often times, the **Systematic Review** only yields a handful of useful research papers. It's akin to "searching for a needle in a haystack". The research papers that aren't of interest are considered `irrelevant`, and the potentially useful ones are considered `not irrelevant` (meaning, it will be read further and potentially used in the study). Clearly, this process of classifying papers for a particular research topic requires having expert-level knowledge of the research and medical domain. It's really important to not overlook useful (non-irrelevant) papers. Oftentimes two doctors will read over the same exact list of thousands of abstracts, just to ensure no papers are overlooked.

If the Systematic Review yields _many_ useful papers, then one might be able to conduct a **Meta Analysis**, allowing one to draw new insights and research conclusions from the myriad of independent, regionalized research through the world. So, one needs to be incredibly meticulous when reading through thousands of abstracts. Wouldn't it be great if NLP could assist in this task? Well, let's find out if it can!

For Part 3, you will help implement a text classifier for the **Systematic Review** process. In this real-life situation, an infectious disease doctor is researching sexually transmitted infections (STIs) in women who have HIV and are living in sub-Saharan Africa. STIs like __gonorrhea__ and __chlamydia__ are under-treated in low-resource communities. Because there aren't affordable and accessible STI testing in the area, there isn't population-wide screen. So, doctors don't have a good understanding of the epidemiology and prevalence of STIs -- especially amongst women who have HIV, which carries extra, serious health risks.

Let's build a text classifier to see if we can help find `not irrelevant` abstracts. We can train the model by providing many _already-annotated_ abstracts, where each abstract is labelled as being `irrelevant` or `not irrelevant`. At _test time_, let's see if your model can help "suggest" which papers to strongly consider.

<a id="part3questions"></a>

## PART 3: Questions
The goal of this task is to use medical abstracts from the conference paper dataset and build/train a model to classify whether paper abstracts are irrelevant or not.

[Return to contents](#contents)
<hr/>
<a id="q31"></a>

#### **[3.1:](#s31)** **Load and preprocess Data**

<a id="q311"></a>

**[3.1.1:](#s311)** 

Download the dataset (code provided below). The dataset consists of 3 files, which you'll need to load:
- `review_78678_irrelevant.csv`
- `review_78678_not_irrelevant_included.csv` and
- `review_78678_not_irrelevant_excluded.csv`

All of these three should be present inside `datasets/project_1/` 

<a id="q312"></a>

**[3.1.2:](#s312)** 
* All 3 files have the same number of columns.
* **We will be using just the `Abstract` column from all the files.**
* Load the data from all the files into 3 dataframes. Create a new column (in all 3 dataframes) called `target` and assign `0` to it for the data from file `review_78678_irrelevant.csv` and `1` for the other two files.

<a id="q313"></a>

**[3.1.3:](#s313)** 
* Concatenate all the dataframes to one keep just the column `Abstract` and `target`. Apply `dropna()` on the dataframe.


<a id="q32"></a>
<hr/> 

#### **[3.2:](#s32)** **Build Data pipelines**

For this section we will be using the `tf.data` API to build a simple but efficient data pipeline. The `tf.data` API enables you to build complex input pipelines from simple to complex reusable pieces. [Reference](https://www.tensorflow.org/guide/data)

<a id="q321"></a>
**[3.2.1:](#s321)** 
* Set the variables - VOCABULARY_SIZE to 15000, SEQUENCE_SIZE (pick appropriate value), and EMBEDDING_SIZE to 100. Use [TextVectorization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing/TextVectorization) class from keras to build a text vectorizer for the entire dataset (Abstracts). Use output sequence length as 256 and `standardize_text()` provided below. 

<a id="q321a"></a>
**[3.2.1a:](#s321a)** What parameter did you choose for sequence length ? Justify your choice. 

<a id="q322"></a>
**[3.2.2:](#s322)** 
* Generate vocabulary. Display vocabulary length. Remember to use adapt() with TextVectorizer.

<a id="q323"></a>
**[3.2.3:](#s323)** 
* Split the dataset with a validation percent 20%. Use a batch size of 32. Use train and validation shuffle buffer size equal to the number of data points in train and validation sets respectively. 

<a id="q324"></a>
**[3.2.4:](#s324)** 
* Build tf.data pipelines for the training and validation dataset. Follow this order when building the pipeline:
  * Shuffle (if necessary) 
  * Batch
  * Map (Convert your text to vectors)
  * Prefetch
* We will use this data pipeline for Model FFNN and LSTM.

<a id="q324b"></a>
**[3.2.4b:](#s324b)** Explain why you should shuffle the data and describe when you should not shuffle the data. 


<hr/>

<a id="q33"></a>
#### **[3.3:](#s33)** **Model 1: Feed Forward Neural Network**

<a id="q331"></a>
**[3.3.1:](#s331)** 
* The first model will be a simple Feed Forward Neural Network. When building a model ensure to give it a unique name since we will need this for comparing our results later. The model needs to have an Embedding layer followed by a Flatten layer and one or more Dense layers. 


<a id="q332"></a>
**[3.3.2:](#s332)** 
* Display model summary. 
* Train the model with the a `learning_rate = 0.003` and `epochs = 10`
* The metric we want to monitor is `accuracy`.
* When calling the `model.fit(...)` function make sure to get the training results. This can be done like `training_results = model.fit(...)`

<a id="q332a"></a>
**[3.3.2a:](#s332a)** Name two reasons why using one-hot encoded vectors instead of using  the Embedding layer is not the way to go.



<a id="q332b"></a>
**[3.3.2b:](#s332b)** Explain what the inputs and outputs of the `Embedding` layer are. Also comment on the dimension going in and what is coming out.



<a id="q333"></a>
**[3.3.3:](#s333)** 
* Pass the training results to the `evaluate_save_model(model, validation_data, training_results, execution_time, learning_rate, epochs)`. This util function will plot your training history and save the model and metrics so we can compare all the models at the end.

<hr/>


<a id="q34"></a>
### **[3.4:](#s34)** **Model 2: LSTM**

The next model will be a Bi directional LSTM.  When building a model ensure to give it a unique name since we will need this for comparing our results later. 

<a id="q341"></a>
**[3.4.1:](#s341)** 
* Add an `Embedding` layer in your model
* The Embedding layer is followed by a Bidirectional LSTM
* The Bidirectional LSTM layer (with 64 units) can be followed by a few Dense layers


<a id="q342"></a>
**[3.4.2:](#s342)** 
* Display model summary.
* Train the model with the a `learning_rate = 1e-4` and `epochs = 10`
* The metric we want to monitor is `accuracy` 
* When calling the `model.fit(...)` function make sure to get the training results. This can be done like `training_results = model.fit(...)`


<a id="q342a"></a>
**[3.4.2a:](#s342a)** How many parameters does Bidirectional layer have ? (Show us how did you calculate)

<a id="q343"></a>
**[3.4.3:](#s343)** 
* Pass the training results to the `evaluate_save_model(model, validation_data, training_results, execution_time, learning_rate, epochs)`. This util function will plot your training history and save the model and metrics so we can compare all the models at the end.

<hr/>

<a id="q35"></a>

### **[3.5:](#s35)** **Build Data Pipelines for BERT**

BERT requires the data to be tokenized in a specific way, for this you need to use the `BertTokenizer` from the `transformers` package from Hugging Face. Steps to prepare your dataset:

<a id="q351"></a>
**[3.5.1:](#s351)** 

* Use `BertTokenizer` to tokenize the input text
* [BertTokenizer](https://huggingface.co/transformers/model_doc/bert.html#berttokenizer), use `bert-base-uncased` as the `vocab_file` argument. Also set `do_lower_case=True`
- When using `tokenizer.encode_plus(...)` set the `max_length` to a value suitable for the dataset. It would need to be `<=512`.
- The output tokens from `tokenizer.encode_plus(...)` is a dictionary with the keys `'input_ids', 'token_type_ids', 'attention_mask'`

<a id="q352"></a>
**[3.5.2:](#s352)**
- Create TF Datasets using the tokenized results. When using `tf.data.Dataset.from_tensor_slices(...,...)` look out for the x values passed in. `BERT` requires 3 inputs `'input_ids', 'token_type_ids', 'attention_mask'` as a tuple.
* Use a batch size of 32
* Use train and validation shuffle buffer size equal to the number of data points in train and validation sets respectively
* When creating tf.data pipelines for the training and validation dataset. Follow this order when building the pipeline:
  * Shuffle (if necessary)
  * Batch
  * Prefetch

 
<hr/>

<a id="q36"></a>
### **[3.6:](#s36)** **Model 3: Finetune BERT**

In this section we will finetune BERT for Sequence Classification

<a id="q361"></a>
**[3.6.1:](#s361)** 

* Build a model using `TFBertForSequenceClassification` from the `transformers` package from Hugging Face
* Load the pre-trained weights using `bert-base-uncased` make sure to set the `name` argument to a unique name since we will need this for comparing our results later

<a id="q362"></a>
**[3.6.2:](#s362)** 

* Train the model with the a `learning_rate = 2e-5` and `epochs = 5`
* The metrics we want to monitor is `accuracy`
* When calling the `model.fit(...)` function make sure to get the training results. This can be done like `training_results = model.fit(...)`
* Pass the training results to the `evaluate_save_model(model, validation_data, training_results, execution_time, learning_rate, epochs)`. This util function will plot your training history and save the model and metrics so we can compare all the models at the end.


<a id="q362a"></a>
**[3.6.2a:](#s362a)** 
* How is BERT related (e.g., similar and different from) to the first Transformer (encoder + decoder) model that was introduced in the Attention is All you Need paper?


<a id="q362b"></a>
**[3.6.2b:](#s362b)** What is the difference between attention and self-attention in the context of Transformer architecture?

 
<hr/>

 <a id="q37"></a>
### **[3.7:](#s37)** **Compare all Models**

 
<a id="q371"></a>
**[3.7.1:](#s371)**

 Do you think accuracy is a good metric here, why or why not ? If we were to incorporate f1_score metric with model training - how would you go about it ? Create a confusion matrix and also report and interpret precision, recall. Interpret the best F1 score reported with respect to the problem at hand. Which one is most important metric for researcher - precision or recall or F1-score (and why)?
 
 
 <hr/>

<a id="part3solutions"></a>

## PART 3: Solutions

[Return to contents](#contents)

<a id="s31"></a>

### **[3.1:](#q31)** **Load and preprocess Data**


<a id="s311"></a>
<div class='exercise'>  

**[3.1.1:](#q311)** Download the dataset (code provided below). The dataset consists of 3 files, which you'll need to load: `review_78678_irrelevant.csv`, `review_78678_not_irrelevant_included.csv`, and `review_78678_not_irrelevant_excluded.csv` all of which should be present inside `datasets/project_1/` 

</div>

In [34]:
# Run this cell to load all the data required for this part
DATA_DIR = "datasets"
start_time = time.time()
hw_utils.download_file(" https://cs109b-course-data.s3.amazonaws.com/project_1.zip", base_path="datasets", extract=True)
execution_time = (time.time() - start_time)/60.0
print("Download execution time (mins)",execution_time)

Download execution time (mins) 0.0038926124572753905


<a id="s312"></a>
<div class='exercise'>  

**[3.1.2:](#q312)** All 3 files have the same number of columns. We will be using just the **Abstract** column from all the files. Load the data from all the files into 3 dataframes. Create a new column (in all 3 dataframes) called target and assign 0 to it for the data from file `review_78678_irrelevant.csv` and 1 for the other two files.

</div>

In [27]:
### your code here

<a id="s313"></a>

<div class='exercise'>
    
**[3.1.3:](#q313)**  Concatenate all the dataframes into a single dataframe. Keep only the columns **Abstract** and **target**. Apply dropna() on the dataframe.

</div>

In [28]:
### your code here

<a id="s32"></a>

### **[3.2:](#q32)** **Build Data Pipelines**




<a id="s321"></a>
<div class='exercise'>  

**[3.2.1:](#q321)**

Set these variables as follow:

* VOCABULARY_SIZE to 15000
* SEQUENCE_SIZE (pick appropriate value), and
* EMBEDDING_SIZE to 100.

Use the `TextVectorization` class from Keras to build a text vectorizer for the entire dataset (Abstracts). Use an output sequence length of **256**, along with the `standardize_text()` function provided below.

</div>

In [37]:
# Standardize text util function
def standardize_text(input_text):
  # Convert to lowercase
  lowercase = tf.strings.lower(input_text)
  # Remove HTML tags
  stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
  return tf.strings.regex_replace(
      stripped_html, "[%s]" % re.escape(string.punctuation), ""
  )

In [29]:
### your code here

<a id="s321a"></a>
<div class='exercise'>  

**[3.2.1a:](#q321a)** What value did you choose for the _output_sequence_length_? Justify your choice. 

</div>

*your answer here*

Check length of words for abstracts, find median len words. 

<a id="s322"></a>
<div class='exercise'>  

**[3.2.2:](#q322)** Use `TextVectorizer`'s `adapt()` method to generate vocabulary. Display vocabulary length.
    
</div>

In [30]:
### your code here

<a id="s323"></a>
<div class='exercise'>  

**[3.2.3:](#q323)**

* Use `train_test_split()` to split the dataset, creating a validation set of size 20%.
* Create TF Dataset. (**HINT:** use `tf.data.Dataset.from_tensor_slices`) 

</div>


In [31]:
### your code here

In [32]:
### your code here


<a id="s324"></a>
<div class='exercise'>  

**[3.2.4:](#q324)** Build `tf.data` pipelines for the training and validation dataset. Follow this order when building the pipeline:
  * Shuffle your training data by using a buffer size that's equal to the number of training data points.
  * Use a batch size of **32.**
  * Map (Convert your text to vectors)
  * `prefetch()` your validation set while specifying a buffer size 

We will use this data pipeline for Model FFNN and LSTM.

</div>

In [33]:
### your code here


<a id="s324b"></a>
<div class='exercise'>  

**[3.2.4b:](#q324b)** Explain why you should shuffle the data, and describe when you should **not** shuffle the data. 

</div>

** Your answer here **

<a id="s33"></a>

### **[3.3:](#q33)** **MODEL 1: Feed Forward Neural Network**



<a id="s331"></a>
<div class='exercise'>  

**[3.3.1:](#q331)** [WE PROVIDE; 0 pts]
* The first model will be a simple Feed Forward Neural Network. When building a model, be sure to give it a unique name since we will need this for later comparing our results. The model needs to have an Embedding layer followed by a Flatten layer and one or more Dense layers. 

</div>


In [34]:
# free code
def build_ffnn():
    
  # Set the model name as
  model_name = 'ffnn_'+str(int(time.time()))

  # Create a FFNN Model
  model = tf.keras.models.Sequential(name=model_name)
  model.add(tf.keras.Input(shape=(SEQUENCE_SIZE)))
  model.add(tf.keras.layers.Embedding(input_dim=VOCABULARY_SIZE, output_dim=EMBEDDING_SIZE))
  model.add(tf.keras.layers.Flatten())
  model.add(tf.keras.layers.Dense(512, activation="relu"))
  model.add(tf.keras.layers.Dense(1,activation="sigmoid"))

  return model


<a id="s332"></a>
<div class='exercise'>

**[3.3.2:](#s332)** [WE PROVIDE; 0 pts]
* Display model summary. 
* Train the model with the a `learning_rate = 0.003` and `epochs = 10`
* The metric we want to monitor is `accuracy`.
* When calling the `model.fit(...)` function make sure to get the training results. This can be done like `training_results = model.fit(...)`
   
</div>

In [35]:
############################
# Training Params
############################
learning_rate = 0.003
epochs = 10

# Free up memory
K.clear_session()

# Build the model
model = build_ffnn()

# Print the model architecture
print(model.summary())

# Optimizer
optimizer = keras.optimizers.Adam(lr=learning_rate)
# Loss
loss = keras.losses.binary_crossentropy

# Compile
model.compile(loss=loss,
                  optimizer=optimizer,
                  metrics=['accuracy'])

# Train model
start_time = time.time()
training_results = model.fit(
        train_data,
        validation_data=validation_data,
        epochs=epochs, 
        verbose=1)
execution_time = (time.time() - start_time)/60.0
print("Training execution time (mins)",execution_time)

<a id="s332a"></a>
<div class='exercise'>

**[3.3.2a:](#q332a)** Name two reasons why one should use an **Embedding layer** instead of **one-hot encoded vectors**.

</div>

**INTERPRETATION:**

*Your answer here*

<a id="s332b"></a>
<div class='exercise'>  

**[3.3.2b:](#q332b)** Explain what the inputs and outputs of the `Embedding` layer are. Also comment on the dimension going in and what is coming out

</div>

**INTERPRETATION:**

*Your answer here*

<a id="s333"></a>
<div class='exercise'>  
    
**[3.3.3:](#q333)** 
* Pass the training results to the `evaluate_save_model(model, validation_data, training_results, execution_time, learning_rate, epochs)`. This util function will plot your training history and save the model and metrics so we can compare all the models at the end.
    
</div>

In [None]:
hw_utils.evaluate_save_model(model, validation_data, training_results, execution_time, learning_rate, epochs)


<a id="s34"></a>
<div class='exercise'>  

### **[3.4:](#q34)** **MODEL 2: LSTM**

</div>

<a id="s341"></a>
<div class='exercise'>  

**[3.4.1:](#s341)** 
* Add an `Embedding` layer in your model
* Next, add a `Bidirectional` LSTM layer with 64 units
* Next, add at least one `Dense` layers

</div>

In [36]:
#### your code here


<a id="s342"></a>
<div class='exercise'>  

**[3.4.2:](#s342)** 
* Display model summary.
* Train the model with the a `learning_rate = 1e-4` and `epochs = 10`
* The metric we want to monitor is `accuracy` 
* When calling the `model.fit(...)` function make sure to get the training results. This can be done like `training_results = model.fit(...)`

</div>

In [37]:
#### your code here


<a id="s342a"></a>
<div class='exercise'>  

##### **[3.4.2a:](#q342a)**

How many parameters does the `Bidirectional` layer have ? Show us your calculations.

</div>

**INTERPRETATION:**

*Your answer here*

<a id="s343"></a>
<div class='exercise'>  
    
**[3.4.3:](#s343)** 

Pass the training results to the `evaluate_save_model(model, validation_data, training_results, execution_time, learning_rate, epochs)`. This util function will plot your training history and save the model and metrics so we can compare all the models at the end.
    
</div>

In [38]:
#### your code here


<a id="s35"></a>
<div class='exercise'>  

### **[3.5:](#q35)** **Build Data Pipelines for BERT**

</div>

<a id="s351"></a>
<div class='exercise'> 
    
**[3.5.1:](#q351)** 

* Use `BertTokenizer` to tokenize the input text
* In the [BertTokenizer](https://huggingface.co/transformers/model_doc/bert.html#berttokenizer), use `bert-base-uncased` as the `vocab_file` argument. Also set `do_lower_case=True`
- When using `tokenizer.encode_plus(...)` set the `max_length` to a value suitable for the dataset. It would need to be `<=512`.
- The output tokens from `tokenizer.encode_plus(...)` is a dictionary with the keys `'input_ids', 'token_type_ids', 'attention_mask'`
                                                                                                                       
</div>

In [39]:
#### your code here


In [40]:
#### your code here


In [41]:
#### your code here


<a id="s352"></a>
<div class='exercise'> 
    
**[3.5.2:](#q352)**
- Create TF Datasets using the tokenized results. When using `tf.data.Dataset.from_tensor_slices(...,...)` look out for the x values passed in. `BERT` requires 3 inputs `'input_ids', 'token_type_ids', 'attention_mask'` as a tuple.
* Use a batch size of **32**
* Use train and validation shuffle buffer size equal to the number of data points in train and validation sets respectively
* When creating tf.data pipelines for the training and validation dataset. Follow this order when building the pipeline:
  * Shuffle (if necessary)
  * Batch
  * Prefetch
</div>


In [42]:
#### your code here


<a id="s36"></a>
<div class='exercise'>  

### **[3.6:](#q36)** **MODEL 3: Finetune BERT**

</div>


<a id="s361"></a>

<div class='exercise'> 

**[3.6.1:](#q361)** 

* Build a model using `TFBertForSequenceClassification` from the `transformers` package from HuggingFace
* Load the pre-trained weights using `bert-base-uncased` make sure to set the `name` argument to a unique name since we will need this for comparing our results later

</div>    
    

In [43]:
#### your code here


<a id="s362"></a>

<div class='exercise'> 
    
**[3.6.2:](#q362)** 
    
    
* Train the model with the a `learning_rate = 2e-5` and `epochs = 5`
* The metrics we want to monitor is `accuracy`
* When calling the `model.fit(...)` function make sure to get the training results. This can be done like `training_results = model.fit(...)`
* Pass the training results to the `evaluate_save_model(model, validation_data, training_results, execution_time, learning_rate, epochs)`. This util function will plot your training history and save the model and metrics so we can compare all the models at the end.

In [44]:
#### your code here


<a id="s362a"></a>
<div class='exercise'>  

##### **[3.6.2a:](#q362a)**

How is BERT related (e.g., similar and different from) to the first Transformer (encoder + decoder) model that was introduced in the Attention is All you Need paper?

</div>

**INTERPRETATION:**

*Your answer here*

<a id="s362b"></a>
<div class='exercise'>  

##### **[3.6.2b:](#q362b)**

What is the difference between attention and self-attention in the context of Transformer architecture?

</div>

**INTERPRETATION:**

*Your answer here*

<a id="s37"></a>
<div class='exercise'>  

### **[3.7:](#q37)** **Compare all Models**

</div>

In [None]:
## Code to compare all models 

models_store_path = "models"

models_metrics_list = glob(models_store_path+"/*_metrics.json")

all_models_metrics = []
for mm_file in models_metrics_list:
  with open(mm_file) as json_file:
    model_metrics = json.load(json_file)
    all_models_metrics.append(model_metrics)

# Load metrics to dataframe
view_metrics = pd.DataFrame(data=all_models_metrics)

# Format columns
view_metrics['accuracy'] = view_metrics['accuracy']*100
view_metrics['accuracy'] = view_metrics['accuracy'].map('{:,.2f}%'.format)

view_metrics['trainable_parameters'] = view_metrics['trainable_parameters'].map('{:,.0f}'.format)
view_metrics['execution_time'] = view_metrics['execution_time'].map('{:,.2f} mins'.format)
view_metrics['loss'] = view_metrics['loss'].map('{:,.2f}'.format)
view_metrics['f1_score'] = view_metrics['f1_score'].map('{:,.2f}'.format)
view_metrics['model_size'] = view_metrics['model_size']/1000000
view_metrics['model_size'] = view_metrics['model_size'].map('{:,.0f} MB'.format)

# Filter columns
view_metrics = view_metrics[["trainable_parameters","execution_time","loss","accuracy","f1_score","model_size","learning_rate","epochs","name"]]

view_metrics = view_metrics.sort_values(by=['f1_score'],ascending=False)
view_metrics.head(10)

*Your answer here*

<a id="s371"></a>
<div class='exercise'>  

**[3.7.1:](#q371)**

* Do you think accuracy is a good metric here, why or why not?
* Create a "confusion matrix" that simply includes the False Positive, False Negative, True Positive, and True Negative counts.
* Also report and interpret Precision and Recall.
* Interpret the best F1 score reported with respect to the problem at hand.
* Which one is most important metric for the doctor (aka researcher): precision, recall, or F1-score (and why)? HINT: see the problem description for Part 3.

</div>



**INTERPRETATION:**

*Your answer here*