Children often ask their caregivers to find episodes of their favorite TV shows based only on a very short (and loosely relevant!) description of it ("the one where Arthur has a wiggly tooth") but video services like Netflix and Amazon don't currently provide such content-based search. Given summaries of each episode, can we use sequence embeddings to solve this retrieval problem?

Before beginning this homework, install the following libraries:

```sh
conda install -c huggingface transformers
pip install -U sentence-transformers
conda install -c conda-forge ipywidgets
```

First, let's read in our data for the TV show "Wild Kratts" (from [Wikipedia](https://en.wikipedia.org/wiki/List_of_Wild_Kratts_episodes)), which has the following (tab-separated) form:

|Episode|Title|Summary|
|:-|:-|:-|
|1|Mom of a Croc|At the Nile River, zoologists Chris and Martin Kratt (voiced by their real-world selves) are on a mission to show one of their fellow Wild Kratts team members—brilliant young inventor Aviva Corcovado (Athena Karkanis)—that there's more to crocodiles than just violence and snapping jaws. After shrinking themselves down to a few inches tall by using Aviva's Miniaturizer invention, the Kratt Brothers disguise themselves as crocodile eggs and sneak into a mother crocodile's new nest. In the Wild Kratts team's turtle-shaped aircraft and headquarters—the Tortuga, one of Aviva's greatest inventions—the Wild Kratts tech team, consisting of Aviva, communications expert and mechanic Koki (Heather Bambrick), and skilled pilot Jimmy Z (Jonathan Malen) monitor Chris and Martin and watch as the mother crocodile faithfully guards her nest against predators for months without even eating anything. Eventually, as the crocodile eggs hatch and the crocodile mom uses her mouth to carry several of her newly hatched babies to the river, Aviva changes her mind about crocodiles and decides that these reptiles are in fact caring and dedicated mothers. But when the mother crocodile leaves the river to go get more hatchlings from her nest, predators threaten the first batch of baby crocodiles. The Kratt Brothers must use the incredible Creature Power Suits—two of Aviva's inventions—to gain the abilities of crocodiles and protect the vulnerable crocodile hatchlings.|
|2|Whale of a Squid|The Kratt Brothers use Aviva's amphipod-inspired submersible, the Amphisub, to dive into the deep waters of the Southern Ocean. There, they witness a never-before-seen wildlife moment: a battle between a sperm whale and a giant squid. However, the water pressure at the extreme depths where the battle is taking place badly damages and partially crushes the Amphisub, forcing Aviva to use her new ExtendoArm invention to pull the submersible back to the Tortuga. To allow Chris and Martin to return to the site of the whale-versus-squid battle, Aviva programs two new Creature Power Suits—Sperm Whale Power for Chris, and Squid Power for Martin. The Kratt Brothers use their new Creature Powers to dive back into the deep sea, where the sperm whale and the giant squid are still locked in combat. Suddenly, the sperm whale becomes entangled in a discarded fishing net and begins sinking toward an area full of underwater volcanoes. To make matters worse, a colossal squid attacks the sperm whale's calf. Chris and Martin must put their Creature Powers of both sperm whale and squid to good use to rescue the mother sperm whale and her calf.|


In [1]:
def read_data(filename):
    data=[]
    with open(filename, encoding="utf-8") as file:
        for line in file:
            cols=line.rstrip().split("\t")
            episode=cols[0]
            title=cols[1]
            summary=cols[2]
            data.append((episode, title, summary))
    return data

In [2]:
data=read_data("../data/wild_kratts_episodes.txt")

In [3]:
def get_document_reps_for_data(data, sequence_embedding_function, model):
    
    # This function applies the sequence_embedding_function argument (a function itself) to the summary
    # element in the input data list, and returns a copy of that list with an embedding of that summary attached.
    
    data_with_reps=[]
    
    for episode, title, summary in data:
        data_with_reps.append((episode, title, summary, sequence_embedding_function(model, summary)))
    
    return data_with_reps

In [4]:
def cosine_similarity(a, b):
    return np.dot(a, b)/(np.linalg.norm(a)*np.linalg.norm(b))

First, we may be tempted to use the [CLS] token for BERT to represent an entire input string (as is often done in *supervised* document classification models).  How well does this work as an out-of-the-box document representation not optimized for our particular task?

In [5]:
from transformers import BertModel, BertTokenizer
import numpy as np
from sentence_transformers import SentenceTransformer

In [6]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


**Q1**: Fill out the `get_cls_token_for_doc` function to return the [CLS] embedding for the input string.  The output should be a single 768-dimensional numpy vector (see `4.embeddings/BERT.ipynb` for converting between a pytorch tensor and a numpy object).

In [68]:
def get_cls_token_for_doc(model, string):
    inputs = tokenizer(string, return_tensors="pt")
    # your code goes here
    
    #getting the tokens
    tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    
    outputs = model(**inputs)
    
    #last_hidden_states = outputs.last_hidden_state
    
    return outputs.last_hidden_state[0].detach().numpy()[2]

In [69]:
bert_cls_data=get_document_reps_for_data(data, get_cls_token_for_doc, model)

**Q2**: Use these representations to find the episode that is most similar to the description "The one where they bounce back in time" by having the highest cosine similarity between representations.  A sample function shell `run_query` is provided below, along with the only arguments you need, but feel free to adapt it as you see fit.

In [71]:
query="The one where they bounce back in time"

In [82]:
def run_query(query, data_with_reps, sequence_embedding_function, model):
    # your code goes here
    vals = []
    
    for eps in data_with_reps:
        comp_rep = sequence_embedding_function(model, query)
        cos_sim = cosine_similarity(eps[3], comp_rep)
        vals.append((cos_sim, query, eps[0]))
        
    for c, q, s in reversed(sorted(vals)):
        print("%.3f\t%s\t%s" % (c, q, s))

In [83]:
run_query(query, bert_cls_data, get_cls_token_for_doc, model)

0.396	The one where they bounce back in time	62
0.377	The one where they bounce back in time	111
0.373	The one where they bounce back in time	75
0.369	The one where they bounce back in time	92
0.366	The one where they bounce back in time	44
0.365	The one where they bounce back in time	81
0.350	The one where they bounce back in time	65
0.348	The one where they bounce back in time	32
0.347	The one where they bounce back in time	1
0.347	The one where they bounce back in time	118
0.345	The one where they bounce back in time	102
0.342	The one where they bounce back in time	150
0.333	The one where they bounce back in time	5
0.332	The one where they bounce back in time	96
0.330	The one where they bounce back in time	30
0.328	The one where they bounce back in time	36
0.323	The one where they bounce back in time	46
0.318	The one where they bounce back in time	91
0.318	The one where they bounce back in time	88
0.315	The one where they bounce back in time	121
0.312	The one where they bounce back 

Now let's try a sentence embedding model that was optimized for generating sentence representations: Sentence-BERT ([Reimers and Gurevych 2019](https://arxiv.org/pdf/1908.10084.pdf)).  Example usage (in the context of the Huggingface transformers library) can be found [here](https://www.sbert.net).

In [84]:
sentence_model = SentenceTransformer('sentence-transformers/all-distilroberta-v1')

Downloading:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.86k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/653 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/15.7k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/329M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/333 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

**Q3**: Fill out the `get_sentence_embedding` function below to return the sentence embedding for the input string, and use it again to find the episode that is most similar to the description "The one where they bounce back in time" by having the highest cosine similarity between representations.  Which method for generating sentence embeddings appears better for this task?

In [105]:
def get_sentence_embedding(model, string):
    # your code goes here
    embeddings = sentence_model.encode(string)
    return embeddings

In [106]:
get_sentence_embedding(model, query)

array([ 2.21291883e-03,  1.46007640e-02,  2.03430578e-02,  1.27137015e-02,
        1.35774920e-02,  3.19657773e-02, -7.94165395e-03,  5.24583161e-02,
        2.90296637e-02, -1.41363693e-02,  4.94237766e-02, -2.31884580e-04,
       -4.03048992e-02, -6.15077540e-02,  9.52694193e-03, -7.15107704e-03,
       -6.67671785e-02, -4.34239171e-02, -1.10399239e-02, -2.21695676e-02,
       -1.34791841e-03, -3.53775956e-02, -7.93617591e-02,  8.38638619e-02,
       -3.64556462e-02,  1.69570819e-02, -8.85298103e-02,  1.07683558e-02,
       -8.18700530e-03,  3.06761004e-02,  1.76261924e-02, -1.28581850e-02,
        2.35316176e-02,  6.67219236e-02,  5.61144166e-02,  3.27455346e-03,
       -4.85332571e-02, -9.01158492e-04,  6.27200492e-03, -5.37139736e-03,
        1.40093248e-02,  1.00574549e-03,  2.96798851e-02,  1.23534417e-02,
       -1.32677779e-02, -6.48900867e-03,  1.86410528e-02,  4.17390689e-02,
       -1.89139359e-02, -3.09142452e-02,  3.27500366e-02,  2.01369170e-02,
        3.56953824e-03, -

In [94]:
def get_document_reps_for_data(data, sequence_embedding_function, model):
    
    # This function applies the sequence_embedding_function argument (a function itself) to the summary
    # element in the input data list, and returns a copy of that list with an embedding of that summary attached.
    
    data_with_reps=[]
    
    for episode, title, summary in data:
        data_with_reps.append((episode, title, summary, sequence_embedding_function(model, summary)))
    
    return data_with_reps

In [107]:
sentence_transformer_data=get_document_reps_for_data_new(data, get_sentence_embedding, sentence_model)

In [97]:
def get_document_reps_for_data_new(data, sequence_embedding_function, model):
    
    # This function applies the sequence_embedding_function argument (a function itself) to the summary
    # element in the input data list, and returns a copy of that list with an embedding of that summary attached.
    
    data_with_reps=[]
    
    for episode, title, summary in data:
        data_with_reps.append((episode, title, summary, sequence_embedding_function(sentence_model, summary)))
    
    return data_with_reps

In [109]:
query="The one where they bounce back in time"
run_query(query, sentence_transformer_data, get_sentence_embedding, sentence_model)

0.349	The one where they bounce back in time	91
0.341	The one where they bounce back in time	76
0.335	The one where they bounce back in time	143
0.334	The one where they bounce back in time	82
0.330	The one where they bounce back in time	47
0.325	The one where they bounce back in time	17
0.321	The one where they bounce back in time	83
0.313	The one where they bounce back in time	121
0.306	The one where they bounce back in time	123
0.304	The one where they bounce back in time	29
0.303	The one where they bounce back in time	84
0.293	The one where they bounce back in time	15
0.292	The one where they bounce back in time	130
0.290	The one where they bounce back in time	43
0.289	The one where they bounce back in time	98
0.285	The one where they bounce back in time	27
0.283	The one where they bounce back in time	13
0.281	The one where they bounce back in time	92
0.278	The one where they bounce back in time	59
0.277	The one where they bounce back in time	6
0.277	The one where they bounce back 