# Assignment 4: "Search your transcripts. You will know it to be true." (Part 2)

## Â© Cristian Danescu-Niculescu-Mizil 2019

## CS/INFO 4300 Language and Information

### Due by 11:59pm on Wednesday February 20th


This is an **individual** assignment.

If you use any outside sources (e.g. research papers, StackOverflow) please list your sources.

In our last assignment we have explored edit distance to retrieve similar sounding quotes from the transcripts. Our overall goal in this assignment is to build a system that efficiently searches for documents similar to a query in large data sets. We will explore the tradeoffs of information retrieval systems by finding newspaper quotes from "Keeping Up With The Kardashians".

**Guidelines**

All cells that contain the blocks that read `# YOUR CODE HERE` are editable and are to be completed to ensure you pass the test-cases. Make sure to write your code where indicated.

All cells that read `YOUR ANSWER HERE` are free-response cells that are editable and are to be completed.

You may use any number of notebook cells to explore the data and test out your functions, although you will only be graded on the solution itself.

You are unable to modify the read-only cells and should never delete any of the given cells.

You should also use Markdown cells to explain your code and discuss your results when necessary.
Instructions can be found [here](http://jupyter-notebook.readthedocs.io/en/latest/examples/Notebook/Working%20With%20Markdown%20Cells.html).

All floating point values should be printed with **2 decimal places** precision. You can do so using the built-in round function.

No cell in this assignment should take longer than **1 second** to run. If a cell takes longer than 1 second to run, it will be marked as incorrect.

**Grading**

For code-completion questions you will be graded on passing the public test cases we have included, as well as any hidden test cases that we have supplemented to ensure that your logic is correct.

For free-response questions you will be manually graded on the quality of your answer.

**Learning Objectives**

- Develop an understanding of the inverted index and its applications
- Explore use cases of boolean search
- Examine how the inverted index can be used to efficiently compute IDF values
- Introduce cosine similarity as an efficient search model

In [483]:
from collections import defaultdict
from collections import Counter
import json
import math
import string
import time
import numpy as np
from nltk.tokenize import TreebankWordTokenizer
from IPython.core.display import HTML

In [484]:
with open("kardashian-transcripts.json", "r") as f:
    transcripts = json.load(f)
print(len(transcripts[0]))

851


In [485]:
treebank_tokenizer = TreebankWordTokenizer()
flat_msgs = [m for transcript in transcripts for m in transcript]
queries = [u"It's like a bunch of people running around talking about nothing.",
           u"Never say to a famous person that this possible endorsment would bring 'er to the spot light.",
           u"Your yapping is making my head ache!",
           u"I'm going to Maryland, did I tell you?"]

## Finding the most similar messages (cosine similarity)

### A high-level overview

Our overall goal of the last part of this assignment is to build a system where we can compute the cosine similarity between queries and our datasets quickly. To accomplish queries and compute cosine similarities, we will need to represent documents as vectors. A common method of representing documents as vectors is by using "term frequency-inverse document frequency" (tf-idf) scores. More details about this method can be found [on the course website](http://www.cs.cornell.edu/courses/cs4300/2020sp/Slides//vsm_cheatsheet.pdf). The notation here is consistent with the hand out, so if you haven't read over it -- you should!

Consider the tf-idf representation of a document and a query: $\vec{d_j}$ and $\vec{q}$, respectively. Elements of these vectors are very often zero because the term frequency of most words in most documents is zero. Stated differently, most words don't appear in most documents! Consider a query that has 5 words in it and a vocabulary that has 20K words in it -- only .025% of the elements of the vector representation of the query are nonzero! When a vector (or a matrix) has very few nonzero entries, it is called "sparse." We can take advantage of the sparsity of tf-idf document representations to compute cosine similarity quickly. We will first build some data stuctures that allow for faster querying of statistics, and then we will build a function that quickly computes cosine similarity between queries and documents.

### A starting point
We will use an **inverted index** for efficiency. This is a sparse term-centered representation that allows us to quickly find all documents that contain a given term.

## Q1 Write a function to construct the inverted index (Code Completion)

As in class, the inverted index is a key-value structure where the keys are terms and the values are lists of *postings*. In this case, we record the documents a term occurs in as well as the **count** of that term in that document.

In [486]:
def build_inverted_index(msgs):
    """ Builds an inverted index from the messages.
    
    Arguments
    =========
    
    msgs: list of dicts.
        Each message in this list already has a 'toks'
        field that contains the tokenized message.
    
    Returns
    =======
    
    inverted_index: dict
        For each term, the index contains 
        a sorted list of tuples (doc_id, count_of_term_in_doc)
        such that tuples with smaller doc_ids appear first:
        inverted_index[term] = [(d1, tf1), (d2, tf2), ...]
        
    Example
    =======
    
    >> test_idx = build_inverted_index([
    ...    {'toks': ['to', 'be', 'or', 'not', 'to', 'be']},
    ...    {'toks': ['do', 'be', 'do', 'be', 'do']}])
    
    >> test_idx['be']
    [(0, 2), (1, 2)]
    
    >> test_idx['not']
    [(0, 1)]
    
    """
    # YOUR CODE HERE
    d = defaultdict(list)
    for k in range(len(msgs)):
        for word in set(msgs[k]['toks']):
            d[word].append((k, msgs[k]['toks'].count(word)))
    return d

In [487]:
# This is an autograder test. Here we can test the function you just wrote above.
start_time = time.time()
inv_idx = build_inverted_index(flat_msgs)
execution_time = time.time() - start_time

assert len(inv_idx) <= 10000 
assert [i[0] for i in inv_idx['bruce']] == sorted([i[0] for i in inv_idx['bruce']])
assert len(inv_idx['bruce']) < len(inv_idx['kim'])
assert len(inv_idx['bruce']) >= 400 and len(inv_idx['bruce']) <= 435
assert len(inv_idx['baby']) >= 250 and len(inv_idx['baby']) <= 300
assert execution_time <= 1.0


## Q2 Using the inverted index for boolean search (Code Completion)

In this section we will use the inverted index you constructed to perform an efficient boolean search. The boolean model was one of the early information retrieval models, and continues to be used in applications today.

A boolean search works by searching for documents which match the boolean expression of the query. Three main operators in a boolean search are `AND` `OR` and `NOT`. For example, the query `"Ned" and "Rob"` would return any document which contains both the words "Ned" and "Rob".

Here, we will treat a query as a simple two-word search with exclusion. For example, the query words "kardashian", "kim" would be equivalent to the boolean expression `"kardashian" NOT "kim"`.

#### In class we implemented the Merge Postings Algorithm, review the code [here](https://www.cs.cornell.edu/courses/cs4300/2020sp/Demos/demo05.html).

The Merge Postings Algorithm we implemented can be thought of a boolean search with the `AND` operator. Write a function `boolean_search` that implements a similar algorithm with the `NOT` operator using the inverted index.

**Note:** Make sure you convert the `query_word` and `not_word` to lowercase. 


------------------------------------------
    Initialize empty list (called merged list M)

    Create sorted list A of documents containing the query_word

    Create sorted list B of documents containing the not_word

    Start: Pointer at the first element of both A and B

    Do: Does it point to the same document ID in each list?

        Yes: advance pointer in both A and B
    
        No: 
            If the pointer with the smaller document ID is in list A:
                Append the smaller document ID to list M
            
            Advance (to the right) the pointer with the smaller ID
    
    End: When we attempt to advance a pointer already at the end of its list

    Finally: if there are remaining document IDs in list A that were not evaluated in the above loop, then append them to list M.

------------------------------------------

**Note:** The objective is to demonstrate your knowledge in building an efficient search algorithm. If you use the Python `set.difference` function, you will lose points.


In [488]:
def boolean_search(query_word,not_word, inverted_index):
    """ Search the collection of documents for the given query_word 
        provided that the documents do not include the not_word
    
    Arguments
    =========
    
    query_word: string,
        The word we are searching for in our documents.
    
    not_word: string,
        The word excluded from our documents.
    
    index: an inverted index as above
    
    
    Returns
    =======
    
    results: list of ints
        Sorted List of results (in increasing order) such that every element is a `doc_id`
        that points to a document that satisfies the boolean
        expression of the query.
        
    """
    # YOUR CODE HERE
    postings1 = [ i for (i,j) in inverted_index[query_word.lower()]]
    postings2 = [ i for (i,j) in inverted_index[not_word.lower()]]
    
    diff_posting=[ i for (i,j) in inverted_index[query_word.lower()]]
    i,j=0,0
    while i<len(postings1) and j<len(postings2):
        if postings1[i]==postings2[j]:
            diff_posting.remove(postings1[i])
            i+=1
            j+=1
        elif postings1[i]<postings2[j]:
            i+=1
        else:
            j+=1
            
    return diff_posting

In [489]:
result0_start_time = time.time()
result0 = boolean_search('ice','cream', inv_idx)
result0_execution_time = time.time() - result0_start_time
result3 = boolean_search('puppy','dog', inv_idx)
result1= boolean_search('Kardashian','Kim',inv_idx)
result4= boolean_search('cake','cake',inv_idx)
assert result0_execution_time < 1.0
assert type(result1) == list
assert len(result3) == 7
assert len(result4)==0


## Q2b Using the inverted index for boolean search (Free Response)

In A3 we already explored search techniques which are able to find a wider variety of relevant results. Why might you want to use a boolean search with an inverted index instead? Give a specific example in which a boolean search would be a better choice than a search with edit distance, and justify why a boolean search would be preferable.

<div style="border-bottom: 4px solid #AAA; padding-bottom: 6px; font-size: 16px; font-weight: bold;">Write your answer in the provided cell below</div>

YOUR ANSWER HERE

The benefits of using boolean search: 

1. It is much faster than edit distance
2. It gives you the relevent results with the words you want to emphasize.

**For example, if you use edit distance, but there are a lot of words that are same but not what you really care about, ie, not relevent to your query.**

Query: i do not like breakfast because i do not like sausages.

Doc2: i do not like break because i do not like my age.

Doc3: i hate breakfast because i never eat sausage.

when you try to calculate the edit distance, you may notice that doc2 and query are similar than doc3 and query, because, there are a lot of same unrelevent words (like "i, do, not, like"). However, in the boolean search, if we want the words: breakfast and sausages appear together, then doc3 is the most relevent and similar documents for this query.

<div style="border-top: 4px solid #AAA; padding-bottom: 6px; font-size: 16px; font-weight: bold; text-align: center;"></div>

## Q3 Compute IDF *using* the inverted index (Code Completion)

Write a function `compute_idf` that uses the inverted index to efficiently compute IDF values.

Words that occur in a very small number of documents are not useful in many cases, so we ignore them. Use a parameter `min_df`
to ignore all terms that occur in strictly fewer than `min_df=10` documents.

Similarly, words that occur in a large *fraction* of the documents don't bring any more information for some tasks. Use a parameter `max_df_ratio` to trim out such words. For example, `max_df_ratio=0.95` means ignore all words that occur in more than 95% of the documents.

As a reminder, we define the IDF statistic as...
$$ IDF(t) = \log \left(\frac{N}{1 + DF(t)} \right) $$

where $N$ is the total number of docs and $DF(t)$ is the number of docs containing $t$. Keep in mind, there are other definitions if IDF out there, so if you go looking for resources on the internet, you might find differing (but also valid) accounts. In practice the base of the log doesn't really matter, however you should use base 2 here.

In [490]:
def compute_idf(inv_idx, n_docs, min_df=10, max_df_ratio=0.95):
    """ Compute term IDF values from the inverted index.
    Words that are too frequent or too infrequent get pruned.
    
    Hint: Make sure to use log base 2.
    
    Arguments
    =========
    
    inv_idx: an inverted index as above
    
    n_docs: int,
        The number of documents.
        
    min_df: int,
        Minimum number of documents a term must occur in.
        Less frequent words get ignored. 
        Documents that appear min_df number of times should be included.
    
    max_df_ratio: float,
        Maximum ratio of documents a term can occur in.
        More frequent words get ignored.
    
    Returns
    =======
    
    idf: dict
        For each term, the dict contains the idf value.
        
    """
    
    # YOUR CODE HERE
    d = defaultdict(float)
    for word, value in inv_idx.items():
        count = len(value)
        if (count >=min_df and count<=n_docs*max_df_ratio):
            d[word] = round(np.log2(n_docs/(1+count)),2)
    return d

In [491]:
# This is an autograder test. Here we can test the function you just wrote above.
start_time = time.time()
idf_dict = compute_idf(inv_idx, len(flat_msgs))
execution_time = time.time() - start_time

assert len(idf_dict) < len(inv_idx)
assert 'blah' not in idf_dict
assert 'blah' in inv_idx 
assert '.' in idf_dict
assert '3' not in idf_dict
assert idf_dict['bruce'] >= 6.0 and idf_dict['bruce'] <= 7.0
assert idf_dict['baby'] >= 6.0 and idf_dict['baby'] <= 8.0
assert execution_time <= 1.0


## Q4 Compute the norm of each document using the inverted index (Code Completion)

Recalling our tf-idf vector representation of documents, we can compute the "norm" as the norm (length) of the vector representation of that document. More specifically, the norm of a document $j$, denoted as $\left|\left| d_j \right|\right|$, is given as follows...

$$ \left|\left| d_j \right|\right| = \sqrt{\sum_{\text{word } i} (tf_{ij} \cdot idf_i)^2} $$

In [492]:
def compute_doc_norms(index, idf, n_docs):
    """ Precompute the euclidean norm of each document.
    
    Arguments
    =========
    
    index: the inverted index as above
    
    idf: dict,
        Precomputed idf values for the terms.
    
    n_docs: int,
        The total number of documents.
    
    Returns
    =======
    
    norms: np.array, size: n_docs
        norms[i] = the norm of document i.
    """
    
    # YOUR CODE HERE
    arr = np.zeros(n_docs)

    for word, ridx in index.items():
        for doc_id, count in ridx:
            try:
                arr[doc_id] += count*idf[word] ** 2
            except KeyError:
                pass
    arr = arr ** 0.5
    return arr

In [493]:
# This is an autograder test. Here we can test the function you just wrote above.
start_time = time.time()
doc_norms = compute_doc_norms(inv_idx, idf_dict, len(flat_msgs))
execution_time = time.time() - start_time

assert len(flat_msgs) == len(doc_norms)
assert doc_norms[3722] == 0
assert max(doc_norms) < 80
assert doc_norms[1] >= 15.5 and doc_norms[1] <= 17.5
assert doc_norms[5] >= 6.5 and doc_norms[5] <= 8.5
assert execution_time <= 1.0


## Q5 Find the most similar messages to the quotes (Code Completion)

The goal of this section is to implement `index_search`, a fast implementation of cosine similarity. You will then test your answer by running the search function using the code provided. Briefly discuss why it worked, or why it might not have worked, for each query.

The goal of `index_search` is to compute the cosine similarity between the query and each document in the dataset. Naively, this computation requires you to compute dot products between the query tf-idf vector $q$ and each document's tf-idf vector $d_i$.

However, you should be able to use the sparsity of the tf-idf representation and the data structures you created to your advantage. More specifically, consider the cosine similarity...

$$ cossim(\vec{q}, \vec{d_j}) = \frac{\vec{q} \cdot \vec{d_j}}{\|\vec{q}\| \cdot \|\vec{d_j}\|}$$

Specifically, focusing on the numerator...

$$ \vec{q} \cdot \vec{d_j} = \sum_{i} {q_i} * {d_i}_j $$

Here ${q_i}$ and ${d_i}_j$ are the $i$-th dimension of the vectors $q$ and ${d_j}$ respectively.
Because many ${q_i}$ and ${d_i}_j$ are zero, it is actually a bit wasteful to actually create the vectors $q$ and $d_j$ as numpy arrays; this is the method that you saw in class.

A faster approach to computing the numerator term of cosine similarity involves quickly computing the above summation using the inverted index, pre-computed idf scores, and pre-computed document norms.

A good "first step" to implementing this efficiently is to only loop over ${q}_j$ that are nonzero (i.e. ${q}_j$ such that the word $j$ appears in the query). 

**Note:** Convert the query to lowercase, and use the `nltk.tokenize.TreebankWordTokenizer` to tokenize the query (provided to you as the `tokenizer` parameter). The transcripts have already been tokenized this way. <br>

**Note 2:** For `index_search`, you need not remove punctuation tokens from the tokenized query before searching.

**Aside:** Precomputation

In many settings, we will need to repeat the same kind of operation many times. Often, part of the input doesn't change.
Queries against the Kardashians transcript are like this: we want to run more queries (in the real world we'd want to run a lot of them every second, even) but the data we are searching doesn't change.

We could write an `index_search` function with the same signature as A3's `verbatim_search`, taking the `query` and the `msgs` as input, and the function would look like:

    def index_search(query, msgs):
        inv_idx = build_inverted_index(msgs)
        idf = compute_idf(inv_idx, len(msgs))
        doc_norms = compute_doc_norms(inv_idx)
        # do actual search


But notice that the first three lines only depend on the messages. Imagine if we run this a million times with different queries but the same collection of documents: we'd wastefully recompute the index, the IDFs and the norms every time and discard them. It's a better idea, then, to precompute them just once, and pass them as arguments.

In [494]:
inv_idx = build_inverted_index(flat_msgs)

idf = compute_idf(inv_idx, len(flat_msgs),
                  min_df=10,
                  max_df_ratio=0.1)  # documents are very short so we can use a small value here
                                     # examine the actual DF values of common words like "the"
                                     # to set these values

inv_idx = {key: val for key, val in inv_idx.items()
           if key in idf}            # prune the terms left out by idf

doc_norms = compute_doc_norms(inv_idx, idf, len(flat_msgs))

In [495]:
import nltk
from nltk.tokenize import TreebankWordTokenizer
import string
from collections import Counter

def index_search(query, index, idf, doc_norms, tokenizer=treebank_tokenizer):
    """ Search the collection of documents for the given query
    
    Arguments
    =========
    
    query: string,
        The query we are looking for.
    
    index: an inverted index as above
    
    idf: idf values precomputed as above
    
    doc_norms: document norms as computed above
    
    tokenizer: a TreebankWordTokenizer
    
    Returns
    =======
    
    results, list of tuples (score, doc_id)
        Sorted list of results such that the first element has
        the highest score, and `doc_id` points to the document
        with the highest score.
    
    Note: 
        
    """
    
    # YOUR CODE HERE
    query = query.lower()
    query_toks = tokenizer.tokenize(query)
    query_counter = Counter(query_toks)

    
    query_norm =0
    for word, count in query_counter.items():
        query_norm += np.power(count*idf[word],2)
    query_norm  = np.sqrt(query_norm)


    dot_list=np.zeros(len(doc_norms))
    for word, count in query_counter.items():
        if word in index.keys():
            tuple_list = index[word]
            for tup in tuple_list:
                dot_list[tup[0]] += count*tup[1]*idf[word]*idf[word]

    dot_n, doc_n, idx = [], [], []
    for i in range(len(doc_norms)):
        if doc_norms[i] != 0:
            dot_n.append(dot_list[i])
            doc_n.append(doc_norms[i])
            idx.append(i)
    score_list = np.array(dot_n)/(np.array(doc_n) * query_norm)

    final = [(score_list[i], i) for i in range(len(score_list))]
    
    final_sorted = sorted(final, key = lambda x: x[0], reverse = True)

    return final_sorted


In [496]:
# This is an autograder test. Here we can test the function you just wrote above.
start_time = time.time()
results = index_search(queries[1], inv_idx, idf, doc_norms)
execution_time = time.time() - start_time

assert type(results[0]) == tuple
assert max(results)[0] == results[0][0]
assert results[0][0] >= 0.4 and results[0][0] <= 0.48
assert execution_time <= 1.0


for query in queries:
    print("#" * len(query))
    print(query)
    print("#" * len(query))

    for score, msg_id in index_search(query, inv_idx, idf, doc_norms)[:10]:
        print("[{:.2f}] {}: {}\n\t({})".format(
            score,
            flat_msgs[msg_id]['speaker'],
            flat_msgs[msg_id]['text'],
            flat_msgs[msg_id]['episode_title'])) 
    print()

#################################################################
It's like a bunch of people running around talking about nothing.
#################################################################
[1.00] KRIS: They brought me over here so I could talk to you.
	(Keeping Up With the Kardashians - Kourt's First Cover)
[0.61] BRUCE: Really?
	(Keeping Up With the Kardashians - Kris ``The Cougar'' Jenner)
[0.46] KIM: Can you play catch over there?
	(Keeping Up With the Kardashians - Shape Up or Ship Out)
[0.46] KOURTNEY: Do you think I'm going to help you if talk to me like that?
	(Keeping Up With the Kardashians - Must Love Dogs)
[0.43] BRUCE: Yeah.
	(Keeping Up With the Kardashians - Botox and Cigarettes)
[0.43] LAUREN: I'm just trying to help her.
	(Keeping Up With the Kardashians - What's Yours Is Mine)
[0.43] LAUREN: "Oh, my God, I'm a fan.
	(Keeping Up With the Kardashians - What's Yours Is Mine)
[0.42] KIM: I'm just afraid if you go to New York, you'll come back hurt."- it.
	(Keeping

## Q5b Find the most similar messages to the quotes (Free Response)

Briefly discuss why cosine similarity worked, or why it might not have worked, **for each query**.

<div style="border-bottom: 4px solid #AAA; padding-bottom: 6px; font-size: 16px; font-weight: bold;">Write your answer in the provided cell below</div>

YOUR ANSWER HERE

1. Some of doc_norms[i] maybe zeros, because when we calculate idfs, we ignore all words that occur in more than 95% of the documents and occurs less than 10. Therefore, if a document contains all the words that have been ingored by us, then the denominator could be 0, in this case, we cannot compare this documents this the query. We have to ignore the whole sentence. 

2. Except for this case, all other sentences could compare with the query, as some the unnecessary words have been ignored by the filter. The comparison score result would be better than edit distance seach.

<div style="border-top: 4px solid #AAA; padding-bottom: 6px; font-size: 16px; font-weight: bold; text-align: center;"></div>

## Q6EC: Extra credit question 1 (optional)

### Updating precomputed values.

In many real-world applications, the collection of documents will not stay the same forever. At Internet-scale, however, it could possibly even be worth recomputing things every second, if during that second we're going to answer millions of queries.

However, there's a better way: in reality, the document set will not change radically, but incrementally.  In particular, it's most common to add or remove a bunch of new documents to the index.

Write functions `add_docs` and `remove_docs` that update the index, idf and document norms.  Think of the implications this has on how we store the IDF. Is there a better way of storing it, that minimizes the memory we need to touch when updating?

Think of adequate test cases for these functions and implement them.

**Note:** You can get up to 0.5 EC for completing this question. *Do not delete the cell below.*

In [474]:
# YOUR CODE HERE
def add_docs(docs, index, msgs, tokenizer=treebank_tokenizer):
    # update the index
    for doc in docs:
        doc = doc.lower()
        doc_toks = tokenizer.tokenize(doc)
        doc_counter = Counter(doc_toks)

        # update index:
        for word, count in doc_counter.items():
            index[word].append((len(msgs), count))

        
        msgs.append({'text': doc, 'toks': doc_toks})
    idf = compute_idf(index, len(msgs))
    doc_norm = compute_doc_norms(index, idf, len(msgs))
    
    return idf, doc_norm
    

def remove_docs(docs_number, index, msgs):
    # update index:
    new_index = index.copy()
    for num in docs_number:
        for word, value in new_index.items():
            for tup in value:
                if tup[0]==num:
                    index[word].remove(tup)
                               
    idf = compute_idf(index, len(msgs))
    doc_norm = compute_doc_norms(index, idf, len(msgs))
                               
    return idf, doc_norm
        

In [475]:
# This is an autograder test. Here we can test the function you just wrote above.
docs = ['You have to be in sync every step of the way.', 'You have to be in sync every step of the way.', 
        'I\'m not up own my ass-- I\'m just really busy doing stuff.']
# since the index will be updated, therefore, we need to record the original value before head.
l1 = len(inv_idx['ass'])
l2 = len(inv_idx['step'])

# now, it is time to run the method(ie. operation)
start_time = time.time()
add_docs(docs, inv_idx, flat_msgs,tokenizer=treebank_tokenizer)
execution_time = time.time() - start_time

assert len(inv_idx) <= 10000 
assert len(inv_idx['ass']) == l1 + 1
assert len(inv_idx['step']) == l2 + 2
assert execution_time <= 1.0

# after the update, the index now is the new one.

In [476]:
docs = [len(flat_msgs)-1]
l1 = len(inv_idx['ass'])
l2 = len(inv_idx['stuff'])

# now, it is time to run the method(ie. operation)
start_time = time.time()
idf, doc_norm = remove_docs(docs, inv_idx, flat_msgs)
execution_time = time.time() - start_time


assert len(inv_idx) <= 10000 
assert len(inv_idx['ass']) == l1 - 1
assert len(inv_idx['stuff']) == l2 - 1
assert execution_time <= 1.0
# after the update, the index now is the new one.

[39680]
74
148
73
45


## Q7EC: Extra credit question 2 (optional)

### Finding your own similarity metric

We've explored using cosine similarity and edit distance to find similar messages to input queries. However, there's a whole world of ways to measure the similarity between two documents. Go forth, and research!

(Fun fact: Fundamental information retrieval techniques were in fact developed at Cornell, so you would not be the first Cornellian to disrupt the field)

For this question, find a new way of measuring similarity between two documents, and implement a search using your new metric. Your new way of measuring document similarity should be different enough from the two approaches we already implemented. It can be a method you devise or an existing method from somewhere else (make sure to reveal your sources).

**Note:** The amount of EC awarded for this question will be determined based on creativity, originality, implementation, and analysis. *Do not delete the cell below.*

In [None]:
# You need to install gensim
!conda install -cy conda-forge gensim

In [479]:
# YOUR CODE HERE
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec
def word2vec_sumVector_similarity(transcript, query, vector_size):
    text = []
    for tran in transcript:
        for dic in tran:
            text.append(dic['toks'])
    query_toks = nltk.tokenize.TreebankWordTokenizer().tokenize(query)

    text.append(query_toks)

    model = Word2Vec(text, min_count=0,size= vector_size, workers=5, window =3, sg = 1)
    
    score = []
    i=0
    arr1 = np.zeros(vector_size)
    for word in query_toks:
        arr1 += model.wv.__getitem__(word)
    for tran in transcript:
        for dic in tran:
            arr2 = np.zeros(vector_size)
            for word in dic['toks']:
                arr2 += model.wv.__getitem__(word)
    
    # next is cosine similarity:
            arr_difference = np.sum(np.absolute(arr1-arr2))
            i=i+1
            score.append((arr_difference, i))
    score_sorted = sorted(score, key = lambda x: x[0], reverse = True)  
    
    return score_sorted

In [480]:
score = doc2vec_sumVector_similarity(transcripts, queries[1], 50)

In [481]:
score

[(1196.378324575504, 38367),
 (904.1096134463733, 23392),
 (904.1096134463733, 30914),
 (831.5588263881509, 12205),
 (813.6027307417535, 24099),
 (813.6027307417535, 31630),
 (796.6861806639936, 8620),
 (705.1169240947929, 19058),
 (689.4845316393767, 22982),
 (689.4845316393767, 30486),
 (658.2664172488148, 17615),
 (651.5606833014754, 26482),
 (625.6189396629925, 4882),
 (618.2968378780643, 23004),
 (618.2968378780643, 30508),
 (617.2576028854528, 22947),
 (617.2576028854528, 30451),
 (600.4785345961573, 24095),
 (600.4785345961573, 31626),
 (595.7245208604436, 29926),
 (595.7245208604436, 30395),
 (591.7070633434632, 19834),
 (573.0796258652408, 34304),
 (569.2535770292452, 34631),
 (558.5888929793, 8794),
 (551.2901567779481, 33177),
 (548.1896139680757, 16037),
 (548.1896139680757, 16598),
 (546.3483432164066, 39149),
 (533.0017258305015, 11747),
 (531.8071531411842, 9219),
 (525.582943109097, 13830),
 (524.8601823651697, 14083),
 (524.7072964233812, 22343),
 (523.4293034612783, 3

In [482]:
print(queries[1])

for score, msg_id in score[:10]:
    print("[{:.2f}] {}: {}\n\t({})".format(
        score,
        flat_msgs[msg_id]['speaker'],
        flat_msgs[msg_id]['text'],
        flat_msgs[msg_id]['episode_title'])) 
    print()

Never say to a famous person that this possible endorsment would bring 'er to the spot light.
[1196.38] KIM: And you, too, okay?
	(Keeping Up With the Kardashians - The Price of Fame)

[904.11] KRIS: That last thing that I want to do is have Bruce get his feelings hurt.
	(Keeping Up With the Kardashians - The Wedding)

[904.11] KRIS: That last thing that I want to do is have Bruce get his feelings hurt.
	(The Wedding: Keeping Up With the Kardashians)

[831.56] KIM: That it was the best way to kick off our meeting.
	(Keeping Up With the Kardashians - Body Blows)

[813.60] BRUCE: So I said to her father that I'd always take care of her.
	(Keeping Up With the Kardashians - The Wedding)

[813.60] BRUCE: So I said to her father that I'd always take care of her.
	(The Wedding: Keeping Up With the Kardashians)

[796.69] KOURTNEY: I really cannot believe Scott.
	(Keeping Up With the Kardashians - Blame It on the Alcohol)

[705.12] KRIS: Tonight, I'm taking the BG5 girls-- the band I manage-- o