# 1.3 Introduction to Information Retrieval

Here we work with a data set scraped from eBay.  The data contains 9895 item titles and descriptions.

First we load the data file and _normalise_ the text - removing certain characters and converting to lower case.  

In [1]:
import csv
import re

with open("data/bike-items.txt") as f:
    r = csv.reader(f, delimiter=',', quotechar='"')
    rgx = re.compile(r'\b[a-zA-Z]+\b') 
    docs = [ (' '.join(re.findall(rgx, x[0])).lower(), ' '.join(re.findall(rgx, x[1])).lower())  \
                for i,x in enumerate(r) if i > 1 ]
    
print(docs[0][0],docs[0][1])

items_t = [ d[0] for d in docs ] # item titles
items_d = [ d[1] for d in docs ] # item descriptions
items_i = range(0, len(items_t)) # item id


('cycling bicycle mtb bike fixie gloss carbon fiber riser bar handlebar', 'description feature easy to use made of high quality carbon fiber with the special design can save for a long time the carbon fiber handlebar is made of high quality carbon fiber so that you can use it relieved this quick disassembling carbon fiber handlebar is easy to use and one of the best gifts to your friends specification material carbon fiber color black handlebar clamp diameter mm length package included x cycling carbon fiber rise')


## Exercise Set 1 - Term Frequency

Let's start with the first 10 item titles from our corpus:

In [2]:
corpus = items_t[0:5]
print(corpus)

['cycling bicycle mtb bike fixie gloss carbon fiber riser bar handlebar', 'bicycle rims x red speed internal hub wheel set beach cruiser bike', 'mavic crossride mountain bike wheels and wtb weirwolf tires', 'new kcnc arrow alloy stem black', 'rotor qxl aero oval road chainring']


We will start by computing the frequency of terms in the *entire* corpus.  We will do this by enumerating over the corpus of documents, tokenizing the documents and count the frequency of tokens.   The easiest way is to build a python dictionary where the key is the token and the value is the count.  You can review python dictionaries in the [docs](https://docs.python.org/2/tutorial/datastructures.html).

<img src='files/resources/ic_assignment_black_24dp_2x.png' align='left'>
Here is a part completed code snippet to compute term frequency.  
Complete this code to correctly populate the term frequency dictionary.

In [3]:
tf = {}
for doc in corpus:
    for word in doc.split():
        # << COMPUTE ERM FREQUENCY DICTIONARY >> CODE HERE
        ## HIDE
        if word in tf:
            tf[word] += 1
        else:
            tf[word] = 1
        ## HIDE

print(tf)

{'and': 1, 'set': 1, 'bicycle': 2, 'cruiser': 1, 'tires': 1, 'fixie': 1, 'oval': 1, 'speed': 1, 'internal': 1, 'mountain': 1, 'cycling': 1, 'handlebar': 1, 'gloss': 1, 'chainring': 1, 'bike': 3, 'black': 1, 'new': 1, 'beach': 1, 'red': 1, 'kcnc': 1, 'wheel': 1, 'rotor': 1, 'fiber': 1, 'hub': 1, 'rims': 1, 'mavic': 1, 'aero': 1, 'stem': 1, 'alloy': 1, 'wtb': 1, 'carbon': 1, 'riser': 1, 'bar': 1, 'qxl': 1, 'crossride': 1, 'arrow': 1, 'weirwolf': 1, 'mtb': 1, 'x': 1, 'wheels': 1, 'road': 1}


We can simplify by using a [Counter](https://docs.python.org/2/library/collections.html#collections.Counter) rather than a dictionary.

<img src='files/resources/ic_assignment_black_24dp_2x.png' align='left'> 
Take a look at the docs for the `Counter` collection.  
Complete this function definition to compute term frequency using the `Counter`.

In [4]:
from collections import Counter

def get_tf(corpus):
    tf = Counter()
    for doc in corpus:
        for word in doc.split():
            # << CODE HERE
            ## HIDE
            tf[word] += 1
    return tf

tf = get_tf(corpus)
print(tf)

Counter({'bike': 3, 'bicycle': 2, 'and': 1, 'set': 1, 'cruiser': 1, 'tires': 1, 'fixie': 1, 'oval': 1, 'speed': 1, 'internal': 1, 'mountain': 1, 'cycling': 1, 'handlebar': 1, 'gloss': 1, 'chainring': 1, 'black': 1, 'new': 1, 'beach': 1, 'red': 1, 'kcnc': 1, 'wheel': 1, 'rotor': 1, 'fiber': 1, 'hub': 1, 'rims': 1, 'mavic': 1, 'aero': 1, 'stem': 1, 'alloy': 1, 'wtb': 1, 'carbon': 1, 'riser': 1, 'bar': 1, 'qxl': 1, 'crossride': 1, 'arrow': 1, 'weirwolf': 1, 'mtb': 1, 'x': 1, 'wheels': 1, 'road': 1})


The Counter does not give us a real speed advantage - since it does more work.   For these tiny data sets we do not see any difference - however in Python 3 it is faster than a default dictionary.  Often times best way to test performance is to time code execution.

<img src='files/resources/ic_info_outline_black_24dp_2x.png' align='left'> 
We should get used to thinking about performance.   
We can use the Jupyter Notebook [magics](http://nbviewer.jupyter.org/github/ipython/ipython/blob/1.x/examples/notebooks/Cell%20Magics.ipynb) to time the execution.

<img src='files/resources/ic_assignment_black_24dp_2x.png' align='left'>
Run the code to compute the term frequency for the full corpus of item titles.  
What is the frequency of the terms 'unicycle', 'bicycle' and 'tricycle'?

In [5]:
tf = get_tf(items_t)

# Print tf for 'unicycle'
## HIDE
print(tf['unicycle'])
print(tf['bicycle'])
print(tf['tricycle'])


5
3544
13


The term frequency can also be computed for each document - the term frequency is a crude measure of the "aboutness" of a document.  For short documents, such as eBay item titles, terms do not occur very frequently.  In longer documents the term frequency is a form of compression and summarization.

We can store the document term frequency in a dictionary, where the key is the document id and the value is the a nested dictionary of document terms and their counts.

For example consider the corpus of three documents:

1. 'mountain bike red'
2. 'road bike carbon'
3. 'bike helmet'

The document term frequencies would be:

| id | document term frequencies |
|----|------------------|
| 1  | { 'mountain' : 1, 'bike' : 1, 'red' : 1 } |
| 2  | { 'road' : 1, 'bike' : 1, 'carbon' : 1 } |
| 3  | { 'bike' : 1, 'helmet' : 1 } |


<img src='files/resources/ic_assignment_black_24dp_2x.png' align='left'>
Now compute **document term frequencies** for the full corpus of item titles.  
Print out the document term frequencies for 3 randomly selected documents - what was the highest frequency term for each?

In [6]:
def get_tfd(corpus):
    tfd = {}
    for i,doc in enumerate(corpus):
        tfd[i]={}
        # << DOCUMENT TERM FREQUENCY >> CODE HERE
        ## HIDE
        for word in doc.split():
            if word in tfd[i]:
                tfd[i][word] += 1
            else:
                tfd[i][word] = 1
    return tfd
            
    
tfd = get_tfd(items_t)
tfd[234]

{'black': 1,
 'blue': 1,
 'giro': 1,
 'ii': 1,
 'milky': 1,
 'nib': 1,
 's': 1,
 'sante': 1,
 'shoes': 1,
 'white': 1,
 'women': 1}

<img src='files/resources/ic_assignment_black_24dp_2x.png' align='left'>
Repeat this time computing the **document term frequencies** for the full corpus of item descriptions.  
Print out the document term frequency for 3 randomly selected documents - what was the highest frequency term for each?

In [7]:
# << COMPUTE TFD FOR ITEM DESCRIPTIONS >> CODE HERE
## HIDE
tfd = get_tfd(items_d)
tfd[234]

{'accurately': 2,
 'also': 1,
 'and': 1,
 'answer': 1,
 'answered': 1,
 'are': 1,
 'as': 4,
 'ask': 2,
 'at': 1,
 'best': 1,
 'black': 1,
 'blue': 1,
 'but': 1,
 'buying': 1,
 'can': 1,
 'clothing': 1,
 'describe': 1,
 'descriptions': 1,
 'experience': 1,
 'feel': 2,
 'for': 1,
 'free': 2,
 'giro': 1,
 'have': 4,
 'if': 3,
 'ii': 1,
 'in': 2,
 'interested': 1,
 'it': 1,
 'items': 1,
 'jae': 1,
 'make': 1,
 'may': 1,
 'milky': 1,
 'more': 1,
 'nib': 1,
 'not': 1,
 'of': 1,
 'one': 1,
 'our': 3,
 'please': 1,
 'positive': 1,
 'possible': 1,
 'promptly': 1,
 'questions': 2,
 's': 1,
 'sale': 1,
 'sales': 1,
 'sante': 1,
 'shoes': 1,
 'specific': 1,
 'than': 1,
 'them': 1,
 'these': 1,
 'to': 4,
 'try': 2,
 'we': 6,
 'which': 1,
 'white': 1,
 'will': 1,
 'women': 1,
 'you': 3,
 'your': 1}

## Exercise Set 2 - Term Frequency Ranking, Boolean Matching and Inverted Indexes

Whilst the document term frequency dictionary in the previous section `tfd` is a compact way to store the term frequency it is not efficient for analysis.  A term frequency matrix is a more effective way to store the data.  

For example consider the corpus of three documents:

1. 'mountain bike red'
2. 'road bike carbon'
3. 'bike helmet'

There is a toal vocabulary of six terms [ 'mountain', 'bike' , 'red', 'road', 'carbon', 'helmet' ].

Each document count be represented as a 3 x 6 element vectors:

| Document ID | mountain | bike | red | road | carbon | helmet |
|-------------|----------|------|-----|------|--------|--------|
| 1 - 'mountain bike red' | 1 | 1 | 1 | 0 | 0 | 0 |
| 2 - 'road bike carbon' | 0 | 1 | 0 | 1 | 1 | 0 |
| 3 - 'bike helmet' | 0 | 1 | 0 | 0 | 0 | 1 |

Those arrays can be stacked naturally into a matrix - one row per document, one column per term.  We call this matrix the term frequency matrix.

To compute the term frequency matrix we have to first compute the lexicon (set of terms) in our corpus.

<img src='files/resources/ic_assignment_black_24dp_2x.png' align='left'>
Review the docs for the [set](https://docs.python.org/2/library/stdtypes.html#set-types-set-frozenset) type. Note - sets do not contain duplicates and can be used to dedupe tokens.
Complete the `get_lexicon()` function definition so that it returns a list of unique terms across a given corpus of documents.  Validate with the small test corpus.

In [8]:
def get_lexicon(corpus):
    lexicon = set()
    # << COMPUTE SET OF TERMS IN CORPUS >> CODE HERE
    ## HIDE
    for doc in corpus:
        lexicon.update([word for word in doc.split()])
    ## HIDE
    return list(lexicon)
    
test_corpus = ['mountain bike red','road bike carbon','bike helmet']
lexicon = get_lexicon(test_corpus)

Now we have our lexicon we can compute a document term frequency matrix.  We will store our document term frequency vectors in a `list`.  Note we could also store them in a `dictionary` where the key is the document_id.  

<img src='files/resources/ic_assignment_black_24dp_2x.png' align='left'>
Complete the code snippet below to compute the term frequency vector for each document.  
Store the term frequency vectors in the list `tfm`.  Validate the results with the test corpus.

In [9]:
lexicon = get_lexicon(test_corpus)

tfm =[]
for doc in test_corpus:
    tfv = [0]*len(lexicon)
    for term in doc.split():
        # << COMPUTE DOCUMENT TERM FREQUENCY VECTOR tfv AND APPEND TO tfm >> CODE HERE
        ## HIDE
        tfv[lexicon.index(term)] += 1
        ## HIDE
    tfm.append(tfv)
    
print(tfm)

[[1, 0, 1, 1, 0, 0], [0, 0, 1, 0, 1, 1], [0, 1, 1, 0, 0, 0]]


Since we are going to reuse the tfm let's create a function that takes as argument the corpus and returns the lexicon and the tfm.

<img src='files/resources/ic_assignment_black_24dp_2x.png' align='left'>
Copy your code snippets from the previous two exercises into the function definition below.    
Test your function by computing 'tfm' on the test corpus verifying your results before computing tfm for the item_titles corpus.

In [10]:
def get_tfm(corpus):
    
    def get_lexicon(corpus):
        lexicon = set()
        # << COMPUTE SET OF TERMS IN CORPUS >> CODE HERE
        ## HIDE
        for doc in corpus:
            lexicon.update([word for word in doc.split()])
        return list(lexicon)
        ## HIDE
        
    lexicon = get_lexicon(corpus)

    tfm =[]
    for doc in corpus:
        tfv = [0]*len(lexicon)
        for term in doc.split():
            # << COMPUTE DOCUMENT TERM FREQUENCY VECTOR AND APPEND TO tfm >> CODE HERE
            ## HIDE
            tfv[lexicon.index(term)] += 1
            ## HIDE
        tfm.append(tfv)
        
    return tfm, lexicon


test_corpus = ['mountain bike red','road bike carbon','bike helmet']
tfm, lexicon = get_tfm(test_corpus)


<img src='files/resources/ic_info_outline_black_24dp_2x.png' align='left'> 
As our corpus increases so does the sparsity of the term frequency matrix - most elements have value zero.  
We can use more efficient [sparse matrix](https://en.wikipedia.org/wiki/Sparse_matrix) storage to save memory.  More details [here](http://localhost:8888/notebooks/1.3%20Introduction%20to%20Information%20Retrieval.ipynb#Sparse-Term-Frequency-Matrix).  

In [11]:
import pandas as pd
from bokeh.plotting import figure, output_notebook, show, vplot
#from bokeh.charts import Bar, Scatter, BoxPlot
#from bokeh.charts.attributes import CatAttr
#from bokeh.models import ColumnDataSource

# Sparsity as a function of document count
n = []
s = []
for i in range(100,1000,100):
    corpus = items_t[0:i]
    tfm, lexicon = get_tfm(corpus)
    c =[ [x.count(0), x.count(1)] for x in tfm]
    n_zero = sum([ y[0] for y in c])
    n_one = sum([ y[1] for y in c])  
    s.append(1.0 - (float(n_one) / (n_one + n_zero)))
    n.append(i)
    
output_notebook(hide_banner=True)
p = figure(x_axis_label='Documents', y_axis_label='Sparsity',
          plot_width=400, plot_height=400)
p.line(n, s, line_width=2)
p.circle(n, s, fill_color="white", size=8)
show(p)

<bokeh.io._CommsHandle at 0x7f187d129ed0>

### Boolean Search

We are now in a position to write our first ranking function.  Now we have the term frequency matrix we can use it to find documents that contain words included in a user specified query.  We will start by simply returning the documents from the corpus that match any terms in the query and rank by the raw frequency of matching terms. 

More specifically our algorithm for 'boolean search' proceeds as follows:

1. Convert query to query vector using the lexicon for the corpus
2. Compute a ranking score for each document by taking the [dot product](https://en.wikipedia.org/wiki/Dot_product) of the query vector and each document's term frequency vector
3. Sort the documents by score

<img src='files/resources/ic_assignment_black_24dp_2x.png' align='left'>
The function definion `get_results_tf()` converts the user query `qry` into a vector using the supplied lexicon.
Complete the function by providing the code to compute the score of each document.  Test using a bike related query such as 'led bike light'.  Do you get relevant results?  

HINT : Here is a [gist](https://gist.github.com/mattwg/60910d90a8987e271212) that shows how to compute the dot product between two vectors!

In [14]:
def get_results_tf(qry, tfm, lexicon):
    qrv = [0]*len(lexicon)
    for term in qry.split():
        if term in lexicon:
            qrv[lexicon.index(term)] = 1

    results = []      
    for i, tfv in enumerate(tfm):
        score = 0
        # << COMPUTE DOCUMENT SCORE >> CODE HERE
        ## HIDE
        score = sum([ xy[0] * xy[1] for xy in zip(qrv,tfv)])
        ## HIDE
        results.append([score, i])
    
    sorted_results = sorted(results, key=lambda t: t[0] * -1 )
    return sorted_results


def print_results(results,n, head=True):
    ''' Helper function to print results
    '''
    if head:    
        print('\nTop %d from recall set of %d items:' % (n,len(results)))
        for r in results[:n]:
            print('\t%0.2f - %s'%(r[0],items_t[r[1]]))
    else:
        print('\nBottom %d from recall set of %d items:' % (n,len(results)))
        for r in results[-n:]:
            print('\t%0.2f - %s'%(r[0],items_t[r[1]]))
    

tfm, lexicon = get_tfm(items_t)
results = get_results_tf('led bike light', tfm , lexicon)
print_results(results,10)



Top 10 from recall set of 9893 items:
	6.00 - frog waterproof bike light set led white front light led red rear light
	4.00 - cycling bike bicycle led front light head light torch mount aaa
	4.00 - waterproof usb rechargeable led bike light set bright headlight free light
	4.00 - niterider tl sl led bike tail light red rear flashing bike safety
	4.00 - sets bright bike bicycle waterproof led head light led rearlight us seller
	4.00 - lm cree led cycling front bike bicycle light headlight only light
	4.00 - led tire valve stem caps neon light bike bicycle car auto wheel light
	4.00 - ultra bright waterproof silicon led bicycle light set led front rear light
	4.00 - usb cycling xml led front bike light bicycle light headlamp headlight


### Inverted Index

This search across documents is expensive - especially if the score for many documents is zero!  To solve this problem we can create an inverted index.  An inverted index can be used to filter out documents that do not contain any of the keywords in the query before computing the ranking score.  

Using our example mini-corpus:

1. 'mountain bike red'
2. 'road bike carbon'
3. 'bike helmet'

There is a toal vocabulary of six terms [ 'mountain', 'bike' , 'red', 'road', 'carbon', 'helmet' ].  An inverted index will map each of these terms to the document in which the document can be found.  

| key | value |
|-----|-------|
| 'mountain' | [ 1 ] |  
| 'bike' | [1, 2, 3] |  
| 'red' | [1] |  
| 'road' |  [2] | 
| 'carbon' | [2] | 
|  'helmet' | [3] |

We could store an inverted index in a dictionary where the key is the term and the value is the document id.

<img src='files/resources/ic_assignment_black_24dp_2x.png' align='left'>
We will create an inverted index as a python dictionary keyed on the token.  
Complete the code snippet below to create the inverted index.  Validate with the test corpus. 

In [15]:
def create_inverted_index(corpus):
    idx={}
    for i, doc in enumerate(corpus):
        # << POPULATE INVERTED INDEX >> CODE HERE
        ## HIDE
        for word in doc.split():
            if word in idx:
                idx[word].append(i)
            else:
                idx[word] = [i]
        ## HIDE
    return idx

test_corpus = ['mountain bike red','road bike carbon','bike helmet']
idx = create_inverted_index(test_corpus)

<img src='files/resources/ic_assignment_black_24dp_2x.png' align='left'>
Now we can create an inverted index for all the item titles.  We can use the set intersection method to find all the documents that match the query 'led bike light'.  Run the code below checking the titles of some of the results that match query terms.

In [16]:
idx = create_inverted_index(items_t)
print(set(idx['led']).intersection(set(idx['bike'])).intersection(set(idx['light'])))
print(items_t[2061])

set([9559, 2061, 2062, 8212, 31, 8229, 4134, 8238, 4143, 8248, 8250, 4159, 8258, 8261, 2118, 80, 4181, 8278, 91, 4193, 6244, 6247, 9237, 6282, 2195, 2196, 6297, 166, 4274, 2227, 6175, 2239, 4294, 4296, 6345, 4299, 4300, 8406, 8408, 6361, 2267, 2272, 8426, 4331, 4335, 6386, 6388, 245, 6407, 6417, 6419, 6424, 8474, 6434, 2339, 8485, 4398, 306, 8499, 4404, 4406, 316, 4413, 8515, 2372, 6469, 2375, 2387, 6497, 2402, 8553, 6514, 2451, 374, 8571, 384, 6533, 8583, 8590, 6544, 6547, 4501, 2459, 4512, 423, 8623, 4532, 2498, 8652, 4557, 8655, 472, 8665, 8670, 2529, 6629, 8681, 4592, 6230, 6662, 6664, 2569, 8715, 2589, 6686, 6235, 6694, 6697, 4658, 8756, 8761, 8763, 6733, 594, 8790, 599, 4699, 8798, 8807, 618, 8811, 4722, 6772, 2682, 2697, 2706, 4758, 4762, 4769, 4770, 2734, 6839, 2749, 8896, 2753, 2754, 8901, 6859, 717, 4817, 6867, 4821, 731, 6880, 6883, 8934, 4839, 4841, 9684, 8954, 4865, 773, 6918, 6920, 8970, 8972, 6926, 8978, 6934, 796, 8992, 4897, 2850, 9692, 6965, 6971, 4924, 829, 6974, 290

We can now improve on our first ranking function.  This time only scoring the documents that match our keywords in the query:

In [17]:
def get_results_tf(qry, idx):
    score = Counter()
    for term in qry.split():
        for doc in idx[term]:
            score[doc] += 1
            
    results=[]
    for x in [[r[0],r[1]] for r in zip(score.keys(), score.values())]:
        if x[1] > 0:
            # output [0] score, [1] doc_id
            results.append([x[1],x[0]])

    sorted_results = sorted(results, key=lambda t: t[0] * -1 )
    return sorted_results;


idx = create_inverted_index(items_t)
results = get_results_tf('led bike light', idx)
print_results(results,10)


Top 10 from recall set of 4630 items:
	6.00 - frog waterproof bike light set led white front light led red rear light
	4.00 - cree xm led cycling head bike bicycle light headlight torch light
	4.00 - cycling bike bicycle led front light head light torch mount aaa
	4.00 - waterproof usb rechargeable led bike light set bright headlight free light
	4.00 - lm cree led cycling front bike bicycle light headlight only light
	4.00 - niterider tl sl led bike tail light red rear flashing bike safety
	4.00 - sets bright bike bicycle waterproof led head light led rearlight us seller
	4.00 - lm cree led cycling front bike bicycle light headlight only light
	4.00 - led tire valve stem caps neon light bike bicycle car auto wheel light
	4.00 - ultra bright waterproof silicon led bicycle light set led front rear light


<img src='files/resources/ic_assignment_black_24dp_2x.png' align='left'>
Run a few different queries some longer some shorter - for example 'front rear led light', 'led light', 'led'.  What do you notice about the ranking score?
Try the query 'mountain bike suspension' - do the results look relevant?  What might be going on?

In [18]:
# << ENTER DIFFERENT QUERIES >> 
results = get_results_tf('mountain bike suspension', idx)
print_results(results, 10)


Top 10 from recall set of 4618 items:
	3.00 - mountain bike mtb bicycle disc brake suspension front fork lock
	3.00 - salsa front black bike wheel skewer road or mountain bike quick release qr
	3.00 - oem jagwire brake shifter cable housing kit road bike mountain bike
	3.00 - oakley mens automatic mountain mtb factory lite mountain bmx bike gloves large
	3.00 - kmc xsp speed chain bike bicycle links mtb mountain bike new
	3.00 - turbo r full suspension mountain bike size shimano suntour
	3.00 - fat bike mountain bike frame and fork plus all components no wheels or tires
	3.00 - new gloves mountain bike motocross bike bmx blue black size l large
	3.00 - oem jagwire brake shifter cable housing kit road bike mountain bike
	3.00 - kmc xxsp speed chain bike bicycle links mtb mountain bike new


<img src='files/resources/ic_info_outline_black_24dp_2x.png' align='left'> 
The term frequency ranking is dominated by high frequency terms.  
For example the term bike is present in nearly every other document.

In [19]:
import pandas as pd
from bokeh.plotting import output_notebook, show
from bokeh.charts import Bar
from bokeh.charts.attributes import CatAttr
#from bokeh.models import ColumnDataSource

df = pd.DataFrame({'term':[x for x in idx.keys()],'freq':[len(x) for x in idx.values()]})

output_notebook(hide_banner=True)
p = Bar(df.sort_values('freq', ascending=False)[:30], label=CatAttr(columns=['term'], sort=False), values='freq',
        plot_width=800, plot_height=400)
show(p)


<bokeh.io._CommsHandle at 0x7f187cb6e350>

## Exercise Set 3 - TF-IDF

We already have all the information we need to compute IDF.  The number of documents in which a term appears is simply the length of the list of documents for a given key in our index.

<img src='files/resources/ic_assignment_black_24dp_2x.png' align='left'>
Create a function definition to compute IDF.  Arguments to the function should be the term, the inverted index and the number of documents in the corpus.  The log function is in the [math](https://docs.python.org/2/library/math.html) module.  Compute the IDF of the terms 'led', 'bike' and 'light'.

In [20]:
import math

def idf(term, idx, n):
    # << IMPLEMENT IDF FUNCTION >> CODE HERE
    ## HIDE
    return math.log( float(n) / (1 + len(idx[term])))    
    ## HIDE


## HIDE    
print(idf('led',idx,len(items_t)))
print(idf('bike',idx,len(items_t)))
print(idf('light',idx,len(items_t)))
## HIDE

2.83139552897
0.780002352774
2.66579387739


To rank based on TF-IDF we only need to make a few small changes to the previous TF ranking function:

1. We need to know how many times the term `t` appears in `D`.  We can store this in our inverted index.  Instead of storing the document ID we can add the document ID and the number of times the term appears.  Previously this was captured in the TF matrix. We can avoid computing the TF matrix if adjust our index.
2. We have to change the function signature in the ranking function - passing in the total size of the corpus - and we have to change the score calculation.

<img src='files/resources/ic_assignment_black_24dp_2x.png' align='left'>
Modify the `get_results()` function to score documents based on TF-IDF rather than just TF.  Run a few different queries some longer some shorter - for example 'front rear led light', 'led light', 'led'.  What do you notice about the ranking score?  How do they compare to the TF ranking?  Try the query 'mountain bike suspension' - do the results look more or less relevant than TF?

In [21]:
def create_inverted_index(corpus):
    idx={}
    for i, doc in enumerate(corpus):
        for word in doc.split():
            if word in idx:
                if i in idx[word]:
                    # Update document's frequency
                    idx[word][i] += 1
                else:
                    # Add document
                    idx[word][i] = 1
            else:
                # Add term
                idx[word] = {i:1}
    return idx

def get_results_tfidf(qry, idx, n):
    score = Counter()
    for term in qry.split():
        # << IMPLEMENT TF-IDF SCORING >> CODE HERE
        ## HIDE
        if term in idx:
            i = idf(term, idx, n)
            for doc in idx[term]:
                score[doc] += idx[term][doc] * i
        ## HIDE
        
    results=[]
    for x in [[r[0],r[1]] for r in zip(score.keys(), score.values())]:
        if x[1] > 0:
            results.append([x[1],x[0]])
    
    sorted_results = sorted(results, key=lambda t: t[0] * -1 )
    return sorted_results

idx = create_inverted_index(items_t)
## HIDE
# results = get_results_tfidf('front led bike light', idx, len(items_t))
# results = get_results_tfidf('led bike light', idx, len(items_t))
# results = get_results_tfidf('led', idx, len(items_t))
results = get_results_tfidf('mountain bike suspension', idx, len(items_t))
## HIDE
print_results(results,10)


Top 10 from recall set of 4618 items:
	8.35 - mountain bike mtb bicycle disc brake suspension front fork lock
	8.35 - turbo r full suspension mountain bike size shimano suntour
	8.35 - marzocchi bomber free ride mountain bike mx race suspension decal sticker
	8.35 - men s mountain bike speed full suspension sturdy steel frame new
	8.35 - full suspension mongoose women s mountain bike bicycle aluminum frame new
	8.35 - cannondale scalpel carbon full suspension mountain bike sram lefty
	8.35 - next cyclone men s mountain bike shimano suspension fork new
	8.35 - trek oclv carbon full suspension mountain bike medium made in the usa
	8.35 - schwinn black front suspension disc brake hybrid mountain bike sale
	8.35 - bmc trailfox mountain bike full suspension frame aluminum rockshox


<img src='files/resources/ic_info_outline_black_24dp_2x.png' align='left'>We can plot the relationship between TF and IDF and get more intuition for what TF and IDF is all about.  Plotting data to understand algorithms is good practice - not only to develop intuition but also to spot bugs in your code!  We will revisit this later in the day.

In [22]:
from bokeh.charts import vplot

idx = create_inverted_index(items_t)

df = pd.DataFrame({'term':[x for x in idx.keys()],'freq':[len(x) for x in idx.values()],
                  'idf':[idf(x, idx, len(items_t)) for x in idx.keys()]})

output_notebook(hide_banner=True)
p1 = Bar(df.sort_values('freq', ascending=False)[:30], label=CatAttr(columns=['term'], sort=False), values='freq',
        plot_width=800, plot_height=400)
p2 = Bar(df.sort_values('freq', ascending=False)[:30], label=CatAttr(columns=['term'], sort=False), values='idf',
        plot_width=800, plot_height=400)
p = vplot(p1, p2)
show(p)

<bokeh.io._CommsHandle at 0x7f187ca0f550>

### Problematic Queries

<img src='files/resources/ic_info_outline_black_24dp_2x.png' align='left'>  

Although we have fixed the suspension query get another problem with the query "mountain bikes" which seems to just return a heap of accessories:


In [23]:
idx = create_inverted_index(items_t)
results = get_results_tfidf('mountain bike', idx, len(items_t))
print_results(results,10)


Top 10 from recall set of 4593 items:
	5.73 - oakley mens automatic mountain mtb factory lite mountain bmx bike gloves large
	5.73 - mavic crossride wheelset mountain bike xc all mountain qr flat speed
	4.06 - salsa front black bike wheel skewer road or mountain bike quick release qr
	4.06 - kmc xsp speed chain bike bicycle links mtb mountain bike new
	4.06 - fat bike mountain bike frame and fork plus all components no wheels or tires
	4.06 - new gloves mountain bike motocross bike bmx blue black size l large
	4.06 - oem jagwire brake shifter cable housing kit road bike mountain bike
	4.06 - kmc xxsp speed chain bike bicycle links mtb mountain bike new
	4.06 - oem jagwire brake shifter cable housing kit road bike mountain bike
	4.06 - mtb road bike mountain bicycle adjustable alloy bike kick stand side kickstand


We need to penalise items where there are many more terms in the query.  For example the terms "mountain" and "bike" only make up 2 / 12 terms in the "oakley mens automatic mountain mtb factory lite mountain bmx bike gloves large" yet it scores highly because there is no penalty for all the other terms in the item title.

In addition this scheme create discrete levels based on combination of word frequency:

In [24]:
df = pd.DataFrame({'score':[float(x[0]) for x in results],
                   'title':[items_t[x[1]] for x in results]})

d = df.groupby('score').first().reset_index()

r1 = re.compile('(bike)')
r2 = re.compile('(mountain)')

for i, t in enumerate(d.title):
    n1 = r1.findall(t)
    n2 = r2.findall(t)
    print('%d x Bike, %d x Mountain, Score = %0.2f'%(len(n1),len(n2),d.score[i]))

1 x Bike, 0 x Mountain, Score = 0.80
2 x Bike, 0 x Mountain, Score = 1.59
0 x Bike, 1 x Mountain, Score = 2.47
1 x Bike, 1 x Mountain, Score = 3.26
2 x Bike, 1 x Mountain, Score = 4.06
1 x Bike, 2 x Mountain, Score = 5.73


In [25]:
from bokeh.plotting import output_notebook, show
from bokeh.charts import Scatter

# Plot score vs item length
df = pd.DataFrame({'score':[float(x[0]) for x in results],
                   'length':[len(items_t[x[1]].split()) for x in results]})

output_notebook(hide_banner=True)
p = Scatter(df, x='score', y='length')
show(p)

<bokeh.io._CommsHandle at 0x7f187c8858d0>

Ideally we do not want scores to be the same for lots of documents. High TF-IDF scores in shorter documents should be more relevant - so we could try by boosting the score for documents that are shorter than average.

In [26]:
def get_results_tfidf_boost(qry, corpus):
    idx = create_inverted_index(corpus)
    n = len(corpus)
    d = [len(x.split()) for x in corpus]
    d_avg = float(sum(d)) / len(d)
    score = Counter()
    for term in qry.split():
        if term in idx:
            i = idf(term, idx, n)
            for doc in idx[term]:
                f = float(idx[term][doc])
                score[doc] += i *  ( f / (float(d[doc]) / d_avg) )
        
    results=[]
    for x in [[r[0],r[1]] for r in zip(score.keys(), score.values())]:
        if x[1] > 0:
            # output [0] score, [1] doc_id
            results.append([x[1],x[0]])

    sorted_results = sorted(results, key=lambda t: t[0] * -1 )
    return sorted_results

In [27]:
results = get_results_tfidf_boost('mountain bike', items_t)
print_results(results, 10)

# Plot score vs item length
df = pd.DataFrame({'score':[float(x[0]) for x in results],
                   'length':[len(items_t[x[1]].split()) for x in results]})

output_notebook()
p = Scatter(df, x='score', y='length')
show(p)


Top 10 from recall set of 4593 items:
	10.76 - trek mountain bike
	8.13 - rocky mountain etsx
	8.13 - rocky mountain slayer
	8.07 - shadow nine mountain bike
	8.07 - cannondale jekyll mountain bike
	6.46 - mens mongoose mountain bike new
	6.46 - magna glacier point mountain bike
	6.46 - truvativ xx mountain bike chainring
	6.46 - gt timberline mountain bike small
	6.46 - banshee scream downhill mountain bike


<bokeh.io._CommsHandle at 0x7f187aaf0690>

That looks better!  This intuition will be built upon in the next section.

## Exercise Set 4 - Implementing BM25

<img src='files/resources/ic_assignment_black_24dp_2x.png' align='left'> In this exercise the goal is to implement the BM25 algorithm.   
To help you I have provided comments breaking this down into steps 1 through 5.  

In [28]:
def get_results_bm25(qry, corpus, k1=1.5, b=0.75):
    idx = create_inverted_index(corpus)
    # 1.Assign (integer) n to be the length of the corpus
    ## HIDE
    n = len(corpus)
    # 2.Assign (list) d with elements corresponding to the length of each document in the corpus
    ## HIDE
    d = [len(x.split()) for x in corpus]
    # 3.Assign (float) d_avg as the average document length of the documents in the corpus
    ## HIDE
    d_avg = float(sum(d)) / len(d)                
    score = Counter()
    for term in qry.split():
        if term in idx:
            i = idf(term, idx, n)
            for doc in idx[term]:
                # 4.Assign (float) f equal to the number fo times the term appears in doc
                ## HIDE
                f = float(idx[term][doc])
                # 5.Assign (float) s the BM25 score for this (term, document) pair
                # HIDE
                s = i * (( f * (k1 + 1) ) / (f + k1 * (1 - b + (b * (float(d[doc]) / d_avg)))))
                score[doc] += s
                
    results=[]
    for x in [[r[0],r[1]] for r in zip(score.keys(), score.values())]:
        if x[1] > 0:
            results.append([x[1],x[0]])

    sorted_results = sorted(results, key=lambda t: t[0] * -1 )
    return sorted_results

In [None]:
results = get_results_bm25('mountain bike', items_t)
print_results(results, 10)


Top 10 from recall set of 4593 items:
	4.75 - trek mountain bike
	4.46 - shadow nine mountain bike
	4.46 - cannondale jekyll mountain bike
	4.20 - mens mongoose mountain bike new
	4.20 - magna glacier point mountain bike
	4.20 - truvativ xx mountain bike chainring
	4.20 - gt timberline mountain bike small
	4.20 - banshee scream downhill mountain bike
	4.20 - lake winter mountain bike shoe
	4.20 - mountain bike tubular tires new


<img src='files/resources/ic_assignment_black_24dp_2x.png' align='left'> Test BM25 with different parameter values.  
Can you observe the effect of varying k1 and b?  What happens if k1=0?

In [None]:
results = get_results_bm25('mountain bike', items_t, k1=1.5, b=0.75)

# Plot score vs item length
df = pd.DataFrame({'score':[float(x[0]) for x in results],
                   'length':[len(items_t[x[1]].split()) for x in results]})
output_notebook()
p = Scatter(df, x='score', y='length')
show(p)

<bokeh.io._CommsHandle at 0x7f187ac79f90>