During day 1 we saw how to build an inverted index to efficiently generate a recall set of documents. We then ranked documents using term frequency, inverse document frequency and BM25.0.  

Today we have looked at building machine learning algorithms for ranking.  This notebook shows how we use the machine learnt models in a search engine.  We "deploy" one of the models we created.

In [2]:
import os 
import pandas as pd

In [3]:
data = pd.read_csv("data/fullDataset.tsv", sep="\t",header=0)
data = data.dropna()

  interactivity=interactivity, compiler=compiler, result=result)


## Build inverted index

In [78]:
data.columns

Index([u'key', u'query', u'Title', u'LeafCats', u'ItemID', u'X_unit_id',
       u'SCORE', u'label_relevanceGrade', u'label_relevanceBinary',
       u'feature_1', u'feature_2', u'feature_3', u'feature_4', u'feature_5',
       u'feature_6', u'feature_7', u'feature_8', u'feature_9', u'feature_10'],
      dtype='object')

In [79]:
import re

rgx = re.compile(r'\b[a-zA-Z]+\b')
corpus = [ ' '.join(re.findall(rgx, str(x))).lower() for x in data.Title]
corpus[0:5]

['disney world all star music all inclusive package feb',
 'mgtc td tf mga morris minor top wing bolts set of',
 'new cross vice gel ink pen gift set document marker',
 'white paper towel roll holder cabinet wall mount sturdy',
 'engine cooling head traxxas jato trx nitro rustler']

In [80]:
def create_inverted_index(corpus):
    idx={}
    for i, doc in enumerate(corpus):
        # << POPULATE INVERTED INDEX >> CODE HERE
        ## HIDE
        for word in doc.split():
            if word in idx:
                idx[word].append(i)
            else:
                idx[word] = [i]
        ## HIDE
    return idx

idx = create_inverted_index(corpus)

We still have bike items in our corpus:

In [81]:
print(idx['bike'][0:10])
print(corpus[55])

[55, 421, 546, 559, 648, 691, 702, 983, 1234, 1262]
waterproof anti shock led bike taillight tail rear light caution aaa battery


Now we have the index we can generate a recall set, all that is left is to score the documents.

### Question?
Here I normalise the features - but what would we do in a real MLR problem - as new data accumulates - do we use algorithms that do not require normalisation of the features?

In [82]:
from sklearn.externals import joblib
features = data.loc[:,'feature_1':'feature_10']
scaler = joblib.load('models/scaler.pkl')
features = scaler.transform(features)

For efficiency we are going to store the feature vector for each document in the index:

In [83]:
def create_inverted_index(corpus, features):
    idx={}
    for i, doc in enumerate(corpus):
        for word in doc.split():
            if word in idx:
                if i not in idx[word]:
                    # Add document
                    idx[word][i] = features[i]
            else:
                # Add term
                idx[word] = {i:features[i]}
    return idx


idx = create_inverted_index(corpus, features)
print(idx['tricycle'].keys())
print(idx['tricycle'].values())

idx['tricycle']

[52481, 14155, 40812, 50745, 63003, 41183]
[array([-0.04813072, -1.12769441, -0.44239811,  0.67552043,  0.59559434,
       -0.40688642, -0.22357568,  0.46916959, -0.84460605, -0.76963843]), array([-0.04813072, -1.12769441, -0.13007418, -2.1655071 ,  3.61571009,
        1.21961306, -0.22357568,  0.46916959,  0.93098746,  1.3424747 ]), array([-0.04813072, -1.12769441, -1.21733325, -0.09324576, -0.10931327,
        1.04045348, -0.22357568,  0.46916959, -0.74731325,  1.28736161]), array([-0.04813072, -1.12769441, -0.41255737, -1.05129674,  2.9225102 ,
        1.21961306, -0.22357568,  0.46916959,  0.88234106,  1.3424747 ]), array([-0.04813072, -1.12769441,  0.0542711 , -1.61285698, -0.4537387 ,
       -1.25492903, -0.21621783, -1.40998272, -0.01761729, -0.76963843]), array([-0.51155529,  0.88676506, -0.58065482, -0.28452294, -1.56581144,
       -0.70496106,  7.68161255,  0.66289515, -0.86892925, -0.76963843])]


{14155: array([-0.04813072, -1.12769441, -0.13007418, -2.1655071 ,  3.61571009,
         1.21961306, -0.22357568,  0.46916959,  0.93098746,  1.3424747 ]),
 40812: array([-0.04813072, -1.12769441, -1.21733325, -0.09324576, -0.10931327,
         1.04045348, -0.22357568,  0.46916959, -0.74731325,  1.28736161]),
 41183: array([-0.51155529,  0.88676506, -0.58065482, -0.28452294, -1.56581144,
        -0.70496106,  7.68161255,  0.66289515, -0.86892925, -0.76963843]),
 50745: array([-0.04813072, -1.12769441, -0.41255737, -1.05129674,  2.9225102 ,
         1.21961306, -0.22357568,  0.46916959,  0.88234106,  1.3424747 ]),
 52481: array([-0.04813072, -1.12769441, -0.44239811,  0.67552043,  0.59559434,
        -0.40688642, -0.22357568,  0.46916959, -0.84460605, -0.76963843]),
 63003: array([-0.04813072, -1.12769441,  0.0542711 , -1.61285698, -0.4537387 ,
        -1.25492903, -0.21621783, -1.40998272, -0.01761729, -0.76963843])}

We can also load the model - in this case it was boosted gradient tree:

In [84]:
clf = joblib.load('models/mlr.pkl')

In [93]:
from collections import Counter
import itertools

def get_results_mlr(qry, idx, model):
    score = Counter()
    for term in qry.split():
        for doc, features in idx[term].iteritems():
            score[doc] = model.predict_proba(features.reshape(1,-1))[0][1]
            
    results=[]
    for x in [[r[0],r[1]] for r in zip(score.keys(), score.values())]:
        if x[1] > 0:
            # output [0] score, [1] doc_id
            results.append([x[1],x[0]])

    sorted_results = sorted(results, key=lambda t: t[0] * -1 )
    return sorted_results;

def print_results(results,n, head=True):
    ''' Helper function to print results
    '''
    if head:    
        print('\nTop %d from recall set of %d items:' % (n,len(results)))
        for r in results[:n]:
            print('\t%0.6f - %s'%(r[0],corpus[r[1]]))
    else:
        print('\nBottom %d from recall set of %d items:' % (n,len(results)))
        for r in results[-n:]:
            print('\t%0.6f - %s'%(r[0],corpus[r[1]]))
            
            
            
results = get_results_mlr('nike air yeezy', idx, clf)
print_results(results,10)
print_results(results,10,head=False)



Top 10 from recall set of 2053 items:
	1.000000 - air conditioning vent clock time thermometer celsius digital blue led backlight
	1.000000 - nike air jordan retro db doernbecher sz free shipping
	1.000000 - nike air yeezy sp red october nrg solar platinum ii blink net zen yeezys i
	1.000000 - chevrolet bel air sports coupe
	0.999999 - chevrolet bel air belair
	0.999998 - shock absorber max air air rear monroe
	0.999998 - chevrolet bel air belair
	0.999997 - flexible silicon decal keyboard cover keypad skin for mac macbook air pro
	0.999997 - youth girls nike shox classic sneakers shoes size white silver
	0.999996 - matte hard cover scrub case for macbook air pro retina

Bottom 10 from recall set of 2053 items:
	0.000751 - briggs stratton sunbelt oregon air filter replacement
	0.000706 - chevrolet impala ss bel air tilt steering column wheel black
	0.000602 - pu leather w stand smart for apple ipad air mini case cover transform
	0.000554 - nike air jordan high dw black red white aj wi

In [95]:
results = get_results_mlr('white iphone', idx, clf)
print_results(results,10)
print_results(results,10,head=False)


Top 10 from recall set of 4347 items:
	1.000000 - lebron x p s elite white metallic gold black brand new dead stock
	1.000000 - apple iphone white verizon smartphone
	1.000000 - bia cordon bleu white ceramic large mixing serving bowl painted fruit flowers
	1.000000 - new electro brand in home micro stereo cd player ipod w remote white
	1.000000 - mens rolex datejust stainless steel date watch w black dial white gold bezel
	1.000000 - new painted romantic poem hard back skin cover case for apple iphone
	1.000000 - samsung galaxy s iii verizon unlocked gsm sch marble white
	0.999999 - gibson les paul traditional pro ii white w case upgd duncan jb jazz
	0.999999 - jtt karaoke noiseless mute microphone for ipad iphone smartphone japan
	0.999999 - apple iphone factory unlocked smartphone gold silver gray

Bottom 10 from recall set of 4347 items:
	0.000259 - stainless steel rolex explorer ii white p serial
	0.000254 - express fitted button down shirt long sleeves white sizes xs s m l xl nwo