# Sugges_ #

One of the strategies to improve user experience is to provide user with hints, or, otherwise, to autocomplete his queries. Let's consider suggest.

Today we will practice generating suggestions using [Trie](https://en.wikipedia.org/wiki/Trie) datastructure (prefix tree), see the example below.

Plan:

1. Build Trie based on real search query data provided by AOL company;
2. Generate suggestion based on trie;
3. Measure suggestion speed;
4. Add spellcheck to suggest (optional).


![image](https://www.ritambhara.in/wp-content/uploads/2017/05/Screen-Shot-2017-05-01-at-4.01.38-PM.png)

## Install Trie DS support

You are free to use any library implementation of Trie, as well as the one we suggest.

https://github.com/google/pygtrie

In [1]:
# !pip install pygtrie

In [1]:
import pygtrie
t = pygtrie.CharTrie()
t["this is 3"] = "A"
t["this is 2"] = ["G",1]
t["this"] = "B"
t["that.is 3"] = "C"

print(t)

n = t.has_node('this') == pygtrie.Trie.HAS_VALUE
s = t.has_node('this') == pygtrie.Trie.HAS_SUBTRIE

print(f"Node = {n}; Subtree = {s}")

for key, val in t.iteritems("this"):
    print(key, '~', val)

CharTrie(this: B, this is 3: A, this is 2: ['G', 1], that.is 3: C)
Node = False; Subtree = False
this ~ B
this is 3 ~ A
this is 2 ~ ['G', 1]


## 1. Build a trie upon a dataset ##

### 1.1 Read dataset

Download the [dataset](https://drive.google.com/drive/folders/1rOE5eed37Jy2ANQItZVwDIFgPmkCoFu6) (we provide only the first part of the original data for simplicity (~3.5 mln queries)).
Explore the data, see readme file. Load the dataset.

In [2]:
import pandas as pd 
import tqdm

aol_data = pd.read_csv("user-ct-test-collection-01.txt.zip",sep="\t")

#Remove rows with NaN Query
aol_data = aol_data.query("not Query.isnull()")
print("DS size:", aol_data.shape[0])
aol_data.head()

DS size: 3558238


Unnamed: 0,AnonID,Query,QueryTime,ItemRank,ClickURL
0,142,rentdirect.com,2006-03-01 07:17:12,,
1,142,www.prescriptionfortime.com,2006-03-12 12:31:06,,
2,142,staple.com,2006-03-17 21:19:29,,
3,142,staple.com,2006-03-17 21:19:45,,
4,142,www.newyorklawyersite.com,2006-03-18 08:02:58,,


### 1.2 Build Trie

We want suggest function to be non-sensitive to stop words because we don't want to upset the user if he confuses/omits prepositions, for example. Consider "public events in Innopolis" vs "public events at Innopolis" or "public events Innopolis" - they all mean the same.

Build Trie based on the dataset, storing query statistics such as query frequency, urls and ranks in nodes. Some queries may not have associated urls, others may have multiple ranked urls. Think of the way to store this information.

In [3]:
#TODO: build trie based on data
aol_trie = pygtrie.CharTrie()
for index, row in tqdm.tqdm_notebook(aol_data.iterrows()):
    #if type(row['Query']) is not str : continue
    if aol_trie.has_node(row['Query']) == 0 or aol_trie.has_subtrie(row['Query']):
        link = dict() if type(row['ClickURL']) is not str else {row['ClickURL']:row["ItemRank"]}
        aol_trie[row['Query']] = {"word":row['Query'], "count":1, "links":link}

    else:
        aol_trie[row['Query']]["count"] += 1
        if type(row['ClickURL']) is str : aol_trie[row['Query']]["links"].update({row['ClickURL']:row["ItemRank"]})

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  This is separate from the ipykernel package so we can avoid doing imports until


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




In [4]:
# test trie
for key, val in aol_trie.iteritems("sample q"):
    print(key, '~', val)

sample question surveys ~ {'word': 'sample question surveys', 'count': 5, 'links': {'http://www.surveyconnect.com': 7.0, 'http://www.custominsight.com': 4.0, 'http://www.askemployees.com': 10.0, 'http://www.lg-employers.gov.uk': 1.0}}
sample questions for immigration interview ~ {'word': 'sample questions for immigration interview', 'count': 1, 'links': {}}
sample questions for interview ~ {'word': 'sample questions for interview', 'count': 1, 'links': {'http://www.quintcareers.com': 1.0}}
sample questions for family interview ~ {'word': 'sample questions for family interview', 'count': 3, 'links': {'http://www.grandparents-day.com': 2.0, 'http://www.quintcareers.com': 5.0, 'http://jobsearchtech.about.com': 3.0}}
sample questions for us citizenship test ~ {'word': 'sample questions for us citizenship test', 'count': 1, 'links': {'http://uscis.gov': 1.0}}
sample questions sociology race and ethnicity ~ {'word': 'sample questions sociology race and ethnicity', 'count': 1, 'links': {}}
sa

## 2. Write a suggest function which is non-sensitive to stop words ##

Suggest options for user query based on Trie you just built.
Output results sorted by frequency, print query count for each suggestion. If there is an url available, print the url too. If multiple url-s are available, print the one with the highest rank (the less the better).

Q: What is the empirical threshold for minimal prefix for suggest?

In [5]:
def complete_user_query(query, trie, top_k=5):
    if not trie.has_key(query) and not trie.has_subtrie(query): 
        print("Nothing to Suggest")
        return 
    
    if trie.has_subtrie(query):
        sd = list(trie.iteritems(query))
        sd = sorted(sd,key=lambda x : x[1]["count"])[-top_k:]
        res = []
        for _ , i in reversed(sd):
            if len(i.get("links",[])) == 0 :
                print(f"Count {i.get('count')} : {i.get('word')}")
                res.append(i["word"])
            else:
                best_link = sorted(i["links"],key=lambda x : x[1]).pop()
                print(f'Count {i["count"]} : {i["word"]} {best_link}')
    else:
        res = aol_trie[query]
        best_link = sorted(res["links"],key=lambda x : x[1]).pop() if len(res.get("links",[])) > 0 else ""
        print(f"Count {res['count']} : {query} , {best_link}")


inp = "trie"
print("Query:", inp)
print("Results:")
complete_user_query(inp, aol_trie)

Query: trie
Results:
Count 5 : tried and true tattoo http://www.tattoonow.com
Count 3 : triethanalomine http://www.amazon.com
Count 3 : triest
Count 2 : tried and failed
Count 1 : triethanolamine http://www.dermaxime.com


## 3. Measure suggest speed ##

Check how fast your search is working. Consider changing your code if it takes too long on average.

In [6]:
inp_queries = ["inf", "the best ", "information retrieval", "sherlock hol", "carnegie mell", 
               "babies r", "new york", "googol", "inter", "USA sta", "Barbara "]

from time import time
query_times = dict()
for q in inp_queries:
    print("Query : ", q)
    print("Results:")
    stat_t = time()
    complete_user_query(q.lower(), aol_trie)
    end_time = time()
    query_times[q] = end_time - stat_t
    print(f"Time elapsed : {round(query_times[q],5)} sec")
    print("\n")


Query :  inf
Results:
Count 94 : information clearing house http://www.informationclearinghouse.info
Count 72 : information on training puppy http://www.dogbreedinfo.com
Count 59 : inflatable slides
Count 40 : infolanka http://www.infolanka.net
Count 36 : inflatable pool water slide http://www.hullaballoorental.com
Time elapsed : 0.08343 sec


Query :  the best 
Results:
Count 30 : the best chocolate cake http://www.nebraska.tv
Count 15 : the best of word jazz http://www.hip-oselect.com
Count 12 : the best nfl mock drafts http://nfldraft.rivals.com
Count 11 : the best way to lose bulky muscle http://www.youronlinefitness.com
Count 7 : the best face products http://www.nativeremedies.com
Time elapsed : 0.00979 sec


Query :  information retrieval
Results:
Nothing to Suggest
Time elapsed : 7e-05 sec


Query :  sherlock hol
Results:
Count 2 : sherlock holmes chronological order http://www.geocities.com
Count 2 : sherlock holmes society http://www.sherlockian.net
Count 1 : sherlock holmes 

## 4. Bonus task ##

Add spellchecking to your suggest.

In [7]:
# Make vocabulary from nltk words
import nltk, re
from collections import Counter
# nltk.download('brown')
# nltk.download('reuters')

# from nltk.corpus import reuters

# words = [w.lower() for w in reuters.words()]
# WORDS = Counter(list(filter(lambda x: x.isalpha() and len(x) > 1 , words)))

WORDS = Counter(aol_data.Query.values.tolist())

In [48]:
def P(word, N=sum(WORDS.values())): 
    "Probability of `word`."
    return WORDS[word] / N

def correction(word): 
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

def candidates(word): 
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words): 
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxy z'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

In [54]:
inp_queries = ["infun", "the beeast ", "information retrieval", "innopolis", "russia girls", "holmes"]
query_times = dict()
for q in inp_queries:
    if not aol_trie.has_key(q) and not aol_trie.has_subtrie(q) :
        corrected_q = correction(q.lower())
        print("Correcting Query : {} -> {} ".format(q,corrected_q))
        if q == corrected_q : 
            print("Query cannot be corrected!")
        else : q = corrected_q 
            
    print("Query : ", q)
    print("Results:")
    stat_t = time()
    complete_user_query(q.lower(), aol_trie)
    end_time = time()
    query_times[q] = end_time - stat_t
    print(f"Time elapsed : {round(query_times[q],5)} sec")
    print("\n")

Correcting Query : infun -> fun 
Query :  fun
Results:
Count 94 : funnyjunk.com http://geekissues.org
Count 76 : fungal meningitis and coma http://www.bmb.leeds.ac.uk
Count 51 : funny shit http://www.funnyhumor.com
Count 40 : funny fishing pictures
Count 39 : funny sound bytes http://www.bustercollings.com
Time elapsed : 0.03324 sec


Correcting Query : the beeast  -> the beast 
Query :  the beast
Results:
Count 7 : the beast from x-men
Count 1 : the beast roller coaster ride
Count 1 : the beast cast
Count 1 : the beast out of the sea and the beast out of the earth rev 13 http://www.apocalipsis.org
Count 1 : the beast http://www.thebeastmovie.com
Time elapsed : 0.00093 sec


Correcting Query : information retrieval -> information retrieval 
Query cannot be corrected!
Query :  information retrieval
Results:
Nothing to Suggest
Time elapsed : 7e-05 sec


Correcting Query : innopolis -> annapolis 
Query :  annapolis
Results:
Count 4 : annapolis maryland http://www.capitalonline.com
Count 2