# Sugges_ #

One of the strategies to improve user experience is to provide user with hints, or, otherwise, to autocomplete his queries. Let's consider suggest.

Today we will practice generating suggestions using [Trie](https://en.wikipedia.org/wiki/Trie) datastructure (prefix tree), see the example below.

Plan:

1. Build Trie based on real search query data provided by AOL company;
2. Generate suggestion based on trie;
3. Measure suggestion speed;
4. Add spellcheck to suggest (optional).


![image](https://www.ritambhara.in/wp-content/uploads/2017/05/Screen-Shot-2017-05-01-at-4.01.38-PM.png)

## Install Trie DS support

You are free to use any library implementation of Trie, as well as the one we suggest.

https://github.com/google/pygtrie

In [None]:
!pip install pygtrie

In [1]:
import pygtrie
t = pygtrie.CharTrie()
t["this is 1"] = "A"
t["this is 2"] = "B"
t["that is 3"] = "C"

print(t)

n = t.has_node('this') == pygtrie.Trie.HAS_VALUE
s = t.has_node('this') == pygtrie.Trie.HAS_SUBTRIE

print(f"Node = {n}; Subtree = {s}")

for key, val in t.iteritems("this"):
    print(key, '~', val)

CharTrie(this is 1: A, this is 2: B, that is 3: C)
Node = False; Subtree = True
this is 1 ~ A
this is 2 ~ B


## 1. Build a trie upon a dataset ##

### 1.1 Read dataset

Download the [dataset](https://drive.google.com/drive/folders/1rOE5eed37Jy2ANQItZVwDIFgPmkCoFu6) (we provide only the first part of the original data for simplicity (~3.5 mln queries)).
Explore the data, see readme file. Load the dataset.

In [2]:
#TODO: Read the dataset

aol_data = None
print("DS size:", aol_data.shape[0])
print("DS head:")
print(aol_data.head())
print("DS tail:")
print(aol_data.tail())

DS size: 3558411
DS head:
   AnonID                        Query            QueryTime  ItemRank ClickURL
0     142               rentdirect.com  2006-03-01 07:17:12       NaN      NaN
1     142  www.prescriptionfortime.com  2006-03-12 12:31:06       NaN      NaN
2     142                   staple.com  2006-03-17 21:19:29       NaN      NaN
3     142                   staple.com  2006-03-17 21:19:45       NaN      NaN
4     142    www.newyorklawyersite.com  2006-03-18 08:02:58       NaN      NaN
DS tail:
           AnonID                      Query            QueryTime  ItemRank  \
3558406  24968114                          -  2006-05-31 01:04:20       NaN   
3558407  24969251  sp.trafficmarketplace.com  2006-05-31 15:51:23       NaN   
3558408  24969374            orioles tickets  2006-05-31 12:24:51       NaN   
3558409  24969374            orioles tickets  2006-05-31 12:31:57       2.0   
3558410  24969374          baltimore marinas  2006-05-31 12:43:40       NaN   

                

### 1.2 Build Trie

We want suggest function to be non-sensitive to stop words because we don't want to upset the user if he confuses/omits prepositions, for example. Consider "public events in Innopolis" vs "public events at Innopolis" or "public events Innopolis" - they all mean the same.

Build Trie based on the dataset, storing query statistics such as query frequency, urls and ranks in nodes. Some queries may not have associated urls, others may have multiple ranked urls. Think of the way to store this information.

In [None]:
#TODO: build trie based on data
aol_trie = pygtrie.CharTrie()


In [None]:
# test trie
for key, val in aol_trie.iteritems("sample q"):
    print(key, '~', val)

## 2. Write a suggest function which is non-sensitive to stop words ##

Suggest options for user query based on Trie you just built.
Output results sorted by frequency, print query count for each suggestion. If there is an url available, print the url too. If multiple url-s are available, print the one with the highest rank (the less the better).

Q: What is the empirical threshold for minimal prefix for suggest?

In [5]:
def complete_user_query(query, trie, top_k=5):
    #TODO: suggest top_k options for a user query
    # sort results by frequency, suggest first ranked urls if available
    pass

        
inp = "trie"
print("Query:", inp)
print("Results:")
complete_user_query(inp, aol_trie)

Query: trie
Results:
Count 5 : tried and true tattoo http://www.triedntruetattoo.com
Count 3 : triest 
Count 3 : triethanalomine http://avalon.unomaha.edu
Count 2 : tried and failed 
Count 2 : when you tried and failed 


## 3. Measure suggest speed ##

Check how fast your search is working. Consider changing your code if it takes too long on average.

In [6]:
inp_queries = ["inf", "the best ", "information retrieval", "sherlock hol", "carnegie mell", 
               "babies r", "new york", "googol", "inter", "USA sta", "Barbara "]

#TODO: measure avg execution time per query


Query: inf
Results:
Count 94 : information clearing house http://www.informationclearinghouse.info
Count 72 : information on training puppy http://www.101-dog-training-tips.com
Count 59 : inflatable slides 
Count 40 : infolanka http://www.infolanka.com
Count 36 : inflatable pool water slide http://www.bizrate.com

Query: best
Results:
Count 257 : bestcounter.biz 
Count 43 : best buy.com http://www.bestbuy.com
Count 30 : the best chocolate cake http://www.cacaoweb.net
Count 29 : best place to buy outdoor cushions http://www.woodclassics.com
Count 23 : best-replicas.com http://www.best-replicas.com

Query: information retrieval
Results:
Sorry, nothing to suggest!

Query: sherlock hol
Results:
Count 2 : sherlock holmes society http://www.realtime.net
Count 2 : sherlock holmes chronological order http://www.geocities.com
Count 1 : sherlock holmes 
Count 1 : sherlock holmes address 
Count 1 : sherlock holmes audiotapes 

Query: carnegie mell
Results:
Count 1 : carnegie mellon 
Count 1 : ca

## 4. Bonus task ##

Add spellchecking to your suggest.