In [None]:
#### IR Part 3 - Wild card queries

# Part 1 and 2 of this series were to demonstrate the inverted index creattion for retrieval of documents. We used dicts and BST for the same.
# The inverted index so far did not support wild card queries. In this part i will implement a very simple wild card query support.


In [6]:
class InvertedIndex:
    def __init__(self):
        self.index = {}

    def add(self, doc_id, doc):
        for word in doc.split():
            if word not in self.index:
                self.index[word] = set()
            self.index[word].add(doc_id)

    def search(self, query):
        if '*' in query:
            return self.wild_card_search(query)
        return self.index.get(query, set())

    def wild_card_search(self, query):
        query = query.replace('*', '')
        result = set()
        for word in self.index:
            if query in word:
                result = result.union(self.index[word])
        return result
    
# Test
index = InvertedIndex()
index.add(1, 'hello world')
index.add(2, 'hello python')
index.add(3, 'world python')

print(index.search('hello'))
print(index.search('th*'))

{1, 2}
{2, 3}


In previous example some times more than one word is used to create a single index. 
It is a simple approcah which is memory efficient and works well for exact matches but could not capture the context of the words.
We can use n-grams to capture the context of the words. This comes at the cost of complexity and more memory usage.

#### General wild card queries

We will look at two techniques to support general wild card queries

* permuterm indexes
* k-gram indexes

In [34]:
# Permute the word and store in index
# Here permutations are done by rotating the word and not by swapping the characters
# e.g hello --> hello$    Apend '$" at the end of the word
# permute 'hello$' --> 'hello$' 'ello$h' 'llo$he' 'lo$hel' 'o$hell' '$hello'
# Store the permuted word as key and original word as value in the index
vocab = ['hello', 'world', 'python']

class PermutermIndex:
    def __init__(self):
        self.index = {}

    def insert(self, word):
        word = word + '$'
        for i in range(len(word)):
            self.index[word[i:] + word[:i]] = word
    
    def get_index(self):
        return self.index
    
    def search(self, query):
        query = query + '$'
        return self.index.get(query, None)
    
    def wild_card_search(self, query):
        query = query + '$'
        star_idx = query.index('*')
        query = query[star_idx:] + query[:star_idx]
        query = query.replace('*', '')
        
        result = set()
        for word in self.index:
            if query in word:
                result.add(self.index[word])
        return result


idx = PermutermIndex()
for word in vocab:
    idx.insert(word)


print(idx.get_index())
print(idx.wild_card_search('h*o'))

{'hello$': 'hello$', 'ello$h': 'hello$', 'llo$he': 'hello$', 'lo$hel': 'hello$', 'o$hell': 'hello$', '$hello': 'hello$', 'world$': 'world$', 'orld$w': 'world$', 'rld$wo': 'world$', 'ld$wor': 'world$', 'd$worl': 'world$', '$world': 'world$', 'python$': 'python$', 'ython$p': 'python$', 'thon$py': 'python$', 'hon$pyt': 'python$', 'on$pyth': 'python$', 'n$pytho': 'python$', '$python': 'python$'}
{'hello$'}


Permuterm works well for simple wild card queries where starting and ending characters are provided. e.g "h\*o". Try "he\*l\*". 
And also dictionary becomes large with every rotation.


#### K-gram indexes

The k-gram index is created after breaking the word into k character sequences and storing these sequences as indexes.
e.g 'hello' --> '\$hel', 'hell', 'ello', 'llo\$'   gives 4 terms (k = 4)  to index where $ is used to mark the start and end of the term. 


In [82]:
from collections import defaultdict

vocab = ['hello', 'hello123', 'hello1234', 'hello142' ]

class KGramIndex:
    def __init__(self, k):
        self.k = k
        self.index = defaultdict(set)   
    
    def insert(self, word):
        word = '$' + word + '$'
        for i in range(len(word) - self.k + 1):
            self.index[word[i:i+self.k]].add(word)

    def _preprocess(self, query):
        final_query = set()
        query = '$' + query + '$'
        qterms = query.split('*')
        qterms = [q for q in qterms if len(q) >= self.k]
        for t in qterms:
            if len(t) > self.k:
                for i in range(len(t) - self.k + 1):
                    final_query.add(t[i:i+self.k])
            else:
                final_query.add(t)
        return final_query

    def search(self, query):
        qterms = self._preprocess(query)
        print(qterms)
        if len(qterms) == 0:
            print('No query terms')
            return set()
        else: 
            result = set()
            for q in qterms:
                result = result.union(self.index[q])
            return result
        
    
    def get_index(self):
        return self.index

kgram = KGramIndex(3)
for word in vocab:
    kgram.insert(word)

print(kgram.get_index())
print(kgram.search('ell*o14*'))



defaultdict(<class 'set'>, {'$he': {'$hello1234$', '$hello$', '$hello123$', '$hello142$'}, 'hel': {'$hello1234$', '$hello$', '$hello123$', '$hello142$'}, 'ell': {'$hello1234$', '$hello$', '$hello123$', '$hello142$'}, 'llo': {'$hello1234$', '$hello$', '$hello123$', '$hello142$'}, 'lo$': {'$hello$'}, 'lo1': {'$hello1234$', '$hello123$', '$hello142$'}, 'o12': {'$hello1234$', '$hello123$'}, '123': {'$hello1234$', '$hello123$'}, '23$': {'$hello123$'}, '234': {'$hello1234$'}, '34$': {'$hello1234$'}, 'o14': {'$hello142$'}, '142': {'$hello142$'}, '42$': {'$hello142$'}})
{'$el', 'ell', 'o14'}
{'$hello1234$', '$hello$', '$hello123$', '$hello142$'}


The k-gram approach is better at handloing complex wildcard queries. It is still an expensive operation. 
e.g search 'ell\*o14\*' breaks the expression in three query terms {'$el', 'ell', 'o14'}. We have to iterate over 3 terms to find out potential matches. And once all the matches are found, one more step is required to get the precise matches. Which could be implemented as string find operation with the user provided query.

In the example above, the query returns hello1234, hello, hello123 and hello142. It should return hello142 only in this case. This can be achieved by one more filtering step mentioned above.
