# Information Retrieval II

* Boolean filtering
* Relevance ranking with tf-idf
* Ranked search in EalsticSearch

## Boolean Filtering

The simplest kind of query involves looking for texts that contain particular words. 

In [1]:
from nltk.corpus import brown

brown_files = {fileid:brown.words(fileid) for fileid in brown.fileids()}
brown_files = {fileid:[w.lower() for w in words] for fileid, words in brown_files.items()}
                       
def get_texts_with_words(word1,word2):
    '''returns a list of brown fileids that contain the provided words'''
    texts = set()
    for fileid, words in brown_files.items():
        has_word1 = word1 in words 
        has_word2 = word2 in words 
        if has_word1 and has_word2:
            texts.add(fileid)
    return texts

For example, we can look for texts in the Brown corpus that contain the words "black" and "blue". However, even for a corpus of only 500 texts, iterating over the texts is a bit slow.

In [2]:
%%time

for i in range(1000):
    get_texts_with_words("black","blue")
print(get_texts_with_words("black","blue"))

{'ck06', 'ck13', 'cl10', 'cf36', 'ca25', 'cg50', 'cn28', 'cg27', 'cp26', 'cn19', 'cg40', 'ca18', 'ck15', 'cg41', 'ce23', 'ck10', 'ca33', 'cn20', 'cp05', 'cl21', 'cp01', 'cn15', 'cp28', 'cp23', 'cp21', 'ce25', 'cp15', 'cp04', 'cl19', 'cb13'}
CPU times: user 35.3 s, sys: 605 ms, total: 35.9 s
Wall time: 36.7 s


Searching through a corpus in response to a query is not practical. That's why we need an *inverted index*. We can implement this as a hash map (i.e. a Python dict) from the word to a set of document ids. Let's create this for the brown corpus:

In [3]:
from collections import defaultdict

def create_inverted_index(nltk_corpus):
    inverted_index = defaultdict(set)
    for fileid in brown.fileids():
        for word in brown.words(fileid):
            inverted_index[word.lower()].add(fileid)
    return inverted_index

brown_inverted_index = create_inverted_index(brown)

Finding the documents which contain one specific word is now very fast.

In [4]:
%%time

for i in range(1000):
    brown_inverted_index["brown"]
    
print(brown_inverted_index["brown"])

{'cj61', 'cb04', 'cf35', 'ck13', 'ch25', 'cf34', 'ce18', 'cj58', 'cn10', 'cp02', 'ch29', 'cp12', 'ce15', 'cl18', 'cg03', 'ca24', 'cn17', 'cp16', 'ce14', 'cc08', 'cp26', 'cb24', 'ck25', 'ck29', 'cj14', 'cf30', 'cg55', 'cr07', 'cn06', 'ca17', 'ca18', 'cn07', 'cf26', 'cn22', 'cg47', 'cc14', 'ch26', 'cj15', 'cl13', 'ca21', 'cf22', 'ck18', 'cn20', 'cn27', 'cn23', 'cp05', 'ca29', 'cp14', 'cn15', 'cg14', 'cc04', 'cg12', 'cn16', 'cf32', 'ce13', 'ck01', 'ch06', 'ck16', 'cn26', 'ca11', 'ce11', 'cp04', 'cl17', 'cl14', 'cb11', 'cb14', 'cp10', 'cb02', 'cg51'}
CPU times: user 195 µs, sys: 57 µs, total: 252 µs
Wall time: 238 µs


We can now use set operations to implement Boolean filtering:

In [5]:
%%time

for i in range(1000):
    set1 = brown_inverted_index["black"]
    set2 = brown_inverted_index["blue"]
    set1 & set2
    
print(set1 & set2)

{'ck06', 'ck13', 'cl10', 'cf36', 'ca25', 'cg50', 'cn28', 'cp26', 'cg27', 'cn19', 'cg40', 'ca18', 'ck15', 'cg41', 'ce23', 'ck10', 'ca33', 'cn20', 'cp05', 'cl21', 'cp01', 'cn15', 'cp28', 'cp23', 'cp21', 'ce25', 'cp15', 'cp04', 'cl19', 'cb13'}
CPU times: user 2.55 ms, sys: 142 µs, total: 2.69 ms
Wall time: 2.66 ms


Sets also provide simple ways to implement other kinds of boolean logics, like *or*

In [6]:
%%time

for i in range(1000):
    set1 = brown_inverted_index["black"]
    set2 = brown_inverted_index["blue"]
    set1 | set2
    
print(set1 | set2)

{'cf18', 'cf35', 'ck13', 'cf02', 'cl20', 'ce32', 'ck11', 'cf28', 'cr02', 'ca24', 'ca25', 'cg50', 'cg40', 'cr05', 'cn05', 'ca17', 'cg75', 'cf39', 'cc14', 'cj48', 'cf38', 'cn13', 'cn29', 'cf22', 'cn23', 'ch27', 'cn12', 'ca30', 'cp20', 'ck16', 'cg04', 'cl19', 'cr09', 'ck14', 'cg51', 'cb26', 'cb05', 'cb09', 'cf34', 'cp25', 'cp29', 'cf36', 'cm01', 'cp16', 'ca32', 'cn28', 'cf06', 'ca39', 'ca18', 'cp22', 'cj62', 'cg41', 'cn22', 'cb27', 'ce23', 'cj70', 'cj53', 'ca33', 'cj66', 'ca15', 'cg17', 'cn27', 'ck04', 'cl21', 'ca22', 'cn15', 'cp21', 'cg09', 'ce25', 'cf29', 'cn26', 'cf10', 'ce11', 'cb06', 'cp15', 'ck22', 'cf44', 'cp10', 'cj09', 'cn24', 'ca16', 'ca05', 'cc05', 'cl10', 'ce34', 'cn01', 'cg18', 'cp26', 'cc08', 'cn19', 'ck29', 'cn04', 'ck15', 'cl09', 'ck24', 'cf01', 'ck10', 'ck26', 'cm04', 'cg05', 'ck12', 'cn20', 'cp01', 'cl22', 'cg12', 'ce13', 'cf42', 'ce05', 'cp04', 'ck28', 'cj10', 'ca01', 'cm02', 'cp06', 'ce12', 'ca02', 'ck06', 'ce19', 'ck19', 'ck23', 'cb17', 'cd03', 'cp12', 'cg69', 'cl07',

Boolean *not* can be implemented by using set difference between the negated set and the set off all documents. E.g.

In [7]:
%%time

for i in range(1000):
    all_documents = set(brown.fileids())
    has_black = brown_inverted_index["black"]

    not_black = all_documents - has_black
    
print(not_black)

{'cf25', 'cj51', 'cf47', 'cj11', 'cf13', 'cg74', 'cb15', 'cb18', 'ch11', 'cg66', 'cg32', 'cg67', 'cn10', 'ce18', 'cg33', 'cl20', 'ck11', 'cn08', 'cb01', 'cg19', 'cd08', 'cj77', 'cf03', 'cl23', 'cn17', 'ch17', 'cg21', 'cj14', 'cf30', 'cg07', 'ch19', 'cf12', 'cc01', 'ch09', 'cj38', 'ce36', 'cl02', 'ca13', 'cj64', 'cf22', 'cb25', 'cp18', 'ch27', 'cj80', 'cb19', 'ck01', 'cc06', 'ch22', 'ca11', 'cb12', 'cj21', 'cp27', 'cj19', 'ch21', 'cc07', 'cg73', 'cg63', 'ce10', 'cf08', 'cb04', 'cj07', 'cb05', 'cj52', 'cb09', 'cg53', 'ca37', 'cg38', 'cd01', 'ch23', 'cp02', 'cg08', 'cf04', 'cj73', 'cp16', 'cn25', 'ce16', 'cj35', 'ca39', 'cd16', 'ca06', 'cp22', 'ce21', 'cj72', 'cn22', 'ce22', 'ch26', 'cg23', 'cj70', 'ca07', 'ck18', 'ca15', 'ce01', 'cg45', 'cf24', 'cp19', 'ce31', 'cl03', 'cg06', 'cf29', 'cn26', 'cf10', 'cd02', 'cg71', 'ck22', 'cf09', 'ch20', 'ce20', 'cp10', 'cj44', 'cj09', 'ck27', 'cg54', 'ch25', 'cj08', 'cf16', 'cd11', 'cg62', 'ck21', 'cj74', 'ca31', 'cm03', 'ce35', 'cn01', 'cj79', 'cg16',

## Relevance ranking with TF-IDF

Relevance ranking means to order documents based on their relevance to a query. We can apply what we've already learned to built a vector space model for relevance ranking. 

* Building a document/term matrix for the corpus
* Carrying out term weighting
* Transfering the query into a compatible space
* Identifying most relevant documents using cosine similarity

In [8]:
from sklearn.feature_extraction import DictVectorizer
from collections import Counter
from nltk.corpus import movie_reviews, stopwords

en_stopwords = set(stopwords.words("english"))

raw_feature_dicts = []
for document in movie_reviews.fileids():
    raw_feature_dicts.append(Counter([word.lower() for word in movie_reviews.words(document) if word.isalpha() and word.lower() not in en_stopwords]))

vectorizer = DictVectorizer()
X = vectorizer.fit_transform(raw_feature_dicts)
print(X.shape)

(2000, 38738)


Let's convert counts in our document vectors into TF-IDF weights.

In [9]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer()
X_tfidf = tfidf.fit_transform(X).toarray()
print(X_tfidf.shape)


(2000, 38738)


We can use our vectorizer to create a query vector. Note that vectorizer expects a list of dictionaries, so we need to put the query in a list to get around this.

In [12]:
def convert_query(text_query):
    '''turn a text query string into a vector representation'''
    words = Counter([word.lower() for word in text_query.split()])
    return vectorizer.transform([words]).toarray()
    
query_vector = convert_query("chamber of secrets")
print(query_vector.shape)

(1, 38738)


We can then use the `cdist` function to compare the vectors for each document with the query and pick out the best one. Note that "cosine" refers to cosine **distance** here, so the best match is in fact the ones with **the lowest weight.**

In [13]:
from scipy.spatial.distance import cdist
import numpy as np

distances = cdist(query_vector,X_tfidf,"cosine")
doc_id = np.argmin(distances)
print(doc_id)
print(" ".join(movie_reviews.words(movie_reviews.fileids()[doc_id])))

1863
not since attending an ingmar bergman retrospective a few years ago have i seen a film as uncompromising in its portrayal of emotional truth as secrets & lies . like bergman , director mike leigh is interested in probing his characters ' inner depths through hypernaturally blunt confrontations . also like bergman , leigh engages in frequent closeups of his characters ' ravished and wracked faces . and the prominent mournfulness of a cello on the soundtrack recalls bergman ' s own use of a bach cello suite in an earlier film . all that is missing is a discussion of god . which is not to say that secrets & lies is nothing more than an homage to the swedish master . in fact , it is quite possible leigh had no such intentions in mind . nonetheless , what we get is so far removed from the average moviegoing experience -- even from the reason we go to the movies in the first place -- that it takes some effort to adjust to the film ' s rhythms . once the adjustment is made , however , th

## Ranked retrieval with ElasticSearch


In [14]:
from elasticsearch_dsl import Document, Text, Keyword, analyzer, tokenizer, Index
from elasticsearch_dsl.connections import connections
from nltk.corpus import brown

connections.create_connection(hosts=['localhost'])

brown_analyzer = analyzer('brown', tokenizer="whitespace", filter=["lowercase","stop"])

class BrownDocument(Document):
    text = Text(analyzer=brown_analyzer)
    genre = Keyword()
    
brown_index = Index("brown")

### We don't need to run this if Brown index already exists:
"""
brown_index.document(BrownDocument)
brown_index.create()

for fileid in brown.fileids():
    text = " ".join(brown.words(fileid))
    genre = brown.categories(fileid)[0]
    doc = BrownDocument(text=text, genre=genre)
    doc.meta.id = fileid
    doc.save()
"""

'\nbrown_index.document(BrownDocument)\nbrown_index.create()\n\nfor fileid in brown.fileids():\n    text = " ".join(brown.words(fileid))\n    genre = brown.categories(fileid)[0]\n    doc = BrownDocument(text=text, genre=genre)\n    doc.meta.id = fileid\n    doc.save()\n'

The default relevance ranking algorithm used in Elasticsearch, which is known as [Okapi BM25](https://en.wikipedia.org/wiki/Okapi_BM25) uses the vector space intuition and involves *tf-idf*, but it does not make use of dimensionality reduction. 

It works as follows: When the documents are indexed, statistics which allow for a quick *tf-idf* calculation for every term for every document in the corpus are collected. Then, when ranking a document given a query, a document is ranked by summing the *tf-idf* score of all the terms in the document that appear in the query. That is, if we have a Document D and a query Q with words $q_0\ldots q_n$, the relevance score is:

$$\text{score}(D,Q) = \sum_{i=1}^{n} \hat{\text{tf}}(q_i, D) \cdot \text{idf}(q_i)$$ 

In the typical context of relatively short queries, and in combination with techniques like stopword removal and the use for inverted indices for boolean filtering (with $O(1)$ lookup using our old friend the hash map!), this calculation is extremely efficient!

We can actually look under the hood of Elasticsearch and see how specific relevance calculations are done. Let's create a new search query.

In [15]:
from elasticsearch_dsl.query import Match

s = brown_index.search()
s = s.query(Match(text="black and blue"))

Note that instead of individual words, this time we are matching a larger phrase. Under the hood, Elasticsearch uses its analyzer to generate tokens from our query, and identifies texts which at least one of the resulting terms. The order of terms doesn't matter, so, given that we have stopword removal, the above is therefore equivalent to `s.query(Match(text="black") | Match(text="blue"))`. For relevance ranking, any term which we are trying to match will be included as one of the query terms for the calculation of Okapi BM25. Let's execute the search and look at the scores of the top 10

In [16]:
response = s.execute()
for hit in response:
    print(hit)
    print(hit.meta.score)



<Hit(brown/cp26): {'text': "I was thinking of the heat and of water that morni...}>
4.560643
<Hit(brown/cg50): {'text': "As he had done on his first Imperial sortie a year...}>
4.493673
<Hit(brown/ca33): {'text': "At last the White House is going to get some much-...}>
4.4184194
<Hit(brown/cb13): {'text': "Sizzling temperatures and hot summer pavements are...}>
4.4184194
<Hit(brown/ck10): {'text': "That summer the gambling houses were closed , desp...}>
4.4184194
<Hit(brown/cl21): {'text': "But the police have dropped the case . I want you ...}>
4.4184194
<Hit(brown/ck13): {'text': "In the dim underwater light they dressed and strai...}>
4.2184463
<Hit(brown/cl10): {'text': "`` Not since last night . I didn't think there was...}>
4.176656
<Hit(brown/ce23): {'text': "Roy Mason is essentially a landscape painter whose...}>
4.1663623
<Hit(brown/cf36): {'text': "It was John who found the lion tracks . He found t...}>
4.0997477


To see exactly how those numbers are being calculated, we have to use the low-level API. First, let's get the true query using `to_dict`

In [17]:
s.to_dict()

{'query': {'match': {'text': 'black and blue'}}}

We can pass that query to a special `explain` function available in the `elasticsearch` API, with the `id` and `body` keywords.

In [18]:
from elasticsearch import Elasticsearch
client = Elasticsearch()

response = client.explain(
    index="brown",
#my code here
    id="cp26",
    body={'query': {'match': {'text': 'black and blue'}}}
#my code here
)
response

{'_index': 'brown',
 '_type': '_doc',
 '_id': 'cp26',
 'matched': True,
 'explanation': {'value': 4.560643,
  'description': 'sum of:',
  'details': [{'value': 2.8859284,
    'description': 'weight(text:black in 487) [PerFieldSimilarity], result of:',
    'details': [{'value': 2.8859284,
      'description': 'score(freq=7.0), computed as boost * idf * tf from:',
      'details': [{'value': 2.2, 'description': 'boost', 'details': []},
       {'value': 1.529856,
        'description': 'idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:',
        'details': [{'value': 108,
          'description': 'n, number of documents containing term',
          'details': []},
         {'value': 500,
          'description': 'N, total number of documents with field',
          'details': []}]},
       {'value': 0.8574569,
        'description': 'tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:',
        'details': [{'value': 7.0,
          'description': 'freq, occurrences of

A summary of important observations from our low-level investigation of scoring Okapi BM25 scoring in Elasticsearch:

* We can see the final score is *basically* a sum of tf-idf scores
* However, there's a boosting term (indicating more important fields?) that defaults to 2.2