# Boolean IR


## Introduction

* We will be learning how to use the open-source [Elasticsearch](https://www.elastic.co/) search engine, which is turn based on [Lucene](https://lucene.apache.org/).
* But the underlying methods (often unsupervised) are worth knowing and can be implemented efficiently in Python for small-to-medium sized applications.

## Setting up

ElasticSearch is a free java-based IR system. You'll need to install an recent version of [Java](https://java.com/en/download/manual.jsp), and get the latest version (v7+) of [Elasticsearch](https://www.elastic.co/downloads/elasticsearch) for your operating system. After you have unzipped/installed Elasticsearch (I recommend downloading an archived version rather than installing), run `elasticsearch` in the `bin directory` to start the Elasticsearch server. 

In Python, you will need the elasticsearch_dsl package.

`!pip install elasticsearch_dsl`

## Setting up an Elasticsearch index

Elasticsearch is written in Java, but it runs as an REST API that we can access using Python. 

The code below assumes that the Elasticsearch API is running at the default port.

There are actually two packages for Elasticsearch, one which provides high-level Pythonic access to Elasticsearch, ([elasticsearch_dsl](https://elasticsearch-dsl.readthedocs.io/en/latest/index.html)), and one which provides lower-level, more flexible access ([elasticsearch](https://elasticsearch-py.readthedocs.io/en/7.10.0/index.html)). 

We'll mostly be using `elasticsearch_dsl` here. We'll start by setting up the connection to the Java API, and testing to see if it works.

In [1]:
from elasticsearch_dsl.connections import connections

connections.create_connection(hosts=['localhost'])

connections.get_connection().info()



{'name': 'seventypercent.local',
 'cluster_name': 'elasticsearch',
 'cluster_uuid': 'QvrhnTD5SEWUK7gx3wVCBA',
 'version': {'number': '7.16.3',
  'build_flavor': 'default',
  'build_type': 'tar',
  'build_hash': '4e6e4eab2297e949ec994e688dad46290d018022',
  'build_date': '2022-01-06T23:43:02.825887787Z',
  'build_snapshot': False,
  'lucene_version': '8.10.1',
  'minimum_wire_compatibility_version': '6.8.0',
  'minimum_index_compatibility_version': '6.0.0-beta1'},
 'tagline': 'You Know, for Search'}

For us, the important unit of Elasticsearch is the *index*, which contains a document collection as well as information needed to search it efficiently. The first step in setting up an Elasticsearch index is to define what a document in our collection will consist of. As usual, we are going to use the Brown corpus as an example. We define a BrownDocument as a subclass of the Document, and define that it will contain the `text` as well as the `genre`. 

In [2]:
from elasticsearch_dsl import Document, Text, Keyword, analyzer, tokenizer

class BrownDocument(Document):
    text = Text()        # TEXT contains some word pre-processing steps
    genre = Keyword()   # this is a single term (treated like an atomic unit)

Note that we are treating the main text a bit differently from the genre, because we might prefer to store the genre as a single indivisible variable rather than as a text that will be processed into tokens. In Elasticsearch, the `Text` fields are associated with an analyzer that coverts a text raw string to a sequence of tokens. 

The default analyzer for Elasticsearch does a simple form of tokenization which removes punctuation, followed by "filtering" consisting of lowercasing, stopword removal, and (Porter) stemming. This is probably fine, but there might be better options, see [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html) for a list of built-in analyzers. We could access one by adding analyzer="", keyword to our `Text`. 

As it turns out, the built-in `classic` analyzer does not lowercase, which is something we'd like. Let's just built a custom analyzer, which allows us to choose exactly the preprocessing steps we want (though unfortunately no lemmatization!), and test it out on the fly using the `simulate` method. Here's a list of build-in [tokenizers](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html) and possible [filters](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html)

In [3]:
from elasticsearch_dsl import Document, Text, Keyword, analyzer, tokenizer

brown_analyzer = analyzer('brown', tokenizer="classic", filter=["lowercase","stop"])     # 'brown' is the name of the analyzer

analyzed = brown_analyzer.simulate("This is a test of your Elasticsearch custom analyzer. How did it go?")['tokens']
for token in analyzed:
    print(token)

{'token': 'test', 'start_offset': 10, 'end_offset': 14, 'typ...}
{'token': 'your', 'start_offset': 18, 'end_offset': 22, 'typ...}
{'token': 'elasticsearch', 'start_offset': 23, 'end_offset':...}
{'token': 'custom', 'start_offset': 37, 'end_offset': 43, 't...}
{'token': 'analyzer', 'start_offset': 44, 'end_offset': 52, ...}
{'token': 'how', 'start_offset': 54, 'end_offset': 57, 'type...}
{'token': 'did', 'start_offset': 58, 'end_offset': 61, 'type...}
{'token': 'go', 'start_offset': 65, 'end_offset': 67, 'type'...}


You can also create [a custom tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-tokenizer.html), for instance the one below which tokenizes into character 3-grams

In [4]:
tri_analyzer = analyzer('my_analyzer',
    tokenizer=tokenizer('trigram', 'ngram', min_gram=3, max_gram=3),
    filter=['lowercase']
)
for token in tri_analyzer.simulate("what's going on here?")['tokens']:
    print(token)

{'token': 'wha', 'start_offset': 0, 'end_offset': 3, 'type':...}
{'token': 'hat', 'start_offset': 1, 'end_offset': 4, 'type':...}
{'token': "at'", 'start_offset': 2, 'end_offset': 5, 'type':...}
{'token': "t's", 'start_offset': 3, 'end_offset': 6, 'type':...}
{'token': "'s ", 'start_offset': 4, 'end_offset': 7, 'type':...}
{'token': 's g', 'start_offset': 5, 'end_offset': 8, 'type':...}
{'token': ' go', 'start_offset': 6, 'end_offset': 9, 'type':...}
{'token': 'goi', 'start_offset': 7, 'end_offset': 10, 'type'...}
{'token': 'oin', 'start_offset': 8, 'end_offset': 11, 'type'...}
{'token': 'ing', 'start_offset': 9, 'end_offset': 12, 'type'...}
{'token': 'ng ', 'start_offset': 10, 'end_offset': 13, 'type...}
{'token': 'g o', 'start_offset': 11, 'end_offset': 14, 'type...}
{'token': ' on', 'start_offset': 12, 'end_offset': 15, 'type...}
{'token': 'on ', 'start_offset': 13, 'end_offset': 16, 'type...}
{'token': 'n h', 'start_offset': 14, 'end_offset': 17, 'type...}
{'token': ' he', 'start_o

Let's redefine BrownDocument using the `brown_analyzer`:

In [5]:
class BrownDocument(Document):
    text = Text(analyzer=brown_analyzer)
    genre = Keyword()

Now we are ready to setup up our index. First, we will initialize an index class and tie it to the `BrownDocument` type defined above using the its `document` method. The `create` method actually creates the index on the Elasicsearch cluster by calling the API with the relevant information.

In [8]:
from elasticsearch_dsl import Index

brown_index = Index("brown")
brown_index.document(BrownDocument)
brown_index.create()

{'acknowledged': True, 'shards_acknowledged': True, 'index': 'brown'}

Then we can fill the index by creating each BrownDocument and saving it to the index. We use keyword arguments to define the fields for each document. Every document in an Elasticsearch index has a unique id, which is contained in a special `meta.id` field, here we are assigning the ids to be the original Brown filenames. 

In [9]:
from nltk.corpus import brown

for fileid in brown.fileids():
    text = " ".join(brown.words(fileid))
    genre = brown.categories(fileid)[0]
    doc = BrownDocument(text=text, genre=genre)    # feeding the text and genre into the BrownDocument class
    doc.meta.id = fileid
    doc.save()



Note that the index you create in Elasticsearch will persist after you close your ipynb and even shut down Elasticsearch. If you want to start again from scratch on a index, you will need to delete your old index first.

**Don't run unless you want to remove the index** 

In [7]:
# brown_index.delete()

Let's do boolean filtering in ElasticSearch. In order to do a search of any kind, we start by creating a search object using the `search` method of the index:

In [10]:
s = brown_index.search()

There are two basic methods which are used to define the search: `filter` and `query`. You use filter if you ONLY want to do boolean filtering. The `query` method does both boolean filtering and relevance ranking, which is slightly inefficient if you don't care about the ranking of documents.

The other thing we need to set up a query using elasticsearch_dsl are `Match` objects. `Match` objects serve two purposes, they let us define the document field where we will be searching, and they can be combined into complex boolean expressions. Note you don't have to use `Match` objects, there is other syntax in Elasticsearch to do queries, but they are very handy for boolean filtering. 

Let's up a simple query which demonstrates the use of `query` and `Match`:

In [11]:
from elasticsearch_dsl.query import Match

s = s.query(Match(text="brown"))

A few important points here. 

1. Using `s.query` doesn't do anything at all unless we assign it to s (queries can be considered an immutable datatype like strings in this regard). 
2. the Match object is generated on the fly and passed as the argument to the `query` method
3. We indicate which field we are searching in by using keyword arguments to the match constructor
4. The above does not actually do the search. It is simply constructing a API call, which will be passed to the Java backend in JSON format when we execute the search. We can look at the underlying API call for a search by using the `to_dict` method. 

In [12]:
s.to_dict()

{'query': {'match': {'text': 'brown'}}}

Now let's `execute` the query, storing the result in a response variable.

In [13]:
response = s.execute()

In [14]:
response

<Response: [BrownDocument(index='brown', id='ch26'), BrownDocument(index='brown', id='cj58'), BrownDocument(index='brown', id='cj15'), BrownDocument(index='brown', id='ce14'), BrownDocument(index='brown', id='ch29'), BrownDocument(index='brown', id='ce15'), BrownDocument(index='brown', id='ce13'), BrownDocument(index='brown', id='cf32'), BrownDocument(index='brown', id='cn22'), BrownDocument(index='brown', id='cc14')]>

We didn't get all the results (there are 70 in total). Like any good search engine, Elasticsearch only likes to show 10 results at a time. We can override this by using slicing syntax:

In [15]:
s = s[:500]
print(s.to_dict())
response = s.execute()
len(response)

{'query': {'match': {'text': 'brown'}}, 'from': 0, 'size': 500}


70

If we do a match on the genre name instead we can quickly pull out documents from particular genres

In [16]:
s = brown_index.search()
s = s.query(Match(genre="government"))
s = s[:500]
response = s.execute()
for i, hit in enumerate(response):
    print(i+1, hit.text[:100])

1 The Office of Business Economics ( OBE ) of the U.S. Department of Commerce provides basic measures 
2 In most of the less developed countries , however , such programing is at best inadequate and at wor
3 You have heard him tell these young people that during his almost 50 years of service in the Congres
4 Origin of state automobile practices . The practice of state-owned vehicles for use of employees on 
5 The Rhode Island property tax There was a time some years ago when local taxation by the cities and 
6 Local industry's investment in Rhode Island was the big story in 1960's industrial development effor
7 Special districts in Rhode island . It is not within the scope of this report to elaborate in any gr
8 Rhode Island Heritage Week proclamation by John A. Notte , Jr. , governor The theme of Rhode Island 
9 Be it enacted by the Senate and House of Representatives of the United States of America in Congress
10 In the same period , 431 presentations by members of the staff were ma

> Notice we only have 30 matches.

Elasticsearch Match instances support three boolean operators: & (and), | (or), and ~ (not). We can simply combine Match objects to create new Match objects, in the same way we would combine sets.

We can have the same flexibility with Match objects boolean operators as we do with Python boolean operators. For example, let's find documents which have neither the word *black* nor *blue*.

In [17]:
s = brown_index.search()
s = s.query(~(Match(text="black") | Match(text="blue")))
s = s[:500]
response = s.execute()
print(len(response))

331


And documents containing both "black" and "blue"

In [18]:
s = brown_index.search()
s = s.query(Match(text="black") & Match(text="blue"))

s = s[:500]
response = s.execute()
print(len(response))

35


## Boolean Filtering

We'll now reimplement some of the Elasticsearch functionality in native Python.

The simplest kind of query involves looking for texts that contain particular words. For example, we can look for texts in the Brown corpus that contain the words "black" and "blue". However, even for a corpus of only 500 texts, iterating over the texts is pretty slow. 

This corresponds to the grep approach to IR.

In [19]:
from nltk.corpus import brown

def get_texts_with_words(word1,word2):
    '''returns a list of brown fileids that contain the provided words'''
    texts = []
    for filename in brown.fileids():
        has_word1 = False
        has_word2 = False
        for word in brown.words(filename):
            if word.lower() == word1:
                has_word1 = True
            if word.lower() == word2:
                has_word2 = True
        if has_word1 and has_word2:
            texts.append(filename)
    return texts

print(get_texts_with_words("black","blue"))

['ca18', 'ca25', 'ca33', 'cb13', 'ce23', 'ce25', 'cf36', 'cg27', 'cg40', 'cg41', 'cg50', 'ck06', 'ck10', 'ck13', 'ck15', 'cl10', 'cl19', 'cl21', 'cn15', 'cn19', 'cn20', 'cn28', 'cp01', 'cp04', 'cp05', 'cp15', 'cp21', 'cp23', 'cp26', 'cp28']


Searching through a corpus in response to a query is not practical. That's why we need to build an index. The underlying data structure used in IR  is known as an *inverted index*. The classic setup for an inverted index is a hash map (i.e. a Python dict) from the word to a list of document ids. Let's create this for the brown corpus.

In [20]:
from collections import defaultdict

def create_inverted_index(nltk_corpus):
    inverted_index = defaultdict(set)
    sorted_ids = nltk_corpus.fileids()
    sorted_ids.sort()

    for filename in sorted_ids:
        for word in brown.words(filename):
            inverted_index[word.lower()].add(filename)

    return inverted_index

brown_inverted_index = create_inverted_index(brown)

Finding the documents which contain one specific word is now very fast.

In [21]:
print(brown_inverted_index["brown"])

{'cj14', 'cc08', 'cn27', 'ck18', 'cp02', 'ch06', 'ck25', 'cl14', 'cn16', 'cj61', 'ce15', 'cp10', 'cb14', 'cf22', 'cg51', 'cn22', 'cf35', 'cp16', 'ce14', 'cl18', 'cb02', 'cb11', 'cf34', 'ca18', 'cr07', 'cn07', 'cn06', 'ce11', 'cn26', 'cb04', 'cg03', 'ck29', 'cn10', 'cp05', 'cl17', 'ca17', 'cg12', 'ca21', 'cf32', 'ca29', 'cc04', 'cg55', 'cg47', 'cn15', 'cc14', 'cj15', 'ca11', 'cj58', 'cf26', 'cn23', 'cp12', 'ck01', 'ca24', 'cp14', 'ch26', 'cg14', 'ce18', 'cn20', 'ck16', 'cp26', 'ck13', 'cl13', 'cp04', 'cb24', 'cn17', 'ch29', 'ce13', 'cf30', 'ch25'}


We can now use our index to do Boolean filtering:

In [22]:
from sys import getsizeof
black_matches = brown_inverted_index["black"]
blue_matches = brown_inverted_index["blue"]

print(len(black_matches))
print(len(blue_matches))

black_matches & blue_matches

100
85


{'ca18',
 'ca25',
 'ca33',
 'cb13',
 'ce23',
 'ce25',
 'cf36',
 'cg27',
 'cg40',
 'cg41',
 'cg50',
 'ck06',
 'ck10',
 'ck13',
 'ck15',
 'cl10',
 'cl19',
 'cl21',
 'cn15',
 'cn19',
 'cn20',
 'cn28',
 'cp01',
 'cp04',
 'cp05',
 'cp15',
 'cp21',
 'cp23',
 'cp26',
 'cp28'}

Sets also provide simple ways to implement other kinds of boolean logics, like *or*

In [23]:
black_matches | blue_matches   

{'ca01',
 'ca02',
 'ca05',
 'ca15',
 'ca16',
 'ca17',
 'ca18',
 'ca22',
 'ca23',
 'ca24',
 'ca25',
 'ca29',
 'ca30',
 'ca32',
 'ca33',
 'ca39',
 'ca40',
 'cb05',
 'cb06',
 'cb09',
 'cb10',
 'cb13',
 'cb17',
 'cb26',
 'cb27',
 'cc04',
 'cc05',
 'cc08',
 'cc14',
 'cc15',
 'cd03',
 'ce05',
 'ce11',
 'ce12',
 'ce13',
 'ce19',
 'ce23',
 'ce25',
 'ce32',
 'ce34',
 'cf01',
 'cf02',
 'cf06',
 'cf10',
 'cf18',
 'cf20',
 'cf22',
 'cf26',
 'cf28',
 'cf29',
 'cf34',
 'cf35',
 'cf36',
 'cf38',
 'cf39',
 'cf42',
 'cf44',
 'cg04',
 'cg05',
 'cg09',
 'cg12',
 'cg14',
 'cg17',
 'cg18',
 'cg27',
 'cg40',
 'cg41',
 'cg50',
 'cg51',
 'cg55',
 'cg69',
 'cg70',
 'cg75',
 'ch27',
 'cj01',
 'cj09',
 'cj10',
 'cj16',
 'cj43',
 'cj48',
 'cj53',
 'cj62',
 'cj66',
 'cj70',
 'ck03',
 'ck04',
 'ck06',
 'ck10',
 'ck11',
 'ck12',
 'ck13',
 'ck14',
 'ck15',
 'ck16',
 'ck17',
 'ck19',
 'ck22',
 'ck23',
 'ck24',
 'ck26',
 'ck28',
 'ck29',
 'cl07',
 'cl09',
 'cl10',
 'cl11',
 'cl19',
 'cl20',
 'cl21',
 'cl22',
 'cm01',
 

Boolean *not* can be implemented by using set difference between the negated set and the set off all documents. E.g.

In [24]:
all_documents = set(brown.fileids())
print(len(all_documents - (black_matches | blue_matches)))

345


We will see this again in the next notebook.