In [1]:
%pip install --no-cache-dir --force-reinstall https://dm.cs.tu-dortmund.de/nats/nats25_01_02_information_retrieval-0.1-py3-none-any.whl
import nats25_01_02_information_retrieval

Collecting nats25-01-02-information-retrieval==0.1
  Downloading https://dm.cs.tu-dortmund.de/nats/nats25_01_02_information_retrieval-0.1-py3-none-any.whl (3.6 kB)
Installing collected packages: nats25-01-02-information-retrieval
Successfully installed nats25-01-02-information-retrieval-0.1
Note: you may need to restart the kernel to use updated packages.


# Foundations
## Information Retrieval

This week, we will learn some basics of information retrieval, and build a simple search engine.

### Hamlet sentences

We want to build a full text search index for Hamlet in this assignment.

First load the Hamlet data from the previous assignment, and split it into sentences. Beware of the particular structure of this document, which not only separates sentences with a dot.

Then tokenize the sentences as in the previous assignment, such that each sentence is a sequence of *words* (no punctuation tokens, lowercase). Do *not* remove stopwords. Do not use a library, write the code yourself.

In [None]:
import re, urllib
# You can add some setup code here (e.g., re.compile)
pass # Your solution here

# Read the entire file:
file_path, _ = urllib.request.urlretrieve("https://dm.cs.tu-dortmund.de/nats/data/hamlet.txt")
with open(file_path, "rt") as file:
    full = file.read()

sentences = [] # Store your output in this list
# First split Hamlet into sentences, then tokenize each sentence.
pass # Your solution here

print(f"Hamlet contains {len(sentences)} sentences, {sum(len(s) for s in sentences)} tokens.")

In [None]:
nats25_01_02_information_retrieval.hidden_tests_3_0(sentences)

Find the longest sentence (as an array of tokens) and print it

In [None]:
longest = [] # store the answer here, as array
pass # Your solution here
print("Length of longest sentence:", len(longest))
print(*longest)

In [None]:
nats25_01_02_information_retrieval.hidden_tests_6_0(longest)

Count how many sentences have exactly one token. Why are there so many? Find the 10 most frequent one-word sentences.

In [None]:
singletons = 0 # Store your answer in this variable
pass # Your solution here
print(f"There are {singletons} sentences with just one word.")

most_common = [] # Store the 10 most common one-word sentences and their counts
pass # Your solution here

for word, count in most_common:
    print(word, count, sep="\t")

In [None]:
nats25_01_02_information_retrieval.hidden_tests_9_0(singletons, most_common)

## Build an inverted index

For full-text search, we need an inverted index. Build a lookup table that allows us to find all sentence numbers that contain a particular word. Do not include multiple occurrences.

In [None]:
from collections import defaultdict
index = defaultdict(list) # words to occurrences
pass # Your solution here
print(f"The index contains {len(index)} words and {sum([len(x) for x in index.values()])} occurrences")

In [None]:
nats25_01_02_information_retrieval.hidden_tests_12_0(sentences, index)

# Excursus: Generators in Python

Python has a (rather uncommon) powerful feature called [*generators*](https://wiki.python.org/moin/Generators).

- When writing generators, they are like functions that can "return" multiple values (using `yield`), and will be paused inbetween
- When consuming generators, they behave essentially like an iterator
- Generators are *lazy*: they do *not* produce a list of all their output, but always one item when necessary
- Generators *could* produce an infinite stream of values

In the following assignments, please use generators for efficiency. Here is a simple example how generators work:

In [None]:
def upto(x):
    i = 0
    while i <= x:
        print("gen: generating", i)
        yield i # Return value and pause!
        print("gen: continuing")
        i += 1

print("Use generator in for loop:")
for j in upto(2):
    print("use: generated:", j)
    print("use: next")

print("Use generator object directly:")
a = upto(1)
print("Type of a:", type(a))
print(next(a))
print("Wait")
print(next(a))
try:
    print(next(a))
except StopIteration:
    print("No further values.")
    
print(*upto(2)) # The star expands an iterable/generator

Write yourself a simple generator to enumerate an existing list: given an input list `[a,b,c]` generate an output containing pairs of `(i,v)` where `i` is the 0-based index of the list.

In [None]:
def my_enumerate(existing):
    """Enumerate the values in the existing list."""
    pass # Your solution here

for i, string in my_enumerate(["apple", "banana", "coconut"]):
    print("Index", i, "value", string)

In [None]:
enumerate=enumerate # Weird fix for JupyterLite
nats25_01_02_information_retrieval.hidden_tests_17_0(my_enumerate, enumerate)

# Intersection of sorted lists

Back to Hamlet: write a *generator* for the *sorted* intersection of two sorted iterators (e.g., list iterators or other generators). Use a **merge** operation as discussed in class!

You may assume that the input is ordered and does not contain duplicates.

In [None]:
def intersect(itera, iterb):
    """Generate the intersection of the two iterators. Do *not* use a list or set!"""
    itera, iterb = iter(itera), iter(iterb)
    try:
        a, b = next(itera), next(iterb)
        pass # Your solution here
    except StopIteration:
        pass # Figure out why this is the right thing to do here!

print(*intersect(range(27,51), [7,23,42,99]))
print(*intersect("abc","abc"))
# We want to compute the intersection of intersections!
print(*intersect("abcdef", intersect("cdefgh", "efghij")))

In [None]:
nats25_01_02_information_retrieval.hidden_tests_20_0(set, intersect, list)

## Search!

We want to use above index and functions to find all sentences that contain `hamlet` and `horatio`.

Write a function `search` that, given a list of keywords, finds all sentence containing all of them.

In [None]:
def search(*words):
    """Find all sentence numbers that contain each word in `words`"""
    pass # Your solution here

for i,s in enumerate(search("hamlet", "horatio")): print(i,s," ",*sentences[s])
print()
for i,s in enumerate(search("to", "be", "or", "not")): print(i,s," ",*sentences[s])

In [None]:
nats25_01_02_information_retrieval.hidden_tests_23_0(search, intersect, index)

## Compute the union

In order to perform "OR" searches, e.g., to find all sentences that contain "hamlet" or "horatio", we need a different merge operation. Also implement the `union` merge using generators as above.

You may assume that the input is ordered and does not contain duplicates.

In [None]:
def union(itera, iterb):
    """Generate the union of the two iterators. Do *not* use a list or set!"""
    def safe_next(i):
        """Helper function because exceptions are not too elegant."""
        try:
            return next(i)
        except StopIteration:
            return None
    itera, iterb = iter(itera), iter(iterb)
    a, b = safe_next(itera), safe_next(iterb)
    pass # Your solution here

print(*union([2,4,6],[1,3,5]))
print(*union("abc","abc"))
print(*union(range(0,7), range(4,10)))

In [None]:
nats25_01_02_information_retrieval.hidden_tests_26_0(set, list, union)

## Search with AND and OR

Perform a more complex search using above functions.

Search for all sentences that contain ("hamlet" or "horatio") and "shall"

In [None]:
answer = [] # Store your result in this variable
pass # Your solution here
answer = list(answer) # in case your answer was a generator
for i,s in enumerate(answer): print(i, s, " ", *sentences[s])

In [None]:
nats25_01_02_information_retrieval.hidden_tests_29_0(answer, sentences)