# Exercise 5: Text Retrieval


To complete the exercise, follow the instructions and complete the missing code and write the answers where required.  All points, except the ones marked with **(N points)** are mandatory. The optional tasks require more independet work and some extra effort. Without completing them you can get at most 75 points for the exercise (the total number of points is 100 and results in grade 10). Sometimes there are more optional exercises and you do not have to complete all of them, you can get at most 100 points.



## Introduction

In this exercise you will implement some indexing operations used to retrieve information from a corpus of text documents. As the size of readily available text documents grows beyond all measures, methods for fast and user-friendly querying of information from text files are needed.

In [1]:
# IMPORTS
from collections import defaultdict
import nltk
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
import math
from nltk.corpus import gutenberg
from nltk.tokenize import word_tokenize
import numpy as np

import sys   
PYTHONIOENCODING="UTF-8"

## Assignment 1: Using nltk library and corpuses

In this assignment, you will learn how to use <a href="https://www.nltk.org/"><b>nltk</b> library</a> and how to load and preprocess a corpus of documents.

a) The <b>nltk</b> library includes mechanisms for loading a number of different text corpuses. Load the <b>Gutenberg corpus</b> by using the function ``nltk.download()``. Then, with the help of the <a href="http://www.nltk.org/book/">NLTK Book</a>, familiarize yourself with the contents and structure of the corpus. If at any point of the exercise you find the Gutenberg corpus too small or otherwise unsuitable for your needs you are welcome to try other corpuses available on the internet.

In [2]:
# Download Gutenberg corpus
nltk.download('gutenberg')
# Download Punkt Tokenizer Models
nltk.download('punkt')
# Download Stop Words
nltk.download('stopwords')

[nltk_data] Downloading package gutenberg to /home/jovyan/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
# TODO
# Show structure of Gutenberg corpus

In [4]:
# TODO: familiarize/experiment with the content of Gutenberg corpus
#       e.g. show the number of words, display first few words, ...

b) Preprocess the raw data of each document in the Gutenberg corpus using the nltk inbuilt tokenizer. You can use the function ``nltk.word_tokenize()``. Remove stop words and punctuation to further reduce the amount of data you will need to process later on.

In [5]:
# TODO
# Write a function that recives an array of words obtained from tokenized file
# as an input parameter and returns an array of filtered words

In [6]:
# TODO
# Preprocess raw data of documents in Gutenberg corpus (use build in tokenizer 'word_tokenize')

c) <b>(15 points)</b> Write your own tokenizer for preprocessing. You can use regular expressions. Show the difference in the results from the tokenizer you used in the previous task. What kind of tokens did you keep/ignore relative to the in-built tokenizer?

In [7]:
# TODO (Bonus +15 points)
# 1) Write your own tokenizer
# 2) Compore results with the inbuilt tokenizer

## Assignment 2: Indexing

In order to simplify and speed up operations on sets of documents, they need to be indexed. This allows a fast look up if and where a query word or phrase appears in our corpus.

a) Build an inverted index. For each unique token appearing in your corpus, make a list of the documents it appears in. Make some queries using Boolean logic (operation AND is an intersection of lists etc.).

In [8]:
# TODO
# Build an inverted index

In [9]:
# TODO
# Make some queries

b) To allow querying of tokens that occur together (i.e. common phrases), a positional index can be used. Build a positional index on your corpus. This is an extension of the inverted index where each of the list elements containing the document index also stores a list of positions in the document where the token appears. Make sure you properly removed stop words in order to keep your computation relatively fast.

In [10]:
# TODO
# Build positional index

In [11]:
# TODO
# Make some queries

c) Use the positional index to query phrases. That is, return the positions in documents where each of the words in your phrase occurs at approximately the same position in the document.

In [12]:
# TODO
# Return the positions in documents where each of the words in your query phrase
# occurs at approximately the same position in the document

## Assignment 3: Text statistics

a) The relevance of the documents in your corpus to the user's query can be measured by the term frequency, i.e. how many times the query term appears in each document. Implement a function that counts the number of appearances of each token in each document.

In [13]:
# TODO
# Count the number of appearances of each token in each document

b) The absolute number of appearances is biased, therefore a different metric called <b>TFIDF</b> (short for term frequency–inverse document frequency) is commonly used to rank the relevance of documents containing a query. TFIDF is computed as follows:

\begin{equation}
\mathrm{tfidf}_{t,d} = \mathrm{tf}_{t,d} \cdot \mathrm{idf}_t,
\end{equation}

where $\mathrm{tf}_{t,d}$ is the frequency of the term $t$ in document $d$ and $\mathrm{idf}_{t}$ equals to

\begin{equation}
\log_{10}{\frac{N}{\mathrm{df}_t}},
\end{equation}

where $N$ is the number of documents in the corpus and $\mathrm{df}_t$ is the number of documents of the corpus in which the term $t$ appears.

Implement a system that returns the first $5$ most relevant documents from the corpus given a query. Note that your queries can contain more than one word. The score in that case is calculated as

\begin{equation}
s(q,d) = \sum_{t \in q}{\mathrm{tfidf}_{t,d}},
\end{equation}

where $q$ is your query. 

In [14]:
# TODO
# Return 5 most relevant documents from the corpus given a query

c) <b>(10 points)</b> Implement a system for handling typographical errors of queries on the user's part. The choice of the method is up to you. You need to show that your system returns relevant results for misspelled queries.

In [15]:
# TODO (Bonus +10 points)
# Handle typographical errors in user's query and return relevant results for misspelled queries.