# Building an inverted index

  - You are given a sample (1000 documents) from the [The Reuters-21578 data collection](http://www.daviddlewis.com/resources/testcollections/reuters21578/) in `data/reuters21578-000.xml`
  - The code that parses the XML and extract a list of preprocessed terms (tokenized, lowercased, stopwords removed) is already given.
  - You are also given an InvIndex class that manages the posting lists operations.
  - Build an inverted index from the input collection with the term frequencies stored.
  - Save the inverted index to a text file. E.g., `termID docID1:freq1 docID2:freq2 ...`.

In [6]:
from xml.dom import minidom
from collections import Counter
import re

## Parsing documents

Stopwords list

In [7]:
stopwords = ["a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"]

Stripping tags inside <> using regex

In [8]:
def striptags(text):
    p = re.compile(r'<.*?>')
    return p.sub('', text)

Parse input text and return a list of indexable terms

In [9]:
def parse(text):
    terms = []
    # Replace specific characters with space
    chars = ["'", ".", ":", ",", "!", "?", "(", ")"]
    for ch in chars:
        if ch in text:
            text = text.replace(ch, " ")

    # Remove tags
    text = striptags(text)

    # Tokenization
    for term in text.split():  # default behavior of the split is to split on one or more whitespaces
        # Lowercasing
        term = term.lower()
        # Stopword removal
        if term in stopwords:
            continue
        terms.append(term)

    return terms

## Processing the input document collection

  - The collection is given as a single XML file. 
  - Each document is inside `<REUTERS ...> </REUTERS>`.
  - We extract the contents of the `<DATE>`, `<TITLE>`, and `<BODY>` tags.
  - After each extracted document, the provided callback function is called and all document data is passed in a single dict argument.

In [10]:
def process_collection(input_file, callback):
    xmldoc = minidom.parse(input_file)
    # Iterate documents in the XML file
    itemlist = xmldoc.getElementsByTagName("REUTERS")
    doc_id = 0
    for doc in itemlist:
        doc_id += 1
        date = doc.getElementsByTagName("DATE")[0].firstChild.nodeValue
        # Skip documents without a title or body
        if not (doc.getElementsByTagName("TITLE") and doc.getElementsByTagName("BODY")):
            continue
        title = doc.getElementsByTagName("TITLE")[0].firstChild.nodeValue
        body = doc.getElementsByTagName("BODY")[0].firstChild.nodeValue
        callback({
            "doc_id": doc_id,
            "date": date,
            "title": title,
            "body": body
            })

Prints a document's contents (used as a callback function passed to `process_collection`)

In [11]:
def print_doc(doc):
    if doc["doc_id"] <= 5:  # print only the first 5 documents
        print("docID:", doc["doc_id"])
        print("date:", doc["date"])
        print("title:", doc["title"])
        print("body:", doc["body"])
        print("--")

In [12]:
process_collection("data/reuters21578-000.xml", print_doc)

docID: 1
date: 26-FEB-1987 15:01:01.79
title: BAHIA COCOA REVIEW
body: Showers continued throughout the week in
the Bahia cocoa zone, alleviating the drought since early
January and improving prospects for the coming temporao,
although normal humidity levels have not been restored,
Comissaria Smith said in its weekly review.
    The dry period means the temporao will be late this year.
    Arrivals for the week ended February 22 were 155,221 bags
of 60 kilos making a cumulative total for the season of 5.93
mln against 5.81 at the same stage last year. Again it seems
that cocoa delivered earlier on consignment was included in the
arrivals figures.
    Comissaria Smith said there is still some doubt as to how
much old crop cocoa is still available as harvesting has
practically come to an end. With total Bahia crop estimates
around 6.4 mln bags and sales standing at almost 6.2 mln there
are a few hundred thousand bags still in the hands of farmers,
middlemen, exporters and processors.
   

## Inverted index

  - The inverted index is an object with methods for adding and fetching postings.
  - The data is stored in a map, where keys are terms and values are lists of postings.
  - Each posting is an object that holds the doc_id and an optional payload.

In [13]:
class Posting(object):
    def __init__(self, doc_id, payload=None):
        self.doc_id = doc_id
        self.payload = payload

In [14]:
class InvIndex(object):

    def __init__(self):
        self.index = {}

    # Add a document to the posting list of a term
    def add_posting(self, term, doc_id, payload=None):
        if term not in self.index:  # if term not in index, initialize empty posting list
            self.index[term] = []
        # append new posting to the posting list
        self.index[term].append(Posting(doc_id, payload))

    # Get the posting list for a given term
    def get_postings(self, term):
        if term in self.index:
            return self.index[term]
        return None

    # Returns all unique terms in the index
    def get_terms(self):
        return self.index.keys() 

### Creating an inverted index from the input collection

In [18]:
ind = InvIndex()

def index_doc(doc):
    text = doc["title"] + " " + doc["body"]
    terms = parse(text)  # list of terms in the document
    tc = Counter(terms)
    for terms, freq in tc.items():
        ind.add_posting(terms, doc["docID"], freq)
    if doc["doc_id"] <= 5:  # print preprocessed contents for the first 5 docs
        print("docID:", doc["doc_id"])        
        print(terms)

    # TODO index the document (add all terms with freqs using `ind.add_posting()`)
    
    
    
process_collection("data/reuters21578-000.xml", index_doc)

docID: 1
['bahia', 'cocoa', 'review', 'showers', 'continued', 'throughout', 'week', 'bahia', 'cocoa', 'zone', 'alleviating', 'drought', 'since', 'early', 'january', 'improving', 'prospects', 'coming', 'temporao', 'although', 'normal', 'humidity', 'levels', 'have', 'been', 'restored', 'comissaria', 'smith', 'said', 'its', 'weekly', 'review', 'dry', 'period', 'means', 'temporao', 'late', 'year', 'arrivals', 'week', 'ended', 'february', '22', 'were', '155', '221', 'bags', '60', 'kilos', 'making', 'cumulative', 'total', 'season', '5', '93', 'mln', 'against', '5', '81', 'same', 'stage', 'last', 'year', 'again', 'seems', 'cocoa', 'delivered', 'earlier', 'consignment', 'included', 'arrivals', 'figures', 'comissaria', 'smith', 'said', 'still', 'some', 'doubt', 'how', 'much', 'old', 'crop', 'cocoa', 'still', 'available', 'harvesting', 'has', 'practically', 'come', 'end', 'total', 'bahia', 'crop', 'estimates', 'around', '6', '4', 'mln', 'bags', 'sales', 'standing', 'almost', '6', '2', 'mln', 'fe

TypeError: add_posting() missing 2 required positional arguments: 'term' and 'doc_id'

#### Saving inverted index to file

In [16]:
# TODO
ind.write

## Questions

  - How much space does the inverted index occupy?
  - How much space would be needed if the same information was stored in a document-term matrix?