# Assignment 2A, Part 1: Indexer

Index the document collection and save the index to disk.  

**IMPORTANT**: The collection and index take up several hundred Megabytes. Do NOT push those to GitHub!

It is recommended that you work on a small sample of documents while developing your solution. It is enough to build the full index once you get to Part 2 of the assignment, as you may realize later that certain refinements are needed.

You have two main options to implement the inverted index: (1) all by yourself from scratch or (2) using the [HashedIndex](https://pypi.org/project/hashedindex/) Python library. There is no third option.

You are required to adhere to the structure provided below.

The code for parsing the gzip files in the collection is already given.

You may decide to build two separate indices for the two document fields (title and content) or to keep them together in the same structure.

In [1]:
import re
import gzip
from bs4 import BeautifulSoup
import hashedindex
from hashedindex import textparser
import glob
import pickle

import nltk

from IPython.display import clear_output # Using IPython.display.clear_output to clear the output of a cell.

In [2]:
nltk_stopwords = set(nltk.corpus.stopwords.words('english'))

In [3]:
def add_docs_bulk(docs, section):
    indexes = hashedindex.HashedIndex()
    doclen = {}
    tC = {}
    total_term_count = 0
    
    for doc_id, doc in docs.items():
        # TODO: complete
#         print("Indexing document {}".format(doc_id))
        
        terms = list(textparser.word_tokenize((doc[section]).lower(), stopwords = nltk_stopwords))
        doclen[doc_id] = len(terms)
        
        for term in terms:
            indexes.add_term_occurrence(term[0], doc_id)
        
    indexes = indexes.items()
    for term, doc_freq_pair in indexes.items():
        doc_freq_pair = dict(doc_freq_pair)
        indexes[term] = doc_freq_pair
        term_count = sum(list(doc_freq_pair.values()))
        
        tC[term] = term_count
        total_term_count += term_count
        
    return (indexes, doclen, tC, total_term_count)

## Indexing a given data file

**NOTE**: Each source gzip file contains several documents. The method below does the parsing of source files and then calls `add_docs_bulk()` to bulk indexing on all document 

In [4]:
def combine_indexes(prev_indexes, new_indexes):
    for k, val in new_indexes.items():
        if k in prev_indexes.keys():
            prev_indexes[k].update(new_indexes[k])
        else:
            prev_indexes[k] = new_indexes[k]
    
    return prev_indexes

# Simple example of this function
example_dict_1 = {'a': {'1':100, '2': 100}, 'b': {'1':200, '3': 100}}
example_dict_2 = {'a': {'5':100, '3': 100}, 'c': {'1':200, '3': 100}}

combine_indexes(example_dict_1, example_dict_2)

{'a': {'1': 100, '2': 100, '5': 100, '3': 100},
 'b': {'1': 200, '3': 100},
 'c': {'1': 200, '3': 100}}

In [7]:
def index_file(file_names, section):
    doc_len = {}
    indexes = {}
    P_tc = {}
    total_term_count = 0
    
    total_files_indexed = 0
    gz_files_read = 0
    for file_name in file_names:
        gz_files_read += 1
        clear_output()
        print("Processing", file_name)
        docs = {}
        with gzip.open(file_name, "rt") as fin:
            is_body = False
            doc_id, body = None, None
            
            for line in fin:
                line = line.strip()
                if line.startswith("<DOCNO>"):  # get doc id
                    doc_id = re.sub("<DOCNO> | </DOCNO>", "", line)
                elif line.startswith("<BODY>"):  # start to parse body
                    is_body = True
                    body = []
                elif line.startswith("</BODY>"):  # finished reading body
                    soup = BeautifulSoup("\n".join(body), "lxml")
                    headline = soup.find("headline")
                    text = soup.find("text")
                    docs[doc_id] = {
                        "title": headline.text if headline is not None else "",  # use an empty string if no <HEADLINE> found
                        "content": text.text if text is not None else ""  # everything inside <TEXT> is indexed as content
                    }
                    # get ready for next document
                    doc_id = None
                    is_body = False
                elif is_body:  # accumulate body content
                    body.append(line)

            # bulk index the collected documents
            total_files_indexed += len(docs)
            print("Bulk indexed:", len(docs), "documents.")
            print("Total files indexed so far:", total_files_indexed)
            print(gz_files_read,"/",len(file_names), "gz files finished reading.")
            new_indexes, doclen, tC, total_tc = add_docs_bulk(docs, section)
            
            # Concatanate and combine the indexes
            indexes = combine_indexes(indexes, new_indexes)
            
            # Concatenate the new document lengths
            doc_len.update(doclen)
            
            # Sum all term counts
            for term, count in tC.items():
                if term in P_tc.keys():
                    P_tc[term] = P_tc[term] + count
                else:
                    P_tc[term] = count
                    
            # Add the total term count
            total_term_count += total_tc
    
    # Calculate P(t|C) needed for Language Model
    for term, count in P_tc.items():
        P_tc[term] = P_tc[term]/total_term_count
        
    clear_output()
    print("Finished indexing", total_files_indexed, "files in", len(file_names), "gz files.")
    return (indexes, doc_len, P_tc)

## Indexing for only one collection for testing

In [8]:
indexes, doc_len, P_tC = index_file(glob.glob("data/aquaint/nyt/2000/20000101_NYT.gz"), section = 'content')
print(len(indexes), len(doc_len), len(P_tC))

Finished indexing 243 files in 1 gz files.
16379 243 16379


**TODO**: Save the index to disk (make sure that the index directory is added to `.gitignore`)

## Indexing the entire collection and writing it to file.

In [9]:
# Indexing all files for the content
indexes, doc_len, P_tC = index_file(glob.glob("data/aquaint/**/*.gz", recursive=True), section = 'content')
print(len(indexes), len(doc_len), len(P_tC))

# Writing the data to appropriate files
pickle.dump(indexes, open("data/indexes_content.p", "wb"))
pickle.dump(doc_len, open("data/doc_len_content.p", "wb")) # Needed for BM25 and LM (Jelinek-Mercer smoothing)
pickle.dump(P_tC, open("data/P_tC_content.p", "wb")) # Needed for LM (Jelinek-Mercer smoothing)

Finished indexing 1033461 files in 3344 gz files.
824277 1033461 824277


In [10]:
# Indexing all files for the title
indexes, doc_len, P_tC = index_file(glob.glob("data/aquaint/**/*.gz", recursive=True), section = 'title')
print(len(indexes), len(doc_len), len(P_tC))

# Writing the data to appropriate files
pickle.dump(indexes, open("data/indexes_title.p", "wb"))
pickle.dump(doc_len, open("data/doc_len_title.p", "wb")) # Needed for BM25 and LM (Jelinek-Mercer smoothing)
pickle.dump(P_tC, open("data/P_tC_title.p", "wb")) # Needed for LM (Jelinek-Mercer smoothing) 

Finished indexing 1033461 files in 3344 gz files.
79448 1033461 79448
