# Assignment 2A, Part 1: Indexer

Index the document collection and save the index to disk.  

**IMPORTANT**: The collection and index take up several hundred Megabytes. Do NOT push those to GitHub!

It is recommended that you work on a small sample of documents while developing your solution. It is enough to build the full index once you get to Part 2 of the assignment, as you may realize later that certain refinements are needed.

You have two main options to implement the inverted index: (1) all by yourself from scratch or (2) using the [HashedIndex](https://pypi.org/project/hashedindex/) Python library. There is no third option.

You are required to adhere to the structure provided below.

The code for parsing the gzip files in the collection is already given.

You may decide to build two separate indices for the two document fields (title and content) or to keep them together in the same structure.

In [1]:
import re
import gzip
from bs4 import BeautifulSoup
import hashedindex
from hashedindex import textparser
import glob
import pickle

import nltk
nltk_stopwords = set(nltk.corpus.stopwords.words('english'))

from IPython.display import clear_output # Using IPython.display.clear_output to clear the output of a cell.

In [2]:
def add_docs_bulk(docs, param_type):
    indx = hashedindex.HashedIndex()
    dlen = {}
    tC = {}
    term_count = 0
    
    for doc_id, doc in docs.items():        
        terms = list(textparser.word_tokenize((doc[param_type]).lower(), stopwords = nltk_stopwords))
        dlen[doc_id] = len(terms)
        
        for term in terms:
            indx.add_term_occurrence(term[0], doc_id)
        
    indx = indx.items()
    for term, doc_freq_pair in indx.items():
        doc_freq_pair = dict(doc_freq_pair)
        indx[term] = doc_freq_pair
        term_count = sum(list(doc_freq_pair.values()))
        
        tC[term] = term_count
        term_count += term_count
        
    return (indx, dlen, tC, term_count)

## Indexing a given data file

**NOTE**: Each source gzip file contains several documents. The method below does the parsing of source files and then calls `add_docs_bulk()` to bulk indexing on all document 

In [3]:
def index_file(file_name, param_type):
    docs = {}
    with gzip.open(file_name, "rt") as fin:
        print("Working with:", file_name)
        is_body = False
        doc_id, body = None, None

        for line in fin:
            line = line.strip()
            if line.startswith("<DOCNO>"):  # get doc id
                doc_id = re.sub("<DOCNO> | </DOCNO>", "", line)
            elif line.startswith("<BODY>"):  # start to parse body
                is_body = True
                body = []
            elif line.startswith("</BODY>"):  # finished reading body
                soup = BeautifulSoup("\n".join(body), "lxml")
                headline = soup.find("headline")
                text = soup.find("text")
                docs[doc_id] = {
                    "title": headline.text if headline is not None else "",  # use an empty string if no <HEADLINE> found
                    "content": text.text if text is not None else ""  # everything inside <TEXT> is indexed as content
                }
                # get ready for next document
                doc_id = None
                is_body = False
            elif is_body:  # accumulate body content
                body.append(line)
        clear_output()
        return add_docs_bulk(docs, param_type)

## Indexing the all files and writing it to pickle file.

In [4]:
all_files = glob.glob("data/aquaint/**/*.gz", recursive=True)

In [5]:
# For Content
indx = {}
d_len = {}
Ptc = {}
term_count = 0

for file in all_files:
    new_indx, dlen, tC, tc = index_file(file, 'content')
    
    for k, val in new_indx.items():
        if k in indx.keys():
            indx[k].update(new_indx[k])
        else:
            indx[k] = new_indx[k]
    
    d_len.update(dlen)
    
    for term, count in tC.items():
        if term in Ptc.keys():
            Ptc[term] = Ptc[term] + count
        else:
            Ptc[term] = count
            
    term_count += tc
    
for term, count in Ptc.items():
    Ptc[term] = Ptc[term]/term_count

NameError: name 'PtC' is not defined

In [6]:
pickle.dump(indx, open("data/content_indx.p", "wb"))
pickle.dump(d_len, open("data/content_d_len.p", "wb"))
pickle.dump(Ptc, open("data/content_PtC.p", "wb"))

In [8]:
# For Title
indx = {}
d_len = {}
Ptc = {}
term_count = 0

for file in all_files:
    new_indx, dlen, tC, tc = index_file(file, 'title')
    
    for k, val in new_indx.items():
        if k in indx.keys():
            indx[k].update(new_indx[k])
        else:
            indx[k] = new_indx[k]
    
    d_len.update(dlen)
    
    for term, count in tC.items():
        if term in Ptc.keys():
            Ptc[term] = Ptc[term] + count
        else:
            Ptc[term] = count
            
    term_count += tc
    
for term, count in Ptc.items():
    Ptc[term] = Ptc[term]/term_count

In [9]:
pickle.dump(indx, open("data/title_indx.p", "wb"))
pickle.dump(d_len, open("data/title_d_len.p", "wb"))
pickle.dump(Ptc, open("data/title_PtC.p", "wb"))