<a href="https://colab.research.google.com/github/Salvoaf/labComputerVision/blob/main/3_Inverted_index.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building an inverted index

  - You are given a sample (1000 documents) from the [The Reuters-21578 data collection](http://www.daviddlewis.com/resources/testcollections/reuters21578/) in `data/reuters21578-000.xml`
  - The code that parses the XML and extract a list of preprocessed terms (tokenized, lowercased, stopwords removed) is already given
  - You are also given an `InvertedIndex` class that manages the posting lists operations
  - Your task is to build an inverted index from the input collection.

In [1]:
!git clone https://github.com/Salvoaf/labComputerVision.git

Cloning into 'labComputerVision'...
remote: Enumerating objects: 21, done.[K
remote: Counting objects: 100% (21/21), done.[K
remote: Compressing objects: 100% (18/18), done.[K
remote: Total 21 (delta 8), reused 6 (delta 1), pack-reused 0[K
Unpacking objects: 100% (21/21), done.


In [2]:
!pip install ipytest

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting ipytest
  Downloading ipytest-0.12.0-py3-none-any.whl (15 kB)
Collecting pytest>=5.4
  Downloading pytest-7.1.3-py3-none-any.whl (298 kB)
[K     |████████████████████████████████| 298 kB 5.1 MB/s 
Collecting pluggy<2.0,>=0.12
  Downloading pluggy-1.0.0-py2.py3-none-any.whl (13 kB)
Collecting iniconfig
  Downloading iniconfig-1.1.1-py2.py3-none-any.whl (5.0 kB)
Collecting jedi>=0.10
  Downloading jedi-0.18.1-py2.py3-none-any.whl (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 44.6 MB/s 
Installing collected packages: pluggy, jedi, iniconfig, pytest, ipytest
  Attempting uninstall: pluggy
    Found existing installation: pluggy 0.7.1
    Uninstalling pluggy-0.7.1:
      Successfully uninstalled pluggy-0.7.1
  Attempting uninstall: pytest
    Found existing installation: pytest 3.6.4
    Uninstalling pytest-3.6.4:
      Successfully uninstalled pytest-3.6.4
Successfull

In [3]:
import ipytest
import re

from typing import List, Dict, Union, Any, Callable
from collections import Counter, defaultdict
from xml.dom import minidom
from dataclasses import dataclass

ipytest.autoconfig()

## Parsing documents

Stopwords list

In [4]:
STOPWORDS = ["a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"]

Stripping tags inside <> using regex

In [5]:
def striptags(text: str) -> str:
    """Removes xml tags.

    Args:
        text: Text string with xml tags.

    Returns:
        String without xml tags.
    """
    p = re.compile(r"<.*?>")
    return p.sub("", text)

Parse input text and return a list of indexable terms

In [6]:
def parse(text: str) -> List[str]:
    """Parses documents and removes xml tags and punctuation.

    Args:
        text: Text to parse.

    Returns:
        List of tokens.
    """
    terms = []
    # Replace specific characters with space
    chars = ["'", ".", ":", ",", "!", "?", "(", ")"]
    for ch in chars:
        text = text.replace(ch, " ")

    # Remove tags
    text = striptags(text)

    # Tokenization
    # default behavior of the split is to split on one or more whitespaces
    return [term.lower() for term in text.split() if term not in STOPWORDS]

## Processing the input document collection

  - The collection is given as a single XML file. 
  - Each document is inside `<REUTERS ...> </REUTERS>`.
  - We extract the contents of the `<DATE>`, `<TITLE>`, and `<BODY>` tags.
  - After each extracted document, the provided callback function is called and all document data is passed in a single dict argument.

In [7]:
def process_collection(input_file:str, callback: Callable) -> None:
    """Processes file and calls the callback function for each document in the
    file.

    Args:
        input_file: Path to file to process.
        callback: Function that will be called for each document.
    """
    xmldoc = minidom.parse(input_file)
    # Iterate documents in the XML file
    itemlist = xmldoc.getElementsByTagName("REUTERS")
    for doc_id, doc in enumerate(itemlist):
        date = doc.getElementsByTagName("DATE")[0].firstChild.nodeValue
        # Skip documents without a title or body
        if not (doc.getElementsByTagName("TITLE") and doc.getElementsByTagName("BODY")):
            continue
        title = doc.getElementsByTagName("TITLE")[0].firstChild.nodeValue
        body = doc.getElementsByTagName("BODY")[0].firstChild.nodeValue
        callback({
            "doc_id": doc_id+1,
            "date": date,
            "title": title,
            "body": body
            })

Prints a document"s contents (used as a callback function passed to `process_collection`)

In [8]:
def print_doc(doc: Dict[str, Union[str, int]]) -> None:
    """Print details of the first 5 documents.

    Args:
        doc: Dictionary with document details.
    """
    if doc["doc_id"] <= 5:  # print only the first 5 documents
        print("docID:", doc["doc_id"])
        print("date:", doc["date"])
        print("title:", doc["title"])
        print("body:", doc["body"])
        print("--")

In [9]:
import os
print(os.getcwd())

/content


In [10]:
process_collection("labComputerVision/reuters21578-000.xml", print_doc)

docID: 1
date: 26-FEB-1987 15:01:01.79
title: BAHIA COCOA REVIEW
body: Showers continued throughout the week in
the Bahia cocoa zone, alleviating the drought since early
January and improving prospects for the coming temporao,
although normal humidity levels have not been restored,
Comissaria Smith said in its weekly review.
    The dry period means the temporao will be late this year.
    Arrivals for the week ended February 22 were 155,221 bags
of 60 kilos making a cumulative total for the season of 5.93
mln against 5.81 at the same stage last year. Again it seems
that cocoa delivered earlier on consignment was included in the
arrivals figures.
    Comissaria Smith said there is still some doubt as to how
much old crop cocoa is still available as harvesting has
practically come to an end. With total Bahia crop estimates
around 6.4 mln bags and sales standing at almost 6.2 mln there
are a few hundred thousand bags still in the hands of farmers,
middlemen, exporters and processors.
   

## Task 1: Complete the inverted index class

  - The inverted index is an object with methods for adding and fetching postings.
  - The data is stored in a map, where keys are terms and values are lists of postings.
  - Each posting is an object that holds the doc_id and an optional payload.

In [11]:
# Since this is a simple data class, intializing it can be abstracted with
# the use of dataclass decorator.
# https://docs.python.org/3/library/dataclasses.html

@dataclass
class Posting:
    doc_id: int
    payload: Any = None

In [12]:
d = defaultdict(list)
d["a"].append(1)
d["a"].append(5)
d["b"] = 2

print(d["a"])
print(d["b"])
print(d["c"])

[1, 5]
2
[]


In [13]:
class InvertedIndex:

    def __init__(self):
        self._index = defaultdict(list)  #term->[]->[]->[]  =>viene associata ad una word una lista
    
    def add_posting(self, term: str, doc_id: int, payload: Any=None) -> None:
        """Adds a document to the posting list of a term."""
        # append new posting to the posting list
        # TODO: append new posting to the posting list
        newValue = Posting(doc_id,payload)
        self._index[term].append(newValue)


    def get_postings(self, term: str) -> List[Posting]:
        """Fetches the posting list for a given term."""
        # TODO: complete
        if self._index[term]:
          return self._index[term]
        return None

    def get_terms(self) -> List[str]:
        """Returns all unique terms in the index."""
        return self._index.keys() 

Tests.

In [14]:
%%ipytest

def test_postings():
    ind = InvertedIndex()
    ind.add_posting("term", 1, 1)
    ind.add_posting("term", 2, 4)
    # Testing existing term
    postings = ind.get_postings("term")
    assert len(postings) == 2
    assert postings[0].doc_id == 1
    assert postings[0].payload == 1
    assert postings[1].doc_id == 2
    assert postings[1].payload == 4
    # Testing non-existent term
    assert ind.get_postings("xyx") is None

def test_vocabulary():
    ind = InvertedIndex()
    ind.add_posting("term1", 1)
    ind.add_posting("term2", 1)
    ind.add_posting("term3", 2)
    ind.add_posting("term2", 3)
    assert set(ind.get_terms()) == set(["term1", "term2", "term3"])

[32m.[0m[32m.[0m[32m                                                                                           [100%][0m
[32m[32m[1m2 passed[0m[32m in 0.02s[0m[0m


## Task 2: Build an inverted index from the input collection

**TODO**: Complete the code to index the entire document collection.  (The content for each document should be the title and body concatenated)

In [15]:
from typing import List, Tuple
def get_doc_term_matrix(docs: List[str]) -> Tuple[List[int], List[str]]:
    """Generates a document-term matrix and the corresponding vocabulary.
    
    Args:
        docs: List of documents, each given by a list of tokenized terms.
        
    Returns:
        Tuple consisting of the document-term matrix and the corresponding vocabulary.
        In the document-term matrix row `i` corresponds to `docs[i]` and column `j`
        corresponds to the jth element of the vocabulary. Values represent the number
        of times the term appears in the document.
        Terms may be in any order in the vocabulary.
    """
    vocabulary = [] #totale parole presenti in tutti i documenti
    doc_term_matrix = []
    vector = [] #una lista di dizionari che memorizzano per ogni parola del documento(key) la proprio occorrenza(value)
    dictionary = {} #per ogni parola del documento(key) la proprio occorrenza(value)

   
    dictionary = {}
    for word in docs:#estraggo ogni parola del documento in questione
      if word in dictionary.keys(): #controllo se nel dizionario è presente una certa parola del doc
        dictionary[word] = dictionary[word] +1 #se ho già la parola aumento l'occorrenza
      else:
        dictionary[word] =  1 #se è la prima volta che la parola si presenta nel dizionario
      if word not in vocabulary: 
        vocabulary.append(word)


   
    list_vec = [] #utilizzo questa lista per popolare il nostro indice
    for word in vocabulary:
      if word in dictionary.keys():
        list_vec.append(dictionary[word])
      else:
        list_vec.append(0)
    return list_vec, vocabulary

In [16]:
ind = InvertedIndex()

def index_doc(doc: Dict[str, Union[str, int]]) -> None:
    """Index document by concatenating document title and body.

    Args:
        doc: Document details.
    """
    text = doc["title"] + " " + doc["body"]
    terms = parse(text)
    matrix, vocabolario = get_doc_term_matrix(terms)  # list of terms in the document
    # TODO: index the document (add all terms with freqs using `ind.add_posting()`)
    for i , count_word in enumerate(matrix):
      ind.add_posting(vocabolario[i], i, count_word)

process_collection("labComputerVision/reuters21578-000.xml", index_doc)


In [17]:
c = list(ind.get_terms())
t = ind.get_postings(c[1])
print(t)

[Posting(doc_id=1, payload=7), Posting(doc_id=2, payload=7)]


## Task 3: Save the inverted index to a file

Save the inverted index to a file (`data/index.dat`). Use a simple text format with `termID docID1:freq1 docID2:freq2 ...` per line, e.g.,

```
xxx 1:1 2:1 3:2
yyy 2:1 4:2
zzz 1:3 3:1 5:2
...
```

Implement this by (1) adding a `write_to_file(self, filename)` method to the `InvertedIndex` class and then (2) invoking that method in the cell below.

In [18]:
!ls

labComputerVision  sample_data


In [23]:
# TODO
all_word = list(ind.get_terms())
file = open("labComputerVision/InvertedIndex.dat"  ,"w")
for word in all_word:
  index_of_word = ind.get_postings(word)
  data = ""
  for i, word_doc in enumerate(index_of_word):
    doc_id = word_doc.doc_id
    payload= word_doc.payload
    if i==0:
      data = f"{word} {doc_id}:{payload}"
    else:
      data = data + f" {doc_id}:{payload}"
  data = data + "\n\n"
  file.write(data)
  
file.close()

In [24]:
from google.colab import files
import pandas as pd

files.download('labComputerVision/InvertedIndex.dat')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Task 4 (advanced, optional): Plot collection size against index size

Create a plot that compares the size of the document collection (bytes) against the size of the corresponding index (bytes) on the y-axis vs. with respect to the number of documents on the x-axis. You may use [Matplotlib](https://www.tutorialspoint.com/jupyter/jupyter_notebook_plotting.htm) for plotting.

In our solution, we create a different callback function and use that one for indexing.

In [None]:
# TODO