## Inverted index
(also referred to as a postings file or inverted file) is a database index storing a mapping from content, such as words or numbers, to its locations in a table, or in a document or a set of documents (named in contrast to a forward index, which maps from documents to content)

### Exercise 1.1
Draw the inverted index that would be built for the following document collection

**Doc 1** new home sales top forecasts  
**Doc 2** home sales rise in july  
**Doc 3** increase in home sales in july  
**Doc 4** july new home sales rise  

In [1]:
from collections import OrderedDict

In [2]:
collection = [
    'new home sales top forecasts',    # Doc 1
    'home sales rise in july',         # Doc 2
    'increase in home sales in july',  # Doc 3
    'july new home sales rise'         # Doc 4
]

Tokenize the text, turning each document into a list of tokens

In [3]:
# simple tokenize by splitting by whitespace
tokenized_collection = [text.split(' ') for text in collection]
tokenized_collection

[['new', 'home', 'sales', 'top', 'forecasts'],
 ['home', 'sales', 'rise', 'in', 'july'],
 ['increase', 'in', 'home', 'sales', 'in', 'july'],
 ['july', 'new', 'home', 'sales', 'rise']]

Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings

In [4]:
postings = {}

for index, document in enumerate(tokenized_collection):
    for term in document:
        if term not in postings:
            postings[term] = []
        postings[term].append(index + 1)  # +1 is used for clarity 

In [5]:
# the term frequency = the length of term postings list
ordered_postings = OrderedDict(sorted(postings.items(), key=lambda t: t[0]))
ordered_postings

OrderedDict([('forecasts', [1]),
             ('home', [1, 2, 3, 4]),
             ('in', [2, 3, 3]),
             ('increase', [3]),
             ('july', [2, 3, 4]),
             ('new', [1, 4]),
             ('rise', [2, 4]),
             ('sales', [1, 2, 3, 4]),
             ('top', [1])])

### Exercise 1.2

#### Consider these documents:  
**Doc 1** breakthrough drug for schizophrenia  
**Doc 2** new schizophrenia drug  
**Doc 3** new approach for treatment of schizophrenia  
**Doc 4** new hopes for schizophrenia patients  

Draw the term-document incidence matrix for this document collection.
***
*A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents.  
In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.*

In [6]:
import numpy as np
import pandas as pd

In [7]:
collection = [
    'breakthrough drug for schizophrenia',          # Doc 1
    'new schizophrenia drug',                       # Doc 2
    'new approach for treatment of schizophrenia',  # Doc 3
    'new hopes for schizophrenia patients'          # Doc 4
]

In [8]:
# collect all tems from documents

terms = set()

for text in collection:
    terms.update(text.split(' '))
terms_list = sorted(terms)
print('Terms:', ', '.join(terms_list))

Terms: approach, breakthrough, drug, for, hopes, new, of, patients, schizophrenia, treatment


In [9]:
columns_count = len(collection)
rows_count = len(terms_list)
term_doc_matrix = np.zeros(shape=(rows_count, columns_count), dtype=np.int8)

In [10]:
for term_index, term in enumerate(terms_list):
    for text_index, text in enumerate(collection):
        if text.find(term) != -1:
            term_doc_matrix[term_index, text_index] = 1

In [11]:
rows_names = pd.Index(terms_list)
columns_names = pd.Index(['Doc_%d' % i for i in range(len(collection))])
df = pd.DataFrame(data=term_doc_matrix, index=rows_names, columns=columns_names)
df

Unnamed: 0,Doc_0,Doc_1,Doc_2,Doc_3
approach,0,0,1,0
breakthrough,1,0,0,0
drug,1,1,0,0
for,1,0,1,1
hopes,0,0,0,1
new,0,1,1,1
of,0,0,1,0
patients,0,0,0,1
schizophrenia,1,1,1,1
treatment,0,0,1,0


### Exercise 1.3
For the document collection shown in **Exercise 1.2**, what are the returned results for
these queries:  
*a.* ```schizophrenia AND drug```  
*b.* ```for AND NOT(drug OR approach)```

In [12]:
# schizophrenia AND drug
schizophrenia_vector = df.loc['schizophrenia']
drug_vector = df.loc['drug']
query_res = np.bitwise_and(schizophrenia_vector, drug_vector)
query_res

Doc_0    1
Doc_1    1
Doc_2    0
Doc_3    0
dtype: int8

In [13]:
# for AND NOT(drug OR approach)
for_vector = df.loc['for']
drug_vector = df.loc['drug']
approach_vector = df.loc['approach']

drug_or_approach = np.bitwise_or(drug_vector, approach_vector)
not_doa = np.bitwise_not(drug_or_approach)
query_res = np.bitwise_and(for_vector, not_doa)
query_res

Doc_0    0
Doc_1    0
Doc_2    0
Doc_3    1
dtype: int8