# Indexing environmental documents

In this notebook we will write a script that will index our documents. Each document will be given and ID. For each word that appears in the documents, we will mark the documents in which it appears.
Suppose word 'car' appears in documents with IDs 3, 13 and 103. Then our index dictionary will contain a key 'car' with value [3,13,103].

In [2]:
import psycopg2 # Database acces
import json # Output file
from collections import defaultdict

Connect to the database:

In [3]:
conn = psycopg2.connect(user='postgres', password='dbpass', database='eurlex_environment_only')
pg = conn.cursor()
print(conn.get_dsn_parameters())

{'user': 'postgres', 'dbname': 'eurlex_environment_only', 'port': '5432', 'tty': '', 'options': '', 'sslmode': 'prefer', 'sslcompression': '0', 'krbsrvname': 'postgres', 'target_session_attrs': 'any'}


Collect all documents from the database:

In [7]:
pg.execute("""
SELECT * FROM documents
""")
all_documents = pg.fetchall()

Now we need to prepare a mapping that will connect document ids to celex number and vice versa. Also we need to create an index dictionary.

In [9]:
doc2id = {} # Dictionary that will map from document CELEX number to ID
id2doc = {} # Dictionary that will map from document ID to CELEX

index = defaultdict(set) # Our index dictionary: keys (words), values (list of document ids)

With the function below we tranform document data into friendly structure.

In [22]:
import re
from string import punctuation
from nltk.corpus import stopwords
from gensim.parsing.preprocessing import preprocess_string, strip_tags, strip_punctuation
stopwords = stopwords.words('english')

def tokenize_document(doc):
    """
    With this function we will tokenize the document. We will remove punctuation, put everything to
    lower case, remove stopwords.
    """
    
    doc_id, celex, title, author, form, date, text = doc
    lower = lambda x: x.lower()
    
    words = []
    
    for part in [title, author, form, text]:
        if part is None: continue
        part = preprocess_string(part, [lower, strip_punctuation])
        words += [w for w in part if w not in stopwords]
    
    return words

We are now ready to start indexing. For each document we will index all the words that appear in the text and save them into index dictionary.


In [None]:
for i, doc in enumerate(all_documents):
    
    doc_id, celex = doc[0], doc[1]
    doc2id[celex] = doc_id
    id2doc[doc_id] = celex
    
    words = tokenize_document(doc)
    for word in words:
        index[word].add(doc_id)
    
    if i % 10000 == 0:
        print(i, len(index))

In [26]:
# Average size of index list
print(sum(len(e) for e in index.values())/len(index))
print(max(len(e) for e in index.values()))

16.672238906345452
122887


And now save the ouput into index.json file.

In [29]:
with open('index.json', 'w') as outfile:
    json.dump({
        'doc2id' : doc2id,
        'id2doc' : id2doc,
        'index' : {k : list(v) for k,v in index.items()}}, # We can't serialize set objects.
        outfile, indent=1)

In [30]:
print(index['deforestation'])

{51225, 116762, 45085, 94239, 94240, 51234, 51241, 57386, 57387, 118827, 98351, 51249, 96316, 51270, 59481, 57435, 2141, 57437, 57447, 57448, 59499, 57453, 59503, 79995, 57469, 57470, 98437, 57480, 51343, 49311, 104609, 104610, 55460, 90277, 34997, 55485, 8399, 30930, 104659, 49372, 51423, 108771, 2276, 28912, 53493, 49399, 49405, 110847, 92431, 45329, 49426, 53523, 104721, 49447, 4399, 117039, 49458, 94514, 55606, 55607, 55613, 43338, 49486, 47439, 55639, 29018, 49499, 45406, 39267, 370, 371, 39295, 33164, 49557, 57752, 55714, 29099, 12716, 57773, 98732, 53680, 2484, 2485, 57781, 57782, 53692, 39358, 33219, 2500, 484, 49640, 49641, 41456, 31227, 41468, 16901, 53774, 43536, 53781, 43542, 53782, 53786, 33309, 53791, 53792, 53795, 57894, 57895, 55849, 53810, 115261, 53825, 53826, 57929, 51786, 53839, 57937, 4692, 43608, 94811, 49758, 49765, 57957, 57962, 57965, 55928, 57979, 17032, 57994, 98967, 12956, 55967, 101026, 53931, 78513, 53938, 53949, 53950, 53951, 51906, 51907, 94917, 45770, 5