# Indexing environmental documents

In this notebook we will write a script that will index our documents. Each document will be given and ID. For each word that appears in the documents, we will mark the documents in which it appears.
Suppose word 'car' appears in documents with IDs 3, 13 and 103. Then our index dictionary will contain a key 'car' with value [3,13,103].

In [1]:
import psycopg2 # Database acces
import json # Output file
from collections import defaultdict

Connect to the database:

In [2]:
conn = psycopg2.connect(user='postgres', password='dbpass', database='eurlex_environment_only')
pg = conn.cursor()
print(conn.get_dsn_parameters())

{'user': 'postgres', 'dbname': 'eurlex_environment_only', 'port': '5432', 'tty': '', 'options': '', 'sslmode': 'prefer', 'sslcompression': '0', 'krbsrvname': 'postgres', 'target_session_attrs': 'any'}


Collect all documents from the database:

In [3]:
pg.execute("""
SELECT * FROM documents
""")
all_documents = pg.fetchall()

Now we need to prepare a mapping that will connect document ids to celex number and vice versa. Also we need to create an index dictionary.

In [4]:
doc2id = {} # Dictionary that will map from document CELEX number to ID
id2doc = {} # Dictionary that will map from document ID to CELEX

index = defaultdict(set) # Our index dictionary: keys (words), values (list of document ids)

With the function below we tranform document data into friendly structure.

In [5]:
import re
from string import punctuation
from nltk.corpus import stopwords
from gensim.parsing.preprocessing import preprocess_string, strip_tags, strip_punctuation
stopwords = stopwords.words('english')

def tokenize_document(doc):
    """
    With this function we will tokenize the document. We will remove punctuation, put everything to
    lower case, remove stopwords.
    """
    
    doc_id, celex, title, author, form, date, text = doc
    lower = lambda x: x.lower()
    
    words = []
    
    for part in [title, author, form, text]:
        if part is None: continue
        part = preprocess_string(part, [lower, strip_punctuation])
        words += [w for w in part if w not in stopwords]
    
    return words

We are now ready to start indexing. For each document we will index all the words that appear in the text and save them into index dictionary. And we will also add the document id to the word set in case that word appears inside this documents text.


In [6]:
for i, doc in enumerate(all_documents):
    
    doc_id, celex = doc[0], doc[1]
    doc2id[celex] = doc_id
    id2doc[doc_id] = celex
    
    words = tokenize_document(doc)
    for word in words:
        index[word].add(doc_id)
    
    if i % 10000 == 0:
        print(i, len(index))

0 85
10000 286227
20000 418779
30000 792069
40000 1219448
50000 1527521
60000 1849271
70000 1871015
80000 2018149
90000 2019121
100000 2019645
110000 2020269
120000 2026726


Same procedure for document descriptors. We will make a small adaptation there and also add full document descriptors in our indexing vocabulary. For example descriptor 'Air pollution' will add words 'air', 'pollution' and 'air pollution' to the index vocabulary.

In [7]:
all_descriptors = pg.execute(
    """SELECT * FROM document_descriptors"""
)
all_descriptors = pg.fetchall()

In [8]:
for i, (celex, descriptor) in enumerate(all_descriptors):
    
    doc_id = doc2id[celex]
    
    # Remove punctuation and make the string lowercase
    descriptor = strip_punctuation(descriptor).lower()
    for word in descriptor.split(' '):
        if word not in stopwords:
            index[word].add(doc_id)
    
    index[descriptor].add(doc_id) # we include the whole descriptor in our vocabulary
    
    if i % 100000 == 0:
        print(i, celex)

0 12001C_DCL_09
100000 32002D0995
200000 32014R1211
300000 52007SC0274
400000 62014CA0167
500000 91985E002108
600000 91993E000577
700000 92002E003658
800000 92013E005407


And we also repeat similar procedure for document subjects.

In [9]:
all_subjects = pg.execute(
    """SELECT * FROM document_subjects"""
)
all_subjects = pg.fetchall()

In [10]:
for i, (celex, subject) in enumerate(all_subjects):
    
    doc_id = doc2id[celex]
    
    # Remove punctuation and make the string lower
    subject = strip_punctuation(subject).lower()
    for word in subject.split(' '):
        if word not in stopwords:
            index[word].add(doc_id)
    
    index[subject].add(doc_id) # we include the whole descriptor in our vocabulary
    
    if i % 100000 == 0:
        print(i, celex)

0 12001C_DCL_09
100000 92011E002990


In [11]:
# Average size of index list
print(sum(len(e) for e in index.values())/len(index))
print(max(len(e) for e in index.values()))

17.494677999018144
122911


And now save the ouput into index.json file.

In [13]:
index_reformated = {k : sorted(list(v)) for k,v in index.items()} # We can't serialize set objects.

with open('index.json', 'w') as outfile:
    json.dump({
        'doc2id' : doc2id,
        'id2doc' : id2doc,
        'index' : index_reformated},
        outfile, indent=1)

In [14]:
print(index_reformated['deforestation'])

[370, 371, 484, 924, 1310, 1659, 1673, 1906, 2004, 2141, 2276, 2484, 2485, 2500, 2809, 4399, 4692, 8399, 12716, 12956, 13965, 16901, 17032, 27917, 28058, 28912, 29018, 29099, 30009, 30702, 30930, 31227, 32123, 33164, 33219, 33309, 34298, 34811, 34997, 35696, 38310, 38417, 38806, 39267, 39295, 39358, 40195, 40564, 40601, 41456, 41468, 41508, 41522, 41804, 41805, 41806, 41866, 42109, 42110, 42114, 42125, 42162, 42409, 42421, 42653, 42688, 42718, 42867, 42947, 42959, 43238, 43338, 43536, 43542, 43608, 43739, 43881, 44068, 44566, 44725, 44756, 45085, 45329, 45406, 45770, 45792, 45795, 45803, 46004, 46266, 46273, 46301, 46304, 46889, 46918, 46931, 46943, 47007, 47439, 47979, 47989, 47995, 48038, 48082, 48252, 48286, 48535, 48647, 48669, 48675, 48687, 48708, 48721, 48730, 48780, 48941, 48957, 48965, 48966, 48978, 48984, 48994, 49311, 49372, 49399, 49405, 49426, 49447, 49458, 49486, 49499, 49557, 49640, 49641, 49758, 49765, 50052, 50053, 50062, 50084, 50243, 50271, 50300, 50321, 50332, 50346,