# Demonstration

The following demonstration will use the training set of the OHSUMED corpus. This training set was used in the Filtering Track of the 9th edition of the Text REtrieval Conference (TREC-9). We will use it for the information retrieval exercises of this workshop. Download [ohsumed.zip](ohsumed.zip) into the same folder as this notebook. The file is part of the git repository, so if you have cloned or downloaded the entire repository you will have the file in the right folder.

The following code unzips the file:

In [2]:
import zipfile
zip_ref = zipfile.ZipFile('ohsumed.zip', 'r')
zip_ref.extractall('.')
zip_ref.close()

To help you read the data, we are providing the file ohsumed.py (in the zip file above) that has a simple API to the data. When you import it at the Python prompt, it will provide the following variables:


1. `index`: a dictionary with document IDs as keys, and document text as values.
2. `questions`: a dictionary with query IDs as keys, and query text as values.
3. `answers`: a dictionary with query IDs as keys, and a set with the IDs of known relevant documents as values. This information is used for evaluation.

Below are some examples:

In [3]:
import ohsumed

Reading OHSUMED data


In [5]:
len(ohsumed.index)

54710

In [6]:
sorted(list(ohsumed.index.keys()))[:10]

['87049087',
 '87049088',
 '87049089',
 '87049090',
 '87049091',
 '87049092',
 '87049093',
 '87049094',
 '87049095',
 '87049096']

In [7]:
ohsumed.index['87097544']

'Serum lipids and lipoproteins were examined in a group of 45 healthy postmenopausal women who were treated for 2 years with either 3 mg of percutaneous estradiol (n = 20) or placebo (n = 25). Percutaneous estradiol was given alone during the first year of treatment and in combination with oral micronized progesterone (200 mg) for 12 days of each cycle during the second year. The women were examined every 3 months throughout the 2 years. Percutaneous estrogen therapy significantly reduced total serum cholesterol and low-density lipoprotein cholesterol, whereas no significant differences were observed in serum triglycerides and high-density lipoprotein cholesterol. Addition of oral progesterone during the second year of treatment did not produce any significant alterations in serum total cholesterol or low-density lipoprotein cholesterol, both of which remained significantly reduced. Serum triglycerides remained virtually unchanged, whereas a slight but significant increase (p less than

In [8]:
len(ohsumed.questions)

63

In [9]:
sorted(list(ohsumed.questions.keys()))[:10]

['OHSU1',
 'OHSU10',
 'OHSU11',
 'OHSU12',
 'OHSU13',
 'OHSU14',
 'OHSU15',
 'OHSU16',
 'OHSU17',
 'OHSU18']

In [10]:
ohsumed.questions['OHSU1']

'60 year old menopausal woman without hormone replacement therapy Are there adverse effects on lipids when progesterone is given with estrogen replacement therapy'

In [9]:
len(ohsumed.answers)

63

In [11]:
ohsumed.answers['OHSU1']

{'87097544', '87157536', '87157537', '87202778', '87316316', '87316326'}

## Inverted index

We are going to build an inverted index of the non-stop words with frequency higher than 5.

The following code reads the files and creates a counter of all words in the corpus (including stop words). We will use NLTK's word tokeniser (read the beginning of [chapter 3 of NLTK's book](http://www.nltk.org/book/ch03.html#processing-raw-text)) to convert each document into a list of tokens. **Note that this code may take some time to run**.

In [12]:
import nltk, collections
nltk.download('stopwords')
nltk.download('punkt')
stop = nltk.corpus.stopwords.words('english')
wordcounter = collections.Counter([w.lower() for k in ohsumed.index
                                             for s in nltk.sent_tokenize(ohsumed.index[k])
                                             for w in nltk.word_tokenize(s)])

[nltk_data] Downloading package stopwords to /Users/jakob/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/jakob/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [13]:
wordcounter.most_common(10)

[('the', 305806),
 ('of', 271953),
 ('.', 254858),
 (',', 239656),
 ('and', 179604),
 ('in', 172449),
 ('to', 107431),
 (')', 96259),
 ('(', 95948),
 ('a', 95281)]

In [18]:
print(stop)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

The following code creates the inverted index of all non-stop words with frequency higher than 5. **Note that this code  may take some time to run.**

In [14]:
inverted = dict()
for d in ohsumed.index:
    for w in nltk.word_tokenize(ohsumed.index[d]):
        w = w.lower()
        if w in stop or wordcounter[w] <= 5:
            continue
        if w in inverted:
            inverted[w].add(d)
        else:
            inverted[w] = set([d])

In [24]:
print(list(inverted.keys())[:20])

['patients', 'converted', 'ventricular', 'fibrillation', 'organized', 'rhythms', 'ambulance', 'technicians', '(', ')', 'hospital', 'arrival', '.', 'authors', 'analyzed', '271', 'cases', 'managed', 'working', 'without']


In [20]:
sorted(list(inverted.keys()))[2990:3010]

['acc',
 'accelerate',
 'accelerated',
 'accelerates',
 'accelerating',
 'acceleration',
 'accelerations',
 'accelerator',
 'accentuate',
 'accentuated',
 'accentuation',
 'accept',
 'acceptability',
 'acceptable',
 'acceptably',
 'acceptance',
 'accepted',
 'accepting',
 'acceptor',
 'acceptors']

In [16]:
inverted['acceptability']

{'87057543',
 '87067994',
 '87073895',
 '87074134',
 '87114326',
 '87119697',
 '87121859',
 '87129900',
 '87149032',
 '87153185',
 '87193350',
 '87223625',
 '87223856',
 '87224779',
 '87232524',
 '87251875',
 '87273001',
 '87282178',
 '87295871',
 '87297008'}

The following code saves the inverted index into a pickle file. This way we do not need to compute the inverted index again. Read [Python's documentation on pickle files](https://docs.python.org/3/library/pickle.html) for more detail. Note that the file we created is opened for writing in binary mode, following the advice of this [stackoverflow post about saving pickle files](http://stackoverflow.com/questions/13906623/using-pickle-dump-typeerror-must-be-str-not-bytes).

In [25]:
import pickle
with open('inverted.pickle', 'wb') as f:
    pickle.dump(inverted,f)

## Boolean retrieval

The following code reads the pickle file and returns the list of documents that maches this Boolean query:

1. (menopausal OR pregnant) AND woman AND NOT healthy

In [26]:
import pickle
with open('inverted.pickle', 'rb') as f:
    inverted = pickle.load(f)

In [27]:
(inverted['menopausal'] | inverted['pregnant']) & inverted['woman'] - inverted['healthy']

{'87060673',
 '87066899',
 '87097274',
 '87097518',
 '87099263',
 '87114245',
 '87117852',
 '87128881',
 '87134330',
 '87138205',
 '87153548',
 '87153568',
 '87169457',
 '87185313',
 '87226668',
 '87231479',
 '87235637',
 '87251241',
 '87252385',
 '87261426',
 '87281235',
 '87290433',
 '87296136',
 '87316210',
 '87316220',
 '87316328',
 '87324028',
 '87325497'}

Note that it took very little time to run the query. In general, creating the index may take some time but it is needed only once if the files do not change. Queries on the index are very fast.

# Your Turn

## 1. Vector Retrieval

### Exercise 1.1: Boolean Information Retrieval

Create an inverted index of the **NLTK Gutenberg corpus** and save it into a file "gutenbergindex.pickle". To create this index there is no need to look for stop words or word frequencies, since the corpus is not that large. Simply use all the words. Use this index to find the documents that match the following Boolean queries:

1. Brutus OR Caesar
2. Brutus AND NOT Caesar
3. (Brutus AND Caesar) OR Calpurnia


In [67]:
import pickle
import nltk
nltk.download("gutenberg")

gutenberg=nltk.corpus.gutenberg
guten_ids=list(gutenberg.fileids())

gutenberg_index= dict()
for doc in guten_ids:
    words = gutenberg.words(doc)
    for w in words:
        if w in gutenberg_index:
            gutenberg_index[w].add(doc)
        else:
            gutenberg_index[w]=set([doc])

print("Saving index into file gutenbergindex.pickle...")
with open("gutenberg_index.pickle","wb") as f:
    pickle.dump(gutenberg_index,f)


[nltk_data] Downloading package gutenberg to /Users/jakob/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


Saving index into file gutenbergindex.pickle...


In [68]:
with open('gutenberg_index.pickle','rb') as z:
    gutenberg_index = pickle.load(z)

In [69]:
# Write your code for searching for Brutus OR Caesar
gutenberg_index['Brutus'] | gutenberg_index['Caesar']

{'bible-kjv.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt'}

In [71]:
gutenberg.words('bible-kjv.txt')

['[', 'The', 'King', 'James', 'Bible', ']', 'The', ...]

In [57]:
# Write your code for searching for Brutus AND NOT Caesar
gutenberg_index['Brutus'] - gutenberg_index['Caesar']

set()

In [74]:
# Write your code for searching for (Brutus AND Caesar) OR Calpurnia
try:
    (gutenberg_index['Brutus'] & gutenberg_index['Caesar']) | gutenberg_index['Calpurnia']
except KeyError as key:
    print('Key not found in Gutenberg Index',key)


Key not found in Gutenberg Index 'Calpurnia'


### Exercise 1.2: tf.idf

Using scikit-learn, compute the tf.idf of all words in the OHSUMED corpus. Use the English list of stop words, and leave all other settings to their default values. In particular, do not stem the words. Pickle the resulting tf.idf vectoriser into a file tfidf.pickle. **Note that in this exercise you should use the sklearn functions, not nltk. In particular, do not use NLTK's list of stop words or its tokeniser.**

In [81]:
# Write your code to compute the tf.idf
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english')

tfidf.fit([ohsumed.index[k] for k in ohsumed.index])
feature_names = tfidf.get_feature_names()

In [83]:
# Write your code to save the results in a pickle file
with open("tfidf.pickle","wb") as f:
    pickle.dump(tfidf,f)

### Exercise 1.3: Sort by tf.idf

Write a program that returns the words of a document with highest tf.idf score. The resulting list of words should be sorted by frequency in descending order.

In [26]:
def best_tfidf(tfidf, docID, numwords=10):
    """Print the words with highest tf.idf, in descending order
    >>> best_tfidf(tfidf, '87049087', numwords=3)
    ['rhythms', 'refibrillation', 'organized']
    """
    # Write your code here
    words = list(tfidf.get_feature_names)
    

In [27]:
best_tfidf(tfidf,'87049087')

['rhythms',
 'refibrillation',
 'organized',
 'refibrillated',
 'converted',
 'emt',
 'paramedic',
 'ds',
 'defibrillation',
 'hospital']

### Optional exercise: tf.idf cosine similarity

Use the OHSUMED collection for the following exercise. Write a function that takes as a parameter a string and an optional parameter $n$ the number of results, and returns the IDs of the $n$ documents that are most relevant according to tf.idf and cosine similarity. The results are sorted in descending order of the cosine similarity score.

In [30]:
# The following funcion implements cosine similarity by using the formulas we have seen in the lectures.
# Feel free to use sklearn's implementation of cosine similarity instead.

def best_documents(querystring,n=10):
    """Return the indices of the best n documents using cosine similarity
    >>> best_documents(ohsumed.questions['OHSU1'], n=3)
    ['87285549', '87162574', '87068356']"""
    # Write your code here


In [31]:
best_documents(ohsumed.questions['OHSU1'])

['87052846',
 '87053030',
 '87057603',
 '87057561',
 '87054719',
 '87053640',
 '87053630',
 '87055106',
 '87057550',
 '87053614']