# Example of IR system based on Vector Space Model on CISI

The CISI dataset can be donwloaded at the following address: [CISI dataset](https://www.kaggle.com/datasets/dmaso01dsta/cisi-a-dataset-for-information-retrieval/code?select=CISI.REL)

In this example, we access a local, parsed version of CISI stored in MongoDb

In [1]:
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm

In [2]:
from pymongo import MongoClient

In [3]:
db = MongoClient()['cisi']

## Explore the dataset

In [4]:
db['documents'].find_one()

{'_id': ObjectId('63ff54cf5881e58ca7d9bb75'),
 'id': 1,
 '.T': '18 Editions of the Dewey Decimal Classifications',
 '.A': 'Comaromi, J.P.',
 '.W': "The present study is a history of the DEWEY Decimal Classification. The first edition of the DDC was published in 1876, the eighteenth edition in 1971, and future editions will continue to appear as needed. In spite of the DDC's long and healthy life, however, its full story has never been told. There have been biographies of Dewey that briefly describe his system, but this is the first attempt to provide a detailed history of the work that more than any other has spurred the growth of librarianship in this country and abroad.",
 '.X': [['1', '5', '1'],
  ['92', '1', '1'],
  ['262', '1', '1'],
  ['556', '1', '1'],
  ['1004', '1', '1'],
  ['1024', '1', '1'],
  ['1024', '1', '1']]}

In [5]:
db['queries'].find_one()

{'_id': ObjectId('63ff54fb5881e58ca7d9c129'),
 'id': 1,
 '.W': 'What problems and concerns are there in making up descriptive titles? What difficulties are involved in automatically retrieving articles from approximate titles? What is the usual relevance of the content of articles to their titles?'}

In [6]:
db['rel'].find_one()

{'_id': ObjectId('63ff56f65881e58ca7d9c199'), 'query': 1, 'doc': 28}

## Get documents, queries and ground truth

In [7]:
from collections import defaultdict

In [8]:
documents = [(r['id'], ". ".join([r['.T'], r['.W']])) for r in db['documents'].find()]
queries = [(r['id'], r['.W']) for r in db['queries'].find()]

In [9]:
ground_truth = defaultdict(list)

for r in db['rel'].find():
    ground_truth[r['query']].append(r['doc'])

In [10]:
qid, q = queries[1]
print(q, '\n')
doc_index = dict(documents)
for doc_id in ground_truth[qid]:
    print(doc_index[doc_id], '\n')

How can actually pertinent data, as opposed to references or entire articles themselves, be retrieved automatically in response to information requests? 

Some Questions Concerning "Information Need". The expression "satisfying a requester's information need" is often used, but its meaning is obscure. The literature on "information need" in relation to retrieval suggests three different (though not inconsistent) possible interpretations. However, each of these interpretations is itself fundamentally unclear. The various obscurities involved are indicated by critical questions, which those who write of information need are invited to answer. 

Retrieval of Answer-Providing Documents. (I) Better understanding of subject document retrieval might result if different functions of subject document retrieval system are studied separately.. This paper is concerned with retrieval of documents, in response to a question, from which answers to that question can be inferred ("answer-providing docu

## Tokenization

In [16]:
from nltk.tokenize import word_tokenize

In [18]:
_, doc = documents[0]
tokens = word_tokenize(doc)

## Count vectorizer

In [19]:
from collections import defaultdict

In [20]:
TFidx = defaultdict(lambda: defaultdict(lambda: 0))
for doc_id, doc_text in tqdm(documents):
    for token in word_tokenize(doc_text):
        TFidx[token][doc_id] += 1

  0%|          | 0/1460 [00:00<?, ?it/s]

In [25]:
TF = pd.DataFrame(TFidx).fillna(0)

In [26]:
TF.head()

Unnamed: 0,18,Editions,of,the,Dewey,Decimal,Classifications,.,The,present,...,subsidization,morn,religiously,prompts,soundness,supposition,Thought,abstraction,certificates,100-150
1,1.0,1.0,7.0,8.0,2.0,2.0,1.0,5.0,2.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
197,1.0,0.0,20.0,18.0,0.0,0.0,0.0,1.0,4.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
354,1.0,0.0,11.0,9.0,2.0,2.0,0.0,8.0,3.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
581,1.0,0.0,16.0,9.0,0.0,0.0,0.0,5.0,6.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
609,1.0,0.0,3.0,4.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
TF.shape

(1460, 13455)

In [28]:
_, query_text = queries[1]
query_text

'How can actually pertinent data, as opposed to references or entire articles themselves, be retrieved automatically in response to information requests?'

In [37]:
query_tokens = word_tokenize(query_text)
vocabulary = list(TF.columns.values)
q = np.zeros(TF.shape[1])
for token in query_tokens:
    try:
        i = vocabulary.index(token)
        q[i] += 1
    except IndexError:
        pass
q = q / q.max()

In [42]:
NormTF = (TF.T / TF.max(axis=1)).T

## Similarity

In [44]:
from sklearn.metrics.pairwise import cosine_similarity

In [45]:
match = cosine_similarity(q.reshape(1, -1), NormTF)

In [46]:
match.shape

(1, 1460)

In [48]:
M = pd.DataFrame(match, columns=NormTF.index.values)

In [79]:
answers = M.loc[0].sort_values(ascending=False).head(2000).index.values

In [80]:
print(query_text, '\n')
for answer in answers:
    print(doc_index[answer], '\n')
    break

How can actually pertinent data, as opposed to references or entire articles themselves, be retrieved automatically in response to information requests? 

Rules for a Dictionary Catalog. No code of cataloguing could be adopted in all points by everyone, because the libraries for study and the libraries for reading have different objects, and those which combine the two do so in different proportions. Again, the preparation of a catalogue must vary as it is to be manuscript or printed, and, if the latter, as it is to be merely an index to the library, giving in the shortest possible compass clues by which the public can find books, or is to attempt to furnish more information on various points, or finally is to be made with a certain regard to what may be called style. 



In [81]:
R = set(list(answers))

In [82]:
E = set(list(ground_truth[1]))

In [83]:
print('Precision', len(R.intersection(E)) / len(R))
print('Recall', len(R.intersection(E)) / len(E))

Precision 0.031506849315068496
Recall 1.0


In [84]:
cm = np.zeros((2, 2))
cm

array([[0., 0.],
       [0., 0.]])

In [85]:
for doc_id, _ in documents:
    if doc_id in R and doc_id in E:
        cm[0,0] += 1
    elif doc_id in R and not doc_id in E:
        cm[0,1] += 1
    elif not doc_id in R and doc_id in E:
        cm[1,0] += 1
    else:
        cm[1,1] += 1

In [86]:
cm

array([[  46., 1414.],
       [   0.,    0.]])

In [87]:
cm[0,0] / cm[0].sum()

0.031506849315068496

In [88]:
cm[0,0] / cm[:,0].sum()

1.0

## Understand the error

In [91]:
doc_id, doc = documents[0]
print(doc_id)
print(doc)

1
18 Editions of the Dewey Decimal Classifications. The present study is a history of the DEWEY Decimal Classification. The first edition of the DDC was published in 1876, the eighteenth edition in 1971, and future editions will continue to appear as needed. In spite of the DDC's long and healthy life, however, its full story has never been told. There have been biographies of Dewey that briefly describe his system, but this is the first attempt to provide a detailed history of the work that more than any other has spurred the growth of librarianship in this country and abroad.


In [94]:
NormTF.loc[1].sort_values(ascending=False).head(20)

the        1.000
of         0.875
,          0.625
.          0.625
in         0.375
and        0.375
history    0.250
has        0.250
that       0.250
DDC        0.250
edition    0.250
first      0.250
this       0.250
to         0.250
is         0.250
The        0.250
been       0.250
Decimal    0.250
Dewey      0.250
a          0.250
Name: 1, dtype: float64

In [96]:
import nltk
stopwords = nltk.corpus.stopwords.words('english')

In [97]:
stopwords[:20]

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his']