# First Term Project: Cranfield Collection
“The Cranfield collection [...] was the pioneering test collection in allowing CRANFIELD precise quantitative measures of information retrieval effectiveness [...]. Collected in the United Kingdom starting in the late 1950s, it contains 1398 abstracts of aerodynamics journal articles, a set of 225 queries, and exhaustive relevance judgments of all (query, document) pairs.” [1, Section 8.2]

Your tasks, reviewed by your colleagues and the course instructors, are the following:

1.   *Implement an unsupervised ranked retrieval system*, [1, Chapter 6] which will produce a list of documents from the Cranfield collection in a descending order of relevance to a query from the Cranfield collection. You MUST NOT use relevance judgements from the Cranfield collection in your information retrieval system. Relevance judgements MUST only be used for the evaluation of your information retrieval system.

2.   *Document your code* in accordance with [PEP 257](https://www.python.org/dev/peps/pep-0257/), ideally using [the NumPy style guide](https://numpydoc.readthedocs.io/en/latest/format.html#docstring-standard) as seen in the code from exercises.  
     *Stick to a consistent coding style* in accordance with [PEP 8](https://www.python.org/dev/peps/pep-0008/).

3.   *Reach at least 22% mean average precision* [1, Section 8.4] with your system on the Cranfield collection. You MUST record your score either in [the public leaderboard](https://docs.google.com/spreadsheets/d/e/2PACX-1vT0FoFzCptIYKDsbcv8LebhZDe_20GFeBAPmS-VyImlWbqET0T7I2iWy59p9SHbUe3LX1yJMhALPcCY/pubhtml) or in this Jupyter notebook. You are encouraged to use techniques for tokenization, [1, Section 2.2] document representation [1, Section 6.4], tolerant retrieval [1, Chapter 3], relevance feedback and query expansion, [1, Chapter 9] and others discussed in the course.

4.   _[Upload an .ipynb file](https://is.muni.cz/help/komunikace/spravcesouboru#k_ss_1) with this Jupyter notebook to the homework vault in IS MU._ You MAY also include a brief description of your information retrieval system and a link to an external service such as [Google Colaboratory](https://colab.research.google.com/), [DeepNote](https://deepnote.com/), or [JupyterHub](https://iirhub.cloud.e-infra.cz/).

[1] Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. [*Introduction to information retrieval*](https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf). Cambridge university press, 2008.

## Loading the Cranfield collection

First, we will install [our library](https://gitlab.fi.muni.cz/xstefan3/pv211-utils) and load the Cranfield collection.

In [2]:
%%capture
! pip install git+https://github.com/MIR-MU/pv211-utils.git
! pip install gensim==3.6.0

### Loading the documents

Next, we will define a class named `Document` that will represent a preprocessed document from the Cranfield collection. Tokenization and preprocessing of the `title` and `body` attributes of the individual documents as well as the creative use of the `authors`, `bibliography`, and `title` attributes is left to your imagination and craftsmanship.

In [3]:
from pv211_utils.cranfield.entities import CranfieldDocumentBase

class Document(CranfieldDocumentBase):
    """
    A preprocessed Cranfield collection document.

    Parameters
    ----------
    document_id : str
        A unique identifier of the document.
    authors : list of str
        A unique identifiers of the authors of the document.
    bibliography : str
        The bibliographical entry for the document.
    title : str
        The title of the document.
    body : str
        The abstract of the document.

    """
    def __init__(self, document_id: str, authors: str, bibliography: str, title: str, body: str):
        super().__init__(document_id, authors, bibliography, title, body)

We will load documents into the `documents` [ordered dictionary](https://docs.python.org/3.8/library/collections.html#collections.OrderedDict). Each document is an instance of the `Document` class that we have just defined.

In [4]:
from pv211_utils.cranfield.loader import load_documents

documents = load_documents(Document)

### Loading the queries
Next, we will define a class named `Query` that will represent a preprocessed query from the Cranfield collection. Tokenization and preprocessing of the `body` attribute of the individual queries is left to your craftsmanship.

In [5]:
from pv211_utils.cranfield.entities import CranfieldQueryBase

class Query(CranfieldQueryBase):
    """
    A preprocessed Cranfield collection query.

    Parameters
    ----------
    query_id : int
        A unique identifier of the query.
    body : str
        The text of the query.

    """
    def __init__(self, query_id: int, body: str):
        super().__init__(query_id, body)

We will load queries into the `queries` [ordered dictionary](https://docs.python.org/3.8/library/collections.html#collections.OrderedDict). Each query is an instance of the `Query` class that we have just defined.

In [6]:
from pv211_utils.cranfield.loader import load_queries

queries = load_queries(Query)

## Implementation of your information retrieval system
Next, we will define a class named `IRSystem` that will represent your information retrieval system. Your class must define a method name `search` that takes a query and returns documents in descending order of relevance to the query.

The example implementation returns documents in decreasing order of the bag-of-words cosine similarity between the document and the query. The example implementation returns documents in decreasing order of the cosine similarity between the document and the query. You can use the example implementation as a basis of your system, or you can replace it with your own implementation.

## Evaluation
Finally, we will evaluate your information retrieval system using [the Mean Average Precision](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision) (MAP) evaluation measure.

In [9]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from pv211_utils.cranfield.irsystem import CranfieldIRSystemBase
from tqdm import tqdm
from typing import Iterable
from sentence_transformers import SentenceTransformer, util
from gensim.corpora import Dictionary
from gensim.similarities import SparseMatrixSimilarity
from gensim.utils import simple_preprocess
from gensim.models import TfidfModel


def preprocess_text(txt: str, stemmer: PorterStemmer, stop_words: set[str]) -> list[str]:
    """
    Function that preprocess text

    Parameters
    ----------
    txt : str
        text to be preprocessed
    stemmer : PorterStemmer
        stemmer that will be used
    stop_words : set[str]
        set of word that will be removed like "and", "is", ...
    """
    data = simple_preprocess(txt)
    return [stemmer.stem(i) for i in data if i not in stop_words]


def add_weights_to_document(doc: Document) -> str:
    """
    Function that add weights to document

    Parameters
    ----------
    doc : Document
        text to be preprocessed
    Returns
    ----------
    str
        a weighted sum of parts of document

    """
    return doc.body + 2 * doc.title + 4 * doc.authors + 2 * doc.bibliography


class IRSystem(CranfieldIRSystemBase):
    """
    My model consist from two parts
     - The first part is sentence transformer from https://www.sbert.net/
     - The second part is TF_IDF from gensim from https://radimrehurek.com/gensim/models/tfidfmodel.html

    """
    def __init__(self):
        # load stopwords and stemmer
        nltk.download('stopwords')
        self._stemmer = PorterStemmer()
        self._stop_words = set(stopwords.words('english'))

        # the first part -> sentence transformer
        # load model and transform documents to vectors
        self._model = SentenceTransformer('all-mpnet-base-v2')
        self._semantic_document_vectors = []
        for doc in list(documents.values()):
            self._semantic_document_vectors.append(self._model.encode(doc.body, convert_to_tensor=True))

        # the second part -> TF-IDF model from gensim
        # preprocess documents
        document_bodies = (preprocess_text(add_weights_to_document(doc), self._stemmer, self._stop_words) for doc in documents.values())
        document_bodies = tqdm(document_bodies, desc='Building the dictionary', total=len(documents))
        # create model from preprocess documents
        dictionary = Dictionary(document_bodies)
        tfidf_model = TfidfModel(dictionary=dictionary, smartirs='lnc')
        # transform preprocessed documents to vectors
        document_vectors = [tfidf_model[dictionary.doc2bow(preprocess_text(add_weights_to_document(doc), self._stemmer, self._stop_words))] for doc in documents.values()]
        document_vectors = tqdm(document_vectors, desc='Building the index', total=len(documents))
        # create SparseMatrixSimilarity and save index and index_to_document to later use
        index = SparseMatrixSimilarity(document_vectors, num_docs=len(documents), num_terms=len(dictionary))
        index_to_document = dict(enumerate(documents.values()))

        self.dictionary = dictionary
        self.index = index
        self.index_to_document = index_to_document


    def search(self, query: Query) -> Iterable[Document]:
        # dict that contains weighted sum of both similarities
        similarities_dict = {}

        # second part
        # preprocess query and transform it to vector and create model
        tfidf_query = TfidfModel(dictionary=self.dictionary, smartirs='atc')[self.dictionary.doc2bow(preprocess_text(query.body, self._stemmer, self._stop_words))]
        # list of tuples document number and his similarity to the query
        tfidf_similarities = enumerate(self.index[tfidf_query])
        # add similarities to similarities_dict
        for document_number, similarity in tfidf_similarities:
            similarities_dict[document_number] = 1.3 * similarity  # normalization
            # number 1.3 is not random, is rounded maximum of tf-idf similarities divided by maximum of sentence transformer similarities

        # first part
        # transform query to vector using sentence transformer
        query_transformer = self._model.encode(query.body)
        # add similarities to similarities_dict
        for i in range(len(self._semantic_document_vectors)):
            similarities_dict[i] += util.cos_sim(query_transformer, self._semantic_document_vectors[i]).tolist()[0][0]

        # sort similarities and return the best ones
        result = [(key, value) for key, value in similarities_dict.items()]
        result = sorted(result, key=lambda item: item[1], reverse=True)

        for document_number, _ in result:
            document = self.index_to_document[document_number]
            yield document

In [10]:
from pv211_utils.cranfield.loader import load_judgements
from pv211_utils.cranfield.leaderboard import CranfieldLeaderboard
from pv211_utils.cranfield.eval import CranfieldEvaluation
submit_result = True
author_name = 'Strompová, Alžbeta'

print('Initializing your system ...')
system = IRSystem()
evaluation = CranfieldEvaluation(system, load_judgements(queries, documents), CranfieldLeaderboard(), author_name)
evaluation.evaluate(tqdm(queries.values(), desc='Querying your system'), submit_result)

Initializing your system ...


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\AlžbetaStrompová\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Building the dictionary: 100%|██████████| 1400/1400 [00:02<00:00, 566.25it/s]
Building the index: 100%|██████████| 1400/1400 [00:00<00:00, 34147.63it/s]
Querying your system: 100%|██████████| 225/225 [00:21<00:00, 10.53it/s]


Your system achieved **49.10% MAP score**.

Congratulations, you passed the **22%** minimum! 🥳

Your result has been submitted to [the leaderboard](https://docs.google.com/spreadsheets/d/e/2PACX-1vSXuOTclZfHWYxh2rf7hfMeLvcCuE5UsJu7BzteyunhPw3z4YNZjCovjmMB6SnDdgjGyenOgdochaEq/pubhtml)! 🏆