<a href="https://colab.research.google.com/github/MichalCervenansky/TensorFlow-2.x-Tutorials/blob/master/PV211_Term_project_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Term Project
“The Cranfield collection [...] was the pioneering test collection in allowing CRANFIELD precise quantitative measures of information retrieval effectiveness [...]. Collected in the United Kingdom starting in the late 1950s, it contains 1398 abstracts of aerodynamics journal articles, a set of 225 queries, and exhaustive relevance judgments of all (query, document) pairs.” [1, Section 8.2]

Your tasks, reviewed by your colleagues and the course instructors, are the following:

1.   *Implement a ranked retrieval system*, [1, Chapter 6] which will produce a list of documents from the Cranfield collection in a descending order of relevance to a query from the Cranfield collection. You MUST NOT use relevance judgements from the Cranfield collection in your information retrieval system. Relevance judgements MUST only be used for the evaluation of your information retrieval system.

2.   *Document your code* in accordance with [PEP 257](https://www.python.org/dev/peps/pep-0257/), ideally using [the NumPy style guide](https://numpydoc.readthedocs.io/en/latest/format.html#docstring-standard) as seen in the code from exercises.  
     *Stick to a consistent coding style* in accordance with [PEP 8](https://www.python.org/dev/peps/pep-0008/).

3.   *Reach at least 35% mean average precision* [1, Section 8.4] with your system on the Cranfield collection. You are encouraged to use techniques for tokenization, [1, Section 2.2] document representation [1, Section 6.4], tolerant retrieval [1, Chapter 3], relevance feedback and query expansion, [1, Chapter 9] and others discussed in the course.

4.   *Upload a link to your Google Colaboratory document to the homework vault in IS MU.* You MAY also include a brief description of your information retrieval system.

#### Install the fresh version of utils

In [252]:
! pip install git+https://gitlab.fi.muni.cz/xstefan3/pv211-utils.git@master | grep '^Successfully'

  Running command git clone -q https://gitlab.fi.muni.cz/xstefan3/pv211-utils.git /tmp/pip-req-build-3rx_36_x
Successfully built pv211-utils


## Loading the Cranfield collection

### Loading the documents
The following code loads documents from the Cranfield collection into the `documents` [ordered dictionary](https://docs.python.org/3.8/library/collections.html#collections.OrderedDict). Tokenization of the `title` and `body` attributes of the individual documents as well as the creative use of the `authors`, `bibliography`, and `title` attributes is left to your imagination and craftsmanship.

In [0]:
from pv211_utils.entities import DocumentBase

class Document(DocumentBase):
    """
    A Cranfield collection document.

    Parameters
    ----------
    document_id : int
        A unique identifier of the document.
    authors : list of str
        A unique identifiers of the authors of the document.
    bibliography : str
        The bibliographical entry for the document.
    title : str
        The title of the document.
    body : str
        The abstract of the document.

    """
    def __init__(self, document_id, authors, bibliography, title, body):
        super().__init__(document_id, authors, bibliography, title, body)
        # preprocessing?

In [0]:
from pv211_utils.loader import load_documents

documents = load_documents(Document)

In [255]:
print('\n'.join(repr(document) for document in list(documents.values())[:9]) + '\n...')

<Document 1 titled “experimental investigation of the aerodynamics of a wing in a slipstream .”>
<Document 2 titled “simple shear flow past a flat plate in an incompressible fluid of small viscosity .”>
<Document 3 titled “the boundary layer in simple shear flow past a flat plate .”>
<Document 4 titled “approximate solutions of the incompressible laminar boundary layer equations for a plate in shear flow .”>
<Document 5 titled “one-dimensional transient heat conduction into a double-layer slab subjected to a linear heat input for a small time internal .”>
<Document 6 titled “one-dimensional transient heat flow in a multilayer slab .”>
<Document 7 titled “the effect of controlled three-dimensional roughness on boundary layer transition at supersonic speeds .”>
<Document 8 titled “measurements of the effect of two-dimensional and three-dimensional roughness elements on boundary layer transition .”>
<Document 9 titled “transition studies and skin friction measurements on an insulated flat

In [256]:
document = documents[14]
document

<Document 14 titled “piston theory - a new aerodynamic tool for the aeroelastician .”>

In [257]:
document.authors

'ashley,h. and zartarian,g.'

In [258]:
document.bibliography

'j. ae. scs. 23, 1956, 1109.'

In [259]:
document.title

'piston theory - a new aerodynamic tool for the aeroelastician .'

In [260]:
document.body

"piston theory - a new aerodynamic tool for the aeroelastician .   representative applications are described which illustrate the extent to which simplifications in the solutions of high-speed unsteady aeroelastic problems can be achieved through the use of certain aerodynamic techniques known collectively as /piston theory ./  based on a physical model originally proposed by hayes and lighthill, piston theory for airfoils and finite wings has been systematically developed by landahl, utilizing expansions in powers of the thickness ratio and the inverse of the flight mach number m .  when contributions of orders and are negligible, the theory predicts a point-function relationship between the local pressure on the surface of a wing and the normal component of fluid velocity produced by the wing's motion .  the computation of generalized forces in aeroelastic equations, such as the flutter determinant, is then always reduced to elementary integrations of the assumed modes of motion .   

### Loading the queries
The following code loads queries from the Cranfield collection into the `queries` [ordered dictionary](https://docs.python.org/3.8/library/collections.html#collections.OrderedDict). Tokenization of the `body` attribute of the individual queries is left to your imagination and craftsmanship.

In [0]:
from pv211_utils.entities import QueryBase

class Query(QueryBase):
    """
    A Cranfield collection query.

    Parameters
    ----------
    query_id : int
        A unique identifier of the query.
    body : str
        The text of the query.

    """
    def __init__(self, query_id, body):
        super().__init__(query_id, body)
        # preprocessing!

In [0]:
from pv211_utils.loader import load_queries

queries = load_queries(Query)

In [263]:
print('\n'.join(repr(query) for query in list(queries.values())[:9]) + '\n...')

<Query 1 “what similarity laws must be obeyed when constructing aeroelastic models of heated high speed aircraft .”>
<Query 2 “what are the structural and aeroelastic problems associated with flight of high speed aircraft .”>
<Query 3 “what problems of heat conduction in composite slabs have been solved so far .”>
<Query 4 “can a criterion be developed to show empirically the validity of flow solutions for chemically reacting gas mixtures based on the simplifying assumption of instantaneous local chemical equilibrium .”>
<Query 5 “what chemical kinetic system is applicable to hypersonic aerodynamic problems .”>
<Query 6 “what theoretical and experimental guides do we have as to turbulent couette flow behaviour .”>
<Query 7 “is it possible to relate the available pressure distributions for an ogive forebody at zero angle of attack to the lower surface pressures of an equivalent ogive forebody at angle of attack .”>
<Query 8 “what methods -dash exact or approximate -dash are presently av

In [264]:
query = queries[14]
query

<Query 14 “papers on shock-sound wave interaction .”>

In [265]:
query.body

'papers on shock-sound wave interaction .'

### Loading the relevance judgements
The following code loads relevance judgements from the Cranfield collection into the `relevant` set. Relevance judgements MUST NOT be used in your information retrieval system. Relevance judgements MUST only be used for the evaluation of your information retrieval system.

In [0]:
from pv211_utils.loader import load_judgements

relevant = load_judgements(queries, documents)

In [267]:
query = queries[1]
query

<Query 1 “what similarity laws must be obeyed when constructing aeroelastic models of heated high speed aircraft .”>

In [268]:
relevant_document = documents[486]
relevant_document

<Document 486 titled “similarity laws for aerothermoelastic testing .”>

In [269]:
(query, relevant_document) in relevant

True

In [270]:
irrelevant_document = documents[487]
irrelevant_document

<Document 487 titled “theory for supersonic two-dimensional, laminar, base-type flows using the crocco-lees mixing concepts .”>

In [271]:
(query, irrelevant_document) in relevant

False

## Implementation of your information retrieval system
The following code provides an example implementation of an information retrieval system in the `search` function. This example implementation returns documents in a random order and achieves a very weak mean average precision. Replace this implementation with your own implementation.

In [0]:
from gensim.models import doc2vec
from gensim.models.doc2vec import TaggedDocument
import multiprocessing
from pv211_utils.loader import load_documents

# Doc2Vec
cores = multiprocessing.cpu_count()

models = [
    # PV-DBOW 
    #Doc2Vec(dm=0, dbow_words=1, size=200, window=8, min_count=19, iter=10, workers=cores),
    # PV-DM w/average
    #Doc2Vec(dm=1, dm_mean=1, size=200, window=8, min_count=19, iter =10, workers=cores),
]
doc2vec_model = doc2vec.Doc2Vec(dm=0, vector_size=300, negative=12, hs=0, min_count=5, workers=cores, alpha=0.1, window=8)

documents = list(load_documents(Document).items())
taggeddocs = []
for each in documents:
    taggeddoc = TaggedDocument(words=each[1].body.split(), tags=str(each[0]))
    taggeddocs.append(taggeddoc)


doc2vec_model.build_vocab(taggeddocs)
doc2vec_model.train(taggeddocs, total_examples=doc2vec_model.corpus_count, epochs=10)

vec_docs = list(map(lambda doc: doc2vec_model.infer_vector(doc.words), taggeddocs))

queries = list(load_queries(Query).items())
taggedqueries = []
for each in queries:
    taggedquery = TaggedDocument(words=each[1].body.split(), tags=str(each[0]))
    taggedqueries.append(taggedquery)

vec_queries = list(map(lambda doc: doc2vec_model.infer_vector(doc.words), taggedqueries))


print(len(vec_docs[1]))
print(len(vec_queries[1]))

from sklearn.metrics.pairwise import cosine_similarity
len(cosine_similarity(vec_queries[0:1], vec_docs[:])[0])

In [0]:
from random import seed, shuffle

from pv211_utils.irsystem import IRSystem

class SillyRandomIRSystem(IRSystem):
    """
    A system that returns documents in random order.

    Attributes
    ----------
    random_documents : list of Document
        Documents in random order.

    """
   
    def search(self, query):
        """The ranked retrieval results for a query.

        Parameters
        ----------
        query : Query
            A query.
        
        Returns
        -------
        list of Document
            The ranked retrieval results for a query.

        """
        vec_query = doc2vec_model.infer_vector(query.body.split())
        sim = cosine_similarity(vec_query.reshape(1,-1), vec_docs[:])[0]
        new_dic = {}
        for i in range(len(sim)):
          new_dic[sim[i]] = documents[i]

        return new_dic.values()

## Evaluation

The following code evaluates your information retrieval system using the Mean Average Precision evaluation measure.
You can [check out on GitLab](https://gitlab.fi.muni.cz/xstefan3/pv211-utils/blob/master/pv211_utils/eval.py) how Mean Average Precision is computed.

If you choose to `submit_result`, the result of your run will appear among our [Leaderboard submissions](https://docs.google.com/spreadsheets/d/1f9P3bn17n2rHGCxBnn3GVr57PF5hMWJEILp06Uq7Jnk/edit?usp=sharing).

Then, your best score for each week will be submited and ranked in the Leaderboard sheet. The best solvers will get small **awards during the semester**, or some **seriously big awards** after the personal check, at the end of the competition (that's the 8th of May for now).

In [0]:
from pv211_utils.eval import mean_average_precision
my_system = SillyRandomIRSystem()
print(documents)
print(my_system.search(queries[0][1]))
print(len(documents))
print(len(my_system.search(queries[0][1])))
#print(len(my_system.search("what similarity laws must be obeyed when constructing aeroelastic models of heated high speed aircraft .")))

mean_average_precision(SillyRandomIRSystem(), submit_result=False, author_name="Červeňanský, Michal")

Please be polite and do not spoil the game for the others ;)

**Have fun!**

## Bibliography
[1] Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. [*Introduction to information retrieval*](https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf). Cambridge university press, 2008.