Indexation & Data search
This project is an intelligent search engine looking into 327 HTML documents, which are scientific articles. It takes into account requests in natural langage.
- The documents were parsed with Perl in order to create a XML file containing essential data in a tree view.
- The tf-idf method was used to score and rank the words in each article. The terms defined as negligible were suppress of the XML corpus, the other words were lemmatized.
- A spelling checker wad developed in Java, notably using the Levenshtein algorithm.
- A grammar was created with AntlrWorks in order to process natural language requests.
- A Java API was used to connect the programs with the Prostgresql data base.
- An interface was designed with servlets
Examples of requests :
- Which articles were written between 2010 and 2012 ?
- I would like the articles dealing with laboratories in the section "focus".
- Who are the authors of the articles from the bulletin 279 ?
- How many articles talk about horses ?
By Céline Aschenbrenner & Anaig Marechal