LO17

Indexation & Data search

This project is an intelligent search engine looking into 327 HTML documents, which are scientific articles. It takes into account requests in natural langage.

The documents were parsed with Perl in order to create a XML file containing essential data in a tree view.
The tf-idf method was used to score and rank the words in each article. The terms defined as negligible were suppress of the XML corpus, the other words were lemmatized.
A spelling checker wad developed in Java, notably using the Levenshtein algorithm.
A grammar was created with AntlrWorks in order to process natural language requests.
A Java API was used to connect the programs with the Prostgresql data base.
An interface was designed with servlets

Examples of requests :

Which articles were written between 2010 and 2012 ?
I would like the articles dealing with laboratories in the section "focus".
Who are the authors of the articles from the bulletin 279 ?
How many articles talk about horses ?

By Céline Aschenbrenner & Anaig Marechal

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
1-preparation_du_corpus		1-preparation_du_corpus
2-indexation		2-indexation
3-correcteur_orthographique		3-correcteur_orthographique
4-analyse_syntaxique		4-analyse_syntaxique
5-version_finale		5-version_finale
Rapports		Rapports
sujetsTD		sujetsTD
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LO17

About

Releases

Packages

Languages

Anaig/LO17

Folders and files

Latest commit

History

Repository files navigation

LO17

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages