A simple search engine which uses the environmental news dataset as its corpus.
The Search engine performs pre-processing of query terms removing stop words, and performing lemmatization to normalize the terms.
The search engine supports the following queries:
- bag of words unions and intersection
- positional queries
- wildcard queries
- combination of these queries
The search engine ranks the documents using a vector-space model.
- The dataset can be found here, but has already been added to the project for convinence.
- Run
pip install -r requirements.txt - Run
python3 driver.pyto start the search engine.