SearchEngine

About: Implementation of a inverse index based query search engine

Libraries required

import sys
re
nltk
collections
json
math
operator

Preprocessing

Parsing the json data to load data to respective feilds.
Concatination of texts from all the fields in the data of a file.
Replacing the non-alphabetic charecters with spaces.
Removing the stopwords from the text data.
Stemming of the words (used Porter stemmer from nltk.stem).

Creating Index

Created two types of index
One with Word and docId of docs that contain it along with it's frequncy in that doc
Other with Word and docId of docs that contain that word with the tf-idf score of the word.

can uncomment some of the parts in the code to write these indexes in files to look it yourself.

Calculating Scores

For each word we calculate the tf-idf score.
For more insights regarding the formulation of tf and idf independently please visit here.

Query processing

loadQueries is the function that loads the queries already present in the file inside /resources/.
query is tokenized and stemmed as we did the document texts while preparing the indexes.
tokens not present in the indexing are dropped from the query token, see line #149 of createIndex.py.

Results

Our system returns a list of docIds in the order of higher to lower relevance.
The docIds at begining are more relevent to our query as compared to the later once.
These are compared on the tf-idf scores computed for all the words of the query across all the documents.
Systems nDCG (relevance) score is ~0.6 which is a good score for as basic and naive method we have implemented here.

Future Work

can use semantics to get more favourable words related to a word in the query.
Specific to feild query cases to be handled.

Discussions

Join for discussions and doubts regarding the workflow of the system.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
resources		resources
README.md		README.md
createIndex.py		createIndex.py
evaluation.py		evaluation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SearchEngine

Libraries required

Preprocessing

Creating Index

Calculating Scores

Query processing

Results

Future Work

Discussions

About

Releases

Packages

Languages

SilentFlame/SearchEngine

Folders and files

Latest commit

History

Repository files navigation

SearchEngine

Libraries required

Preprocessing

Creating Index

Calculating Scores

Query processing

Results

Future Work

Discussions

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages