IR-Wikipedia-Search-Engine

In this project we build a search engine for the entire Wikipedia corpus. The engine workflow is provided bellow:

Search body:

Returns up to a 100 search results for the query using TFIDF AND Csine Similarity of the body articles.

Search title:

Returns all search results that contain a query word in the title of articles, ordered in descending order of the number of query words that appear in the title. For example, a document with a title that matches two of the query words will be ranked before a document with a title that matches only one query term.

Searc anchor:

Returns all search results that contain a query word in the anchor text of articles, ordered in descending order of the number of query words that appear in the text linking to the page. For example, a document with a anchor text that matches two of the query words will be ranked before a document with anchor text that matches only one query term.

Search:

In this part we create a Word2Vec model for the entire Wikipedia corpus. With this model, we can find semantics between the query words provided by the user and the most similar words in the corpus for the word in the query. Moreover, with the model we can found similarity between words in the query (for example, for the words “information” and “retrieval” we get a high similarity score. We create the model using the genism package and saved the trained model in a bin file and upload it to the bucket. First, we tokenize the query and check what is the number of words the query contain. With this information we know in which way to use our model.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
IndexCreationCode.ipynb		IndexCreationCode.ipynb
README.md		README.md
backend.py		backend.py
bins_path.txt		bins_path.txt
inverted_index_gcp.py		inverted_index_gcp.py
run_frontend_in_colab.ipynb		run_frontend_in_colab.ipynb
search_frontend.py		search_frontend.py
startup_script_gcp.sh		startup_script_gcp.sh
w2vec wikipedia.py		w2vec wikipedia.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IR-Wikipedia-Search-Engine

Search body:

Search title:

Searc anchor:

Search:

About

Releases

Packages

Contributors 2

Languages

OmerIdgar/IR-Wikipedia-Search-Engine

Folders and files

Latest commit

History

Repository files navigation

IR-Wikipedia-Search-Engine

Search body:

Search title:

Searc anchor:

Search:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages