Skip to content

FarshidNooshi/Persian-Search-Engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

96 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

In The Name Of GOD

Persian Search Engine

License: APACHE-2.0 GitHub contributors

About The Project

This Project is For my Information Retrieval Course Project which was three phases and has various techniques in information retrieval such as TF-IDF scoring, Clusterring techniques and ... . below is a brief introduction to this course.

Preprocessing Section

For all query answering approaches, we need a positional posting list which is extracted in Phase 1 section Also for word embedding purposes, a word to vector model is constructed.

Categorization

In order to answer queries fast enough, we need to categorize documents and queries. In third section a KNN Approach is implemented In third section an unsupervised approach is implemented using the K-means algorithm.

Query answering

Multiple query answering approaches are implemented:

  • Simple common word counting approach in Phase 1 and 2.
  • An approach based on inner product and tf-idf in Phase 2 which uses champion lists in order to speed up.
  • A word embedding-based approach with inner product criteria in Phase 1.
  • A fast version of the word embedding approach in Phase 2 which uses clusters in order to speed up.
  • A category aware approach in Phase 3 that user enters the category along with the query itself.
  • Bulk inserting data to the ElasticSearch and using ElasticSearch as another solution for using in products in Phase 3.

Information Retrieval

Information retrieval is the process through which a computer system can respond to a user's query for text-based information on a specific topic. Web search is one of the most important applications of information retrieval techniques and an area in which most people interact with IR systems. The goal of this course is to introduce students with the basics, models, tools and applications of the modern information retrieval.

Contents

This repository is for my Information Retrieval course project at the Amirkabir University of Technology and contains the following files:

Topics

The topics in this course are:

  • Text Preprocessing and vocabulary construction
    • Document and word separation
    • Normalization
    • Stemming and lemmatization
    • Spelling correction
  • Indexing
    • Index construction
    • Index compression
  • Retrieval and ranking methods
    • Boolean, Vector-based, Probabilistic Retrieval
  • Performance evaluation for information retrieval methods
  • Query languages and operators
  • Document classification and clustering
  • Web search
    • Basics
    • Crawling
    • Link analysis
  • IR-based applied systems

Instructor

The Information Retrieval course in Spring 2022 at the Computer Engineering Department in Amirkabir University of Technology is taught by: Assoc. Prof. Ahmad Nickabadi

Textbook

C. D. Manning, P. Raghavan, H. Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.

About

Amirkabir University of Technology- Computer Engineering department- Information Retrieval Course Project

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published