Skip to content

boonious/information_retrieval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IR Build Status Coverage Status

IR is an Elixir-based exercise in information retrieval, in-memory indexing and full-text searching of CSV dataset.

Usage - Interactive Elixir (IEx)

The application can be invoked interactively through IEx.

  # from application home directory,
  # start IEx with the application running
  # this will also compile the application if it hasn't been compiled
  ...$ iex -S mix
    
  # or start IEx with shell history
  ..$ iex --erl "-kernel shell_history enabled" -S mix

Dataset parsing and indexing:

  # build an in-memory corpus for the entire dataset
  iex> {:ok, corpus} = IR.parse(:all)

  # corpus of specific number of documents
  iex> {:ok, corpus} = IR.parse(2)

  # build an in-memory index for the entire dataset
  iex> {:ok, index} = IR.indexing(:all)
  
  # index specific number of documents
  iex> {:ok, index} = IR.indexing(500)
  ... # %{ "term" => "postings"..}

  # indexing and build text corpus
  iex> {:ok, index, corpus} = IR.indexing(5000, corpus: true)
  ...

The generated corpus and index can be used in search queries.

For quick tests, the application's search function - IR.q will automatically generate a corpus and index of 1000 (max) docs from the CSV dataset if pre-created index / corpus are not supplied. The query will be issued on this small index.

Doc ids and scores are currently being returned as results.

  # quick search test with up to 1000 max docs
  # also returns IDs with scores, {id, score}
  iex> IR.q "northern renaissance van eyck"
  Indexing..
  Found 4 results.
  .... # <- search results
  [{4, 1.79218}, {5, 1.79218}, {6, 1.79218}, {7, 0.67294}]

  # stricter AND boolean search
  iex(9)> IR.q "northern renaissance van eyck", op: :and
  Indexing..
  Found 2 results.
  ....
  [{4, 1.79218}, {5, 1.79218}, {6, 1.79218}]


  # create in-memory index and corpus for the entire CSV dataset
  # ( > 1000 docs), it'll take awhile for a large dataset
  iex> {:ok, index, corpus} = IR.indexing(:all, corpus: true)
  ...

  # use the index / corpus in search
  iex> IR.q "renaissance", index: index, corpus: corpus
  ...

  # re-use the corpus / index for another search
  # without waiting for indexing
  iex> IR.q "van eyck", index: index, corpus: corpus, op: :and
  ...

Ranking and sorting of results can be toggled with the :sort option.

  iex> IR.q "christopher columbus carlo eyck galileo galilei", sort: false
  Indexing..
  Found 5 results.
  ...
  [{1, 1.25276}, {4, 0.55962}, {5, 0.55962}, {6, 0.55962}, {7, 5.01105}]

  # ranking with relevancy
  iex(6)> IR.q "christopher columbus carlo eyck galileo galilei", sort: true
  Indexing..
  Found 5 results.
  [{7, 5.01105}, {1, 1.25276}, {4, 0.55962}, {5, 0.55962}, {6, 0.55962}]

CSV data

A dataset can be supplied in a CSV file named data.csv, in the application home directory. It currently imports title and description columns which should be specified with such headers in the file. Existing data from other columns will not be parsed.

The default dataset filename and path can be configured in config/config.exs:

  config :ir,
    data_filepath: "another_path/another_filename.csv"

Requirement

This application is based on Erlang. To compile and build the application, Elixir is required. You can install Elixir on OS X via Homebrew with:

  brew install elixir

The above installs both Elixr and Erlang. For other OSes, check the installation instructions on elixir-lang.org.

To compile the application, run the following from the application home directory:

  ...$ mix deps.get; mix compile

Documentation

API documentation can be generated with the following command:

  ...$ mix docs
  Docs successfully generated.
  View them at "doc/index.html".

About

An exercise in information retrieval, in-memory indexing and full-text searching.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages