Python Document Indexer

A minimal document indexer written in Python. Computes a positional index for a document collection and a term document matrix, then allows a ranked query based on the index and the matrix. A college assignment.

Dependencies

Numpy.
Scipy.

Run

General:

usage: python -m indexer [-h] ACTION ...

options:
  -h, --help  show this help message and exit

Actions:
  The action to apply over the input document collection.

  ACTION
    index     Compute the index and term document matrix for collection.
    query     Query the documents that includes the input phrase. And sort based on cosine similarity.
    dump      Dump the positional index for collection.

Index:

usage: python -m indexer index [-h] [--no-stop-words] [COLLECTION]

Compute the index and term document matrix for collection.

positional arguments:
  COLLECTION           The path to the directory containing the document collection. Defaults to the curent working directory.

options:
  -h, --help           show this help message and exit
  --no-stop-words, -n  Do not remove stop words.

Query:

usage: python -m indexer query [-h] PHRASE [COLLECTION]

Query the documents that includes the input phrase.

positional arguments:
  PHRASE      The phrase to query for.
  COLLECTION  The path to the directory containing the document collection. Defaults to the curent working directory.

options:
  -h, --help  show this help message and exit

Dump:

usage: python -m indexer dump [-h] {index,matrix} [COLLECTION]

Dump the positional index for collection.

positional arguments:
  {index,matrix}  Which structure to dump.
  COLLECTION      The path to the directory containing the document collection. Defaults to the curent working directory.

options:
  -h, --help      show this help message and exit

Test

This project uses pytest for unit testing. Run the tests as follows:

pytest

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
indexer		indexer
sample_collection		sample_collection
test_collection		test_collection
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Python Document Indexer

Dependencies

Run

Test

About

Uh oh!

Releases

Packages

Contributors 5

Uh oh!

Languages

License

OmarEmaraDev/python_document_indexer

Folders and files

Latest commit

History

Repository files navigation

Python Document Indexer

Dependencies

Run

Test

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Languages

Packages