Latent Semantic Analysis

Introduction

The program lsi.py implements the a simple latent semantic analysis engine using svd. Simple changes can be made to the program to try out Non-negative matrix factorization or Vector quantization.

It implements the following:

Given a document title, it outputs k similar documents
Given any word, it outputs k related words from all the documents. If this word occurs in none of the documents it outputs k random words
Given a query, it outputs k relevant documents for the query.

The code has been optimised to work well in large cases as well. The addition methods of removing stopwords, tfidf and normalising are implemented within the same file but are kept commented. A user can simply uncomment the required things and get it working with only slight modifications.

Setting up the environment?

Needs scipy,numpy and a few other basic python libraries. To save yourself from the struggle of setting up the environment, use the requirements.txt file to setup the virtual environment for python

virtualenv venv
source venv/bin/activate
pip install requirements.txt

To deactivate the virtualenv use: deactivate

Running the latent semantic search engine?

lsi.py can be run as follows:

python lsi.py -z 200 -k 10 --dir Directory --doc_in <name of input document file> --doc_out <name of output document file to be generated by code> --term_in <name of input term file> --term_out <name of output term file to be generated by code> --query_in <name of input query file> --query_out <name of output query file to be generated by code>

where
-z: Dimensionality of lower dimensional space
-k: # of similar terms/documents to be returned
--dir: Directory containing input documents
--doc_in: Input file containing list of document titles (one per line) corresponding to whom k similar documents are to be returned.
--doc_out: Each line of this file will have titles of k documents (separated by ';<tab>' i.e semicolon followed by tab) that are similar to the document in corresponding line of doc_in
--term_in: Input file containing list of words (one per line) corresponding to whom k similar words/terms are to be returned.
--term_out: Each line of this output file will have k words (separated by ';<tab>' i.e semicolon followed by tab) that are similar to the word in corresponding line of term_in
--query_in: Input file containing list of queries (one per line) corresponding to whom k relevant documents are to be returned.
--query_out: Each line of this output file will have titles of k documents (separated by ';<tab>' i.e semicolon followed by tab) that are relevant to the query in corresponding line of query_in

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Documents		Documents
sampleio		sampleio
test		test
tp		tp
.DS_Store		.DS_Store
.gitignore		.gitignore
LSA_reference.pdf		LSA_reference.pdf
README.md		README.md
doc_in.txt		doc_in.txt
doc_out.txt		doc_out.txt
lsi.py		lsi.py
proper.py		proper.py
proper1.py		proper1.py
query_in.txt		query_in.txt
query_out.txt		query_out.txt
requirements.txt		requirements.txt
term_in.txt		term_in.txt
term_out.txt		term_out.txt
try.py		try.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Latent Semantic Analysis

Introduction

Setting up the environment?

Running the latent semantic search engine?

Note: The documents in the directory must be numbered 1,2,3,4....n

Contributing

License

About

Releases

Packages

Languages

Prakhar0409/Latent-Semantic-Indexing

Folders and files

Latest commit

History

Repository files navigation

Latent Semantic Analysis

Introduction

Setting up the environment?

Running the latent semantic search engine?

Note: The documents in the directory must be numbered 1,2,3,4....n

Contributing

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages