Threaded-Sparse-TFIDF

Creating a repository for multithreading TF-IDF vectorization for similarity search using sparse matrices for computations.

PyPi

Usage:

from TF_IDF import TF_IDF_Vectorizer

tf_idf = TF_IDF_Vectorizer(use_cached=True, print_output=False)
_, ranking = tf_idf.get_similarity_score("science fiction super hero movie", num_workers=k)

Performance:

Image:

Table:

num_workers	time	partition_size
1.0	1.1117637634277344	6.778499999999999
2.0	0.8195240020751953	3.4149000000000003
3.0	0.7357232332229614	2.2773
4.0	0.7232689380645752	1.7081
5.0	0.7375946760177612	1.3555999999999997
6.0	0.7682486534118652	1.1307000000000003
7.0	0.7640876531600952	0.9618
8.0	0.7513441801071167	0.8506
9.0	0.7795052766799927	0.7587
10.0	0.8141436100006103	0.6807
11.0	0.8003325223922729	0.6195000000000002
12.0	0.8441393852233887	0.5697
13.0	0.8490614175796509	0.5258000000000002
14.0	0.9322290658950806	0.48739999999999994
15.0	0.8824400186538697	0.45729999999999993

Data

A subset of the Information Retrieval Dataset - Internet Movie Database (IMDB) specifically movies after the year 2007.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Threaded_Sparse_TFIDF.egg-info		Threaded_Sparse_TFIDF.egg-info
Threaded_Sparse_TFIDF		Threaded_Sparse_TFIDF
__pycache__		__pycache__
build/lib/Threaded_Sparse_TFIDF		build/lib/Threaded_Sparse_TFIDF
data		data
dist		dist
output		output
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data.zip		data.zip
performance.csv		performance.csv
performance.png		performance.png
setup.py		setup.py
test.py		test.py

License

AmanPriyanshu/Threaded-Sparse-TFIDF

Folders and files

Latest commit

History

Repository files navigation

Threaded-Sparse-TFIDF

Usage:

Performance:

Image:

Table:

Data

About

Resources

License

Stars

Watchers

Forks

Languages