Science Concierge

a Python repository for content-based recommendation based on Latent semantic analysis (LSA) topic distance and Rocchio Algorithm. Science Concierge is an backend algorithm for Scholarfy www.scholarfy.net, an automatic scheduler for conference.

See full article on PLOS ONE, Arxiv or full tex manuscript and presentation here. You can also see the scale version of Scholarfy to 14.3M articles from Pubmed at pubmed.scholarfy.net.

Usage

First, clone the repository.

$ git clone https://github.com/titipata/science_concierge

Install dependencies using pip,

$ pip install -r requirements.txt

Install the library using setup.py,

$ python setup.py develop install

Download example data

We provide example csv file from Pubmed Open Acess Subset that you can download and play with (we parsed using pubmed_parser). Each file contains pmc, pmid, title, abstract, publication_year as column name. Use download function to download example data,

import science_concierge
science_concierge.download(['pubmed_oa_2015.csv', 'pubmed_oa_2016.csv'])

We provide pubmed_oa_{year}.csv from {year} = 2007, ..., 2016 (note 2007 is all publications before year 2008). Alternative is to use awscli to download,

$ aws s3 cp s3://science-of-science-bucket/science_concierge/data/ . --recursive

Example usage of Science Concierge

You can build quick recommendation by importing ScienceConcierge class then use fit method to fit list of documents. Then use recommend to recommend documents based on like or dislike documents.

import pandas as pd
from science_concierge import ScienceConcierge

df = pd.read_csv('data/pubmed_oa_2016.csv', encoding='utf-8')
docs = list(df.abstract) # provide list of abstracts
titles = list(df.title) # titles
# select weighting from 'count', 'tfidf', or 'entropy'
recommend_model = ScienceConcierge(stemming=True, ngram_range=(1,1),
                                   weighting='entropy', norm=None,
                                   n_components=200, n_recommend=200,
                                   verbose=True)
recommend_model.fit(docs) # input list of documents or abstracts
index = recommend_model.recommend(likes=[10000], dislikes=[]) # input list of like/dislike index (here we like title[10000])
docs_recommend = [titles[i] for i in index[0:10]] # recommended documents

Vectorizer available

We have adds on vectorizer classes including LogEntropyVectorizer and BM25Vectorizer for calculating documents-terms weighting from input list of documents. Here is an example usage.

from science_concierge import LogEntropyVectorizer
l_model = LogEntropyVectorizer(norm=None, ngram_range=(1,2),
                               stop_words='english', min_df=1, max_df=0.8)
X = l_model.fit_transform(docs) # where docs is list of documents

In this case when we have sparse matrix of documents, we can use fit_document_matrix method directly.

recommend_model = ScienceConcierge(n_components=200, n_recommend=200)
recommend_model.fit_document_matrix(X)
index = recommend_model.recommend(likes=[10000], dislikes=[])

Dependencies

numpy
pandas
unidecode
nltk with white space tokenizer and Porter stemmer,
use science_concierge.download_nltk() to download required corpora (there is a stemmer bug in nltk==3.2.2)
scikit-learn
cachetools
joblib

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
article		article
science_concierge		science_concierge
.gitignore		.gitignore
README.md		README.md
recommend_model.joblib		recommend_model.joblib
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

article

article