# Creating a document similarity microservice for the Reuters-21578 dataset.

First download the Reuters-21578 dataset in JSON format into the local folder:

```bash
git clone https://github.com/fergiemcdowall/reuters-21578-json
```

The first step will be to convert this into the default corpus format we use:


In [1]:
import json
import codecs 
import os

docs = []
for filename in os.listdir("reuters-21578-json/data/full"):
    f = open("reuters-21578-json/data/full/"+filename)
    js = json.load(f)
    for j in js:
        if 'topics' in j and 'body' in j:
            d = {}
            d["id"] = j['id']
            d["text"] = j['body'].replace("\n","")
            d["title"] = j['title']
            d["tags"] = ",".join(j['topics'])
            docs.append(d)
print "loaded ",len(docs)," documents"

loaded  10377  documents


Create a gensim LSI document similarity model

In [2]:
from  seldon.text import DocumentSimilarity,DefaultJsonCorpus
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

corpus = DefaultJsonCorpus(docs)
ds = DocumentSimilarity(model_type='gensim_lsi')
ds.fit(corpus)
print "done"


INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:adding document #10000 to Dictionary(71167 unique tokens: [u'yetunspecified', u'europeancommission', u'overheadcosts', u'mdbl', u'othersecurities']...)
INFO:gensim.corpora.dictionary:built Dictionary(73530 unique tokens: [u'yetunspecified', u'europeancommission', u'overheadcosts', u'mdbl', u'othersecurities']...) from 10377 documents (total 1255015 corpus positions)
INFO:gensim.models.tfidfmodel:collecting document frequencies
INFO:gensim.models.tfidfmodel:PROGRESS: processing document #0
INFO:gensim.models.tfidfmodel:PROGRESS: processing document #10000
INFO:gensim.models.tfidfmodel:calculating IDF weights for 10377 documents and 73529 features (752872 matrix non-zeros)
INFO:seldon.text:Building gensim lsi model
INFO:gensim.models.lsimodel:using serial LSI version on this node
INFO:gensim.models.lsimodel:updating model with new documents
INFO:gensim.models.lsimodel:prepa

done


Run a test over the document to compute average jaccard similarity to the 1-nearest neighbour for each document using the "tags" field of the meta data as the ground truth. 

In [3]:
ds.score()

INFO:gensim.similarities.docsim:creating matrix with 10377 documents and 100 features
INFO:gensim.similarities.docsim:creating dense shard #0
INFO:gensim.similarities.docsim:saving index shard to /tmp/gensim_index.0
INFO:gensim.utils:saving MatrixSimilarity object under /tmp/gensim_index.0, separately None
INFO:gensim.utils:loading MatrixSimilarity object from /tmp/gensim_index.0
INFO:seldon.text:accuracy: 0.837711 time: 5 secs avg_call_time: 0.000482


0.837711403799443

Run a test again but use the Annoy approximate nearest neighbour index that would have been built.

In [4]:
ds.score(approx=True)

INFO:seldon.text:accuracy: 0.836532 time: 1 secs avg_call_time: 0.000096


0.836531742819262

In [5]:
import seldon
rw = seldon.Recommender_wrapper()
rw.save_recommender(ds,"reuters_recommender")
print "done"

INFO:seldon.util:creating folder /tmp/recommender_tmp615134
INFO:gensim.utils:saving Similarity object under /tmp/recommender_tmp615134/gensim_index, separately None
INFO:seldon.fileutil:copy /tmp/recommender_tmp615134 to reuters_recommender
INFO:seldon.fileutil:copying /tmp/recommender_tmp615134/gensim_index to reuters_recommender/gensim_index
INFO:seldon.fileutil:copying /tmp/recommender_tmp615134/annoy_index to reuters_recommender/annoy_index
INFO:seldon.fileutil:copying /tmp/recommender_tmp615134/gensim_index.0 to reuters_recommender/gensim_index.0
INFO:seldon.fileutil:copying /tmp/recommender_tmp615134/rec to reuters_recommender/rec
INFO:seldon.fileutil:copying /tmp/recommender_tmp615134/meta to reuters_recommender/meta


done


In [None]:
from seldon.microservice import Microservices
m = Microservices()
app = m.create_recommendation_microservice("reuters_recommender")
app.run(host="0.0.0.0",debug=False)

INFO:seldon.util:creating folder /tmp/recommender_tmp797941
INFO:seldon.fileutil:copy reuters_recommender to /tmp/recommender_tmp797941
INFO:seldon.fileutil:copying reuters_recommender/gensim_index to /tmp/recommender_tmp797941/gensim_index
INFO:seldon.fileutil:copying reuters_recommender/annoy_index to /tmp/recommender_tmp797941/annoy_index
INFO:seldon.fileutil:copying reuters_recommender/gensim_index.0 to /tmp/recommender_tmp797941/gensim_index.0
INFO:seldon.fileutil:copying reuters_recommender/rec to /tmp/recommender_tmp797941/rec
INFO:seldon.fileutil:copying reuters_recommender/meta to /tmp/recommender_tmp797941/meta
INFO:gensim.utils:loading Similarity object from /tmp/recommender_tmp797941/gensim_index
INFO:werkzeug: * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
INFO:gensim.utils:loading MatrixSimilarity object from /tmp/recommender_tmp615134/gensim_index.0
INFO:werkzeug:127.0.0.1 - - [30/Nov/2015 10:11:45] "GET /recommend?recent_interactions=6003&client=reuters&user_i

In [None]:
import requests
params = {}
params["recent_interactions"] = "6003"
params["limit"] = 4
params["client"] = "reuters"
r = requests.get("http://127.0.0.1:5000/recommend",params=params)
print r.status_code
j = json.loads(r.text)
print j