# Creating a document similarity microservice for the Reuters-21578 dataset.

First download the Reuters-21578 dataset in JSON format into the local folder:

```bash
git clone https://github.com/fergiemcdowall/reuters-21578-json
```

The first step will be to convert this into the default corpus format we use:


In [1]:
import json
import codecs 
import os

docs = []
for filename in os.listdir("reuters-21578-json/data/full"):
    f = open("reuters-21578-json/data/full/"+filename)
    js = json.load(f)
    for j in js:
        if 'topics' in j and 'body' in j:
            d = {}
            d["id"] = j['id']
            d["text"] = j['body'].replace("\n","")
            d["title"] = j['title']
            d["tags"] = ",".join(j['topics'])
            docs.append(d)
print "loaded ",len(docs)," documents"

loaded  10377  documents


## Create a gensim LSI document similarity model

In [2]:
from  seldon.text import DocumentSimilarity,DefaultJsonCorpus
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

corpus = DefaultJsonCorpus(docs)
ds = DocumentSimilarity(model_type='gensim_lsi')
ds.fit(corpus)
print "done"


INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:adding document #10000 to Dictionary(71167 unique tokens: [u'yetunspecified', u'europeancommission', u'overheadcosts', u'mdbl', u'othersecurities']...)
INFO:gensim.corpora.dictionary:built Dictionary(73530 unique tokens: [u'yetunspecified', u'europeancommission', u'overheadcosts', u'mdbl', u'othersecurities']...) from 10377 documents (total 1255015 corpus positions)
INFO:gensim.models.tfidfmodel:collecting document frequencies
INFO:gensim.models.tfidfmodel:PROGRESS: processing document #0
INFO:gensim.models.tfidfmodel:PROGRESS: processing document #10000
INFO:gensim.models.tfidfmodel:calculating IDF weights for 10377 documents and 73529 features (752872 matrix non-zeros)
INFO:seldon.text.docsim:Building gensim lsi model
INFO:gensim.models.lsimodel:using serial LSI version on this node
INFO:gensim.models.lsimodel:updating model with new documents


KeyboardInterrupt: 

## Run accuracy tests

Run a test over the document to compute average jaccard similarity to the 1-nearest neighbour for each document using the "tags" field of the meta data as the ground truth. 

In [None]:
ds.score()

Run a test again but use the Annoy approximate nearest neighbour index that would have been built. Should be much faster.

In [None]:
ds.score(approx=True)

## Run single nearest neighbour query
Run a nearest neighbour query on a single document and print the title and tag meta data

In [None]:
query_doc=6023
print "Query doc: ",ds.get_meta(query_doc)['title'],"Tagged:",ds.get_meta(query_doc)['tags']
neighbours = ds.nn(query_doc,k=5,translate_id=True,approx=True)
print neighbours
for (doc_id,_) in neighbours:
    j = ds.get_meta(doc_id)
    print "Doc id",doc_id,j['title'],"Tagged:",j['tags']

## Save recommender

Save the recommender to the filesystem in ```reuters_recommender``` folder

In [None]:
import seldon
rw = seldon.Recommender_wrapper()
rw.save_recommender(ds,"reuters_recommender")
print "done"

## Start a microservice to serve the recommender

In [3]:
from seldon.microservice import Microservices
m = Microservices()
app = m.create_recommendation_microservice("reuters_recommender")
app.run(host="0.0.0.0",port=5000,debug=False)

INFO:seldon.util:creating folder /tmp/recommender_tmp915987
INFO:seldon.fileutil:copy reuters_recommender to /tmp/recommender_tmp915987
INFO:seldon.fileutil:copying reuters_recommender/gensim_index to /tmp/recommender_tmp915987/gensim_index
INFO:seldon.fileutil:copying reuters_recommender/annoy_index to /tmp/recommender_tmp915987/annoy_index
INFO:seldon.fileutil:copying reuters_recommender/gensim_index.0 to /tmp/recommender_tmp915987/gensim_index.0
INFO:seldon.fileutil:copying reuters_recommender/rec to /tmp/recommender_tmp915987/rec
INFO:seldon.fileutil:copying reuters_recommender/meta to /tmp/recommender_tmp915987/meta
INFO:gensim.utils:loading Similarity object from /tmp/recommender_tmp915987/gensim_index
INFO:werkzeug: * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
INFO:gensim.utils:loading MatrixSimilarity object from /tmp/recommender_tmp915987/gensim_index.0
INFO:werkzeug:127.0.0.1 - - [28/Mar/2016 12:34:27] "GET /recommend?recent_interactions=6023&client=reuters&user_i