# MongoDB GridFS demonstration

by Dr. Rikard Sandström, rsandstroem@kpmg.com

If you have not done so already, I recommend that you first look at "MongoDB_demo.ipynb" in this repository. It explains the basics in MongoDB and introduces pymongo.

This demonstration shows how to use GridFS for storing large files in MongoDB. MongoDB has a document size limit of a few MB. GridFS circumvents this by splitting the file in smaller chunks. I find it convenient for storing binary data such as images and video. For more information please see https://docs.mongodb.org/manual/core/gridfs/

The demonstration stores a scanned pdf in GridFS, then extracts text from the pdf and adds that as another "column" of the same collection. This allows retrieval of pdf's whos content match user defined search terms. This use case was brought to life by my frustration over searching through piles of papers when filling out my tax returns.

## Create a MongoClient

First, import things we will need. Use pymongo to connect to the "test" database on port 27017 of the local machine.

In [1]:
import os
import pandas as pd
import numpy as np
import pymongo
from pymongo import MongoClient
import bson
import gridfs
client = MongoClient('localhost', 27017)
db = client.test

## Create a GridFS

To start from a clean slate, begin by dropping the collections GridFS might have created earlier. GridFS creates a "files" collection, containing meta data, and the "chunks" collection, containing small chunks of the large files.

In [2]:
db.drop_collection('fs.files')
db.drop_collection('fs.chunks')
fs = gridfs.GridFS(db)

## Hello World

Let's insert an object and see if we can retrieve it. The object in this case is just a string containing "Hello World". Add some additional free text description called "text" and a tag field.

In [3]:
a = fs.put("Hello World", text='I am a tiny document', tags=['foo', 'bar'], username='skywalker')
print 'put returned:\t', a
print 'database contains:\t', fs.get(a).read()

put returned:	56fe679cb143762514358152
database contains:	Hello World


## Extract text from a scanned PDF file

Let's try something more complex. "scansmpl.pdf" is a free sample of a scanned pdf file containing text in image format. For pdf files already containing text in text format we can just grab the text with pdftotext. If this does not work we need to run and OCR to optically reconstruct characters from the image, which is much slower. 

Todo: use better OCR software, preferably without dropping to system.

In [4]:
filename = 'scansmpl.pdf'
tags = ['default']
os.system('pdftotext ' + filename + ' temp.txt')
text = open('temp.txt').read()
if len(text)<2: # No text found, so we need to do an OCR 
    os.system('ocrmypdf ' + filename + ' ocr.pdf')
    os.system('pdftotext ocr.pdf temp.txt')
    text = open('temp.txt').read()
print 'Text consists of', len(text), 'words'

Text consists of 1230 words


## Insert the document to the data base

First convert the pdf to binary format. Then insert the binary version of the original file into mongo, together with the text extraction we did in the previous step and the file name.

In [5]:
data = bson.Binary(open(filename).read())
fs.put(filename=filename, data=data, text=text, tags=tags, username='dvader')

ObjectId('56fe67a3b143762514358154')

## Inspect data base contents

Use command list() to get an array of files that were inserted. There should only be one file at this point, but if you inserted it multiple times it would not repeat in the array, so let's count replicas just to be sure.

In [6]:
print fs.list()
print fs.find({"filename" : "scansmpl.pdf"}).count()

[u'scansmpl.pdf']
1


Remember that the meta data was stored in fs.files? You can see what each document contains using this collection, like shown below. The "_id" links to the inserted object in fs.chunks. Very convenient!

In [7]:
list(db.fs.files.find())

[{u'_id': ObjectId('56fe679cb143762514358152'),
  u'chunkSize': 261120,
  u'length': 11,
  u'md5': u'b10a8db164e0754105b7a99be72e3fe5',
  u'tags': [u'foo', u'bar'],
  u'text': u'I am a tiny document',
  u'uploadDate': datetime.datetime(2016, 4, 1, 12, 20, 44, 60000),
  u'username': u'skywalker'},
 {u'_id': ObjectId('56fe67a3b143762514358154'),
  u'chunkSize': 261120,
  u'filename': u'scansmpl.pdf',
  u'length': 21530,
  u'md5': u'8d2fc6d280b1385302910fd5162eaad2',
  u'tags': [u'default'],
  u'text': u'THE SLEREXE COMPANY LIMITED\nSAPORS LANE\nTELEPHONE\n\nOur Ref.\n\n350/PJC/EAC\n\nDr.\n\nCundall,\n\nP.N.\n\n.\n\nBOOLE\n\nBoots\n\n-\n\nDORSET\n\n(94513) 51617\n\n-\n\n.\n\nBHZS SER\n\n123456\n\nnuax\n\n18th\n\nJanuary, 1972.\n\nMining Surveys Ltd.,\nHolroyd Road,\n\nReading,\nBerks.\n\nDear\n\nPete,\n\nPermit me\ntransmission.\n\nintroduce you\n\nto\n\nto\n\nfacility\n\nthe\n\nof\n\nfacsimile\n\nphotocell is caused to perform a raster scan over\nvariations of print density on the docume

## Search database for a file

Suppose you want to get your hands on all documents with a specific filename. Just do a standard find() on the MongoDB, the do whatever you want with the documents (here I print the upload date).

In [8]:
cursor = fs.find({"filename" : "scansmpl.pdf"}).limit(3)
print "Found", cursor.count(), "documents"
for doc in cursor:
    print doc.uploadDate

Found 1 documents
2016-04-01 12:20:51.911000


If you want to grab the file using the "_id" you must convert it to an ObjectId first.

In [10]:
from bson.objectid import ObjectId
fs.find_one({'_id': ObjectId('56fe67a3b143762514358154')})._id

ObjectId('56fe67a3b143762514358154')

If you are only interested in the last version you could retrieve it like this:

In [11]:
print fs.get_last_version("scansmpl.pdf").uploadDate

2016-04-01 12:20:51.911000


The purpose of extracting the text from the scanned pdf earlier was to enable searching for matching substrings to retrieve documents. To do this we need to create an index of the text field. 

I have yet to figure out how to do this with pymongo, but in the mongo shell this is just one short line (see comment below).

In [12]:
#db.fs.files.createIndex({ "text": "text" } ) # need to implement this in pymongo, works in mongo shell
founddocs = fs.find({"$text": { "$search": "Cundall" }})
print founddocs.count()

1


## Write the found documents to disk

In [13]:
for i,doc in enumerate(founddocs):
    with open(doc.filename+'_'+str(i), 'w') as f:
      f.write(fs.get(doc._id).read())

You should now have the original file(s) matching the search terms with a counter suffix in the local folder.