# Simple MongoDB demonstration

by Rikard Sandström, rsandstroem@kpmg.com

## Upgrade MongoDB

For this tutorial we will need MongoDB version 3.0 and pymongo to connect to MongoDB from Python. NB: Be careful with data loss if you are upgrading from an older release used in production!

If you are running a KAVE machine on the Amazon cloud, please follow these instructions:
http://docs.mongodb.org/manual/tutorial/install-mongodb-on-red-hat.
In short, create a file /etc/yum.repos.d/mongodb-org-3.0.repo containing: 

then install the latest version with 

Next install pymongo:

If all is fine "mongo --version" should now tell you that you are using version 3.0 or later, and from a python prompt "import pymongo" will not return an error message.

## Create a MongoClient

First, import things we will need. Use pymongo to connect to the "test" database. Specify that we want to use the collection "people" in this database.

In [1]:
import os
import pandas as pd
import numpy as np
import pymongo
from pymongo import MongoClient
client = MongoClient('localhost', 27017)
db = client.test
collection = db.people

## Import data into the database

Import data from a json file into the MongoDB database "test", collection "people". 
We can do this using the insert method, but for simplicity we execute a "mongoimport" in a shell environment, but first we drop the collection if it already exists.

In [2]:
collection.drop()
os.system('mongoimport -d test -c people dummyData.json')

0

## Check if you can access the data from the MongoDB. 

We use find() to get a cursor to the documents in the data. Let's see who the three youngest persons in this data are. 
Sort the results by the field "Age", and print out the first three documents.
Note the structure of documents, it is the same as the documents we imported from the json file, but it has unique values for the new "_id" field.

In [3]:
cursor = collection.find().sort('Age',pymongo.ASCENDING).limit(3)
for doc in cursor:
    print doc

{u'Country': u'Serbia', u'Age': 18.0, u'_id': ObjectId('55d5b8b85a1aaba39446ff29'), u'Name': u'Sawyer, Neve M.', u'Location': u'-34.37446, 174.0838'}
{u'Country': u'Somalia', u'Age': 19.0, u'_id': ObjectId('55d5b8b85a1aaba39446fee7'), u'Name': u'Townsend, Cadman I.', u'Location': u'-87.69188, -144.16138'}
{u'Country': u'Eritrea', u'Age': 20.0, u'_id': ObjectId('55d5b8b85a1aaba39446ff0f'), u'Name': u'Graham, Emerald O.', u'Location': u'61.35398, 28.04381'}


## Aggregation in MongoDB

Here is a small demonstration of the aggregation framework. 
We want to create a table of the number of persons in each country and their average age.
To do it we group by country.
We extract the results from MongoDB aggregation into a pandas dataframe, and use the country as index.

In [4]:
pipeline = [
        {"$group": {"_id":"$Country",
             "AvgAge":{"$avg":"$Age"},
             "Count":{"$sum":1},
        }},
        {"$sort":{"Count":-1,"AvgAge":1}}
]
aggResult = collection.aggregate(pipeline)["result"]
df1 = pd.DataFrame(aggResult)
df1 = df1.set_index("_id")
df1[:5]

Unnamed: 0_level_0,AvgAge,Count
_id,Unnamed: 1_level_1,Unnamed: 2_level_1
China,46.25,4
Antarctica,46.333333,3
Guernsey,48.333333,3
Puerto Rico,26.5,2
Heard Island and Mcdonald Islands,29.0,2


For simple cases one can either use a cursor through find("search term") or use the "$match" operator in the aggregation framework, like this:

In [5]:
pipeline = [
        {"$match": {"Country":"China"}},
]
aggResult = collection.aggregate(pipeline)["result"]
df2 = pd.DataFrame(aggResult)
df2[:5]

Unnamed: 0,Age,Country,Location,Name,_id
0,32,China,"39.9127, 116.3833","Holman, Hasad O.",55d5b8b85a1aaba39446fee4
1,43,China,"31.2, 121.5","Byrd, Dante A.",55d5b8b85a1aaba39446ff21
2,57,China,"45.75, 126.6333","Carney, Tamekah I.",55d5b8b85a1aaba39446ff2a
3,53,China,"40, 95","Mayer, Violet U.",55d5b8b85a1aaba39446ff36


## Use the MongoDB data

Let's do something with the data from the last aggregation, put their location on a map.
Click on the markers to find the personal details of the four persons located in China.

In [6]:
from IPython.display import HTML
import folium

def inline_map(map):
    map._build_map()
    return HTML('<iframe srcdoc="{srcdoc}" style="width: 100%; height: 510px; border: none"></iframe>'.format(srcdoc=map.HTML.replace('"', '&quot;')))


world_map = folium.Map(location=[35, 100], 
                    zoom_start=4)
for i in range(len(df2)):
    world_map.simple_marker(location=df2.Location[i].split(','), popup=df2.Name[i]+', age:'+str(df2.Age[i]))

inline_map(world_map)

In case no map is shown, try the following command from a terminal window and retry:

For more information on how to use maps, color by region etc, please check out GeoMapsFoliumDemo