# MongoDB Primer

# 1. Connect to mongoDB

In [None]:
from pymongo import *

This assumes you already have mongodb running in the background:

In [None]:
client = MongoClient('localhost', 27017)

# 2. Create a database

We can create a database that we can call `test_database`.

In [None]:
db = client.my_database

In fact, this command doesn't even create the database: it simply assumes that it exists. It will not actually be created until we insert out first document. We can check this by listing databases: the only ones currently there will be the databases you've used previously for atomate or fireworks, `my_database` is not in the list.

In [None]:
client.database_names()

# 3. Create a collection in your database

Generally, you will only have one database, but you may have many collections in that database. If you're using atomate, you could have a `tasks` collection, a `materials` collection, etc. If you're storing your own data, you might one a collection for each type of experimental data.

Here, let's create a collection called `nobel`.

In [None]:
c = db.nobel

This is equivalent to `c = db['nobel']`. In `pymongo`, (confusingly) there are two ways to do the same thing

# 4. Insert a single document

Let's insert a document. The 'power' on MongoDB is that it doesn't require a schema: your document can contain any valid Python `list`, `dict`, `float`, `string`, `bool` in any combination you like. This makes it very flexible! 

In [None]:
my_doc = {
    'year': 1939,
    'field': 'physics',
    'name': 'Ernst Orlando Lawrence',
    'country': 'USA'
}

In [None]:
result = c.insert_one(my_doc)

The `result` variable can give us the `ObjectID` of the document we inserted: this is a unique key that MongoDB uses for book-keeping.

In [None]:
result.inserted_id

Now, if we look at our database we see our new database (`my_database`) has been added:

In [None]:
client.database_names()

And our `nobel` collection has been added to `my_database`:

In [None]:
db.collection_names()

# 5. Delete a document

To delete a document, you can use the `delete_one` or `delete_many` methods on your collection. The argument to this method is simply a query. For example, `c.delete_many({'year': 1939})` would delete all documents that have `year` set as 1939.

Here, since we still have the `my_doc` dict, we can simply set the query to the dict itself!

In [None]:
result = c.delete_one(my_doc)

Assigning the output of this method to a variable (`result`) is optional, but allows us to obtain information on how many documents were deleted:

In [None]:
print(result.deleted_count)

# 6. Insert many documents

Let's load some test data. The Nobel Prize now has an API! (There's an API for everything now)


We will load data taken from: http://api.nobelprize.org/v1/laureate.json

Note that we are loading data from JSON, but we could just as easily load data from a .csv file, an Excel file, or similar, into a Python dictionary: numpy is a good module that can help with this.

In [None]:
from monty.serialization import loadfn

In [None]:
nobel_data = loadfn('nobel_laureates.json')

And we can see what it contains:

In [None]:
nobel_data[0]

Note that the structure of this document is different from the document we inserted previously (`my_doc`): this again emphasises the schemaless nature of mongoDB. It is up to you to perform validation that your inputs are correct!

We can see how many entries `nobel_data` contains:

In [None]:
len(nobel_data)

Now we can insert all of these into our `nobel` collection at once:

In [None]:
result = c.insert_many(nobel_data)

If we want to, we can now check the document insertion occured, and print all the `ObejctID`s of the inserted document.

In [None]:
result.acknowledged

In [None]:
result.inserted_ids

Like `insert_one()` and `insert_many()`, many methods in `pymongo` have a `_one()` or a `_many()` version.

# 7. Query documents

This uses the exact same syntax as `MPRester().query()`, so hopefully you will be familiar with it!

Let's test a simpe query:

In [None]:
cursor = c.find({'surname': 'Bragg'})

This returns a cursor: an object that helps us keep track of where we are in the database. In Python, it is iterable, which means to see our results we can write:

In [None]:
from pprint import pprint # pprint is exactly the same as print, except it adds extra spaces to make dictionaries easier to read!

In [None]:
for document in cursor:
    pprint(document)

Now, let's run the exact same command again:

In [None]:
for document in cursor:
    pprint(document)

This time it prints nothing! That's because our cursor is now at the end of our results from our query. To see our results again, the query hast to be run again.

In [None]:
cursor = c.find({'surname': 'Bragg'})

If you want to store the results to use later, you can store them in a list:

In [None]:
my_query_results = list(cursor)

In [None]:
pprint(my_query_results)

But use caution with this: it's not a good idea to store all the results of a very large query (this is what a database if for after all).

It is also possible to do projections, to only return the document fields you're interested in:

In [None]:
cursor = c.find({'surname': 'Bragg'},
                ['firstname', 'surname', 'prizes.category', 'prizes.year', 'prizes.motivation'])

Here, the `find` method has taken two dictionaries as its arguments: the first dictionary is the query, and the second is a list of the fields you want to return. 

In [None]:
# doc is a dict
# we can just print the information we're interested in
for doc in cursor:
    print("{} {} ({}, {}): {}\n".format(doc['firstname'],
                                    doc['surname'],
                                    doc['prizes'][0]['category'],
                                    doc['prizes'][0]['year'],
                                    # prize 'motivation' is not always defined, as we use
                                    # .get() to return a default value of an empty string ""
                                    # if 'motivation' is not defined
                                    doc['prizes'][0].get('motivation', "")))

Since our projection is just a `dict`, if we use it a lot, we can store it like so:

In [None]:
my_projection = ['firstname', 'surname', 'prizes.category', 'prizes.year', 'prizes.motivation']
cursor = c.find({'surname': 'Bragg'}, my_projection)

This is equivalent to the previous example.

## 7.1 More complex queries: element matching

Let's try a more complex query. We want to find all laureates who won a Nobel prize in a specified year. If we look at the structure of the documents above, we see that each laureate as a list `prizes`, for each of their Nobel prizes, and each element of that `list` is a `dict` containing information on the prize, such as its `year`.

To match all laureates from '1905', we match on every *element* of the `prizes` list with `$elemMatch`.

In [None]:
cursor = c.find({'prizes': {'$elemMatch': {'year': '1905'}}},
                ['firstname', 'surname', 'prizes.category', 'prizes.year', 'prizes.motivation'])

In [None]:
for doc in cursor:
    print("{} {} ({}): {}\n".format(doc['firstname'],
                                    doc['surname'],
                                    doc['prizes'][0]['category'],
                                    doc['prizes'][0].get('motivation', "")))

This example highlights one of the problems with a schema-less database like mongoDB. You may have been surprised that the year was entered as a string (`'1905'`) and not an int (`1905`).

It would make more sense to store the year as an integer, but the source data had the years as strings, and these were inserted into the database without complaint. Likewise, mongoDB has support for date objects, which would be more appropriate for the `born` and `died` fields.

Even though `mongoDB` is schema-less, it's worth thinking about the form you want your date to take, because it can make querying your data easier later.

To use a concrete example, some people store a value with its units as a string, like so: `'0.1 eV'`. While storing the units is useful, it is now not possible to use, for example, 'greater than' queries on that field, since it is not a number. A more useful format might be to store the value as a dict, `{'value': 0.1, 'unit': 'eV'}`. Now, the unit is still stored with the value, but the value is stored as a number (`float`)  that is easier to query.

## 7.2 More complex queries: OR logic

In a similar way, we can query all of the winners of the Nobel Prize in Chemistry.

In [None]:
cursor = c.find({'prizes': {'$elemMatch': {'category': 'chemistry'}}},
                ['firstname', 'surname', 'prizes.category', 'prizes.year', 'prizes.motivation'])

In [None]:
for doc in cursor:
    print("{} {} ({}): {}\n".format(doc['firstname'],
                                    doc['surname'],
                                    doc['prizes'][0]['category'],
                                    doc['prizes'][0].get('motivation', "")))

If we wanted to ask for `chemistry` OR `physics`, we would modify the query by replacing `'category': 'chemistry'` with `'category': {'$in': ['chemistry', 'physics']}`, like so:

In [None]:
cursor = c.find({'prizes': {'$elemMatch': {'category': {'$in': ['chemistry', 'physics']}}}},
                ['firstname', 'surname', 'prizes.category', 'prizes.year', 'prizes.motivation'])

In [None]:
for doc in cursor:
    print("{} {} ({}, {}): {}\n".format(doc['firstname'],
                                        doc['surname'],
                                        doc['prizes'][0]['category'],
                                        doc['prizes'][0]['year'],
                                        doc['prizes'][0].get('motivation', "")))

# 7.2 More complex queries: counting

Let's see which laureates have won two Nobel prizes!

In [None]:
cursor = c.find({'prizes': {'$size': 2}},
                ['firstname', 'surname', 'prizes.category', 'prizes.year', 'prizes.motivation'])

In [None]:
for doc in cursor:
    pprint(doc)
    print("\n")

## 7.2 More complex queries: regular expressions

We can search for strings which contain a given word, for example to find Nobel prizes awarded to a specific institution.

In [None]:
MY_CITY = "Berkeley"

This will be a fairly complex query, so let's write it separately to make it easier to read (the query is only a dict after all):

In [None]:
# the format of the document looks like this:
# 'prizes': [{'affiliations': [{'city': 'Manchester',
#                               'country': 'United Kingdom',
#                               'name': 'Victoria University'}],

# you can write this all on one line too
query = {
    'prizes': {
        '$elemMatch': {
            'affiliations': {
                '$elemMatch': {
                    'city': {
                        '$regex': MY_CITY
                    }
                }
            }
        }
    }
}

For this query, we use the `$elemMatch` we introduced earlier.

We also use `$regex`, this takes a 'regular expression'. In its simplest form, it will match any string that contains its argument, for example `{'$regex': 'California'}` would return results with `University of California`. There are far more complicated regular expressions possible to match more complicated queries: Google is your friend!

In [None]:
cursor = c.find(query,
               ['firstname', 'surname', 'prizes.category', 'prizes.year', 'prizes.motivation'])

In [None]:
for doc in cursor:
    print("{} {} ({}, {}): {}\n".format(doc['firstname'],
                                        doc['surname'],
                                        doc['prizes'][0]['category'],
                                        doc['prizes'][0]['year'],
                                        doc['prizes'][0].get('motivation', "")))

You can replace `city` with `name` in the above query to search for your specific institution! (or `city` with `country` to search for your country)

# 8. More advanced topics

MongoDB provides very powerful aggregation and document validation features. The MongoDB documentation and StackOverflow are two great resources to learn more. Good luck! :)

The only caveat: MongoDB syntax is very similar to pymongo, but slightly different (usually Python requires some extra quotation marks). If you're trying something and it doesn't work, try searching for the pymongo equivalent.