# MongoDB Experiments

* Author: Johannes Maucher
* Last Update: 18.05.2018

## Prerequisites
For the experiments in this notebook MongoDB must be installed. It can be downloaded from [MongoDB Community Server Download](https://www.mongodb.com/download-center#community). After installation the MongoDB environment must be set up as described in the [MongoDB installation and setup tutorial](https://docs.mongodb.com/manual/tutorial). As described in this tutorial a data directory for MongoDB must be defined. In the case that the MongoDB data dirctory is `d:\test\mongodb\data`. Enter the following command in a shell in order to start the MongoDB server:

`"C:\Program Files\MongoDB\Server\3.4\bin\mongod.exe" --dbpath d:\test\mongodb\data`

> Notes JM: 
> * The first part of the command above is the script to start the mongoDB server. Instead of providing the full path, `mongodb` is sufficient, if the following line has been included in the *.bash_profile*-file:
> `export PATH="/Users/maucher/mongodb/bin:$PATH"`
> the second part of the command defines the path to the mongoDB database, on my Macbook, this is `/Users/maucher/DataSets/mongodb/data`. On thinkpad the path is `C:\Users\maucher\DataSets\world-food-fact`

The example data used in this notebook comes from [OpenFoodFacts](https://world.openfoodfacts.org/data). A MongoDB dump of this data can be downloaded from [MongoDB dump of OpenFoodFacts](http://world.openfoodfacts.org/data/openfoodfacts-mongodbdump.tar.gz). Decompress the downloaded archive and move the `.tsv` file to your MongoDB data directory. 

Apply the *mongoimport*-command to import the *.tsv*-dump into the MongoDB database by typing the following command into a shell:

`"C:\Program Files\MongoDB\Server\3.4\bin\mongoimport.exe" --db fooddata --type tsv --file d:\test\mongodb\data\en.openfoodfacts.org.products.tsv --ignoreBlanks --headerline`

Finally the `pymongo`- package must be installed, e.g. by `conda install pymongo`. Then the local *Open Food Facts-MongoDB* can be accessed as demonstrated in this notebook. [pymongo tutorial](http://api.mongodb.com/python/current/tutorial.html) provides a quick *pymongo*-introduction.  

## Access local Open Food Facts MongoDb

> Note that the MongoDB Server must run in order to execute the following client-actions (as described above).

In [2]:
import pymongo
import pprint
import pandas as pd
from IPython.display import display

Create MongoDB client and connect to the running MongoDB server:

In [3]:
from pymongo import MongoClient
client = MongoClient()
client = MongoClient('mongodb://localhost:27017/')
print(client.address)
print(client.database_names())

('localhost', 27017)
['admin', 'config', 'fooddata', 'local', 'progCourseData']


Select the `progCourseData`-database. This database contains several datasets, applied in the *Programming for Data Science*-course.

In [4]:
db = client['progCourseData']

In MongoDB *collections* are like tables in other database types. A list of all collection names can be obtained as follows:

In [5]:
print(db.name)
print(db.collection_names())

progCourseData
['lobbyPediaParteispenden', 'insurance', 'humanResources', 'VideoGamesSales-22-12-2016', 'EnergyMixGeoClust']


## Import data from csv File into mongoDB
The following command can be applied (either from shell or in jupyter notebook) for importing data into the Mongo database. Note that this command should be executed only once. A second run yields a doubling of data.

In [5]:
#!/Users/maucher/mongodb/bin/mongoimport --db progCourseData --type csv --file /Users/maucher/DataSets/DataFromProgramming/lobbyPediaParteispenden.csv --ignoreBlanks --headerline

The better option is maybe to use the `--drop` option, which drops a collecion, if it exists, before creating the new. 

In [6]:
!/Users/maucher/mongodb/bin/mongoimport --db progCourseData --type csv --file /Users/maucher/DataSets/DataFromProgramming/lobbyPediaParteispenden.csv --ignoreBlanks --headerline --drop

2018-10-02T21:00:16.169+0200	no collection specified
2018-10-02T21:00:16.169+0200	using filename 'lobbyPediaParteispenden' as collection
2018-10-02T21:00:16.183+0200	connected to: localhost
2018-10-02T21:00:16.183+0200	dropping: progCourseData.lobbyPediaParteispenden
2018-10-02T21:00:16.329+0200	imported 2466 documents


In [7]:
print(db.collection_names())

['lobbyPediaParteispenden', 'insurance', 'humanResources', 'VideoGamesSales-22-12-2016', 'EnergyMixGeoClust']


## Drop a collection

In [8]:
#db.drop_collection("lobbyPediaParteispenden")

## Access a database collection (a table of the database)
Create the interface to the collection (table):` [The Mongo DB documentation](https://docs.mongodb.com/manual/reference/method/db.getCollection/) says that the following two options are functionally equivalent:

* `db.get_collection("collectionName")`
* `db.collectionName`


In [9]:
#coll=db.EnergyMixGeoClust
coll=db.get_collection("EnergyMixGeoClust")
#coll=db.get_collection("lobbyPediaParteispenden")

Determine the number of items in the collection (rows in the table):

In [10]:
coll.count()

65

## Write Mongo DB to Pandas Dataframe

`find()` without any arguments (a query) returns the entire database as a cursor-object:

In [11]:
datadict=coll.find()

In [12]:
print(type(datadict))

<class 'pymongo.cursor.Cursor'>


Write database contents to pandas dataframe:

In [13]:
energyDF=pd.DataFrame(list(datadict))

In [14]:
energyDF.shape

(65, 13)

In [15]:
display(energyDF.head())

Unnamed: 0,Unnamed: 1,CO2Emm,Cluster,Coal,Country,Gas,Hydro,Lat,Long,Nuclear,Oil,Total2009,_id
0,1,602.7,5,26.5,Canada,85.2,90.2,56.130366,-106.346771,20.3,97.0,319.2,5afddb89ef93e059be279628
1,4,409.4,5,11.7,Brazil,18.3,88.5,-14.235004,-51.92528,2.9,104.3,225.7,5afddb89ef93e059be279629
2,0,5941.9,6,498.0,US,588.7,62.2,37.09024,-95.712891,190.2,842.9,2182.0,5afddb89ef93e059be27962a
3,2,436.8,6,6.8,Mexico,62.7,6.0,23.634501,-102.552784,2.2,85.6,163.2,5afddb89ef93e059be27962b
4,5,70.3,5,4.1,Chile,3.0,5.6,-35.675147,-71.542969,0.0,15.4,28.1,5afddb89ef93e059be27962c


In [16]:
energyDF.columns

Index(['', 'CO2Emm', 'Cluster', 'Coal', 'Country', 'Gas', 'Hydro', 'Lat',
       'Long', 'Nuclear', 'Oil', 'Total2009', '_id'],
      dtype='object')

In [17]:
energyDF=energyDF.drop(columns=["_id"])

In [18]:
display(energyDF.head())

Unnamed: 0,Unnamed: 1,CO2Emm,Cluster,Coal,Country,Gas,Hydro,Lat,Long,Nuclear,Oil,Total2009
0,1,602.7,5,26.5,Canada,85.2,90.2,56.130366,-106.346771,20.3,97.0,319.2
1,4,409.4,5,11.7,Brazil,18.3,88.5,-14.235004,-51.92528,2.9,104.3,225.7
2,0,5941.9,6,498.0,US,588.7,62.2,37.09024,-95.712891,190.2,842.9,2182.0
3,2,436.8,6,6.8,Mexico,62.7,6.0,23.634501,-102.552784,2.2,85.6,163.2
4,5,70.3,5,4.1,Chile,3.0,5.6,-35.675147,-71.542969,0.0,15.4,28.1


### Get items of the collection
Retrieve first element of the database:

In [19]:
item=coll.find_one()
pprint.pprint(item)

{'': 1,
 'CO2Emm': 602.7,
 'Cluster': 5,
 'Coal': 26.5,
 'Country': 'Canada',
 'Gas': 85.2,
 'Hydro': 90.2,
 'Lat': 56.130366,
 'Long': -106.346771,
 'Nuclear': 20.3,
 'Oil': 97.0,
 'Total2009': 319.2,
 '_id': ObjectId('5afddb89ef93e059be279628')}


Get subset according to specified query:

In [20]:
nuclear=coll.find({"Nuclear": { '$gt': 0.0 }})

In [21]:
nuclearDF=pd.DataFrame(list(nuclear))
display(nuclearDF)

Unnamed: 0,Unnamed: 1,CO2Emm,Cluster,Coal,Country,Gas,Hydro,Lat,Long,Nuclear,Oil,Total2009,_id
0,1,602.7,5,26.5,Canada,85.2,90.2,56.130366,-106.346771,20.3,97.0,319.2,5afddb89ef93e059be279628
1,4,409.4,5,11.7,Brazil,18.3,88.5,-14.235004,-51.92528,2.9,104.3,225.7,5afddb89ef93e059be279629
2,0,5941.9,6,498.0,US,588.7,62.2,37.09024,-95.712891,190.2,842.9,2182.0,5afddb89ef93e059be27962a
3,2,436.8,6,6.8,Mexico,62.7,6.0,23.634501,-102.552784,2.2,85.6,163.2,5afddb89ef93e059be27962b
4,3,164.2,4,1.1,Argentina,38.8,9.2,-38.416097,-63.616672,1.8,22.3,73.3,5afddb89ef93e059be27962d
5,13,172.8,6,4.6,Belgium,15.6,0.1,50.503887,4.469936,10.7,38.5,69.4,5afddb89ef93e059be279635
6,14,43.7,2,6.3,Bulgaria,2.2,0.9,42.733883,25.48583,3.5,4.4,17.4,5afddb89ef93e059be279636
7,15,109.5,2,15.8,Czech_Republic,7.4,0.7,49.817492,15.472962,6.1,9.7,39.6,5afddb89ef93e059be279637
8,17,52.5,6,3.7,Finland,3.2,2.9,61.92411,25.748151,5.4,9.9,25.0,5afddb89ef93e059be279639
9,18,398.7,3,10.1,France,38.4,13.1,46.227638,2.213749,92.9,87.5,241.9,5afddb89ef93e059be27963a


Each element of a collection is a Python dictionary. Hence, e.g. the attributes (keys) can be obtained as follows:

In [22]:
pprint.pprint(list(item.keys()))

['_id',
 '',
 'Country',
 'Oil',
 'Gas',
 'Coal',
 'Nuclear',
 'Hydro',
 'Total2009',
 'CO2Emm',
 'Lat',
 'Long',
 'Cluster']


Queries can be specified as key-value pairs. E.g. in order to retrieve one product from *Ferrero* the following command can be applied:

In [23]:
germ=coll.find_one({'Country' : {'$regex':'.*[gG]erman.*'}})
pprint.pprint(germ)

{'': 19,
 'CO2Emm': 795.6,
 'Cluster': 6,
 'Coal': 71.0,
 'Country': 'Germany',
 'Gas': 70.2,
 'Hydro': 4.2,
 'Lat': 51.165691,
 'Long': 10.451526,
 'Nuclear': 30.5,
 'Oil': 113.9,
 'Total2009': 289.8,
 '_id': ObjectId('5afddb89ef93e059be27963b')}


CO2-Emmission of the queried country:

In [24]:
pprint.pprint(germ['CO2Emm'])

795.6


More complex queries can be configured by applying logical operators and comparison symbols, such as `$and, $or`, `$gt` (greater then) and `$lt` (smaller then).

In [25]:
risky=coll.find(
{
     '$and': [
            { 'Nuclear' : { '$gt': 50.0 } },
            { 'Oil'    : { '$gt': 80.0 } }
          ]
}
)

In [26]:
risky.count()

3

In [27]:
display(pd.DataFrame(list(risky)))

Unnamed: 0,Unnamed: 1,CO2Emm,Cluster,Coal,Country,Gas,Hydro,Lat,Long,Nuclear,Oil,Total2009,_id
0,0,5941.9,6,498.0,US,588.7,62.2,37.09024,-95.712891,190.2,842.9,2182.0,5afddb89ef93e059be27962a
1,18,398.7,3,10.1,France,38.4,13.1,46.227638,2.213749,92.9,87.5,241.9,5afddb89ef93e059be27963a
2,56,1222.1,6,108.8,Japan,78.7,16.7,36.204824,138.252924,62.1,197.6,463.9,5afddb89ef93e059be279661


The `next()`-operator can be obtained to iterate through all items of a query-result:

In [28]:
nuclear=coll.find({"Nuclear": { '$gt': 0.0 }})

In [29]:
for i in range(10): #list the first 10 items
    p=next(nuclear)
    print('-'*10)
    print(p['Country'])
    print(p['Nuclear'])
    print(p['Hydro'])

----------
Canada
20.3
90.2
----------
Brazil
2.9
88.5
----------
US
190.2
62.2
----------
Mexico
2.2
6.0
----------
Argentina
1.8
9.2
----------
Belgium
10.7
0.1
----------
Bulgaria
3.5
0.9
----------
Czech_Republic
6.1
0.7
----------
Finland
5.4
2.9
----------
France
92.9
13.1


In [30]:
coll.distinct('Cluster')

[5, 6, 4, 2, 3, 1]