# Playing with MongoDB

[MongoDB](https://en.wikipedia.org/wiki/MongoDB) has to be installed, unlike SQLite that comes bundled in Python.

```bash
sudo eopkg install mongodb
pip install --user pymongo
sudo mkdir -p /data/db
sudo chown $USER /data/db
```

Interesting links:
- http://api.mongodb.com/python/current/tutorial.html

Let's play with our DGEMM dataset, used in the paper.

In [1]:
!wget -c https://github.com/Ezibenroc/calibration_analysis/raw/master/dahu/blas/dgemm_calibration.csv -O /tmp/data.csv
!cut -d, -f1,2,3,4,5,6,10,11 /tmp/data.csv > /tmp/dgemm.csv
!head /tmp/dgemm.csv

--2019-06-11 15:14:59--  https://github.com/Ezibenroc/calibration_analysis/raw/master/dahu/blas/dgemm_calibration.csv
Résolution de github.com… 140.82.118.3
Connexion à github.com|140.82.118.3|:443… connecté.
requête HTTP transmise, en attente de la réponse… 302 Found
Emplacement : https://media.githubusercontent.com/media/Ezibenroc/calibration_analysis/master/dahu/blas/dgemm_calibration.csv [suivant]
--2019-06-11 15:14:59--  https://media.githubusercontent.com/media/Ezibenroc/calibration_analysis/master/dahu/blas/dgemm_calibration.csv
Résolution de media.githubusercontent.com… 151.101.120.133
Connexion à media.githubusercontent.com|151.101.120.133|:443… connecté.
requête HTTP transmise, en attente de la réponse… 416 Range Not Satisfiable

    Le fichier a déjà été complètement récupéré ; rien à faire.

function,m,n,k,timestamp,duration,node,core
dgemm,378,7640,2427,3473.428414,0.48594659999999995,10,0
dgemm,378,7640,2427,3473.914385,0.4861293,10,0
dgemm,378,7640,2427,3474.400522,0.486

## The basics

First, let's load our huge CSV file containing dgemm data and dump it into a MongoDB collection.

In [2]:
import pandas
import pymongo

In [3]:
%time df = pandas.read_csv('/tmp/dgemm.csv')
print(len(df))
df.head()

CPU times: user 2.62 s, sys: 286 ms, total: 2.9 s
Wall time: 2.9 s
5004288


Unnamed: 0,function,m,n,k,timestamp,duration,node,core
0,dgemm,378,7640,2427,3473.428414,0.485947,10,0
1,dgemm,378,7640,2427,3473.914385,0.486129,10,0
2,dgemm,378,7640,2427,3474.400522,0.486853,10,0
3,dgemm,9441,640,1160,3474.887383,0.455139,10,0
4,dgemm,9441,640,1160,3475.34253,0.453528,10,0


In [4]:
client = pymongo.MongoClient()
db = client.test_database
collection = db.test_collection

In [5]:
%%time
df_dict = df.to_dict('records')

CPU times: user 1min 13s, sys: 977 ms, total: 1min 14s
Wall time: 1min 14s


In [6]:
%%time
collection.insert_many(df_dict)

CPU times: user 46.8 s, sys: 1.16 s, total: 48 s
Wall time: 58.6 s


<pymongo.results.InsertManyResult at 0x7fce4bcac388>

In [7]:
del df_dict  # releasing some memory

In [8]:
!du -sh /data
!du -sh /tmp/dgemm.csv

681M	/data
250M	/tmp/dgemm.csv


Now, let's see how much time is needed to read from this database.

In [9]:
%time tmp = pandas.DataFrame(list(collection.find()))
print(len(tmp))
tmp.head()

CPU times: user 24.9 s, sys: 1.82 s, total: 26.7 s
Wall time: 27.7 s
5004288


Unnamed: 0,_id,core,duration,function,k,m,n,node,timestamp
0,5cffa9acec24b4aa8e35170f,0,0.485947,dgemm,2427,378,7640,10,3473.428414
1,5cffa9acec24b4aa8e351710,0,0.486129,dgemm,2427,378,7640,10,3473.914385
2,5cffa9acec24b4aa8e351711,0,0.486853,dgemm,2427,378,7640,10,3474.400522
3,5cffa9acec24b4aa8e351712,0,0.455139,dgemm,1160,9441,640,10,3474.887383
4,5cffa9acec24b4aa8e351713,0,0.453528,dgemm,1160,9441,640,10,3475.34253


Wow, reading from the database is much longer than reading from the CSV file. This feels weird. Let's see if we can at least have a low time by reading a subset.

In [10]:
%time tmp = pandas.DataFrame(list(collection.find({'node': 20})))
print(len(tmp))
tmp.head()

CPU times: user 650 ms, sys: 4 ms, total: 654 ms
Wall time: 2.05 s
156384


Unnamed: 0,_id,core,duration,function,k,m,n,node,timestamp
0,5cffa9b6ec24b4aa8e4f56af,0,0.486023,dgemm,2427,378,7640,20,3485.569752
1,5cffa9b6ec24b4aa8e4f56b0,0,0.486281,dgemm,2427,378,7640,20,3486.055798
2,5cffa9b6ec24b4aa8e4f56b1,0,0.485554,dgemm,2427,378,7640,20,3486.542086
3,5cffa9b6ec24b4aa8e4f56b2,0,0.461661,dgemm,1160,9441,640,20,3487.027647
4,5cffa9b6ec24b4aa8e4f56b3,0,0.458144,dgemm,1160,9441,640,20,3487.489317


Alright, so reading a subset of the database is shorter than reading everything, but it is still too long. What if we have an index?

In [11]:
collection.profiles.create_index('node', unique=False)

'node_1'

In [12]:
%time tmp = pandas.DataFrame(list(collection.find({'node': 20})))
print(len(tmp))
tmp.head()

CPU times: user 644 ms, sys: 4.06 ms, total: 648 ms
Wall time: 2.02 s
156384


Unnamed: 0,_id,core,duration,function,k,m,n,node,timestamp
0,5cffa9b6ec24b4aa8e4f56af,0,0.486023,dgemm,2427,378,7640,20,3485.569752
1,5cffa9b6ec24b4aa8e4f56b0,0,0.486281,dgemm,2427,378,7640,20,3486.055798
2,5cffa9b6ec24b4aa8e4f56b1,0,0.485554,dgemm,2427,378,7640,20,3486.542086
3,5cffa9b6ec24b4aa8e4f56b2,0,0.461661,dgemm,1160,9441,640,20,3487.027647
4,5cffa9b6ec24b4aa8e4f56b3,0,0.458144,dgemm,1160,9441,640,20,3487.489317


Ok, it did not change anything.