# Experimenting with MongoDB Docker container

## Set up of Docker container on local machine

For this experiment I created a local folder with the following structure on my machine. <br>

```
docker_test
    └── data
        ├── csv_data
        │   ├── 1652951631.csv
        │   ├── 1652951755.csv
        │   ├── 1652951877.csv
        │   ├── 1652951992.csv
        │   ├── 1652952112.csv
        │   ├── 1652952240.csv
        │   ├── 1652952353.csv
        │   ├── 1652952472.csv
        │   ├── 1652952598.csv
        │   ├── 1652952711.csv
        │   ├── 1652952837.csv
        │   ├── 1652952961.csv
        │   ├── 1652953072.csv
        │   ├── 1652953199.csv
        │   ├── 1652953311.csv
        │   ├── 1652953435.csv
        │   ├── 1652953552.csv
        │   └── 1652953679.csv
        └── mongo_db
```

- The `csv_data` folder contains some .csv files which I downloaded from seneca
- The `mongo_db` folder is where the mongo_db should be stored in the end locally, so it will not be deleted after we stop the container (aka volume for this docker container)

In order to create and run a Docker container with MongoDB I changed my working directory into the `docker_test` folder and ran the following command in the terminal:
```
docker run -d -p 27017:27017 -v `pwd`/data/mongo_db:/data/db --name mongo_db mongo   
```
Now the container should be running in the background (you can check with ```docker container ls```)

## Python code to connect to MongoDB and insert data

In [1]:
# imports
from pymongo import MongoClient
from pprint import pprint

In [2]:
# Client connects to "localhost" by default 
client = MongoClient()

In [3]:
# Create new client
db = client['TravelDashboard']

### Test whether we can add documents which will be persisted even after stopping and restarting the container

In [10]:
# this is just a random example of a docuemnt which could be entered
courses = {'title': 'Data Science',
         'lecturer': {
         'name': 'Markus Löcher',
         'department': 'Math',
         'status': 'Professor'}}

In [12]:
db.courses.insert_one(courses)

<pymongo.results.InsertOneResult at 0x112412190>

In between these steps I stopped an restarted the docker container:
```
docker stop mongo_db
docker restart mongo_db
```
Now let's check if the the entry is still there

In [13]:
# Print all documents
cursor = db.courses.find()

for document in cursor:
    pprint(document)

{'_id': ObjectId('62a8684d3138824bd05e4e39'),
 'lecturer': {'department': 'Math',
              'name': 'Markus Löcher',
              'status': 'Professor'},
 'title': 'Data Science'}


The entry is still there :) That's good.

In [14]:
# lets drop the courses collection
db.courses.drop()

### Add all sample .csv files to the MongoDB using python

In [5]:
# Check current directory where notebook is located
import os 
import glob
#os.path.abspath("") # in python it should not be (""), but (__file__) !!!

In [73]:
# get path of all .csv files in csv_data folder
all_files = glob.glob(
    os.path.join("/Users/philippheitmann/Desktop/docker_test/data/csv_data", "*.csv"))
        #"/Volumes/Dateien/Coding/Datasets/Flight_Data/flights_csv", "*.csv"))


In [74]:
import pandas as pd
df = pd.read_csv(all_files[0], index_col=0)

In [76]:
cols = [
    "geo_altitude", "vertical_rate", 'country_cc', "callsign", "avg_no_seats", "time"
]


In [78]:
df_clean = df[cols]

In [79]:
df_clean.groupby("time").

Unnamed: 0,geo_altitude,vertical_rate,country_cc,callsign,avg_no_seats,time
0,11871.96,0.00,SE,SWR17A,164.0,1652953199
1,12245.34,0.33,HR,SWR185N,164.0,1652953199
2,10142.22,0.00,AU,JST969,180.4,1652953199
3,5082.54,-17.56,AU,JST526,180.4,1652953199
4,11353.80,-13.33,JP,ADO121,151.8,1652953199
...,...,...,...,...,...,...
3322,8374.38,7.80,FR,AFR53HX,180.0,1652953199
3323,1363.98,-0.98,PT,TAP1939,102.0,1652953199
3324,12009.12,0.00,JP,CES7504,309.0,1652953199
3325,5219.70,3.58,AE,KNE502,180.4,1652953199


In [None]:
{
    "time": 1652953199,
    "flights": [
        {
            "callsign": "SWR17A",
            "geo_altitude": 2500,
            "vertical_rate": -4,
            "country_cc": "JP",
            "avg_no_seats": 12
        },
        {
            "callsign": "SWR17A",
            "geo_altitude": 2500,
            "vertical_rate": -4,
            "country_cc": "JP",
            "avg_no_seats": 12
        }
    ]
}


In [None]:
(df_clean.groupby(["time"])\
    .apply(lambda x: dict(zip(x.geo_altitude,x.geo_altitude)))\
        .reset_index()\
            .rename(columns={0: "Flights"}).to_json(orient="records"))


In [90]:
cols = ["geo_altitude", "vertical_rate", 'country_cc', "callsign", "avg_no_seats"]


In [96]:
df_clean.groupby("time").apply(
    lambda x: x[cols].to_dict("records")).reset_index().rename(columns={
        0: "flights"
    }).to_json(orient="records")


'[{"time":1652953199,"Flights":[{"geo_altitude":11871.96,"vertical_rate":0.0,"country_cc":"SE","callsign":"SWR17A","avg_no_seats":164.0},{"geo_altitude":12245.34,"vertical_rate":0.33,"country_cc":"HR","callsign":"SWR185N","avg_no_seats":164.0},{"geo_altitude":10142.22,"vertical_rate":0.0,"country_cc":"AU","callsign":"JST969","avg_no_seats":180.4},{"geo_altitude":5082.54,"vertical_rate":-17.56,"country_cc":"AU","callsign":"JST526","avg_no_seats":180.4},{"geo_altitude":11353.8,"vertical_rate":-13.33,"country_cc":"JP","callsign":"ADO121","avg_no_seats":151.8},{"geo_altitude":4008.12,"vertical_rate":12.03,"country_cc":"MY","callsign":"AIQ356","avg_no_seats":180.4},{"geo_altitude":11483.34,"vertical_rate":0.0,"country_cc":"US","callsign":"UAL436","avg_no_seats":149.0},{"geo_altitude":1714.5,"vertical_rate":-2.6,"country_cc":"DE","callsign":"EDW403Y","avg_no_seats":144.0},{"geo_altitude":11849.1,"vertical_rate":-0.33,"country_cc":"FR","callsign":"SWR400N","avg_no_seats":144.0},{"geo_altitude

In [36]:
dict_test = df[cols].to_dict(orient='records')

In [39]:
dict_compl = {df.time.unique()[0]:dict_test}

In [43]:
dict_compl[1652951631]

[{'geo_altitude': 11932.92,
  'vertical_rate': -0.33,
  'country_cc': 'DE',
  'callsign': 'SWR17A',
  'avg_no_seats': 164.0},
 {'geo_altitude': nan,
  'vertical_rate': nan,
  'country_cc': 'CH',
  'callsign': 'SWR169A',
  'avg_no_seats': 180.0},
 {'geo_altitude': 1303.02,
  'vertical_rate': -5.2,
  'country_cc': 'DE',
  'callsign': 'SWR166C',
  'avg_no_seats': 180.0},
 {'geo_altitude': 11102.34,
  'vertical_rate': 0.0,
  'country_cc': 'AT',
  'callsign': 'SWR185N',
  'avg_no_seats': 164.0},
 {'geo_altitude': 6050.28,
  'vertical_rate': 9.75,
  'country_cc': 'AU',
  'callsign': 'JST969',
  'avg_no_seats': 180.4},
 {'geo_altitude': 12077.7,
  'vertical_rate': 0.0,
  'country_cc': 'AU',
  'callsign': 'JST526',
  'avg_no_seats': 180.4},
 {'geo_altitude': 11513.82,
  'vertical_rate': 0.0,
  'country_cc': 'US',
  'callsign': 'UAL436',
  'avg_no_seats': 149.0},
 {'geo_altitude': 9296.4,
  'vertical_rate': -7.8,
  'country_cc': 'AT',
  'callsign': 'EDW403Y',
  'avg_no_seats': 144.0},
 {'geo_al

In [9]:
#Store csv as list of dictonaries (each row/flight will be one dict)
df_dict = df.to_dict(orient='records')

In [11]:
df.head()

Unnamed: 0,icao24,callsign,origin_country,long,lat,baro_altitude,on_ground,velocity,true_track,vertical_rate,...,manufacturericao,model,icaoaircrafttype,model_no_stripped,avg_no_seats,coord,city_name,reg_admin1,reg_admin2,country_cc
0,4b1814,SWR17A,Switzerland,10.7348,52.1512,11590.02,False,230.25,15.15,-0.33,...,AIRBUS,A320-271N,L2J,A320,164.0,"(52.1512, 10.7348)",Vahlberg,Lower Saxony,,DE
1,4b1816,SWR169A,Switzerland,8.5592,47.4504,,True,0.32,5.62,,...,,,,,180.0,"(47.4504, 8.5592)",Glattbrugg / Rohr/Platten-Balsberg,Zurich,Bezirk Buelach,CH
2,4b1817,SWR166C,Switzerland,8.401,47.5808,1127.76,False,104.1,137.0,-5.2,...,,,,,180.0,"(47.5808, 8.401)",Hohentengen am Hochrhein,Baden-Wuerttemberg,Freiburg Region,DE
3,4b1813,SWR185N,Switzerland,12.7821,46.7718,10683.24,False,228.92,116.57,0.0,...,AIRBUS,A320-271N,L2J,A320,164.0,"(46.7718, 12.7821)",Tristach,Tyrol,Politischer Bezirk Lienz,AT
4,7c6b2b,JST969,Australia,152.6294,-28.1269,5783.58,False,211.02,225.99,9.75,...,AIRBUS,A320 232,L2J,A320,180.4,"(-28.1269, 152.6294)",Cedar Vale,Queensland,Logan,AU


In [30]:
df[["vertical_rate", "callsign"]].to_dict(orient="records")

[{'vertical_rate': -0.33, 'callsign': 'SWR17A'},
 {'vertical_rate': nan, 'callsign': 'SWR169A'},
 {'vertical_rate': -5.2, 'callsign': 'SWR166C'},
 {'vertical_rate': 0.0, 'callsign': 'SWR185N'},
 {'vertical_rate': 9.75, 'callsign': 'JST969'},
 {'vertical_rate': 0.0, 'callsign': 'JST526'},
 {'vertical_rate': 0.0, 'callsign': 'UAL436'},
 {'vertical_rate': -7.8, 'callsign': 'EDW403Y'},
 {'vertical_rate': 2.6, 'callsign': 'SWR400N'},
 {'vertical_rate': nan, 'callsign': 'SWR1210'},
 {'vertical_rate': 8.78, 'callsign': 'SWR2277'},
 {'vertical_rate': -5.53, 'callsign': 'SWR37EZ'},
 {'vertical_rate': 0.0, 'callsign': 'JST527'},
 {'vertical_rate': -4.88, 'callsign': 'SWR33U'},
 {'vertical_rate': 0.0, 'callsign': 'SWR101G'},
 {'vertical_rate': 0.0, 'callsign': 'JST609'},
 {'vertical_rate': -5.85, 'callsign': 'TAP762K'},
 {'vertical_rate': 7.15, 'callsign': 'SAA332'},
 {'vertical_rate': 0.0, 'callsign': 'JST906'},
 {'vertical_rate': -8.78, 'callsign': 'JST889'},
 {'vertical_rate': -9.1, 'callsign'

In [84]:
# input the lsit of dicts into MongoDB
db.travel_data.insert_many(df_dict)

<pymongo.results.InsertManyResult at 0x11ebaa0d0>

In between these steps I stopped an restarted the docker container:
```
docker stop mongo_db
docker restart mongo_db
```
Now let's check if the the entries are still there

In [95]:
# Count number of documents in MongoDB collection travel_data
db.travel_data.count_documents({})

3327

In [98]:
# Check if number of documents is equal to number of rows from the dataframe
db.travel_data.count_documents({})==df.shape[0]

True

### Insert all csv files in the mongodb
Great :) Now lets include all documents in the csv_folder into the mongoDB collection

In [71]:
# first let's drop the collection 
db.travel_data.drop()

In [108]:
# Lets see how big the mongo folder is before including the documents
!du -hs data/mongo_db     

 35M	data/mongo_db


In [65]:
%%time
# Loop over all files in csv_data folder and insert them into the MongoDB
for file in all_files[100:]:
    db.travel_data.insert_many(pd.read_csv(file, index_col=0).to_dict(orient='records'))

KeyboardInterrupt: 

In [66]:
db.travel_data.count_documents({})

KeyboardInterrupt: 

In [67]:
import pymongo
db.travel_data.create_index([("time", pymongo.DESCENDING),
                             ("geo_altitude", pymongo.DESCENDING),
                             ("vertical_rate", pymongo.DESCENDING)])


'time_-1_geo_altitude_-1_vertical_rate_-1'

So for 10 .csv files it took me ~12 seconds. In our seneca folder there are more than 16k files already :P 

In [None]:
fromm = 1652951631
to = 1654834794


In [69]:
db.travel_data.find_one(sort=[("time", -1)])

{'_id': ObjectId('62be8ed4cc870da1045a843b'),
 'icao24': '3c08c9',
 'callsign': 'AHO9764',
 'origin_country': 'Germany',
 'long': 6.6983,
 'lat': 38.0347,
 'baro_altitude': 13716.0,
 'on_ground': False,
 'velocity': 221.84,
 'true_track': 1.86,
 'vertical_rate': 0.33,
 'geo_altitude': 14249.4,
 'time': 1654834794.0,
 'callsign_carrier': 'AHO',
 'carrier_company': 'Air Hamburg',
 'carrier_type': 'P+',
 'manufacturericao': nan,
 'model': nan,
 'icaoaircrafttype': nan,
 'model_no_stripped': nan,
 'avg_no_seats': 180.0,
 'coord': '(38.0347, 6.6983)',
 'city_name': 'Hikone',
 'reg_admin1': 'Shiga Prefecture',
 'reg_admin2': nan,
 'country_cc': 'JP'}

In [110]:
# If we want to include all csv files into mongodb we will need 4.5 hours (for the current number of files)
16_000/60/60

4.444444444444445

In [111]:
# Lets see how big the mongo folder is after including the documents
!du -hs data/mongo_db  

 56M	data/mongo_db


So at least for this sample it was ~ 2MB per .csv file 

In [112]:
# So if we include all .csv files (current status: 16k) we will need 32GB of disk space on our local machine 
2*16_000/1000

32.0