---
#  Data modeling, importing, Indexing and Querying GPX Datasets

Date: 27-11-2019 <br>
Concept version: 0.9 <br>
Author: Pieter Lems  <br>

© Copyright 2019 Ministerie van Defensie


This notebook wil provide information relating to creating data models for MongoDB.
To create the data models we are going to use Python and MongoEngine.The notebook also shows how to import the data into the mongoDB datastores.

## Contents of notebook

- Importing the required modules
- Reading the datasets
- Validating the datasets
- Connecting to the database
  - Create Docker MongoDB database (if needed)
  - Connect
- Creating the model
- Loading the data using the model
  - Creating the import functions
  - Load the data
- Querying the data (pre-indexing)
  - Indexing the data
  - Querying the data (post-indexing)
  
### The datasets in used in this notebook can be found in the folder ("../Data/Trail_JSON/") 
---

### Importing the required modules

In [1]:
import pandas as pd

from mongoengine import * 

from datetime import datetime

### Reading the datasets

In [2]:
Biesbosch = pd.read_json(
    '../Data/Trail_JSON/Trail_Biesbosch.json')

Biesbosch_Lib = pd.read_json(
    '../Data/Trail_JSON/Trail-Biesbosch-Libellen.json')

Zeeland = pd.read_json(
    '../Data/Trail_JSON/Trail_ZeelandMNV.json')

Hamert_Hike = pd.read_json(
    '../Data/Trail_JSON/Trail-Hamert-Hike.json')

Hamert_Bike = pd.read_json(
    '../Data/Trail_JSON/Trail-Hamert-Bike.json')

### Connecting to the database

#### Create Docker container
Uncomment the next line if you dont have a mongoDB docker container
and you want to import the data in a docker container.

This command will download a MongoDB docker image and run the container on port 27017 (localhost:27017)

In [3]:
#!docker run -d -p 27017:27017 mongo:latest

#### Connect to a database called: "Trail_Database"

In [4]:
connect('Trail_Database')

MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True, read_preference=Primary())

### Creating the model

In [12]:
class Trail(Document):
    
    # Name of the Trail 
    name = StringField()

    # Abreviation of the Name
    abr = StringField()
    
    # Start date
    s_date= DateTimeField()
    
    # End date
    e_date = DateTimeField()
    
    # Trail type (Biking,Hiking,Driving)
    r_type = StringField() 

    # Amount of trackpoints in the dataset
    t_points = IntField()

class Geometry(EmbeddedDocument):
 
    # coordinates of signal (coord=[1,2])
    coord = PointField()
    
    # altitude of the signal
    alt = FloatField()
    
class Signal(Document):
    
    # Timestamp of signal
    time = DateTimeField()
    
    # Geometry of signal
    geometry = EmbeddedDocumentField(Geometry)
    
    # Reference to the route of signal
    trail = ReferenceField(Trail)

---

### Loading the data using the model

In [6]:
def load_data(df,name,abreviation,type):
    
    # Here we get the value of the time column of the first row in the dataframe.
    # We apply /1000 to remove the UTC (Timezone) info. This is required to create a valid timestamp.
    s_date = datetime.fromtimestamp((df.at[0,'time']/1000))
    
    # Here we get the value of the time column of the last row in the dataframe.
    # This value is located at the index -1 of the lenght of the dataframe.
    e_date = datetime.fromtimestamp((df.at[len(df.index)-1,'time']/1000))
    
    # Get the total lenght of the dataframe we do this because it's the same
    # as the amount of signals in the dataset.
    t_points = df.shape[0]
    

    # Create the trial document by creating a new instance of the Trail document 
    # in which we pass the required values.
    trail = Trail(name = name,
                  s_date = s_date,
                  e_date = e_date,
                  abr = abreviation,
                  r_type = type,
                  t_points = t_points)
    
    # Save the Trail document to the database.
    trail.save()

    # Create an empty list of signals to which we will append all the signal
    # documents after they have been created. We will pass the list to the
    # mongodb bulk insert feature.
    signals = []
    
    
    # Here we itterate through all the rows in the dataframe.
    # For each row in the dataframe the following code is executed.
    for index,row in df.iterrows():
        
        # Convert the datetime to a valid format by removing the timezone info.
        time = datetime.fromtimestamp(row['time']/1000)
        
        # Here we create the geometry document in which we pass the required values.
        geometry = Geometry(coord = [row['lon'],row['lat']], 
                            alt = row['alt'])
        # Here we create a signal document in which we pass the required values.
        signal = Signal(time = time,
                        geometry = geometry,
                        trail = trail)
    
        # Here we append the created document to the signals list.
        signals.append(signal)

    # Bulk insert, the populated signals list, in the database
    Signal.objects.insert(signals,load_bulk=True)

    # Print if the insert process is succesfull.
    print("Inserted " + str(len(df.index))+" trackpoints from dataset: " + str(name))

## Loading the data

In [7]:
load_data(Biesbosch,'Biesbosch','B','Boat & Hike')
load_data(Zeeland,"Zeeland Camper",'ZC',"Car")
load_data(Biesbosch_Lib,"Biesbosch Libellen",'BL',"Hike")
load_data(Hamert_Hike,"Hamert Hike",'HH',"Hike")
load_data(Hamert_Bike,"Hamert Bike",'HB',"Bike")

Inserted 739 trackpoints from dataset: Biesbosch
Inserted 1174 trackpoints from dataset: Zeeland Camper
Inserted 493 trackpoints from dataset: Biesbosch Libellen
Inserted 1483 trackpoints from dataset: Hamert Hike
Inserted 422 trackpoints from dataset: Hamert Bike


---
### Querying the data pre-index

First we will run a couple of queries before we create the indexes on the database. By doing this, we can compare the time it takes to return a certain amount of data with and without an indexed database.To find information related to the execution of the query add .explain() behind the query

#### Query to find ID of the trail : Biesbosch

In [8]:
Trail.objects(name = 'Biesbosch').only('name','id').to_json()

'[{"_id": {"$oid": "5e8b12e35ac6a6c5a797895b"}, "name": "Biesbosch"}]'

#### Query to return al signals related to Trial: Biesbosch

In [9]:
Signal.objects(trail='5e1db4c3a602d099584a91cb').explain()

{'queryPlanner': {'plannerVersion': 1,
  'namespace': 'Trail_Database.signal',
  'indexFilterSet': False,
  'parsedQuery': {'trail': {'$eq': ObjectId('5e1db4c3a602d099584a91cb')}},
  'winningPlan': {'stage': 'COLLSCAN',
   'filter': {'trail': {'$eq': ObjectId('5e1db4c3a602d099584a91cb')}},
   'direction': 'forward'},
  'rejectedPlans': []},
 'executionStats': {'executionSuccess': True,
  'nReturned': 0,
  'executionTimeMillis': 4,
  'totalKeysExamined': 0,
  'totalDocsExamined': 4311,
  'executionStages': {'stage': 'COLLSCAN',
   'filter': {'trail': {'$eq': ObjectId('5e1db4c3a602d099584a91cb')}},
   'nReturned': 0,
   'executionTimeMillisEstimate': 0,
   'works': 4313,
   'advanced': 0,
   'needTime': 4312,
   'needYield': 0,
   'saveState': 33,
   'restoreState': 33,
   'isEOF': 1,
   'invalidates': 0,
   'direction': 'forward',
   'docsExamined': 4311},
  'allPlansExecution': []},
 'serverInfo': {'host': 'geostack-system',
  'port': 27017,
  'version': '3.6.3',
  'gitVersion': '9586e

##### It took 15 miliseconds to return 739 results using a COLLSCAN (Collection scan)

#### Query to return al items related to Crane: Lotta, between 2008-08-26 and 2009-9-27

In [10]:
Signal.objects(Q(trail='5e1db4c3a602d099584a91cb')&
                     Q(time__gte=datetime(2008,8,26)) &
                     Q(time__lte=datetime(2009,9,27))).explain()

{'queryPlanner': {'plannerVersion': 1,
  'namespace': 'Trail_Database.signal',
  'indexFilterSet': False,
  'parsedQuery': {'$and': [{'trail': {'$eq': ObjectId('5e1db4c3a602d099584a91cb')}},
    {'time': {'$lte': datetime.datetime(2009, 9, 27, 0, 0)}},
    {'time': {'$gte': datetime.datetime(2008, 8, 26, 0, 0)}}]},
  'winningPlan': {'stage': 'COLLSCAN',
   'filter': {'$and': [{'trail': {'$eq': ObjectId('5e1db4c3a602d099584a91cb')}},
     {'time': {'$lte': datetime.datetime(2009, 9, 27, 0, 0)}},
     {'time': {'$gte': datetime.datetime(2008, 8, 26, 0, 0)}}]},
   'direction': 'forward'},
  'rejectedPlans': []},
 'executionStats': {'executionSuccess': True,
  'nReturned': 0,
  'executionTimeMillis': 2,
  'totalKeysExamined': 0,
  'totalDocsExamined': 4311,
  'executionStages': {'stage': 'COLLSCAN',
   'filter': {'$and': [{'trail': {'$eq': ObjectId('5e1db4c3a602d099584a91cb')}},
     {'time': {'$lte': datetime.datetime(2009, 9, 27, 0, 0)}},
     {'time': {'$gte': datetime.datetime(2008, 8,

##### It took 16 miliseconds to return 739 results using a COLLSCAN (Collection scan)

### Indexing the database

##### there are 3 ways to create indexes on data. 
- Create an index when modeling the data.<br>
to create an index while creating the data model, we have to add a meta field to  the 	document we want to create an index on. For example: If we want to create an index on 	the altitude field in the geometry document, we add the following meta field to our geometry document:

In [11]:
class Geometry(EmbeddedDocument):

    alt = FloatField()
    
    meta = {
        'collection': 'altitude',
        'indexes': [
          {'fields': ['alt']}
        ]
    }

 - Create indexes after modeling the data <br>
We can also create the indexes after we created the datamodel. We are going to use this way to create indexes below. For example: if we want to create an index on the altitude field after creating the data model we would run the following command: <br>
Transmission.create_index(("geometry.alt"))


  
 - Create indexes using pymongo
    add 2d index to coord field db.signals.ensureIndex({"geometry.coord.coordinates":"2d"});

##### We want to create 4 indexes:

   - 2D Sphere index This index will be used to query the coordinates of the trackpoit (This was automaticly done when assiging PointField() to the coordinates entry, when creating the database model)
    2D index We need this index to be able to find coordinates in a cetrain box
   - time index We need this index because we will query on the time a lot of times
   - trail index (in the signal collection) We need this index because we will query to find signals per trail using the trail


#### Create an index on the trail reference field in the signal collection

In [13]:
Signal.create_index(("trail"))

'trail_1'

####  Create an index on the time field in the signal collection

In [14]:
Signal.create_index(("time"))

'time_1'

#### Create an index on the coordinates field in the transmission collection

In [15]:
Signal.create_index(("geometry.coord"))

'geometry.coord_1'

---
### Querying the data post-index¶

#### Query to return al items related to Trial: Biesbosch

In [16]:
Signal.objects(trail='5e1db4c3a602d099584a91cb').explain()

{'queryPlanner': {'plannerVersion': 1,
  'namespace': 'Trail_Database.signal',
  'indexFilterSet': False,
  'parsedQuery': {'trail': {'$eq': ObjectId('5e1db4c3a602d099584a91cb')}},
  'winningPlan': {'stage': 'FETCH',
   'inputStage': {'stage': 'IXSCAN',
    'keyPattern': {'trail': 1},
    'indexName': 'trail_1',
    'isMultiKey': False,
    'multiKeyPaths': {'trail': []},
    'isUnique': False,
    'isSparse': False,
    'isPartial': False,
    'indexVersion': 2,
    'direction': 'forward',
    'indexBounds': {'trail': ["[ObjectId('5e1db4c3a602d099584a91cb'), ObjectId('5e1db4c3a602d099584a91cb')]"]}}},
  'rejectedPlans': []},
 'executionStats': {'executionSuccess': True,
  'nReturned': 0,
  'executionTimeMillis': 0,
  'totalKeysExamined': 0,
  'totalDocsExamined': 0,
  'executionStages': {'stage': 'FETCH',
   'nReturned': 0,
   'executionTimeMillisEstimate': 0,
   'works': 1,
   'advanced': 0,
   'needTime': 0,
   'needYield': 0,
   'saveState': 0,
   'restoreState': 0,
   'isEOF': 1,


##### It took 0 miliseconds to return 739 results using a IXSCAN (Index scan)

#### Query to return al items related to Crane: Lotta, between 2008-08-26 and 2009-9-27

In [17]:
Signal.objects(Q(trail='5e1db4c3a602d099584a91cb')&
                     Q(time__gte=datetime(2008,8,26)) &
                     Q(time__lte=datetime(2009,9,27))).explain()

{'queryPlanner': {'plannerVersion': 1,
  'namespace': 'Trail_Database.signal',
  'indexFilterSet': False,
  'parsedQuery': {'$and': [{'trail': {'$eq': ObjectId('5e1db4c3a602d099584a91cb')}},
    {'time': {'$lte': datetime.datetime(2009, 9, 27, 0, 0)}},
    {'time': {'$gte': datetime.datetime(2008, 8, 26, 0, 0)}}]},
  'winningPlan': {'stage': 'FETCH',
   'filter': {'$and': [{'time': {'$lte': datetime.datetime(2009, 9, 27, 0, 0)}},
     {'time': {'$gte': datetime.datetime(2008, 8, 26, 0, 0)}}]},
   'inputStage': {'stage': 'IXSCAN',
    'keyPattern': {'trail': 1},
    'indexName': 'trail_1',
    'isMultiKey': False,
    'multiKeyPaths': {'trail': []},
    'isUnique': False,
    'isSparse': False,
    'isPartial': False,
    'indexVersion': 2,
    'direction': 'forward',
    'indexBounds': {'trail': ["[ObjectId('5e1db4c3a602d099584a91cb'), ObjectId('5e1db4c3a602d099584a91cb')]"]}}},
  'rejectedPlans': [{'stage': 'FETCH',
    'filter': {'trail': {'$eq': ObjectId('5e1db4c3a602d099584a91cb')}

##### It took 0 miliseconds to return 739 results using a IXSCAN (Index scan)

# Some GeoQueries

Select all Transmissions in certain polygone.<br>
Use https://www.keene.edu/campus/maps/tool/ to find desired polygone.<br>
parameters:
- point 1
- point 2
- point 3
- point 4

In [18]:
def select_transmissions_in_polygone(p1,p2,p3,p4):
    Transmissions_in_Polygone = Transmission.objects(geometry__coord__geo_within=[[p1,p2,p3,p4]]).to_json()
    return pd.DataFrame(eval(Transmissions_in_Polygone))

Select all transmission near a certain point
parameters:
- longitude of point
- latitude of point
- distance around point (in meters)

In [19]:
def transmissions_near_point(lon,lat,distance):
    
    Transmissions_near_Point = Transmission.objects(geometry__coord__near=[lon, lat],
                                                    geometry__coord__max_distance=distance).to_json()
    return pd.read_json(Transmissions_near_Point)