---
# Data modeling Employee

Date: 02-11-2019 <br>
Concept version: 0.9 <br>
Author: Pieter Lems  <br>
Docent: J. Kroon<br>

© Copyright 2019 Ministerie van Defensie

This notebook wil provide information and scripts relating to creating data models for MongoDB.<br>
To create the data models we are going to use Python. 


The data sets in used in this notebook can be found in the folder ("Data/")

## Contents of notebook
- Importing and exploring the data sets
    - Show data sets
    - explore
    - Clean data set
- Creating the model
    - Referenced
        - Defining the Employee document
        - Defining the Adress document
        - Defining the payment document
    - Embedded
        - Defining the Employee document
        - Defining the Adress document
        - Defining the payment document
    - Optimize models
        - Rename variables
        - Embedding and Refercing where needed
- Loading the data
- Querying the data


## discussed topics
- Handle null values
- Handle different column names
- Handle different

## Sources
https://leportella.com/english/2018/08/23/mongo-db-python-and-mongoengine.html <br>
https://realpython.com/introduction-to-mongodb-and-python/ <br>
https://www.techrunnr.com/how-to-install-mongodb-compass/ <br>

For info related to mongoDB (Crash course)<br>
https://www.youtube.com/watch?v=PIWVFUtBV1Q


For finding the coordinates of a polygon<br>
https://www.keene.edu/campus/maps/tool/


Info about geospatial queries<br>
https://docs.mongodb.com/manual/geospatial-queries/#geospatial-indexes

HAE -> MSL <br>
https://www.unavco.org/software/geodetic-utilities/geoid-height-calculator/geoid-height-calculator.html


embedded or referenced<br>
https://stackoverflow.com/questions/5373198/mongodb-relationships-embed-or-reference

openmymind.net/Multiple-Collections-Versus-Embedded-Documents


## Tools 
- Mongo docker container <br>
docker run -d -p 27017:27017 mongo:latest
- Mongo compass <br>
wget https://downloads.mongodb.com/compass/mongodb-compass_1.15.1_amd64.deb <br>
sudo dpkg -i mongodb-compass_1.15.1_amd64.deb

- Mongoengine:
connect to database

connect(alias='Referenced',db='Employee_Database_Referenced')

disconnect from database:

disconnect(alias='Referenced')

Add data to specific database (in model)

meta = {'db_alias': 'Referenced'}

## Use-case

We are developers at the personnel department of the Dutch Ministry of Defense.  We overheard that there is a problem concerning the fragmented data of military personnel. The problem is that their personal information is stored in 3 different data stores. 
    1) The first data store contains their personal information.
    2) The second data store contains their address information
    3) The third data store contains their salary / payment information

Because of this, it’s hard to retrieve data in a fast and efficient way. That’s were we as developers / data scientists come in. 

Our goal is to store the data in one centralized database. We also want to query the data in a fast and efficient way.  The extraction and transformation phases, of the ETL-Process, are already completed.  
The data sets can be found in the folder: “Data/CSV/”

---

---
# Importing and exploring the data sets
---

In [1]:
# Import the pandas 
import pandas as pd

---
## Showing data sets
---

In [2]:
# Read employee data in dataframe
Employees_df = pd.read_csv("Data/CSV/Employee_data.csv")

# Display dataframe
Employees_df

Unnamed: 0,Employee-Id,First-Name,Last-Name,Previous-Job
0,1,Pieter,Lems,Sales Person
1,2,Jawed,Balkhi,Dentist
2,3,Joost,Bakker,Lawyer
3,4,John,Doe,Factory worker


In [3]:
# Read address data in dataframe
Addresses_df= pd.read_csv("Data/CSV/Address_data.csv")

# Display dataframe
Addresses_df

Unnamed: 0,address-Id,house-number,city
0,1,10,Rotterdam
1,2,29,Utrecht
2,3,119,Drente
3,4,1232,Amesfoort


In [4]:
# Read payment data in dataframe
Payments_df = pd.read_csv("Data/CSV/Payment_data.csv")

# Display dataframe
Payments_df

Unnamed: 0,Payment_Id,Payment_Amount,Payment_Date
0,1,1000,2016-01-18 00:00:00.000
1,2,2900,2017-06-12 00:00:00.000
2,3,11900,2018-12-01 00:00:00.000
3,4,31500,2018-12-01 00:00:00.000


## Exploring data information

As you can see in the table below, the dataset's al have 3 attributes in common; employee_id, first_name and the last_name.

| Employee 1 | Employee 2 |
| --- | --- |
| employee_id | employeeid |
| first_name | firstname |
| last_name | lastname | 
| age | previousjob |

---
# Creating the model
---

In [5]:
from mongoengine import *
from datetime import datetime

---
## Referenced model
---

In [6]:
connect(alias='Referenced',db='Employee_Database_Referenced')

MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True, read_preference=Primary())

In [7]:
class Employee(Document):
    
    employee_id  = IntField(required=True, max_length=100)
    
    first_name = StringField(required=True, max_length=20)
    
    last_name = StringField(required=True, max_length=20)
    
    # Add this line to save model to Referenced db
    meta = {'db_alias': 'Referenced'}
    
class Address(Document):
    
    address_id = IntField(required=True, max_length=100)
    
    house_number = IntField(required=True, max_length=100)
    
    city = StringField(required=True, max_length=20)
    
    # Reference to the Employee document 
    employee = ReferenceField(Employee) 

    # Add this line to save model to Referenced db
    meta = {'db_alias': 'Referenced'}
    
class Payment(Document):
    
    payment_id = IntField(required=True, max_length=20)
    
    payment_amount = FloatField(required=True, max_length=20)
    
    payment_date = DateTimeField(default=datetime.now)
    
    # Reference to the Employee document 
    employee = ReferenceField(Employee) 
    
    # Add this line to save model to Referenced db
    meta = {'db_alias': 'Referenced'}

---
## Embedded model
---

In [8]:
connect('Employee_Database_Embedded',alias='Embedded')

MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True, read_preference=Primary())

In [9]:
class Payment(EmbeddedDocument):
    
    payment_id = IntField(required=True, max_length=20)
    
    payment_amount = FloatField(required=True, max_length=20)
    
    payment_date = DateTimeField(default=datetime.now)
    
    meta = {'db_alias': 'Embedded'}
    
class Address(EmbeddedDocument):
    
    address_id = IntField(required=True, max_length=100)
    
    house_number = IntField(required=True, max_length=100)
    
    city = StringField(required=True, max_length=20)
    
    meta = {'db_alias': 'Embedded'}

class Employee(Document):
    employee_id  = IntField(required=True, max_length=100)
    
    first_name = StringField(required=True, max_length=20)
    
    last_name = StringField(required=True, max_length=20)
    
    ## The embedded documents
    payments = EmbeddedDocumentField(Payment)
    
    adresses = EmbeddedDocumentField(Address)
    
    meta = {'db_alias': 'Embedded'}


---
## Optimized model

We can use both referenced and embedded documents
 

referenced : Payment
Since the document will keep growing overtime 


embedded: Document

---

In [10]:
connect('Employee_Database_Optimized',alias='Optimized')

MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True, read_preference=Primary())

In [19]:
# Define the payment document
# The payment document contains a reference
# To the Employee document, it belongs to.
class Payment(Document):
    
    # The id of the payment
    payment_id = IntField(required=True, max_length=20)
    
    # The amount of the payment
    payment_amount = FloatField(required=True, max_length=20)
    
    # The date te payment was made. (Datetime)
    payment_date = DateTimeField(default=datetime.now)
    
    #Reference_id of employee 
    employee =  ReferenceField(Employee)
    
    # Since we want to add the data to a specifik database,
    # We are going to define that database in the metadata
    meta = {'db_alias': 'Embedded'}

# Define the Adresses document
# This documenten will be embedded in the 
# Employee document
class Address(EmbeddedDocument):
    
    # The ID of the Address of Type Int
    address_id = IntField(required=True, max_length=100)
    
    # The house number, of Type Int
    house_number = IntField(required=True, max_length=100)
    
    # The City, of type String
    city = StringField(required=True, max_length=20)
    
    meta = {'db_alias': 'Embedded'}

# Define the Employee document 
class Employee(Document):

    # The ID of the employee , of type Int
    employee_id  = IntField(required=True, max_length=100)
    
    # The first name of the employee, of type String
    first_name = StringField(required=True, max_length=20)
    
    # The last name of the employee, of type String 
    last_name = StringField(required=True, max_length=20)
    
    ## The embedded documents
    adresses = EmbeddedDocumentField(Address)
    meta = {'db_alias': 'Embedded'}

---
# Loading the data

Test data referenced 

---

In [None]:
def add_test_data_employee_1():
    employee1 = Employee(1,"Pieter","Lems").save()
    address_employee1 = Address(1,10,"Rotterdam",employee1).save()
    payment_employee1 = Payment(1,1000,datetime.now(),employee1).save()
def add_test_data_employee_2():
    employee2 = Employee(1,"Dennis","Strik").save()
    address_employee2 = Address(1,10,"Utrecht",employee2).save()
    payment_employee2 = Payment(1,1000,datetime.now(),employee2).save()
    
add_test_data_employee_1()
add_test_data_employee_2()

---

Test data embedded

---

In [None]:
def add_test_data_employee_1():
    address_employee1 = Address(1,10,"Rotterdam")
    
    payment_employee1 = Payment(1,1000,datetime.now())
    
    employee1 = Employee(1,"Pieter","Lems",
                         payment_employee1,
                         address_employee1).save()
    
def add_test_data_employee_2():    
    address_employee2 = Address(2,10,"Utrecht")
    
    payment_employee2 = Payment(2,1000,datetime.now())
    
    employee2 = Employee(2,"Dennis","Strik",
                         payment_employee2,
                         address_employee2).save()
    

add_test_data_employee_1()
add_test_data_employee_2()
    
    

---
# Querying the data 
---


In [17]:
Employee.objects(first_name__contains='D').to_json()
Payment.objects(payment_amount__lte=10050).to_json()

'[]'

In [None]:
Employee.objects.first().payments.to_json()