# DS Data Engineer (Data Eng)

### _Objectives_
- Understand the data pipelines and business problems enough to be able to prescribe analytical solutions. 
- Apply a diverse set of tactics including statistics and quantitative reasoning to solve problems as well as research and produce relevant product insights. 
- Able to build ETL data pipelines and infrastructure to support Product and Data Science departments. 
- Ability to understand business requirements and come up with scalable engineering solutions for data storage and retrieval. 
- Understand and articulate data engineering and infrastructure decisions.  
- Perform analyses of data trends to inform stakeholder's decisions using a variety of visualization tools. 
- Strong database and data visualization skills will be very valuable for this role.

### _Foundational Skills_
- Solid Understanding of DS Units 1 & 3
    - Pandas
    - Databases
        - SQL (Postgresql)
        - No-SQL (MongoDB)
    - Graphing Libraries
        - Matplotlib
        - Plotly
        - Seaborn
        - Altair

### _Skills to Strengthen_
- Build a Database Interface: MongoDB
    - Data Seeding: Mock Data
    - Filters & Projections
    - JSON Backup & Restore
- Build a Data Visualization Component: Plotly
    - Parameterized Abstraction Encapsulating Plotly


In [1]:
# First, we're going to pip install some dependencies
%pip install colab-env
%pip install pymongo[srv]
%pip install MonsterLab

Collecting colab-env
  Downloading colab-env-0.2.0.tar.gz (4.7 kB)
Collecting python-dotenv<1.0,>=0.10.0
  Downloading python_dotenv-0.20.0-py3-none-any.whl (17 kB)
Building wheels for collected packages: colab-env
  Building wheel for colab-env (setup.py) ... [?25l[?25hdone
  Created wheel for colab-env: filename=colab_env-0.2.0-py3-none-any.whl size=3838 sha256=326b0a698ccd3bb4c7e6a9be6bb360ef59ca8a4ee403a6e7158fdffabbe570b1
  Stored in directory: /root/.cache/pip/wheels/bb/ca/e8/3d25b6abb4ac719ecb9e837bb75f2a9b980430005fb12a9107
Successfully built colab-env
Installing collected packages: python-dotenv, colab-env
Successfully installed colab-env-0.2.0 python-dotenv-0.20.0
Collecting dnspython<3.0.0,>=1.16.0
  Downloading dnspython-2.2.1-py3-none-any.whl (269 kB)
[K     |████████████████████████████████| 269 kB 6.1 MB/s 
[?25hInstalling collected packages: dnspython
Successfully installed dnspython-2.2.1
Collecting MonsterLab
  Downloading MonsterLab-1.2.2-py3-none-any.whl (4.4 kB

# MongoDB Basics
MongoDB Interface Class: see the `database.py` file in the `data_model` package

In [2]:
import os
import json
from typing import Sequence

from MonsterLab import Monster
from pymongo import MongoClient
from pandas import DataFrame
from dotenv import load_dotenv
import plotly.graph_objects as go
import plotly.express as px

## Load the environment variables from `.env` file
You should create your own `.env` file with the URL provided when you setup your Mongo account. See the `.env-example` file.

### *Note* 
One way to import your ```.env``` file into Colab: 
- Click on the folder icon on the left panel. 
- Under the word "Files", click the page icon with an up arrow in it
- find the text file you created containing your MONGODB_URL
- Once the file is loaded, right-click the file and rename your file in Colab to ```.env```, this will cause the file to "disappear", this is normal and supposed to happen. 
- restart your runtime

Now you're ready to move on. 

In [3]:
load_dotenv()

True

## Connect to the database
- URL: The URL given to us by MongoDB
    - Example: `mongodb+srv://<USER>:<PASS>@<CLUSTER>.<PROJECT_UUID>.mongodb.net`
- Database: This can be any name we like
    - Example: `MonsterLab`
- Collection: This can be any name we like
    - Example: `Monsters`

If we ever refer to a Database and/or Collection that doesn't exist, Mongo will create them for us as long as we have a Project setup and ready to go.

### Connection Info

In [4]:
url = os.getenv("MONGO_URL")
database = "MonsterLab"
collection = "Monsters"

### Instantiate the MongoClient with the connection info

In [5]:
db = MongoClient(url)[database][collection]

### Cleanup: Reset the collection by deleting all entries
**This is not required if it is the first time connecting to this collection.**

Passing an empty dictionary means delete all. If we only wanted to delete a subgroup we can pass a filter object instead. The filter needs to have one or more key/value pairs. See the section on Filters.


Example:

If you want to delete all Dragons:
```
db.delete_many({"type": "Dragon"})
```

We are going to delete all entries, essentially resetting our collection.

In [6]:
db.delete_many({})

<pymongo.results.DeleteResult at 0x7f8f420309b0>

### Example Data Point: Monster

In [7]:
monster = Monster()
monster

Name: Pit Lord
Type: Devilkin
Level: 9
Rarity: Rank 1
Damage: 9d4+1
Health: 37.97
Energy: 37.21
Sanity: 35.9
Time Stamp: 2022-04-07 08:05:25

### Convert the Monster class instance into a dictionary

In [8]:
monster_dict = vars(monster)
monster_dict

{'damage': '9d4+1',
 'energy': 37.21,
 'health': 37.97,
 'level': 9,
 'name': 'Pit Lord',
 'rarity': 'Rank 1',
 'sanity': 35.9,
 'time_stamp': '2022-04-07 08:05:25',
 'type': 'Devilkin'}

### Insert the Monster into the Monsters collection

In [9]:
db.insert_one(monster_dict)

<pymongo.results.InsertOneResult at 0x7f8f42047550>

### Find all data points in the Monsters collection
Mongo returns an iterator, so we need to cast it to a list, or use it in a loop to print the data.

In [10]:
# Cast to a list
list(db.find())

[{'_id': ObjectId('624efdb801cdd511e4c54b0c'),
  'damage': '9d4+1',
  'energy': 37.21,
  'health': 37.97,
  'level': 9,
  'name': 'Pit Lord',
  'rarity': 'Rank 1',
  'sanity': 35.9,
  'time_stamp': '2022-04-07 08:05:25',
  'type': 'Devilkin'}]

In [11]:
# Used in a loop
for monster in db.find():
    print(str(monster))

{'_id': ObjectId('624efdb801cdd511e4c54b0c'), 'type': 'Devilkin', 'name': 'Pit Lord', 'level': 9, 'rarity': 'Rank 1', 'damage': '9d4+1', 'time_stamp': '2022-04-07 08:05:25', 'health': 37.97, 'energy': 37.21, 'sanity': 35.9}


### Create Many Monsters

In [12]:
many_monsters = (vars(Monster()) for _ in range(999))

### Insert Many Monsters

In [13]:
db.insert_many(many_monsters)

<pymongo.results.InsertManyResult at 0x7f8f41f771e0>

### Checksum to see we have the new monsters added
We started with 1 then added 999 monsters. We should have a total of 1000 Monsters.

In [14]:
db.count_documents({})

1000

### Convert the Monsters Collection into a DataFrame

In [15]:
df = DataFrame(db.find())

### Save the data for ML Eng to use for training the ML model

In [29]:
df.to_csv("training_data.csv")

## Filters
It's common to only want a subgroup of the data in a collection. We could use pandas, but to do this efficiently we should use a Mongo filter query.

This does not affect the database in any way. The data is still in Mongo, it's just not downloaded for this query.

In [30]:
DataFrame(db.find({"type": "Dragon"}))

Unnamed: 0,_id,type,name,level,rarity,damage,time_stamp,health,energy,sanity
0,624efdc401cdd511e4c54ef6,Dragon,Faerie Dragon,6,Rank 0,6d2,2022-04-07 08:05:32,11.27,12.99,12.13
1,624efdc401cdd511e4c54efa,Dragon,Black Wyrmling,13,Rank 1,13d4+3,2022-04-07 08:05:32,50.95,53.10,53.25
2,624efdc401cdd511e4c54efd,Dragon,Ruby Wyrmling,11,Rank 0,11d2+2,2022-04-07 08:05:32,22.64,21.18,21.82
3,624efdc401cdd511e4c54f00,Dragon,Wyvern,4,Rank 3,4d8+2,2022-04-07 08:05:32,35.41,28.56,34.00
4,624efdc401cdd511e4c54f03,Dragon,Copper Wyrmling,4,Rank 1,4d4+2,2022-04-07 08:05:32,17.49,16.42,16.25
...,...,...,...,...,...,...,...,...,...,...
157,624efdc401cdd511e4c552bf,Dragon,Wyvern,13,Rank 4,13d10+1,2022-04-07 08:05:32,128.38,126.15,134.46
158,624efdc401cdd511e4c552c4,Dragon,White Drake,16,Rank 1,16d4+1,2022-04-07 08:05:32,62.10,64.04,63.57
159,624efdc401cdd511e4c552cd,Dragon,Silver Drake,10,Rank 2,10d6+1,2022-04-07 08:05:32,61.73,58.70,61.59
160,624efdc401cdd511e4c552d0,Dragon,Faerie Dragon,12,Rank 0,12d2,2022-04-07 08:05:32,24.55,23.27,24.07


## Projections
For machine learning we only want the data that we intend to use for ML training. To do this we could use Pandas, but we can also use Mongo. It is far more efficient to use Mongo rather than download all the info then filter it. Let's see how to get rid of the `_id` and `time_stamp` columns with Mongo.

This does not affect the database in any way. The data is still in Mongo, it's just not downloaded for this query.

In [31]:
DataFrame(db.find(projection={"_id": False, "time_stamp": False}))

Unnamed: 0,type,name,level,rarity,damage,health,energy,sanity
0,Devilkin,Pit Lord,9,Rank 1,9d4+1,37.97,37.21,35.90
1,Undead,Ghostly Guard,4,Rank 0,4d2+2,7.63,8.61,8.52
2,Dragon,Faerie Dragon,6,Rank 0,6d2,11.27,12.99,12.13
3,Elemental,Djinni,15,Rank 0,15d2+1,30.02,30.61,29.30
4,Demonic,Nightmare,4,Rank 1,4d4+1,16.67,15.89,14.85
...,...,...,...,...,...,...,...,...
995,Demonic,Pit Fiend,5,Rank 2,5d6+2,28.30,32.55,28.76
996,Dragon,White Drake,4,Rank 1,4d4+3,16.75,16.39,16.97
997,Devilkin,Succubus,11,Rank 0,11d2,22.44,21.80,21.66
998,Demonic,Nightmare,9,Rank 4,9d10+2,89.32,87.61,86.24


## A Filtered Projection
...And sometimes we want to use both filter and projection together.

In [32]:
DataFrame(db.find(
    {"type": "Dragon"},
    projection={"_id": False, "time_stamp": False},
))

Unnamed: 0,type,name,level,rarity,damage,health,energy,sanity
0,Dragon,Faerie Dragon,6,Rank 0,6d2,11.27,12.99,12.13
1,Dragon,Black Wyrmling,13,Rank 1,13d4+3,50.95,53.10,53.25
2,Dragon,Ruby Wyrmling,11,Rank 0,11d2+2,22.64,21.18,21.82
3,Dragon,Wyvern,4,Rank 3,4d8+2,35.41,28.56,34.00
4,Dragon,Copper Wyrmling,4,Rank 1,4d4+2,17.49,16.42,16.25
...,...,...,...,...,...,...,...,...
157,Dragon,Wyvern,13,Rank 4,13d10+1,128.38,126.15,134.46
158,Dragon,White Drake,16,Rank 1,16d4+1,62.10,64.04,63.57
159,Dragon,Silver Drake,10,Rank 2,10d6+1,61.73,58.70,61.59
160,Dragon,Faerie Dragon,12,Rank 0,12d2,24.55,23.27,24.07


### Backup Database Monsters to JSON File
The auto generated "_id" field is a custom object type and can not be serialized to JSON.

The following cell should create a backup of our data and save it to "monsters.json" locally.

In [33]:
with open("monsters.json", "w") as file:
    json.dump(tuple(db.find(projection={"_id": False})), file)
db.count_documents({})

1000

### Delete All Monsters to Test our Backup & Restore Procedure

In [34]:
db.delete_many({})
db.count_documents({})

0

### Restore Database Monsters from JSON File

In [35]:
with open("monsters.json", "r") as file:
    db.insert_many(json.load(file))
db.count_documents({})

1000

# Monster Graphs: Plotly

In [36]:
df_type = DataFrame(db.find(projection={"_id": False, "type": True}))
type_value_counts = df_type["type"].value_counts()
type_value_counts

Devilkin     174
Demonic      171
Undead       168
Fey          164
Dragon       162
Elemental    161
Name: type, dtype: int64

In [37]:
data = go.Pie(
    labels=type_value_counts.index,
    values=type_value_counts.values,
    hole=0.5,
    textinfo="percent+label",
    textfont={"size": 12},
    hoverinfo="text+label+value",
    textposition='inside',
    showlegend=False,
)
layout = go.Layout(
    title={
        "text": "Monster Counts by Types",
        "font": {"color": "white", "size": 24},
    },
    colorway=px.colors.qualitative.Antique,
    width=640,
    height=640,
    paper_bgcolor="#333333",
)
figure = go.Figure(data, layout)
figure.show()

### Wrap it in a Function
This will encapsulate our graphing code. That means we can change the implementation later without too much trouble.
For example, we could change to using Altair instead of Plotly, and the rest of the system wouldn't break.

In [38]:
def pie_chart(title: str, labels: Sequence, values: Sequence) -> go.Figure:
    return go.Figure(
        go.Pie(
            labels=labels,
            values=values,
            hole=0.5,
            textinfo="percent+label",
            textfont={"size": 12},
            hoverinfo="text+label+value",
            textposition='inside',
            showlegend=False,
        ),
        go.Layout(
            title={
                "text": title,
                "font": {"color": "white", "size": 24},
            },
            colorway=px.colors.qualitative.Antique,
            width=640,
            height=640,
            paper_bgcolor="#333333",
        ),
    )

In [39]:
# This should produce the same graph as above
pie_chart(
    "Monster Count by Type",
    type_value_counts.index,
    type_value_counts.values,
).show()

### New Data, New Graph, Same Function

In [40]:
df_dragon_rarity = DataFrame(
    db.find({"type": "Dragon"}, projection={"_id": False, "rarity": True}),
)
dragon_rarity_counts = df_dragon_rarity["rarity"].value_counts().sort_index()
dragon_rarity_counts

Rank 0    41
Rank 1    36
Rank 2    32
Rank 3    27
Rank 4    17
Rank 5     9
Name: rarity, dtype: int64

In [41]:
pie_chart(
    title="Dragon Count by Rarity",
    labels=dragon_rarity_counts.index,
    values=dragon_rarity_counts.values,
).show()