# Data Engineer & Analyst

Finding just the right data for a project can be a challenge. Finding perfectly clean data ready for machine learning can be impossible. For our first assignment we'll generate some data, but before we do that we'll need a place to store it! For that I've included a MongoDB interface, all you need to do is enter your credentials when prompted.

## MongoDB Database Interface

Our first assignment is to generate some random data, but first we need a place to put it!

MongoDB is a good place to start when learning database operations in Python. Unlike relational databases, working with a no-SQL database (MongoDB) is more like working with other Python libraries and less like writing obscure SQL queries. For some of these assignments you will need a free [MongoDB account](https://www.mongodb.com/). In assignment 2 you'll create your own database interface much like the one below.


In [None]:
# PyMongo requires dnspython to be installed
!pip install dnspython

In [None]:
from typing import Dict, Iterable
from pymongo import MongoClient
import pandas as pd

In [None]:
class MongoDB:

    def __init__(self, url, collection, table):
        self.url = url
        self.collection = collection
        self.table = table

    def connect(self):
        return MongoClient(self.url)[self.collection][self.table]

    def find(self, query_obj: Dict) -> pd.DataFrame:
        return pd.DataFrame(self.connect().find(query_obj))

    def insert(self, insert_obj: Iterable[Dict]):
        self.connect().insert_many(insert_obj)

    def update(self, query: Dict, data: Dict):
        self.connect().update_many(query, {"$set": data})

    def delete(self, query_obj: Dict):
        self.connect().delete_many(query_obj)

    def get_df(self) -> pd.DataFrame:
        return pd.DataFrame(self.find({}))

It's best practice to store passwords and credentials in a `.env` file. Here the notebook will ask you for your database info when you run the cell below. Make sure you have your MongoDB account setup first.

In [None]:
base_url = input("URL? ")
user_name = input("Username? ")
password = input("Password? ")

collection = input("Collection? ")
table = input("Table? ")

url = f"mongodb+srv://{user_name}:{password}@{base_url}"

In [None]:
db = MongoDB(url, collection, table)
print(db.get_df())

# Assignment 1: Mock Data

Now we need some data! Random data can be generated in many ways. Here's an examples using MonsterLab and the database interface above. Mock data should have the same shape as expected in the *real data*. 

[MonsterLab](https://pypi.org/project/MonsterLab/) is built on [Fortuna](https://pypi.org/project/Fortuna/). MonsterLab is how Bandersnatch generates its data, more on that later...

For this assignment review the code cells below and have a go playing with the generators in MonsterLab. See what you can do! 


1. Sign up for MongoDB if you don't already have an account.
2. Run the cells below to get a feel for it. Edit the code, and have fun.
3. Create and store at least 1000 monsters using the database interface and MonsterLab's Monster class.
4. Find all the Dragons, print them as a DataFrame.
5. Get all the monsters into a pandas DataFrame.

Bandersnatch should be very happy, indeed!

## Random Monsters: MonsterLab & Fortuna

Fortuna is a random value toolkit by Robert Sharp. If you would like to know more, here's the [Fortuna Documentation](https://pypi.org/project/Fortuna/). Unfortunately, Fortuna is currently incompatible with Windows. As such, it is recommended to run this notebook with Colab or Jupyter on WSL. Fortuna is 100% compatible with all *nix systems including macOS.

In [None]:
# Colab is highly recommended for those on Windows.
!pip install MonsterLab

In [None]:
from MonsterLab import Monster

In [None]:
help(Monster)

### A Random Monster

In [None]:
m1 = Monster()
m1

### Monster as a Dict

In [None]:
m2 = Monster()
m2.to_dict()

### Insert a Single Custom Monster into the database

In [None]:
monster = Monster(
    name="Bandersnatch", 
    monster_type="Demonic", 
    level=20, 
    rarity="Rank 5",
)

db.insert([monster.to_dict()])

db.get_df()

### Insert Many Random Monsters

In [None]:
n_monsters = 1024
db.insert(Monster().to_dict() for _ in range(n_monsters))

In [None]:
db.get_df()

### Find all the monsters that match a query

In [None]:
db.find({"Name": "Vampire"})

### Find the monsters that match a query

In [None]:
db.find({"Type": "Undead", "Level": 10})

### Find all the Dragons.

In [None]:
# YOUR CODE HERE

### Get all the monsters into a pandas dataframe.

In [None]:
# YOUR CODE HERE

# Assignment 2: Database Interface


Write a database interface class for the database of your choice, typically MongoDB or PostgreSQL. It's recommended to choose the same database type as your primary Labs Project.

This interface will serve as an [Abstraction Layer](https://en.wikipedia.org/wiki/Abstraction_layer) for your database. Abstraction layers are one of the most overlooked and under valued constructs in all of programming. In this assignment, we will [encapsulate](https://en.wikipedia.org/wiki/Encapsulation_(computer_programming)) or abstract away the type of database we're using by creating a interface. This interface could be replaced by another one that accesses a different type of database. As long as the same methods with the same signatures are on both, the rest of the app won't even know. The polymorphic abstraction layer give us this ability, without the rest of the app being reworked, because all calls to the database travel through our matching interfaces.

Objects that can replace eachother like this are said to be [Polymorphic](https://en.wikipedia.org/wiki/Polymorphism_(computer_science)).

Your custom interface should implement the following methods at minimum:
1. `create`
2. `read`
3. `update`
5. `delete`

If you need a refresher on how to build Python classes, look no further!
- [Basic Python Classes](https://sharpdesigndigital.com/class-objects/)
- [Advanced Python Classes](https://sharpdesigndigital.com/advanced-classes/)

In [None]:
class DataInterface:
    # YOUR CODE HERE

For extra data points, use your own interface as the source of data for the next few assignments. For this to work well, your interface must be 100% polymorphic with the provided `MongoDB` interface above or you'll need to refactor the code in the rest of the notebook to match your interface. It is NOT recommended to do some of each. You should choose, but choose wisely.

# Assignment 3: Visualizations

In [None]:
import plotly.graph_objects as go
import plotly.express as px

Change the implementation of Pie Chart "Monsters by Rarity" by using the function "rank_lookup" to dispaly the rank names Common...Legendary etc. rather than the rank values 0-5 in the legend.

In [None]:
def rank_lookup(rank: str) -> str:
    return {
        "Rank 0": "Very Common",
        "Rank 1": "Common",
        "Rank 2": "Uncommon",
        "Rank 3": "Rare",
        "Rank 4": "Epic",
        "Rank 5": "Legendary",
    }.get(rank, "Unknown")

## Pie Chart: Monsters by Rarity

In [None]:
target = "Rarity"

df = db.get_df()[target].value_counts()
data = go.Pie(labels=df.index, values=df.values, hole=0.5)

layout = go.Layout(
    title=f"Monsters by {target}",
    colorway=px.colors.qualitative.Antique,
    height=700,
    width=770,
)

figure = go.Figure(data, layout)

figure.update_traces(
    textfont_size=14,
    textinfo='percent+label',
)

figure.show()

## Line Chart: Monster Rarity Totals Over Time

In [None]:
from itertools import accumulate

In [None]:
target = "Rarity"  # ["Rarity", "Level", "Type"]

df = db.get_df()

df_cross = pd.crosstab(df['Time Stamp'], df[target])

for column in df_cross.columns:
    df_cross[column] = list(accumulate(df_cross[column]))

title = f"Monster {target} Totals Over Time"

data = [go.Scatter(
    x=df_cross.index, 
    y=df_cross[col],
    name=col,
    line={"width": 1.5},
) for col in df_cross.columns]

layout = go.Layout(
    title=title,
    colorway=px.colors.qualitative.Antique,
    height=600,
    width=800,
    yaxis={"title": "Monster Count"},
    xaxis={"title": "Time Stamp"},
)

figure = go.Figure(data, layout)
figure.show()

## Stacked Bar Chart Crosstab: Rarity by Level

Dynaically add the name of the target to the title of the y-axis.

In [None]:
feature = "Level"  # ["Level", "Type", "Rarity"]
target = "Rarity"  # ["Level", "Type", "Rarity"]

df = db.get_df()

df_cross = pd.crosstab(df[feature], df[target])

title = f"{target} by {feature}"

data = [
    go.Bar(name=col, x=df_cross.index, y=df_cross[col])
    for col in df_cross.columns
]

layout = go.Layout(
    title=title,
    colorway=px.colors.qualitative.Antique,
    height=600,
    width=810,
    barmode="stack",
    yaxis={"title": "Monster Count"},
    xaxis={'title': feature}
)

figure = go.Figure(data, layout)

figure.show()

## Altair

In [None]:
import altair as alt
import numpy as np

Replace the all the `...` in the code below with the right variable names. Play with it! What can you do with Altair?

In [None]:
x_axis = "Health"  # ["Energy", "Sanity", "Health"]
y_axis = "Energy"  # ["Energy", "Sanity", "Health"]
target = "Rarity"  # ["Rarity", "Level", "Type"]
rarity = "All"     # ["All", "Rank 0", ... "Rank 5"]

monsters = db.get_df().drop(columns=['_id'])

if rarity != "All":
    monsters = monsters[monsters['Rarity'] == ...]

graph = alt.Chart(
    monsters,
    title=f"{rarity} Monsters",
).mark_circle(size=100).encode(
    x=alt.X(
        ...,
        axis=alt.Axis(title=x_axis),
    ),
    y=alt.Y(
        ...,
        axis=alt.Axis(title=y_axis),
    ),
    color=...,
    tooltip=alt.Tooltip(list(monsters.columns)),
).properties(
    height=500,
    width=500,
)

graph

### Abstraction, Encapsulation, Polymophism

Below is one example of an abstraction that encasulates a graph and extends some customization points. Here we'll use a functional interface, but classes work too. 

You can parameterize every aspect of the graph by adding function arguments. Be mindfull, you don't want to over-do it here. Keep your calling signature simple and usable. Provide good defaults and well named arguments, and your users will enjoy using your code. Make it super complicated and they may as well just use Altair themselves.

A good interface should always encapsulate the core logic in such a way that the rest of the app is totally unaware of how it works, but can still interact with the core logic in a general way. One might say that the interface is more abstract than the core logic it encapsulates. At this higher abstraction level it becomes easier to replace our core logic without disrupting parallel development on other parts of the app. And now a word from our sponsor, Polymorphism.

One hypothetical example of Polymorphism is if we designed more than one graph, possibly with two different graphing libraries. Then gave them compatible interfaces. This gives us the ability to trade one graph library for another without rewriting the whole app. We could do that at any time during development without disrupting anything.

In [None]:
def scatter(x_axis="Health", y_axis="Energy", target="Rarity", rarity="All"):

    monsters = db.get_df().drop(columns=['_id'])

    if rarity != "All":
        monsters = monsters[monsters['Rarity'] == rarity]

    graph = alt.Chart(
        monsters,
        title=f"{rarity} Monsters",
    ).mark_circle(size=200).encode(
        x=alt.X(
            x_axis,
            axis=alt.Axis(title=x_axis),
        ),
        y=alt.Y(
            y_axis,
            axis=alt.Axis(title=y_axis),
        ),
        color=target,
        tooltip=alt.Tooltip(list(monsters.columns)),
    ).properties(
        height=500,
        width=500,
    )

    return graph

Try other columns below. In terms of data science, what's the most interesting graph you can make with the scatter_plot function? 

What feature(s) are missing from this function? 

What's cool about designing software this way? 

What's lacking about designing software this way?

In [None]:
scatter_plot = scatter(
    x_axis="Time Stamp", 
    y_axis="Level", 
    target="Type", 
    rarity="Rank 0",
)
scatter_plot

Same graph as above as a JSON file...

Juggeling json is tricky until you get the hang of it, then it's really easy!

In [None]:
# The Altair Library provides the `.to_json()` method.
# This creates a dirty json string. In the next assignment we'll see about fixing it.
scatter_plot_json = scatter_plot.to_json()
scatter_plot_json

# Assignment 4: Data I/O Juggling

### JSON Library

In [None]:
import json

The JSON python library has 4 functions to help with JSON I/O
- 2 for File I/O
    - `json.load` -> load a dict from JSON data in a file
    - `json.dump` -> takes a dict and saves JSON data to a file. Best use context manager.
- 2 for Memory I/O
    - `json.loads` -> creates a dict from JSON data in memory (some variable)
    - `json.dumps`  -> turns a dict into JSON data in memory (some variable)


In [None]:
# json.loads will turn dirty json into a dictionary
scatter_plot_dict = json.loads(scatter_plot_json)
scatter_plot_dict

In [None]:
# json dumps will create a clean json string from a dict
scatter_plot_json = json.dumps(scatter_plot_dict)
scatter_plot_json

Context managers help manage resources like files that need to be closed when we're done with them. The context manager automatically closes its resource when it goes out of scope. Two context managers are defined below, they begin with the `with` keyword.

In [None]:
# json dump will save a json file from a dict
with open("scatter_plot.json", "w") as file:
    json.dump(scatter_plot_dict, file)

In [None]:
# json load will open a json file and turn it into a dict
with open("scatter_plot.json", "r") as file:
    json_dict = json.load(file)
json_dict

Save your favorite graph as a json file and post it in slack with a screenshot of the graph.