# DS Machine Learning Engineer (ML Eng)

### _Objectives_
- Build scalable machine learning model(s) to drive product requirements. 
- Understand the big picture when it comes to identifying which machine learning models to use/build and which features to use in the modeling process. 
- Understand what product and data metrics would be the most important to optimize for and also have a good feel for what levers move those metrics. 
- Apply a diverse set of tactics including statistics and quantitative reasoning to solve problems as well as research to produce relevant product insights. Strong modeling skills are a plus.

### _Foundational Skills_
- Solid Understanding of DS Unit 2 and/or 4
- Types of models and their use cases
- Model Metrics
  - Precision vs Accuracy
- ETL Pipelines

### _Skills to Strengthen_
- Build a Scikit Learn Model Interface
    - Model Selection (Regression vs Classification)
    - Train/Test Split
    - Model Training
    - Hyperparameter Tuning
    - Model Validation
    - Performance Evaluation
        - Is it small enough and fast enough to make predictions in the cloud?
        - Does it train fast enough to do dynamic training in the cloud?
- Joblib or Pickle for Model Serialization
    - Typically required for deployment when dynamic training isn’t viable
    - Dump to save the model to a file
    - Load to open the model from a file


In [2]:
# First, let's pip install some dependencies
%pip install MonsterLab

Collecting MonsterLab
  Downloading MonsterLab-1.2.2-py3-none-any.whl (4.4 kB)
Installing collected packages: MonsterLab
Successfully installed MonsterLab-1.2.2


## ML Modeling Basics
Scikit: Random Forest Classifier (RFC) Interface Class: see the `model.py` file in the `machine_learning` package


In [3]:
from functools import reduce
from operator import mul
from random import randint

import pandas as pd
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
import plotly.graph_objects as go
import plotly.express as px
from MonsterLab import Monster

## Data Schema: MonsterLab, Monster

In [4]:
Monster()

Name: Lich King
Type: Undead
Level: 14
Rarity: Rank 4
Damage: 14d10+1
Health: 140.65
Energy: 137.81
Sanity: 139.52
Time Stamp: 2022-04-07 08:00:28

### Load Data from CSV

We are going to take our `training_data.csv` that was created earlier

In [6]:
# Create a CSV and put it on Github and use a URL? 
df = pd.read_csv("training_data.csv")
df

Unnamed: 0.1,Unnamed: 0,_id,type,name,level,rarity,damage,time_stamp,health,energy,sanity
0,0,624efdb801cdd511e4c54b0c,Devilkin,Pit Lord,9,Rank 1,9d4+1,2022-04-07 08:05:25,37.97,37.21,35.90
1,1,624efdbc01cdd511e4c54b0d,Undead,Ghostly Guard,4,Rank 0,4d2+2,2022-04-07 08:05:32,7.63,8.61,8.52
2,2,624efdbc01cdd511e4c54b0e,Dragon,Faerie Dragon,6,Rank 0,6d2,2022-04-07 08:05:32,11.27,12.99,12.13
3,3,624efdbc01cdd511e4c54b0f,Elemental,Djinni,15,Rank 0,15d2+1,2022-04-07 08:05:32,30.02,30.61,29.30
4,4,624efdbc01cdd511e4c54b10,Demonic,Nightmare,4,Rank 1,4d4+1,2022-04-07 08:05:32,16.67,15.89,14.85
...,...,...,...,...,...,...,...,...,...,...,...
995,995,624efdbc01cdd511e4c54eef,Demonic,Pit Fiend,5,Rank 2,5d6+2,2022-04-07 08:05:32,28.30,32.55,28.76
996,996,624efdbc01cdd511e4c54ef0,Dragon,White Drake,4,Rank 1,4d4+3,2022-04-07 08:05:32,16.75,16.39,16.97
997,997,624efdbc01cdd511e4c54ef1,Devilkin,Succubus,11,Rank 0,11d2,2022-04-07 08:05:32,22.44,21.80,21.66
998,998,624efdbc01cdd511e4c54ef2,Demonic,Nightmare,9,Rank 4,9d10+2,2022-04-07 08:05:32,89.32,87.61,86.24


## Target & Features

In [7]:
target = "rarity"
features = ["level", "health", "energy", "sanity"]

## Hyperparameter Tuning

### Random Seed

In [8]:
# random_seed = randint(123456789, 987654321)
random_seed = 831592708

### Train/Test Split

In [9]:
X_train, X_test, y_train, y_test = train_test_split(
    df[features],
    df[target],
    test_size=0.20,
    random_state=random_seed,
    stratify=df[target],
)

### Parameters

In [10]:
param_dist = {
    "criterion": ("gini", "entropy"),
    "max_depth": (9, 10, 11),
    "max_features": (2, 3, 4),
    "n_estimators": (33, 66, 99),
}

### Calculate the number of parameter combinations
Setting the `n_iter` parameter to the total number of combinations is equivalent to using GridSearchCV.

In [11]:
n_iter = reduce(mul, map(len, param_dist.values()))

### RandomizedSearchCV

In [12]:
search = RandomizedSearchCV(
    RandomForestClassifier(random_state=random_seed),
    param_distributions=param_dist,
    n_iter=n_iter,
    n_jobs=-1,
    cv=5,
    random_state=random_seed,
)
search.fit(X_train, y_train)

RandomizedSearchCV(cv=5,
                   estimator=RandomForestClassifier(random_state=831592708),
                   n_iter=54, n_jobs=-1,
                   param_distributions={'criterion': ('gini', 'entropy'),
                                        'max_depth': (9, 10, 11),
                                        'max_features': (2, 3, 4),
                                        'n_estimators': (33, 66, 99)},
                   random_state=831592708)

## Best Model

In [13]:
search.best_estimator_

RandomForestClassifier(max_depth=9, max_features=3, n_estimators=99,
                       random_state=831592708)

## Train Score

In [14]:
search.best_score_

0.95

## Test Score

In [15]:
search.score(X_test, y_test)

0.955

## Feature Importances Graph

In [16]:
data = go.Pie(
    labels=search.feature_names_in_.tolist(),
    values=search.best_estimator_.feature_importances_.tolist(),
    hole=0.5,
    showlegend=False,
    hoverinfo="value",
    textfont={"size": 14},
    textinfo="percent+label",
)

layout = go.Layout(
    title={
        "text": "Feature Importances",
        "font": {"color": "white", "size": 24},
    },
    colorway=px.colors.qualitative.Antique,
    height=700,
    width=770,
    paper_bgcolor="#333333",
)

figure = go.Figure(data, layout)
figure.show()