# Machine Learning Engineer

## Introduction: Welcome to the Machine Learning Engineering Role!

This course will walk you through everything you need to know to be a Rockstar
ML Engineer. Unlike other courses, the DS Role courses will present some
assignments in the form of iPython Notebooks. It is highly recommended to use
Colab to complete these assignments.

### Course Outline
- ML Model Interfaces
- Model Tuning & Pipelines
- Clustering Model & Lookup Tables
- Model Serialization

### Random Monsters: MonsterLab

In [2]:
!pip install MonsterLab



In [3]:
import joblib
from typing import List, Tuple

import pandas as pd
from pandas import DataFrame
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

from MonsterLab import Monster

Before we can do machine learning we need some data!

### Random Monster

In [4]:
Monster()

Name: Diamond Archfey
Type: Fey
Level: 11
Rarity: Rank 1
Damage: 11d4+2
Health: 45.98
Energy: 44.36
Sanity: 44.74
Time Stamp: 2022-03-31 06:49:44

Generate Mock Monster Data. 5000 should make for a good model, but play with it, see what you can find with different values for the `number` variable below.

In [5]:
number = 5000
df = pd.DataFrame(Monster().to_dict() for _ in range(number))
df.to_csv("monsters.csv", index=False)
df

Unnamed: 0,Name,Type,Level,Rarity,Damage,Health,Energy,Sanity,Time Stamp
0,Mummy Lord,Undead,4,Rank 2,4d6+3,23.42,25.73,23.34,2022-03-31 06:49:44
1,Wyvern,Dragon,8,Rank 1,8d4+1,30.41,33.54,32.28,2022-03-31 06:49:44
2,Lightning Elemental,Elemental,5,Rank 3,5d8,38.77,42.68,43.58,2022-03-31 06:49:44
3,Skeletal Archer,Undead,9,Rank 2,9d6,54.32,55.16,53.69,2022-03-31 06:49:44
4,Kobold Knight,Devilkin,2,Rank 0,2d2,4.42,4.82,4.00,2022-03-31 06:49:44
...,...,...,...,...,...,...,...,...,...
4995,Steam Archfey,Fey,5,Rank 1,5d4+2,21.84,19.02,20.14,2022-03-31 06:49:44
4996,Night Hag,Demonic,13,Rank 0,13d2+3,25.50,25.79,25.15,2022-03-31 06:49:44
4997,Platinum Dragon,Dragon,8,Rank 2,8d6+1,45.80,49.13,46.77,2022-03-31 06:49:44
4998,Dust Spirit,Fey,2,Rank 1,2d4+1,7.12,8.09,8.64,2022-03-31 06:49:44


In [6]:
target = "Rarity"
features = ["Level", "Health", "Energy", "Sanity"]

## Hyperparameter Tuning

In [7]:
X_train, X_test, y_train, y_test = train_test_split(
    df[features],
    df[target],
    test_size=0.20,
    random_state=42,
    stratify=df[target],
)

param_dist = {
    "criterion": ("gini", "entropy"),
    "max_depth": (3, 6, 9),
    "max_features": (2, 3, 4)
}

n_iter = 1
for arr in param_dist.values():
    n_iter *= len(arr)

search = RandomizedSearchCV(
    RandomForestClassifier(
        n_estimators=333,
        random_state=42,
    ),
    param_distributions=param_dist,
    n_iter=n_iter,
    n_jobs=-1,
    cv=7,
    random_state=42,
)
search.fit(X_train, y_train)

RandomizedSearchCV(cv=7,
                   estimator=RandomForestClassifier(n_estimators=333,
                                                    random_state=42),
                   n_iter=18, n_jobs=-1,
                   param_distributions={'criterion': ('gini', 'entropy'),
                                        'max_depth': (3, 6, 9),
                                        'max_features': (2, 3, 4)},
                   random_state=42)

In [8]:
search.best_estimator_

RandomForestClassifier(criterion='entropy', max_depth=9, max_features=3,
                       n_estimators=333, random_state=42)

In [9]:
search.best_score_

0.9829990499867909

## Model Interface: RandomForestClassifier

### Abstraction, Encapsulation, Polymorphism

Below is one example of an abstraction that encapsulates an ML model and extends some customization points. Here we'll use a class interface, but functions can work too.

You can parameterize every aspect of the model by adding arguments to the init method. Be mindful, you don't want to over-do it here. Keep your calling signature simple and usable. Provide good defaults and well named arguments, and your users will enjoy using your code. Make it super complicated, and they may as well just use Scikit themselves.

A good interface should always encapsulate the core logic in such a way that the rest of the app is totally unaware of how it works, but can still interact with it. One might say that a good interface is always more abstract than the core logic it encapsulates. At this higher abstraction level it becomes easier to replace our core logic without disrupting other parts of the app.

...and now a word from our sponsor, Polymorphism!

One hypothetical example of Polymorphism is if we designed more than one ML model, possibly with two different ML libraries. Then gave them compatible interfaces. This gives us the ability to trade one model library for another without rewriting the whole app. We could do that at any time during development without disrupting anything. A Polymorphic system is built to be modular from the start.

In [10]:
class Model:

    def __init__(self, df: DataFrame, target: str, features: List[str]):
        X_train, X_test, y_train, y_test = train_test_split(
            df[features],
            df[target],
            test_size=0.20,
            random_state=42,
            stratify=df[target],
        )
        self.model = RandomForestClassifier(
            criterion="entropy",
            max_depth=9,
            max_features=3,
            n_jobs=-1,
            random_state=42,
        )
        self.model.fit(X_train, y_train)
        self.baseline_score = 1 / df[target].unique().shape[0]
        self.test_score = self.model.score(X_test, y_test)

    def __call__(self, feature_basis: DataFrame) -> List[Tuple]:
        prediction = self.model.predict(feature_basis)
        probability = self.model.predict_proba(feature_basis)
        return list(zip(prediction, map(max, probability)))

    def __repr__(self):
        output = (
            "Model: Random Forest Classifier",
            f"Baseline Score: {self.baseline_score:.1%}",
            f"Testing Score: {self.test_score:.1%}",
        )
        return "\n".join(output)

    def __str__(self):
        return repr(self)

    def save(self, filepath):
        joblib.dump(self, filepath)

    @staticmethod
    def open(filepath: str):
        return joblib.load(filepath)

In [11]:
model = Model(df=df, target=target, features=features)
model

Model: Random Forest Classifier
Baseline Score: 16.7%
Testing Score: 98.7%

In [12]:
model.model

RandomForestClassifier(criterion='entropy', max_depth=9, max_features=3,
                       n_jobs=-1, random_state=42)

### Make Predictions

In [17]:
test_cases = DataFrame([
    {
        "Level": 1,
        "Health": 5,
        "Energy": 5,
        "Sanity": 5,
    },
    {
        "Level": 10,
        "Health": 75,
        "Energy": 75,
        "Sanity": 75,
    },
])

In [19]:
model(test_cases)

[('Rank 1', 0.7921387626688972), ('Rank 2', 0.71)]

### Serialization

In [20]:
model.save("model.job")

In [21]:
model = Model.open("model.job")