# Machine Learning Engineer

## Intorduction

### Random Monsters: MonsterLab & Fortuna

Fortuna is a random value toolkit by Robert Sharp. If you would like to know more, here's the [Fortuna Documentation](https://pypi.org/project/Fortuna/). Unfortunately, Fortuna is currently incompatible with Windows. As such, it is recommended to run this notebook with Colab or Jupyter on WSL. Fortuna is 100% compatible with all *nix systems including MacOS.

In [None]:
!pip install MonsterLab --upgrade

In [None]:
import pandas as pd
import datetime
from time import sleep
from MonsterLab import Monster

Before we can do machine learning we need some data!

A Random Monster

In [None]:
Monster()

Generate Mock Monster Data.

5000 should make for a good model, but play with it, see what you can find with different values for the `number` variable below.

In [None]:
number = 5000

df = pd.DataFrame(Monster().to_dict() for _ in range(number))

df.to_csv("monsters.csv", index=False)

df

## Assignment 1 - ML Model Interface Intro

### Abstraction, Encapsulation, Polymophism

Below is one example of an abstraction that encasulates an ML model and extends some customization points. Here we'll use a class interface, but functions can work too.

You can parameterize every aspect of the model by adding arguments to the init method. Be mindfull, you don't want to over-do it here. Keep your calling signature simple and usable. Provide good defaults and well named arguments, and your users will enjoy using your code. Make it super complicated and they may as well just use Scikit themselves.

A good interface should always encapsulate the core logic in such a way that the rest of the app is totally unaware of how it works, but can still interact with the core logic in a general way. One might say that a good interface is always more abstract than the core logic it encapsulates. At this higher abstraction level it becomes easier to replace our core logic without disrupting parallel development on other parts of the app. And now a word from our sponsor, Polymorphism.

One hypothetical example of Polymorphism is if we designed more than one ML model, possibly with two different ML libraries. Then gave them compatible interfaces. This gives us the ability to trade one model library for another without rewriting the whole app. We could do that at any time during development without disrupting anything.

A Polymorphic system is built to be modular from the start.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

### Model Interface: RandomForest Example

In [None]:
class Model:

    def __init__(self, target: pd.Series, features: pd.DataFrame):
        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(
            features,
            target,
            test_size=0.20,
            stratify=target,
            random_state=42,
        )
        self.model = RandomForestClassifier(
            n_jobs=-1,
            random_state=42,
        )
        self.model.fit(self.X_train, self.y_train)
        self.train_score = self.model.score(self.X_train, self.y_train) - 0.0001
        self.test_score = self.model.score(self.X_test, self.y_test)
    
    def __call__(self, pred_input):
        prediction, *_ = self.model.predict([pred_input])
        probability, *_ = self.model.predict_proba([pred_input])
        confidence = max(probability) * self.test_score
        return prediction, confidence
    
    def __repr__(self):
        train = f"Training Score: {100*self.train_score:.2f}%"
        test = f"Validation Score: {100*self.test_score:.2f}%"
        return f"Model(target, features)\n{train}\n{test}"

    def __str__(self):
        return self.__repr__()

Create a model interface for your favorite Scikit model by completing the code below.

In [None]:
class MyModel:

    def __init__(self, target: pd.Series, features: pd.DataFrame):
        # YOUR CODE HERE
        ...
    
    def __call__(self, pred_input):
        prediction, *_ = self.model.predict([pred_input])
        probability, *_ = self.model.predict_proba([pred_input])
        confidence = max(probability) * self.test_score
        return prediction, confidence
    
    def __repr__(self):
        train = f"Training Score: {100*self.train_score:.2f}%"
        test = f"Validation Score: {100*self.test_score:.2f}%"
        return f"Model(target, features)\n{train}\n{test}"

    def __str__(self):
        return self.__repr__()

Read data from the monster.csv file we created in a previous step.

In [None]:
df = pd.read_csv(...)
df

Drop or encode non-numeric data, except for our rarity target, we'll need that one.

In [None]:
df = df.drop(...)
df

### Set Target & Features

Target

In [None]:
target = df["Rarity"]
target

Features

In [None]:
features = df.drop(columns=["Rarity"])
features

### Model Scoring

Train the model on target and features defined above.

In [None]:
model = Model(...)
print(model)

In [None]:
model.model

### Prediction Function

In [None]:
def prediction(pred_input, model):
    pred, conf = model([*pred_input])
    return f"Prediction: {pred}", f"Confidence: {100*conf:.0f}%"

In [None]:
test_case = {
    "level": 1,
    "health": 2,
    "energy": 2,
    "sanity": 2, 
}

### Make Predictions

In [None]:
pred, confidence = prediction(list(test_case.values()), model)

print(pred)
print(confidence)

Make some more test cases and make some predictions.

In [None]:
# YOUR CODE HERE

# Assignment 2 - Pipelines & Tuning Review

## Hyper-parameter Tuning

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
features.shape

In [None]:
target.shape

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    features,
    target,
    test_size=0.20,
    stratify=target,
    random_state=42,
)

In [None]:
X_train.shape

In [None]:
y_train.shape

Calculate the Population Target Class Weight

In [None]:
total = target.shape[0]
counts = target.value_counts()

# Flat weights are like a baseline for comparison
class_weight_flat = dict(zip(counts.index, [0.1666] * len(counts)))
class_weight = dict(zip(counts.index, map(lambda x: x / total, counts)))

print(class_weight_flat, class_weight, sep='\n')

Compare the two class weight strategies by completing the code below. Which is better - flat weights or the custom class weights, and why?

In [None]:
param_dist = {
    "bootstrap": (True, False),
    "criterion": ("gini", "entropy"),
    "max_depth": (None, 3, 6, 9),
    "class_weight": (...),
}

n_iter = 1
for arr in param_dist.values():
    n_iter *= len(arr)
n_iter

search = RandomizedSearchCV(
    RandomForestClassifier(
        class_weight=class_weight,
        n_estimators=333,
        random_state=42,
    ),
    param_distributions=param_dist,
    n_iter=n_iter,
    n_jobs=-1,
    cv=7,
    random_state=42,
)

search.fit(X_train, y_train)

In [None]:
def validation_accuracy(model):
    validation_set = [Monster().to_dict() for _ in range(1000)]
    df = pd.DataFrame(validation_set)
    df = df.drop(columns=["Name", "Damage", "Type", "Time Stamp"])
    y_val = df['Rarity']
    x_val = df.drop(columns=['Rarity'])
    naive_baseline = 1 / len(y_val.unique())
    weighted_baseline = max(class_weight.values())
    print(f"Naive Baseline: {100 * naive_baseline:.2f}%")
    print(f"Weighted Baseline: {100 * weighted_baseline:.2f}%")
    print(f"Test Accuracy: {100 * model.score(X_test, y_test):.2f}%")
    print(f"Validation Accuracy: {100 * model.score(x_val, y_val):.2f}%")

In [None]:
validation_accuracy(search)

## Machine Learning Pipeline

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

pipeline = make_pipeline(
    StandardScaler(),
    RandomForestClassifier(
        class_weight=class_weight,
        n_estimators=333,
        random_state=42,
        n_jobs=-1,
    ),
).fit(features, target)

# Assignment 3 - Clustering Model & Lookup Table

## Custom Lookup Table - Synthetic Distribution

A lookup table is better than machine learning if you have **all the data**. 

Here we do not have all the data, we'll be using the lookup table to design a hard-mode test rather than make predictions.

Systematic Domain Coverage with less-random data. It's still random, but not as random as before, beacuse we're specifying exact levels and rarity ranks. In fact, we have every possible combination of those two features. This is good domain coverage but not complete. Something like this is not suitable for making predictions, but it can be useful for designing a hard-mode validation test.

A test based on this lookup table is hard-mode because our data typically follows a non-flat distribution of level and rarity. This table flattens our distribution - we'll have exactly one of each combination of level and rarity. Even though, in the wild we would need a long-long time to naturally get at least one of each combination.

In [None]:
ranks = [
    "Rank 0",
    "Rank 1",
    "Rank 2",
    "Rank 3",
    "Rank 4",
    "Rank 5",
]

levels = range(1, 21)

monsters = [
    Monster(level=level, rarity=rank).to_dict() 
    for rank in ranks 
    for level in levels
]

A Lookup Table

In [None]:
level = 1
rarity = "Rank 0"

df_lookup = pd.DataFrame(monsters)
targets = df_lookup[
    (df_lookup["Level"] == level) & (df_lookup["Rarity"] == rarity)
].to_dict(orient="records")
targets

Parameterize the above code such that you define an interface for the core logic with inputs for level and rarity, it should return a target list of best matches. You can use a class or function.

In [None]:
# YOUR CODE HERE

Prediction Test: Hard Mode

In [None]:
def pred_test(target, model):
    pred, conf = model([
        target["Level"], 
        target["Health"], 
        target["Energy"], 
        target["Sanity"],
    ])
    keys = ["Actual", "Prediction", "Confidence", "Correct"]
    values = [rarity, pred, conf, pred == target["Rarity"]]
    return dict(zip(keys, values))

Why is this prediction test "hard mode"?

In [None]:
df = pd.DataFrame(pred_test(target, model) for target in monsters)
df

Get the average of the "Correct" column.

In [None]:
# YOUR CODE HERE

Get the average of the "Confidence" column.

In [None]:
# YOUR CODE HERE

What can be said about the correctness average vs. the confidence average. Is this result what you expected and why or why not?

## Clustering Model: KNN

When you don't know what else to do, when you don't even know what the target should be, clustering can help.

In [None]:
from sklearn.neighbors import NearestNeighbors
from typing import Iterable, Iterator, Dict

Make 5000 fresh data points with a list comprehension.

In [None]:
# Cluster Lookup Data
monsters = [...]

Drop or encode non-numeric values.

In [None]:
# Cluster Training Data
df = pd.DataFrame(monsters).drop(columns=["Name", "Damage", "Type", "Time Stamp", "Rarity"])

Comlete the code below.

In [None]:
class ClusterModel:

    def __init__(self, 
                 lookup_data: Iterable[Dict], 
                 training_data: pd.DataFrame, 
                 n_neighbors: int):
        self.lookup = lookup_data
        self.knn = NearestNeighbors(...)
        self.knn.fit(...)

    def __call__(self, inputs: Iterable[int]) -> Iterator[Dict]:
        nearest = self.knn.kneighbors([inputs], return_distance=False)[0]
        return map(lambda n: self.lookup[n], nearest)

In [None]:
cluster = ClusterModel(
    lookup_data=..., 
    training_data=..., 
    n_neighbors=...,
)

In [None]:
for monster in cluster([2, 3, 3, 3]):
    print(monster)

# Assignment 4 - Model Serialization

Pickle is another option for serialization, but it is recommended to use joblib when building production apps.

In [None]:
from joblib import dump, load

Save the `model.job` with `dump()`

In [None]:
...(model, "model.job")

Open the `model.job` with `load()`

In [None]:
saved_model = ...("model.job")

## Bonus Round - The Machine

If assignment 4 took less than 10 minutes, you should do the Bonus Round.

Implement the model interface below. Use your favorite clasification model from any library.

The Machine should take a `target` pd.Series and a `features` pd.DataFrame as input. It should then do a train/teast split. Then define the model you want to use and fit it with the training set. See the Model interface in assignment 1 for inspiration.

This interface will serve as an [Abstraction Layer](https://en.wikipedia.org/wiki/Abstraction_layer) for your model. Abstraction layers are one of the most overlooked and under valued constructs in all of programming. In this assignment, we will [encapsulate](https://en.wikipedia.org/wiki/Encapsulation_(computer_programming)) or abstract away the type of model we're using by creating a interface class. This interface could be replaced by another one that wraps a different type of model. As long as the same methods with the same signatures are on both interfaces, the rest of the app won't even know. The polymorphic abstraction layer gives us this ability, without the rest of the app being reworked, because all calls to the model travel through the same interface.

Objects that can replace eachother like this are said to be [Polymorphic](https://en.wikipedia.org/wiki/Polymorphism_(computer_science)).

In [None]:
class Machine:

    def __init__(self, target: pd.Series, features: pd.DataFrame):
        ...

    def __call__(self, features) -> Tuple[str, float]:
        ...

In [None]:
custom_machine = Machine(target, features)

In [None]:
dump(custom_machine, "machine.job")

In [None]:
machine = load("machine.job")

In [None]:
total = target.shape[0]
counts = target.value_counts()
class_weight = dict(zip(counts.index, map(lambda x: x / total, counts)))

def validation_accuracy(custom_model):
    validation_set = [Monster().to_dict() for _ in range(1000)]
    df = pd.DataFrame(validation_set)
    df = df.drop(columns=["Name", "Damage", "Type", "Time Stamp"])
    y_val = df['Rarity']
    x_val = df.drop(columns=['Rarity'])
    naive_baseline = 1 / len(y_val.unique())
    weighted_baseline = max(class_weight.values())
    print(f"Naive Baseline: {100 * naive_baseline:.2f}%")
    print(f"Weighted Baseline: {100 * weighted_baseline:.2f}%")
    print(f"Test Accuracy: {100 * custom_model.test_score:.2f}%")
    print(f"Validation Accuracy: {100 * custom_model.model.score(x_val, y_val):.2f}%")

In [None]:
validation_accuracy(machine)