# Key features of NannyML

NannyML is an open-source library that helps data scientists monitor their models in production. Do you recall what its key features are?

### Possible Answers

    Performance estimation and calculation{Answer}


    Multivariate and univariate drift detection{Answer}


    Measuring operational metrics


    Data quality monitoring{Answer}


    Business value calculation and estimation {Answer}

In [1]:
# exercise 01

"""
Load the dataset

NannyML comes with a set of internal datasets in order to make it easier to demo use cases and test different algorithms. To load the dataset, you only need to use the nannyml.load_us_census_ma_employment_data() function.

The function returns three Pandas DataFrame objects: the reference set (the test set), the analysis set (unseen production data), and the ground truth for the analysis set. These data frames should be named according to the convention as reference, analysis, and analysis_gt.

In this exercise, you will load the US Census Employment dataset and print the data frames to understand what they look like.
"""

# Instructions

"""

    Import the nannyml libary.

    Load the US Census Employment dataset from the nannyml library.

    Print the head of the reference data.

    Print the head of the analysis data.

"""

# solution

# Import nannyml
import nannyml

# Load US Census Employment dataset
reference, analysis, analysis_gt = nannyml.load_us_census_ma_employment_data()

# Print head of the reference data
print(reference.head())

# Print head of the analysis data
print(analysis.head())

#----------------------------------#

# Conclusion

"""
Great work! Now you know the basics of NannyML. In the next video, you will learn how to create reference and analysis sets from any raw data!
"""

ModuleNotFoundError: No module named 'nannyml'

# Reference or analysis period?

In the last video, you learned about two data periods NannyML operates on: the reference and analysis periods. These periods are really important to understand before you can move to monitor performance and covariate shifts in production.

Now, let's test your memory. Can you remember the characteristics of each of these periods?

![Answer](images/ch01-01.png)

In [None]:
# exercise 02

"""
Loading and splitting the data

To deploy and monitor a model in production, you must first create it. In the last video, you've been introduced to loading and processing data, building the model, and creating reference and analysis sets.

In this exercise, you'll follow a similar process, but to simplify matters, you'll use the NYC Green Taxi dataset provided in a csv file that's already been processed.

For this exercise, pandas has been imported as pd and is ready for you to use.
"""

# Instructions

"""

    Pass the green_taxi_dataset.csv to the dataset_name variable.
    Use pd.read_csv() to load the dataset.
    Show the head of the dataset.
---

    Split the dataset into a training set, a test set, and a production(prod) set.

"""

# solution

# Load the dataset
import pandas as pd
dataset_name = "datasets/green_taxi_dataset.csv"
data = pd.read_csv(dataset_name, parse_dates=['lpep_pickup_datetime'], low_memory=False)
features = ['lpep_pickup_datetime', 'PULocationID', 'DOLocationID', 'trip_distance', 'fare_amount', 'pickup_time']
target = 'tip_amount'

display(data.head())

# Split the training data
X_train = data.loc[data['partition'] == 'train', features]
y_train = data.loc[data['partition'] == 'train', target]

# Split the test data
X_test = data.loc[data['partition'] == 'test', features]
y_test = data.loc[data['partition'] == 'test', target]

# Split the prod data
X_prod = data.loc[data['partition'] == 'prod', features]
y_prod = data.loc[data['partition'] == 'prod', target]

#----------------------------------#

# Conclusion

"""
Great job! Data split and loading are the first steps to create reference and analysis sets for monitoring in production!
"""

Unnamed: 0,lpep_pickup_datetime,PULocationID,DOLocationID,trip_distance,fare_amount,tip_amount,pickup_time,partition
0,2016-12-01 00:00:02,82,129,0.6,5.0,1.0,0,train
1,2016-12-01 00:01:57,255,7,4.53,17.5,4.7,0,train
2,2016-12-01 00:04:17,65,195,1.94,9.5,0.0,0,train
3,2016-12-01 00:06:45,41,41,1.0,5.5,2.0,0,train
4,2016-12-01 00:09:18,74,42,2.02,8.5,1.0,0,train


'\nGreat job! Data split and loading are the first steps to create reference and analysis sets for monitoring in production!\n'

In [None]:
# exercise 03

"""
Creating reference and analysis set

After your data is split into train, test, and production sets, you can build and deploy your model. The testing and production data will later be used to create the reference and analysis set.

In this exercise, you will go through this process. You have all of your X_train/test/prod, and y_train/test/prod datasets created in the previous exercise already loaded here.

For this exercise, pandas has been imported as pd and is ready for use.
"""

# Instructions

"""

    Train the model using fit method and pass X_train and y_train sets.
    Make predictions on train and test sets.
    Deploy the model by making predictions for production data.
---

    Add a y_pred(models predictions) column for the reference and analysis sets, assigning the model's predictions from the test and production sets, respectively.
    Add a tip_amount column to reference set set and assign values from y_test(labels) to it.
    Join the lpep_pickup_datetime timestamp column for reference and analysis set.


"""

# solution

from lightgbm import LGBMRegressor

# Fit the model
model = LGBMRegressor(random_state=111, n_estimators=50, n_jobs=1)
model.fit(X_train, y_train)

# Get model's prediction on train, test, and production set
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)
y_pred_prod = model.predict(X_prod)

# Create reference and analysis set
reference = X_test.copy() # Copy test set features
reference['y_pred'] = y_pred_test # Add models predictions on test set
reference['tip_amount'] = y_test # Add labels(ground truth)
reference = reference.join(data['lpep_pickup_datetime']) # Add timestamp column

analysis = X_prod.copy() # Add production set features
analysis['y_pred'] = y_pred_prod # Add models predictions on production set
analysis = analysis.join(data['lpep_pickup_datetime']) # Add timestamp column

#----------------------------------#

# Conclusion

"""
Amazing! Creating reference and analysis sets is the first thing you need to do to keep an eye on your model in production. In the following video, we will use sets created in this exercise to estimate our model's performance!
"""

# Specify the algorithm and problem type

Imagine you are a data scientist consultant working for a hotel chain. Your task is to build a model to predict whether the customer will arrive (or not). Many bookings are made months in advance, which means you're dealing with the delayed ground truth issue.

What is the name of the performance estimation algorithm you would use and the problem type you would describe it? 

### Possible Answers


    CBPE, multi-label classification
    
    
    DLE, binary classification
    
    
    CBPE, binary classification {Answer}
    
    
    None of the above

# Interpreting results

In this scenario, you've successfully implemented your performance estimation algorithm in a production environment. As a result, you have a plot for estimated ROC AUC metric.

Your task now is to select correct information based on the plot about the following:

![image](images/lesson_3_exercise_cbpe.png)

### Possible Answers


    chunk period = d daily, upper threshold = 1, lower threshold = 0.6, alert month = September
    
    
    chunk period = m monthly, upper threshold = 0.9, lower threshold = 0.7, alert month = November
    
    
    chunk period = m monthly, upper threshold = 1, lower threshold = 0.7, alert month = September {Answer}
    
    
    chunk period = w weekly, upper threshold = 1, lower threshold = 0.7, alert month = September
    

# CBPE and DLE workflow

Recall that NannyML provides a repeatable workflow similar to scikit-learn for estimating your model's technical performance.

Reorder the provided pseudo-code to reflect the initialization of the CBPE algorithm, the estimation of performance, and its visualization.

![Answer](images/ch01-02.png)

In [None]:
# exercise 04

"""
Performance estimation for tip prediction

In the previous exercises, you prepared a reference and analysis set for the NYC Green Taxi dataset. In this one, you will use that data to estimate the model's performance in production.

First, you must initialize the DLE algorithm with the provided parameters and then plot the results.

The reference and analysis set is already loaded and saved in the reference and analysis variables. Additionally, nannyml is also already imported.
"""

# Instructions

"""


    Initiate the DLE algorithm with daily chunk period, tip_amount as a y_true , and MSE metric.

    Fit reference set to the DLE estimator, estimate performance for analysis set and store the output in the results variable.

    Visualize the results using plot() and show() methods.

"""

# solution

estimator = nannyml.DLE(y_pred='y_pred',
    timestamp_column_name='lpep_pickup_datetime',
    feature_column_names=features,
    chunk_period='d',
    y_true='tip_amount',
    metrics=['mse'])

# Fit the reference data to the DLE algorithm
estimator.fit(reference)

# Estimate the performance on the analysis data
results = estimator.estimate(analysis)

# Plot and show the results
results.plot().show()

#----------------------------------#

# Conclusion

"""
Great job! Now you have the knowledge and tools to estimate your model's performance in production. In the next chapter, we'll dive into comparing the estimated performance with the actual performance, when the ground truth becomes available!
"""

'\n\n'