In [1]:
# run this to shorten the data import from the files
import os
cwd = os.path.dirname(os.getcwd())+'/'
path_data = os.path.join(os.path.dirname(os.getcwd()), 'datasets/')


# Identifying relevant drifts

Recall the Green Taxi dataset example from Chapter 2, where the model was predicting the tip amount. In this exercise, we've prepared a comparison plot that illustrates the daily values of the reconstruction error obtained from the multivariate drift detection method, shown in light blue, alongside the realized performance calculated using the MAE metric, which is plotted in dark blue.

Your task now is to identify the day when an alerted drift overlaps with an alert in the model's performance.

![image](images/Lesson_3_1_exercise_plot.png)

### Possible Answers


    31st of December
    
    
    27th of December
    
    
    25th of December {Answer}
    
    
    18th of December

**Great job! As you can see, there are a lot of false alerts, and only one of the drift actually negatively impacts the performance.**

In [None]:
# exercise 01

"""
Drift in hotel booking dataset

In the previous chapter, you calculated the business value and ROC AUC performance for a model that predicts booking cancellations. You noticed a few alerts in the resulting plots, which is why you need to investigate the presence of drift in the analysis data.

In this exercise, you will initialize the multivariate drift detection method and compare its results with the performance results calculated in the previous chapter.

StandardDeviationThreshold is already imported along with business value, and ROC AUC results stored in the perf_results variable and feature_column_names are already defined.
"""

# Instructions

"""


    Initialize the StandardDeviationThreshold method and set std_lower_multiplier to 2 and std_upper_multiplier parameters to 1.

    Add the following feature names country, lead_time, parking_spaces, and hotel. Retain their order.

    Pass previously defined thresholds and feature names to the DataReconstructionDriftCalculator.

    Show the comparison plot featuring both the multivariate drift detection results(mv_results) and the performance results(perf_results).


"""

# solution

# Create standard deviation thresholds
stdt = StandardDeviationThreshold(std_lower_multiplier=2, std_upper_multiplier=1)

# Define feature columns
feature_column_names = ['country', 'lead_time', 'parking_spaces', 'hotel']

# Intialize, fit, and show results of multivariate drift calculator
mv_calc = nannyml.DataReconstructionDriftCalculator(
    column_names=feature_column_names,
	threshold = stdt,
    timestamp_column_name='timestamp',
    chunk_period='m')
mv_calc.fit(reference)
mv_results = mv_calc.calculate(analysis)
mv_results.filter(period='analysis').compare(perf_results).plot().show()

#----------------------------------#

# Conclusion

"""
Great job! It's important to observe that two of the alerted drifts had a significant impact, resulting in a large decrease and increase in the business value. Additionally, the alert in the ROC AUC value appears to be linked to these drifts. In the next video, we will delve into tools to explain these findings!
"""

'/home/nero/Documents/Estudos/DataCamp'

In [1]:
# exercise 02

"""
Univariate drift detection for hotel booking dataset

In the previous exercises, we established using the multivariate drift detection method that the shift in data in January is responsible for the alert in the ROC AUC metric and the negative business value of the model.

In this exercise, you will use a univariate drift detection method to find the feature and explanation behind the drift.

The reference and analysis sets are already pre-loaded for you.
"""

# Instructions

"""


    Specify Wasserstein and Jensen-Shannon method for continuous methods and L-inifity and Chi2 for categorical.

    Fit the reference and calculate results on the analysis set.

    Plot the results.

"""

# solution

# Intialize the univariate drift calculator
uv_calc = nannyml.UnivariateDriftCalculator(
    column_names=feature_column_names,
    timestamp_column_name='timestamp',
    chunk_period='m',
    continuous_methods=['wasserstein', 'jensen_shannon'],
    categorical_methods=['l_infinity', 'chi2'],
)

# Plot the results
uv_calc.fit(reference)
uv_results = uv_calc.calculate(analysis)
uv_results.plot().show()

#----------------------------------#

# Conclusion

"""
Great work! Notice we got 8 different plots, some of them have a lot of alerts some of them not at all. To make the results more insightful let's rank the results by the number of alerts and its correlation to performance!
"""

'\n\n'

In [2]:
# exercise 03

"""
Ranking the univariate results

In the previous exercises, you ended up with eight plots. In this exercise your task is to rank them based on the number of the alerts and the correlation with the ROC AUC performance.

The univariate results are pre-loaded and stored in uv_results variable, and performance results are stored in perf_results variable.
"""

# Instructions

"""

    Initialize AlertCountRanker without any initial parameters.
    Call .rank() method and pass the filtered uv_results for Wasserstein and L-infinity methods.
---

    Initialize CorrelationRanker without any initial parameters.
    Fit correlation ranker with filtered perf_results for reference period.
    Use rank method and pass there filtered uv_results for Wasserstein and L-infinity methods and perf_results.

"""

# solution

# Initialize the alert count ranker

alert_count_ranker = nannyml.AlertCountRanker()
alert_count_ranked_features = alert_count_ranker.rank(
    uv_results.filter(methods=['wasserstein', 'l_infinity']))

display(alert_count_ranked_features)

# Initialize the correlation ranker
correlation_ranker = nannyml.CorrelationRanker()
correlation_ranker.fit(perf_results.filter(period='reference'))

correlation_ranked_features = correlation_ranker.rank(
    uv_results.filter(methods=['wasserstein', 'l_infinity']),
    perf_results)
display(correlation_ranked_features)

#----------------------------------#

# Conclusion

"""
Great! The count-based alert appears to have a high number of false alerts for parking spaces and lead time. According to the correlation ranker, it seems that drifts in the hotel and country features are the ones affecting the performance. Now, let's visualize the drift and distribution values for these features!
"""

'\n\n'

In [3]:
# exercise 04

"""
Visualizing drifting features

After ranking the univariate results, you know that drift hotel and country features are impacting the model's performance the most. In this exercise, you will look at the drift results and distribution plots of them to determine the root cause of the problem.

The results from the univariate drift calculator are stored in the uv_results variable.
"""

# Instructions

"""


    Set period argument to analysis for drift_results.

    Pass hotel and country to column_names for drift_results.

    Set kind argument in .plot() method to "drift".

    Do the same for distribution_results, except for setting the kind argument in .plot() method to "distribution".

"""

# solution

# Filter and create drift plots
drift_results = uv_results.filter(
    period='analysis',
    column_names=['hotel', 'country']
    ).plot(kind='drift')

# Filter and create distribution plots
distribution_results = uv_results.filter(
    period='analysis',
    column_names=['hotel', 'country']
    ).plot(kind='distribution')

# Show the plots
drift_results.show()
distribution_results.show()

#----------------------------------#

# Conclusion

"""
Fantastic work! It's now evident that the country distribution is notably shifting, particularly in January, which provides a clear explanation for the performance drops. The root cause appears to be an increase in international travelers during the winter season, attracted by Portugal's warm climate.
"""

'\n\n'

In [4]:
# exercise 05

"""
Data quality checks

As you learned in the previous video, missing values can result in a loss of valuable information and potentially lead to incorrect interpretations. Similarly, the presence of unseen values can also affect your model's confidence.

In this exercise, your goal is to explore whether the hotel booking dataset contains missing values and identify any unseen values. The reference and analysis datasets are already loaded, along with the nannyml library.

A quick reminder, if you can't recall the column types, you can easily explore the data using the .head() method.
"""

# Instructions

"""

    Initialize the missing value calculator, passing the selected columns to column_names and setting the chunk_period to monthly.
---
    Add two categorical column names country and hotel, initialize the unseen values calculator, and pass the categorical_columns to column names.

"""

# solution

# Define analyzed columns
selected_columns = ['country', 'lead_time', 'parking_spaces', 'hotel']

# Intialize missing values calculator
ms_calc = nannyml.MissingValuesCalculator(
    column_names=selected_columns,
    chunk_period='m',
    timestamp_column_name='timestamp'
)

# Fit, calculate and plot the results
ms_calc.fit(reference)
ms_results = ms_calc.calculate(analysis)
ms_results.plot().show()

#----------------------------------#

# Define analyzed categorical columns
categorical_columns = ['country', 'hotel']

# Intialize unseen values calculator
us_calc = nannyml.UnseenValuesCalculator(
  	column_names=categorical_columns, 
  	chunk_period='m', 
  	timestamp_column_name='timestamp'
)

# Fit, calculate and plot the results
us_calc.fit(reference)
us_results = us_calc.calculate(analysis)
us_results.filter(period='analysis').plot().show()

#----------------------------------#

# Conclusion

"""
You got it! The country is the only feature with unseen values. The highest number of occurrences is in January, and that's when we also observed the most significant drop in performance. In our next exercise, we'll take a closer look at the statistical summary of our continuous features.
"""

'\n\n'

In [None]:
# exercise 06

"""
Summary statistics

Recall from the previous lesson that NannyML provides five methods for tracking statistical changes in your features.

In this exercise, you will focus on examining the lead_time feature from the Hotel Booking dataset, which indicates how many days in advance a booking was made. By using summation, median, and standard deviation statistics, you can gain valuable insights into how customer booking behavior has evolved over time.

It's important to note that both the reference and analysis sets, as well as the nannyml library, are already pre-loaded and ready for use.
"""

# Instructions

"""

    Define analyzed column to lead time, initialize SummaryStatsSumCalculator, Pass analyzed_column to the column names parameter.
---
    Initialize SummaryStatsMedianCalculator, pass analyzed_column to the column names parameter, filter results for the only analysis period.
---
    Initialize SummaryStatsStdCalculator.

"""

# solution

# Define analyzed column
analyzed_column = ['lead_time']

# Intialize sum values calculator
sum_calc = nannyml.SummaryStatsSumCalculator(
    column_names=analyzed_column, 
    chunk_period='m', 
    timestamp_column_name='timestamp'
)

# Fit, calculate and plot the results
sum_calc.fit(reference)
sum_calc_res = sum_calc.calculate(analysis)
sum_calc_res.plot().show()

#----------------------------------#

# Define analyzed column
analyzed_column = ['lead_time']

# Intialize median values calculator
med_calc = nannyml.SummaryStatsMedianCalculator(
    column_names=analyzed_column, 
    chunk_period='m', 
    timestamp_column_name='timestamp'
)

# Fit, calculate and plot the results
med_calc.fit(reference)
med_calc_res = med_calc.calculate(analysis)
med_calc_res.filter(period='analysis').plot().show()

#----------------------------------#

# Define analyzed column
analyzed_column = ['lead_time']

# Intialize standard deviation values calculator
std_calc = nannyml.SummaryStatsStdCalculator(
    column_names=analyzed_column, 
    chunk_period='m', 
    timestamp_column_name='timestamp'
)

# Fit, calculate and plot the results
std_calc.fit(reference)
std_calc_res = std_calc.calculate(analysis)
std_calc_res.filter(period="analysis").plot().show()

#----------------------------------#

# Conclusion

"""
Fantastic! For January, the standard deviation of lead times drops below the threshold. A low standard deviation suggests that lead times in January are relatively consistent and clustered around the mean. This typically makes predictions more straightforward. However, it's worth noting that the substantial shift in the country feature had such a significant impact that it resulted in a decrease in model performance. Now, let's see how we can resolve these problems!
"""

# What is the resolution?

Imagine a scenario where a company launched an app to help people take notes on their tablets. Internally, the app runs an ML model to recognize handwritten characters and help users do a text search across their notes.

The company notices that the more people interact with the app, the worse the ML model becomes. After a careful examination, the company realizes that after a while of writing in the app, the users start getting more comfortable, and their handwriting becomes more clumsy and less readable.

What would be the issue resolution in that case?

### Possible Answers


    Do nothing
    
    
    Revert back to the previous model
    
    
    Retrain the model with new data{Answer}
    
    
    Change the downstream processes

# Should you do nothing or not?

Let's picture a real estate company that primarily buys houses at fair prices, renovates them, and then sells them at higher prices.

Lately, the model has overestimated few prices, resulting in huge profit reductions. This means that the model's predictions have a significant impact on the company's bottom line.

Is it a viable solution to do nothing in this scenario, and why?

### Possible Answers


    Yes, because overestimating prices doesn't significantly affect the business's bottom line.
    
    
    No, because the model's consistent overestimations could potentially result in a loss for the business. {Answer}
    
    
    No, because the model could also underestimate prices, which would reduce profits.

In [6]:
# exercise 07

"""
Implementing a monitoring workflow

Throughout the course, you've learned about the monitoring workflow. The first step is performance monitoring. If there are negative changes, the next steps involve multivariate drift detection to identify if drift caused the performance drop, followed by univariate drift detection to pinpoint the cause in individual features. Once the investigation results are in, you can take steps to resolve the issue.

To solidify this knowledge, in the exercise, you'll apply this process to the US Consensus dataset. The reference and analysis datasets are pre-loaded, and you have access to the CBPE estimator, uv_calc univariate calculator, and an alert_count_ranker for feature drift ranking.
"""

# Instructions

"""

    Fit the reference set to the estimator, estimate results on analysis set, and show the results.
---

    Set the chunk_size parameter to 5000 for DataReconstructionDriftCalculator. Next, filter the mv_results for analysis period, then compare them with the estimated_results.
---
Question

The top drifting feature is AGEP, which represents the person's age. Your task is to use the distribution plot and explain the drift that happens in Chunk 27 and 28, which was responsible for the performance drop.

Possible answers:
    
    During that period, there is an increased number of old individuals in the data.
    
    During that period, there is a greater presence of younger individuals in the data. {Answer}
    
    In that period, a higher number of middle age individuals is in the data.
"""

# solution

estimator.fit(reference)
estimated_performance = estimator.estimate(analysis)
mv_calc = nannyml.DataReconstructionDriftCalculator(column_names=features, chunk_size=5000)
mv_calc.fit(reference)
mv_results = mv_calc.calculate(analysis)

# Calculate univariate drift
uv_calculator.fit(reference)
uv_results = uv_calculator.calculate(analysis)

# Check the most drifting features
alert_count_ranked_features = alert_count_ranker.rank(uv_results)
display(alert_count_ranked_features.head())
#----------------------------------#

# Conclusion

"""
Congratulations! The shift in the distribution in Chunks 27 and 28 is primarily centered around younger individuals, many of whom are students without families. This change caused a drop in the model's performance because it had limited exposure to this particular demographic.
"""

'\n\n'