# **redis-feast-gcp**: 02 - Feature Drift

In this notebook, we will use Feast SDK to load historical data and apply some simple data drift detection techniques. In practice, you will want to enable these metrics to be monitored and acted upon.

**This notebook assumes that you've already set up your feature store and model repo in GCP**


![architecture](img/redis-feast-gcp-architecture.png)

# Fetching Historical Data

Now that our Feast feature store has been provisioned and GCP infra is ready to go, we can use Feast to generate a training dataset. To do this, we need an entity dataframe, alongside the list of features we want.

To make things simpler, we use the [`DataFetcher`](utils/data_fetcher.py) class that wraps Feast, which wraps Redis and BigQuery.

In [None]:
from feature_store.repo import config
from feature_store.utils import (
    DataFetcher,
    storage
)

# Load fs
fs = storage.get_feature_store(
    config_path=config.REPO_CONFIG,
    bucket_name=config.BUCKET_NAME
)

# Load data fetcher
data_fetcher = DataFetcher(fs)

# Feature Data Drift w/ Feast

For data that is based on time series, it's important to consider that different windows of time behave differently due to seasonality and other temporal influences. Concept drift is very possible in these scenarios because the inference/serving (or testing) environment may be very different than the training environment.

Fortunately, with an orchestration layer like **Feast** over Redis & some "offline" source like BigQuery, you can:
- Measure distribution shifts in your historical data
- Detect whether "online" or realtime data is "out of bounds" from some baseline distribution

This is important in order to:
- Make sure that training and testing windows behave similarly
- Identify possible remedies to drift
- Make sure that real time features and model predictions are in line

## Detecting Distribution Shifts

In [None]:
import seaborn as sns
from datetime import datetime
from matplotlib import pyplot

def fetch_window(start, end):
    # Fetch historical data from window
    d = data_fetcher.get_training_data(
        entity_query=f"""
            select
                state,
                date as event_timestamp
            from
                {config.BIGQUERY_DATASET_NAME}.{config.WEEKLY_VACCINATIONS_TABLE}
            where
                date between timestamp('{start.isoformat()}') and timestamp('{end.isoformat()}')
        """
    )

    # Clean up any nulls
    d.dropna(inplace=True)
    d.sort_values(['event_timestamp', 'state'], axis=0, inplace=True)
    d['year'] = start.year
    return d

In [None]:
# Define and fetch windows of data
window_1_start = datetime.strptime('2021-01-01', '%Y-%m-%d')
window_1_end = datetime.strptime('2021-12-31', '%Y-%m-%d')
ds_2021 = fetch_window(window_1_start, window_1_end)

window_2_start = datetime.strptime('2022-01-01', '%Y-%m-%d')
window_2_end = datetime.strptime('2022-12-31', '%Y-%m-%d')
ds_2022 = fetch_window(window_2_start, window_2_end)

In [None]:
ds_2021.head()

In [None]:
ds_2022.head()

### Vaccine Demand Example

Given the nature of the global pandemic, we would expect there to be different underlying distributions of feature data between 2021 and 2022.

In [None]:
sns.distplot(ds_2021.lag_1_vaccine_interest, label='2021 Vaccine Search Interest')
sns.distplot(ds_2022.lag_1_vaccine_interest, label='2022 Vaccine Search Interest')
pyplot.legend()
pyplot.show()

In [None]:
sns.distplot(ds_2021.lag_1_vaccine_intent, label='2021 Vaccine Search Intent')
sns.distplot(ds_2022.lag_1_vaccine_intent, label='2022 Vaccine Search Intent')
pyplot.legend()
pyplot.show()

In [None]:
sns.distplot(ds_2021.lag_1_vaccine_safety, label='2021 Vaccine Search Safety')
sns.distplot(ds_2022.lag_1_vaccine_safety, label='2022 Vaccine Search Safety')
pyplot.legend()
pyplot.show()

In [None]:
sns.distplot(ds_2021.weekly_vaccinations_count, label='2021 Weekly Vaccinations Count')
sns.distplot(ds_2022.weekly_vaccinations_count, label='2022 Weekly Vaccinations Count')
pyplot.legend()
pyplot.show()

In [None]:
# look at relationships
import pandas as pd

g = sns.FacetGrid(pd.concat([ds_2021, ds_2022]), col="year", height=5)
g.map(sns.scatterplot, "lag_2_vaccine_safety", "weekly_vaccinations_count", s=100, alpha=.5)
g.add_legend()

Next we'll use a tool like evidently to cross reference data drift for all features...

In [None]:
from evidently.dashboard import Dashboard
from evidently.dashboard.tabs import DataDriftTab, CatTargetDriftTab

In [None]:
ds_2021['target'] = ds_2021.weekly_vaccinations_count
ds_2022['target'] = ds_2022.weekly_vaccinations_count


data_drift_report = Dashboard(tabs=[DataDriftTab(verbose_level=True), 
                                    CatTargetDriftTab(verbose_level=True)])
data_drift_report.calculate(ds_2021, ds_2022, column_mapping = None)
data_drift_report.show(mode="inline")