<div align="center">
<h1><img width="30" src="https://madewithml.com/static/images/rounded_logo.png">&nbsp;<a href="https://madewithml.com/">Made With ML</a></h1>
Applied ML · MLOps · Production
<br>
Join 30K+ developers in learning how to responsibly <a href="https://madewithml.com/about/">deliver value</a> with ML.
    <br>
</div>

<br>

<div align="center">
    <a target="_blank" href="https://newsletter.madewithml.com"><img src="https://img.shields.io/badge/Subscribe-30K-brightgreen"></a>&nbsp;
    <a target="_blank" href="https://github.com/GokuMohandas/MadeWithML"><img src="https://img.shields.io/github/stars/GokuMohandas/MadeWithML.svg?style=social&label=Star"></a>&nbsp;
    <a target="_blank" href="https://www.linkedin.com/in/goku"><img src="https://img.shields.io/badge/style--5eba00.svg?label=LinkedIn&logo=linkedin&style=social"></a>&nbsp;
    <a target="_blank" href="https://twitter.com/GokuMohandas"><img src="https://img.shields.io/twitter/follow/GokuMohandas.svg?label=Follow&style=social"></a>
    <br>
    🔥&nbsp; Among the <a href="https://github.com/topics/mlops" target="_blank">top MLOps</a> repositories on GitHub
</div>

<br>
<hr>

# Set up

In [1]:
# Install Feast
!pip install feast==0.10.5 -q
!pip freeze | grep feast

feast==0.10.5


We're going to create a feature repository at the root of our project. [Feast](https://feast.dev/) will create a configuration file for us and we're going to add an additional [features.py](https://github.com/GokuMohandas/MLOps/blob/main/features/features.py) file to define our features.

> Traditionally, the feature repository would be it's own isolated repository that other services will use to read/write features from but we're going to simplify it and create it directly in our application's repository.

In [2]:
%%bash
cd ../
feast init --minimal --template local features
cd features
touch features.py


Creating a new Feast repository in /Users/goku/Documents/madewithml/mlops/features.



```bash
features/
├── feature_store.yaml  - configuration
└── features.py         - feature definitions
```

We're going to configure the locations for our registry and online store (SQLite) in our [feature_store.yaml](https://github.com/GokuMohandas/MLOps/blob/main/features/feature_store.yaml) file. 
- **registry**: contains information about our feature repository, such as data sources, feature views, etc. Since it's in a DB, instead of a Python file, it can very quickly be accessed in production.
- **online store**: DB (SQLite for local) that stores the (latest) features for defined entites to be used for online inference.

If all definitions look valid, Feast will sync the metadata about Feast objects to the registry. The registry is a tiny database storing most of the same information you have in the feature repository. This step is necessary because the production feature serving infrastructure won't be able to access Python files in the feature repository at run time, but it will be able to efficiently and securely read the feature definitions from the registry.

```yaml
project: features
registry: ../stores/feature/registry.db
provider: local
online_store:
    path: ../stores/feature/online_store.db
```

# Data

Feast requires it's [data sources](https://github.com/feast-dev/feast/blob/master/sdk/python/feast/data_source.py) to either come from a file ([Parquet](https://databricks.com/glossary/what-is-parquet)), data warehouse ([BigQuery](https://cloud.google.com/bigquery)) or data stream ([Kafka](https://kafka.apache.org/) / [Kinesis](https://aws.amazon.com/kinesis/)). We'll convert our generated features file (`features.json`) into a Parquet file.

> Read more about these data sources in our [pipelines](https://madewithml.com/courses/mlops/pipelines/#data) and [deployment](https://madewithml.com/courses/mlops/deployment/#batch-processing) lessons.

In [3]:
import pandas as pd
from pathlib import Path
from config import config
from tagifai import utils

In [4]:
# Load features to df
features_fp = Path(config.DATA_DIR, "features.json")
features = utils.load_dict(filepath=features_fp)
df = pd.DataFrame(features)

In [5]:
# Format timestamp
df.created_on = pd.to_datetime(df.created_on)

In [6]:
# Convert to parquet
df.to_parquet(
    Path(config.DATA_DIR, "features.parquet"),
    compression=None,
    allow_truncated_timestamps=True,
)

# Feature definitions

Now that we have our data source prepared, we can define our features for the feature store.

In [7]:
from datetime import datetime
from feast import Entity, Feature, FeatureView, ValueType
from feast.data_source import FileSource
from google.protobuf.duration_pb2 import Duration
from config import config

The first step is to define the location of the features (FileSource in our case) and the timestamp column for each data point.

In [8]:
# Read data
START_TIME = "2020-02-17"
project_details = FileSource(
    path=str(Path(config.DATA_DIR, "features.parquet")),
    event_timestamp_column="created_on",
)

  and should_run_async(code)


Next, we need to define the main entity that each data point pertains to. In our case, each project has a unique ID with features such as text and tags.

In [9]:
# Define an entity
project = Entity(
    name="id",
    value_type=ValueType.INT64,
    description="project id",
)

Finally, we're ready to create a [FeatureView](https://docs.feast.dev/concepts/feature-views) that loads specific features (`features`), of various [value types](https://api.docs.feast.dev/python/feast.html?highlight=valuetype#feast.value_type.ValueType), from a data source (`input`) for a specific period of time (`ttl`).

In [10]:
# Define a Feature View for each project
project_details_view = FeatureView(
    name="project_details",
    entities=["id"],
    ttl=Duration(
        seconds=(datetime.today() - datetime.strptime(START_TIME, "%Y-%m-%d")).days * 24 * 60 * 60
    ),
    features=[
        Feature(name="text", dtype=ValueType.STRING),
        Feature(name="tags", dtype=ValueType.STRING_LIST),
    ],
    online=True,
    input=project_details,
    tags={},
)

Once we've defined our feature views, we can `apply` it to push a version controlled definition of our features to the registry for fast access. It will also configure our registry and online stores that we've defined in our [feature_store.yaml](https://github.com/GokuMohandas/MLOps/blob/main/features/feature_store.yaml)

In [11]:
%%bash
cd ../features
feast apply

Registered entity id
Registered feature view project_details
Deploying infrastructure for project_details


# Historical features

Once we've registered our feature definition, along with the data source, entity definition, etc., we can use it to fetch historical features. This is done via joins using the provided timestamps using pandas (local) or BigQuery (production).

In [12]:
import pandas as pd
from feast import FeatureStore

  and should_run_async(code)


In [13]:
# Identify entities
project_ids = [1, 2, 3]
now = datetime.now()
timestamps = [datetime(now.year, now.month, now.day)]*len(project_ids)
entity_df = pd.DataFrame.from_dict({"id": project_ids, "event_timestamp": timestamps})
entity_df.head()

Unnamed: 0,id,event_timestamp
0,1,2021-06-07
1,2,2021-06-07
2,3,2021-06-07


In [14]:
# Get historical features
store = FeatureStore(repo_path=Path(config.BASE_DIR, "features"))
training_df = store.get_historical_features(
    entity_df=entity_df,
    feature_refs=["project_details:text", "project_details:tags"],
).to_df()
training_df.head()

  and should_run_async(code)


Unnamed: 0,event_timestamp,id,project_details__text,project_details__tags
0,2021-06-07 00:00:00+00:00,1,Machine Learning Basics A practical set of not...,"[code, tutorial, keras, pytorch, tensorflow, d..."
1,2021-06-07 00:00:00+00:00,2,Deep Learning with Electronic Health Record (E...,"[article, tutorial, deep-learning, health, ehr]"
2,2021-06-07 00:00:00+00:00,3,Automatic Parking Management using computer vi...,"[code, tutorial, video, python, machine-learni..."


# Materialize

For online inference, we want to retrieve features very quickly via our online store, as opposed to fetching them from slow joins. However, the features are not in our online store just yet, so we'll need to [materialize](https://docs.feast.dev/quickstart#4-materializing-features-to-the-online-store) them first.

In [15]:
%%bash
cd ../features
CURRENT_TIME=$(date -u +"%Y-%m-%dT%H:%M:%S")
feast materialize-incremental $CURRENT_TIME

  and should_run_async(code)


Materializing 1 feature views to 2021-06-07 13:14:52-07:00 into the sqlite online store.

project_details from 2020-02-17 13:14:53-08:00 to 2021-06-07 13:14:52-07:00:


100%|████████████████████████████████████████████████████████| 2030/2030 [00:00<00:00, 14949.12it/s]


This has moved the features for all of our projects into the online store since this was first time materializing to the online store. When we subsequently run the [`materialize-incremental`](https://docs.feast.dev/how-to-guides/load-data-into-the-online-store#2-b-materialize-incremental-alternative) command, Feast keeps track of previous materializations and so we'll only materialize the new data since the last attempt.

# Online features

In [16]:
# Get online features
store = FeatureStore(repo_path=Path(config.BASE_DIR, "features"))
feature_vector = store.get_online_features(
    feature_refs=["project_details:text", "project_details:tags"],
    entity_rows=[{"id": 1000}],
).to_dict()
feature_vector

  and should_run_async(code)


{'project_details__tags': [['code',
   'course',
   'tutorial',
   'video',
   'natural-language-processing',
   'low-resource']],
 'id': [1000],
 'project_details__text': ['CMU LTI Low Resource NLP Bootcamp 2020 A low-resource natural language and speech processing bootcamp held by the Carnegie Mellon University Language Technologies Institute in May 2020.']}