# MLOps Graded Assignment - Week 3 Resources 

# Time-Aware Iris Dataset for Feast Tutorial

## Overview

This directory contains a modified, time-series version of the classic Iris dataset. It has been specifically generated to be compatible with the [Feast feature store](https://feast.dev/) and is intended for use in a hands-on tutorial.

Unlike the original static dataset, this version simulates the tracking of features for a few individual iris plants over a period of time, making it suitable for demonstrating real-world feature store concepts.

---

## The Problem with the Standard Iris Dataset

The standard Iris dataset is a simple table of 150 measurements. While excellent for basic classification tasks, it is unsuitable for demonstrating a feature store because it lacks:

1.  **An Entity**: There is no unique identifier for the object being measured (e.g., a specific plant ID). Feast requires an entity to associate features with.
2.  **Timestamps**: All data exists at a single, unknown point in time. Feast is built around time-series data to provide point-in-time correctness and prevent data leakage in training sets.

This dataset solves these issues by introducing an `iris_id` as the entity and an `event_timestamp` for each feature measurement.

---

## Dataset Schema

The data is stored in the `iris_data_adapted_for_feast.csv` file and has the following columns:

| Column Name         | Data Type | Description                                                                                                                                                             | Feast Role             |
| ------------------- | --------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------- |
| `event_timestamp`   | Timestamp | The exact UTC timestamp when the measurement was recorded. This is crucial for historical lookups and point-in-time joins.                                               | **Timestamp Field** |
| `iris_id`           | Integer   | A unique identifier for each individual iris plant being tracked.                                                                                                         | **Entity Key** |
| `sepal_length`      | Float     | The length of the sepal in centimeters.                                                                                                                                 | Feature                |
| `sepal_width`       | Float     | The width of the sepal in centimeters.                                                                                                                                  | Feature                |
| `petal_length`      | Float     | The length of the petal in centimeters.                                                                                                                                 | Feature                |
| `petal_width`       | Float     | The width of the petal in centimeters.                                                                                                                                  | Feature                |
| `species`           | String    | The species of the iris plant (`setosa`, `versicolor`, or `virginica`). Can be used as a feature or a prediction target (label).                                          | Feature / Label        |
| `created_timestamp` | Timestamp | The UTC timestamp when the data row was created or ingested. Feast can use this to resolve data freshness.                                                                | **Created Timestamp** |

---

## How The Data Was Generated

This dataset was synthetically generated using a Python script:

1.  The base data comes from the `scikit-learn` Iris dataset.
2.  We simulated **3 unique iris plants** and assigned each an `iris_id` (1001, 1002, 1003).
3.  For each plant, we generated **15 days of sequential data**, creating a unique `event_timestamp` for each day.
4.  To simulate real-world variance, a small amount of random noise was added to the feature measurements (`sepal_length`, etc.) for each timestamp.
5.  The final DataFrame was saved in the efficient Parquet file format.

---

### Overview
Making a prediction using a linear regression model is a common use case in ML. In this guide tutorial, we build the model that predicts if a driver will complete a trip based on a number of features ingested into Feast.

The basic local mode gives you ability to quickly try Feast, while the advanced mode shows how you can use Feast in a production setting, in particular for the Google Cloud Platform (GCP) cloud.

This tutorial uses Feast with scikit learn to:

* Train a model locally using data from BigQuery
* Test the model for online inference using SQLite (for fast iteration)
* Test the model for online inference using Firestore (to represent production)


## Step 1: Install feast, scikit-learn

Install feast, gcp dependencies and scikit-learn


In [4]:
!pip install --quiet feast scikit-learn 'feast[gcp]'

#### Check feast version

In [3]:
!feast version

[1m[34mFeast SDK Version: [1m[32m"0.55.0"


## Step 2: Clone the Git repo (Not needed for assignment)

Clone the Driver Ranking Git repo into your Colab Folder

In [5]:
# !git clone https://github.com/feast-dev/feast-driver-ranking-tutorial.git

Cloning into 'feast-driver-ranking-tutorial'...
remote: Enumerating objects: 65, done.[K
remote: Counting objects: 100% (65/65), done.[K
remote: Compressing objects: 100% (49/49), done.[K
remote: Total 65 (delta 26), reused 43 (delta 14), pack-reused 0 (from 0)[K
Receiving objects: 100% (65/65), 21.31 KiB | 5.33 MiB/s, done.
Resolving deltas: 100% (26/26), done.


## Step 3: Set up your Goggle Cloud Platform (GCP) Configurations

Set configurations
Set the following configuration, which we'll be using throughout the tutorial:

PROJECT_ID: Your project.
BUCKET_NAME: The name of a bucket which will be used to store the feature store registry and model artifacts.
BIGQUERY_DATASET_NAME: The name of a dataset which will be used to create tables containing features.
AI_PLATFORM_MODEL_NAME: The name of a model name which will be created in AI Platform.

In [7]:
PROJECT_ID= "ivory-totem-474120-s5" #@param {type:"string"}
BUCKET_NAME= "21f2000143-mlops-week3-ga-3-feast" #@param {type:"string"} custom
BIGQUERY_DATASET_NAME="21f2000143-mlops-week3-ga-3-feast" #@param {type:"string"} custom
AI_PLATFORM_MODEL_NAME="21f2000143-mlops-week3-ga-3-feast" #@param {type:"string"

! gcloud config set project $PROJECT_ID
%env GOOGLE_CLOUD_PROJECT=$PROJECT_ID
!echo project_id = $PROJECT_ID > ~/.bigqueryrc

Updated property [core/project].
env: GOOGLE_CLOUD_PROJECT=ivory-totem-474120-s5


In [8]:
# Only run if your bucket doesn't already exist!
! gsutil mb gs://$BUCKET_NAME

Creating gs://21f2000143-mlops-week3-ga-3-feast/...
ServiceException: 409 A Cloud Storage bucket named '21f2000143-mlops-week3-ga-3-feast' already exists. Try another name. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.


## Step 4: Apply and deploy feature definitions

`feast apply` scans python files in the current directory for feature definitions and deploys infrastructure according to `feature_store.yaml`

In [12]:
!cd feast-driver-ranking-tutorial/driver_ranking && feast apply

Traceback (most recent call last):
  File "/opt/conda/bin/feast", line 7, in <module>
    sys.exit(cli())
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1161, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1082, in main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1697, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1443, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 788, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/feast/cli/cli.py", line 272, in apply_total_command
    apply_to

### Inspect the files created under your local folder

In [None]:
%%shell
cd /content/feast-driver-ranking-tutorial/driver_ranking/data/
ls -l

total 20
-rw-r--r-- 1 root root 16384 Mar 20 02:49 online.db
-rw-r--r-- 1 root root   787 Mar 20 02:57 registry.db




## Step 5: Train your model

In [None]:
!pip install "numpy<2" "pandas==2.2.2"



In [None]:
import feast
from joblib import dump
import pandas as pd
from sklearn.linear_model import LinearRegression

# Load driver order data
orders = pd.read_csv("/content/feast-driver-ranking-tutorial/driver_orders.csv", sep="\t")
orders["event_timestamp"] = pd.to_datetime(orders["event_timestamp"])

# Connect to your feature store provider
fs = feast.FeatureStore(repo_path="/content/feast-driver-ranking-tutorial/driver_ranking")

# Retrieve training data from BigQuery
training_df = fs.get_historical_features(
    entity_df=orders,
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_hourly_stats:avg_daily_trips",
    ],
).to_df()

print("----- Feature schema -----\n")
print(training_df.info())

print()
print("----- Example features -----\n")
print(training_df.head())

# Train model
target = "trip_completed"

reg = LinearRegression()
train_X = training_df[training_df.columns.drop(target).drop("event_timestamp")]
train_Y = training_df.loc[:, target]
reg.fit(train_X[sorted(train_X)], train_Y)

# Save model
dump(reg, "driver_model.bin")



----- Feature schema -----

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype              
---  ------           --------------  -----              
 0   event_timestamp  10 non-null     datetime64[us, UTC]
 1   driver_id        10 non-null     int64              
 2   trip_completed   10 non-null     int64              
 3   conv_rate        10 non-null     float64            
 4   acc_rate         10 non-null     float64            
 5   avg_daily_trips  10 non-null     int64              
dtypes: datetime64[us, UTC](1), float64(2), int64(3)
memory usage: 612.0 bytes
None

----- Example features -----

            event_timestamp  driver_id  trip_completed  conv_rate  acc_rate  \
0 2021-04-17 04:29:28+00:00       1002               0   0.586277  0.374841   
1 2021-04-19 04:29:28+00:00       1002               0   0.586277  0.374841   
2 2021-04-18 04:29:28+00:00       1002               0 

['driver_model.bin']

## Step 6: Materialize your online store
Apply and materialize data to Firestore

In [None]:
!cd /content/feast-driver-ranking-tutorial/driver_ranking/ && feast materialize 2021-01-01T00:00:00 2022-01-01T00:00:00

see the appropriate new directories, set the environment variable
`JUPYTER_PLATFORM_DIRS=1` and then run `jupyter --paths`.
The use of platformdirs will be the default in `jupyter_core` v6
  from jupyter_core.paths import jupyter_data_dir, jupyter_runtime_dir, secure_write
Materializing [1m[32m1[0m feature views from [1m[32m2021-01-01 00:00:00+00:00[0m to [1m[32m2022-01-01 00:00:00+00:00[0m into the [1m[32msqlite[0m online store.

[1m[32mdriver_hourly_stats[0m:
100%|██████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 376.42it/s]


### Step 7:  Make Prediction

In [None]:
import pandas as pd
import feast
from joblib import load


class DriverRankingModel:
    def __init__(self):
        # Load model
        self.model = load("/content/driver_model.bin")

        # Set up feature store
        self.fs = feast.FeatureStore(repo_path="/content/feast-driver-ranking-tutorial/driver_ranking/")

    def predict(self, driver_ids):
        # Read features from Feast
        driver_features = self.fs.get_online_features(
            entity_rows=[{"driver_id": driver_id} for driver_id in driver_ids],
            features=[
                "driver_hourly_stats:conv_rate",
                "driver_hourly_stats:acc_rate",
                "driver_hourly_stats:avg_daily_trips",
            ],
        )
        df = pd.DataFrame.from_dict(driver_features.to_dict())

        # Make prediction
        df["prediction"] = self.model.predict(df[sorted(df)])

        # Choose best driver
        best_driver_id = df["driver_id"].iloc[df["prediction"].argmax()]

        # return best driver
        return best_driver_id

In [None]:
def make_drivers_prediction():
    drivers = [1001, 1002, 1003, 1004]
    model = DriverRankingModel()
    best_driver = model.predict(drivers)
    print(f"Prediction for best driver id: {best_driver}")

In [None]:
make_drivers_prediction()

Prediction for best driver id: 1001


