# Welcome to Full Stack Machine Learning's Week 4 Project!

In the final week, you will return to the workflow you built last week on the [taxi dataset](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page). 

## Task 1: Deploy the champion
Use what you have learned in the last two weeks to make necessary modifications and to deploy your latest version of the `TaxiFarePrediction` flow to Argo. Use `--branch champion` to denote this deployment as the champion model.

In [1]:
%load_ext autoreload
%autoreload 2

In [8]:

%%writefile taxi_fare_prediction_champion.py

from metaflow import FlowSpec, step, card, conda_base, current, Parameter, Flow, trigger, retry, timeout, project

import pandas as pd

URL = "https://outerbounds-datasets.s3.us-west-2.amazonaws.com/taxi/latest.parquet"
DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S"


def clean_data(df: pd.DataFrame) -> pd.DataFrame:
    obviously_bad_data_filters = [
        "fare_amount > 0",  # fare_amount in US Dollars
        "trip_distance <= 100",  # trip_distance in miles
        "trip_distance > 0",
        "passenger_count > 0",
        "tpep_pickup_datetime < tpep_dropoff_datetime",
        "tip_amount >= 0",
        "tolls_amount >= 0",
        "improvement_surcharge >= 0",
        "total_amount >= 0",
        "congestion_surcharge >= 0",
        "airport_fee >= 0",
        # TODO: add some logic to filter out what you decide is bad data!
        # TIP: Don't spend too much time on this step for this project though, it practice it is a never-ending process.
    ]

    df = df.query(" & ".join(obviously_bad_data_filters))

    if len(df) == 0:
        raise ValueError("No entries remain after filtering.")

    return df

@project(name="taxi_fare_prediction")
@trigger(events=["s3"])
@conda_base(
libraries={
    "pandas": "2.1.2",  # bump version
    "pyarrow": "13.0.0", # bump version
    #"numpy": "1.21.2",  # omit defining numpy since pandas comes with it
    "scikit-learn": "1.3.2", # bump version
}
)
class TaxiFarePrediction(FlowSpec):
    data_url = Parameter("data_url", default=URL)

    @retry(times=3, minutes_between_retries=1)
    @step
    def start(self):
        """Read data seperately to allow retries."""
        import pandas as pd

        self.df = pd.read_parquet(self.data_url)

        self.next(self.transform_features)

    @step
    def transform_features(self):
        """Clean data."""

        self.df = clean_data(self.df)

        self.X = self.df["trip_distance"].values.reshape(-1, 1)
        self.y = self.df["total_amount"].values

        self.next(self.train_linear_model)

    @timeout(minutes=5)
    @step
    def train_linear_model(self):
        "Train linear model."
        from sklearn.linear_model import LinearRegression

        self.model = LinearRegression()

        self.model.fit(self.X, self.y)

        self.next(self.predict)
    
    @step
    def predict(self):
        "Do insample prediction."
        from sklearn.metrics import mean_absolute_error

        self.y_hat = self.model.predict(self.X)
        self.score = mean_absolute_error(self.y, self.y_hat)

        self.next(self.end)

    @step
    def end(self):
        """
        End of flow!
        """
        print('Scores:')
        print(f"The insample MAE of the linear model is {self.score:.2f}.")


if __name__ == "__main__":
    TaxiFarePrediction()

Overwriting taxi_fare_prediction_champion.py


In [9]:
! python taxi_fare_prediction_champion.py --environment=conda --production --branch champion --production argo-workflows create

[35m[1mMetaflow 2.10.6+ob(v1)[0m[35m[22m executing [0m[31m[1mTaxiFarePrediction[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:sandbox[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mProject: [0m[32m[1mtaxi_fare_prediction[0m[35m[22m, Branch: [0m[32m[1mprod.champion[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[32m[1m    Pylint is happy![K[0m[32m[1m[0m
[1mDeploying [0m[31m[1mtaxifareprediction.prod.champion.taxifareprediction[0m[1m to Argo Workflows...[K[0m[1m[0m
[22mIt seems this is the first time you are deploying [0m[31m[1mtaxifareprediction.prod.champion.taxifareprediction[0m[22m to Argo Workflows.[K[0m[22m[0m
[22m[K[0m[22m[0m
[22mA new production token generated.[K[0m[22m[0m
[22m[K[0m[22m[0m
[22mThe namespace of this production flow is[K[0m[22m[0m
[32m[22m    production

In [10]:
! python taxi_fare_prediction_champion.py --environment=conda --production --branch champion --production argo-workflows trigger

[35m[1mMetaflow 2.10.6+ob(v1)[0m[35m[22m executing [0m[31m[1mTaxiFarePrediction[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:sandbox[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mProject: [0m[32m[1mtaxi_fare_prediction[0m[35m[22m, Branch: [0m[32m[1mprod.champion[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[32m[1m    Pylint is happy![K[0m[32m[1m[0m
[1mWorkflow [0m[31m[1mtaxifareprediction.prod.champion.taxifareprediction[0m[1m triggered on Argo Workflows (run-id [0m[31m[1margo-taxifareprediction.prod.champion.taxifareprediction-pvmdm[0m[1m).[K[0m[1m[0m
[1mSee the run in the UI at https://ui-pw-623054330.outerbounds.dev/TaxiFarePrediction/argo-taxifareprediction.prod.champion.taxifareprediction-pvmdm[K[0m[1m[0m


## Task 2: Build the challenger
Develop a second model, by using the same `TaxiFarePrediction` architecture. Then, deploy the flow to Argo as the `--branch challenger`. 
<br>
<br>
Hint: Modify the `linear_model` step. 
<br>
Bonus: Write a paragraph summary of how you developed the second model and tested it before deploying the challenger flow. Let us know in Slack what you found challenging about the task? 

## Task 3: Analyze the results
Return to this notebook, and read in the results of the challenger and champion flow using the Metaflow Client API.
<br><br>

#### Questions
- Does your model perform better on the metrics you selected? 
- Think about your day job, how would you go about assessing whether to roll forward the production "champion" to your new model? 
    - What gives you confidence one model is better than another?
    - What kinds of information do you need to monitor to get buy-in from stakeholders that model A is preferable to model B?  

## CONGRATULATIONS! 🎉✨🍾
If you made it this far, you have completed the Full Stack Machine Learning Corise course. 
We are so glad that you chose to learn with us, and hope to see you again in future courses. Stay tuned for more content and come join us in [Slack](http://slack.outerbounds.co/) to keep learning about Metaflow!