### Senario: Mlflow is Running remotely at ec2

MLflow setup:

- Tracking server: yes, remote server (EC2).
- Backend store: postgresql database.
- Artifacts store: s3 bucket.
The experiments can be explored by accessing the remote server.

The exampe uses AWS to host a remote server. In order to run the example you'll need an AWS account. Follow the steps described in the file [`mlflow_on_aws.md`](https://github.com/DataTalksClub/mlops-zoomcamp/blob/main/02-experiment-tracking/mlflow_on_aws.md) to create a new AWS account and launch the tracking server.

In [1]:
import mlflow
import os

TRACKING_SERVER_HOST = "ec2-13-233-121-70.ap-south-1.compute.amazonaws.com" # fill in with the public DNS of the EC2 instance
mlflow.set_tracking_uri(f"http://{TRACKING_SERVER_HOST}:5000")

In [2]:
print(f"tracking URI: '{mlflow.get_tracking_uri()}'")


tracking URI: 'http://ec2-13-233-121-70.ap-south-1.compute.amazonaws.com:5000'


In [3]:
mlflow.search_experiments() # list_experiments API has been removed, you can use search_experiments instead.()


[<Experiment: artifact_location='s3://mlartifact-s3/5', creation_time=1753734733192, experiment_id='5', last_update_time=1753734733192, lifecycle_stage='active', name='store-sales-prediction-orchestration', tags={}>,
 <Experiment: artifact_location='s3://mlartifact-s3/4', creation_time=1753733399891, experiment_id='4', last_update_time=1753733399891, lifecycle_stage='active', name='store-sales-hyperopt-random-forest', tags={}>,
 <Experiment: artifact_location='s3://mlartifact-s3/3', creation_time=1753732199567, experiment_id='3', last_update_time=1753732199567, lifecycle_stage='active', name='random-forest-best-models', tags={}>,
 <Experiment: artifact_location='s3://mlartifact-s3/2', creation_time=1753727040615, experiment_id='2', last_update_time=1753727040615, lifecycle_stage='active', name='store-sales-prediction', tags={}>,
 <Experiment: artifact_location='s3://mlartifact-s3/0', creation_time=1751400670126, experiment_id='0', last_update_time=1751400670126, lifecycle_stage='active

In [4]:
# set-up mflow experiment
mlflow.set_experiment("sales-prediciton-rf-reg")

2025/07/29 03:13:09 INFO mlflow.tracking.fluent: Experiment with name 'sales-prediciton-rf-reg' does not exist. Creating a new experiment.


<Experiment: artifact_location='s3://mlartifact-s3/6', creation_time=1753740789943, experiment_id='6', last_update_time=1753740789943, lifecycle_stage='active', name='sales-prediciton-rf-reg', tags={}>

In [5]:
import pickle

import pandas as pd

from sklearn.feature_extraction import DictVectorizer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import make_pipeline


In [6]:
import warnings
warnings.filterwarnings('ignore')

In [7]:
# read the csv file
def read_dataframe(filename):
    df = pd.read_csv(filename)  
    
    return df
    

In [8]:
# dataframe feature in dictionary
def prepare_features(df):
    df["date"] = pd.to_datetime(df["date"])
    
    # Feature Engineering — Date-based features
    df["year"] = df["date"].dt.year
    df["month"] = df["date"].dt.month
    df["day"] = df["date"].dt.day
    df["dayofweek"] = df["date"].dt.dayofweek
    df["is_weekend"] = df["dayofweek"].isin([5, 6]).astype(int)

    # Define categorical features
    categorical = ["store", "promo", "holiday", "year", "month", "dayofweek", "is_weekend"]
    df_dicts = df[categorical].to_dict(orient="records")

    return df_dicts

In [9]:
# load and read datasets
df_train = read_dataframe("./input_data/train.csv")
df_val = read_dataframe("./input_data/test.csv")

target = 'sales'
y_train = df_train[target].values
y_val = df_val[target].values

In [10]:
# feature_engineering
dict_train = prepare_features(df_train)
dict_val = prepare_features(df_val)

### Train RandomForestRegressor

In [11]:
with mlflow.start_run():
    params = dict(max_depth=20, n_estimators=100, min_samples_leaf=10, random_state=0)
    mlflow.log_params(params)

    pipeline = make_pipeline(
        DictVectorizer(),
        RandomForestRegressor(**params, n_jobs=-1)
    )

    pipeline.fit(dict_train, y_train)
    y_pred = pipeline.predict(dict_val)

    rmse = mean_squared_error(y_pred, y_val, squared=False)
    print(params, rmse)
    mlflow.log_metric('rmse', rmse)

    mlflow.sklearn.log_model(pipeline, artifact_path="model")

{'max_depth': 20, 'n_estimators': 100, 'min_samples_leaf': 10, 'random_state': 0} 7.558145882869018




🏃 View run sincere-auk-734 at: http://ec2-13-233-121-70.ap-south-1.compute.amazonaws.com:5000/#/experiments/6/runs/080e0226c1fc49cc818d3c023625b36d
🧪 View experiment at: http://ec2-13-233-121-70.ap-south-1.compute.amazonaws.com:5000/#/experiments/6
