<h1 style=\"text-align: center; font-size: 50px;\">😷 Register Model for COVID Movement Patterns with VAR (Vector Autoregression)</h1>
This notebook shows an visual data analysis of the effects of COVID-19 in two different cities: New York and London

## Notebook Overview
- Imports
- Configurations
- Preparing the Data
- Logging Model to MLflow
- Fetching the Latest Model Version from MLflow
- Loading the Model and Running Inference

## Imports

In [1]:
# ------------------------ Data Manipulation ------------------------
import pandas as pd
import numpy as np

# ------------------------ System Utilities ------------------------
import warnings
import logging
from pathlib import Path
import os
import pickle
import time
import json
from typing import Any, Optional, Dict

# ------------------------ MLflow for Experiment Tracking and Model Management ------------------------
import mlflow
from mlflow import MlflowClient
from mlflow.models.signature import ModelSignature
from mlflow.types.schema import Schema, ColSpec, TensorSpec, ParamSchema, ParamSpec

## Configurations

In [2]:
# Suppress Python warnings
warnings.filterwarnings("ignore")

In [3]:
# Create logger
logger = logging.getLogger("cities_analysis_logger")
logger.setLevel(logging.INFO)
logger.propagate = False
logger.handlers.clear()

formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s", 
                              datefmt="%Y-%m-%d %H:%M:%S")  

stream_handler = logging.StreamHandler()
stream_handler.setFormatter(formatter)
logger.addHandler(stream_handler)

In [4]:
# ------------------------- Paths -------------------------
DATA_PATH = "/home/jovyan/datafabric/tutorial/"
ARTIFACTS_PATH = "../artifacts"

# ------------------------ MLflow Integration ------------------------
EXPERIMENT_NAME = "Two_Cities_Experiment"
RUN_NAME = "Two_Cities_Run"
MODEL_NAME = "Two_Cities_Model"

In [5]:
start_time = time.time()  
logger.info('Notebook execution started.')

2025-09-11 01:46:01 - INFO - Notebook execution started.


## Verify Assets


In [6]:
# Check whether the Dataset file exists
is_dataset_available = Path(DATA_PATH).exists()

# Log the configuration status of the dataset
if is_dataset_available:
    logger.info("The Dataset is properly configured.")
else:
    logger.info(
        "The Dataset is not properly configured. Please create and download the required assets "
        "in your project on AI Studio."
    )


2025-09-11 01:46:01 - INFO - The Dataset is properly configured.


## Preparing the Data

### Acknowledgments:
I'd like to thank the original authors of these data sources!

| Data | Original Source |
| --- | --- |
| Mobility Data | [COVID-19 Community Mobility Reports](https://www.google.com/covid19/mobility/) |
| NYC Cases | [NYC Department of Health and Mental Hygiene](https://www1.nyc.gov/site/doh/index.page) |
| London Cases | [GOV.UK Coronavirus (COVID-19) in the UK](https://coronavirus.data.gov.uk/) |

In [7]:
source_folder = DATA_PATH
ny_mobility = pd.read_csv(f"{source_folder}/NewYork_mobility.csv")
ldn_mobility = pd.read_csv(f"{source_folder}London_mobility.csv")
ny_cases = pd.read_csv(f"{source_folder}daily_data_NewYork.csv")
ldn_cases = pd.read_csv(f"{source_folder}daily_data_London.csv")

In [8]:
def rename_mobility_cols(df: pd.DataFrame) -> pd.DataFrame:
    """
    Renames the and cleans the dataset columns.

    Parameters:
        df(pd.DataFrame): A Dataframe containing the data.

    Returns:
        pd.DataFrame: The cleaned dataframe with the renamed columns.
    """
    # Rename columns
    df = df.rename(columns={'country_region':'country'})
    df = df.rename(columns={'retail_and_recreation_percent_change_from_baseline':'retail'})
    df = df.rename(columns={'grocery_and_pharmacy_percent_change_from_baseline':'pharmacy'})
    df = df.rename(columns={'parks_percent_change_from_baseline':'parks'})
    df = df.rename(columns={'transit_stations_percent_change_from_baseline':'transit_station'})
    df = df.rename(columns={'workplaces_percent_change_from_baseline':'workplaces'})
    df = df.rename(columns={'residential_percent_change_from_baseline':'residential'})
    df.drop(['country_region_code','sub_region_1', 'sub_region_2', 'residential'], axis=1, inplace = True)
    return df


ny_mobility = ny_mobility.loc[ny_mobility['sub_region_2'] == "New York County"].reset_index(drop=True)
ldn_mobility = ldn_mobility.loc[ldn_mobility['sub_region_2'] == "City of London"].reset_index(drop=True)

ny_mobility = rename_mobility_cols(ny_mobility)
ldn_mobility = rename_mobility_cols(ldn_mobility)

mobility_features = ny_mobility.columns[6:]

| Mobility Features     | Description                                                                                                                           |
|-----------------|---------------------------------------------------------------------------------------------------------------------------------------|
| country          | Country Name                                                                         |
| metro_area       | Metropolitan area                                                                    |
| iso_3166_2_code  | Codes for the names of the principal subdivisions (e.g. provinces or states)         |
| census_fips_code | Census fips code                                                                     |
| place_id         | Place IDs uniquely identify a place in the Google Places database and on Google Maps |
| date             | Date                                                                                 |
| retail          | Mobility trends for places like restaurants, cafes, shopping centers, theme parks, museums, libraries, and movie theaters.            |
| pharmacy        | Mobility trends for places like grocery markets, food warehouses, farmers markets, specialty food shops, drug stores, and pharmacies. |
| parks           | Mobility trends for places like local parks, national parks, public beaches, marinas, dog parks, plazas, and public gardens.          |
| transit_station | Mobility trends for places like public transport hubs such as subway, bus, and train stations.                                        |
| workplaces      | Mobility trends for places of work.                                                                                                   |

In [9]:
ldn_mobility.head()

Unnamed: 0,country,metro_area,iso_3166_2_code,census_fips_code,place_id,date,retail,pharmacy,parks,transit_station,workplaces
0,United Kingdom,,GB-LND,,ChIJ4Y3fTlUDdkgR0Gbsoi2uDgQ,2020-02-15,-5.0,-9.0,-12.0,-11.0,
1,United Kingdom,,GB-LND,,ChIJ4Y3fTlUDdkgR0Gbsoi2uDgQ,2020-02-16,-1.0,-21.0,-23.0,-13.0,
2,United Kingdom,,GB-LND,,ChIJ4Y3fTlUDdkgR0Gbsoi2uDgQ,2020-02-17,-3.0,-2.0,4.0,-1.0,-4.0
3,United Kingdom,,GB-LND,,ChIJ4Y3fTlUDdkgR0Gbsoi2uDgQ,2020-02-18,-2.0,-2.0,-1.0,-2.0,-2.0
4,United Kingdom,,GB-LND,,ChIJ4Y3fTlUDdkgR0Gbsoi2uDgQ,2020-02-19,-7.0,-4.0,5.0,0.0,-4.0


Try it out yourself 🚀 : [Get the Address for a Place ID 🌎](https://developers.google.com/maps/documentation/javascript/examples/geocoding-place-id)

![](https://i.imgur.com/B69162k.png)

In [10]:
def rename_dailyData_cols(df: pd.DataFrame) -> pd.DataFrame:
    """
    Renames the daily data DataFrame columns to standardized names.

    Parameters:
        df (pd.DataFrame): A Dataframe containing daily COVID-19 statistics.

    Returns:
        pd.DataFrame: The Dataframe with the renamed columns.
    """
    mapping = {df.columns[0]:'date', df.columns[1]: 'case_count', df.columns[2]:'hospitalized_count', df.columns[3]: 'death_count'} 
    df = df.rename(columns = mapping)
    return df

ny_cases = ny_cases[['date_of_interest','CASE_COUNT','HOSPITALIZED_COUNT','DEATH_COUNT']]
ldn_cases = ldn_cases[['date','newCasesBySpecimenDate', 'newAdmissions', 'newDeaths28DaysByDeathDate']]

ny_cases = rename_dailyData_cols(ny_cases)
ldn_cases = rename_dailyData_cols(ldn_cases)

cases_features = ny_cases.columns[1:]

| Cases Features     | Description                    |
|--------------------|--------------------------------|
| date               | Date                           |
| case_count         | Number of daily cases recorded |
| hospitalized_count | Number of people hospitalized  |
| death_count        | Number of deaths recorded      |

In [11]:
ny_cases.head()

Unnamed: 0,date,case_count,hospitalized_count,death_count
0,02/29/2020,1,1,0
1,03/01/2020,0,1,0
2,03/02/2020,0,2,0
3,03/03/2020,1,7,0
4,03/04/2020,5,2,0


In [12]:
for df in ny_mobility, ldn_mobility, ny_cases, ldn_cases:
    df['date'] = df['date'].astype('datetime64[ns]')

In [13]:
def merge_data(mobility: pd.DataFrame, cases: pd.DataFrame) -> pd.DataFrame:
    """
    Merges mobility and case data on the 'date' column using an inner join.

    Parameters:
        mobility (pd.DataFrame): DataFrame containing mobility indicators.
        cases (pd.DataFrame): DataFrame containing COVID-19 case statistics.

    Returns:
        pd.DataFrame: Merged DataFrame containing both mobility and case data, aligned by date.
    """
    merged_df = pd.merge(mobility, cases, how='inner', on = 'date')
    return merged_df

# setting date as the index column
ny_df = merge_data(ny_mobility, ny_cases).set_index('date')
ldn_df = merge_data(ldn_mobility, ldn_cases).set_index('date')
ldn_df = ldn_df.iloc[:-2,:]

features = ny_df.columns[5:]

In [14]:
ldn_df.head()

Unnamed: 0_level_0,country,metro_area,iso_3166_2_code,census_fips_code,place_id,retail,pharmacy,parks,transit_station,workplaces,case_count,hospitalized_count,death_count
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2020-03-19,United Kingdom,,GB-LND,,ChIJ4Y3fTlUDdkgR0Gbsoi2uDgQ,-78.0,-47.0,-74.0,-70.0,-63.0,332,240,25
2020-03-20,United Kingdom,,GB-LND,,ChIJ4Y3fTlUDdkgR0Gbsoi2uDgQ,-82.0,-54.0,-77.0,-73.0,-64.0,424,272,46
2020-03-21,United Kingdom,,GB-LND,,ChIJ4Y3fTlUDdkgR0Gbsoi2uDgQ,-86.0,-53.0,-86.0,-78.0,,348,311,46
2020-03-22,United Kingdom,,GB-LND,,ChIJ4Y3fTlUDdkgR0Gbsoi2uDgQ,-85.0,-67.0,-86.0,-81.0,,439,335,53
2020-03-23,United Kingdom,,GB-LND,,ChIJ4Y3fTlUDdkgR0Gbsoi2uDgQ,-87.0,-70.0,-82.0,-79.0,-76.0,685,505,56


In [15]:
ny_df.head()

Unnamed: 0_level_0,country,metro_area,iso_3166_2_code,census_fips_code,place_id,retail,pharmacy,parks,transit_station,workplaces,case_count,hospitalized_count,death_count
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2020-02-29,United States,,,36061.0,ChIJOwE7_GTtwokRFq0uOwLSE9g,1.0,2.0,-2.0,-1.0,6.0,1,1,0
2020-03-01,United States,,,36061.0,ChIJOwE7_GTtwokRFq0uOwLSE9g,-3.0,2.0,-7.0,-3.0,1.0,0,1,0
2020-03-02,United States,,,36061.0,ChIJOwE7_GTtwokRFq0uOwLSE9g,3.0,6.0,17.0,-3.0,4.0,0,2,0
2020-03-03,United States,,,36061.0,ChIJOwE7_GTtwokRFq0uOwLSE9g,0.0,6.0,5.0,-1.0,3.0,1,7,0
2020-03-04,United States,,,36061.0,ChIJOwE7_GTtwokRFq0uOwLSE9g,3.0,9.0,13.0,-3.0,2.0,5,2,0


## Logging Model to MLflow

Reading the JSON with training parameters that were saved during training

In [16]:
artifacts_dir = ARTIFACTS_PATH
os.makedirs(artifacts_dir, exist_ok=True)

with open(f"{artifacts_dir}/training_metrics.json", "r") as metrics:
            metrics_dict = json.load(metrics)

In [17]:
class TwoCitiesModel(mlflow.pyfunc.PythonModel):
    def load_context(self, context: mlflow.pyfunc.PythonModelContext) -> None:
        """Load model artifacts from the MLflow context."""
        try:    
            
            # Load the trained models
            self.ny_model = pickle.load(open(os.path.join(context.artifacts["ny_model"]), "rb"))
            self.ldn_model = pickle.load(open(os.path.join(context.artifacts["ldn_model"]), "rb"))
            
            # Load the last values for forecasting and reverse transformation
            self.ny_last_values = pickle.load(open(os.path.join(context.artifacts["ny_last_values"]), "rb"))
            self.ldn_last_values = pickle.load(open(os.path.join(context.artifacts["ldn_last_values"]), "rb"))
            
            # Load the lag order for each model
            self.ny_lag_order = self.ny_model.k_ar
            self.ldn_lag_order = self.ldn_model.k_ar
            
            # Load the last raw values needed for transforming differenced forecasts back
            self.ny_last_raw_value = pickle.load(open(os.path.join(context.artifacts["ny_last_raw_value"]), "rb"))
            self.ldn_last_raw_value = pickle.load(open(os.path.join(context.artifacts["ldn_last_raw_value"]), "rb"))
            
            # Load feature names
            self.features = pickle.load(open(os.path.join(context.artifacts["features"]), "rb"))

        except Exception as e:
            logger.error(f"Error loading context: {str(e)}")
            raise

    def check_stationarity(self, series: pd.Series) -> bool:
        """Check if a time series is stationary using Augmented Dickey-Fuller test."""
        result = adfuller(series, autolag='AIC')
        return result[1] <= 0.05 
    
    def difference_data(self, data: pd.DataFrame) -> pd.DataFrame:
        """Apply differencing to make time series stationary."""
        return data.diff().dropna()
    
    def rolling_back_transformation(self, last_raw_value: Any, forecast_output: pd.DataFrame) -> pd.DataFrame:
        """Convert differenced forecasts back to original scale."""
        forecast_final = forecast_output.copy()
        
        for i, col in enumerate(self.features):
            col_forecast = f"{col}_forecast"
            # Cumulatively add differences starting from the last known value
            forecast_final[col_forecast] = last_raw_value[i] + forecast_output[col_forecast].cumsum()
        
        return forecast_final

    
    def predict(self, context: Any, model_input: Dict[str, Any], params: Optional[dict] = None) -> pd.DataFrame:
        """
        Computes the predicted forecast.
        """
        try:
            city = model_input.get("city", ["New York"])[0]
            steps = int(model_input.get("steps", [7])[0])
            new_data = model_input.get("new_data", None)
        
            if city == "New York":
                model = self.ny_model
                forecast_input = self.ny_last_values
                last_raw_value = self.ny_last_raw_value
                lag_order = self.ny_lag_order
            else:  # London
                model = self.ldn_model
                forecast_input = self.ldn_last_values
                last_raw_value = self.ldn_last_raw_value
                lag_order = self.ldn_lag_order
            
            # If new data is provided, update the forecast input
            if new_data is not None:
                new_df = pd.DataFrame(new_data)
                
                # Check if data needs differencing
                stationary = all(self.check_stationarity(new_df[col]) for col in new_df.columns)
                
                if not stationary:
                    new_df = self.difference_data(new_df)
                
                forecast_input = new_df.values[-lag_order:]
                
                last_raw_value = new_data.iloc[-1].values
            
            # Make forecast
            fc = model.forecast(y=forecast_input, steps=steps)
            
            # Create DataFrame with forecasted values
            dates = pd.date_range(start=pd.Timestamp.today(), periods=steps)
            forecast_output = pd.DataFrame(fc, index=dates, columns=[f"{col}_forecast" for col in self.features])
            
            # Transform differenced forecasts back to original scale
            forecast_final = self.rolling_back_transformation(last_raw_value, forecast_output)
            
            return forecast_final
        except Exception as e:
            logger.error(f"Error performing prediction: {str(e)}")
            raise
    
    @classmethod
    def log_model(cls, model_name: str) -> None:
        """
        Logs the model to MLflow with artifacts for demo and config.
        """
        try:
            # Define input and output schema
            input_schema = Schema([
                ColSpec("string","city"),
                ColSpec("long","steps"),
                ])
            output_schema = Schema([
                ColSpec("string", "class"),
            ])
            
            # Define model signature
            signature = ModelSignature(inputs=input_schema, outputs=output_schema)
            
            # Prepare artifacts dictionary with model files
            artifacts = {
                "ny_model": f"{artifacts_dir}/ny_model.pkl",
                "ldn_model": f"{artifacts_dir}/ldn_model.pkl",
                "ny_last_values": f"{artifacts_dir}/ny_last_values.pkl",
                "ldn_last_values": f"{artifacts_dir}/ldn_last_values.pkl",
                "ny_last_raw_value": f"{artifacts_dir}/ny_last_raw_value.pkl",
                "ldn_last_raw_value": f"{artifacts_dir}/ldn_last_raw_value.pkl",
                "features": f"{artifacts_dir}/features.pkl"
            }
            
            # Add demo folder as artifact
            demo_folder = "../demo"
            if os.path.exists(demo_folder):
                artifacts["demo"] = demo_folder
                logger.info(f"✅ Demo folder added to artifacts: {demo_folder}")
            else:
                logger.warning(f"⚠️  Demo folder not found: {demo_folder}")
                
            # Add config file as artifact
            config_path = "../configs/config.yaml"
            if os.path.exists(config_path):
                artifacts["config"] = config_path
                logger.info(f"✅ Config file added to artifacts: {config_path}")
            else:
                logger.warning(f"⚠️  Config file not found: {config_path}")
            
            # Log the model in MLflow
            mlflow.pyfunc.log_model(
                artifact_path=model_name,
                python_model=cls(),
                artifacts=artifacts,
                signature=signature
            )
        except Exception as e:
            logger.error(f"Error logging model: {str(e)}")
            raise

  func_info = _get_func_info_if_type_hint_supported(predict_attr)


In [18]:
logger.info(f'Starting the experiment: {EXPERIMENT_NAME}')


mlflow.set_tracking_uri("/phoenix/mlflow")
# Set the MLflow experiment name
mlflow.set_experiment(experiment_name=EXPERIMENT_NAME)

# Start an MLflow run
with mlflow.start_run(run_name=RUN_NAME) as run:    
    
    # Registering the training metrics to mlflow
    mlflow.log_metrics(metrics_dict)
    
    # Print the artifact URI for reference
    logging.info(f"Run's Artifact URI: {run.info.artifact_uri}")
    
    # Log the model to MLflow
    TwoCitiesModel.log_model(model_name=MODEL_NAME)

    # Register the logged model in MLflow Model Registry
    mlflow.register_model(
        model_uri=f"runs:/{run.info.run_id}/{MODEL_NAME}", 
        name=MODEL_NAME
    )

logger.info(f'Registered the model: {MODEL_NAME}')

2025-09-11 01:46:02 - INFO - Starting the experiment: Two_Cities_Experiment
2025/09/11 01:46:02 INFO mlflow.tracking.fluent: Experiment with name 'Two_Cities_Experiment' does not exist. Creating a new experiment.
2025-09-11 01:46:03 - INFO - ✅ Demo folder added to artifacts: ../demo
2025-09-11 01:46:03 - INFO - ✅ Config file added to artifacts: ../configs/config.yaml
2025/09/11 01:46:03 INFO mlflow.models.signature: Inferring model signature from type hints
  signature_from_type_hints = _infer_signature_from_type_hints(


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/6 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Successfully registered model 'Two_Cities_Model'.
Created version '1' of model 'Two_Cities_Model'.
2025-09-11 01:46:10 - INFO - Registered the model: Two_Cities_Model


## Fetching the Latest Model Version from MLflow

In [19]:
# Initialize the MLflow client
client = MlflowClient()

# Retrieve the latest version of the model
model_metadata = client.get_latest_versions(MODEL_NAME, stages=["None"])
latest_model_version = model_metadata[0].version  # Extract the latest model version

# Fetch model information, including its signature
model_info = mlflow.models.get_model_info(f"models:/{MODEL_NAME}/{latest_model_version}")

# Print the latest model version and its signature
print(f"Latest Model Version: {latest_model_version}")
print(f"Model Signature: {model_info.signature}")

Latest Model Version: 1
Model Signature: inputs: 
  ['city': string (required), 'steps': long (required)]
outputs: 
  ['class': string (required)]
params: 
  None



## Loading the Model and Running Inference

In [20]:
model = mlflow.pyfunc.load_model(model_uri=f"models:/{MODEL_NAME}/{latest_model_version}")
# Make predictions for New York
ny_prediction = model.predict({
    "city": ["New York"],
    "steps": [3]  # Forecast for the next 3 days
})

# Make predictions for London
ldn_prediction = model.predict({
    "city": ["London"],
    "steps": [3]  # Forecast for the next 3 days
})

print(f"New York: {ny_prediction}" )
print("\n")
print(f"London: {ldn_prediction}")

New York:                             retail_forecast  pharmacy_forecast  \
2025-09-11 01:46:10.861316       -33.728463         -30.846299   
2025-09-12 01:46:10.861316       -38.215550         -30.564491   
2025-09-13 01:46:10.861316       -35.413120         -22.604498   

                            parks_forecast  transit_station_forecast  \
2025-09-11 01:46:10.861316       10.141208                -22.428500   
2025-09-12 01:46:10.861316       -4.275643                -39.144546   
2025-09-13 01:46:10.861316       11.427796                -30.944809   

                            workplaces_forecast  case_count_forecast  \
2025-09-11 01:46:10.861316           -22.994693          2408.753695   
2025-09-12 01:46:10.861316           -56.921910          4896.935950   
2025-09-13 01:46:10.861316           -40.361528          4383.142165   

                            hospitalized_count_forecast  death_count_forecast  
2025-09-11 01:46:10.861316                    98.599755            

In [21]:
end_time: float = time.time()
elapsed_time: float = end_time - start_time
elapsed_minutes: int = int(elapsed_time // 60)
elapsed_seconds: float = elapsed_time % 60

logger.info(f"⏱️ Total execution time: {elapsed_minutes}m {elapsed_seconds:.2f}s")
logger.info("✅ Notebook execution completed successfully.")

2025-09-11 01:46:10 - INFO - ⏱️ Total execution time: 0m 9.10s
2025-09-11 01:46:10 - INFO - ✅ Notebook execution completed successfully.


Built with ❤️ using [**HP AI Studio**](https://hp.com/ai-studio).