This notebook serves as the first step of the POC of whether it would make sense to use databricks as one of the ML options in GoodData.

What this Notebook does does:

- It fetches data from GoodData
- Creates an ARIMA model based on the data.
- Registeres the said ARIMA model in the Unity Catalog of Databricks

What it doesn't do:

- Setup data in Gooddata (.csv and instructions are in the .md of the linked [GitHub Repo](https://github.com/Mara3l/Databricks_ML))
- Create the endpoint for ML consumption
    - This step is intentionaly made in the Databricks UI. If you are unsure how to do that, please refer to the [Databricks Documentation](https://docs.databricks.com/en/machine-learning/model-serving/create-manage-serving-endpoints.html).

In [None]:
%pip install gooddata-pandas

In [None]:
from gooddata_pandas import GoodPandas

host = "YOUR HOST"
token = "GD_TOKEN"
workspace_id = "GD_WORKSPACE"
visualization_id = "GD_VISUALIZATION"
gp = GoodPandas(host, token)



frames = gp.data_frames(workspace_id)

df = frames.for_visualization(visualization_id)


Check the head of the dataframe.

In [None]:
df.head()

Check the columns. If the columns are different from the metric name in the setup section, please update it.

In [None]:
df.columns

In [None]:
metric_name = "Stock_Price"

In [None]:
from statsmodels.tsa.arima.model import ARIMA
import mlflow
import mlflow.pyfunc

def train_arima_model(data, order=(5,1,0)):
    model = ARIMA(data, order=order)
    model_fit = model.fit()
    return model_fit

arima_model = train_arima_model(df[metric_name])

forecast_steps = 5 # Forecast next 5 months
forecast = arima_model.forecast(steps=forecast_steps)
print(forecast)

In [None]:
from statsmodels.tsa.arima.model import ARIMA
import mlflow
import mlflow.pyfunc
import pandas as pd

class ArimaModelWrapper(mlflow.pyfunc.PythonModel):
    def __init__(self, model):
        self.model = model

    def predict(self, context, model_input):
        return self.model.forecast(steps=len(model_input))

def train_arima_model(data, order=(5,1,0)):
    model = ARIMA(data, order=order)
    model_fit = model.fit()
    return model_fit

arima_model = train_arima_model(df[metric_name])

forecast_steps = 5
forecast = arima_model.forecast(steps=forecast_steps)
print(forecast)

wrapped_model = ArimaModelWrapper(model=arima_model)

# Log the model with MLflow
with mlflow.start_run() as run:
    mlflow.pyfunc.log_model(
        artifact_path="arima_model",
        python_model=wrapped_model
    )

# Load the model for prediction
logged_model = f"runs:/{run.info.run_id}/arima_model"
loaded_model = mlflow.pyfunc.load_model(logged_model)

# Make predictions with the loaded model
model_input = pd.DataFrame([0]*forecast_steps)  # Dummy input for the forecast steps
predictions = loaded_model.predict(model_input)
print(predictions)


In [None]:
CATALOG_NAME = "main"
SCHEMA_NAME = "default"

Here you will need to add your MLFlow experiment name, which is on the right side of the Notebook. If you are unsure about what an experiment is, please refer to the [Databricks Documentation](https://docs.databricks.com/en/mlflow/quick-start.html).

In [None]:
experiment_name = "kindly-owl-69"
run_id = mlflow.search_runs(filter_string=f'tags.mlflow.runName = {experiment_name}').iloc[0].run_id

In [None]:
# Register the model to Unity Catalog. 
model_name = f"ARIMA_PREDICTIONS_STOCK_PRICE"
model_version = mlflow.register_model(f"runs:/{run_id}/arima_model", model_name)

import time
# Registering the model takes a few seconds, so add a small delay
time.sleep(15)