This notebook builds a model to predict purchases using views and revenue data. It uses MLflow to automatically track and save the model's settings, accuracy, and the final model file for easy comparison.

### Import the required Libraries

In [0]:
import mlflow
import mlflow.sklearn
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from mlflow.models import infer_signature

### Load the Cleaned Data

In [0]:
# Pulling the cleaned Gold table into a Pandas DataFrame for modeling
table_name = "ecommerce.fact_product_performance"
df = spark.table(table_name).toPandas()

### Define Features and Split the Data

In [0]:
# Features : views and revenue (input)
# Target variable : purchases, predicting the number of purchases

X = df[["views", "revenue"]]
y = df["purchases"]

# Splitting 80% for training and 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Configure MLFlow Experiment

In [0]:
# Set the storage location for this experiment
user_name = spark.sql("SELECT current_user()").collect()[0][0]
mlflow.set_experiment(f"/Users/{user_name}/product_purchase_model")

2026/01/20 15:24:00 INFO mlflow.tracking.fluent: Experiment with name '/Users/tbhavya054@gmail.com/product_purchase_model' does not exist. Creating a new experiment.


<Experiment: artifact_location='dbfs:/databricks/mlflow-tracking/2630815116402525', creation_time=1768922640163, experiment_id='2630815116402525', last_update_time=1768922640163, lifecycle_stage='active', name='/Users/tbhavya054@gmail.com/product_purchase_model', tags={'mlflow.experiment.sourceName': '/Users/tbhavya054@gmail.com/product_purchase_model',
 'mlflow.experimentType': 'MLFLOW_EXPERIMENT',
 'mlflow.ownerEmail': 'tbhavya054@gmail.com',
 'mlflow.ownerId': '4128649050082485'}>

### Model Training

In [0]:
# Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train)

### Performance Evaluation

In [0]:
# Calculate accuracy score
r2 = model.score(X_test, y_test)

# Define the model's input and output schema
signature = infer_signature(X_test, model.predict(X_test))



### Track Experiment Results

In [0]:
with mlflow.start_run(run_name="Linear_Regression_Final"):
    # 1. Calculate the score '
    score = model.score(X_test, y_test)
    
    # 2. Record model settings
    mlflow.log_param("test_size", 0.2)
    
    # 3. Record accuracy metric (using the 'score' variable)
    mlflow.log_metric("r2_score", score)
    
    # 4. Save the trained model artifact
    mlflow.sklearn.log_model(model, "model", signature=signature)

# 5. Print the score (now the name matches!)
# 1.0 = 100% accuracy , 0.0 = 0% accuracy
print(f"R² Score: {score:.4f}")

R² Score: 0.9300


In [0]:
print(f"Model Training Complete. Accuracy: {score:.0%}. Results are saved in MLFlow.")

Model Training Complete. Accuracy: 93%. Results are saved in MLFlow.
