<a href="https://colab.research.google.com/github/DigitalSocrates/Experiments_in_DataScience/blob/master/MLFLow_Experiments/LinearRegression_with_MlFlow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Logistic Regression model and MLFlow

In [None]:
# imports
import pandas as pd
from datetime import datetime
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [None]:
# Load the dataset - file is located in the same folder as the notebook
raw_ccpp_data = pd.read_csv('CCPP_data.csv')

In [None]:
# let us examine the dataframe
raw_ccpp_data.info

In [None]:
# summary statistics
raw_ccpp_data.describe()

We want to check our data for the following
- Temperature (T) in the range 1.81°C to 37.11°C,
- Ambient Pressure (AP) in the range 992.89-1033.30 milibar,
- Relative Humidity (RH) in the range 25.56% to 100.16%
- Exhaust Vacuum (V) in the range 25.36-81.56 cm Hg
- Net hourly electrical energy output (PE) 420.26-495.76 MW (Target we are trying to predict)

In [None]:
# query data and create new data frame
ccpp_data = raw_ccpp_data.query('(AT >= 1.81 & AT <= 37.11) & (V >= 25.36 & V <= 81.56) & (AP >= 992.89 & AP <= 1033.30) & (RH >= 25.56 & RH <= 100.16)')

In [None]:
# summary statistics including all collumns
ccpp_data.describe(include='all')

Features and target selection

In [None]:
X = ccpp_data.drop('PE', axis=1)  # Features
y = ccpp_data['PE']  # Target variable

In [None]:
# Split the data into training and testing sets
# 70% data for training and 30% data for test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
# training set
print(X_train.describe(include='all'))
# test set
print(X_test.describe(include='all'))

Prepare for mlflow

MUST start mlflow prior to training the model use
mlflow ui
to check that service is running check
http://127.0.0.1:5000

Few important points
1. mflow is set to autolog for sklearn
2. log_models is set to True
3. log_datasets is set to True

Note: If you use set_tracking_uri(), you should set_experiment() after that.

In [None]:
# enable mlflow autologging
import mlflow

mlflow.set_tracking_uri('http://127.0.0.1:5000')
experiment_name = 'Training Logistic Regression model on CCPP data <jira id>'
mlflow.set_experiment(experiment_name)
experiment = mlflow.get_experiment_by_name(experiment_name)

mlflow.sklearn.autolog(disable=False,
                       log_models=True,
                       log_datasets=True,
                       registered_model_name="CCPP Logistic Regression",
                       )

Set tags that will be useful for ML runs experiments

In [None]:
# set tags - adding metadata about the model
tags = {"team": "Engineering Team Name",
        "dataset": "CCPP model",
        "release.version": "1.2.3",
        "inputs": X_train.columns,
        "target": "PE"}

now = datetime.now() # current date and time
experiment_date = now.strftime("%m/%d/%Y, %H:%M:%S")

Run our first experiment with linear regression model

Few points here
1. setup run and use datetime stamp or use some other unique identifier

In [None]:
# Using linear regression model
with mlflow.start_run(experiment_id=experiment.experiment_id,
                      run_name='linear_regression_exp__' + experiment_date):
    mlflow.set_tags(tags)
    model = LinearRegression(n_jobs=5)
    model.fit(X_train, y_train)

    # Model Evaluation
    y_pred = model.predict(X_test)

    # calculate different metrics
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    r2 = r2_score(y_test, y_pred)

    mlflow.log_metric('mae', mae)
    mlflow.log_metric('mse', mse)
    mlflow.log_metric('rmse', rmse)
    mlflow.log_metric('r2', r2)

Display metrics collected

TODO: need to add explanation for each here

In [None]:
print(f"MAE: {mae}")
print(f"MSE: {mse}")
print(f"RMSE: {rmse}")
print(f"R-squared: {r2}")

Using model to predict

In [None]:
# lets try to predict
# Input features for prediction (replace with your own feature values)
new_data = pd.DataFrame({
    'AT': 14.96,
    'V': 41.76,
    'AP': 1024.07,
    'RH': 73.17,
}, index=[0])

# PE
expected_output = 463.26

# Predict the electrical energy output for the new data
predicted_energy_output = model.predict(new_data)

print(f"Predicted Electrical Energy Output: {predicted_energy_output[0]} and expected {expected_output}. Accuracy of {(expected_output/predicted_energy_output[0])*100}%")

Visualize predicted vs actual

In [None]:
import matplotlib.pyplot as plt

# scatter plot to visualize predicted va actial
# note: y_test contains the actual energy output values, and y_pred contains the predicted values

plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.xlabel('Actual Energy Output')
plt.ylabel('Predicted Energy Output')
plt.title('Actual vs. Predicted Energy Output')
plt.grid(True)
plt.show()