![tower_bridge](tower_bridge.jpg)

As the climate changes, predicting the weather becomes ever more important for businesses. You have been asked to support on a machine learning project with the aim of building a pipeline to predict the climate in London, England. Specifically, the model should predict mean temperature in degrees Celsius (°C).

Since the weather depends on a lot of different factors, you will want to run a lot of experiments to determine what the best approach is to predict the weather. In this project, you will run experiments for different regression models predicting the mean temperature, using a combination of `sklearn` and `mlflow`.

You will be working with data stored in `london_weather.csv`, which contains the following columns:
- **date** - recorded date of measurement - (**int**)
- **cloud_cover** - cloud cover measurement in oktas - (**float**)
- **sunshine** - sunshine measurement in hours (hrs) - (**float**)
- **global_radiation** - irradiance measurement in Watt per square meter (W/m2) - (**float**)
- **max_temp** - maximum temperature recorded in degrees Celsius (°C) - (**float**)
- **mean_temp** - **target** mean temperature in degrees Celsius (°C) - (**float**)
- **min_temp** - minimum temperature recorded in degrees Celsius (°C) - (**float**)
- **precipitation** - precipitation measurement in millimeters (mm) - (**float**)
- **pressure** - pressure measurement in Pascals (Pa) - (**float**)
- **snow_depth** - snow depth measurement in centimeters (cm) - (**float**)

In [107]:
# Run this cell to install mlflow
!pip install mlflow

Defaulting to user installation because normal site-packages is not writeable


In [108]:
# Run this cell to import the modules you require
import pandas as pd
import numpy as np
import mlflow
import mlflow.sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

# Read in the data
weather = pd.read_csv("london_weather.csv")

features = [
    "cloud_cover",
    "sunshine",
    "global_radiation",
    "max_temp",
    "min_temp",
    "precipitation",
    "pressure",
    "snow_depth",
]
X = weather[features]
y = weather["mean_temp"]

In [109]:
imputer = SimpleImputer(strategy="mean")
X = imputer.fit_transform(X)

In [110]:
scaler = StandardScaler()
X = scaler.fit_transform(X)


In [111]:
imputer = SimpleImputer(strategy="mean")
X = imputer.fit_transform(X)

# Handle missing values in y
y = np.array(weather["mean_temp"]).reshape(-1, 1)
y = imputer.fit_transform(y).flatten()

# Standardize the data
scaler = StandardScaler()
X = scaler.fit_transform(X)

In [112]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [113]:
models = {
    "LinearRegression": LinearRegression(),
    "DecisionTree": DecisionTreeRegressor(random_state=42),
    "RandomForest": RandomForestRegressor(random_state=42),
}

In [114]:
mlflow.set_experiment("London_Temperature_Prediction")

<Experiment: artifact_location='file:///work/files/workspace/mlruns/466394272889679740', creation_time=1737222360182, experiment_id='466394272889679740', last_update_time=1737222360182, lifecycle_stage='active', name='London_Temperature_Prediction', tags={}>

In [115]:
y = np.array(weather["mean_temp"]).reshape(-1, 1)
y = imputer.fit_transform(y).flatten()

In [116]:
for model_name, model in models.items():
    try:
        # Ensure any previous run is ended
        if mlflow.active_run() is not None:
            mlflow.end_run()

        with mlflow.start_run(run_name=model_name):
            # Train the model
            model.fit(X_train, y_train)

            # Make predictions
            y_pred = model.predict(X_test)

            # Calculate RMSE
            rmse = np.sqrt(mean_squared_error(y_test, y_pred))

            # Log model and metrics
            mlflow.log_param("model_name", model_name)
            mlflow.log_metric("rmse", rmse)
            mlflow.sklearn.log_model(model, model_name)

            print(f"Model: {model_name}, RMSE: {rmse}")
    finally:
        # Ensure the run is ended
        mlflow.end_run()

Model: LinearRegression, RMSE: 0.9166133728599348
Model: DecisionTree, RMSE: 1.2750102381614619
Model: RandomForest, RMSE: 0.9166052782128887


In [117]:
 y_pred = model.predict(X_test)

In [118]:
  rmse = np.sqrt(mean_squared_error(y_test, y_pred))

In [119]:
mlflow.log_param("model_name", model_name)
mlflow.log_metric("rmse", rmse)
mlflow.sklearn.log_model(model, model_name)

print(f"Model: {model_name}, RMSE: {rmse}")

Model: RandomForest, RMSE: 0.9166052782128887


In [120]:
experiment_id = mlflow.get_experiment_by_name("London_Temperature_Prediction").experiment_id
experiment_results = mlflow.search_runs(experiment_ids=[experiment_id])

In [121]:
print("Experiment Results:")
print(experiment_results)

Experiment Results:
                              run_id  ... tags.mlflow.user
0   74ee6108c531498a8f389ad9186e7413  ...             repl
1   e8912651389f47b8b3dbd15186c6ac40  ...             repl
2   2e29dc84f612419a8823594cf960fe52  ...             repl
3   23ce4789553a44109e88c90b0a04da2f  ...             repl
4   8a378b1ef1d94495b8dca244cd5749d5  ...             repl
5   03339d0123af4addaf6001d4a50e9ccd  ...             repl
6   a110573794a44ac280343c8501d7e97e  ...             repl
7   3d67bea5b6ef48e3a4c8a98bb9ae181a  ...             repl
8   c581a5ee7cb942f68f83352005ba95f8  ...             repl
9   fcd676b9a9134a1c9711dbb9f332976a  ...             repl
10  35e972691e3c4267bc27ca6e9e832794  ...             repl
11  6b07484873aa44b48a793a354b86c81c  ...             repl
12  54c286e117aa4d8db745910f92d1245c  ...             repl
13  3751c519f2e841beafda6f455671744f  ...             repl
14  b163b26134f844a1b078427a40480089  ...             repl
15  eac0459a013b4bcc9b233f80afd22d0d

In [122]:
best_runs = experiment_results[experiment_results["metrics.rmse"] <= 3]
if not best_runs.empty:
    print("Best models with RMSE <= 3:")
    print(best_runs)
else:
    print("No model achieved RMSE <= 3.")


Best models with RMSE <= 3:
                              run_id  ... tags.mlflow.user
0   74ee6108c531498a8f389ad9186e7413  ...             repl
1   e8912651389f47b8b3dbd15186c6ac40  ...             repl
2   2e29dc84f612419a8823594cf960fe52  ...             repl
3   23ce4789553a44109e88c90b0a04da2f  ...             repl
4   8a378b1ef1d94495b8dca244cd5749d5  ...             repl
5   03339d0123af4addaf6001d4a50e9ccd  ...             repl
6   a110573794a44ac280343c8501d7e97e  ...             repl
7   3d67bea5b6ef48e3a4c8a98bb9ae181a  ...             repl
8   c581a5ee7cb942f68f83352005ba95f8  ...             repl
9   fcd676b9a9134a1c9711dbb9f332976a  ...             repl
10  35e972691e3c4267bc27ca6e9e832794  ...             repl
11  6b07484873aa44b48a793a354b86c81c  ...             repl
12  54c286e117aa4d8db745910f92d1245c  ...             repl
13  3751c519f2e841beafda6f455671744f  ...             repl
14  b163b26134f844a1b078427a40480089  ...             repl
15  eac0459a013b4bcc9b233f80