

As the climate changes, predicting the weather becomes ever more important for businesses. You have been asked to support on a machine learning project with the aim of building a pipeline to predict the climate in London, England. Specifically, the model should predict mean temperature in degrees Celsius (°C).

Since the weather depends on a lot of different factors, you will want to run a lot of experiments to determine what the best approach is to predict the weather. In this project, you will run experiments for different regression models predicting the mean temperature, using a combination of `sklearn` and `mlflow`.

You will be working with data stored in `london_weather.csv`, which contains the following columns:
- **date** - recorded date of measurement - (**int**)
- **cloud_cover** - cloud cover measurement in oktas - (**float**)
- **sunshine** - sunshine measurement in hours (hrs) - (**float**)
- **global_radiation** - irradiance measurement in Watt per square meter (W/m2) - (**float**)
- **max_temp** - maximum temperature recorded in degrees Celsius (°C) - (**float**)
- **mean_temp** - **target** mean temperature in degrees Celsius (°C) - (**float**)
- **min_temp** - minimum temperature recorded in degrees Celsius (°C) - (**float**)
- **precipitation** - precipitation measurement in millimeters (mm) - (**float**)
- **pressure** - pressure measurement in Pascals (Pa) - (**float**)
- **snow_depth** - snow depth measurement in centimeters (cm) - (**float**)

In [None]:
# Run this cell to install mlflow
!pip install mlflow

Defaulting to user installation because normal site-packages is not writeable


In [None]:
import pandas as pd
import numpy as np
import mlflow
import mlflow.sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

# Read in the data
weather = pd.read_csv("london_weather.csv")

# Show the first few rows to inspect the data
print(weather.head())

# Handle missing values
imputer = SimpleImputer(strategy="mean")
weather_imputed = pd.DataFrame(imputer.fit_transform(weather))

# Assign column names after imputation
weather_imputed.columns = weather.columns

# Features and target
X = weather_imputed.drop(columns=["mean_temp", "date"])  # Drop 'mean_temp' (target) and 'date'
y = weather_imputed["mean_temp"]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# MLflow setup: Start tracking experiment
mlflow.set_experiment("London_Weather_Prediction")

# Function to log the model and RMSE
def log_model_and_rmse(model, model_name):
    # Predict and calculate RMSE
    y_pred = model.predict(X_test_scaled)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))

    # Log the model and RMSE
    with mlflow.start_run():
        mlflow.log_param("model", model_name)
        mlflow.log_metric("rmse", rmse)
        mlflow.sklearn.log_model(model, model_name)

    return rmse

# Linear Regression Model
lr_model = LinearRegression()
lr_model.fit(X_train_scaled, y_train)
lr_rmse = log_model_and_rmse(lr_model, "Linear_Regression")

# Decision Tree Regressor Model
dt_model = DecisionTreeRegressor(random_state=42)
dt_model.fit(X_train_scaled, y_train)
dt_rmse = log_model_and_rmse(dt_model, "Decision_Tree_Regressor")

# Random Forest Regressor Model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train_scaled, y_train)
rf_rmse = log_model_and_rmse(rf_model, "Random_Forest_Regressor")

# Displaying the RMSE for each model
print(f"Linear Regression RMSE: {lr_rmse}")
print(f"Decision Tree RMSE: {dt_rmse}")
print(f"Random Forest RMSE: {rf_rmse}")

# Search for all runs to get the results
experiment_results = mlflow.search_runs()
print(experiment_results[["run_id", "metrics.rmse"]])

# The best model is the one with the lowest RMSE
best_model = min([(lr_rmse, "Linear_Regression"),
                  (dt_rmse, "Decision_Tree_Regressor"),
                  (rf_rmse, "Random_Forest_Regressor")], key=lambda x: x[0])

print(f"Best Model: {best_model[1]} with RMSE: {best_model[0]}")


2024/12/22 21:11:43 INFO mlflow.tracking.fluent: Experiment with name 'London_Weather_Prediction' does not exist. Creating a new experiment.


       date  cloud_cover  sunshine  ...  precipitation  pressure  snow_depth
0  19790101          2.0       7.0  ...            0.4  101900.0         9.0
1  19790102          6.0       1.7  ...            0.0  102530.0         8.0
2  19790103          5.0       0.0  ...            0.0  102050.0         4.0
3  19790104          8.0       0.0  ...            0.0  100840.0         2.0
4  19790105          6.0       2.0  ...            0.0  102250.0         1.0

[5 rows x 10 columns]
Linear Regression RMSE: 0.9166133728599348
Decision Tree RMSE: 1.2706354827458726
Random Forest RMSE: 0.9166267459704752
                             run_id  metrics.rmse
0  ebd6c380400d47df9886f27b2f0e5c38      0.916627
1  f8773392a03c490aa1b32265f986097b      1.270635
2  670d5572330d415a8947133990dc01c5      0.916613
Best Model: Linear_Regression with RMSE: 0.9166133728599348
