![tower_bridge](tower_bridge.jpg)

As the climate changes, predicting the weather becomes ever more important for businesses. You have been asked to support on a machine learning project with the aim of building a pipeline to predict the climate in London, England. Specifically, the model should predict mean temperature in degrees Celsius (°C).

Since the weather depends on a lot of different factors, you will want to run a lot of experiments to determine what the best approach is to predict the weather. In this project, you will run experiments for different regression models predicting the mean temperature, using a combination of `sklearn` and `mlflow`.

You will be working with data stored in `london_weather.csv`, which contains the following columns:
- **date** - recorded date of measurement - (**int**)
- **cloud_cover** - cloud cover measurement in oktas - (**float**)
- **sunshine** - sunshine measurement in hours (hrs) - (**float**)
- **global_radiation** - irradiance measurement in Watt per square meter (W/m2) - (**float**)
- **max_temp** - maximum temperature recorded in degrees Celsius (°C) - (**float**)
- **mean_temp** - **target** mean temperature in degrees Celsius (°C) - (**float**)
- **min_temp** - minimum temperature recorded in degrees Celsius (°C) - (**float**)
- **precipitation** - precipitation measurement in millimeters (mm) - (**float**)
- **pressure** - pressure measurement in Pascals (Pa) - (**float**)
- **snow_depth** - snow depth measurement in centimeters (cm) - (**float**)

In [26]:
# Run this cell to install mlflow
!pip install mlflow

Defaulting to user installation because normal site-packages is not writeable


In [27]:
# Run this cell to import the modules you require
import pandas as pd
import numpy as np
import mlflow
import mlflow.sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

# Read in the data
weather = pd.read_csv("london_weather.csv")

# Start coding here
# Use as many cells as you like

In [28]:
# Check columns of weather data
display(weather.head(), weather.info())

# Impute missing values using 'mean'
imputer = SimpleImputer(strategy='mean').set_output(transform='pandas')
weather_imputed = imputer.fit_transform(weather)

# Split the data
X = weather_imputed.drop('mean_temp', axis=1)
y = weather_imputed['mean_temp']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

# Scale Features
scaler = StandardScaler().set_output(transform='pandas')
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Check final data
display(X_train,X_test,y_train,y_test)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15341 entries, 0 to 15340
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   date              15341 non-null  int64  
 1   cloud_cover       15322 non-null  float64
 2   sunshine          15341 non-null  float64
 3   global_radiation  15322 non-null  float64
 4   max_temp          15335 non-null  float64
 5   mean_temp         15305 non-null  float64
 6   min_temp          15339 non-null  float64
 7   precipitation     15335 non-null  float64
 8   pressure          15337 non-null  float64
 9   snow_depth        13900 non-null  float64
dtypes: float64(9), int64(1)
memory usage: 1.2 MB


Unnamed: 0,date,cloud_cover,sunshine,global_radiation,max_temp,mean_temp,min_temp,precipitation,pressure,snow_depth
0,19790101,2.0,7.0,52.0,2.3,-4.1,-7.5,0.4,101900.0,9.0
1,19790102,6.0,1.7,27.0,1.6,-2.6,-7.5,0.0,102530.0,8.0
2,19790103,5.0,0.0,13.0,1.3,-2.8,-7.2,0.0,102050.0,4.0
3,19790104,8.0,0.0,13.0,-0.3,-2.6,-6.5,0.0,100840.0,2.0
4,19790105,6.0,2.0,29.0,5.6,-0.8,-1.4,0.0,102250.0,1.0


None

Unnamed: 0,date,cloud_cover,sunshine,global_radiation,max_temp,min_temp,precipitation,pressure,snow_depth
4209,-0.786987,-2.063391,2.671493,2.479511,1.875136,0.324548,-0.443250,1.225786,-0.072005
4098,-0.790187,-0.131415,0.660554,0.396502,-0.395879,0.193291,-0.443250,0.442434,-0.072005
13529,1.347290,-1.097403,0.089547,-0.797006,-1.660941,-1.719320,0.512470,1.283104,-0.072005
6282,-0.296598,0.834573,-0.357329,-0.256550,-0.472088,-1.550560,-0.443250,-0.847231,-0.072005
12646,1.106222,-1.097403,-0.481461,0.148793,0.838700,1.730879,-0.230868,-0.035220,-0.072005
...,...,...,...,...,...,...,...,...,...
5191,-0.543385,-1.580397,1.504652,0.734287,-0.258704,-1.419302,-0.443250,1.474165,-0.072005
13418,1.271687,-1.580397,1.032950,0.351464,0.457657,0.043282,-0.443250,1.751204,-0.072005
5390,-0.537749,-0.131415,0.387464,-0.042619,0.183306,0.474557,-0.151224,-1.563711,-0.072005
860,-1.529123,1.317567,-0.878683,-0.143954,0.945392,0.962085,0.087706,-1.076504,-0.072005


Unnamed: 0,date,cloud_cover,sunshine,global_radiation,max_temp,min_temp,precipitation,pressure,snow_depth
9261,0.363221,0.351579,0.834339,1.240965,0.046131,0.943334,-0.443250,-0.197622,-0.072005
5376,-0.538440,1.317567,-1.077295,-0.830785,0.640558,0.962085,1.600928,-0.063879,-0.072005
12578,1.104518,1.317567,-1.002815,-0.234031,0.107098,0.549561,-0.443250,0.127183,-0.072005
927,-1.527428,0.834573,-0.754551,0.081236,0.396690,0.943334,-0.443250,-0.121197,-0.072005
14109,1.435341,-0.131415,0.710207,0.959477,0.823458,1.393360,1.415094,-0.407789,-0.072005
...,...,...,...,...,...,...,...,...,...
7563,-0.044818,0.351579,-0.481461,-0.234031,0.716766,0.530810,1.176164,-0.761253,-0.072005
573,-1.609614,-0.131415,-0.084238,0.632952,1.463610,1.149596,-0.443250,-0.226281,-0.072005
12452,1.101178,0.834573,-1.077295,-1.112272,-0.670230,-1.663066,-0.443250,0.337350,-0.072005
7333,-0.051293,0.834573,-1.077295,-1.134791,-1.615216,-1.119285,-0.283963,1.340422,-0.072005


4209     18.3
4098     11.1
13529     1.8
6282      2.4
12646    20.2
         ... 
5191      6.3
13418    13.6
5390     13.3
860      14.0
7270      8.0
Name: mean_temp, Length: 12272, dtype: float64

9261     16.0
5376     14.6
12578    12.9
927      17.0
14109    17.6
         ... 
7563     14.6
573      17.4
12452     4.6
7333      4.3
13825     6.1
Name: mean_temp, Length: 3069, dtype: float64

In [29]:
# Model Training and eval

# Linear Regression
with mlflow.start_run():  # Start a new run for mlflow logging
    lr_model = LinearRegression()
    lr_model.fit(X_train, y_train)
    y_pred_lr = lr_model.predict(X_test)

    mse_lr = mean_squared_error(y_test, y_pred_lr)
    rmse_lr = np.sqrt(mse_lr)  # Calculate RMSE

    # Log model, parameters, and RMSE metric
    mlflow.log_metric("rmse", rmse_lr)
    mlflow.log_params({"model": "LinearRegression"})

    print(f'mse_lr: {mse_lr:.2f}')

# Decision Tree Regressor
with mlflow.start_run():
    dt_model = DecisionTreeRegressor(random_state=42)
    dt_model.fit(X_train, y_train)
    y_pred_dt = dt_model.predict(X_test)

    mse_dt = mean_squared_error(y_test, y_pred_dt)
    rmse_dt = np.sqrt(mse_dt)


    mlflow.log_metric("rmse", rmse_dt)
    mlflow.log_params({"model": "DecisionTreeRegressor"})

    print(f"Decision Tree MSE: {mse_dt:.2f}")

# Random Forest Regressor
with mlflow.start_run():
    rf_model = RandomForestRegressor(random_state=42)
    rf_model.fit(X_train, y_train)
    y_pred_rf = rf_model.predict(X_test)

    mse_rf = mean_squared_error(y_test, y_pred_rf)
    rmse_rf = np.sqrt(mse_rf)


    mlflow.log_metric("rmse", rmse_rf)
    mlflow.log_params({"model": "RandomForestRegressor"})

    print(f"Random Forest MSE: {mse_rf:.2f}")

mse_lr: 0.84
Decision Tree MSE: 1.59
Random Forest MSE: 0.83


In [30]:
# Search all mlflow runs and store results
experiment_results = mlflow.search_runs()
print("All experiment results:")
for index, run in experiment_results.iterrows():
    print(f" - Run ID: {run.run_id}")
    print(f"   - Model: {run['params.model']}")
    print(f"   - RMSE: {run['metrics.rmse']}")

All experiment results:
 - Run ID: b68d0dc2c255443c9f02ae48e2c2353e
   - Model: RandomForestRegressor
   - RMSE: 0.9137548696834175
 - Run ID: 4b08432d1ab8450d86150ff6c0af24af
   - Model: DecisionTreeRegressor
   - RMSE: 1.2613161137692555
 - Run ID: 08a5bbe18f2a4f29950c8bd257c999d9
   - Model: LinearRegression
   - RMSE: 0.9165913965895832
 - Run ID: f9bac45db72a470d8bbab9b18d70e3a9
   - Model: RandomForestRegressor
   - RMSE: 0.9137548696834175
 - Run ID: dc625a7298e541debed60b435d680621
   - Model: DecisionTreeRegressor
   - RMSE: 1.2613161137692555
 - Run ID: e7255106d54f4128a2ee150267bbe9e6
   - Model: LinearRegression
   - RMSE: 0.9165913965895832
 - Run ID: 9b2090cb06a645ceb401b80cf8bb6dee
   - Model: RandomForestRegressor
   - RMSE: 0.9137548696834175
 - Run ID: b18816a4c1e84fedb5f7285dc776915f
   - Model: DecisionTreeRegressor
   - RMSE: 1.2613161137692555
 - Run ID: b2fd452fc3c4432e8f4253bcc262a4c4
   - Model: LinearRegression
   - RMSE: 0.9165913965895832
