

As the climate changes, predicting the weather becomes ever more important for businesses. Since the weather depends on a lot of different factors, you will want to run a lot of experiments to determine what the best approach is to predict the weather. In this project, you will run experiments for different regression models predicting the mean temperature, using a combination of `sklearn` and `MLflow`.

You will be working with data stored in `london_weather.csv`, which contains the following columns:
- **date** - recorded date of measurement - (**int**)
- **cloud_cover** - cloud cover measurement in oktas - (**float**)
- **sunshine** - sunshine measurement in hours (hrs) - (**float**)
- **global_radiation** - irradiance measurement in Watt per square meter (W/m2) - (**float**)
- **max_temp** - maximum temperature recorded in degrees Celsius (°C) - (**float**)
- **mean_temp** - mean temperature in degrees Celsius (°C) - (**float**)
- **min_temp** - minimum temperature recorded in degrees Celsius (°C) - (**float**)
- **precipitation** - precipitation measurement in millimeters (mm) - (**float**)
- **pressure** - pressure measurement in Pascals (Pa) - (**float**)
- **snow_depth** - snow depth measurement in centimeters (cm) - (**float**)

In [None]:

import pandas as pd
import numpy as np
import mlflow
import mlflow.sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

# Read in the data
weather = pd.read_csv("london_weather.csv")


weather.head()

Unnamed: 0,date,cloud_cover,sunshine,global_radiation,max_temp,mean_temp,min_temp,precipitation,pressure,snow_depth
0,19790101,2.0,7.0,52.0,2.3,-4.1,-7.5,0.4,101900.0,9.0
1,19790102,6.0,1.7,27.0,1.6,-2.6,-7.5,0.0,102530.0,8.0
2,19790103,5.0,0.0,13.0,1.3,-2.8,-7.2,0.0,102050.0,4.0
3,19790104,8.0,0.0,13.0,-0.3,-2.6,-6.5,0.0,100840.0,2.0
4,19790105,6.0,2.0,29.0,5.6,-0.8,-1.4,0.0,102250.0,1.0


In [None]:
df_cleaned = weather.dropna() 
X = df_cleaned.drop(columns=["mean_temp"])
y = df_cleaned["mean_temp"]


In [43]:
# weather.shape
X.isna().sum()

date                0
cloud_cover         0
sunshine            0
global_radiation    0
max_temp            0
min_temp            0
precipitation       0
pressure            0
snow_depth          0
dtype: int64

In [44]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [45]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [46]:
models = {
    "LinearRegression": LinearRegression(),
    "RandomForestRegressor": RandomForestRegressor(n_estimators=100, random_state=42),
    "DecisionTreeRegressor": DecisionTreeRegressor()
}

In [47]:
model = LinearRegression()


model.fit(X_train_scaled, y_train)


pred = model.predict(X_test_scaled)

mse = mean_squared_error(y_test, pred)
rmse = np.sqrt(mse)
print(f"Mean Squared Error: {mse}")
print(f"Root Mean Squared Error: {rmse}")

Mean Squared Error: 0.8121621069635554
Root Mean Squared Error: 0.9012003700418434


In [48]:

mlflow.set_experiment("London_Weather_Prediction")

best_rmse = float("inf")
best_model_name = None

for model_name, model in models.items():
    with mlflow.start_run():
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)

       
        rmse = np.sqrt(mean_squared_error(y_test, y_pred))

      
        mlflow.log_param("model_name", model_name)
        mlflow.log_metric("rmse", rmse)
        mlflow.sklearn.log_model(model, model_name)

        print(f"{model_name} - RMSE: {rmse:.3f}")

        if rmse < best_rmse:
            best_rmse = rmse
            best_model_name = model_name

print(f"\nBest Model: {best_model_name} RMSE: {best_rmse:.3f}")


LinearRegression - RMSE: 0.901
RandomForestRegressor - RMSE: 0.884
DecisionTreeRegressor - RMSE: 1.267

Best Model: RandomForestRegressor RMSE: 0.884


In [49]:
import mlflow


experiment_results = mlflow.search_runs()


print(experiment_results)


                             run_id  ...                      tags.mlflow.log-model.history
0  1815758e5ccd406aac7a9de5a56ef8d7  ...  [{"run_id": "1815758e5ccd406aac7a9de5a56ef8d7"...
1  4ee9468acfc14c339f44089da007e4c6  ...  [{"run_id": "4ee9468acfc14c339f44089da007e4c6"...
2  446e687f9fbc488aa81faf2e6caafd1e  ...  [{"run_id": "446e687f9fbc488aa81faf2e6caafd1e"...
3  b34a535e8ee541f3b25aab69aa45a5a2  ...  [{"run_id": "b34a535e8ee541f3b25aab69aa45a5a2"...
4  e23d85cef88d49e8b69a95e4ad98cd0f  ...  [{"run_id": "e23d85cef88d49e8b69a95e4ad98cd0f"...
5  a1d299845860472da6988067dab28ea5  ...  [{"run_id": "a1d299845860472da6988067dab28ea5"...
6  de25a455376041a0b4c1de663c983c45  ...  [{"run_id": "de25a455376041a0b4c1de663c983c45"...
7  89be2c03fe28482e950933cc1855f9ce  ...  [{"run_id": "89be2c03fe28482e950933cc1855f9ce"...
8  1e71a9ea99e842f9b9b10d37b73252c5  ...  [{"run_id": "1e71a9ea99e842f9b9b10d37b73252c5"...

[9 rows x 13 columns]
