# **Predicting weather in London with MLflow**

> *Perform a machine learning experiment to find the best model that predicts the temperature in London!*






As the climate changes, predicting the weather becomes ever more important for businesses. The task is  to support on a machine learning project with the aim of building a pipeline to predict the climate in London, England. Specifically, the model should predict mean temperature in degrees Celsius (°C).

Since the weather depends on a lot of different factors, we will want to run a lot of experiments to determine what the best approach is to predict the weather. In this project, we will run experiments for different regression models predicting the mean temperature, using a combination of sklearn and mlflow.

We will be working with data stored in london_weather.csv, which contains the following columns:

- date - recorded date of measurement - (int)
- cloud_cover - cloud cover measurement in oktas - (float)
- sunshine - sunshine measurement in hours (hrs) - (float)
- global_radiation - irradiance measurement in Watt per square meter (W/m2) - (float)
- max_temp - maximum temperature recorded in degrees Celsius (°C) - (float)
- mean_temp - target mean temperature in degrees Celsius (°C) - (float)
- min_temp - minimum temperature recorded in degrees Celsius (°C) - (float)
- precipitation - precipitation measurement in millimeters (mm) - (float)
- pressure - pressure measurement in Pascals (Pa) - (float)
- snow_depth - snow depth measurement in centimeters (cm) - (float)





Dataset link :  https://www.kaggle.com/datasets/emmanuelfwerr/london-weather-data

Steps are-

- Use machine learning to predict the mean temperature in London, England, logging your root mean squared error (RMSE) metrics using mlflow.

- Build a model to predict "mean_temp" with a RMSE of 3 or less.
- Use MLflow to log any models you train, their hyperparameters, and respective RMSE scores (ensuring you include "rmse" as part of the metric name).
- Search all of your mlflow runs and store the results as a variable called experiment_results


In [1]:
!pip install mlflow

Collecting mlflow
  Downloading mlflow-2.16.2-py3-none-any.whl.metadata (29 kB)
Collecting mlflow-skinny==2.16.2 (from mlflow)
  Downloading mlflow_skinny-2.16.2-py3-none-any.whl.metadata (30 kB)
Collecting alembic!=1.10.0,<2 (from mlflow)
  Downloading alembic-1.13.2-py3-none-any.whl.metadata (7.4 kB)
Collecting docker<8,>=4.0.0 (from mlflow)
  Downloading docker-7.1.0-py3-none-any.whl.metadata (3.8 kB)
Collecting graphene<4 (from mlflow)
  Downloading graphene-3.3-py2.py3-none-any.whl.metadata (7.7 kB)
Collecting gunicorn<24 (from mlflow)
  Downloading gunicorn-23.0.0-py3-none-any.whl.metadata (4.4 kB)
Collecting databricks-sdk<1,>=0.20.0 (from mlflow-skinny==2.16.2->mlflow)
  Downloading databricks_sdk-0.32.3-py3-none-any.whl.metadata (37 kB)
Collecting gitpython<4,>=3.1.9 (from mlflow-skinny==2.16.2->mlflow)
  Downloading GitPython-3.1.43-py3-none-any.whl.metadata (13 kB)
Collecting opentelemetry-api<3,>=1.9.0 (from mlflow-skinny==2.16.2->mlflow)
  Downloading opentelemetry_api-1.2

In [2]:
# Run this cell to import the modules you require
import pandas as pd
import numpy as np
import mlflow
import mlflow.sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

# Read in the data
weather = pd.read_csv("london_weather.csv")


In [5]:
# Imputer for features
imputer = SimpleImputer(strategy='mean')
scaler = StandardScaler()

# Define features and target
features = ['cloud_cover', 'sunshine', 'global_radiation', 'max_temp',
            'min_temp', 'precipitation', 'pressure', 'snow_depth']
X = weather[features]

# Impute missing values in features
X_imputed = imputer.fit_transform(X)
X_scaled = scaler.fit_transform(X_imputed)

In [6]:
# Handle missing values in target column (mean_temp)
y_imputer = SimpleImputer(strategy='mean')
y = y_imputer.fit_transform(weather['mean_temp'].values.reshape(-1, 1)).ravel()  # Convert to NumPy, reshape, and impute

# Now proceed with splitting the data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# 2. Function to train and log models using mlflow
def train_and_log_model(model, model_name, X_train, X_test, y_train, y_test):
    with mlflow.start_run():
        # Train the model
        model.fit(X_train, y_train)

        # Make predictions
        y_pred = model.predict(X_test)

        # Calculate RMSE
        rmse = np.sqrt(mean_squared_error(y_test, y_pred))

        # Log model, hyperparameters, and metrics to MLflow
        mlflow.sklearn.log_model(model, model_name)
        mlflow.log_metric("rmse", rmse)

        print(f"{model_name}: RMSE = {rmse}")
        return rmse

# Initialize models
linear_model = LinearRegression()
tree_model = DecisionTreeRegressor(random_state=42)
forest_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train models and log results
train_and_log_model(linear_model, "LinearRegression", X_train, X_test, y_train, y_test)
train_and_log_model(tree_model, "DecisionTreeRegressor", X_train, X_test, y_train, y_test)
train_and_log_model(forest_model, "RandomForestRegressor", X_train, X_test, y_train, y_test)



LinearRegression: RMSE = 0.9166133728599348




DecisionTreeRegressor: RMSE = 1.2750102381614619




RandomForestRegressor: RMSE = 0.9166052782128887


0.9166052782128887

In [7]:
# Search for MLflow runs and store them
experiment_results = mlflow.search_runs(order_by=["metrics.rmse ASC"])
experiment_results

Unnamed: 0,run_id,experiment_id,status,artifact_uri,start_time,end_time,metrics.rmse,tags.mlflow.log-model.history,tags.mlflow.runName,tags.mlflow.user,tags.mlflow.source.type,tags.mlflow.source.name
0,8c53d9bf04e14f0c9895a859b8592f31,0,FINISHED,file:///content/mlruns/0/8c53d9bf04e14f0c9895a...,2024-09-21 20:12:29.001000+00:00,2024-09-21 20:12:39.863000+00:00,0.916605,"[{""run_id"": ""8c53d9bf04e14f0c9895a859b8592f31""...",mercurial-shrew-914,root,LOCAL,/usr/local/lib/python3.10/dist-packages/colab_...
1,af511605ebcf40f984f3615c1ccbb836,0,FINISHED,file:///content/mlruns/0/af511605ebcf40f984f36...,2024-09-21 20:12:21.399000+00:00,2024-09-21 20:12:26.455000+00:00,0.916613,"[{""run_id"": ""af511605ebcf40f984f3615c1ccbb836""...",dazzling-ox-698,root,LOCAL,/usr/local/lib/python3.10/dist-packages/colab_...
2,86227f22bfbb401dbe1887897487459c,0,FINISHED,file:///content/mlruns/0/86227f22bfbb401dbe188...,2024-09-21 20:12:26.460000+00:00,2024-09-21 20:12:28.997000+00:00,1.27501,"[{""run_id"": ""86227f22bfbb401dbe1887897487459c""...",loud-gull-256,root,LOCAL,/usr/local/lib/python3.10/dist-packages/colab_...
