![tower_bridge](tower_bridge.jpg)

As the climate changes, predicting the weather becomes ever more important for businesses. You have been asked to support on a machine learning project with the aim of building a pipeline to predict the climate in London, England. Specifically, the model should predict mean temperature in degrees Celsius (°C).

Since the weather depends on a lot of different factors, you will want to run a lot of experiments to determine what the best approach is to predict the weather. In this project, you will run experiments for different regression models predicting the mean temperature, using a combination of `sklearn` and `mlflow`.

You will be working with data stored in `london_weather.csv`, which contains the following columns:
- **date** - recorded date of measurement - (**int**)
- **cloud_cover** - cloud cover measurement in oktas - (**float**)
- **sunshine** - sunshine measurement in hours (hrs) - (**float**)
- **global_radiation** - irradiance measurement in Watt per square meter (W/m2) - (**float**)
- **max_temp** - maximum temperature recorded in degrees Celsius (°C) - (**float**)
- **mean_temp** - **target** mean temperature in degrees Celsius (°C) - (**float**)
- **min_temp** - minimum temperature recorded in degrees Celsius (°C) - (**float**)
- **precipitation** - precipitation measurement in millimeters (mm) - (**float**)
- **pressure** - pressure measurement in Pascals (Pa) - (**float**)
- **snow_depth** - snow depth measurement in centimeters (cm) - (**float**)

In [1]:
# Run this cell to import the n ules you require
import pandas as pd
import numpy as np
import mlflow
import mlflow.sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

# Read in the data
weather = pd.read_csv("london_weather.csv",parse_dates=['date'])

# Start coding here
# Use as many cells as you like

In [2]:
weather

Unnamed: 0,date,cloud_cover,sunshine,global_radiation,max_temp,mean_temp,min_temp,precipitation,pressure,snow_depth
0,1979-01-01,2.0,7.0,52.0,2.3,-4.1,-7.5,0.4,101900.0,9.0
1,1979-01-02,6.0,1.7,27.0,1.6,-2.6,-7.5,0.0,102530.0,8.0
2,1979-01-03,5.0,0.0,13.0,1.3,-2.8,-7.2,0.0,102050.0,4.0
3,1979-01-04,8.0,0.0,13.0,-0.3,-2.6,-6.5,0.0,100840.0,2.0
4,1979-01-05,6.0,2.0,29.0,5.6,-0.8,-1.4,0.0,102250.0,1.0
...,...,...,...,...,...,...,...,...,...,...
15336,2020-12-27,1.0,0.9,32.0,7.5,7.5,7.6,2.0,98000.0,
15337,2020-12-28,7.0,3.7,38.0,3.6,1.1,-1.3,0.2,97370.0,
15338,2020-12-29,7.0,0.0,21.0,4.1,2.6,1.1,0.0,98830.0,
15339,2020-12-30,6.0,0.4,22.0,5.6,2.7,-0.1,0.0,100200.0,


In [3]:
#number of null values in each column
weather.isna().sum()

date                   0
cloud_cover           19
sunshine               0
global_radiation      19
max_temp               6
mean_temp             36
min_temp               2
precipitation          6
pressure               4
snow_depth          1441
dtype: int64

In [4]:
threshold = int(len(weather)*0.05)

In [5]:
cols_to_drop = weather.columns[weather.isna().sum()<= threshold]
weather.dropna(subset= cols_to_drop,inplace=True)

In [6]:
weather.isna().sum()

date                   0
cloud_cover            0
sunshine               0
global_radiation       0
max_temp               0
mean_temp              0
min_temp               0
precipitation          0
pressure               0
snow_depth          1418
dtype: int64

In [7]:
weather.snow_depth.nunique()

19

In [8]:
cols_with_missing_values = weather.columns[weather.isna().sum()>0]
for col in  cols_with_missing_values:
    weather[col] = weather[col].fillna(weather[col].mode()[0])

In [9]:
weather=weather.drop('date',axis=1)

In [10]:
features=weather.drop('mean_temp',axis=1).values
target = weather['mean_temp'].values

In [None]:
X_train , X_test, Y_train, Y_test = train_test_split(features,target,test_size=0.3,random_state=42)
Models = {"LinearRegression":LinearRegression(),"DecisonTreeRegressor":DecisionTreeRegressor(random_state=42),"RandomForestRegressor":RandomForestRegressor(random_state=42)}
MSE_values=[]
mlflow.set_experiment("Predicting London Weather")
for key,i in Models.items():
    mlflow.start_run()
    a = i
    a.fit(X_train,Y_train)
    y_predict = a.predict(X_test)
    MSE_values.append(np.round(mean_squared_error(Y_test,y_predict),2))
    mlflow.sklearn.log_model(i,key)
    mlflow.log_metric('rmse',np.round(mean_squared_error(Y_test,y_predict),2))
    mlflow.end_run()
MSE_values

In [None]:
experiment_name = "Predicting London Weather"
experiment = mlflow.get_experiment_by_name(experiment_name)
experiment_id = experiment.experiment_id

# Search all runs in the experiment
experiment_results = mlflow.search_runs(experiment_ids=experiment_id)

# Convert results to a DataFrame for easier handling (optional)
experiment_results_df = pd.DataFrame(experiment_results)

print(experiment_results_df.head())

In [15]:
MSE_values

[0.77, 1.52, 0.76]