# From Open_Meteo to RFR pollutants prediction

##### Info:
* Source training set: file "p3_WildAir\Open_Meteo_com\OpenMeteo_data\CSV\CSV_meteopollu_final\250219_meteopolluwind2124_nooutliers.csv";
* source prediction: file "?????";

* Each meteo features have the unity measure on title;
* For the wind "u10" and "v10" unities (number means speed):
    - A positive "u" means wind blowing from the West to East (if negative is in the opposing direction);
    - A positive "v" is wind blowing from South to North (if negative...);
* Each pollutant is on micrograms (one-millionth of a gram) per cubic meter air or "µg/m3".
* Machine Learning hybrid model using Random Forest Regressor optimized by GRidSearchCV:
    - RFR will be trained with meteo data from 2021 to 2024;
    - RFR will be predicting with new meteo forecasts from Open-Meteo

* We manage to predict the pollution levels for the 5 most dangerous pollutants for Human health, according to entities like Worls Health Organization (WHO), for the region of Lyon, France, with a Machine Learning model (Random Forest Regressor) that can be adjusted for other regions;
* The goal of our project is to forecast the days with a dangerous air quality index (AQI), with an alert being activated if one of the observed pollutants has above the european regulated levels of AQI;
* If the alert is activated, a list of recomendations is delivered to guide the population to adapt the best behaviour to protect themselves and to collaborate in ways to improve the AQI.
-----------------------

### Taking the Data from the Open-Meteo API

In [1]:
##### LIBRARIES IN USE #########

##### To get the Open-Meteo data #####
import openmeteo_requests
import requests_cache
from retry_requests import retry

##### Working the Data ######
import pandas as pd
import numpy as np

##### DataViz #####
import matplotlib.pyplot as plt
import seaborn as sns

##### Machine Learning #####
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.metrics import root_mean_squared_error
from sklearn.model_selection import GridSearchCV


## 1. DataSet for training ML with meteo_pollutants 2021-2024

In [60]:
df_metepollu_trainingset = pd.read_csv(r"C:\Users\sophi\FrMarques\LyonData WCS new\P3 wildAir\p3_WildAir\Open_Meteo_com\OpenMeteo_data\CSV\CSV_meteopollu_final\250219_meteopolluwind2124_buckets.csv")


* We had two excessive outliers for pm10 that could corrupt the desired forecast. We replaced those outliers by the values of the following day to make it more balanced for the Machine to Learn; 

In [65]:
print(df_metepollu_trainingset.shape)
display(df_metepollu_trainingset.head(1))

(1461, 14)


Unnamed: 0,date_id,temp_c,humidity_%,rain_mm,snowfall_cm,atmopressure_hpa,cloudcover_%,u10,v10,NO2,O3,PM10,PM2.5,SO2
0,2021-01-01,2.599667,85.773796,0.075,0.002917,988.1066,95.291664,1.368064,-3.283859,71.0,58.3,24.2,24.2,8.1


### 1.1. Droping and Reorganizing the necessary columns 

In [64]:
print(f"Before: {df_metepollu_trainingset.shape}")
df_metepollu_trainingset = df_metepollu_trainingset[['date_id', 'temp_c', \
    'humidity_%', 'rain_mm', 'snowfall_cm', 'atmopressure_hpa', 'cloudcover_%', 'u10', 'v10', \
       'NO2', 'O3', 'PM10', 'PM2.5', 'SO2']]
print(f"After: {df_metepollu_trainingset.shape}")

Before: (1461, 21)
After: (1461, 14)


### 1.2. Setting "date_id" as datetime objet and as the index

In [66]:
df_metepollu_trainingset["date_id"] = pd.to_datetime(df_metepollu_trainingset["date_id"])
df_metepollu_trainingset.set_index("date_id", inplace=True)

### 1.3. Renaming all columns in lower case

In [67]:
df_metepollu_trainingset.rename(columns=str.lower, inplace=True)

In [68]:
df_metepollu_trainingset.to_csv("meteo_pollu_trainingset.csv", index=True)

## 2. Getting the Open-Meteo forecast for the next 14 days

### 2.1. The Open-Meteo forecast: from download to dataframe 

* Source: https://open-meteo.com/en/docs#latitude=45.756&longitude=4.827&current=&minutely_15=&hourly=temperature_2m,relative_humidity_2m,rain,snowfall,surface_pressure,cloud_cover,wind_speed_10m,wind_direction_10m&daily=&timezone=Europe%2FBerlin&forecast_days=14&models=

In [16]:
import openmeteo_requests

import requests_cache
import pandas as pd
from retry_requests import retry

# Setup the Open-Meteo API client with cache and retry on error
cache_session = requests_cache.CachedSession('.cache', expire_after = 3600)
retry_session = retry(cache_session, retries = 5, backoff_factor = 0.2)
openmeteo = openmeteo_requests.Client(session = retry_session)

# Make sure all required weather variables are listed here
# The order of variables in hourly or daily is important to assign them correctly below
url = "https://api.open-meteo.com/v1/forecast"
params = {
	"latitude": 45.756,
	"longitude": 4.827,
	"hourly": ["temperature_2m", "relative_humidity_2m", "rain", "snowfall", "surface_pressure", "cloud_cover", "wind_speed_10m", "wind_direction_10m"],
	"timezone": "Europe/Berlin",
	"forecast_days": 14
}
responses = openmeteo.weather_api(url, params=params)

# Process first location. Add a for-loop for multiple locations or weather models
response = responses[0]
print(f"Coordinates {response.Latitude()}°N {response.Longitude()}°E")
print(f"Elevation {response.Elevation()} m asl")
print(f"Timezone {response.Timezone()} {response.TimezoneAbbreviation()}")
print(f"Timezone difference to GMT+0 {response.UtcOffsetSeconds()} s")

# Process hourly data. The order of variables needs to be the same as requested.
hourly = response.Hourly()
hourly_temperature_2m = hourly.Variables(0).ValuesAsNumpy()
hourly_relative_humidity_2m = hourly.Variables(1).ValuesAsNumpy()
hourly_rain = hourly.Variables(2).ValuesAsNumpy()
hourly_snowfall = hourly.Variables(3).ValuesAsNumpy()
hourly_surface_pressure = hourly.Variables(4).ValuesAsNumpy()
hourly_cloud_cover = hourly.Variables(5).ValuesAsNumpy()
hourly_wind_speed_10m = hourly.Variables(6).ValuesAsNumpy()
hourly_wind_direction_10m = hourly.Variables(7).ValuesAsNumpy()

hourly_data = {"date": pd.date_range(
	start = pd.to_datetime(hourly.Time(), unit = "s", utc = True),
	end = pd.to_datetime(hourly.TimeEnd(), unit = "s", utc = True),
	freq = pd.Timedelta(seconds = hourly.Interval()),
	inclusive = "left"
)}

hourly_data["temperature_2m"] = hourly_temperature_2m
hourly_data["relative_humidity_2m"] = hourly_relative_humidity_2m
hourly_data["rain"] = hourly_rain
hourly_data["snowfall"] = hourly_snowfall
hourly_data["surface_pressure"] = hourly_surface_pressure
hourly_data["cloud_cover"] = hourly_cloud_cover
hourly_data["wind_speed_10m"] = hourly_wind_speed_10m
hourly_data["wind_direction_10m"] = hourly_wind_direction_10m

hourly_dataframe = pd.DataFrame(data = hourly_data)
print(hourly_dataframe)

Coordinates 45.76000213623047°N 4.820000171661377°E
Elevation 161.0 m asl
Timezone b'Europe/Berlin' b'GMT+1'
Timezone difference to GMT+0 3600 s
                         date  temperature_2m  relative_humidity_2m  rain  \
0   2025-02-18 23:00:00+00:00          5.3110                  87.0   0.0   
1   2025-02-19 00:00:00+00:00          4.8110                  86.0   0.0   
2   2025-02-19 01:00:00+00:00          4.4610                  87.0   0.0   
3   2025-02-19 02:00:00+00:00          4.2110                  89.0   0.0   
4   2025-02-19 03:00:00+00:00          3.8110                  80.0   0.0   
..                        ...             ...                   ...   ...   
331 2025-03-04 18:00:00+00:00          6.4945                  65.0   0.0   
332 2025-03-04 19:00:00+00:00          6.0945                  68.0   0.0   
333 2025-03-04 20:00:00+00:00          5.7445                  70.0   0.0   
334 2025-03-04 21:00:00+00:00          5.3945                  72.0   0.0   
335 2025

### 2.2 Correcting the timezone to CET/ Europe/ Paris / Lyon

In [17]:
hourly_dataframe["date"] = pd.to_datetime(hourly_dataframe["date"]).dt.tz_convert("Europe/Paris")
hourly_dataframe.head(2)

Unnamed: 0,date,temperature_2m,relative_humidity_2m,rain,snowfall,surface_pressure,cloud_cover,wind_speed_10m,wind_direction_10m
0,2025-02-19 00:00:00+01:00,5.311,87.0,0.0,0.0,999.401306,98.0,1.297998,56.309914
1,2025-02-19 01:00:00+01:00,4.811,86.0,0.0,0.0,999.659973,100.0,1.938659,111.801476


#### 2.3 Adding a new column "date only" to group by day

In [85]:
hourly_dataframe["date_only"] = hourly_dataframe["date"].dt.date

#### 2.4 Setting the daily mean values for all columns
##### 2.4.1 Resetting the index from "date" to "date_only";
##### 2.4.2 Drop the "date" column

In [24]:
daily_dataframe = hourly_dataframe.groupby("date_only").mean().reset_index()
daily_dataframe["date_only"] = pd.to_datetime(daily_dataframe["date_only"])
daily_dataframe = daily_dataframe.drop(columns="date")

### 2.5 Renaming all columns

In [26]:
daily_dataframe = daily_dataframe.rename(columns={
    "date_only": "date_id",
    "temperature_2m": "temp_c",
    "relative_humidity_2m": "humidity_%",
    "rain": "rain_mm",
    "snowfall": "snowfall_cm",
    "surface_pressure": "atmopressure_hpa",
    "cloud_cover": "cloudcover_%",
    "wind_speed_10m": "windspeed_kmh",
    "wind_direction_10m": "winddirection_360"
    })

### 2.6 Turning "date" as the Dataframe index

In [28]:
daily_dataframe.set_index("date_id", inplace=True)

### 2.7 Setting windfutures usable by the Machine Larning model

In [29]:
daily_dataframe["windspeed_ms"] = daily_dataframe["windspeed_kmh"] * 0.277778   # turning speed from km/h to m/s;
wind_dir_rad = np.radians(daily_dataframe["winddirection_360"])                   # turning direction to radians;
daily_dataframe["u10"] = -daily_dataframe["windspeed_ms"] * np.sin(wind_dir_rad)   # calculating "u" and "v";
daily_dataframe["v10"] = -daily_dataframe["windspeed_ms"] * np.cos(wind_dir_rad)

### 2.8 Droping unecessary windfeatures 

In [30]:
columns2drop = ["windspeed_ms", "windspeed_kmh", "winddirection_360"]

daily_dataframe = daily_dataframe.drop(columns=columns2drop)

### 2.9 Saving the Dataframe into a CSV keeping the index ("date_id")

In [33]:
daily_dataframe.to_csv("df_meteoforecast_19fev_04mar25.csv", index=True)

* New df_predict_csv path: "\p3_WildAir\Prediction_final_ML\CSV to predict\df_meteoforecast_19fev_04mar25.csv"

## 3. Machine Learning prep

### 3.1 Setting the DataFrames for training and prediction

In [69]:
df_training_2124 = pd.read_csv(r"C:\Users\sophi\FrMarques\LyonData WCS new\P3 wildAir\p3_WildAir\Open_Meteo_com\OpenMeteo_data\CSV\CSV_meteopollu_final\meteo_pollu_trainingset.csv")
df_predict_pm25 = pd.read_csv(r"C:\Users\sophi\FrMarques\LyonData WCS new\P3 wildAir\p3_WildAir\Prediction_final_ML\CSV to predict\df_meteoforecast_19fev_04mar25.csv")
print(df_training_2124.shape)
display(df_training_2124.head(2))
print(df_predict_pm25.shape)
display(df_predict_pm25.head(2))

(1461, 14)


Unnamed: 0,date_id,temp_c,humidity_%,rain_mm,snowfall_cm,atmopressure_hpa,cloudcover_%,u10,v10,no2,o3,pm10,pm2.5,so2
0,2021-01-01,2.599667,85.773796,0.075,0.002917,988.1066,95.291664,1.368064,-3.283859,71.0,58.3,24.2,24.2,8.1
1,2021-01-02,1.618417,79.047424,0.008333,0.023333,989.22394,95.791664,2.309296,-5.272886,46.0,49.8,23.2,16.4,31.4


(14, 9)


Unnamed: 0,date_id,temp_c,humidity_%,rain_mm,snowfall_cm,atmopressure_hpa,cloudcover_%,u10,v10
0,2025-02-19,8.656834,77.333336,0.066667,0.0,1001.31006,95.0,-0.908649,1.795909
1,2025-02-20,10.983917,81.0,0.0,0.0,1006.14264,66.166664,-0.266752,1.271984


### 3.2 Setting "date_id" as index on both dataframes

In [70]:
df_training_2124["date_id"] = pd.to_datetime(df_training_2124["date_id"])
df_training_2124.set_index("date_id", inplace=True)

df_predict_pm25["date_id"] = pd.to_datetime(df_predict_pm25["date_id"])
df_predict_pm25.set_index("date_id", inplace=True)

display(df_training_2124.head(2))
display(df_predict_pm25.head(2))

Unnamed: 0_level_0,temp_c,humidity_%,rain_mm,snowfall_cm,atmopressure_hpa,cloudcover_%,u10,v10,no2,o3,pm10,pm2.5,so2
date_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2021-01-01,2.599667,85.773796,0.075,0.002917,988.1066,95.291664,1.368064,-3.283859,71.0,58.3,24.2,24.2,8.1
2021-01-02,1.618417,79.047424,0.008333,0.023333,989.22394,95.791664,2.309296,-5.272886,46.0,49.8,23.2,16.4,31.4


Unnamed: 0_level_0,temp_c,humidity_%,rain_mm,snowfall_cm,atmopressure_hpa,cloudcover_%,u10,v10
date_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2025-02-19,8.656834,77.333336,0.066667,0.0,1001.31006,95.0,-0.908649,1.795909
2025-02-20,10.983917,81.0,0.0,0.0,1006.14264,66.166664,-0.266752,1.271984


In [71]:
df_training_2124[["no2", "o3", "pm10", "pm2.5", "so2"]].describe()

Unnamed: 0,no2,o3,pm10,pm2.5,so2
count,1461.0,1461.0,1461.0,1461.0,1461.0
mean,81.040999,72.644969,44.362765,21.719986,17.556509
std,24.201867,28.31761,24.706171,12.981523,23.476302
min,5.2,3.3,11.9,3.0,0.0
25%,66.0,55.3,28.1,13.5,3.0
50%,80.7,73.5,38.3,17.7,7.9
75%,95.3,90.2,53.6,25.5,22.5
max,164.4,184.7,224.6,115.1,175.9


## 4. Machine Learning: Random Forest Regressor optimized

In [72]:
df_train = df_training_2124.copy()
features = ['temp_c', 'humidity_%', 'rain_mm', 'snowfall_cm',
            'atmopressure_hpa', 'cloudcover_%', 'u10', 'v10']

pollutants = ["no2", "o3", "pm10", "pm2.5", "so2"]

X = df_train[features]
y_dict = {pollutant: df_train[pollutant] for pollutant in pollutants}

X_train, X_test, y_train_dict, y_test_dict = {}, {}, {}, {}

for pollutant in pollutants:
    X_train[pollutant], X_test[pollutant], y_train_dict[pollutant], y_test_dict[pollutant] = train_test_split(
        X, y_dict[pollutant], test_size=0.2, random_state=42
    )

best_params = {
    "n_estimators": 100,  
    "max_depth": 15,      
    "min_samples_split": 2,  
    "min_samples_leaf": 5,  
    "random_state": 42  
}

rfr_models = {}  
metrics = {}  

for pollutant in pollutants:
    print(f" Training the RFR for {pollutant}...")

    rfr = RandomForestRegressor(**best_params)
    rfr.fit(X_train[pollutant], y_train_dict[pollutant])

    rfr_models[pollutant] = rfr  

    y_pred = rfr.predict(X_test[pollutant])  

    mae = mean_absolute_error(y_test_dict[pollutant], y_pred)
    mse = mean_squared_error(y_test_dict[pollutant], y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test_dict[pollutant], y_pred)

    metrics[pollutant] = {"MAE": mae, "MSE": mse, "RMSE": rmse, "R2": r2}

print("\n Evaluating the final model:")

for pollutant, values in metrics.items():
    print(f"\n {pollutant}:")
    for metric, value in values.items():
        print(f"   -{metric}: {value:.2f}")

def predict_pollution(forecast_meteo, trained_models):
    """
    Function to forecast pollution levels from meteo forecasts.

    Parameters:
    - forecast_meteo (DataFrame): DataFrame with meteo forecasts.
    - trained_models (dict): Dictionary with the trained models {pollutant: correspondent model}.
    
    # Example of use:
    forecast_meteo = pd.read_csv("forecast_meteo_14days.csv")
    df_results = predict_pollution(forecast_meteo, rfr_models)
    print(df_results)

    Returns:
    - df_predictions (DataFrame): DataFrame with forecasts for each pollutant.
    """
    
    df_predictions = forecast_meteo[["date_id"]].copy()

    forecast_meteo = forecast_meteo.drop(columns=["date_id"], errors="ignore")

    for pollutant, model in trained_models.items():
        print(f" Calculating a forecast for {pollutant}...")
        df_predictions[f"{pollutant}_Predicted"] = model.predict(forecast_meteo)

    return df_predictions




 Training the RFR for no2...
 Training the RFR for o3...
 Training the RFR for pm10...
 Training the RFR for pm2.5...
 Training the RFR for so2...

 Evaluating the final model:

 no2:
   -MAE: 14.15
   -MSE: 328.02
   -RMSE: 18.11
   -R2: 0.34

 o3:
   -MAE: 12.35
   -MSE: 269.02
   -RMSE: 16.40
   -R2: 0.68

 pm10:
   -MAE: 14.51
   -MSE: 483.49
   -RMSE: 21.99
   -R2: 0.20

 pm2.5:
   -MAE: 6.60
   -MSE: 97.23
   -RMSE: 9.86
   -R2: 0.47

 so2:
   -MAE: 11.99
   -MSE: 367.77
   -RMSE: 19.18
   -R2: 0.24


## 5. Machine Learning RFR Prediction des Polluants

In [None]:
df_predict_pm25.reset_index(inplace=True)

In [78]:
df_forecast_19fev_04mars25 = predict_pollution(df_predict_pm25, rfr_models)
df_forecast_19fev_04mars25

 Calculating a forecast for no2...
 Calculating a forecast for o3...
 Calculating a forecast for pm10...
 Calculating a forecast for pm2.5...
 Calculating a forecast for so2...


Unnamed: 0,date_id,no2_Predicted,o3_Predicted,pm10_Predicted,pm2.5_Predicted,so2_Predicted
0,2025-02-19,91.381367,57.74524,54.169395,25.050778,6.654289
1,2025-02-20,100.687508,55.031135,59.901501,32.421581,16.400155
2,2025-02-21,81.258329,61.939708,58.65004,30.702983,3.939529
3,2025-02-22,71.042423,77.206496,39.146821,13.626208,22.731234
4,2025-02-23,86.550961,55.625564,41.304872,22.11787,9.616478
5,2025-02-24,90.325343,55.111128,57.891753,30.728425,5.665025
6,2025-02-25,78.573694,58.338091,40.860268,20.237236,15.489302
7,2025-02-26,88.863788,54.273545,44.732135,27.292548,10.124637
8,2025-02-27,94.522165,60.987574,57.821003,39.069522,18.577048
9,2025-02-28,89.910978,63.37404,39.594288,22.961058,17.228414


In [81]:
df_metepollu_trainingset[["no2", "o3", "pm10", "pm2.5", "so2"]].describe()

Unnamed: 0,no2,o3,pm10,pm2.5,so2
count,1461.0,1461.0,1461.0,1461.0,1461.0
mean,81.040999,72.644969,44.362765,21.719986,17.556509
std,24.201867,28.31761,24.706171,12.981523,23.476302
min,5.2,3.3,11.9,3.0,0.0
25%,66.0,55.3,28.1,13.5,3.0
50%,80.7,73.5,38.3,17.7,7.9
75%,95.3,90.2,53.6,25.5,22.5
max,164.4,184.7,224.6,115.1,175.9


In [84]:
df_forecast_19fev_04mars25[['no2_Predicted', 'o3_Predicted', 'pm10_Predicted',
       'pm2.5_Predicted', 'so2_Predicted']].describe()

Unnamed: 0,no2_Predicted,o3_Predicted,pm10_Predicted,pm2.5_Predicted,so2_Predicted
count,14.0,14.0,14.0,14.0,14.0
mean,84.730679,62.149523,46.645145,24.951392,10.848924
std,8.302927,7.284186,9.095417,6.518052,6.356613
min,71.042423,54.273545,33.043037,13.626208,3.441216
25%,78.888662,56.155483,39.910783,20.707394,5.091343
50%,83.904645,60.726861,44.022174,23.054545,9.870558
75%,90.221751,65.184709,56.908101,29.850374,16.172442
max,100.687508,77.206496,59.901501,39.069522,22.731234


In [86]:
df_forecast_19fev_04mars25.to_csv("1st_pollutants_forecast 19f_04m25.csv", index=False)