## Playful project
This project is about forecasting the next 5 days air temperature. In addition, 4 types of predictions and their comparison will be used, which include
1- use of API from <https://api.weatherapi.com>
2- users prediction (guess of next 5 days temperature)
3- model prediction based on 2 dataset from <https://www.knmi.nl/>
4- calculate the historical averages for the same day and month across years for the next five days

Moreover, the result of this project are:
1- by comparing API's prediction, historical Average Temperature and model prediction can be evaluated 
2- Users can know how accurate their forecast is, and this program will give them suggestions based on the temperature in the output, in addition to showing the accuracy of their forecast.

## Technology
Predictive Analytics: AI-powered predictive analytics models will analyze historical air temperature data to forecast future air temperature patterns. This predictive capability will enable system and users to know about the accuracy of their prediction (Eichholtz, 2021; He et al., 2022).

## ML Model Implementation 

The proposed solution will employ various ML models to achieve its goals, including: 
//Regression Models and Neural Network: will be used to predict future air temperature.

## Data
Two datase from KNMI.nl will be used. 

1- monthly air temperature avarage between 1981 to 2024

2- daily avarage air temperature between 1981 to 2011 

and API data from api.weatherapi.com

For predicting the next five days' air temperatures, it would be most relevant to use the daily temperature dataset, as it provides the granularity needed for daily predictions. While the yearly data could provide some long-term trends and seasonality insights, the daily dataset will be more directly applicable for short-term forecasts.

## Solution 1 : predict based on historical data

* Read the dataframes

In [25]:
# importing the seaborn and matplotlib.pyplot libraries along with pandas. 
# These libraries are commonly used for data visualization in Python
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

* Load the dataset

In [26]:
# Preprocessing
import pandas as pd
from datetime import datetime

# Load the dataset
daily_temp_df = pd.read_csv('daily_temp1_1981_2011_one_decimal.csv')

# Convert the 'years' column to a datetime format for easier manipulation
daily_temp_df['date'] = pd.to_datetime(daily_temp_df['years'], format='%Y%m%d')

# Drop the original 'years' column as it's no longer needed
daily_temp_df.drop('years', axis=1, inplace=True)

# Set the 'date' column as the index of the dataframe
daily_temp_df.set_index('date', inplace=True)

# Check for missing values in the dataset
missing_values = daily_temp_df.isnull().sum()

# Display the first few rows and missing values count for review
(daily_temp_df.head(), missing_values)


(            station 1  station 2  station 3  station 4  station 5  station 6  \
 date                                                                           
 1981-01-01        6.6        5.4        5.7        5.2        5.0        3.7   
 1981-01-02        7.7        7.5        7.5        7.5        7.6        6.6   
 1981-01-03        8.7        8.1        8.3        9.0        8.3        7.9   
 1981-01-04        5.6        5.1        5.2        4.6        5.0        3.7   
 1981-01-05        4.6        3.8        3.4        3.0        3.4        1.9   
 
             station 7  station 8  station 9  station 10  station 11  \
 date                                                                  
 1981-01-01        4.2        4.3        7.2         6.1         5.5   
 1981-01-02        6.8        6.4        7.5         7.9         7.2   
 1981-01-03        7.9        8.3        8.7         9.1         8.9   
 1981-01-04        4.4        3.9        6.1         5.1         4.4   

* Feature Engineering

In [27]:
# Continuing from the corrected preprocessing step, let's perform feature engineering on the correct dataframe variable `daily_temp_df`.

# Feature Engineering

# Day of the year and day of the week to capture seasonal and weekly effects
daily_temp_df['day_of_year'] = daily_temp_df.index.dayofyear
daily_temp_df['day_of_week'] = daily_temp_df.index.dayofweek

# Lag features - temperatures from previous days to capture recent trends
num_lags = 5  # Number of lag days
for lag in range(1, num_lags + 1):
    daily_temp_df[f'temp_lag_{lag}'] = daily_temp_df['daily average'].shift(lag)

# Rolling average - to smooth out short-term fluctuations and highlight longer-term trends
rolling_window_size = 7  # 7-day rolling window
daily_temp_df['rolling_avg_7d'] = daily_temp_df['daily average'].rolling(window=rolling_window_size).mean()

# Fill any NaN values created by the lag and rolling features with backward fill
daily_temp_df.fillna(method='bfill', inplace=True)

# Display the updated dataframe to verify the new features
daily_temp_df.head()


  daily_temp_df.fillna(method='bfill', inplace=True)


Unnamed: 0_level_0,station 1,station 2,station 3,station 4,station 5,station 6,station 7,station 8,station 9,station 10,...,station 14,daily average,day_of_year,day_of_week,temp_lag_1,temp_lag_2,temp_lag_3,temp_lag_4,temp_lag_5,rolling_avg_7d
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1981-01-01,6.6,5.4,5.7,5.2,5.0,3.7,4.2,4.3,7.2,6.1,...,5.7,5.3,1,3,5.3,5.3,5.3,5.3,5.3,4.5
1981-01-02,7.7,7.5,7.5,7.5,7.6,6.6,6.8,6.4,7.5,7.9,...,6.4,7.3,2,4,5.3,5.3,5.3,5.3,5.3,4.5
1981-01-03,8.7,8.1,8.3,9.0,8.3,7.9,7.9,8.3,8.7,9.1,...,8.4,8.6,3,5,7.3,5.3,5.3,5.3,5.3,4.5
1981-01-04,5.6,5.1,5.2,4.6,5.0,3.7,4.4,3.9,6.1,5.1,...,4.1,4.7,4,6,8.6,7.3,5.3,5.3,5.3,4.5
1981-01-05,4.6,3.8,3.4,3.0,3.4,1.9,2.4,2.0,4.8,3.7,...,2.4,3.2,5,0,4.7,8.6,7.3,5.3,5.3,4.5


In [28]:
# Step 3: Splitting the Data

# For time series forecasting, especially when we're predicting the next 5 days,
# we typically use the most recent data for testing and the rest for training.
# However, as per your requirement, since we're predicting future values beyond the available dataset,
# we will consider the last part of the dataset for validation (not exactly future but unseen during training).

from sklearn.model_selection import train_test_split

# Define features and target variable
X = daily_temp_df.drop(columns=['daily average'])  # All other columns are features
y = daily_temp_df['daily average']  # Target variable

# Since this is time-series data, we won't shuffle the rows randomly.
# Let's take the last 5% of data for testing which simulates predicting future temperatures.
test_size = int(len(X) * 0.05)  # Adjust the test size if needed

# Split the data into training and testing sets
X_train, X_test = X[:-test_size], X[-test_size:]
y_train, y_test = y[:-test_size], y[-test_size:]

# Verify the sizes of the datasets
(X_train.shape, X_test.shape, y_train.shape, y_test.shape)


((10410, 22), (547, 22), (10410,), (547,))

* Modeling - Random Forest


In [29]:
# Step 4: Modeling - Random Forest

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

# Initialize the Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)  # adjust these parameters

# Train the model on the training data
rf_model.fit(X_train, y_train)

# Predict on the testing data
y_pred_rf = rf_model.predict(X_test)

# Calculate the performance metrics
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
mae_rf = mean_absolute_error(y_test, y_pred_rf)

# Output the performance metrics for the Random Forest model
(rmse_rf, mae_rf)


  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


(0.1506255208521871, 0.10553564899451548)

* Modeling - Neural Network

In [30]:
# Step 4: Modeling - Neural Network

from sklearn.neural_network import MLPRegressor

# Initialize the MLPRegressor (Neural Network)
# This is a basic neural network setup; parameters can be adjusted for optimization
nn_model = MLPRegressor(hidden_layer_sizes=(100, 50), max_iter=500, random_state=42)

# Train the model on the training data
nn_model.fit(X_train, y_train)

# Predict on the testing data
y_pred_nn = nn_model.predict(X_test)

# Calculate the performance metrics
rmse_nn = np.sqrt(mean_squared_error(y_test, y_pred_nn))
mae_nn = mean_absolute_error(y_test, y_pred_nn)

# Output the performance metrics for the Neural Network model
(rmse_nn, mae_nn)


  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


(0.14389893243012583, 0.11980014209589204)

To predict the next five days' temperatures based on the Random Forest and Neural Network models i've trained, i would ideally require the recent actual data to update my lag features. Since i don't have future data, I'll illustrate how I can use the last part of my training data to create a simplified version of "future data". Remember, this is a makeshift approach due to the absence of real future data:

Firstly, ensure that my models and the necessary Python packages are correctly set up as per my previous code. Then, let's craft the future_data using the latest available information from my training set (X_train). This involves shifting lagged temperature values to simulate "future" inputs based on the last known data.

In [38]:
# Create future_data based on the last 5 days from the training data
# For simplicity, let's replicate the last day data and shift lag features assuming stable conditions
# Note: This is a simplification. Normally, I should update this based on actual recent weather data.

import numpy as np

# Prepare the 'future_data' for the next 5 days based on the shifted values of the last day in the training set
future_data_list = []
for i in range(1, 6):  # Create data for the next 5 days
    new_row = X_train.iloc[-1].copy()  # Start with the last day in training data
    # Shift temperature lags; in real scenario, update this with actual temperature forecasts or measurements
    for lag in range(1, 6):  # Assuming I have 5 lags
        if lag == 1:  # For the first lag, it's the last known value
            new_row[f'temp_lag_{lag}'] = y_train.iloc[-1]  # Last known actual temperature
        else:  # For other lags, shift the previous lags
            new_row[f'temp_lag_{lag}'] = X_train.iloc[-1][f'temp_lag_{lag - 1}']
    future_data_list.append(new_row)

# Convert the list of future data rows into a DataFrame
future_data = pd.DataFrame(future_data_list)

# Predict the next 5 days with both models
future_pred_rf = rf_model.predict(future_data)
future_pred_nn = nn_model.predict(future_data)

# Display the predictions
print("Next 5 days' temperature predictions:")
print("--------------------------------------")
print("Day | Random Forest | Neural Network")
print("--------------------------------------")
for day in range(5):
    print(f"Day {day + 1} | {future_pred_rf[day]:.2f}°C        | {future_pred_nn[day]:.2f}°C")


  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


Next 5 days' temperature predictions:
--------------------------------------
Day | Random Forest | Neural Network
--------------------------------------
Day 1 | 24.21°C        | 24.25°C
Day 2 | 24.21°C        | 24.25°C
Day 3 | 24.21°C        | 24.25°C
Day 4 | 24.21°C        | 24.25°C
Day 5 | 24.21°C        | 24.25°C


## Solution 2 : predict based on API from <https://api.weatherapi.com>

In [35]:
import requests

# Actual WeatherAPI key
api_key = "20dda7b73ea54e7f8d5122008240303"
location = "Gouda, Netherlands"  # The location for which I want the forecast
url = f"http://api.weatherapi.com/v1/forecast.json?key={api_key}&q={location}&days=5&aqi=no&alerts=no"

# Send the request to WeatherAPI
response = requests.get(url)
forecast_data = response.json()  # Convert the response to JSON format

# Now, let's extract, calculate, and print the average temperature for each of the next 5 days
print(f"Average Daily Temperatures for the Next 5 Days in {location}:")
print("--------------------------------------------------------------")
for day in forecast_data['forecast']['forecastday']:
    date = day['date']
    max_temp = day['day']['maxtemp_c']
    min_temp = day['day']['mintemp_c']
    avg_temp = (max_temp + min_temp) / 2  # Calculate the daily average temperature
    print(f"Date: {date}, Average Temp: {avg_temp:.2f}°C")





Average Daily Temperatures for the Next 5 Days in Gouda, Netherlands:
--------------------------------------------------------------
Date: 2024-03-09, Average Temp: 9.20°C
Date: 2024-03-10, Average Temp: 8.90°C
Date: 2024-03-11, Average Temp: 7.85°C
Date: 2024-03-12, Average Temp: 6.80°C
Date: 2024-03-13, Average Temp: 7.05°C


* how each source compares for the upcoming five days

In [39]:
import requests
import pandas as pd
import numpy as np

# Fetch the forecast from WeatherAPI
api_key = "20dda7b73ea54e7f8d5122008240303"  # Replace with your actual WeatherAPI key
location = "Gouda, Netherlands"
url = f"http://api.weatherapi.com/v1/forecast.json?key={api_key}&q={location}&days=5&aqi=no&alerts=no"
response = requests.get(url)
forecast_data = response.json()  # Convert response to JSON

# Extract average forecast temperatures from the API data
api_forecast_temps = [(day['day']['maxtemp_c'] + day['day']['mintemp_c']) / 2 for day in forecast_data['forecast']['forecastday']]

# Compare with your models' predictions
print("Comparing 5-day temperature forecasts:")
print("---------------------------------------")
print("Day | API Avg Temp | Random Forest | Neural Network")
print("---------------------------------------")
for day in range(5):
    print(f"Day {day + 1} | {api_forecast_temps[day]:.2f}°C        | {future_pred_rf[day]:.2f}°C        | {future_pred_nn[day]:.2f}°C")


Comparing 5-day temperature forecasts:
---------------------------------------
Day | API Avg Temp | Random Forest | Neural Network
---------------------------------------
Day 1 | 9.20°C        | 24.21°C        | 24.25°C
Day 2 | 8.90°C        | 24.21°C        | 24.25°C
Day 3 | 7.85°C        | 24.21°C        | 24.25°C
Day 4 | 6.80°C        | 24.21°C        | 24.25°C
Day 5 | 7.05°C        | 24.21°C        | 24.25°C


* calculate the historical averages for the same day and month across years for the next five days:

In [40]:
import pandas as pd

# Ensure 'daily_temp_df' is indexed by date if it's not already
if not isinstance(daily_temp_df.index, pd.DatetimeIndex):
    daily_temp_df['date'] = pd.to_datetime(daily_temp_df['years'], format='%Y%m%d')  # Adjust format as necessary
    daily_temp_df.set_index('date', inplace=True)


In [43]:
import pandas as pd
from datetime import datetime, timedelta

# Correcting the dataframe filtering logic
for i in range(5):
    target_date = today + timedelta(days=i)
    # Make sure to group the conditions with parentheses
    same_day = daily_temp_df[(daily_temp_df.index.month == target_date.month) & (daily_temp_df.index.day == target_date.day)]
    # Calculate the average of daily average temperatures for this date
    historical_avg = same_day['daily average'].mean()
    print(f"Date: {target_date.strftime('%Y-%m-%d')}: Historical Average Temperature: {historical_avg:.2f}°C")


Date: 2024-03-09: Historical Average Temperature: 6.44°C
Date: 2024-03-10: Historical Average Temperature: 6.76°C
Date: 2024-03-11: Historical Average Temperature: 6.72°C
Date: 2024-03-12: Historical Average Temperature: 6.54°C
Date: 2024-03-13: Historical Average Temperature: 6.50°C


* This script first fetches the forecast from WeatherAPI, calculates the historical average temperatures for the same dates over the years from my dataset, and then prints a comparison table that includes the API's forecast, my Random Forest and Neural Network predictions, and the historical averages. This will provide a comprehensive view, showing how each source compares for the upcoming five days.

In [44]:
import requests
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Assuming daily_temp_df is my historical data DataFrame and it's already loaded and indexed by date
# Actual WeatherAPI key
api_key = "20dda7b73ea54e7f8d5122008240303"
location = "Gouda, Netherlands"
url = f"http://api.weatherapi.com/v1/forecast.json?key={api_key}&q={location}&days=5&aqi=no&alerts=no"
response = requests.get(url)
forecast_data = response.json()  # Convert response to JSON

# Extract average forecast temperatures from the API data
api_forecast_temps = [(day['day']['maxtemp_c'] + day['day']['mintemp_c']) / 2 for day in forecast_data['forecast']['forecastday']]

# Prepare for historical average calculation
today = datetime.now()
historical_averages = []

for i in range(5):
    target_date = today + timedelta(days=i)
    same_day = daily_temp_df[(daily_temp_df.index.month == target_date.month) & (daily_temp_df.index.day == target_date.day)]
    historical_avg = same_day['daily average'].mean()
    historical_averages.append(historical_avg)

# Display comparison
print("Comparing 5-day temperature forecasts:")
print("---------------------------------------------------------------")
print("Day | API Avg Temp | Random Forest | Neural Network | Historical Avg")
print("---------------------------------------------------------------")
for day in range(5):
    print(f"Day {day + 1} | {api_forecast_temps[day]:.2f}°C        | {future_pred_rf[day]:.2f}°C        | {future_pred_nn[day]:.2f}°C        | {historical_averages[day]:.2f}°C")


Comparing 5-day temperature forecasts:
---------------------------------------------------------------
Day | API Avg Temp | Random Forest | Neural Network | Historical Avg
---------------------------------------------------------------
Day 1 | 9.20°C        | 24.21°C        | 24.25°C        | 6.44°C
Day 2 | 8.90°C        | 24.21°C        | 24.25°C        | 6.76°C
Day 3 | 7.85°C        | 24.21°C        | 24.25°C        | 6.72°C
Day 4 | 6.80°C        | 24.21°C        | 24.25°C        | 6.54°C
Day 5 | 7.05°C        | 24.21°C        | 24.25°C        | 6.50°C


The significant difference observed between the machine learning model predictions (Random Forest and Neural Network) and the WeatherAPI and historical averages could be due to several factors:

* Data Discrepancy: The models were trained on historical data that may not accurately reflect current weather conditions or trends due to climate change or other factors.

* Model Complexity and Overfitting: The models may be overfitted to the historical training data, capturing noise rather than the underlying patterns, thus failing to generalize to current conditions.

* Inadequate Features: The models might be missing critical features that influence temperature, such as atmospheric pressure, humidity, or specific seasonal indicators, limiting their prediction accuracy.

* Model Parameters: The hyperparameters for the Random Forest and Neural Network might not be optimally tuned for the best predictive performance.

To improve the model predictions, consider reassessing the training data for relevance and accuracy, enhancing the feature set, evaluating model performance more rigorously, and tuning model parameters. Comparing model outputs with established meteorological predictions can also provide valuable benchmarks for model accuracy.



## compare user prediction and API

* integrate user predictions and compare them with the API predictions, and then provide weather-related tips based on the results

ask the user for their temperature predictions for the next five days, compare these predictions with those from the WeatherAPI, and then offer advice based on whether the weather is expected to be cold or hot:

In [45]:
import requests

# Assuming API and location setup
api_key = "20dda7b73ea54e7f8d5122008240303"  # Actual API key
location = "Gouda, Netherlands"
url = f"http://api.weatherapi.com/v1/forecast.json?key={api_key}&q={location}&days=5&aqi=no&alerts=no"
response = requests.get(url)
forecast_data = response.json()

# Collect user's temperature predictions
user_predictions = []
for i in range(1, 6):
    temp = float(input(f"Enter your prediction for Day {i} average temperature (°C): "))
    user_predictions.append(temp)

# Fetch API forecast temperatures
api_forecast_temps = [(day['day']['maxtemp_c'] + day['day']['mintemp_c']) / 2 for day in forecast_data['forecast']['forecastday']]

# Compare and give tips
print("\nComparing your predictions with actual forecasts:")
print("--------------------------------------------------")
print("Day | Your Prediction | API Forecast | Tips")
print("--------------------------------------------------")
for i, (user_pred, api_pred) in enumerate(zip(user_predictions, api_forecast_temps), start=1):
    # Determine if it's cold or hot for simplistic tips
    tips = "It's a typical day, enjoy!"
    if api_pred < 5:
        tips = "The weather is cold, don't forget to wear a jacket!"
    elif api_pred > 25:
        tips = "The weather is hot, remember to stay hydrated and wear light clothing!"
    
    print(f"Day {i} | {user_pred:.2f}°C            | {api_pred:.2f}°C       | {tips}")



Comparing your predictions with actual forecasts:
--------------------------------------------------
Day | Your Prediction | API Forecast | Tips
--------------------------------------------------
Day 1 | 9.00°C            | 9.20°C       | It's a typical day, enjoy!
Day 2 | 6.00°C            | 8.90°C       | It's a typical day, enjoy!
Day 3 | 7.00°C            | 7.85°C       | It's a typical day, enjoy!
Day 4 | 8.00°C            | 6.80°C       | It's a typical day, enjoy!
Day 5 | 9.00°C            | 7.05°C       | It's a typical day, enjoy!
