# **Weather Forecasting**

---
## **1. The Setup**

### **1.1 Libraries and API**
The following libraries will be necessary for the model. The API key is from the open source weather API: *openweathermap.com*.

A helper function is used to make an API request depending on the location of interest.

In [31]:
import requests
import pandas as pd
import numpy as np
import pytz
from sklearn.model_selection import train_test_split
# from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import mean_squared_error
from datetime import datetime, timedelta

In [32]:
import os
from dotenv import load_dotenv

API_KEY = os.getenv("API_KEY")
BASE_URL = 'https://api.openweathermap.org/data/2.5/'

In [33]:
def get_current_weather(city):
  url = f"{BASE_URL}weather?q={city}&appid={API_KEY}&units=metric"  # API request URL construction
  response = requests.get(url)
  data = response.json()
  return {
      'city': data['name'],
      'country': data['sys']['country'],
      'temperature_2m_mean': round(data['main']['temp']),
      'apparent_temperature_mean': round(data['main']['feels_like']),
      'temperature_2m_min': round(data['main']['temp_min']),
      'temperature_2m_max': round(data['main']['temp_max']),
      'relative_humidity_2m_mean': data['main']['humidity'],
      'wind_speed_10m_max': data['wind']['speed'],
      'wind_direction_10m_dominant': data['wind']['deg'],
      'surface_pressure_mean': data['main']['pressure'],
      'description': data['weather'][0]['description'],
  }

### **1.2 Reading historical data**

The data was obtained from *open-meteo.com*. For more information, see Appendix A.

In [34]:
df = pd.read_csv('weather_data.csv', index_col = 'date')
df.drop('Unnamed: 0', axis=1, inplace=True)
df.head()

def read_historical_data(file_path):
  df = pd.read_csv(file_path) # Load the csv file

  df = df.dropna()
  df = df.drop_duplicates()
  df = df.reset_index(drop=True)
  return df

---
## **2 Predicting Rain**

### **2.1 To Encode, or not to Encode?**

It is important to note from the data being used, it is already numerical:

*   For any data requiring label enconding, this is already the case, and
*   There is a lack of data suitable for one-hot encoding.

Thus no encoder will be used in the preprocessing.

[NB: It may be a good idea to consider categorising data. In this case, bearings are being used to measure the wind direction. However, it may be better to consider what the general wind direction is relative to a compass reading. In other words, categorise wind direction to either N, NE, E, SE, S, SW, W or NW. For now, the bearing will keep being used and will then be converted to the closest corresponding compass direction.]



### **2.2 Features, targets... bullseye**

It is better to start simple and then build up the complexity (potentially inclusing a neural network eventually). Thus, the model will be made to predict the `rain_tomorrow` variable. It may be beneficial for the next step to be predicting `precipitation_hours`.

In [35]:
def preprocessing_for_classification(data):
  # Define the features and targets
  X = data[['temperature_2m_min', 'temperature_2m_max', 'wind_speed_10m_max', 'wind_direction_10m_dominant', 'relative_humidity_2m_mean', 'surface_pressure_mean', 'apparent_temperature_mean']]
  y = data[['rain_tomorrow']]

  # Missing: 'precipitation_hours', 'wind_gusts_10m_max',
  return X, y

### **2.3 Training the model**

Since we are dealing with a binary classification problem (it either rains tomorrow or it doesn't), one may think to use the Log Loss function (i.e. binary cross-entropy). However, since computational efficiency is desired early on, the default Gini Impurity for Random Forest Classifiers will be used.

In [36]:
from sklearn.metrics import log_loss, mean_squared_error

def train_rain_model(X,y):
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)
  model = RandomForestClassifier(n_estimators=200, random_state=42)

  model.fit(X_train, y_train)
  y_pred = model.predict(X_test)

  print(f"Log Loss Error: {log_loss(y_test, y_pred)}")
  print(f"Mean Squared Error: {mean_squared_error(y_test, y_pred)}")

  return model

## **3. Predicting Temperature and Wind Speed**

### **3.1 Preprocessing for Regression**

The previous classification model was for predicting whether or not it was going to rain the next day. The following model will be to instead predict continuous variables such as temperature and wind speed. We begin by preprocessing the data again.

In [37]:
def preprocessing_for_regression(data, feature):
  X, y = [], [] # Features and targets respectively

  for i in range(len(data)-1):
    X.append(data[feature].iloc[i])
    y.append(data[feature].iloc[i+1])

  X = np.array(X).reshape(-1, 1)
  y = np.array(y).reshape(len(y),)

  return X, y

### **3.2 Making the model and its prediction**

Since we are predicting multiple continuous variables using Random Forest Regressors, these functions are made to create various models depending on the desired variable. The prediction function remains the same across the models.

In [38]:
def train_regression_model(X, y):
  model = RandomForestRegressor(n_estimators=200, random_state=42)
  model.fit(X, y)
  return model

In [39]:
def predict(model, current_value):
  predictions = [current_value]

  for i in range(5):
    next_value = model.predict(np.array(predictions).reshape(-1, 1))
    predictions.append(next_value[-1])

  return predictions[1:]

## **4. Bringing it all together**

In [40]:
def weather_forecast():
  city = input("Enter a city: ")
  current_weather = get_current_weather(city)


  data = read_historical_data('weather_data.csv')

  X, y = preprocessing_for_classification(data)
  rain_model = train_rain_model(X, y)

  current_data = {
      'temperature_2m_min': current_weather['temperature_2m_min'],
      'temperature_2m_max': current_weather['temperature_2m_max'],
      # 'wind_gusts_10m_max': data['wind']['gust'],
      'wind_speed_10m_max': current_weather['wind_speed_10m_max'],
      'wind_direction_10m_dominant': current_weather['wind_direction_10m_dominant'],
      'relative_humidity_2m_mean': current_weather['relative_humidity_2m_mean'],
      'surface_pressure_mean': current_weather['surface_pressure_mean'],
      'apparent_temperature_mean': current_weather['apparent_temperature_mean'],
      # 'precipitation_hours': data['rain']['3h'],
      # 'weather_code': data[],
      # 'precipitation_sum': data[],
  }

  current_df = pd.DataFrame([current_data])

  rain_prediction = rain_model.predict(current_df)[0]

  # Predict temperature
  X_temp, y_temp = preprocessing_for_regression(data, 'temperature_2m_mean')
  temp_model = train_regression_model(X_temp, y_temp)
  temp_prediction = predict(temp_model, current_weather['temperature_2m_mean'])

  #Predict Min Temp
  X_min, y_min = preprocessing_for_regression(data, 'temperature_2m_min')
  min_temp_model = train_regression_model(X_min, y_min)
  min_temp_prediction = predict(min_temp_model, current_weather['temperature_2m_min'])

  #Predict Max Temp
  X_max, y_max = preprocessing_for_regression(data, 'temperature_2m_max')
  max_temp_model = train_regression_model(X_max, y_max)
  max_temp_prediction = predict(max_temp_model, current_weather['temperature_2m_max'])

  # Predict humidity
  X_hum, y_hum = preprocessing_for_regression(data, 'relative_humidity_2m_mean')
  hum_model = train_regression_model(X_hum, y_hum)
  hum_prediction = predict(hum_model, current_weather['relative_humidity_2m_mean'])

  ##

  timezone = pytz.timezone('Europe/London')
  current_time = datetime.now(timezone)
  next_day = current_time + timedelta(hours=24)
  next_day = next_day.replace(minute=0, second=0, microsecond=0)

  future_times = [(next_day + timedelta(hours=i)).strftime('%d-%m-%Y') for i in range(0,24*5, 24)]

  print(f"City: {city}")
  print(f"Country: {current_weather['country']}")
  print(f"Current Temperature: {current_weather['temperature_2m_mean']}°C")
  print(f"Current Humidity: {current_weather['relative_humidity_2m_mean']}%")
  print(f"Minimum Temperature: {current_weather['temperature_2m_min']}")
  print(f"Maximum Temperature: {current_weather['temperature_2m_max']}")
  print(f"Weather Prediction: {current_weather['description']}")
  print(f"Rain Prediction: {'Yes' if rain_prediction else 'No'}")

  print("\nTemperature Forecast:")
  for time, temp, mini, maxi in zip(future_times, temp_prediction, min_temp_prediction, max_temp_prediction):
    print(f"{time}: {round(temp, 1)}°C. Min: {round(mini, 1)}. Max: {round(maxi, 1)}.")

  print("\nHumidity Forecast:")
  for time, hum in zip(future_times, hum_prediction):
    print(f"{time}: {round(hum, 1)}%")

In [41]:
weather_forecast()

  return fit_method(estimator, *args, **kwargs)


Log Loss Error: 5.898052372764625
Mean Squared Error: 0.16363636363636364
City: London
Country: GB
Current Temperature: 16°C
Current Humidity: 52%
Minimum Temperature: 15
Maximum Temperature: 17
Weather Prediction: broken clouds
Rain Prediction: No

Temperature Forecast:
06-10-2025: 15.6°C. Min: 15.3. Max: 18.6.
07-10-2025: 14.5°C. Min: 15.6. Max: 18.3.
08-10-2025: 16.0°C. Min: 13.1. Max: 15.4.
09-10-2025: 16.4°C. Min: 8.4. Max: 15.3.
10-10-2025: 18.9°C. Min: 9.9. Max: 13.4.

Humidity Forecast:
06-10-2025: 61.1%
07-10-2025: 50.1%
08-10-2025: 67.9%
09-10-2025: 68.0%
10-10-2025: 67.9%


---
# Appendix A

The following commented code was generated in *open-meteo.com*. It uses its Historical Weather API to generate a pandas dataframe with the information selected on the website. I exported the dataframe as a .csv file to add an additional column with ease using Excel: the `rain_tomorrow` column. Whether or not it rains tomorrow is quite important since that is a huge part of the end prediction - whether it rains or not.

(I named it w_data to not get it confused with the dataset used for training)

In [42]:
# !pip install openmeteo-requests
# !pip install requests-cache retry-requests

# import openmeteo_requests

# import requests_cache
# from retry_requests import retry

# # Setup the Open-Meteo API client with cache and retry on error
# cache_session = requests_cache.CachedSession('.cache', expire_after = -1)
# retry_session = retry(cache_session, retries = 5, backoff_factor = 0.2)
# openmeteo = openmeteo_requests.Client(session = retry_session)

# # Make sure all required weather variables are listed here
# # The order of variables in hourly or daily is important to assign them correctly below
# url = "https://archive-api.open-meteo.com/v1/archive"
# params = {
# 	"latitude": 51.65,
# 	"longitude": -0.08,
# 	"start_date": "2024-07-01",
# 	"end_date": "2025-07-01",
# 	"daily": ["temperature_2m_min", "temperature_2m_max", "wind_gusts_10m_max", "wind_speed_10m_max", "wind_direction_10m_dominant", "relative_humidity_2m_mean", "surface_pressure_mean", "temperature_2m_mean", "precipitation_hours", "weather_code", "precipitation_sum"],
# 	"timezone": "auto",
# }
# responses = openmeteo.weather_api(url, params=params)

# # Process first location. Add a for-loop for multiple locations or weather models
# response = responses[0]
# print(f"Coordinates: {response.Latitude()}°N {response.Longitude()}°E")
# print(f"Elevation: {response.Elevation()} m asl")
# print(f"Timezone: {response.Timezone()}{response.TimezoneAbbreviation()}")
# print(f"Timezone difference to GMT+0: {response.UtcOffsetSeconds()}s")

# # Process daily data. The order of variables needs to be the same as requested.
# daily = response.Daily()
# daily_temperature_2m_min = daily.Variables(0).ValuesAsNumpy()
# daily_temperature_2m_max = daily.Variables(1).ValuesAsNumpy()
# daily_wind_gusts_10m_max = daily.Variables(2).ValuesAsNumpy()
# daily_wind_speed_10m_max = daily.Variables(3).ValuesAsNumpy()
# daily_wind_direction_10m_dominant = daily.Variables(4).ValuesAsNumpy()
# daily_relative_humidity_2m_mean = daily.Variables(5).ValuesAsNumpy()
# daily_surface_pressure_mean = daily.Variables(6).ValuesAsNumpy()
# daily_temperature_2m_mean = daily.Variables(7).ValuesAsNumpy()
# daily_precipitation_hours = daily.Variables(8).ValuesAsNumpy()
# daily_weather_code = daily.Variables(9).ValuesAsNumpy()
# daily_precipitation_sum = daily.Variables(10).ValuesAsNumpy()

# daily_data = {"date": pd.date_range(
# 	start = pd.to_datetime(daily.Time(), unit = "s", utc = True),
# 	end = pd.to_datetime(daily.TimeEnd(), unit = "s", utc = True),
# 	freq = pd.Timedelta(seconds = daily.Interval()),
# 	inclusive = "left"
# )}

# daily_data["temperature_2m_min"] = daily_temperature_2m_min
# daily_data["temperature_2m_max"] = daily_temperature_2m_max
# daily_data["wind_gusts_10m_max"] = daily_wind_gusts_10m_max
# daily_data["wind_speed_10m_max"] = daily_wind_speed_10m_max
# daily_data["wind_direction_10m_dominant"] = daily_wind_direction_10m_dominant
# daily_data["relative_humidity_2m_mean"] = daily_relative_humidity_2m_mean
# daily_data["surface_pressure_mean"] = daily_surface_pressure_mean
# daily_data["temperature_2m_mean"] = daily_temperature_2m_mean
# daily_data["precipitation_hours"] = daily_precipitation_hours
# daily_data["weather_code"] = daily_weather_code
# daily_data["precipitation_sum"] = daily_precipitation_sum

# daily_dataframe = pd.DataFrame(data = daily_data)
# print("\nDaily data\n", daily_dataframe)
# daily_dataframe.to_csv('w_data.csv')