<h1><center><font size=6>Forcasting the daily precipitation in the Pra Basin of Ghana.</center></font></h1>

### What to expect in this notebook

*  Introduction
*  Objectives
*  Essential Libraries
*  Exploratory Data Analysis
*  Data Preprocessing
*  Train Test Split
*  KNN Regression
*  Decision Tree Regression
*  Support Vector Machines (SVM)
*  Random Forest Regression
*  Gradient Boosting Regression
*  Ridge and Lasso Reression
*  XGBOOSTING Regression
*  Neural Network Regression
*  Model tuning using GridSearch
*  Model Evaluation
*  Model deployment

# **Introduction**
In this machine learning project, I focus on forecasting the daily precipitation in the Pra Basin of Ghana, an area critical for local agriculture, water resources, and climate research. The intricacies of precipitation patterns, influenced by both local and global climatic factors, make this an ideal subject for advanced predictive modeling.

For the analysis, I utilized climate data obtained from NASA's POWER Data Access Viewer (https://power.larc.nasa.gov/data-access-viewer/), a reputable source providing high-resolution global climate data. **Location: Latitude  6.625   Longitude -0.875**. This dataset offers comprehensive meteorological information crucial for understanding and forecasting weather patterns in the region and beyond.

Employing Python and various machine learning techniques, this notebook aims to construct a reliable forecast model. This endeavor not only enhances our understanding of the Pra Basin's climatic conditions but also contributes to the development of effective water management and agricultural planning strategies in response to the anticipated precipitation trends.

## **Objectives:**

The primary objective of this project is to develop and validate a machine learning model capable of accurately forecasting daily precipitation levels in the Pra Basin of Ghana. By leveraging historical climate data from NASA's POWER Data Access Viewer, I aim to:

- Analyze historical weather patterns and precipitation data within the region.
- Identify key climatic features and variables influencing precipitation in the Pra Basin.
- Construct and train predictive models using state-of-the-art machine learning techniques.
- Evaluate the models' performance and accuracy in forecasting daily precipitation.

In [1]:
# Mounting the drive in google colab (should be commented out when working on a local machine)
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# **Importing essential libraries & the dataset:**

In [2]:
# importing libraries.
import numpy as np
import pandas as pd
pd.set_option('display.float_format', '{:.2f}'.format)
pd.set_option('display.max_rows', 100)
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('darkgrid')
import warnings
warnings.filterwarnings("ignore")
from plotly.subplots import make_subplots
import plotly.graph_objs as go
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score

In [74]:
# Importing the data set in google colab
data = pd.read_csv('/content/drive/MyDrive/Copilot/Pra_Nasa_data.csv')

In [75]:
df=data.copy()

In [76]:
data.head()

Unnamed: 0,Year,Month,Day,Wind_speed,Surf_Pressure,Precipitation,R_Humidity,T_Max,T_Min,Dew
0,1982,1,1,2.13,98.34,3.27,72.38,34.26,22.54,21.43
1,1982,1,2,1.49,98.23,3.52,72.0,33.26,22.41,21.0
2,1982,1,3,1.55,98.28,6.01,71.81,33.15,22.24,20.71
3,1982,1,4,1.87,98.33,7.64,71.69,32.55,21.74,20.33
4,1982,1,5,2.39,98.33,1.1,64.12,32.73,20.91,17.32


In [77]:
# Calculate the mean of the 'T_min' and 'T_max' and add it as a T_mean column
data['T_mean'] = df[['T_Min', 'T_Max']].mean(axis=1)

In [78]:
data.head()

Unnamed: 0,Year,Month,Day,Wind_speed,Surf_Pressure,Precipitation,R_Humidity,T_Max,T_Min,Dew,T_mean
0,1982,1,1,2.13,98.34,3.27,72.38,34.26,22.54,21.43,28.4
1,1982,1,2,1.49,98.23,3.52,72.0,33.26,22.41,21.0,27.84
2,1982,1,3,1.55,98.28,6.01,71.81,33.15,22.24,20.71,27.7
3,1982,1,4,1.87,98.33,7.64,71.69,32.55,21.74,20.33,27.14
4,1982,1,5,2.39,98.33,1.1,64.12,32.73,20.91,17.32,26.82


In [79]:
# drop the T_Min and T_Max columns from a dataframe
data = data.drop(['T_Min', 'T_Max'], axis=1)
data.head()

Unnamed: 0,Year,Month,Day,Wind_speed,Surf_Pressure,Precipitation,R_Humidity,Dew,T_mean
0,1982,1,1,2.13,98.34,3.27,72.38,21.43,28.4
1,1982,1,2,1.49,98.23,3.52,72.0,21.0,27.84
2,1982,1,3,1.55,98.28,6.01,71.81,20.71,27.7
3,1982,1,4,1.87,98.33,7.64,71.69,20.33,27.14
4,1982,1,5,2.39,98.33,1.1,64.12,17.32,26.82


In [80]:
# Assuming 'Year', 'Month', and 'Day' are columns in your DataFrame
data['date'] = pd.to_datetime(data[['Year', 'Month', 'Day']])

In [81]:
data.head()

Unnamed: 0,Year,Month,Day,Wind_speed,Surf_Pressure,Precipitation,R_Humidity,Dew,T_mean,date
0,1982,1,1,2.13,98.34,3.27,72.38,21.43,28.4,1982-01-01
1,1982,1,2,1.49,98.23,3.52,72.0,21.0,27.84,1982-01-02
2,1982,1,3,1.55,98.28,6.01,71.81,20.71,27.7,1982-01-03
3,1982,1,4,1.87,98.33,7.64,71.69,20.33,27.14,1982-01-04
4,1982,1,5,2.39,98.33,1.1,64.12,17.32,26.82,1982-01-05


In [82]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15340 entries, 0 to 15339
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Year           15340 non-null  int64         
 1   Month          15340 non-null  int64         
 2   Day            15340 non-null  int64         
 3   Wind_speed     15340 non-null  float64       
 4   Surf_Pressure  15340 non-null  float64       
 5   Precipitation  15340 non-null  float64       
 6   R_Humidity     15340 non-null  float64       
 7   Dew            15340 non-null  float64       
 8   T_mean         15340 non-null  float64       
 9   date           15340 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(6), int64(3)
memory usage: 1.2 MB


## **Observation:**
The dataset contains no missing records

## Plotting the yearly records of the climate variables!

In [83]:
# plotting the data.

fig = make_subplots(rows=6, cols=1,
                    vertical_spacing=0.1,
                    subplot_titles=('Wind_speed', 'Surf_Pressure',
                                    "Precipitation" ,'R_Humidity', 'Dew', 'T_mean'))

fig.append_trace(go.Scatter(x=data['date'], y=data['Wind_speed']),
              row=1, col=1)

fig.add_trace(go.Scatter(x=data['date'], y=data['Surf_Pressure']),
              row=2, col=1)

fig.add_trace(go.Scatter(x=data['date'], y=data['Precipitation']),
              row=3, col=1)

fig.add_trace(go.Scatter(x=data['date'], y=data['R_Humidity']),
              row=4, col=1)

fig.add_trace(go.Scatter(x=data['date'], y=data['Dew']),
              row=5, col=1)

fig.add_trace(go.Scatter(x=data['date'], y=data['T_mean']),
              row=6, col=1)

fig.update_layout(height=1800,showlegend=False)
fig.show()


## Data preprocessing.

## **Outlier detection**

In [84]:
# Determining the percentage outlier data points
def find_outliers(data):
    for column in data.columns:
        if data[column].dtype != object:
            Q1 = data[column].quantile(0.25)
            Q3 = data[column].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR
            outliers = ((data[column] < lower_bound) | (data[column] > upper_bound)).sum()
            percentage = (outliers / data[column].shape[0]) * 100
            print(f"Column {column} has {round(percentage, 2)}% outliers")
            print("-" * 50)

In [85]:
find_outliers(data)

Column Year has 0.0% outliers
--------------------------------------------------
Column Month has 0.0% outliers
--------------------------------------------------
Column Day has 0.0% outliers
--------------------------------------------------
Column Wind_speed has 0.4% outliers
--------------------------------------------------
Column Surf_Pressure has 0.08% outliers
--------------------------------------------------
Column Precipitation has 6.47% outliers
--------------------------------------------------
Column R_Humidity has 2.05% outliers
--------------------------------------------------
Column Dew has 7.95% outliers
--------------------------------------------------
Column T_mean has 0.34% outliers
--------------------------------------------------
Column date has 0.0% outliers
--------------------------------------------------


In [69]:
data.shape

(15340, 10)

## **Outlier removal and feature engineering using a defined function**


In [86]:
def remove_outliers(series):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return series[(series >= lower_bound) & (series <= upper_bound)]

# Use the function
data['yearday'] = data.date.dt.dayofyear # making day of year column.
data.rename(columns={'Day':'Monthday'},inplace=True) # changing the day of the month to Monthday.
data.rename(columns={'date':'ds'},inplace=True) # changing the date column name to "ds" for NeuralProphet model.
for col in ["Wind_speed" ,'Surf_Pressure', 'Precipitation', 'R_Humidity', 'Dew', 'T_mean']:
    data[col] = remove_outliers(data[col]) # removing the column's outliers
data.dropna(inplace=True)

In [87]:
data.head()

Unnamed: 0,Year,Month,Monthday,Wind_speed,Surf_Pressure,Precipitation,R_Humidity,Dew,T_mean,ds,yearday
0,1982,1,1,2.13,98.34,3.27,72.38,21.43,28.4,1982-01-01,1
1,1982,1,2,1.49,98.23,3.52,72.0,21.0,27.84,1982-01-02,2
2,1982,1,3,1.55,98.28,6.01,71.81,20.71,27.7,1982-01-03,3
3,1982,1,4,1.87,98.33,7.64,71.69,20.33,27.14,1982-01-04,4
11,1982,1,12,2.48,98.4,0.0,70.94,19.71,26.82,1982-01-12,12


In [88]:
data.shape

(13020, 11)

In [89]:
# plotting the data.
fig = make_subplots(rows=2, cols=1,
                    vertical_spacing=0.1,
                    subplot_titles=("Wind_speed" ,'Surf_Pressure'))

fig.add_trace(go.Scatter(x=data['ds'], y=data['Wind_speed']),
              row=1, col=1)

fig.add_trace(go.Scatter(x=data['ds'], y=data['Surf_Pressure']),
              row=2, col=1)

fig.update_layout(height=800,showlegend=False)
fig.show()


## Predicting Precipitation using the KNeighbors machine learning algorithm.

In [90]:
# Define your target variables (the variables you want to predict)
y = data[['Precipitation']]

# Define your feature variables (the variables you will use to predict the targets)
X = data[['Year', 'Month', 'Monthday', 'yearday', 'ds', 'T_mean', 'Surf_Pressure', 'Wind_speed', 'R_Humidity', 'Dew']]

# Split your data into training and testing sets based on the 'Year'
X_train1 = X[(X['Year'] < 2010)]

X_train = X_train1.drop(['ds', 'yearday'], axis = 1)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

# Convert the scaled data back into a DataFrame
# Note that 'ds' is not included in the columns as it was dropped before scaling
X_train = pd.DataFrame(X_train, columns=['Year', 'Month', 'Monthday', 'T_mean', 'Surf_Pressure', 'Wind_speed', 'R_Humidity', 'Dew'])

X_test1 = X[(X['Year'] >= 2010) & (X['Year'] <= 2023)]
X_test = X_test1.drop(['ds', 'yearday'], axis = 1)

scaler = StandardScaler()
X_test = scaler.fit_transform(X_test)

# Convert the scaled data back into a DataFrame
# Note that 'ds' is not included in the columns as it was dropped before scaling
X_test = pd.DataFrame(X_test, columns=['Year', 'Month', 'Monthday', 'T_mean', 'Surf_Pressure', 'Wind_speed', 'R_Humidity', 'Dew'])

y_train = y[(X['Year'] < 2010)]
y_test = y[(X['Year'] >= 2010) & (X['Year'] <= 2023)]

In [91]:
X_train.head()

Unnamed: 0,Year,Month,Monthday,T_mean,Surf_Pressure,Wind_speed,R_Humidity,Dew
0,-1.7,-1.75,-1.67,1.07,-1.11,-0.97,-0.91,-0.18
1,-1.7,-1.75,-1.56,0.75,-1.76,-1.88,-0.96,-0.58
2,-1.7,-1.75,-1.45,0.68,-1.47,-1.79,-0.99,-0.85
3,-1.7,-1.75,-1.33,0.36,-1.17,-1.34,-1.0,-1.21
4,-1.7,-1.75,-0.43,0.18,-0.76,-0.47,-1.1,-1.8


In [92]:
X_test.head()

Unnamed: 0,Year,Month,Monthday,T_mean,Surf_Pressure,Wind_speed,R_Humidity,Dew
0,-1.6,-1.7,-1.68,1.37,-0.42,-0.67,-1.7,-1.44
1,-1.6,-1.7,-1.56,1.41,-0.35,-1.23,-1.76,-1.45
2,-1.6,-1.7,-1.45,1.6,-0.55,-1.08,-2.19,-2.36
3,-1.6,-1.7,-1.33,1.22,-0.55,-1.32,-1.93,-2.45
4,-1.6,-1.7,-1.22,1.52,-0.67,-1.19,-2.34,-2.78


In [93]:
y_train.head()

Unnamed: 0,Precipitation
0,3.27
1,3.52
2,6.01
3,7.64
11,0.0


In [94]:
y_test.head()

Unnamed: 0,Precipitation
10227,0.03
10228,0.07
10229,0.07
10230,0.09
10231,0.15


## **KN Regressor Algorithm**

In [95]:
from sklearn.neighbors import KNeighborsRegressor

# Pre-calculate the features list
features = ['Year', 'Month', 'Monthday', 'T_mean', 'Surf_Pressure', 'Wind_speed', 'R_Humidity', 'Dew']

# Initialize the forecast dictionary
KNNforecasts = dict()

# Define the column
column = 'Precipitation'

# Fit the model
m = KNeighborsRegressor(n_neighbors=27)
X = X_train[features]
y = y_train[column]
model = m.fit(X, y)

# Make predictions
X_test = X_test[features]
forecast = model.predict(X_test)

# Store the forecast and print the R2 score
KNNforecasts[column] = forecast
print(f'r2 score for {column} : {r2_score(y_test[column], forecast)}')

r2 score for Precipitation : 0.3123801129238638


## looking at how forecast correspond with the test data.

In [96]:
fig = make_subplots(rows=1, cols=1, vertical_spacing=0.08, subplot_titles=('Precipitation'))

# Pre-calculate X_test1['ds']
x_values = X_test1['ds']

for col, forecast in KNNforecasts.items():
    fig.append_trace(go.Scatter(x=x_values, y=y_test[col], name=col), row=1, col=1)
    fig.append_trace(go.Scatter(x=x_values, y=forecast, name=col+' forecast'), row=1, col=1)

fig.update_layout(height=400)
fig.show()

## **KN Optimization using the Gridsearch method**

In [97]:
# Pre-calculate the features list
features = ['Year', 'Month', 'Monthday', 'T_mean', 'Surf_Pressure', 'Wind_speed', 'R_Humidity', 'Dew']

# Define the column
column = 'Precipitation'

# Define the model
m = KNeighborsRegressor()

# Define the parameter grid
param_grid = {
    'n_neighbors': range(1, 31),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

# Create a GridSearchCV object
grid_search = GridSearchCV(estimator=m, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)

# Fit the model
X = X_train[features]
y = y_train[column]
grid_search.fit(X, y)

# Print the best parameters
print(f'Best parameters: {grid_search.best_params_}')

# Make predictions with the best model
X_test = X_test[features]
forecast_KN_op = grid_search.best_estimator_.predict(X_test)

# Print the R2 score
print(f'r2 score for {column} : {r2_score(y_test[column], forecast_KN_op)}')

Fitting 3 folds for each of 120 candidates, totalling 360 fits
Best parameters: {'metric': 'manhattan', 'n_neighbors': 30, 'weights': 'distance'}
r2 score for Precipitation : 0.33626603048458736


In [26]:
fig = make_subplots(rows=1, cols=1, vertical_spacing=0.08, subplot_titles=('Precipitation'))

# Pre-calculate X_test1['ds']
x_values = X_test1['ds']

for col, forecast in KNNforecasts.items():
    fig.append_trace(go.Scatter(x=x_values, y=y_test[col], name=col), row=1, col=1)
    fig.append_trace(go.Scatter(x=x_values, y=forecast_KN_op, name=col+' forecast'), row=1, col=1)

fig.update_layout(height=400)
fig.show()

## **Random Forest Regressor Algorithm**

In [27]:
from sklearn.ensemble import RandomForestRegressor

# Pre-calculate the features list
features = ['Year', 'Month', 'Monthday', 'T_mean', 'Surf_Pressure', 'Wind_speed', 'R_Humidity', 'Dew']

# Initialize the forecast dictionary
RFforecasts = dict()

# Define the column
column = 'Precipitation'

# Fit the model
m = RandomForestRegressor(n_estimators=100, random_state=42)
X = X_train[features]
y = y_train[column]
model = m.fit(X, y)

# Make predictions
X_test = X_test[features]
forecast_rf = model.predict(X_test)

# Store the forecast and print the R2 score
RFforecasts[column] = forecast_rf
print(f'r2 score for {column} : {r2_score(y_test[column], forecast_rf)}')

r2 score for Precipitation : 0.3577635598327874


## **Optimization of Random Forest Regressor using GriedSearch**

In [28]:
# Pre-calculate the features list
features = ['Year', 'Month', 'Monthday', 'T_mean', 'Surf_Pressure', 'Wind_speed', 'R_Humidity', 'Dew']

# Define the column
column = 'Precipitation'

# Define the model
m = RandomForestRegressor(random_state=42)

# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Create a GridSearchCV object
grid_search = GridSearchCV(estimator=m, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)

# Fit the model
X = X_train[features]
y = y_train[column]
grid_search.fit(X, y)

# Print the best parameters
print(f'Best parameters: {grid_search.best_params_}')

# Make predictions with the best model
X_test = X_test[features]
forecast_rf_op = grid_search.predict(X_test)

# Print the R2 score
print(f'r2 score for {column} : {r2_score(y_test[column], forecast_rf_op)}')

Fitting 3 folds for each of 108 candidates, totalling 324 fits
Best parameters: {'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 200}
r2 score for Precipitation : 0.3762966815283265


In [30]:
# Create the subplot
fig = make_subplots(rows=1, cols=1, vertical_spacing=0.08,subplot_titles=('Precipitation'))

# Pre-calculate X_test1['ds']
x_values = X_test1['ds']

# Add traces to the plot
fig.append_trace(go.Scatter(x=x_values, y=y_test[column], name=column), row=1, col=1)
fig.append_trace(go.Scatter(x=x_values, y=forecast_rf_op, name=column+' forecast'), row=1, col=1)

# Update the layout and show the plot
fig.update_layout(height=400)
fig.show()

## **Decision Tree Regressor Algorithm**

In [32]:
from sklearn.tree import DecisionTreeRegressor

# Pre-calculate the features list
features = ['Year', 'Month', 'Monthday', 'T_mean', 'Surf_Pressure', 'Wind_speed', 'R_Humidity', 'Dew']

# Initialize the forecast dictionary
DTforecasts = dict()

# Define the column
column = 'Precipitation'

# Fit the model
m = DecisionTreeRegressor(random_state=42)
X = X_train[features]
y = y_train[column]
model = m.fit(X, y)

# Make predictions
X_test = X_test[features]
forecast_dt = model.predict(X_test)

# Store the forecast and print the R2 score
DTforecasts[column] = forecast_dt
print(f'r2 score for {column} : {r2_score(y_test[column], forecast_dt)}')

r2 score for Precipitation : -0.3703963858887731


## **Optimization of Decision Tree Regressor using GriedSearch**

In [33]:
# Pre-calculate the features list
features = ['Year', 'Month', 'Monthday', 'T_mean', 'Surf_Pressure', 'Wind_speed', 'R_Humidity', 'Dew']

# Define the column
column = 'Precipitation'

# Define the model
m = DecisionTreeRegressor(random_state=42)

# Define the parameter grid
param_grid = {
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Create a GridSearchCV object
grid_search = GridSearchCV(estimator=m, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)

# Fit the model
X = X_train[features]
y = y_train[column]
grid_search.fit(X, y)

# Print the best parameters
print(f'Best parameters: {grid_search.best_params_}')

# Make predictions with the best model
X_test = X_test[features]
forecast_dt_op = grid_search.best_estimator_.predict(X_test)

# Print the R2 score
print(f'r2 score for {column} : {r2_score(y_test[column], forecast_dt_op)}')

Fitting 3 folds for each of 36 candidates, totalling 108 fits
Best parameters: {'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 10}
r2 score for Precipitation : 0.21855111656920956


In [34]:
# Create the subplot
fig = make_subplots(rows=1, cols=1, vertical_spacing=0.08,subplot_titles=('Precipitation'))

# Pre-calculate X_test1['ds']
x_values = X_test1['ds']

# Add traces to the plot
fig.append_trace(go.Scatter(x=x_values, y=y_test[column], name=column), row=1, col=1)
fig.append_trace(go.Scatter(x=x_values, y=forecast_dt_op, name=column+' forecast'), row=1, col=1)

# Update the layout and show the plot
fig.update_layout(height=400)
fig.show()

## **XGBOOST Regressor Algorithm**

In [35]:
from xgboost import XGBRegressor

# Pre-calculate the features list
features = ['Year', 'Month', 'Monthday', 'T_mean', 'Surf_Pressure', 'Wind_speed', 'R_Humidity', 'Dew']

# Initialize the forecast dictionary
XGBforecasts = dict()

# Define the column
column = 'Precipitation'

# Fit the model
m = XGBRegressor(n_estimators=100, random_state=42)
X = X_train[features]
y = y_train[column]
model = m.fit(X, y)

# Make predictions
X_test = X_test[features]
forecast_XG = model.predict(X_test)

# Store the forecast and print the R2 score
XGBforecasts[column] = forecast
print(f'r2 score for {column} : {r2_score(y_test[column], forecast_XG)}')

r2 score for Precipitation : 0.2821731578678305


## **Optimization of XGBOOST Regressor using GriedSearch**

In [36]:
# Pre-calculate the features list
features = ['Year', 'Month', 'Monthday','T_mean', 'Surf_Pressure', 'Wind_speed', 'R_Humidity', 'Dew']

# Define the column
column = 'Precipitation'

# Define the model
m = XGBRegressor(random_state=42)

# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.5, 0.7, 1.0],
    'colsample_bytree': [0.5, 0.7, 1.0]
}

# Create a GridSearchCV object
grid_search = GridSearchCV(estimator=m, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)

# Fit the model
X = X_train[features]
y = y_train[column]
grid_search.fit(X, y)

# Print the best parameters
print(f'Best parameters: {grid_search.best_params_}')

# Make predictions with the best model
X_test = X_test[features]
forecast_XG_op = grid_search.best_estimator_.predict(X_test)

# Print the R2 score
print(f'r2 score for {column} : {r2_score(y_test[column], forecast_XG_op)}')

Fitting 3 folds for each of 324 candidates, totalling 972 fits
Best parameters: {'colsample_bytree': 0.7, 'learning_rate': 0.01, 'max_depth': None, 'n_estimators': 200, 'subsample': 0.5}
r2 score for Precipitation : 0.3717285057191947


## **Gradient Boosting Regressor Algorithm**

In [38]:
from sklearn.ensemble import GradientBoostingRegressor

# Pre-calculate the features list
features = ['Year', 'Month', 'Monthday', 'T_mean', 'Surf_Pressure', 'Wind_speed', 'R_Humidity', 'Dew']

# Initialize the forecast dictionary
GBforecasts = dict()

# Define the column
column = 'Precipitation'

# Fit the model
m = GradientBoostingRegressor(random_state=42)
X = X_train[features]
y = y_train[column]
model = m.fit(X, y)

# Make predictions
X_test = X_test[features]
forecast_gb = model.predict(X_test)

# Store the forecast and print the R2 score
GBforecasts[column] = forecast
print(f'r2 score for {column} : {r2_score(y_test[column], forecast_gb)}')

r2 score for Precipitation : 0.38180110767291553


## **Optimization of Gradient Boosting Regressor using GridSearch**

In [40]:
# Pre-calculate the features list
features = ['Year', 'Month', 'Monthday', 'T_mean', 'Surf_Pressure', 'Wind_speed', 'R_Humidity', 'Dew']

# Define the column
column = 'Precipitation'

# Define the model
m = GradientBoostingRegressor(random_state=42)

# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.5, 0.7, 1.0]
}

# Create a GridSearchCV object
grid_search = GridSearchCV(estimator=m, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)

# Fit the model
X = X_train[features]
y = y_train[column]
grid_search.fit(X, y)

# Print the best parameters
print(f'Best parameters: {grid_search.best_params_}')

# Make predictions with the best model
X_test = X_test[features]
forecast_gb_op = grid_search.best_estimator_.predict(X_test)

# Print the R2 score
print(f'r2 score for {column} : {r2_score(y_test[column], forecast_gb_op)}')

Fitting 3 folds for each of 108 candidates, totalling 324 fits
Best parameters: {'learning_rate': 0.01, 'max_depth': 10, 'n_estimators': 200, 'subsample': 0.5}
r2 score for Precipitation : 0.3757191939175545


## **Neural Network Regression**

In [41]:
from sklearn.neural_network import MLPRegressor

# Pre-calculate the features list
features = ['Year', 'Month', 'Monthday', 'T_mean', 'Surf_Pressure', 'Wind_speed', 'R_Humidity', 'Dew']

# Initialize the forecast dictionary
NNforecasts = dict()

# Define the column
column = 'Precipitation'

# Fit the model
m = MLPRegressor(random_state=42, max_iter=500)
X = X_train[features]
y = y_train[column]
model = m.fit(X, y)

# Make predictions
X_test = X_test[features]
forecast_nn = model.predict(X_test)

# Store the forecast and print the R2 score
NNforecasts[column] = forecast_nn
print(f'r2 score for {column} : {r2_score(y_test[column], forecast_nn)}')

r2 score for Precipitation : 0.38230619204903105


## **Optimization of the Neural Network regressor using the gridsearch**

In [42]:
# Pre-calculate the features list
features = ['Year', 'Month', 'Monthday', 'T_mean', 'Surf_Pressure', 'Wind_speed', 'R_Humidity', 'Dew']

# Define the column
column = 'Precipitation'

# Define the model
m = MLPRegressor(random_state=42)

# Define the parameter grid
param_grid = {
    'hidden_layer_sizes': [(50,50,50), (50,100,50), (100,)],
    'activation': ['tanh', 'relu'],
    'solver': ['sgd', 'adam'],
    'alpha': [0.0001, 0.05],
    'learning_rate': ['constant','adaptive'],
}

# Create a GridSearchCV object
grid_search = GridSearchCV(estimator=m, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)

# Fit the model
X = X_train[features]
y = y_train[column]
grid_search.fit(X, y)

# Print the best parameters
print(f'Best parameters: {grid_search.best_params_}')

# Make predictions with the best model
X_test = X_test[features]
forecast_nn_op = grid_search.best_estimator_.predict(X_test)

# Print the R2 score
print(f'r2 score for {column} : {r2_score(y_test[column], forecast_nn_op)}')

Fitting 3 folds for each of 48 candidates, totalling 144 fits
Best parameters: {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'sgd'}
r2 score for Precipitation : 0.3781589327537981


## **Ridge Regression**

In [56]:
from sklearn.linear_model import Ridge

# Pre-calculate the features list
features = ['Year', 'Month', 'Monthday', 'yearday', 'T_mean', 'Surf_Pressure', 'Wind_speed', 'R_Humidity', 'Dew']

# Initialize the forecast dictionary
Ridgeforecasts = dict()

# Define the column
column = 'Precipitation'

# Fit the model
m = Ridge(random_state=42)
X = X_train[features]
y = y_train[column]
model = m.fit(X, y)

# Make predictions
X_test = X_test[features]
forecast_ridge = model.predict(X_test)

# Store the forecast and print the R2 score
Ridgeforecasts[column] = forecast_ridge
print(f'r2 score for {column} : {r2_score(y_test[column], forecast_ridge)}')

r2 score for Precipitation : 0.3280594307224415


## **Lasso Regression**

In [57]:
from sklearn.linear_model import Lasso

# Pre-calculate the features list
features = ['Year', 'Month', 'Monthday', 'yearday', 'T_mean', 'Surf_Pressure', 'Wind_speed', 'R_Humidity', 'Dew']

# Initialize the forecast dictionary
Lassoforecasts = dict()

# Define the column
column = 'Precipitation'

# Fit the model
m = Lasso(random_state=42)
X = X_train[features]
y = y_train[column]
model = m.fit(X, y)

# Make predictions
X_test = X_test[features]
forecast_lasso = model.predict(X_test)

# Store the forecast and print the R2 score
Lassoforecasts[column] = forecast_lasso
print(f'r2 score for {column} : {r2_score(y_test[column], forecast_lasso)}')

r2 score for Precipitation : 0.10119498728269505


## **Support Vector Machine Regression**

In [98]:
from sklearn.svm import SVR

# Pre-calculate the features list
features = ['Year', 'Month', 'Monthday', 'T_mean', 'Surf_Pressure', 'Wind_speed', 'R_Humidity', 'Dew']

# Initialize the forecast dictionary
SVMforecasts = dict()

# Define the column
column = 'Precipitation'

# Fit the model
m = SVR()
X = X_train[features]
y = y_train[column]
model = m.fit(X, y)

# Make predictions
X_test = X_test[features]
forecast_svm = model.predict(X_test)

# Store the forecast and print the R2 score
SVMforecasts[column] = forecast_svm
print(f'r2 score for {column} : {r2_score(y_test[column], forecast_svm)}')

r2 score for Precipitation : 0.35666967980384867


## **Optimization of the SVM regressor using the gridsearch**

In [28]:
# Pre-calculate the features list
features = ['Year', 'Month', 'Monthday', 'T_mean', 'Surf_Pressure', 'Wind_speed', 'R_Humidity', 'Dew']

# Define the column
column = 'Precipitation'

# Define the model
m = SVR()

# Define the parameter grid
param_grid = {
    'C': [0.1, 1, 10],
    'epsilon': [0.1, 0.2, 0.3],
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid']
}

# Create a GridSearchCV object
grid_search = GridSearchCV(estimator=m, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)

# Fit the model
X = X_train[features]
y = y_train[column]
grid_search.fit(X, y)

# Print the best parameters
print(f'Best parameters: {grid_search.best_params_}')

# Make predictions with the best model
X_test = X_test[features]
forecast_svm_op = grid_search.best_estimator_.predict(X_test)

# Print the R2 score
print(f'r2 score for {column} : {r2_score(y_test[column], forecast_svm_op)}')

Fitting 3 folds for each of 36 candidates, totalling 108 fits
Best parameters: {'C': 1, 'epsilon': 0.3, 'kernel': 'rbf'}
r2 score for Precipitation : 0.3591608933415579


In [99]:
# Create the subplot
fig = make_subplots(rows=1, cols=1, vertical_spacing=0.08,subplot_titles=('Precipitation'))

# Pre-calculate X_test1['ds']
x_values = X_test1['ds']

# Add traces to the plot
fig.append_trace(go.Scatter(x=x_values, y=y_test[column], name=column), row=1, col=1)
fig.append_trace(go.Scatter(x=x_values, y=forecast_svm_op, name=column+' forecast'), row=1, col=1)

# Update the layout and show the plot
fig.update_layout(height=400)
fig.show()

## **Conclusions:**


In conclusion, this notebook has successfully applied nine different machine learning algorithms to forecast daily precipitation in the Pra Basin, Ghana, West Africa. The performance of these models has been evaluated based on their R² scores, a measure of how well future outcomes are likely to be predicted by the model.

Among the evaluated models, Random Forest (RF), Gradient Boosting, and Neural Network each achieved the highest R² value of 0.38, indicating that they are the most effective at capturing the variance in the daily precipitation data for the Pra Basin. This suggests that these models have a comparable ability to predict daily precipitation with a reasonable level of accuracy.

On the other hand, Lasso Regression, with an R² value of 0.10, performed significantly worse than the other models, indicating that it may not be suitable for this particular forecasting task.

Given these outcomes, for future studies or applications aiming to predict precipitation in this region, it would be advisable to focus on refining and optimizing the Random Forest, Gradient Boosting, and Neural Network models due to their superior performance in this context. These models hold the most promise for developing accurate and reliable precipitation forecasts for the Pra Basin.