# Predicting GBP/USD Prices Using Random Forest
# a. Objective: Build a Random Forest model to predict prices based on historical data and technical indicators.
# b. Focus: Machine learning, feature engineering, and model evaluation.

In [None]:
'''
Methodology 

1.	Data Loading and Preprocessing:
    	•	The script loads an hourly candlestick dataset for GBP/USD.
    	•	The Local time column is parsed into a datetime format and set as the index.
    	•	Technical indicators are calculated for feature engineering:
    	•	Moving Averages (MA): MA_5 (5-period) and MA_50 (50-period) simple moving averages.
    	•	Exponential Moving Average (EMA): EMA_10 (10-period).
    	•	Relative Strength Index (RSI): Measures the magnitude of recent gains versus losses over a 14-period window.
    	•	Moving Average Convergence Divergence (MACD): The difference between 12-period and 26-period EMAs.
    	•	Bollinger Bands: Bollinger_Upper and Bollinger_Lower, representing upper and lower bounds for the 20-period moving average.

2.	Target Creation:
    	•	The target variable, Target, is set as the next hour’s closing price by shifting the Close column by one period backward.

3.	Data Splitting:
    	•	The data is split into training and testing sets (70% training, 30% testing) without shuffling to preserve the time series order.

4.	Model Training:
    	•	A RandomForestRegressor model is initialized with 100 trees (n_estimators=100) and a maximum depth of 10 (max_depth=10).
    	•	The model is trained on the training data.

5.	Prediction and Evaluation:
    	•	Predictions are made on the test set.
    	•	The model’s performance is evaluated using:
    	•	R-squared: Indicates how well the model explains the variability of the target.
    	•	Mean Absolute Error (MAE): Average magnitude of errors.
    	•	Root Mean Squared Error (RMSE): Measures the standard deviation of prediction errors.
    	•	A plot shows actual vs. predicted closing prices over time for visual inspection.

6.	Cross-Validation:
    	•	TimeSeriesSplit cross-validation is used to ensure robustness, and reliability, splitting the data into six sequential folds.
    	•	R-squared scores for each fold are calculated, and their mean is displayed.
'''     

In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, TimeSeriesSplit, cross_val_score
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import matplotlib.pyplot as plt

# Load the dataset and perform existing preprocessing steps
file_path = 'GBPUSD_Candlestick_1_Hour_BID_01.01.2020-31.08.2024.csv'
data = pd.read_csv(file_path)
data['Local time'] = pd.to_datetime(data['Local time'], format='%d.%m.%Y %H:%M:%S.%f GMT%z', utc=True)
data.set_index('Local time', inplace=True)
data['MA_5'] = data['Close'].rolling(window=5).mean()
data['MA_50'] = data['Close'].rolling(window=50).mean()
data['EMA_10'] = data['Close'].ewm(span=10, adjust=False).mean()
window_length = 14
delta = data['Close'].diff(1)
gain = (delta.where(delta > 0, 0)).rolling(window=window_length).mean()
loss = (-delta.where(delta < 0, 0)).rolling(window=window_length).mean()
rs = gain / loss
data['RSI'] = 100 - (100 / (1 + rs))
ema_12 = data['Close'].ewm(span=12, adjust=False).mean()
ema_26 = data['Close'].ewm(span=26, adjust=False).mean()
data['MACD'] = ema_12 - ema_26
data['Bollinger_Upper'] = data['MA_20'] = data['Close'].rolling(window=20).mean() + (2 * data['Close'].rolling(window=20).std())
data['Bollinger_Lower'] = data['MA_20'] - (2 * data['Close'].rolling(window=20).std())
data['Target'] = data['Close'].shift(-1)
data = data.dropna()
X = data.drop(['Target'], axis=1)
Y = data['Target']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, shuffle=False)
rf_model = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
rf_model.fit(X_train, Y_train)


# Predict and evaluate the model
Y_pred = rf_model.predict(X_test)
print(f"R-squared: {r2_score(Y_test, Y_pred)}")
print(f"MAE: {mean_absolute_error(Y_test, Y_pred)}")
print(f"RMSE: {mean_squared_error(Y_test, Y_pred, squared=False)}")
plt.figure(figsize=(12, 6))
plt.plot(Y_test.index, Y_test, label='Actual Prices')
plt.plot(Y_test.index, Y_pred, label='Predicted Prices', alpha=0.7)
plt.title('Actual vs Predicted Closing Prices')
plt.legend()
plt.show()


# Cross-validation for robustness
tscv = TimeSeriesSplit(n_splits=6)
cross_val_scores = cross_val_score(rf_model, X, Y, cv=tscv, scoring='r2')
print(f"Cross-Validation R-squared scores: {cross_val_scores}")
print(f"Mean R-squared score from cross-validation: {cross_val_scores.mean()}")


FileNotFoundError: [Errno 2] No such file or directory: 'GBPUSD_Candlestick_1_Hour_BID_01.01.2020-31.08.2024.csv'

In [None]:
'''
Interpretation :

1.	R-squared (0.9973):
	•	This value, close to 1, indicates that the model explains about 99.73% of the variance in the target (GBP/USD closing prices).
	•	A high R-squared suggests that the model is performing very well in capturing the underlying pattern in the data.
2.	Mean Absolute Error (MAE, 0.00076):
	•	This metric shows the average absolute difference between the predicted and actual prices.
	•	An MAE of 0.00076 means that, on average, the model’s predictions are off by 0.00076 units in GBP/USD terms, indicating highly           accurate predictions.
3.	Root Mean Squared Error (RMSE, 0.0011):
	•	RMSE gives more weight to larger errors and is often used to penalize bigger deviations.
	•	An RMSE of 0.0011 suggests that the standard deviation of the prediction errors is very low, reinforcing that the model is making 	     accurate predictions with minimal large errors.
'''

In [None]:
'''
1.	Cross-Validation R-squared Scores:
	•	The R-squared scores for each fold vary: [0.4803, 0.9968, 0.9990, 0.8526, 0.9973, 0.9963].
	•	This variation suggests that the model performs well on some splits but has a lower R-squared in certain periods (e.g., 0.4803).
	•	This inconsistency could indicate that some segments of the data are more challenging to predict accurately, possibly due to changes in market patterns or volatility.
2.	Mean R-squared Score (0.8870):
	•	The average R-squared across all folds is approximately 0.8870, indicating that, on average, the model explains about 88.7% of the variance across different time periods.
'''	

In [None]:
#Burlyn
