# Gold Price Forecasting using Machine Learning

## Overview
This project focuses on predicting gold prices using historical data and machine learning techniques. The dataset is retrieved from Yahoo Finance, and various technical indicators are calculated to enhance predictive accuracy. The model is trained using the k-Nearest Neighbors (k-NN) algorithm with multiple time-step forecasting.

The project includes the following steps:
1. **Data Collection and Preprocessing**
2. **Exploratory Data Analysis (EDA)**
3. **Feature Engineering**
4. **Time Series Splitting**
5. **Model Training and Optimization**
6. **Forecast Evaluation**
7. **Visualization of Predictions**

## Code Breakdown

### 1. Importing Libraries
The necessary Python libraries for data analysis, visualization, and machine learning are imported.

In [3]:
!pip install ta

[0m[31mERROR: Could not find a version that satisfies the requirement ta (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for ta[0m[31m
[0m

In [4]:
import pandas as pd
import numpy as np
import yfinance as yf
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, TimeSeriesSplit, GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.neighbors import KNeighborsRegressor
import ta

ModuleNotFoundError: No module named 'ta'

### 2. Data Collection
The historical gold price dataset is retrieved using Yahoo Finance. The dataset includes Open, High, Low, Close, and Volume data for the past 10 years.

In [None]:
zloto = yf.Ticker("GC=F")
dane = zloto.history(start="2014-02-28", end="2024-02-29").reset_index()
dane['Date'] = pd.to_datetime(dane['Date']).dt.to_period('D')
dane['Date'] = dane['Date'].dt.to_timestamp()

### 3. Data Selection and Initial Analysis
A subset of relevant columns is selected, and basic data statistics are displayed to understand the dataset better.

In [None]:
data_selected = dane[['Date', 'Close','Open','High','Low','Volume']].copy()
print("Dataset Shape:", data_selected.shape)
print("\nColumn Info:")
print(data_selected.info())
print("\nDescriptive Statistics:")
print(data_selected.describe())
print("\nMissing Values:")
print(data_selected.isnull().sum())

### 4. Data Visualization
Plots are created to analyze the price trends and distribution of gold prices.

In [None]:
plt.figure(figsize=(15, 7))
plt.plot(data_selected['Date'], data_selected['Close'])
plt.title('Historical Gold Prices')
plt.xlabel('Date')
plt.ylabel('Closing Price')
plt.grid(True)
plt.show()

### 5. Splitting Data into Training and Testing Sets
The dataset is split into 80% training and 20% testing data to evaluate model performance effectively.

In [None]:
data_selected = data_selected.set_index('Date')
total_size = len(data_selected)
train_size = int(total_size * 0.8)
data_train = data_selected[:train_size]
data_test = data_selected[train_size:]

### 6. Feature Engineering
Technical indicators such as Simple Moving Averages (SMA), Relative Strength Index (RSI), and Bollinger Bands are added to enhance model performance.

In [None]:
def create_features(df, lag=5):
    df_features = df.copy()
    df_features['Target'] = df_features['Close'].shift(-1)
    df_features['SMA_5'] = ta.trend.sma_indicator(df_features['Close'], window=5)
    df_features['RSI'] = ta.momentum.rsi(df_features['Close'], window=14)
    df_features['BB_high'] = ta.volatility.BollingerBands(df_features['Close']).bollinger_hband()
    df_features['BB_low'] = ta.volatility.BollingerBands(df_features['Close']).bollinger_lband()
    
    for i in range(1, lag+1):
        df_features[f'Lag_Close_{i}'] = df_features['Close'].shift(i)
    
    df_clean = df_features.dropna()
    X = df_clean.drop(columns=['Target'])
    y = df_clean['Target']
    return X, y

### 7. Multi-Step Forecasting
A function is created to prepare features for multi-step forecasting, predicting prices multiple days into the future.

In [None]:
def create_multistep_features(df, lag=5, steps=5):
    df_features = df.copy()
    for step in range(1, steps + 1):
        df_features[f'Target_{step}'] = df_features['Close'].shift(-step)
    df_clean = df_features.dropna()
    X = df_clean.drop(columns=[f'Target_{step}' for step in range(1, steps + 1)])
    y = df_clean[[f'Target_{step}' for step in range(1, steps + 1)]]
    return X, y

### 8. Model Training and Hyperparameter Optimization
A k-NN regression model is trained using TimeSeriesSplit cross-validation and optimized using GridSearchCV.

In [None]:
param_grid = {
    'n_neighbors': range(3, 31, 2),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

tscv = TimeSeriesSplit(n_splits=5)
knn = KNeighborsRegressor()
grid_search = GridSearchCV(knn, param_grid, cv=tscv, scoring='neg_mean_squared_error', n_jobs=-1, verbose=0)

### 9. Model Evaluation and Visualization
The model is evaluated for multiple future time steps, and predictions are compared to actual values.

In [None]:
y_preds = []
for step in range(steps):
    y_step_train = y_train.iloc[:, step]
    y_step_test = y_test.iloc[:, step]
    grid_search.fit(X_train_scaled, y_step_train)
    y_step_pred = grid_search.predict(X_test_scaled)
    y_preds.append(y_step_pred)

### 10. Performance Metrics and Results Table
Mean Squared Error (MSE), Mean Absolute Error (MAE), and R-squared (R²) values are calculated for each forecasting step.

In [None]:
metrics_dict = {
    'Step (Days)': [],
    'MSE': [],
    'MAE': [],
    'R²': [],
    'Predicted Price ($)': []
}

for step in range(steps):
    metrics_dict['Step (Days)'].append(step + 1)
    metrics_dict['MSE'].append(mean_squared_error(y_test.iloc[:, step], y_preds[step]))
    metrics_dict['MAE'].append(mean_absolute_error(y_test.iloc[:, step], y_preds[step]))
    metrics_dict['R²'].append(r2_score(y_test.iloc[:, step], y_preds[step]))
    metrics_dict['Predicted Price ($)'].append(y_preds[step][-1])

metrics_df = pd.DataFrame(metrics_dict).round(3)
print(metrics_df.to_string(index=False))

## Conclusion
This project successfully implements a multi-step forecasting model for gold prices using k-NN regression. The model is optimized through hyperparameter tuning, and results are visualized to compare predictions against actual market data. The approach can be further refined using deep learning models such as LSTMs or transformers for improved long-term forecasting.

---
**Next Steps:**
- Experimenting with other regression models (e.g., XGBoost, LSTMs)
- Incorporating external features (e.g., macroeconomic indicators)
- Testing different feature engineering techniques

This project serves as a strong foundation for financial time series forecasting and can be expanded upon in future research.

