# 01 Data Collection – Forex Price Prediction Project

## Objectives

* Fetch historical OHLCV (Open, High, Low, Close, Volume) data for 10 selected forex pairs using the OANDA API.
* Save the raw data in CSV format in the `data/raw/` directory for further processing.

## Inputs

* OANDA API Key stored securely in `.env` file.
* Selected forex pairs:
  1. GBP/USD  
  2. EUR/GBP  
  3. EUR/USD  
  4. AUD/USD  
  5. NZD/USD  
  6. USD/JPY  
  7. USD/SGD  
  8. USD/CHF  
  9. USD/CAD  
  10. GBP/CAD
* Desired number of historical candles per pair and timeframe (e.g., last 1000 H4 candles).

## Outputs

* 10 CSV files stored in `data/raw/` named as:
  * GBP_USD_forex_data.csv
  * EUR_GBP_forex_data.csv
  * EUR_USD__forex_data.csv
  * AUD_USD_forex_data.csv
  * NZD_USD_forex_data.csv
  * USD_JPY_forex_data.csv
  * USD_SGD_forex_data.csv
  * USD_CHF_forex_data.csv
  * USD_CAD_forex_data.csv
  * GBP_CAD_forex_data.csv 

* Each CSV contains:
  - `timestamp`  
  - `open`  
  - `high`  
  - `low`  
  - `close`  
  - `volume`  

## Additional Comments

* This notebook covers the **Data Collection** step of CRISP-DM.
* The raw files will be used in **02_data_cleaning_feature_engineering.ipynb** for feature creation.
* Ensure that the `.env` file with `OANDA_API_KEY` is correctly configured before running this notebook.


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/forex-price-predictor/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/forex-price-predictor'

# Import Libraries & Load API Key

Purpose: Import dependencies and load OANDA API key securely.

In [4]:
import os
import pandas as pd
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Retrieve OANDA API key
OANDA_API_KEY = os.getenv("OANDA_API_KEY")

# Verify that the key is loaded
print("OANDA_API_KEY loaded:", bool(OANDA_API_KEY))


OANDA_API_KEY loaded: True


---

## Define Forex Pairs & Data Collection Plan

In this step, we will define the **10 forex pairs** for which we will collect **1-hour (H1) OHLCV data**.

- Each pair will have approximately **8,760 candles (~1 year)**.
- The raw CSV files will be saved in the folder: `data/raw/`.
- The raw data will include:
  - `timestamp`
  - `open`
  - `high`
  - `low`
  - `close`
  - `volume`

**Selected Pairs:**
1. GBP/USD  
2. EUR/GBP  
3. EUR/USD  
4. AUD/USD  
5. NZD/USD  
6. USD/JPY  
7. USD/SGD  
8. USD/CHF  
9. USD/CAD  
10. GBP/CAD

In [5]:
PAIRS = [
    "GBP_USD", "EUR_GBP", "EUR_USD", "AUD_USD", "NZD_USD",
    "USD_JPY", "USD_SGD", "USD_CHF", "USD_CAD", "GBP_CAD"
]

# Number of candles ~1 year of 1-hour data
NUM_CANDLES = 8760
TIMEFRAME = "H1"

# Ensure raw data directory exists
import os
os.makedirs("data/raw", exist_ok=True)

print(f"Collecting {NUM_CANDLES} candles per pair ({TIMEFRAME})")
print("Pairs:", PAIRS)


Collecting 8760 candles per pair (H1)
Pairs: ['GBP_USD', 'EUR_GBP', 'EUR_USD', 'AUD_USD', 'NZD_USD', 'USD_JPY', 'USD_SGD', 'USD_CHF', 'USD_CAD', 'GBP_CAD']


---

## Test Data Fetch for One Pair

Before fetching data for all 10 pairs,  
we will **test the `fetch_live_data()` function** for a single pair (EUR/USD):

- Fetch **1 year (~8,760) of 1-hour candles**  
- Preview the first few rows to confirm:
  - Columns: `timestamp, open, high, low, close, volume`  
  - Correct number of rows fetched


In [6]:
from src.data_collection import fetch_live_data

# Fetch sample data
sample_df = fetch_live_data("EUR_USD", candles=NUM_CANDLES, timeframe=TIMEFRAME)

# Display first 5 rows
sample_df.head()


Unnamed: 0,timestamp,open,high,low,close,volume
0,2024-03-06 03:00:00+00:00,1.08527,1.08533,1.08504,1.08508,1905
1,2024-03-06 04:00:00+00:00,1.08508,1.08524,1.08495,1.08519,1408
2,2024-03-06 05:00:00+00:00,1.08518,1.0857,1.08516,1.08552,1866
3,2024-03-06 06:00:00+00:00,1.08549,1.08607,1.08549,1.08595,2183
4,2024-03-06 07:00:00+00:00,1.08594,1.08666,1.08586,1.08621,4008


---

## Fetch and Save Data for All 10 Forex Pairs

Now that we have successfully tested fetching data for a single pair (EUR/USD),  
we will proceed to collect **1 year (~8,760 1-hour candles)** for all 10 selected forex pairs.

- Each dataset will be saved in the folder: `data/raw/`
- Files will follow the naming convention:
Example: `GBP_USD_forex_data.csv`
- Columns included:
  - `timestamp`
  - `open`
  - `high`
  - `low`
  - `close`
  - `volume`

This will complete the **Data Collection step** of our CRISP-DM workflow.


In [None]:
import time

failed_pairs = []

for pair in PAIRS:
    print(f"Fetching data for {pair} ...")
    
    try:
        df = fetch_live_data(pair, candles=NUM_CANDLES, timeframe=TIMEFRAME)
        
        if not df.empty:
            save_path = f"data/raw/{pair}_forex_data.csv"
            df.to_csv(save_path, index=False)
            print(f"Saved {len(df)} rows to {save_path}\n")
        else:
            print(f"No data fetched for {pair}\n")
            failed_pairs.append(pair)
            
    except Exception as e:
        print(f"Error fetching {pair}: {e}\n")
        failed_pairs.append(pair)
    
    # Pause to avoid hitting API rate limits
    time.sleep(3)

print("\n--- Bulk Fetch Completed ---")
if failed_pairs:
    print("⚠ The following pairs failed and need retrying:", failed_pairs)
else:
    print("All pairs fetched successfully!")


Fetching data for GBP_USD ...
❌ Error fetching data for GBP_USD: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
❌ No data fetched for GBP_USD

Fetching data for EUR_GBP ...
✅ Saved 8759 rows to data/raw/EUR_GBP_forex_data.csv

Fetching data for EUR_USD ...
✅ Saved 8759 rows to data/raw/EUR_USD_forex_data.csv

Fetching data for AUD_USD ...
✅ Saved 8759 rows to data/raw/AUD_USD_forex_data.csv

Fetching data for NZD_USD ...
✅ Saved 8759 rows to data/raw/NZD_USD_forex_data.csv

Fetching data for USD_JPY ...
✅ Saved 8759 rows to data/raw/USD_JPY_forex_data.csv

Fetching data for USD_SGD ...
✅ Saved 8759 rows to data/raw/USD_SGD_forex_data.csv

Fetching data for USD_CHF ...
✅ Saved 8759 rows to data/raw/USD_CHF_forex_data.csv

Fetching data for USD_CAD ...
✅ Saved 8759 rows to data/raw/USD_CAD_forex_data.csv

Fetching data for GBP_CAD ...
✅ Saved 8759 rows to data/raw/GBP_CAD_forex_data.csv


--- Bulk Fetch Completed ---
⚠ The following pairs fai

**Note:**  
GBP/USD data was initially difficult to fetch due to an OANDA API timeout.  
It was successfully collected after isolating the pair and fetching in smaller batches.  
All 10 forex pairs now have 1-year (H1) OHLCV data stored in `data/raw/`.


In [None]:
# Retry only GBP/USD separately
pair = "GBP_USD"
df = fetch_live_data(pair, candles=NUM_CANDLES, timeframe=TIMEFRAME)

if not df.empty:
    save_path = f"data/raw/{pair}_forex_data.csv"
    df.to_csv(save_path, index=False)
    print(f"Saved {len(df)} rows to {save_path}")
else:
    print(f"Still no data for {pair}")


✅ Saved 8759 rows to data/raw/GBP_USD_forex_data.csv


In [14]:
import glob

# List all raw CSV files
raw_files = glob.glob("data/raw/*_forex_data.csv")

print("Raw files saved:")
for file in raw_files:
    print(" -", file)

print(f"\nTotal files saved: {len(raw_files)}")

# Optional: Preview first few rows of one file
if raw_files:
    sample_preview = pd.read_csv(raw_files[0]).head()
    print("\nPreview of first file:")
    display(sample_preview)


Raw files saved:
 - data/raw/GBP_USD_forex_data.csv
 - data/raw/USD_JPY_forex_data.csv
 - data/raw/USD_SGD_forex_data.csv
 - data/raw/USD_CHF_forex_data.csv
 - data/raw/USD_CAD_forex_data.csv
 - data/raw/NZD_USD_forex_data.csv
 - data/raw/EUR_USD_forex_data.csv
 - data/raw/EUR_GBP_forex_data.csv
 - data/raw/GBP_CAD_forex_data.csv
 - data/raw/AUD_USD_forex_data.csv

Total files saved: 10

Preview of first file:


Unnamed: 0,timestamp,open,high,low,close,volume
0,2024-03-06 04:00:00+00:00,1.26993,1.27028,1.26978,1.2702,2121
1,2024-03-06 05:00:00+00:00,1.27022,1.27068,1.27018,1.27038,1940
2,2024-03-06 06:00:00+00:00,1.27037,1.27074,1.27028,1.27048,2582
3,2024-03-06 07:00:00+00:00,1.27049,1.27156,1.27032,1.27107,5915
4,2024-03-06 08:00:00+00:00,1.27106,1.27256,1.2709,1.27256,5703


# Push files to Repo

### 1. Check current git status



In [18]:
!git status

On branch main
Your branch is up to date with 'origin/main'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   .gitignore[m
	[31mmodified:   jupyter_notebooks/01_data_collection.ipynb[m
	[31mmodified:   src/data_collection.py[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31mdata/[m

no changes added to commit (use "git add" and/or "git commit -a")


### 2. Stage all new/updated files

In [19]:
!git add data/raw/*.csv notebooks/01_data_collection.ipynb

fatal: pathspec 'notebooks/01_data_collection.ipynb' did not match any files


### 3. Commit with a descriptive message

In [20]:
!git commit -m "Add raw forex OHLCV data for 10 pairs (Notebook 01)"

On branch main
Your branch is up to date with 'origin/main'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   .gitignore[m
	[31mmodified:   jupyter_notebooks/01_data_collection.ipynb[m
	[31mmodified:   src/data_collection.py[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31mdata/[m

no changes added to commit (use "git add" and/or "git commit -a")


### 4. Push to the main branch

In [21]:
!git push origin main

Everything up-to-date


* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.