# 02 - Data Cleaning and Resampling

**Objective:**  
1. Validate the raw 1H OHLCV data collected from OANDA.  
2. Identify and handle missing data, duplicates, and any data quality issues.  
3. Resample 1H data to 4H, Daily, and Weekly timeframes for multi-timeframe analysis.  
4. Validate the resampled data.  
5. Apply the process to all 7 currency pairs and save processed files.

**Key Steps:**  
- Data Validation (Check completeness & integrity)  
- Resampling & Feature Aggregation  
- Save Processed Data for Model Training

---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/forex-mtf-strategy-predictor/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/forex-mtf-strategy-predictor'

## Import Libraries and Setup Paths

We import Pandas and Pathlib for file management,  
and our custom `data_cleaning` functions from `src/`.

In [20]:
from pathlib import Path
import pandas as pd
from src.data_cleaning import validate_ohlc_data, resample_ohlc, validate_resampled_data

# Set correct paths relative to project root
raw_path = Path("data/raw")
processed_path = Path("data/processed")
processed_path.mkdir(parents=True, exist_ok=True)

# Display settings
pd.set_option('display.max_rows', 5)
pd.set_option('display.max_columns', None)

print("Raw CSVs:", list(raw_path.glob("*.csv")))


Raw CSVs: [PosixPath('data/raw/EUR_USD_1H.csv'), PosixPath('data/raw/USD_JPY_1H.csv'), PosixPath('data/raw/AUD_USD_1H.csv'), PosixPath('data/raw/NZD_USD_1H.csv'), PosixPath('data/raw/USD_CAD_1H.csv'), PosixPath('data/raw/USD_CHF_1H.csv'), PosixPath('data/raw/GBP_USD_1H.csv')]


---

## Load and Inspect 1H Raw Data

We start with one currency pair (`EUR_USD`) to:
1. Inspect data structure
2. Check columns and data types
3. Look for missing values or duplicates


In [21]:
pair = "EUR_USD"
file_path = raw_path / f"{pair}_1H.csv"

print("Looking for:", file_path.resolve())  # Debug

df = pd.read_csv(file_path, parse_dates=["timestamp"])
print(df.info())
df.head()


Looking for: /workspaces/forex-mtf-strategy-predictor/data/raw/EUR_USD_1H.csv
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43800 entries, 0 to 43799
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype              
---  ------     --------------  -----              
 0   timestamp  43800 non-null  datetime64[ns, UTC]
 1   open       43800 non-null  float64            
 2   high       43800 non-null  float64            
 3   low        43800 non-null  float64            
 4   close      43800 non-null  float64            
 5   volume     43800 non-null  int64              
dtypes: datetime64[ns, UTC](1), float64(4), int64(1)
memory usage: 2.0 MB
None


Unnamed: 0,timestamp,open,high,low,close,volume
0,2018-07-19 08:00:00+00:00,1.1608,1.16186,1.16041,1.16128,6352
1,2018-07-19 09:00:00+00:00,1.16133,1.1616,1.15951,1.16062,4379
2,2018-07-19 10:00:00+00:00,1.1606,1.16142,1.15954,1.15956,3570
3,2018-07-19 11:00:00+00:00,1.1596,1.16006,1.15858,1.16004,4001
4,2018-07-19 12:00:00+00:00,1.16006,1.1601,1.15748,1.15864,4080


---

## Validate Raw Data

We use our `validate_ohlc_data` function to check:
1. Missing values
2. Duplicate rows
3. Timestamp continuity

In [22]:
issues_raw = validate_ohlc_data(df)
print(issues_raw)

{'row_count': 43800, 'missing_values': {'timestamp': 0, 'open': 0, 'high': 0, 'low': 0, 'close': 0, 'volume': 0}, 'duplicate_rows': np.int64(0), 'gaps_found': 375}


  gaps = df_sorted[df_sorted["diff"] > pd.Timedelta("1H")]


---

## Resample Data for One Pair

We resample `EUR_USD` 1H data to:
- 4H (H4)
- 1D (Daily)
- 1W (Weekly)

We aggregate:
- **Open** = first value
- **High** = max value
- **Low** = min value
- **Close** = last value
- **Volume** = sum

In [23]:
# Set timestamp as index for resampling
df.set_index("timestamp", inplace=True)

df_4h = resample_ohlc(df, "4H")
df_1d = resample_ohlc(df, "1D")
df_1w = resample_ohlc(df, "1W")

print("Sample 4H data:")
display(df_4h.head())

Sample 4H data:


  df_resampled = df.resample(timeframe).apply(ohlc).dropna()


Unnamed: 0_level_0,open,high,low,close,volume
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2018-07-19 08:00:00+00:00,1.1608,1.16186,1.15858,1.16004,18302
2018-07-19 12:00:00+00:00,1.16006,1.16222,1.15748,1.16032,15041
2018-07-19 16:00:00+00:00,1.16033,1.16789,1.16024,1.16364,22171
2018-07-19 20:00:00+00:00,1.16364,1.1656,1.16321,1.16548,5751
2018-07-20 00:00:00+00:00,1.16548,1.16599,1.16264,1.16501,15689


---

## Validate Resampled Data

We check the resampled data for:
1. Correct timeframe intervals
2. Missing values
3. Logical OHLC structure

In [24]:
print("4H Validation:", validate_resampled_data(df_4h, "4H"))
print("1D Validation:", validate_resampled_data(df_1d, "1D"))
print("1W Validation:", validate_resampled_data(df_1w, "1W"))

4H Validation: {'row_count': 11326, 'missing_values': {'open': 0, 'high': 0, 'low': 0, 'close': 0, 'volume': 0}, 'duplicate_rows': np.int64(0), 'timeframe': '4H'}
1D Validation: {'row_count': 2198, 'missing_values': {'open': 0, 'high': 0, 'low': 0, 'close': 0, 'volume': 0}, 'duplicate_rows': np.int64(0), 'timeframe': '1D'}
1W Validation: {'row_count': 368, 'missing_values': {'open': 0, 'high': 0, 'low': 0, 'close': 0, 'volume': 0}, 'duplicate_rows': np.int64(0), 'timeframe': '1W'}


---

## Apply Resampling to All Currency Pairs

After confirming that resampling works for one pair,
we loop through all pairs and save processed CSVs in `data/processed/`.

In [25]:
pairs = ["EUR_USD", "USD_JPY", "GBP_USD", "USD_CHF", "AUD_USD", "USD_CAD", "NZD_USD"]

for pair in pairs:
    df = pd.read_csv(raw_path / f"{pair}_1H.csv", parse_dates=["timestamp"])
    df.set_index("timestamp", inplace=True)
    
    df_4h = resample_ohlc(df, "4H")
    df_1d = resample_ohlc(df, "1D")
    df_1w = resample_ohlc(df, "1W")
    
    df_4h.to_csv(processed_path / f"{pair}_4H.csv")
    df_1d.to_csv(processed_path / f"{pair}_1D.csv")
    df_1w.to_csv(processed_path / f"{pair}_1W.csv")
    
    print(f"Processed {pair}: 4H={len(df_4h)}, 1D={len(df_1d)}, 1W={len(df_1w)}")

print("All pairs processed and saved to data/processed/")


  df_resampled = df.resample(timeframe).apply(ohlc).dropna()


Processed EUR_USD: 4H=11326, 1D=2198, 1W=368


  df_resampled = df.resample(timeframe).apply(ohlc).dropna()


Processed USD_JPY: 4H=11322, 1D=2197, 1W=368


  df_resampled = df.resample(timeframe).apply(ohlc).dropna()


Processed GBP_USD: 4H=11326, 1D=2198, 1W=368


  df_resampled = df.resample(timeframe).apply(ohlc).dropna()


Processed USD_CHF: 4H=11325, 1D=2198, 1W=368


  df_resampled = df.resample(timeframe).apply(ohlc).dropna()


Processed AUD_USD: 4H=11328, 1D=2198, 1W=368


  df_resampled = df.resample(timeframe).apply(ohlc).dropna()


Processed USD_CAD: 4H=11320, 1D=2197, 1W=368
Processed NZD_USD: 4H=11327, 1D=2198, 1W=368
All pairs processed and saved to data/processed/


  df_resampled = df.resample(timeframe).apply(ohlc).dropna()


---

# Push files to Repo

### 1. Check current git status

In [26]:
!git status

On branch main
Your branch is ahead of 'origin/main' by 1 commit.
  (use "git push" to publish your local commits)

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   jupyter_notebooks/01_data_collection.ipynb[m
	[31mmodified:   jupyter_notebooks/02_data_cleaning.ipynb[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31mdata/processed/[m
	[31msrc/data_cleaning.py[m

no changes added to commit (use "git add" and/or "git commit -a")


### 2. Stage all new/updated files

In [None]:
!git add .

### 3. Commit with a descriptive message

In [None]:
!git commit -m "Validate 1H data for all pairs, and resample the data to 4H Daily and Weekly OHLC data"