# Part 1: Data Download and Preparation 

## Step 1: Download Intraday Market Data

Objective: Acquire historical intraday data for a selected equity or cryptocurrency.

Requirements:

Use yfinance for equities (e.g., AAPL).
Use a crypto API (e.g., Binance) for cryptocurrencies.
Save data as CSV with columns: Datetime, Open, High, Low, Close, Volume.
Deliverable:

A CSV file containing intraday market data for the chosen asset.
Example (Equity):

python

import yfinance as yf

data = yf.download(tickers='AAPL', period='7d', interval='1m')

data.to_csv('market_data.csv')

---

In [3]:
import yfinance as yf
import pandas as pd
from datetime import datetime
import os

For our version of the end-to-end tradign system, we will use data on the top 10 stocks by market capitalization for now. Note that this can eaily be chnaged, and our use of the Alpaca API will provide additional support for cryptocurrencies and additional assest classes. For now, our focus will be the top 10 stocks mentioned here: https://finance.yahoo.com/research-hub/screener/largest_market_cap/?guccounter=1&guce_referrer=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8&guce_referrer_sig=AQAAAIv-Bw2JL9C_WgD2UYHX8tuGb-x89S-SvE_P5yZAdbRsLggdSzI64TaWt-4xTu8IaLG-rW0_loydqM6sFidpeQdyyiFd4kx11oM3-sqEWjsXMpgfjQyFBkrnXmiaN0MlV6wLBDP2eL7Z5rbOUaBtX5t2iJBhqRsbrrKX_OL0t1FI

In [5]:
'''
- Note that both GOOGL and GOOG were included in the top 10 market-cap list, 
so GOOGL was selected since it was higher up to avoid representing the same company

- Birkshire Hathaway (A) was also included in the list, but the share price was too high,
so BRK-B was chosen instead

Data Download Configuration:
- Period: 7 days (yfinance typically allows 7-30 days of 1-minute data)
- Interval: 1 minute bars
- Each stock is downloaded separately to ensure proper CSV format
'''

TICKERS = ['NVDA', 'AAPL', 'MSFT', 'GOOGL', 'AMZN', 'AVGO', 'META', 'TSM', 'BRK-B']
PERIOD = '8d'
INTERVAL = '1m'
OUTPUT_DIR = 'market_data'

# Create output directory
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Download each ticker separately
for i, ticker in enumerate(TICKERS, 1):
    print(f"\n[{i}/{len(TICKERS)}] {ticker}")
    
    try:
        # Download data for single ticker
        data = yf.download(
            tickers=ticker,
            period=PERIOD,
            interval=INTERVAL,
            progress=False
        )
        # Flatten MultiIndex columns if they exist
        if isinstance(data.columns, pd.MultiIndex):
            data.columns = data.columns.droplevel(1)
        
        # Verify required columns exist
        required_cols = ['Open', 'High', 'Low', 'Close', 'Volume']
        if not all(col in data.columns for col in required_cols):
            print(f"    Missing required columns")
            continue
        
        # Select only required columns in correct order
        data = data[required_cols]
        
        # Save to CSV with Datetime as column
        output_file = f"{OUTPUT_DIR}/{ticker}_1m.csv"
        data.to_csv(output_file)
        
        # Verification
        print(f"    Rows: {len(data):,}")
        print(f"    Range: {data.index[0]} to {data.index[-1]}")
        print(f"    Saved: {output_file}")
    except Exception as e:
        print(f"    Error: {e}")

print("\nSample CSV format:")
sample_file = f"{OUTPUT_DIR}/{TICKERS[0]}_1m.csv"
if os.path.exists(sample_file):
    sample = pd.read_csv(sample_file, nrows=3)
    print(sample.to_string())


[1/9] NVDA


  data = yf.download(


    Rows: 2,814
    Range: 2025-11-07 14:30:00+00:00 to 2025-11-18 15:53:00+00:00
    Saved: market_data/NVDA_1m.csv

[2/9] AAPL


  data = yf.download(


    Rows: 2,814
    Range: 2025-11-07 14:30:00+00:00 to 2025-11-18 15:53:00+00:00
    Saved: market_data/AAPL_1m.csv

[3/9] MSFT


  data = yf.download(


    Rows: 2,812
    Range: 2025-11-07 14:30:00+00:00 to 2025-11-18 15:53:00+00:00
    Saved: market_data/MSFT_1m.csv

[4/9] GOOGL


  data = yf.download(


    Rows: 2,812
    Range: 2025-11-07 14:30:00+00:00 to 2025-11-18 15:53:00+00:00
    Saved: market_data/GOOGL_1m.csv

[5/9] AMZN


  data = yf.download(


    Rows: 2,814
    Range: 2025-11-07 14:30:00+00:00 to 2025-11-18 15:53:00+00:00
    Saved: market_data/AMZN_1m.csv

[6/9] AVGO


  data = yf.download(


    Rows: 2,814
    Range: 2025-11-07 14:30:00+00:00 to 2025-11-18 15:53:00+00:00
    Saved: market_data/AVGO_1m.csv

[7/9] META


  data = yf.download(


    Rows: 2,814
    Range: 2025-11-07 14:30:00+00:00 to 2025-11-18 15:53:00+00:00
    Saved: market_data/META_1m.csv

[8/9] TSM


  data = yf.download(


    Rows: 2,814
    Range: 2025-11-07 14:30:00+00:00 to 2025-11-18 15:53:00+00:00
    Saved: market_data/TSM_1m.csv

[9/9] BRK-B


  data = yf.download(


    Rows: 2,814
    Range: 2025-11-07 14:30:00+00:00 to 2025-11-18 15:53:00+00:00
    Saved: market_data/BRK-B_1m.csv

Sample CSV format:
                    Datetime        Open        High         Low       Close   Volume
0  2025-11-07 14:30:00+00:00  184.860001  184.860001  183.654999  184.139999  8109544
1  2025-11-07 14:31:00+00:00  184.160004  185.229996  184.160004  184.899994  1052345
2  2025-11-07 14:32:00+00:00  184.889999  185.550003  184.839996  185.538803   750338


---

## Step 2: Data Cleaning and Organization 
Objective: Prepare the raw data for modeling and strategy development.

Requirements:

Remove missing or duplicate rows.
Set Datetime as index and sort chronologically.
Add derived features (e.g., returns, moving averages).
Deliverable:

A cleaned pandas DataFrame ready for analysis.
Example:

python

import pandas as pd

data = pd.read_csv('market_data.csv')

data.dropna(inplace=True)

data.set_index('Datetime', inplace=True)

data.sort_index(inplace=True)

---

In [6]:
'''
Step 2: Clean and Organize Data
- Remove missing or duplicate rows
- Set Datetime as index and sort chronologically
- Add derived features (returns, moving averages)
'''

# Configuration
INPUT_DIR = 'market_data'
OUTPUT_DIR = 'cleaned_data'
TICKERS = ['NVDA', 'AAPL', 'MSFT', 'GOOGL', 'AMZN', 'AVGO', 'META', 'TSM', 'BRK-B']

os.makedirs(OUTPUT_DIR, exist_ok=True)

for i, ticker in enumerate(TICKERS, 1):
    print(f"\n[{i}/{len(TICKERS)}] {ticker}")
    
    input_file = f"{INPUT_DIR}/{ticker}_1m.csv"
    
    if not os.path.exists(input_file):
        print(f"    File not found")
        continue
    
    try:
        # Read CSV
        data = pd.read_csv(input_file)
        initial_rows = len(data)
        
        # Set Datetime as index
        data['Datetime'] = pd.to_datetime(data['Datetime'])
        data.set_index('Datetime', inplace=True)
        
        # Sort chronologically
        data.sort_index(inplace=True)
        
        # Remove duplicates
        data = data[~data.index.duplicated(keep='first')]
        
        # Remove missing values
        data.dropna(inplace=True)
        
        # Add derived features
        data['Returns'] = data['Close'].pct_change()
        data['MA_20'] = data['Close'].rolling(window=20).mean()
        data['MA_50'] = data['Close'].rolling(window=50).mean()
        
        # Remove rows with NaN from indicators
        data.dropna(inplace=True)
        
        final_rows = len(data)
        print(f"    Initial: {initial_rows:,} rows")
        print(f"    Final: {final_rows:,} rows")
        
        # Save cleaned data
        output_file = f"{OUTPUT_DIR}/{ticker}_cleaned.csv"
        data.to_csv(output_file)
        print(f"    Saved: {output_file}")
        
    except Exception as e:
        print(f"    Error: {e}")

# Show sample
print("\nSample (first ticker):")
sample_file = f"{OUTPUT_DIR}/{TICKERS[0]}_cleaned.csv"
if os.path.exists(sample_file):
    sample = pd.read_csv(sample_file, index_col='Datetime', parse_dates=True)
    print(f"Shape: {sample.shape}")
    print(sample.head(3))


[1/9] NVDA
    Initial: 2,814 rows
    Final: 2,765 rows
    Saved: cleaned_data/NVDA_cleaned.csv

[2/9] AAPL
    Initial: 2,814 rows
    Final: 2,765 rows
    Saved: cleaned_data/AAPL_cleaned.csv

[3/9] MSFT
    Initial: 2,812 rows
    Final: 2,763 rows
    Saved: cleaned_data/MSFT_cleaned.csv

[4/9] GOOGL
    Initial: 2,812 rows
    Final: 2,763 rows
    Saved: cleaned_data/GOOGL_cleaned.csv

[5/9] AMZN
    Initial: 2,814 rows
    Final: 2,765 rows
    Saved: cleaned_data/AMZN_cleaned.csv

[6/9] AVGO
    Initial: 2,814 rows
    Final: 2,765 rows
    Saved: cleaned_data/AVGO_cleaned.csv

[7/9] META
    Initial: 2,814 rows
    Final: 2,765 rows
    Saved: cleaned_data/META_cleaned.csv

[8/9] TSM
    Initial: 2,814 rows
    Final: 2,765 rows
    Saved: cleaned_data/TSM_cleaned.csv

[9/9] BRK-B
    Initial: 2,814 rows
    Final: 2,765 rows
    Saved: cleaned_data/BRK-B_cleaned.csv

Sample (first ticker):
Shape: (2765, 8)
                                 Open        High         Low     