# Step 1: Data Collection - Turkish Financial Data

## üìã Setup Instructions

### Before You Start:
1. **Get CBRT EVDS API Key** (for macroeconomic data):
   - Visit: https://evds2.tcmb.gov.tr/
   - Register for free account
   - Get API key from your profile
   - See `CBRT_API_SETUP.md` for detailed instructions

2. **Optional: Use .env file** (recommended for security):
   - Copy `.env.example` to `.env`
   - Add your API key: `EVDS_API_KEY=your_key_here`
   - The notebook will automatically load it

### What This Notebook Does:
- ‚úÖ Collects macroeconomic data from CBRT (inflation, interest rates, exchange rates)
- ‚úÖ Collects BIST stock prices from Yahoo Finance
- ‚úÖ Combines datasets for comprehensive analysis
- ‚úÖ Saves all data to `data/raw/` folder

# Step 1: Data Collection

## Goal
Collect Turkish financial datasets from various sources:
1. Financial Ratios Dataset for BIST Manufacturing Firms (Zenodo)
2. T√úƒ∞K financial statistics
3. BIST stock data (if accessible)

## Tasks
- Download datasets
- Load into pandas DataFrames
- Initial data inspection
- Save raw data to `../data/raw/`

In [1]:
import pandas as pd
import numpy as np
import os
import sys
from pathlib import Path

# Add src to path
project_root = Path().resolve().parent
sys.path.append(str(project_root / "src"))

# Set up paths
data_raw_dir = project_root / "data" / "raw"
data_raw_dir.mkdir(parents=True, exist_ok=True)

print(f"Project root: {project_root}")
print(f"Raw data directory: {data_raw_dir}")

# Import our data collector
from data_collection import TurkishFinancialDataCollector, get_evds_api_key_instructions

Project root: C:\Users\cihan\turkish_finance_ml
Raw data directory: C:\Users\cihan\turkish_finance_ml\data\raw


## üß™ Test Your API Key (Optional)

Run this cell to test if your API key works before collecting full dataset.

In [2]:
# Quick API key test (run this after setting EVDS_API_KEY below)
# Uncomment and run to test your key:
# from test_api_key import test_api_key
# test_api_key(EVDS_API_KEY)

## Option 1: CBRT EVDS - Macroeconomic Data (RECOMMENDED) ‚≠ê

**Source:** Central Bank of Turkey (CBRT) EVDS API

**What you'll get:**
- Consumer Price Index (CPI/T√úFE) - Inflation
- Producer Price Index (PPI/√úFE)
- Policy Interest Rates
- USD/TRY Exchange Rates
- Time-series format (monthly data from 2000+)

**Steps:**
1. Get API key from: https://evds2.tcmb.gov.tr/ (FREE, just register)
2. Run the cell below

In [None]:
# Initialize collector
# IMPORTANT: Get your API key from https://evds2.tcmb.gov.tr/
# You can either:
# 1. Set it here: EVDS_API_KEY = "your_key_here"
# 2. Or use .env file (recommended): Add EVDS_API_KEY=your_key_here to .env file

# Try to load API key from .env file first
from load_env import get_evds_api_key
EVDS_API_KEY = get_evds_api_key()

# If not in .env, try manual setting
if EVDS_API_KEY is None:
    EVDS_API_KEY = "YOUR_API_KEY"  # ‚ö†Ô∏è Replace with your actual API key!

if EVDS_API_KEY is None or EVDS_API_KEY == "YOUR_API_KEY":
    print("‚ö†Ô∏è  Please set your EVDS API key first!")
    print("\nüìã How to get API key:")
    get_evds_api_key_instructions()
    print("\nüí° Tip: You can also add it to .env file: EVDS_API_KEY=your_key_here")
    macro_data = pd.DataFrame()  # Initialize as empty to avoid NameError
else:
    collector = TurkishFinancialDataCollector(
        data_dir=data_raw_dir,
        evds_api_key=EVDS_API_KEY
    )
    
    # Collect macroeconomic data
    print("üìä Collecting macroeconomic data from CBRT EVDS...")
    macro_data = collector.collect_cbrt_macroeconomic_data(
        start_date="01-01-2000",
        end_date="31-12-2024"
    )
    
    if not macro_data.empty:
        print(f"\n‚úÖ Successfully collected macroeconomic data!")
        print(f"   Shape: {macro_data.shape}")
        print(f"   Columns: {macro_data.columns.tolist()}")
        print(f"   Date range: {macro_data['Date'].min()} to {macro_data['Date'].max()}")
        print("\nFirst few rows:")
        display(macro_data.head(10))
        print("\nData info:")
        print(macro_data.info())
        print("\nStatistical summary:")
        display(macro_data.describe())

## Option 2: BIST Stock Prices (Yahoo Finance) ‚≠ê

**Source:** Yahoo Finance (via yfinance library)

**What you'll get:**
- Historical daily stock prices (Open, High, Low, Close, Volume)
- BIST-100 index data
- Individual company stocks
- Time-series format (daily data from 2000+)

**No API key needed!** This is free and easy to use.

In [4]:
# Initialize collector (no API key needed for stock data)
collector_stocks = TurkishFinancialDataCollector(data_dir=data_raw_dir)

# Option A: Collect BIST-100 index only
print("üìä Collecting BIST-100 Index data...")
bist_index = collector_stocks.collect_bist_stock_data(
    tickers=['XU100.IS'],  # BIST-100 index
    start_date="2000-01-01",
    end_date="2024-12-31"
)

if not bist_index.empty:
    print(f"\n‚úÖ Successfully collected BIST-100 data!")
    print(f"   Shape: {bist_index.shape}")
    print(f"   Date range: {bist_index['Date'].min()} to {bist_index['Date'].max()}")
    display(bist_index.head(10))

# Option B: Collect multiple major BIST companies
print("\n" + "="*60)
print("üìä Collecting data for major BIST-100 companies...")
print("="*60)
major_stocks = collector_stocks.collect_bist_100_companies()

if not major_stocks.empty:
    print(f"\n‚úÖ Successfully collected stock data for {major_stocks['Ticker'].nunique()} companies!")
    print(f"   Shape: {major_stocks.shape}")
    print(f"   Tickers: {major_stocks['Ticker'].unique()}")
    display(major_stocks.head(10))

üìä Collecting BIST-100 Index data...
üìä Collecting XU100.IS...
   ‚úÖ XU100.IS: 6253 records (2000-01-04 00:00:00+02:00 to 2024-12-30 00:00:00+03:00)

‚úÖ Saved stock data to: C:\Users\cihan\turkish_finance_ml\data\raw\bist_stock_prices.csv
   Shape: (6253, 9)
   Tickers: ['XU100.IS']

‚úÖ Successfully collected BIST-100 data!
   Shape: (6253, 9)
   Date range: 2000-01-04 00:00:00+02:00 to 2024-12-30 00:00:00+03:00


Unnamed: 0,Date,Open,High,Low,Close,Volume,Dividends,Stock Splits,Ticker
0,2000-01-04 00:00:00+02:00,152.087189,176.392068,152.087189,175.121063,5453870000,0.0,0.0,XU100.IS
1,2000-01-05 00:00:00+02:00,175.121063,178.02005,162.376136,169.319107,6672090000,0.0,0.0,XU100.IS
2,2000-01-06 00:00:00+02:00,169.319114,174.606074,160.867148,161.999146,6609500000,0.0,0.0,XU100.IS
3,2000-01-07 00:00:00+02:00,161.999142,163.055136,156.234173,158.373154,2544440000,0.0,0.0,XU100.IS
4,2000-01-11 00:00:00+02:00,158.37315,163.882124,152.931193,163.473129,5361840000,0.0,0.0,XU100.IS
5,2000-01-12 00:00:00+02:00,163.47313,173.041087,163.47313,169.335098,5969170000,0.0,0.0,XU100.IS
6,2000-01-13 00:00:00+02:00,169.335101,182.567026,169.335101,181.381042,7364170000,0.0,0.0,XU100.IS
7,2000-01-14 00:00:00+02:00,181.381048,193.31998,181.381048,191.10199,6785520000,0.0,0.0,XU100.IS
8,2000-01-17 00:00:00+02:00,191.101986,206.17791,183.012032,184.582016,7324420000,0.0,0.0,XU100.IS
9,2000-01-18 00:00:00+02:00,184.582014,195.771957,181.824033,195.771957,5954870000,0.0,0.0,XU100.IS



üìä Collecting data for major BIST-100 companies...
üìä Collecting AKBNK.IS...
   ‚úÖ AKBNK.IS: 6609 records (2000-05-10 00:00:00+03:00 to 2026-01-16 00:00:00+03:00)
üìä Collecting GARAN.IS...
   ‚úÖ GARAN.IS: 6609 records (2000-05-10 00:00:00+03:00 to 2026-01-16 00:00:00+03:00)
üìä Collecting THYAO.IS...
   ‚úÖ THYAO.IS: 6609 records (2000-05-10 00:00:00+03:00 to 2026-01-16 00:00:00+03:00)
üìä Collecting TUPRS.IS...
   ‚úÖ TUPRS.IS: 6609 records (2000-05-10 00:00:00+03:00 to 2026-01-16 00:00:00+03:00)
üìä Collecting SAHOL.IS...
   ‚úÖ SAHOL.IS: 6609 records (2000-05-10 00:00:00+03:00 to 2026-01-16 00:00:00+03:00)
üìä Collecting BIMAS.IS...
   ‚úÖ BIMAS.IS: 5252 records (2005-07-22 00:00:00+03:00 to 2026-01-16 00:00:00+03:00)
üìä Collecting ARCLK.IS...
   ‚úÖ ARCLK.IS: 6609 records (2000-05-10 00:00:00+03:00 to 2026-01-16 00:00:00+03:00)
üìä Collecting KOZAL.IS...
   ‚úÖ KOZAL.IS: 4091 records (2010-02-12 00:00:00+02:00 to 2026-01-16 00:00:00+03:00)
üìä Collecting SASA.IS...

Unnamed: 0,Date,Open,High,Low,Close,Volume,Dividends,Stock Splits,Ticker
0,2000-05-10 00:00:00+03:00,-117775.390625,-123009.857107,-115158.171109,-117775.390625,1284067,0.0,0.0,AKBNK.IS
1,2000-05-11 00:00:00+03:00,-117775.390625,-120392.619291,-112540.933293,-117775.390625,608846,0.0,0.0,AKBNK.IS
2,2000-05-12 00:00:00+03:00,-120392.601563,-123009.838993,-115158.154152,-120392.601562,2283592,0.0,0.0,AKBNK.IS
3,2000-05-15 00:00:00+03:00,-117775.390625,-120392.619291,-115158.171109,-117775.390625,122956,0.0,0.0,AKBNK.IS
4,2000-05-16 00:00:00+03:00,-115158.164062,-117775.383418,-112540.926406,-115158.164062,1331524,0.0,0.0,AKBNK.IS
5,2000-05-17 00:00:00+03:00,-115158.164062,-117775.383418,-109923.6979,-115158.164062,1062260,0.0,0.0,AKBNK.IS
6,2000-05-18 00:00:00+03:00,-107306.46875,-109923.706614,-102072.002173,-107306.46875,1716238,0.0,0.0,AKBNK.IS
7,2000-05-19 00:00:00+03:00,-107306.46875,-107306.46875,-107306.46875,-107306.46875,0,0.0,0.0,AKBNK.IS
8,2000-05-22 00:00:00+03:00,-102595.460938,-104689.247722,-99454.780761,-102595.460938,1728168,0.0,0.0,AKBNK.IS
9,2000-05-23 00:00:00+03:00,-102595.460938,-104689.247722,-99454.780761,-102595.460938,1728168,0.0,0.0,AKBNK.IS


## Option 3: Combine Stock and Macro Data

Merge stock prices with macroeconomic indicators for comprehensive analysis.

In [5]:
# Combine stock and macro data
# Note: Stock data is daily, macro data is monthly
# We'll forward-fill macro data to match daily frequency

# Check if variables exist, if not try loading from files
try:
    _ = bist_index
    if bist_index.empty:
        bist_index = pd.DataFrame()
except NameError:
    bist_index = pd.DataFrame()

try:
    _ = macro_data
    if macro_data.empty:
        macro_data = pd.DataFrame()
except NameError:
    macro_data = pd.DataFrame()

# Try loading from saved files if variables are empty
if bist_index.empty:
    stock_file = data_raw_dir / "bist_stock_prices.csv"
    if stock_file.exists():
        df_stock = pd.read_csv(stock_file)
        # Filter for BIST-100 index if available
        if 'Ticker' in df_stock.columns:
            bist_index = df_stock[df_stock['Ticker'] == 'XU100.IS'].copy()
        else:
            bist_index = df_stock.copy()
        print("‚úÖ Loaded stock data from saved file")

if macro_data.empty:
    macro_file = data_raw_dir / "cbrt_macroeconomic_data.csv"
    if macro_file.exists():
        macro_data = pd.read_csv(macro_file)
        macro_data['Date'] = pd.to_datetime(macro_data['Date'])
        print("‚úÖ Loaded macro data from saved file")
    else:
        print("‚ö†Ô∏è  Macro data file not found. Please run the macro collection cell first.")

# Now combine the data
if not bist_index.empty and not macro_data.empty:
    # Prepare stock data
    stock_daily = bist_index[['Date', 'Open', 'High', 'Low', 'Close', 'Volume']].copy()
    stock_daily['Date'] = pd.to_datetime(stock_daily['Date'])
    stock_daily = stock_daily.sort_values('Date').reset_index(drop=True)
    
    # Prepare macro data
    macro_monthly = macro_data.copy()
    macro_monthly['Date'] = pd.to_datetime(macro_monthly['Date'])
    macro_monthly = macro_monthly.sort_values('Date').reset_index(drop=True)
    
    # Merge: forward-fill monthly macro data to daily stock data
    combined = stock_daily.merge(
        macro_monthly,
        on='Date',
        how='left'
    )
    
    # Forward-fill macro indicators (monthly values fill forward to daily)
    macro_cols = [col for col in macro_monthly.columns if col != 'Date']
    combined[macro_cols] = combined[macro_cols].fillna(method='ffill')
    
    # Save combined dataset
    output_file = data_raw_dir / "combined_stock_macro_data.csv"
    combined.to_csv(output_file, index=False)
    
    print(f"\n‚úÖ Combined dataset created!")
    print(f"   Shape: {combined.shape}")
    print(f"   Saved to: {output_file}")
    print(f"   Date range: {combined['Date'].min()} to {combined['Date'].max()}")
    print(f"\nColumns: {combined.columns.tolist()}")
    print("\nFirst few rows:")
    display(combined.head(10))
    print("\nMissing values:")
    print(combined.isnull().sum())
    
elif bist_index.empty:
    print("‚ö†Ô∏è  Stock data not available. Run the stock collection cell first.")
elif macro_data.empty:
    print("‚ö†Ô∏è  Macro data not available. Set your EVDS API key and run macro collection cell first.")
    print("   Or ensure cbrt_macroeconomic_data.csv exists in data/raw/ folder")
else:
    print("‚ö†Ô∏è  Both datasets needed for combination.")

‚ö†Ô∏è  Macro data file not found. Please run the macro collection cell first.
‚ö†Ô∏è  Macro data not available. Set your EVDS API key and run macro collection cell first.
   Or ensure cbrt_macroeconomic_data.csv exists in data/raw/ folder


## Initial Data Inspection

Check the collected datasets:

In [6]:
# Check all collected datasets
print("="*60)
print("DATA COLLECTION SUMMARY")
print("="*60)

datasets = {}

# Check for macro data
macro_file = data_raw_dir / "cbrt_macroeconomic_data.csv"
if macro_file.exists():
    df_macro = pd.read_csv(macro_file)
    datasets['Macroeconomic Data'] = df_macro
    print(f"\n‚úÖ Macroeconomic Data:")
    print(f"   File: {macro_file}")
    print(f"   Shape: {df_macro.shape}")
    print(f"   Columns: {df_macro.columns.tolist()}")
    print(f"   Date range: {df_macro['Date'].min()} to {df_macro['Date'].max()}")

# Check for stock data
stock_file = data_raw_dir / "bist_stock_prices.csv"
if stock_file.exists():
    df_stock = pd.read_csv(stock_file)
    datasets['Stock Prices'] = df_stock
    print(f"\n‚úÖ Stock Prices Data:")
    print(f"   File: {stock_file}")
    print(f"   Shape: {df_stock.shape}")
    print(f"   Columns: {df_stock.columns.tolist()}")
    if 'Date' in df_stock.columns:
        print(f"   Date range: {df_stock['Date'].min()} to {df_stock['Date'].max()}")
    if 'Ticker' in df_stock.columns:
        print(f"   Tickers: {df_stock['Ticker'].unique()}")

# Check for combined data
combined_file = data_raw_dir / "combined_stock_macro_data.csv"
if combined_file.exists():
    df_combined = pd.read_csv(combined_file)
    datasets['Combined Data'] = df_combined
    print(f"\n‚úÖ Combined Data:")
    print(f"   File: {combined_file}")
    print(f"   Shape: {df_combined.shape}")
    print(f"   Columns: {df_combined.columns.tolist()}")

print("\n" + "="*60)
print(f"Total datasets collected: {len(datasets)}")
print("="*60)

if len(datasets) == 0:
    print("\n‚ö†Ô∏è  No datasets found. Please run the collection cells above.")
else:
    print("\n‚úÖ Data collection complete! Ready for EDA.")
    print("   Next step: Run 02_eda_exploration.ipynb")

DATA COLLECTION SUMMARY

‚úÖ Stock Prices Data:
   File: C:\Users\cihan\turkish_finance_ml\data\raw\bist_stock_prices.csv
   Shape: (62216, 9)
   Columns: ['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Dividends', 'Stock Splits', 'Ticker']
   Date range: 2000-05-10 00:00:00+03:00 to 2026-01-16 00:00:00+03:00
   Tickers: ['AKBNK.IS' 'GARAN.IS' 'THYAO.IS' 'TUPRS.IS' 'SAHOL.IS' 'BIMAS.IS'
 'ARCLK.IS' 'KOZAL.IS' 'SASA.IS' 'PETKM.IS']

Total datasets collected: 1

‚úÖ Data collection complete! Ready for EDA.
   Next step: Run 02_eda_exploration.ipynb


## Additional Data Sources (Optional)

### Kaggle Datasets
If you want pre-processed datasets, consider downloading from Kaggle:

1. **Borsa Istanbul Stock Exchange Dataset**
   - Link: https://www.kaggle.com/datasets/gokhankesler/borsa-istanbul-turkish-stock-exchange-dataset
   - Download and place in `data/raw/` folder

2. **BIST100 Turkish Stock Market**
   - Link: https://www.kaggle.com/datasets/hakanetin/bist100turkishstaockmarketturkhissefiyatlar
   - Download and place in `data/raw/` folder

### Financial Ratios Dataset (Zenodo)
- Link: https://zenodo.org/records/15551015
- Download and place in `data/raw/` folder

In [7]:
# Load any additional datasets you downloaded manually
# Example:
# kaggle_data = pd.read_csv(data_raw_dir / "kaggle_dataset.csv")
# print(f"Kaggle dataset shape: {kaggle_data.shape}")

print("\n" + "="*60)
print("‚úÖ DATA COLLECTION COMPLETE!")
print("="*60)
print("\nüìã Next Steps:")
print("   1. Review collected datasets above")
print("   2. Check data quality and completeness")
print("   3. Proceed to: 02_eda_exploration.ipynb")
print("\nüí° Tip: All data is saved in data/raw/ folder for reproducibility")


‚úÖ DATA COLLECTION COMPLETE!

üìã Next Steps:
   1. Review collected datasets above
   2. Check data quality and completeness
   3. Proceed to: 02_eda_exploration.ipynb

üí° Tip: All data is saved in data/raw/ folder for reproducibility
