# Data Collection for Vietnamese FDI Stocks

This notebook demonstrates how to collect and structure data for Vietnamese FDI enterprises.

**Outputs**: 
- `stocks.csv` - Stock metadata
- `fundamentals.csv` - Fundamental features  
- `values.csv` - Daily closing prices (aligned)
- `adj.npy` - Adjacency matrix for GNN

In [None]:
import sys
import os

# Add src directory to path
sys.path.append('../src')

import pandas as pd
import numpy as np
from VNStocks import VNStocksDataset

print("✓ Libraries imported successfully!")
print(f"Current directory: {os.getcwd()}")

In [None]:
# Load stock list
stock_list = pd.read_csv('../data/fdi_stocks_list.csv')

print(f"Total FDI companies: {len(stock_list)}")
print(f"\nSector distribution:")
print(stock_list['sector'].value_counts())
print(f"\nFDI Status distribution:")
print(stock_list['fdi_status'].value_counts())
print("\nAll stocks:")
stock_list

In [None]:
# Configuration
stock_list_path = '../data/fdi_stocks_list.csv'
start_date = '2022-01-01'
end_date = '2024-12-31'
data_dir = '../data'

# Initialize dataset
dataset = VNStocksDataset(
    stock_list_path=stock_list_path,
    start_date=start_date,
    end_date=end_date,
    data_dir=data_dir
)

print(f"\n✓ Dataset initialized!")
print(f"  Stocks: {dataset.num_stocks}")
print(f"  Period: {start_date} to {end_date}")

In [None]:
# Collect price data
# Use 'vnstock' for real data or 'manual' for sample data
dataset.collect_price_data(source='manual')

print(f"\n✓ Price data collected for {len(dataset.price_data)} stocks")

# Display sample data for one stock
sample_ticker = dataset.tickers[0]
print(f"\nSample data for {sample_ticker}:")
print(dataset.price_data[sample_ticker].head())

In [None]:
# Collect fundamentals
dataset.collect_fundamentals()

print(f"\n✓ Fundamentals collected for {len(dataset.fundamentals_data)} stocks")

# Display sample fundamentals
fundamentals_df = pd.DataFrame(dataset.fundamentals_data)
print("\nSample fundamentals:")
fundamentals_df.head()

In [None]:
# Save all processed data
dataset.save_all_data()

print("\n" + "="*60)
print("DATA COLLECTION COMPLETE!")
print("="*60)

In [None]:
import os

# Check generated files
data_files = ['stocks.csv', 'fundamentals.csv', 'values.csv', 'adj.npy']

print("Generated files:")
for file in data_files:
    filepath = os.path.join(data_dir, file)
    if os.path.exists(filepath):
        size = os.path.getsize(filepath) / 1024  # KB
        print(f"  ✓ {file} ({size:.2f} KB)")
    else:
        print(f"  ✗ {file} (not found)")

# Load and preview values.csv
print("\n" + "="*60)
print("Preview: values.csv")
print("="*60)
values_df = pd.read_csv(os.path.join(data_dir, 'values.csv'))
print(f"Shape: {values_df.shape}")
print(f"Date range: {values_df['Date'].min()} to {values_df['Date'].max()}")
print("\nFirst 5 rows:")
print(values_df.head())
print("\nLast 5 rows:")
print(values_df.tail())

# Load and preview adjacency matrix
print("\n" + "="*60)
print("Preview: adj.npy")
print("="*60)
adj_matrix = np.load(os.path.join(data_dir, 'adj.npy'))
print(f"Shape: {adj_matrix.shape}")
print(f"Number of edges: {np.sum(adj_matrix) // 2}")
print(f"Graph density: {(np.sum(adj_matrix) // 2) / (adj_matrix.shape[0] * (adj_matrix.shape[0] - 1) / 2):.4f}")
print("\nAdjacency matrix (first 5x5):")
print(adj_matrix[:5, :5])

## Summary & Next Steps

**What we've accomplished:**
1. ✅ Defined FDI stock screening criteria
2. ✅ Created stock list (15 Vietnamese FDI companies)
3. ✅ Built data collection pipeline
4. ✅ Generated structured datasets:
   - `stocks.csv` - Metadata for 15 stocks
   - `fundamentals.csv` - Financial metrics (PE, ROE, etc.)
   - `values.csv` - Time-series of daily prices (aligned)
   - `adj.npy` - Graph adjacency matrix

**Next steps:**
1. **Replace sample data with real data**
   - Install vnstock: `pip install vnstock`
   - Update data source in collection script
   
2. **Expand stock list**
   - Research more FDI companies in Vietnam
   - Add to `fdi_stocks_list.csv`
   
3. **Run data preparation**
   - Open `1_data_preparation.ipynb`
   - Calculate volatility and risk metrics
   - Engineer features for GNN
   
4. **Build GNN model**
   - Design graph neural network architecture
   - Train on volatility prediction task
   - Evaluate performance

**For real Vietnamese stock data:**
```python
# Install vnstock
!pip install vnstock

# Update VNStocks.py or use directly:
from vnstock import stock_historical_data
df = stock_historical_data(symbol='VNM', start_date='2022-01-01', end_date='2024-12-31')
```

## Step 7: Verify Generated Files

## Step 6: Save All Data

Generate final data files: stocks.csv, fundamentals.csv, values.csv, adj.npy

## Step 5: Collect Fundamental Data

Gather financial metrics for each stock.

## Step 4: Collect Price Data

**Note**: Currently using sample data generation. To use real Vietnamese stock data:
- Install: `pip install vnstock`
- Change `source='manual'` to `source='vnstock'`

## Step 3: Initialize Dataset

Configure the data collection parameters.

## Step 2: Review Stock List

Let's examine the FDI companies we'll be analyzing.

## Step 1: Import Libraries and Setup