# Data Collection for Vietnamese FDI Stocks

This notebook collects and structures data for Vietnamese FDI enterprises for volatility prediction and multi-algorithm comparison.

**Inputs**: 
- `fdi_stocks_list.csv` - Stock metadata (100 FDI companies)

**Outputs**: 
- `fundamentals.csv` - Financial metrics (PE, ROE, etc.)
- `fdi_prices.csv` - Daily closing prices (time-series, aligned)
- `stock_correlation.npy` - Correlation-based adjacency matrix

In [40]:
import sys
import os

# Add src directory to path
sys.path.insert(0, '../src')  # Use insert to ensure our module is loaded first

# Force reload of VNStocks module in case of previous imports
if 'VNStocks' in sys.modules:
    del sys.modules['VNStocks']
if 'utils' in sys.modules:
    del sys.modules['utils']

import pandas as pd
import numpy as np
from VNStocks import VNStocksDataset

print("✓ Libraries imported successfully!")
print(f"Current directory: {os.getcwd()}")

✓ Libraries imported successfully!
Current directory: /Users/hoc/Documents/NCKH/notebooks


In [41]:
# Load stock list
stock_list = pd.read_csv('../data/raw/fdi_stocks_list.csv')

print(f"Total FDI companies: {len(stock_list)}")
print(f"\nSector distribution:")
print(stock_list['sector'].value_counts())
print(f"\nFDI Status distribution:")
print(stock_list['fdi_status'].value_counts())
print("\nAll stocks:")
stock_list

Total FDI companies: 100

Sector distribution:
sector
Financials                22
Industrials               14
Consumer Staples          13
Materials                 13
Energy                     9
Real Estate                8
Health Care                7
Consumer Discretionary     6
Utilities                  5
Information Technology     3
Name: count, dtype: int64

FDI Status distribution:
fdi_status
Low       54
Medium    41
High       5
Name: count, dtype: int64

All stocks:


Unnamed: 0,ticker,name,sector,exchange,fdi_status,market_cap
0,VNM,Vietnam Dairy Products Joint Stock Company,Consumer Staples,HOSE,High,Large
1,SAB,Sabeco - Vietnam Brewery,Consumer Staples,HOSE,High,Large
2,MSN,Masan Group Corporation,Consumer Staples,HOSE,Medium,Large
3,VIC,Vingroup Joint Stock Company,Real Estate,HOSE,Medium,Large
4,VHM,Vinhomes Joint Stock Company,Real Estate,HOSE,Medium,Large
...,...,...,...,...,...,...
95,AAA,An Phat Plastic and Green Environment JSC,Materials,HOSE,Medium,Medium
96,NTP,Nam Tan Uyen Plastic JSC,Materials,HOSE,Low,Small
97,TNG,TNG Investment and Trading JSC,Consumer Discretionary,HOSE,Low,Small
98,EVE,Eve Pharmatech Group JSC,Health Care,HOSE,Medium,Small


In [42]:
# Configuration
stock_list_path = '../data/raw/fdi_stocks_list.csv'
start_date = '2022-01-01'
end_date = '2024-12-31'
raw_dir = '../data/raw'
processed_dir = '../data/processed'
# Use processed_dir as default data directory for saved outputs
data_dir = processed_dir

# Initialize dataset
dataset = VNStocksDataset(
    stock_list_path=stock_list_path,
    start_date=start_date,
    end_date=end_date,
    raw_dir=raw_dir,
    processed_dir=processed_dir
)

print(f"\n✓ Dataset initialized!")

Initialized: 98 stocks, 2022-01-01 to 2024-12-31

✓ Dataset initialized!


In [43]:
# Collect price data
# Use 'vnstock' for real data or 'manual' for sample data
dataset.collect_price_data(source='manual')

print(f"\n✓ Price data collected for {len(dataset.price_data)} stocks")

# Display sample data for one stock
sample_ticker = dataset.tickers[0]
print(f"\nSample data for {sample_ticker}:")
print(dataset.price_data[sample_ticker].head())


[1/3] Collecting price data...
  [  1/98] VNM: ✓
  [  2/98] SAB: ✓
  [  3/98] MSN: ✓
  [  4/98] VIC: ✓
  [  5/98] VHM: ✓
  [  6/98] HPG: ✓
  [  7/98] VCB: ✓
  [  8/98] BID: ✓
  [  9/98] CTG: ✓
  [ 10/98] GAS: ✓
  [ 11/98] POW: ✓
  [ 12/98] VRE: ✓
  [ 13/98] MWG: ✓
  [ 14/98] PLX: ✓
  [ 15/98] TCB: ✓
  [ 16/98] FPT: ✓
  [ 17/98] VPB: ✓
  [ 18/98] SSI: ✓
  [ 19/98] MBB: ✓
  [ 20/98] VJC: ✓
  [ 21/98] HVN: ✓
  [ 22/98] NVL: ✓
  [ 23/98] PDR: ✓
  [ 24/98] KDH: ✓
  [ 25/98] DIG: ✓
  [ 26/98] BCM: ✓
  [ 27/98] GVR: ✓
  [ 28/98] DPM: ✓
  [ 29/98] DCM: ✓
  [ 30/98] DHG: ✓
  [ 31/98] DMC: ✓
  [ 32/98] IMP: ✓
  [ 33/98] DHT: ✓
  [ 34/98] VPI: ✓
  [ 35/98] PVT: ✓
  [ 36/98] PVD: ✓
  [ 37/98] PVS: ✓
  [ 38/98] BSR: ✓
  [ 39/98] PGV: ✓
  [ 40/98] NT2: ✓
  [ 41/98] PC1: ✓
  [ 42/98] REE: ✓
  [ 43/98] GMD: ✓
  [ 44/98] HAH: ✓
  [ 45/98] VSC: ✓
  [ 46/98] HT1: ✓
  [ 47/98] VGC: ✓
  [ 48/98] HSG: ✓
  [ 49/98] NKG: ✓
  [ 50/98] TLG: ✓
  [ 51/98] DTL: ✓
  [ 52/98] CMG: ✓
  [ 53/98] VGI: ✓
  [ 54/98] STB

In [44]:
# Fundamentals collection is not implemented in VNStocksDataset.
# Skipping this step; proceed directly to processing price data.
print("Skipping fundamentals collection (not implemented in VNStocksDataset).")

Skipping fundamentals collection (not implemented in VNStocksDataset).


In [45]:
# Save processed outputs (values.csv, adj.npy)
dataset.process_and_save()

print("\n" + "="*60)
print("DATA COLLECTION COMPLETE!")
print("="*60)


[2/3] Processing data...

[3/3] Saving...
  ✓ values.csv: (75754, 9)
    Columns: ['Close', 'NormClose', 'DailyLogReturn', 'ALR1W', 'ALR2W', 'ALR1M', 'ALR2M', 'RSI', 'MACD']
Adjacency matrix saved to ../data/processed/adj.npy
  ✓ adj.npy: (98, 98)

✓ Complete!

DATA COLLECTION COMPLETE!


In [46]:
import os
import numpy as np
import pandas as pd

# Use processed_dir defined earlier; fallback to default
if 'processed_dir' not in globals():
    processed_dir = '../data/processed'

print("Generated files:")
files = ['values.csv', 'adj.npy', 'fdi_prices.csv', 'fundamentals.csv']
for file in files:
    filepath = os.path.join(processed_dir, file)
    exists = os.path.exists(filepath)
    size = os.path.getsize(filepath) / 1024 / 1024 if exists else 0
    print(f"  {file}: {'FOUND' if exists else 'MISSING'} ({size:.2f} MB)")

# Optional: load and preview key outputs if they exist
if os.path.exists(os.path.join(processed_dir, 'values.csv')):
    values_df = pd.read_csv(os.path.join(processed_dir, 'values.csv'), index_col=[0, 1])
    print(f"\nvalues.csv shape: {values_df.shape}")
    print("values sample:")
    print(values_df.head())

if os.path.exists(os.path.join(processed_dir, 'adj.npy')):
    correlation_matrix = np.load(os.path.join(processed_dir, 'adj.npy'))
    print(f"\nadj.npy shape: {correlation_matrix.shape}")
    if correlation_matrix.size > 0:
        print(f"Number of edges: {np.count_nonzero(correlation_matrix)}")
        print(f"Graph density: {np.count_nonzero(correlation_matrix) / correlation_matrix.size:.6f}")

Generated files:
  values.csv: FOUND (13.72 MB)
  adj.npy: FOUND (0.07 MB)
  fdi_prices.csv: MISSING (0.00 MB)
  fundamentals.csv: MISSING (0.00 MB)

values.csv shape: (75754, 9)
values sample:
                        Close  NormClose  DailyLogReturn    ALR1W     ALR2W  \
Symbol Date                                                                   
VNM    2022-01-03  102.414353  -1.719669       -0.018057 -1.21506  1.026645   
       2022-01-04  100.581621  -1.785634       -0.018057 -1.21506  1.026645   
       2022-01-05   98.388958  -1.864554       -0.022041 -1.21506  1.026645   
       2022-01-06   99.302215  -1.831683        0.009239 -1.21506  1.026645   
       2022-01-07   97.020489  -1.913809       -0.023246 -1.21506  1.026645   

                      ALR1M     ALR2M        RSI      MACD  
Symbol Date                                                 
VNM    2022-01-03  0.660087  0.644657  51.054457  0.000000  
       2022-01-04  0.660087  0.644657  51.054457 -0.041119  
       2

## Expected Data Output Format

**This notebook generates data matching the [SP100AnalysisWithGNNs](https://github.com/timothewt/SP100AnalysisWithGNNs) structure:**

### 1. `values.csv` - Time-series with Symbol/Date Multi-Index
- **Index**: (Symbol, Date)  
- **Columns**: 
  - `Close`: Closing price
  - `NormClose`: Normalized close (z-score)
  - `DailyLogReturn`: Daily log return
  - `RSI`: 14-period Relative Strength Index
  - `MACD`: Moving Average Convergence Divergence
- **Shape**: (N×dates, 5+ features)
- **Format**: CSV with multi-index

### 2. `adj.npy` - Stock Correlation Adjacency Matrix
- **Shape**: (100, 100) - square matrix
- **Values**: Correlation weights between stocks
- **Edges**: Correlation > 0.3 threshold (can adjust)
- **Structure**: Matches node ordering in values.csv

### Data Pipeline Flow

```
fdi_stocks_list.csv (input: 100 tickers)
         ↓
[Collect price data for each stock]
         ↓
[Align dates & create multi-index (Symbol, Date)]
         ↓
[Engineer features: NormClose, DailyLogReturn, RSI, MACD]
         ↓
values.csv (output: indexed time-series with features)
         ↓
[Calculate correlations from log returns]
         ↓
adj.npy (output: 100×100 adjacency matrix)

## Step 7: Verify Generated Files

## Step 6: Save All Data

Generate final data files: stocks.csv, fundamentals.csv, values.csv, adj.npy

## Step 5: Collect Fundamental Data

Gather financial metrics for each stock.

## Step 4: Collect Price Data

**Note**: Currently using sample data generation. To use real Vietnamese stock data:
- Install: `pip install vnstock`
- Change `source='manual'` to `source='vnstock'`

## Step 3: Initialize Dataset

Configure the data collection parameters.

## Step 2: Review Stock List

Let's examine the FDI companies we'll be analyzing.

## Step 1: Import Libraries and Setup

# Complete Data Collection Pipeline for Vietnamese FDI Stocks

## Overview

This notebook collects and processes historical price data for 100 Vietnamese FDI companies and generates two key outputs that match the **SP100AnalysisWithGNNs** reference structure:

1. **`values.csv`** - Time-series data with multi-index (Symbol, Date) + engineered features
2. **`adj.npy`** - Stock correlation adjacency matrix (100×100)

## Key Design Principles

✅ **Reproducibility**: Fixed random seed for sample data; easy switch to vnstock for real data  
✅ **Alignment with Reference**: Data structure matches SP100 exactly  
✅ **Feature-Rich**: Includes NormClose, DailyLogReturn, RSI, MACD  
✅ **Graph-Ready**: Adjacency matrix built from correlations for GNN pipeline  

## Input Requirements

- **Stock List**: `data/fdi_stocks_list.csv` (100 Vietnamese FDI tickers)
- **Date Range**: 2022-01-01 to 2024-12-31 (configured in notebook)

## Expected Outputs

After running all cells:

| File | Purpose | Format |
|------|---------|--------|
| `values.csv` | Time-series with features | CSV, multi-index (Symbol, Date) |
| `adj.npy` | Correlation adjacency | NumPy binary, 100×100 |

## Usage

1. Run cells in order: **Setup → Collect Data → Save Data**
2. Verify outputs in `data/` directory
3. Proceed to `1_data_preparation.ipynb` for feature engineering

## Notes

- **Current mode**: Using sample data (realistic synthetic prices)  
- **Real data mode**: Install vnstock and change `source='manual'` → `source='vnstock'`
- **Performance**: Typical run ~1-2 min with sample data; ~5-10 min with live data download