# Data Collection for Vietnamese FDI Stocks

This notebook demonstrates how to collect and structure data for Vietnamese FDI enterprises.

**Outputs**: 
- `stocks.csv` - Stock metadata
- `fundamentals.csv` - Fundamental features  
- `values.csv` - Daily closing prices (aligned)
- `adj.npy` - Adjacency matrix for GNN

In [1]:
import sys
import os

# Add src directory to path
sys.path.append('../src')

import pandas as pd
import numpy as np
from VNStocks import VNStocksDataset

print("âœ“ Libraries imported successfully!")
print(f"Current directory: {os.getcwd()}")

âœ“ Libraries imported successfully!
Current directory: /Users/hoc/Documents/NCKH/notebooks


In [2]:
# Load stock list
stock_list = pd.read_csv('../data/fdi_stocks_list.csv')

print(f"Total FDI companies: {len(stock_list)}")
print(f"\nSector distribution:")
print(stock_list['sector'].value_counts())
print(f"\nFDI Status distribution:")
print(stock_list['fdi_status'].value_counts())
print("\nAll stocks:")
stock_list

Total FDI companies: 100

Sector distribution:
sector
Financials                22
Industrials               14
Consumer Staples          13
Materials                 13
Energy                     9
Real Estate                8
Health Care                7
Consumer Discretionary     6
Utilities                  5
Information Technology     3
Name: count, dtype: int64

FDI Status distribution:
fdi_status
Low       54
Medium    41
High       5
Name: count, dtype: int64

All stocks:


Unnamed: 0,ticker,name,sector,exchange,fdi_status,market_cap
0,VNM,Vietnam Dairy Products Joint Stock Company,Consumer Staples,HOSE,High,Large
1,SAB,Sabeco - Vietnam Brewery,Consumer Staples,HOSE,High,Large
2,MSN,Masan Group Corporation,Consumer Staples,HOSE,Medium,Large
3,VIC,Vingroup Joint Stock Company,Real Estate,HOSE,Medium,Large
4,VHM,Vinhomes Joint Stock Company,Real Estate,HOSE,Medium,Large
...,...,...,...,...,...,...
95,AAA,An Phat Plastic and Green Environment JSC,Materials,HOSE,Medium,Medium
96,NTP,Nam Tan Uyen Plastic JSC,Materials,HOSE,Low,Small
97,TNG,TNG Investment and Trading JSC,Consumer Discretionary,HOSE,Low,Small
98,EVE,Eve Pharmatech Group JSC,Health Care,HOSE,Medium,Small


In [3]:
# Configuration
stock_list_path = '../data/fdi_stocks_list.csv'
start_date = '2022-01-01'
end_date = '2024-12-31'
data_dir = '../data'

# Initialize dataset
dataset = VNStocksDataset(
    stock_list_path=stock_list_path,
    start_date=start_date,
    end_date=end_date,
    data_dir=data_dir
)

print(f"\nâœ“ Dataset initialized!")
print(f"  Stocks: {dataset.num_stocks}")
print(f"  Period: {start_date} to {end_date}")

Initialized dataset with 100 stocks
Date range: 2022-01-01 to 2024-12-31

âœ“ Dataset initialized!
  Stocks: 100
  Period: 2022-01-01 to 2024-12-31


In [4]:
# Collect price data
# Use 'vnstock' for real data or 'manual' for sample data
dataset.collect_price_data(source='manual')

print(f"\nâœ“ Price data collected for {len(dataset.price_data)} stocks")

# Display sample data for one stock
sample_ticker = dataset.tickers[0]
print(f"\nSample data for {sample_ticker}:")
print(dataset.price_data[sample_ticker].head())


Collecting price data from manual...
[1/100] Downloading VNM... âœ“ (782 days)
[2/100] Downloading SAB... âœ“ (782 days)
[3/100] Downloading MSN... âœ“ (782 days)
[4/100] Downloading VIC... âœ“ (782 days)
[5/100] Downloading VHM... âœ“ (782 days)
[6/100] Downloading HPG... âœ“ (782 days)
[7/100] Downloading VCB... âœ“ (782 days)
[8/100] Downloading BID... âœ“ (782 days)
[9/100] Downloading CTG... âœ“ (782 days)
[10/100] Downloading GAS... âœ“ (782 days)
[11/100] Downloading POW... âœ“ (782 days)
[12/100] Downloading VRE... âœ“ (782 days)
[13/100] Downloading MWG... âœ“ (782 days)
[14/100] Downloading PLX... âœ“ (782 days)
[15/100] Downloading TCB... âœ“ (782 days)
[16/100] Downloading FPT... âœ“ (782 days)
[17/100] Downloading VPB... âœ“ (782 days)
[18/100] Downloading SSI... âœ“ (782 days)
[19/100] Downloading MBB... âœ“ (782 days)
[20/100] Downloading VJC... âœ“ (782 days)
[21/100] Downloading HVN... âœ“ (782 days)
[22/100] Downloading NVL... âœ“ (782 days)
[23/100] Downloading PDR.

In [5]:
# Collect fundamentals
dataset.collect_fundamentals()

print(f"\nâœ“ Fundamentals collected for {len(dataset.fundamentals_data)} stocks")

# Display sample fundamentals
fundamentals_df = pd.DataFrame(dataset.fundamentals_data)
print("\nSample fundamentals:")
fundamentals_df.head()


Collecting fundamental data...
[1/100] VNM... âœ“
[2/100] SAB... âœ“
[3/100] MSN... âœ“
[4/100] VIC... âœ“
[5/100] VHM... âœ“
[6/100] HPG... âœ“
[7/100] VCB... âœ“
[8/100] BID... âœ“
[9/100] CTG... âœ“
[10/100] GAS... âœ“
[11/100] POW... âœ“
[12/100] VRE... âœ“
[13/100] MWG... âœ“
[14/100] PLX... âœ“
[15/100] TCB... âœ“
[16/100] FPT... âœ“
[17/100] VPB... âœ“
[18/100] SSI... âœ“
[19/100] MBB... âœ“
[20/100] VJC... âœ“
[21/100] HVN... âœ“
[22/100] NVL... âœ“
[23/100] PDR... âœ“
[24/100] KDH... âœ“
[25/100] DIG... âœ“
[26/100] BCM... âœ“
[27/100] GVR... âœ“
[28/100] DPM... âœ“
[29/100] DCM... âœ“
[30/100] DHG... âœ“
[31/100] DMC... âœ“
[32/100] IMP... âœ“
[33/100] DHT... âœ“
[34/100] VPI... âœ“
[35/100] PVT... âœ“
[36/100] PVD... âœ“
[37/100] PVS... âœ“
[38/100] BSR... âœ“
[39/100] PGV... âœ“
[40/100] NT2... âœ“
[41/100] PC1... âœ“
[42/100] REE... âœ“
[43/100] GMD... âœ“
[44/100] HAH... âœ“
[45/100] VSC... âœ“
[46/100] HT1... âœ“
[47/100] VGC... âœ“
[48/100] HT1... âœ“
[49/100] HSG... â

Unnamed: 0,ticker,pe_ratio,pb_ratio,roe,debt_to_equity,dividend_yield,market_cap,beta,name,sector,exchange,fdi_status
0,VNM,13.735726,1.703532,0.127181,0.449515,0.049907,12590930000.0,1.747798,Vietnam Dairy Products Joint Stock Company,Consumer Staples,HOSE,High
1,SAB,29.518918,2.869708,0.26426,0.369706,0.060102,51178630000.0,1.438887,Sabeco - Vietnam Brewery,Consumer Staples,HOSE,High
2,MSN,27.803215,0.552771,0.066666,1.600781,0.001796,18773470000.0,1.402546,Masan Group Corporation,Consumer Staples,HOSE,Medium
3,VIC,6.672659,3.701922,0.248757,0.311605,0.051178,64512910000.0,1.55391,Vingroup Joint Stock Company,Real Estate,HOSE,Medium
4,VHM,5.07381,1.927818,0.197305,0.271394,0.049786,87028440000.0,0.590935,Vinhomes Joint Stock Company,Real Estate,HOSE,Medium


In [6]:
# Save all processed data
dataset.save_all_data()

print("\n" + "="*60)
print("DATA COLLECTION COMPLETE!")
print("="*60)


SAVING ALL DATA
âœ“ Saved stocks.csv (100 stocks)
âœ“ Saved fundamentals.csv (100 records)

Creating values DataFrame...
Values DataFrame shape: (773, 99)
Date range: 2022-01-03 00:00:00 to 2024-12-31 00:00:00
âœ“ Saved values.csv (773, 99)

Calculating correlation matrix...
Correlation matrix shape: (98, 98)
Mean correlation: -0.0002

Creating adjacency matrix (threshold=0.3)...
Adjacency matrix shape: (98, 98)
Number of edges: 0
Graph density: 0.0000
Adjacency matrix saved to ../data/adj.npy
âœ“ Saved adj.npy (98, 98)

ALL DATA SAVED SUCCESSFULLY!

Files created in ../data:
  - stocks.csv: Stock metadata
  - fundamentals.csv: Fundamental features
  - values.csv: Daily closing prices
  - adj.npy: Adjacency matrix for GNN

DATA COLLECTION COMPLETE!


  values_df = values_df.fillna(method='ffill').fillna(method='bfill')


In [7]:
import os

# Check generated files
data_files = ['stocks.csv', 'fundamentals.csv', 'values.csv', 'adj.npy']

print("Generated files:")
for file in data_files:
    filepath = os.path.join(data_dir, file)
    if os.path.exists(filepath):
        size = os.path.getsize(filepath) / 1024  # KB
        print(f"  âœ“ {file} ({size:.2f} KB)")
    else:
        print(f"  âœ— {file} (not found)")

# Load and preview values.csv
print("\n" + "="*60)
print("Preview: values.csv")
print("="*60)
values_df = pd.read_csv(os.path.join(data_dir, 'values.csv'))
print(f"Shape: {values_df.shape}")
print(f"Date range: {values_df['Date'].min()} to {values_df['Date'].max()}")
print("\nFirst 5 rows:")
print(values_df.head())
print("\nLast 5 rows:")
print(values_df.tail())

# Load and preview adjacency matrix
print("\n" + "="*60)
print("Preview: adj.npy")
print("="*60)
adj_matrix = np.load(os.path.join(data_dir, 'adj.npy'))
print(f"Shape: {adj_matrix.shape}")
print(f"Number of edges: {np.sum(adj_matrix) // 2}")
print(f"Graph density: {(np.sum(adj_matrix) // 2) / (adj_matrix.shape[0] * (adj_matrix.shape[0] - 1) / 2):.4f}")
print("\nAdjacency matrix (first 5x5):")
print(adj_matrix[:5, :5])

Generated files:
  âœ“ stocks.csv (6.42 KB)
  âœ“ fundamentals.csv (18.86 KB)
  âœ“ values.csv (1374.45 KB)
  âœ“ adj.npy (75.16 KB)

Preview: values.csv
Shape: (773, 99)
Date range: 2022-01-03 to 2024-12-31

First 5 rows:
         Date         VNM         SAB         MSN        VIC        VHM  \
0  2022-01-03  102.414353  197.113510  186.819291  60.035953  50.442859   
1  2022-01-04  100.581621  201.696952  188.280553  61.152178  51.292787   
2  2022-01-05   98.388958  202.083721  183.796105  61.970867  49.554221   
3  2022-01-06   99.302215  203.822042  177.610974  62.349983  49.788270   
4  2022-01-07   97.020489  201.422642  175.885775  61.336234  49.043836   

          HPG         VCB         BID         CTG  ...         SBT  \
0  111.108752  107.764225  178.887426  154.500602  ...  139.132678   
1  110.721037  105.945948  173.278235  155.370456  ...  137.821307   
2  113.028773  104.939446  173.410394  152.703201  ...  142.911050   
3  113.179274  103.180559  173.321710  152.408

## Summary & Next Steps

**What we've accomplished:**
1. âœ… Defined FDI stock screening criteria
2. âœ… Created stock list (15 Vietnamese FDI companies)
3. âœ… Built data collection pipeline
4. âœ… Generated structured datasets:
   - `stocks.csv` - Metadata for 15 stocks
   - `fundamentals.csv` - Financial metrics (PE, ROE, etc.)
   - `values.csv` - Time-series of daily prices (aligned)
   - `adj.npy` - Graph adjacency matrix

**Next steps:**
1. **Replace sample data with real data**
   - Install vnstock: `pip install vnstock`
   - Update data source in collection script
   
2. **Expand stock list**
   - Research more FDI companies in Vietnam
   - Add to `fdi_stocks_list.csv`
   
3. **Run data preparation**
   - Open `1_data_preparation.ipynb`
   - Calculate volatility and risk metrics
   - Engineer features for GNN
   
4. **Build GNN model**
   - Design graph neural network architecture
   - Train on volatility prediction task
   - Evaluate performance

**For real Vietnamese stock data:**
```python
# Install vnstock
!pip install vnstock

# Update VNStocks.py or use directly:
from vnstock import stock_historical_data
df = stock_historical_data(symbol='VNM', start_date='2022-01-01', end_date='2024-12-31')
```

## Step 7: Verify Generated Files

## Step 6: Save All Data

Generate final data files: stocks.csv, fundamentals.csv, values.csv, adj.npy

## Step 5: Collect Fundamental Data

Gather financial metrics for each stock.

## Step 4: Collect Price Data

**Note**: Currently using sample data generation. To use real Vietnamese stock data:
- Install: `pip install vnstock`
- Change `source='manual'` to `source='vnstock'`

## Step 3: Initialize Dataset

Configure the data collection parameters.

## Step 2: Review Stock List

Let's examine the FDI companies we'll be analyzing.

## Step 1: Import Libraries and Setup

# Data Collection Pipeline - Remaining Steps Guide

## Overview
This guide helps you complete the remaining steps to finalize your 100-stock FDI company dataset for predicting volatility and risk levels.

**Current Status**: âœ… 100 FDI stocks defined in `fdi_stocks_list.csv`

## Remaining Implementation Steps

### Step 1: Test with Sample Data (Optional - for development)
The notebook is already set up to generate realistic sample data for testing your pipeline before using real data.

### Step 2: Install Real Data Source (Recommended)
To collect actual Vietnamese stock data:

```bash
pip install vnstock yfinance
```

### Step 3: Update Data Collection in the Notebook
When you run the data collection cells, they will:
1. Load all 100 stocks from `fdi_stocks_list.csv`
2. Download historical prices (currently sample data - can be replaced with vnstock)
3. Calculate fundamental features
4. Build correlation-based adjacency matrix
5. Save 4 key outputs:
   - `stocks.csv` - Stock metadata (100 x 6 columns)
   - `fundamentals.csv` - Financial metrics (100 stocks)
   - `values.csv` - Daily closing prices aligned by date
   - `adj.npy` - Adjacency matrix for GNN (100 x 100)

### Step 4: Customize Data Collection
Edit these parameters in the configuration cell:
- `start_date`: Historical period start
- `end_date`: Historical period end
- `source`: 'vnstock' (real data) or 'manual' (sample)
- `correlation_threshold`: Edge threshold in adjacency matrix

### Step 5: Feature Engineering for Volatility
After basic data collection, in `1_data_preparation.ipynb` you'll:
1. Calculate rolling volatility (key target variable)
2. Compute risk metrics
3. Engineer additional features for FDI stocks
4. Create train/test splits

### Step 6: GNN Model Development
Next phases:
1. **Graph Construction**: Use correlation-based adjacency matrix
2. **Node Features**: Combine price data + fundamentals
3. **Model Architecture**: Temporal Graph Convolutional Networks
4. **Target**: Predict volatility/risk level at t+1 from t data

## Key Differences from S&P 100 Reference Project

Your project focuses on **Vietnamese FDI enterprises**, while the reference used S&P 100. Key adaptations:

| Aspect | Reference (S&P 100) | Your Project (VN FDI) |
|--------|-------------------|----------------------|
| **Stocks** | 100 US companies | 100 Vietnamese FDI companies |
| **Exchanges** | NYSE, NASDAQ | HOSE, HNX |
| **Sectors** | Diverse US sectors | Manufacturing, Finance, RE, Energy |
| **Data Source** | Yahoo Finance | vnstock (Vietnamese) |
| **Target** | Price forecasting | Volatility & Risk prediction |
| **FDI Focus** | Not applicable | High/Medium/Low FDI involvement |

## File Structure After Data Collection

```
data/
â”œâ”€â”€ fdi_stocks_list.csv          # âœ… 100 stocks (defined)
â”œâ”€â”€ stocks.csv                    # Generated: metadata
â”œâ”€â”€ fundamentals.csv              # Generated: financial metrics
â”œâ”€â”€ values.csv                    # Generated: price history
â””â”€â”€ adj.npy                       # Generated: correlation graph

notebooks/
â”œâ”€â”€ 0_data_collection.ipynb       # Current - data collection
â”œâ”€â”€ 1_data_preparation.ipynb      # Next - feature engineering
â””â”€â”€ [future] 2-9_model_notebooks
```

## Next Actions

1. **Run this notebook** (0_data_collection.ipynb) to generate:
   - stocks.csv, fundamentals.csv, values.csv, adj.npy

2. **Open 1_data_preparation.ipynb** to:
   - Calculate volatility and risk metrics
   - Engineer features for GNN
   - Prepare train/test datasets

3. **Build GNN models** for:
   - Volatility prediction
   - Risk classification
   - Portfolio analysis specific to FDI enterprises

## Troubleshooting

**Issue**: vnstock library not available
- **Solution**: Use sample data first (`source='manual'`) to test pipeline, then install vnstock for real data

**Issue**: Missing trading days for certain stocks
- **Solution**: The pipeline automatically handles missing values with forward/backward fill

**Issue**: Correlation matrix has NaN values
- **Solution**: Check that fundamentals are being collected; adjust correlation threshold

Good luck with your research! ðŸ“ŠðŸ“ˆ