# Dim Date - B·∫£ng Chi·ªÅu Ng√†y th√°ng

## M·ª•c ti√™u
T·∫°o b·∫£ng `dim_date` ƒë·ªÉ ph·ª•c v·ª• ph√¢n t√≠ch th·ªùi gian trong EDA:
- Li√™n k·∫øt v·ªõi `fact_orders` th√¥ng qua `date_id` 
- Chu·∫©n h√≥a t·ª´ c·ªôt `inserted_at` trong orders
- H·ªó tr·ª£ ph√¢n t√≠ch theo ng√†y, tu·∫ßn, th√°ng, qu√Ω, nƒÉm
- Cung c·∫•p th√¥ng tin ng√†y l·ªÖ, ng√†y l√†m vi·ªác cho business analysis

## Quy tr√¨nh
1. **Import & Setup** - Thi·∫øt l·∫≠p m√¥i tr∆∞·ªùng
2. **Load fact_orders** - L·∫•y d·ªØ li·ªáu t·ª´ fact_orders
3. **Extract Date Range** - X√°c ƒë·ªãnh ph·∫°m vi ng√†y th√°ng
4. **Create Date Dimension** - T·∫°o b·∫£ng dim_date
5. **Add Business Attributes** - Th√™m thu·ªôc t√≠nh kinh doanh
6. **Load into Database** - Load v√†o Silver database
7. **Update fact_orders** - Th√™m date_id v√†o fact_orders
8. **Data Dictionary** - T·∫°o data dictionary

---


## Cell 1: Import v√† Thi·∫øt l·∫≠p


In [5]:
# Import c√°c th∆∞ vi·ªán c·∫ßn thi·∫øt
import pandas as pd
import numpy as np
from sqlalchemy import create_engine, text
from datetime import datetime, timedelta
import os
from dotenv import load_dotenv
import warnings
warnings.filterwarnings('ignore')

# Load environment variables
load_dotenv()

# Database connection
DB_USER = os.getenv("DB_USER")
DB_PASS = os.getenv("DB_PASS")
DB_HOST = os.getenv("DB_HOST")
DB_PORT = os.getenv("DB_PORT")
DB_SILVER = os.getenv("DB_SILVER")

# T·∫°o k·∫øt n·ªëi Silver database
silver_engine = create_engine(f"mysql+pymysql://{DB_USER}:{DB_PASS}@{DB_HOST}:{DB_PORT}/{DB_SILVER}")

print("‚úÖ Import & Setup completed successfully")
print(f"üìä Connected to Silver database: {DB_SILVER}")
print(f"üìÖ Dim Date creation started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")


‚úÖ Import & Setup completed successfully
üìä Connected to Silver database: winner_silver
üìÖ Dim Date creation started at: 2025-10-17 11:55:39


## Cell 2: Load fact_orders v√† Extract Date Range


In [6]:
# Load fact_orders ƒë·ªÉ l·∫•y ph·∫°m vi ng√†y th√°ng
print("=== LOADING FACT_ORDERS FOR DATE RANGE ===")

# Query ƒë·ªÉ l·∫•y th√¥ng tin ng√†y th√°ng t·ª´ fact_orders
date_query = """
SELECT 
    MIN(inserted_at) as min_date,
    MAX(inserted_at) as max_date,
    COUNT(DISTINCT DATE(inserted_at)) as unique_days,
    COUNT(*) as total_orders
FROM fact_orders
WHERE inserted_at IS NOT NULL;
"""

date_info = pd.read_sql(date_query, silver_engine)

print("üìä Date Range Information:")
print(f"üìÖ Earliest order date: {date_info['min_date'].iloc[0]}")
print(f"üìÖ Latest order date: {date_info['max_date'].iloc[0]}")
print(f"üìÖ Unique days with orders: {date_info['unique_days'].iloc[0]}")
print(f"üìÖ Total orders: {date_info['total_orders'].iloc[0]:,}")

# L·∫•y sample data ƒë·ªÉ hi·ªÉu c·∫•u tr√∫c
sample_query = """
SELECT 
    order_id,
    inserted_at,
    DATE(inserted_at) as order_date,
    YEAR(inserted_at) as order_year,
    MONTH(inserted_at) as order_month,
    DAY(inserted_at) as order_day
FROM fact_orders 
WHERE inserted_at IS NOT NULL
LIMIT 5;
"""

sample_data = pd.read_sql(sample_query, silver_engine)
print(f"\nüìã Sample data structure:")
print(sample_data)

# X√°c ƒë·ªãnh ph·∫°m vi ng√†y ƒë·ªÉ t·∫°o dim_date
min_date = date_info['min_date'].iloc[0]
max_date = date_info['max_date'].iloc[0]

# M·ªü r·ªông ph·∫°m vi ƒë·ªÉ bao g·ªìm c√°c ng√†y kh√¥ng c√≥ ƒë∆°n h√†ng
start_date = min_date.date() - timedelta(days=30)  # Tr∆∞·ªõc 30 ng√†y
end_date = max_date.date() + timedelta(days=30)    # Sau 30 ng√†y

print(f"\nüìÖ Date dimension range:")
print(f"üìÖ Start date: {start_date}")
print(f"üìÖ End date: {end_date}")
print(f"üìÖ Total days to create: {(end_date - start_date).days + 1}")


=== LOADING FACT_ORDERS FOR DATE RANGE ===
üìä Date Range Information:
üìÖ Earliest order date: 2021-12-30 03:13:15
üìÖ Latest order date: 2025-08-16 02:19:59
üìÖ Unique days with orders: 503
üìÖ Total orders: 40,236

üìã Sample data structure:
  order_id         inserted_at  order_date  order_year  order_month  order_day
0    40616 2025-08-16 02:19:59  2025-08-16        2025            8         16
1    40615 2025-08-15 12:56:51  2025-08-15        2025            8         15
2    40614 2025-08-15 12:45:17  2025-08-15        2025            8         15
3    40613 2025-08-15 12:04:39  2025-08-15        2025            8         15
4    40612 2025-08-15 11:43:06  2025-08-15        2025            8         15

üìÖ Date dimension range:
üìÖ Start date: 2021-11-30
üìÖ End date: 2025-09-15
üìÖ Total days to create: 1386


## Cell 3: Create Date Dimension Table


In [8]:
# T·∫°o Date Dimension Table
print("=== CREATING DATE DIMENSION TABLE ===")

def create_date_dimension(start_date, end_date):
    """T·∫°o b·∫£ng dim_date v·ªõi ƒë·∫ßy ƒë·ªß thu·ªôc t√≠nh th·ªùi gian"""
    
    # T·∫°o danh s√°ch ng√†y
    date_list = []
    current_date = start_date
    
    while current_date <= end_date:
        date_list.append(current_date)
        current_date += timedelta(days=1)
    
    # T·∫°o DataFrame v·ªõi datetime type
    df = pd.DataFrame({'date': date_list})
    df['date'] = pd.to_datetime(df['date'])  # ƒê·∫£m b·∫£o l√† datetime type
    
    # Th√™m c√°c thu·ªôc t√≠nh ng√†y th√°ng
    df['date_id'] = df['date'].dt.strftime('%Y%m%d').astype(int)  # Primary key
    df['full_date'] = df['date']
    df['year'] = df['date'].dt.year
    df['month'] = df['date'].dt.month
    df['day'] = df['date'].dt.day
    df['day_of_week'] = df['date'].dt.dayofweek  # 0=Monday, 6=Sunday
    df['day_of_year'] = df['date'].dt.dayofyear
    df['week_of_year'] = df['date'].dt.isocalendar().week
    df['quarter'] = df['date'].dt.quarter
    
    # T√™n ng√†y trong tu·∫ßn
    weekday_names = ['Th·ª© Hai', 'Th·ª© Ba', 'Th·ª© T∆∞', 'Th·ª© NƒÉm', 'Th·ª© S√°u', 'Th·ª© B·∫£y', 'Ch·ªß Nh·∫≠t']
    df['weekday_name'] = df['day_of_week'].map(lambda x: weekday_names[x])
    
    # T√™n th√°ng
    month_names = ['Th√°ng 1', 'Th√°ng 2', 'Th√°ng 3', 'Th√°ng 4', 'Th√°ng 5', 'Th√°ng 6',
                   'Th√°ng 7', 'Th√°ng 8', 'Th√°ng 9', 'Th√°ng 10', 'Th√°ng 11', 'Th√°ng 12']
    df['month_name'] = df['month'].map(lambda x: month_names[x-1])
    
    # Quarter name
    quarter_names = {1: 'Q1', 2: 'Q2', 3: 'Q3', 4: 'Q4'}
    df['quarter_name'] = df['quarter'].map(quarter_names)
    
    # Year-Month format
    df['year_month'] = df['date'].dt.strftime('%Y-%m')
    df['year_quarter'] = df['date'].dt.to_period('Q').astype(str)
    
    # Business attributes
    df['is_weekend'] = df['day_of_week'].isin([5, 6])  # Saturday, Sunday
    df['is_weekday'] = ~df['is_weekend']
    df['is_month_start'] = df['date'].dt.is_month_start
    df['is_month_end'] = df['date'].dt.is_month_end
    df['is_quarter_start'] = df['date'].dt.is_quarter_start
    df['is_quarter_end'] = df['date'].dt.is_quarter_end
    df['is_year_start'] = df['date'].dt.is_year_start
    df['is_year_end'] = df['date'].dt.is_year_end
    
    # Vietnamese holidays (m·ªôt s·ªë ng√†y l·ªÖ ch√≠nh)
    vietnamese_holidays = {
        '2024-01-01': 'T·∫øt D∆∞∆°ng l·ªãch',
        '2024-02-08': 'T·∫øt Nguy√™n ƒê√°n (M·ªìng 1)',
        '2024-02-09': 'T·∫øt Nguy√™n ƒê√°n (M·ªìng 2)', 
        '2024-02-10': 'T·∫øt Nguy√™n ƒê√°n (M·ªìng 3)',
        '2024-02-11': 'T·∫øt Nguy√™n ƒê√°n (M·ªìng 4)',
        '2024-02-12': 'T·∫øt Nguy√™n ƒê√°n (M·ªìng 5)',
        '2024-04-18': 'Gi·ªó T·ªï H√πng V∆∞∆°ng',
        '2024-04-30': 'Ng√†y Gi·∫£i ph√≥ng mi·ªÅn Nam',
        '2024-05-01': 'Ng√†y Qu·ªëc t·∫ø Lao ƒë·ªông',
        '2024-09-02': 'Ng√†y Qu·ªëc kh√°nh',
        '2025-01-01': 'T·∫øt D∆∞∆°ng l·ªãch',
        '2025-01-28': 'T·∫øt Nguy√™n ƒê√°n (M·ªìng 1)',
        '2025-01-29': 'T·∫øt Nguy√™n ƒê√°n (M·ªìng 2)',
        '2025-01-30': 'T·∫øt Nguy√™n ƒê√°n (M·ªìng 3)',
        '2025-01-31': 'T·∫øt Nguy√™n ƒê√°n (M·ªìng 4)',
        '2025-02-01': 'T·∫øt Nguy√™n ƒê√°n (M·ªìng 5)',
        '2025-04-07': 'Gi·ªó T·ªï H√πng V∆∞∆°ng',
        '2025-04-30': 'Ng√†y Gi·∫£i ph√≥ng mi·ªÅn Nam',
        '2025-05-01': 'Ng√†y Qu·ªëc t·∫ø Lao ƒë·ªông',
        '2025-09-02': 'Ng√†y Qu·ªëc kh√°nh'
    }
    
    df['holiday_name'] = df['date'].dt.strftime('%Y-%m-%d').map(vietnamese_holidays)
    df['is_holiday'] = df['holiday_name'].notna()
    
    # Days from start/end
    df['days_from_start'] = (df['date'] - pd.to_datetime(start_date)).dt.days
    df['days_to_end'] = (pd.to_datetime(end_date) - df['date']).dt.days
    
    # Reorder columns
    column_order = [
        'date_id', 'full_date', 'year', 'month', 'day',
        'day_of_week', 'weekday_name', 'day_of_year', 'week_of_year',
        'quarter', 'quarter_name', 'month_name',
        'year_month', 'year_quarter',
        'is_weekend', 'is_weekday', 'is_holiday', 'holiday_name',
        'is_month_start', 'is_month_end', 'is_quarter_start', 'is_quarter_end',
        'is_year_start', 'is_year_end',
        'days_from_start', 'days_to_end'
    ]
    
    return df[column_order]

# T·∫°o dim_date
dim_date = create_date_dimension(start_date, end_date)

print(f"‚úÖ Created dim_date with {len(dim_date)} rows")
print(f"üìÖ Date range: {dim_date['full_date'].min()} to {dim_date['full_date'].max()}")
print(f"üìã Columns: {len(dim_date.columns)}")

print(f"\nüìä Sample data:")
print(dim_date.head(10))

print(f"\nüìä Data types:")
print(dim_date.dtypes)

print(f"\nüìä Summary statistics:")
print(f"- Weekends: {dim_date['is_weekend'].sum()} days")
print(f"- Holidays: {dim_date['is_holiday'].sum()} days")
print(f"- Unique years: {dim_date['year'].nunique()}")
print(f"- Unique quarters: {dim_date['quarter'].nunique()}")
print(f"- Unique months: {dim_date['month'].nunique()}")


=== CREATING DATE DIMENSION TABLE ===
‚úÖ Created dim_date with 1386 rows
üìÖ Date range: 2021-11-30 00:00:00 to 2025-09-15 00:00:00
üìã Columns: 26

üìä Sample data:
    date_id  full_date  year  month  day  day_of_week weekday_name  \
0  20211130 2021-11-30  2021     11   30            1       Th·ª© Ba   
1  20211201 2021-12-01  2021     12    1            2       Th·ª© T∆∞   
2  20211202 2021-12-02  2021     12    2            3      Th·ª© NƒÉm   
3  20211203 2021-12-03  2021     12    3            4      Th·ª© S√°u   
4  20211204 2021-12-04  2021     12    4            5      Th·ª© B·∫£y   
5  20211205 2021-12-05  2021     12    5            6     Ch·ªß Nh·∫≠t   
6  20211206 2021-12-06  2021     12    6            0      Th·ª© Hai   
7  20211207 2021-12-07  2021     12    7            1       Th·ª© Ba   
8  20211208 2021-12-08  2021     12    8            2       Th·ª© T∆∞   
9  20211209 2021-12-09  2021     12    9            3      Th·ª© NƒÉm   

   day_of_year  week_of_year  

## Cell 4: Load dim_date v√†o Database


In [10]:
# Load dim_date v√†o Silver database
print("=== LOADING DIM_DATE TO DATABASE ===")

# Import SQLAlchemy types
from sqlalchemy import Integer, Date, String, Boolean

# ƒê·ªãnh nghƒ©a data types cho MySQL v·ªõi SQLAlchemy types
dtype_dim_date = {
    'date_id': Integer,
    'full_date': Date,
    'year': Integer,
    'month': Integer, 
    'day': Integer,
    'day_of_week': Integer,
    'weekday_name': String(10),
    'day_of_year': Integer,
    'week_of_year': Integer,
    'quarter': Integer,
    'quarter_name': String(3),
    'month_name': String(10),
    'year_month': String(7),
    'year_quarter': String(7),
    'is_weekend': Boolean,
    'is_weekday': Boolean,
    'is_holiday': Boolean,
    'holiday_name': String(50),
    'is_month_start': Boolean,
    'is_month_end': Boolean,
    'is_quarter_start': Boolean,
    'is_quarter_end': Boolean,
    'is_year_start': Boolean,
    'is_year_end': Boolean,
    'days_from_start': Integer,
    'days_to_end': Integer
}

try:
    # Load v√†o database
    dim_date.to_sql(
        'dim_date', 
        silver_engine, 
        if_exists='replace', 
        index=False,
        dtype=dtype_dim_date
    )
    
    print("‚úÖ Successfully loaded dim_date to Silver database")
    
    # Verify the data
    verification_query = """
    SELECT 
        COUNT(*) as total_rows,
        MIN(full_date) as min_date,
        MAX(full_date) as max_date,
        COUNT(DISTINCT year) as unique_years,
        COUNT(DISTINCT quarter) as unique_quarters,
        COUNT(DISTINCT month) as unique_months,
        SUM(is_weekend) as weekend_days,
        SUM(is_holiday) as holiday_days
    FROM dim_date;
    """
    
    verification = pd.read_sql(verification_query, silver_engine)
    print(f"\nüìä Verification Results:")
    print(f"- Total rows: {verification['total_rows'].iloc[0]:,}")
    print(f"- Date range: {verification['min_date'].iloc[0]} to {verification['max_date'].iloc[0]}")
    print(f"- Unique years: {verification['unique_years'].iloc[0]}")
    print(f"- Unique quarters: {verification['unique_quarters'].iloc[0]}")
    print(f"- Unique months: {verification['unique_months'].iloc[0]}")
    print(f"- Weekend days: {verification['weekend_days'].iloc[0]}")
    print(f"- Holiday days: {verification['holiday_days'].iloc[0]}")
    
    # Sample data from database
    sample_query = """
    SELECT * FROM dim_date 
    WHERE is_holiday = 1 OR is_weekend = 1
    ORDER BY full_date
    LIMIT 10;
    """
    
    sample_db = pd.read_sql(sample_query, silver_engine)
    print(f"\nüìã Sample data from database (holidays & weekends):")
    print(sample_db[['date_id', 'full_date', 'weekday_name', 'is_weekend', 'is_holiday', 'holiday_name']])
    
except Exception as e:
    print(f"‚ùå Error loading dim_date: {str(e)}")
    raise


=== LOADING DIM_DATE TO DATABASE ===
‚úÖ Successfully loaded dim_date to Silver database

üìä Verification Results:
- Total rows: 1,386
- Date range: 2021-11-30 to 2025-09-15
- Unique years: 5
- Unique quarters: 4
- Unique months: 12
- Weekend days: 396.0
- Holiday days: 20.0

üìã Sample data from database (holidays & weekends):
    date_id   full_date weekday_name  is_weekend  is_holiday holiday_name
0  20211204  2021-12-04      Th·ª© B·∫£y           1           0         None
1  20211205  2021-12-05     Ch·ªß Nh·∫≠t           1           0         None
2  20211211  2021-12-11      Th·ª© B·∫£y           1           0         None
3  20211212  2021-12-12     Ch·ªß Nh·∫≠t           1           0         None
4  20211218  2021-12-18      Th·ª© B·∫£y           1           0         None
5  20211219  2021-12-19     Ch·ªß Nh·∫≠t           1           0         None
6  20211225  2021-12-25      Th·ª© B·∫£y           1           0         None
7  20211226  2021-12-26     Ch·ªß Nh·∫≠t       

## Cell 5: Update fact_orders v·ªõi date_id


In [11]:
# Update fact_orders v·ªõi date_id
print("=== UPDATING FACT_ORDERS WITH DATE_ID ===")

try:
    # Ki·ªÉm tra c·∫•u tr√∫c hi·ªán t·∫°i c·ªßa fact_orders
    check_query = """
    DESCRIBE fact_orders;
    """
    
    fact_orders_structure = pd.read_sql(check_query, silver_engine)
    print("üìã Current fact_orders structure:")
    print(fact_orders_structure)
    
    # Ki·ªÉm tra xem ƒë√£ c√≥ c·ªôt date_id ch∆∞a
    if 'date_id' not in fact_orders_structure['Field'].values:
        print("\n‚ûï Adding date_id column to fact_orders...")
        
        # Th√™m c·ªôt date_id
        add_column_query = """
        ALTER TABLE fact_orders 
        ADD COLUMN date_id INT AFTER inserted_at;
        """
        
        with silver_engine.connect() as conn:
            conn.execute(text(add_column_query))
            conn.commit()
        
        print("‚úÖ Added date_id column to fact_orders")
    else:
        print("‚úÖ date_id column already exists in fact_orders")
    
    # Update date_id cho t·∫•t c·∫£ records
    print("\nüîÑ Updating date_id for all orders...")
    
    update_query = """
    UPDATE fact_orders 
    SET date_id = CAST(DATE_FORMAT(inserted_at, '%Y%m%d') AS UNSIGNED)
    WHERE inserted_at IS NOT NULL;
    """
    
    with silver_engine.connect() as conn:
        result = conn.execute(text(update_query))
        conn.commit()
        updated_rows = result.rowcount
    
    print(f"‚úÖ Updated {updated_rows:,} rows with date_id")
    
    # Verify the update
    verification_query = """
    SELECT 
        COUNT(*) as total_orders,
        COUNT(date_id) as orders_with_date_id,
        MIN(date_id) as min_date_id,
        MAX(date_id) as max_date_id,
        COUNT(DISTINCT date_id) as unique_date_ids
    FROM fact_orders;
    """
    
    verification = pd.read_sql(verification_query, silver_engine)
    print(f"\nüìä Update Verification:")
    print(f"- Total orders: {verification['total_orders'].iloc[0]:,}")
    print(f"- Orders with date_id: {verification['orders_with_date_id'].iloc[0]:,}")
    print(f"- Date_id range: {verification['min_date_id'].iloc[0]} to {verification['max_date_id'].iloc[0]}")
    print(f"- Unique date_ids: {verification['unique_date_ids'].iloc[0]}")
    
    # Sample data v·ªõi date_id
    sample_query = """
    SELECT 
        order_id,
        inserted_at,
        date_id,
        total_price,
        status_name
    FROM fact_orders 
    WHERE date_id IS NOT NULL
    ORDER BY inserted_at DESC
    LIMIT 10;
    """
    
    sample_data = pd.read_sql(sample_query, silver_engine)
    print(f"\nüìã Sample orders with date_id:")
    print(sample_data)
    
    # Test join v·ªõi dim_date
    join_test_query = """
    SELECT 
        fo.order_id,
        fo.inserted_at,
        fo.date_id,
        dd.full_date,
        dd.weekday_name,
        dd.is_weekend,
        dd.is_holiday,
        dd.holiday_name,
        fo.total_price
    FROM fact_orders fo
    LEFT JOIN dim_date dd ON fo.date_id = dd.date_id
    WHERE fo.date_id IS NOT NULL
    ORDER BY fo.inserted_at DESC
    LIMIT 5;
    """
    
    join_test = pd.read_sql(join_test_query, silver_engine)
    print(f"\nüîó Test JOIN with dim_date:")
    print(join_test)
    
except Exception as e:
    print(f"‚ùå Error updating fact_orders: {str(e)}")
    raise


=== UPDATING FACT_ORDERS WITH DATE_ID ===
üìã Current fact_orders structure:
                             Field          Type Null Key Default Extra
0                         order_id  varchar(100)  YES        None      
1                        system_id        bigint  YES        None      
2                          shop_id        bigint  YES        None      
3                       order_link  varchar(255)  YES        None      
4               link_confirm_order  varchar(255)  YES        None      
5                   order_currency  varchar(255)  YES        None      
6                      total_price         float  YES        None      
7   total_price_after_sub_discount           int  YES        None      
8                   total_discount           int  YES        None      
9                   total_quantity           int  YES        None      
10                    items_length  varchar(255)  YES        None      
11                             tax  varchar(255)  YES     

## Cell 6: Data Dictionary v√† Summary


In [12]:
# Generate Data Dictionary cho dim_date
print("=== GENERATING DATA DICTIONARY ===")

def get_business_meaning(column_name):
    """Get business meaning for each column"""
    business_meanings = {
        "date_id": "Primary key - Date identifier in YYYYMMDD format",
        "full_date": "Full date in DATE format",
        "year": "Year (4 digits)",
        "month": "Month (1-12)",
        "day": "Day of month (1-31)",
        "day_of_week": "Day of week (0=Monday, 6=Sunday)",
        "weekday_name": "Day name in Vietnamese",
        "day_of_year": "Day of year (1-366)",
        "week_of_year": "Week number in year (1-53)",
        "quarter": "Quarter (1-4)",
        "quarter_name": "Quarter name (Q1-Q4)",
        "month_name": "Month name in Vietnamese",
        "year_month": "Year-Month format (YYYY-MM)",
        "year_quarter": "Year-Quarter format (YYYY-Q#)",
        "is_weekend": "Boolean - Is weekend (Saturday/Sunday)",
        "is_weekday": "Boolean - Is weekday (Monday-Friday)",
        "is_holiday": "Boolean - Is Vietnamese holiday",
        "holiday_name": "Holiday name if applicable",
        "is_month_start": "Boolean - Is first day of month",
        "is_month_end": "Boolean - Is last day of month",
        "is_quarter_start": "Boolean - Is first day of quarter",
        "is_quarter_end": "Boolean - Is last day of quarter",
        "is_year_start": "Boolean - Is first day of year",
        "is_year_end": "Boolean - Is last day of year",
        "days_from_start": "Days from start of date range",
        "days_to_end": "Days to end of date range"
    }
    return business_meanings.get(column_name, "No business meaning defined")

# T·∫°o Data Dictionary
dict_data = []
for col in dim_date.columns:
    col_info = {
        "table_name": "dim_date",
        "column_name": col,
        "dtype": str(dim_date[col].dtype),
        "sql_type": str(dtype_dim_date.get(col, "Not defined")),
        "null_count": dim_date[col].isnull().sum(),
        "null_pct": round(dim_date[col].isnull().mean() * 100, 2),
        "unique_count": dim_date[col].nunique(),
        "sample_values": str(dim_date[col].dropna().unique()[:3].tolist()),
        "business_meaning": get_business_meaning(col),
        "extraction_date": datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    }
    dict_data.append(col_info)

data_dictionary = pd.DataFrame(dict_data)

# Hi·ªÉn th·ªã Data Dictionary
print(f"Generated Data Dictionary for {len(data_dictionary)} columns")
print("\n=== DATA DICTIONARY ===")
print(data_dictionary)

# Append Data Dictionary v√†o file Excel
excel_path = "Technical_Document/Dictionary.xlsx"
try:
    from openpyxl import load_workbook
    
    # Ki·ªÉm tra file Excel c√≥ t·ªìn t·∫°i kh√¥ng
    try:
        # Load workbook hi·ªán t·∫°i
        wb = load_workbook(excel_path)
        
        # L·∫•y sheet ƒë·∫ßu ti√™n
        ws = wb.active
        
        # Ki·ªÉm tra xem c√≥ d·ªØ li·ªáu c≈© kh√¥ng
        if ws.max_row > 1:
            print(f"Found existing data in {excel_path}, appending new data...")
        else:
            print(f"File {excel_path} exists but is empty, adding header and data...")
            
    except FileNotFoundError:
        print(f"File {excel_path} not found, creating new file...")
        wb = None
    except Exception as e:
        print(f"Error loading {excel_path}: {str(e)}, creating new file...")
        wb = None
    
    if wb is None:
        # T·∫°o file m·ªõi v·ªõi header
        data_dictionary.to_excel(excel_path, index=False, sheet_name='Data_Dictionary')
        print(f"‚úÖ Created new file: {excel_path}")
    else:
        # Append v√†o file hi·ªán t·∫°i
        from openpyxl.utils.dataframe import dataframe_to_rows
        
        # T√¨m d√≤ng cu·ªëi c√πng c√≥ d·ªØ li·ªáu
        last_row = ws.max_row
        
        # Th√™m d·ªØ li·ªáu m·ªõi t·ª´ d√≤ng ti·∫øp theo
        for r in dataframe_to_rows(data_dictionary, index=False, header=False):
            last_row += 1
            for c_idx, value in enumerate(r, 1):
                ws.cell(row=last_row, column=c_idx, value=value)
        
        # L∆∞u file
        wb.save(excel_path)
        print(f"‚úÖ Appended {len(data_dictionary)} rows to: {excel_path}")
    
except Exception as e:
    print(f"‚ùå Error appending to Data Dictionary: {str(e)}")
    # Fallback: t·∫°o file m·ªõi
    try:
        data_dictionary.to_excel(excel_path, index=False)
        print(f"‚úÖ Created new file as fallback: {excel_path}")
    except Exception as e2:
        print(f"‚ùå Error creating fallback file: {str(e2)}")

# Summary Report
print(f"\n=== TRANSFORMATION SUMMARY ===")
print(f"Source: Date range from fact_orders")
print(f"Target records: {len(dim_date):,}")
print(f"Columns created: {len(dim_date.columns)}")
print(f"Target table: Silver.dim_date")
print(f"Date range: {dim_date['full_date'].min()} to {dim_date['full_date'].max()}")
print(f"Business attributes: Weekend/Holiday detection, Quarter/Month analysis")
print(f"Data Dictionary: {excel_path}")
print(f"fact_orders updated with date_id for JOIN capability")
print(f"Transformation completed: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

print(f"\n=== EDA READINESS ===")
print(f"‚úÖ Time series analysis ready")
print(f"‚úÖ Seasonal pattern analysis ready") 
print(f"‚úÖ Weekend vs weekday comparison ready")
print(f"‚úÖ Holiday impact analysis ready")
print(f"‚úÖ Quarter/Month/Year aggregation ready")
print(f"‚úÖ Business calendar analysis ready")
print(f"‚úÖ Date filtering and grouping ready")


=== GENERATING DATA DICTIONARY ===
Generated Data Dictionary for 26 columns

=== DATA DICTIONARY ===
   table_name       column_name           dtype  \
0    dim_date           date_id           int64   
1    dim_date         full_date  datetime64[ns]   
2    dim_date              year           int32   
3    dim_date             month           int32   
4    dim_date               day           int32   
5    dim_date       day_of_week           int32   
6    dim_date      weekday_name          object   
7    dim_date       day_of_year           int32   
8    dim_date      week_of_year          UInt32   
9    dim_date           quarter           int32   
10   dim_date      quarter_name          object   
11   dim_date        month_name          object   
12   dim_date        year_month          object   
13   dim_date      year_quarter          object   
14   dim_date        is_weekend            bool   
15   dim_date        is_weekday            bool   
16   dim_date        is_holiday 