# Data Warehousing Structure Fundamentals

## Data Warehousing Structure Fundamentals

Understanding the fundamental structure of data warehouses is essential for designing effective analytical solutions. This section covers the core principles of dimensional modeling and data warehouse design.

## Deciding What Your Data Warehouse Will Be Used For

Before designing your data warehouse structure, determine its primary purpose:

### Common Use Cases:

1. **Business Intelligence and Reporting**
   - Standard reports
   - Ad-hoc queries
   - Executive dashboards
   - Operational reports

2. **Analytics and Data Mining**
   - Predictive analytics
   - Statistical analysis
   - Pattern recognition
   - Trend analysis

3. **Performance Management**
   - KPIs and metrics
   - Scorecards
   - Balanced scorecards
   - Performance monitoring

4. **Decision Support**
   - What-if analysis
   - Scenario planning
   - Forecasting
   - Strategic planning

### Design Implications:
- **Reporting**: Focus on pre-aggregated data, star schemas
- **Analytics**: Detailed data, flexible structures
- **Performance**: Pre-calculated metrics, optimized queries
- **Decision Support**: Flexible models, drill-down capabilities

## The Basic Principles of Dimensionality

**Dimensionality** is the foundation of dimensional modeling. It involves organizing data around business processes and measurements.

### Key Principles:

1. **Business Process Focus**
   - Model around business processes (sales, orders, shipments)
   - Not around departments or applications
   - Each process becomes a fact table

2. **Grain Definition**
   - Define the lowest level of detail (grain)
   - One row in fact table = one measurement event
   - Critical for accurate analysis

3. **Dimension Tables**
   - Contain descriptive attributes
   - Provide context for measurements
   - Enable filtering and grouping

4. **Fact Tables**
   - Contain measurements (facts)
   - Foreign keys to dimensions
   - Numeric, additive values

5. **Additivity**
   - Facts should be additive across dimensions
   - Some facts are semi-additive (balances)
   - Some are non-additive (ratios, percentages)

### Dimensional Model Benefits:
- **Intuitive**: Matches how business users think
- **Performance**: Optimized for queries
- **Flexible**: Easy to add new dimensions
- **Understandable**: Business-friendly structure

## Compare Facts, Fact Tables, Dimensions, and Dimension Tables

### Facts
**Definition**: Numeric measurements that represent business events

**Characteristics:**
- Numeric values
- Measurable quantities
- Business metrics
- Examples: sales amount, quantity sold, profit

**Types:**
- **Additive**: Can be summed (sales amount)
- **Semi-additive**: Can be summed across some dimensions (account balance)
- **Non-additive**: Cannot be summed (ratios, percentages)

### Fact Tables
**Definition**: Tables that store facts and foreign keys to dimensions

**Structure:**
- Foreign keys to dimension tables
- Fact columns (measures)
- Degenerate dimensions (transaction numbers)
- Optional: date/time stamps

**Characteristics:**
- Large number of rows
- Narrow tables (few columns)
- Dense (most dimension combinations exist)
- Fast loading, fast querying

**Example:**
```
Sales_Fact
- date_key (FK)
- product_key (FK)
- customer_key (FK)
- store_key (FK)
- sales_amount
- quantity_sold
- profit
```

### Dimensions
**Definition**: Descriptive attributes that provide context for facts

**Characteristics:**
- Textual or descriptive
- Used for filtering and grouping
- Relatively stable over time
- Examples: product, customer, time, geography

**Types:**
- **Conformed Dimensions**: Shared across multiple fact tables
- **Role-Playing Dimensions**: Same dimension used multiple times (e.g., order date, ship date)
- **Junk Dimensions**: Group of flags and indicators
- **Degenerate Dimensions**: Transaction identifiers

### Dimension Tables
**Definition**: Tables that store dimension attributes

**Structure:**
- Primary key (surrogate key)
- Natural key (business key)
- Descriptive attributes
- Hierarchies
- Slowly changing dimension attributes

**Characteristics:**
- Smaller number of rows
- Wide tables (many columns)
- Relatively static
- Rich descriptive attributes

**Example:**
```
Product_Dimension
- product_key (PK, surrogate)
- product_id (natural key)
- product_name
- category
- brand
- price
- description
```

### Relationships:

```
Dimension Tables (Many) ←→ Fact Table (One)
     ↓                           ↓
Descriptive              Measurements
Attributes               and Metrics
```

## Compare Different Forms of Additivity in Facts

### 1. **Fully Additive Facts**
**Definition**: Can be summed across all dimensions

**Examples:**
- Sales amount
- Quantity sold
- Cost
- Revenue

**Characteristics:**
- Can be aggregated across any dimension
- Most common type of fact
- Support roll-up and drill-down

**SQL Example:**
```sql
-- Can sum across any dimension
SELECT 
    product_category,
    SUM(sales_amount) as total_sales
FROM sales_fact
GROUP BY product_category
```

### 2. **Semi-Additive Facts**
**Definition**: Can be summed across some dimensions but not others

**Examples:**
- Account balance (can sum across accounts, not across time)
- Inventory levels (can sum across products, not across time)
- Headcount (can sum across departments, not across time)

**Characteristics:**
- Time dimension is usually the exception
- Need special handling for time aggregations
- Use average or snapshot for time

**SQL Example:**
```sql
-- Cannot sum balances across time
-- Use average or last value instead
SELECT 
    account_type,
    AVG(account_balance) as avg_balance,
    MAX(account_balance) as max_balance
FROM account_fact
GROUP BY account_type
```

### 3. **Non-Additive Facts**
**Definition**: Cannot be summed across any dimension

**Examples:**
- Ratios (profit margin, conversion rate)
- Percentages (discount percentage)
- Averages (average order value)
- Unit prices

**Characteristics:**
- Must be calculated from additive facts
- Store numerator and denominator separately
- Calculate ratio at query time

**SQL Example:**
```sql
-- Calculate ratio from additive facts
SELECT 
    product_category,
    SUM(profit) / SUM(sales_amount) as profit_margin
FROM sales_fact
GROUP BY product_category
```

### Comparison:

| Type | Sum Across Dimensions | Examples | Aggregation Method |
|------|----------------------|-----------|-------------------|
| Fully Additive | All dimensions | Sales, Quantity | SUM |
| Semi-Additive | Some dimensions | Balance, Inventory | SUM (except time), AVG/MAX for time |
| Non-Additive | None | Ratios, Percentages | Calculate from additive facts |

## Compare a Star Schema to a Snowflake Schema

### Star Schema
**Definition**: A dimensional model with a central fact table surrounded by denormalized dimension tables

**Structure:**
- One fact table in the center
- Multiple dimension tables around it
- Direct relationships (no intermediate tables)
- Denormalized dimensions

**Example:**
```
        Product_Dimension
              ↓
Sales_Fact ← Time_Dimension
              ↓
        Customer_Dimension
              ↓
        Store_Dimension
```

**Characteristics:**
- Simple structure
- Denormalized dimensions
- Fewer joins
- Better query performance
- More storage space

**Advantages:**
- Simple to understand
- Fast queries (fewer joins)
- Easy to navigate
- Better for reporting tools

**Disadvantages:**
- More storage space
- Data redundancy
- Potential data inconsistency

### Snowflake Schema
**Definition**: A normalized version of star schema where dimension tables are normalized

**Structure:**
- One fact table in the center
- Normalized dimension tables
- Hierarchical relationships
- Multiple levels of dimension tables

**Example:**
```
        Product_Dimension
              ↓
        Category_Dimension
              ↓
Sales_Fact ← Time_Dimension
              ↓
        Customer_Dimension
              ↓
        Region_Dimension
```

**Characteristics:**
- Normalized structure
- Hierarchical dimensions
- More joins required
- Less storage space
- More complex queries

**Advantages:**
- Less storage space
- No data redundancy
- Better data consistency
- Supports complex hierarchies

**Disadvantages:**
- More complex structure
- Slower queries (more joins)
- Harder to understand
- More complex for reporting tools

### Comparison:

| Aspect | Star Schema | Snowflake Schema |
|--------|------------|------------------|
| **Normalization** | Denormalized | Normalized |
| **Joins** | Fewer | More |
| **Performance** | Faster | Slower |
| **Storage** | More | Less |
| **Complexity** | Simple | Complex |
| **Use Case** | Reporting, BI | Complex hierarchies, storage optimization |

### When to Use Each:
- **Star Schema**: Most common, better performance, simpler
- **Snowflake Schema**: Complex hierarchies, storage constraints, normalized requirements

## Database Keys for Data Warehousing

### 1. **Surrogate Keys**
**Definition**: System-generated, meaningless keys used in data warehouse

**Characteristics:**
- Integer or sequential numbers
- No business meaning
- Stable and unchanging
- Used as primary keys

**Advantages:**
- Handle SCDs (Slowly Changing Dimensions)
- Independent of source systems
- Better performance (integer joins)
- Handle multiple source systems

**Example:**
```sql
CREATE TABLE product_dimension (
    product_key INT PRIMARY KEY,  -- Surrogate key
    product_id VARCHAR(50),       -- Natural key
    product_name VARCHAR(100),
    ...
);
```

### 2. **Natural Keys (Business Keys)**
**Definition**: Keys from source systems that have business meaning

**Characteristics:**
- Come from source systems
- Have business meaning
- May change over time
- Used for lookups and matching

**Example:**
- Product ID: "PROD-12345"
- Customer ID: "CUST-67890"
- Order Number: "ORD-2024-001"

### 3. **Composite Keys**
**Definition**: Keys made up of multiple columns

**Characteristics:**
- Multiple columns together form key
- Used when single column isn't unique
- Common in fact tables (degenerate dimensions)

**Example:**
```sql
-- Composite key in fact table
date_key, product_key, customer_key, order_number
```

### 4. **Foreign Keys**
**Definition**: Keys that reference primary keys in other tables

**Characteristics:**
- Link fact tables to dimension tables
- Enforce referential integrity
- Enable joins

**Example:**
```sql
CREATE TABLE sales_fact (
    date_key INT REFERENCES date_dimension(date_key),
    product_key INT REFERENCES product_dimension(product_key),
    customer_key INT REFERENCES customer_dimension(customer_key),
    sales_amount DECIMAL(10,2)
);
```

### Key Strategy Best Practices:
- Use surrogate keys for dimension tables
- Store natural keys for lookups
- Use foreign keys in fact tables
- Index all key columns
- Maintain referential integrity

## Summarize Data Warehousing Structure

### Key Concepts:

1. **Dimensional Modeling**: Organize data around business processes
2. **Facts**: Numeric measurements of business events
3. **Dimensions**: Descriptive attributes providing context
4. **Star vs. Snowflake**: Denormalized vs. normalized schemas
5. **Additivity**: Fully additive, semi-additive, non-additive facts
6. **Keys**: Surrogate keys for dimensions, foreign keys for relationships

### Design Principles:
- Business process focus
- Clear grain definition
- Conformed dimensions
- Appropriate additivity
- Proper key management

### Next Steps:
- Learn dimensional modeling techniques
- Understand fact table types
- Explore slowly changing dimensions
