## 1. Dataset Overview

The dataset contains 4209 observations describing vehicle configurations and their corresponding test bench duration.

- **Target (y):** continuous, positive real-valued (test time)
- **Identifier (ID):** unique row identifier, not a predictive feature
- **Categorical features (X0‚ÄìX8):** nominal, low-cardinality (string-valued)
- **Binary features (X10+):** high-dimensional sparse one-hot encodings (0/1)

This structure corresponds to a static, tabular regression problem with mixed feature types.

---

## 2. Target Variable Characterization (y)

- `y` is continuous and strictly positive.
- The distribution shows heterogeneity consistent with different vehicle configurations.
- Extreme values are interpreted as real operational cases, not measurement errors.
- Skewness is present, indicating that robust loss functions may be appropriate.

**Interpretation:**  
The target distribution supports regression modeling. Outliers represent rare but valid configurations and should not be removed.

---

## 3. Feature Structure & Cardinality

### Nominal Features (X0‚ÄìX8)

- Low-cardinality categorical variables.
- Each feature encodes a discrete configuration choice.
- Some category levels appear infrequently, introducing sparsity at the configuration level.

### Binary Features (X10+)

- One-hot encoded configuration flags.
- High dimensionality with strong sparsity.
- Many features are zero for most observations.

**Risk identified:**  
Rare combinations of active binary flags may limit generalization.

---

## 4. Missingness Logic

- No explicit NaN values observed.
- Absence of a configuration option is encoded as `0`.
- Missingness is therefore structural, not stochastic.

**Consequence:**  
Zero does not mean ‚Äúmeasured zero‚Äù but ‚Äúoption not selected‚Äù.

No imputation is required at this stage.

---

## 5. Leakage Risk Assessment

- `ID` may correlate with production order or historical testing sequences.
- Repeated or near-identical configurations may exist.
- Temporal or process coupling is possible but not observable directly from the dataset.

**Governance note:**  
Leakage risk exists conceptually and must be controlled in later validation stages.

---

## 6. Correlation & Dependency (Descriptive)

- Binary features may correlate indirectly via shared configurations.
- Correlations are descriptive only, not causal.
- Feature relevance is treated as a hypothesis, not evidence.

**Example hypothesis:**  
‚ÄúCertain configuration flags appear to be associated with longer test durations.‚Äù

---

## 7. ML Feasibility Statement

Based on the data structure and descriptive diagnostics:

**Regression modeling is feasible.**

**Constraints:**
- High-dimensional sparsity  
- Rare configuration combinations  
- Potential implicit leakage via configuration repetition  

No fundamental data pathology prevents supervised regression.


In [1]:
"""
Deliverable 2 ‚Äî Minimal EDA for Regression Feasibility
=====================================================

THEORY
------
This script performs a strictly descriptive and diagnostic
Exploratory Data Analysis (EDA).

No predictive modeling is performed.
No knowledge is generated.
Only data characteristics and risks are identified.

INPUTS
------
- train.csv : tabular dataset with target y and features X*

OUTPUTS
-------
- Console summaries of:
  - dataset structure
  - feature types
  - target distribution statistics
  - cardinality diagnostics
  - configuration uniqueness
  - potential leakage indicators

TERMINAL USAGE
--------------
python minimal_eda.py train.csv

EPISTEMIC SCOPE
---------------
Allowed:
- Counting
- Describing
- Summarizing
- Diagnosing risks

Forbidden:
- Model fitting
- Feature selection
- Performance metrics
"""

import sys
import pandas as pd
from tqdm import tqdm

def main(path):
    print("Loading dataset...")
    # Load the training dataset
    df = pd.read_csv('../data/train.csv')

    print("\n=== Dataset Overview ===")
    print(f"Dataset shape: {df.shape}")
    print(f"Rows: {df.shape[0]}")
    print(f"Columns: {df.shape[1]}")

    print("\n=== Target Summary (y) ‚Äî Test Duration ===")
    print(df['y'].describe())

    print("\n=== Feature Type Breakdown ===")
    print("Analyzing feature types...")
    
    # Analyze column types with progress
    categorical = []
    binary = []
    
    for col in tqdm(df.columns, desc="Identifying feature types"):
        if col.startswith("X"):
            if df[col].dtype == object:
                categorical.append(col)
            elif df[col].dropna().isin([0, 1]).all():
                binary.append(col)

    print(f"\nCategorical features (nominal): {len(categorical)}")
    print(f"Binary features (one-hot): {len(binary)}")

    # Cardinality analysis for nominal features (X0‚ÄìX8)
    print("\n=== Cardinality: Nominal Features (X0‚ÄìX8) ===")
    print("Unique values per categorical feature:")
    for col in tqdm(categorical, desc="Analyzing X0‚ÄìX8"):
        unique_count = df[col].nunique()
        print(f"{col}: {unique_count} unique values")

    # Cardinality analysis for binary features (X10+)
    if binary:
        print("\n=== Cardinality: Binary Features (X10+) ===")
        print("Features with non-zero variance (actually used):")
        active_binary = []
        for col in tqdm(binary, desc="Analyzing X10‚ÄìX368"):
            if df[col].mean() > 0:  # Feature is not always 0
                active_binary.append(col)
        
        print(f"Total active (used) binary features: {len(active_binary)}")
        print(f"Completely inactive (always 0) binary features: {len(binary) - len(active_binary)}")

    # Configuration uniqueness: Nominal features only (X0-X8)
    print("\n=== Configuration Uniqueness (X0‚ÄìX8) ===")
    print("Analyzing nominal configuration only...")
    
    nominal_combinations = df[categorical].drop_duplicates()
    print(f"Total unique configurations: {len(nominal_combinations)}")
    print(f"Total rows: {df.shape[0]}")
    print(f"Configuration reuse rate: {(1 - len(nominal_combinations) / df.shape[0]) * 100:.2f}%")

    # Configuration uniqueness: Binary features only (X10-X368)
    if binary:
        print("\n=== Configuration Uniqueness (X10‚ÄìX368) ===")
        print("Analyzing binary configuration only...")
        
        binary_combinations = df[binary].drop_duplicates()
        print(f"Total unique configurations: {len(binary_combinations)}")
        print(f"Total rows: {df.shape[0]}")
        print(f"Configuration reuse rate: {(1 - len(binary_combinations) / df.shape[0]) * 100:.2f}%")

    # Configuration uniqueness: Full feature combination (X0-X368)
    print("\n=== Configuration Uniqueness (X0‚ÄìX368) ===")
    print("Analyzing complete configuration...")
    
    # Create a configuration tuple for each row
    config_cols = categorical + binary
    configuration_combinations = df[config_cols].drop_duplicates()
    
    print(f"Total unique configurations: {len(configuration_combinations)}")
    print(f"Total rows: {df.shape[0]}")
    print(f"Configuration reuse rate: {(1 - len(configuration_combinations) / df.shape[0]) * 100:.2f}%")
    
    # Configuration frequency distribution
    print("\n=== Configuration Repetition Pattern ===")
    df_with_config = df.copy()
    df_with_config['_config_id'] = df[config_cols].apply(tuple, axis=1)
    config_counts = df_with_config['_config_id'].value_counts()
    
    print(f"Vehicles tested per configuration:")
    print(f"  Min: {config_counts.min()}")
    print(f"  Max: {config_counts.max()}")
    print(f"  Mean: {config_counts.mean():.2f}")
    print(f"  Median: {config_counts.median():.0f}")
    
    # Show configurations tested most frequently
    print(f"\nTop 10 most-tested configurations (vehicle repetition):")
    for i, (config, count) in enumerate(config_counts.head(10).items(), 1):
        print(f"  {i}. {count} vehicles with same configuration")

    # Mean test duration for most-repeated configuration
    most_repeated_config = config_counts.idxmax()
    most_repeated_count = config_counts.max()
    mask = df_with_config['_config_id'] == most_repeated_config
    repeated_config_rows = df[mask]
    mean_test_time = repeated_config_rows['y'].mean()
    
    print(f"\n=== Mean Test Duration for Most-Repeated Configuration ===")
    print(f"Configuration (9 vehicles): Mean test time = {mean_test_time:.2f} seconds")

    # Detailed analysis of most-repeated configuration
    print("\n=== Leakage Risk: Most-Repeated Configuration Analysis ===")
    most_repeated_config = config_counts.idxmax()
    most_repeated_count = config_counts.max()
    
    # Find all rows with this configuration
    mask = df_with_config['_config_id'] == most_repeated_config
    repeated_config_rows = df[mask]
    
    print(f"Most frequently tested configuration appears {most_repeated_count} times")
    print(f"\nTest duration (y) statistics for these {most_repeated_count} vehicles:")
    print(f"  Count: {len(repeated_config_rows)}")
    print(f"  Min: {repeated_config_rows['y'].min():.2f}")
    print(f"  Q1 (25th percentile): {repeated_config_rows['y'].quantile(0.25):.2f}")
    print(f"  Median (50th percentile): {repeated_config_rows['y'].median():.2f}")
    print(f"  Mean: {repeated_config_rows['y'].mean():.2f}")
    print(f"  Q3 (75th percentile): {repeated_config_rows['y'].quantile(0.75):.2f}")
    print(f"  Max: {repeated_config_rows['y'].max():.2f}")
    print(f"  Std: {repeated_config_rows['y'].std():.2f}")
    
    print(f"\nVariability measures across same configuration:")
    print(f"  Range: {repeated_config_rows['y'].max() - repeated_config_rows['y'].min():.2f}")
    print(f"  IQR (Interquartile Range): {(repeated_config_rows['y'].quantile(0.75) - repeated_config_rows['y'].quantile(0.25)):.2f}")
    print(f"  Coefficient of Variation: {(repeated_config_rows['y'].std() / repeated_config_rows['y'].mean() * 100):.2f}%")
    
    # Assessment
    cov_value = (repeated_config_rows['y'].std() / repeated_config_rows['y'].mean() * 100)
    print(f"\nLeakage Risk Assessment:")
    if cov_value < 15:
        print(f"  ‚úÖ LOW RISK: CoV < 15% indicates high consistency within same configuration")
    elif cov_value < 30:
        print(f"  ‚ö†Ô∏è  MEDIUM RISK: CoV 15-30% suggests moderate variability")
    else:
        print(f"  üî¥ HIGH RISK: CoV > 30% indicates high variability, check for temporal effects")

    # Leakage risk assessment
    print("\n=== Potential Leakage Check ===")
    print(f"Unique IDs: {df['ID'].nunique()}")
    print(f"Duplicate rows (entire): {df.duplicated().sum()}")
    print(f"Duplicate configurations: {df.shape[0] - len(configuration_combinations)}")
    
    if df.shape[0] - len(configuration_combinations) > 0:
        print("\n‚ö†Ô∏è  WARNING: Same configurations tested multiple times")
        print("   This indicates potential leakage risk via configuration ordering.")

    print("\n=== Analysis Complete ===")

if __name__ == "__main__":
    main(sys.argv[1])


Loading dataset...

=== Dataset Overview ===
Dataset shape: (4209, 378)
Rows: 4209
Columns: 378

=== Target Summary (y) ‚Äî Test Duration ===
count    4209.000000
mean      100.669318
std        12.679381
min        72.110000
25%        90.820000
50%        99.150000
75%       109.010000
max       265.320000
Name: y, dtype: float64

=== Feature Type Breakdown ===
Analyzing feature types...


Identifying feature types: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 378/378 [00:00<00:00, 1340.06it/s]



Categorical features (nominal): 8
Binary features (one-hot): 368

=== Cardinality: Nominal Features (X0‚ÄìX8) ===
Unique values per categorical feature:


Analyzing X0‚ÄìX8: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 8/8 [00:00<00:00, 85.22it/s]


X0: 47 unique values
X1: 27 unique values
X2: 44 unique values
X3: 7 unique values
X4: 4 unique values
X5: 29 unique values
X6: 12 unique values
X8: 25 unique values

=== Cardinality: Binary Features (X10+) ===
Features with non-zero variance (actually used):


Analyzing X10‚ÄìX368: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 368/368 [00:00<00:00, 6516.85it/s]


Total active (used) binary features: 356
Completely inactive (always 0) binary features: 12

=== Configuration Uniqueness (X0‚ÄìX8) ===
Analyzing nominal configuration only...
Total unique configurations: 3866
Total rows: 4209
Configuration reuse rate: 8.15%

=== Configuration Uniqueness (X10‚ÄìX368) ===
Analyzing binary configuration only...
Total unique configurations: 2652
Total rows: 4209
Configuration reuse rate: 36.99%

=== Configuration Uniqueness (X0‚ÄìX368) ===
Analyzing complete configuration...
Total unique configurations: 3911
Total rows: 4209
Configuration reuse rate: 7.08%

=== Configuration Repetition Pattern ===
Vehicles tested per configuration:
  Min: 1
  Max: 9
  Mean: 1.08
  Median: 1

Top 10 most-tested configurations (vehicle repetition):
  1. 9 vehicles with same configuration
  2. 7 vehicles with same configuration
  3. 5 vehicles with same configuration
  4. 5 vehicles with same configuration
  5. 5 vehicles with same configuration
  6. 4 vehicles with same con

## Configuration Repetition Pattern & Leakage Risk Analysis

### üìä **Configuration Repetition Pattern**

The output below shows the **frequency distribution** of how often each vehicle configuration was tested:

- **Min: 1** = Some configurations were tested only once (unique)
- **Max: 9** = The most frequently tested configuration was tested 9 times (9 vehicles with exactly the same configuration)
- **Mean & Median** = Average repetition rate across all configurations

---

### ‚ö†Ô∏è **The Leakage Risk**

The warning is **critical**:

```
WARNING: Same configurations tested multiple times
This indicates potential leakage risk via configuration ordering.
```

**What does this mean?**

1. **Configurations repeat** ‚Üí 9 vehicles with identical options
2. **ID and Configuration may be correlated**:
   - If IDs are sorted chronologically and configurations are tested repeatedly
   - ‚Üí The model could learn **ID ordering** instead of actual configuration!

**Example scenario:**
```
ID 1000‚Äì1008: All with configuration "Red, Diesel, Premium"
ID 1009‚Äì1020: All with configuration "Blue, Gasoline, Standard"
```

A naive model might think: "ID 1000‚Äì1008 are quick to test" ‚Äî this is **leakage**, not true prediction!

---

### üìà **Detailed Leakage Analysis**

The extended code now shows:

1. **Most-repeated configuration** = how many times the most common config appears
2. **Test duration statistics** for vehicles with identical configurations:
   - Min, Max, Mean, Median, Std
   - **Range** = Max - Min (variation within same config)
   - **Coefficient of Variation** = Std / Mean (relative variability)

**Interpretation:**
- If CoV is **low** (< 20%) ‚Üí High consistency, low leakage risk
- If CoV is **high** (> 30%) ‚Üí High variability, possible process differences, check for temporal correlation

---

### ‚úÖ **What to do?**

1. **Train-Test Split correctly** ‚Üí Split by configuration, not by ID
2. **Remove ID from the model** (already done)
3. **Stratify Cross-Validation** ‚Üí by configuration group, not randomly
4. **Check temporal ordering** ‚Üí Does ID order correlate with test sequence?

This is **governance-critical**! üëç