# Bias Detection & Data Quality Analysis

**Purpose**: Identify data bias, missing patterns, outliers, and systematic data quality issues

**Date**: January 12, 2026

## Objectives
1. Detect temporal bias (missing years, incomplete periods)
2. Identify geographic bias (missing countries, regions)
3. Find systematic outliers and anomalies
4. Analyze data completeness patterns
5. Detect measurement bias and inconsistencies

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sqlalchemy import create_engine
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Database connection
DB_CONFIG = {
    'host': '172.18.0.1',
    'port': 5432,
    'database': 'lianel_energy',
    'user': 'airflow',
    'password': 'P9xK2mN7vQ4wR8tY3sL6hJ5nB1cV0zX'
}

connection_string = f"postgresql://{DB_CONFIG['user']}:{DB_CONFIG['password']}@{DB_CONFIG['host']}:{DB_CONFIG['port']}/{DB_CONFIG['database']}"
engine = create_engine(connection_string)

print("‚úÖ Database connection established")

## 1. Load Data

In [None]:
# Load ML forecasting dataset
query = """
SELECT 
    cntr_code,
    year,
    total_energy_gwh,
    renewable_energy_gwh,
    fossil_energy_gwh,
    pct_renewable,
    pct_fossil,
    yoy_change_total_energy_pct,
    yoy_change_renewable_pct,
    energy_density_gwh_per_km2,
    area_km2
FROM ml_dataset_forecasting_v1
ORDER BY cntr_code, year
"""

df = pd.read_sql(query, engine)
print(f"‚úÖ Loaded {len(df)} records")
print(f"Countries: {df['cntr_code'].nunique()}")
print(f"Years: {df['year'].min()} - {df['year'].max()}")
df.head()

In [None]:
# Check for missing values
missing_analysis = df.isnull().sum()
missing_pct = (missing_analysis / len(df)) * 100

print("üìã Missing Value Analysis:")
missing_df = pd.DataFrame({
    'Missing Count': missing_analysis,
    'Missing Percentage': missing_pct
}).sort_values('Missing Count', ascending=False)
print(missing_df[missing_df['Missing Count'] > 0].to_string())

# Check for zero values (potential data quality issues)
zero_analysis = {}
for col in ['total_energy_gwh', 'renewable_energy_gwh', 'fossil_energy_gwh']:
    zero_count = (df[col] == 0).sum()
    zero_analysis[col] = {
        'zero_count': zero_count,
        'zero_pct': (zero_count / len(df)) * 100
    }

print("\n‚ö†Ô∏è Zero Value Analysis:")
for col, stats in zero_analysis.items():
    print(f"  {col}: {stats['zero_count']} zeros ({stats['zero_pct']:.2f}%)")

# Check for invalid percentages
invalid_pct = df[
    (df['pct_renewable'] < 0) | 
    (df['pct_renewable'] > 100) |
    (df['pct_renewable'].isnull())
]
print(f"\n‚ùå Invalid Renewable Percentages: {len(invalid_pct)} records")
if len(invalid_pct) > 0:
    print(invalid_pct[['cntr_code', 'year', 'pct_renewable', 'total_energy_gwh']].to_string(index=False))

# Check for data completeness by year
completeness_by_year = df.groupby('year').agg({
    'cntr_code': 'count',
    'total_energy_gwh': lambda x: (x > 0).sum(),
    'renewable_energy_gwh': lambda x: (x > 0).sum(),
    'fossil_energy_gwh': lambda x: (x > 0).sum()
})
completeness_by_year.columns = ['total_records', 'has_total_energy', 'has_renewable', 'has_fossil']
completeness_by_year['fossil_completeness_pct'] = (completeness_by_year['has_fossil'] / completeness_by_year['total_records']) * 100

print("\nüìä Data Completeness by Year:")
print(completeness_by_year.to_string())

# Visualize completeness
fig, axes = plt.subplots(2, 1, figsize=(14, 10))

# Plot 1: Completeness by year
ax1 = axes[0]
x = completeness_by_year.index
width = 0.25
ax1.bar(x - width, completeness_by_year['has_total_energy'], width, label='Total Energy', alpha=0.7)
ax1.bar(x, completeness_by_year['has_renewable'], width, label='Renewable', alpha=0.7)
ax1.bar(x + width, completeness_by_year['has_fossil'], width, label='Fossil', alpha=0.7)
ax1.set_xlabel('Year')
ax1.set_ylabel('Number of Records')
ax1.set_title('Data Completeness by Year')
ax1.legend()
ax1.grid(True, alpha=0.3, axis='y')

# Plot 2: Fossil completeness percentage
ax2 = axes[1]
ax2.plot(completeness_by_year.index, completeness_by_year['fossil_completeness_pct'], 
         marker='o', linewidth=2, markersize=8, color='red')
ax2.axhline(y=100, color='green', linestyle='--', linewidth=2, label='100% Complete')
ax2.set_xlabel('Year')
ax2.set_ylabel('Fossil Data Completeness (%)')
ax2.set_title('Fossil Energy Data Completeness Over Time')
ax2.set_ylim([0, 105])
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n‚úÖ Key Findings:")
print(f"  - Years with incomplete fossil data: {completeness_by_year[completeness_by_year['fossil_completeness_pct'] < 100].index.tolist()}")
print(f"  - Missing values: {missing_analysis.sum()} total missing values across all columns")

## 5. Summary & Recommendations

### Key Bias Issues Identified

1. **Temporal Bias**: Missing years for some countries
2. **Data Completeness Bias**: 2016-2017 missing fossil data
3. **Geographic Bias**: Potential missing countries
4. **Outlier Bias**: Extreme values that may skew analysis

### Recommendations

1. **Flag incomplete data**: Add data quality flags to ML datasets
2. **Handle outliers**: Decide on outlier treatment (remove, cap, or investigate)
3. **Fill missing years**: Investigate why some countries have missing years
4. **Re-ingest incomplete periods**: Re-run ingestion for 2016-2017
5. **Document data limitations**: Create data quality documentation