## Demografische Entwicklung Wiens seit 2008: Analyse der
## Bevölkerungsstruktur und Geburtenentwicklung auf
## Bezirksebene
<u>**Big Data Projekt von:**</u>
<br>
Johannes Reitterer <br>
Johannes Mantler <br>
Nicolas Nemeth <br>
<br>

# ETL Pipeline
Diese ETL-Pipeline lädt demografische Daten der Stadt Wien, bereinigt sie und speichert sie in MongoDB zur weiteren Analyse.

**Datenquellen:**
- Bevölkerung nach Geburtsbundesland (2008-heute): ~500.000 Datensätze https://www.data.gv.at/datasets/f54e6828-3d75-4a82-89cb-23c58057bad4?locale=de
- Geburtenstatistik (2002-heute): ~50.000 Datensätze https://www.data.gv.at/datasets/f54e6828-3d75-4a82-89cb-23c58057bad4?locale=de

## Pipeline-Ablauf

### 1. Extract (Daten laden)

Die Rohdaten werden aus CSV-Dateien von data.gv.at geladen.

Da es Probleme bei der API-Abfrage gibt, müssen die csv files manuell gedownloaded werden und in den Projekt Ordner eingefügt werden.

### 2. Transform (Daten bereinigen)

**Spaltenumbenennung:**
- Englische Spaltennamen werden zu deutschen Namen konvertiert
- Beispiel: `REF_YEAR` → `Jahr`, `DISTRICT_CODE` → `Bezirk_Roh`

**Bezirkscode-Transformation:**

Wien verwendet statistische Codes (90101-90223), die zu Postleitzahlen konvertiert werden:

```
90101 → 1010 (1. Bezirk)
90201 → 1020 (2. Bezirk)
90301 → 1030 (3. Bezirk)
...
```

**Datenbereinigung:**
- Ungültige Bezirkscodes entfernen
- Fehlende Werte mit 0 auffüllen
- Negative Werte korrigieren
- Datentypen zu Integer konvertieren

(1 = Männer, 2 = Frauen)

### 3. Load (Daten speichern)

Die bereinigten Daten werden in MongoDB gespeichert:
- **Collection `population`**: Bevölkerungsdaten nach Bezirk, Jahr, Alter, Geschlecht und Herkunft
- **Collection `births`**: Geburtendaten nach Bezirk, Jahr und Geschlecht
 
**Dokumentstruktur Beispiel:**
```json
{
  "Jahr": 2020,
  "Bezirk": 1010,
  "Geschlecht": 1,
  "Alter": 25,
  "Wien": 1234,
  "Ausland": 789
}
```

## Verwendung

```python
# Pipeline ausführen
run_pipeline()

# Ergebnis: Daten in MongoDB unter wien_demografie_db
# - population: Bevölkerungsdaten
# - births: Geburtendaten
```

In [76]:
"""
Authors: Johannes Mantler, Johannes Reitterer, Nicolas Nemeth

Data Sources: 
- Population by province of birth (2008-present) https://www.data.gv.at/datasets/98b782ca-8e46-43d7-a061-e196d0e0160a?locale=de
- Birth statistics (2002-present) https://www.data.gv.at/datasets/f54e6828-3d75-4a82-89cb-23c58057bad4?locale=de
"""

import pandas as pd
import sys
from pymongo import MongoClient
import os

MONGO_CONFIG = {
    'uri': "mongodb://admin:admin123@localhost:27017/",
    'auth_source': "admin",
    'database': "wien_demografie_db",
    'use_docker': True
}

# File paths
DATA_FILES = {
    'population': 'vie-bdl-pop-sex-age1-stk-cob-geoat10-2008f.csv',
    'births': 'vie-bdl-pop-sex-age1-bir-2002f.csv'
}

# Dataschema mappings
POPULATION_COLUMNS = {
    'REF_YEAR': 'Jahr',
    'DISTRICT_CODE': 'Bezirk_Roh',
    'SUB_DISTRICT_CODE': 'Sub_Bezirk',
    'REF_DATE': 'Datum',
    'SEX': 'Geschlecht',
    'AGE1': 'Alter',
    'UNK': 'Unbekannt',
    'BGD': 'Burgenland',
    'KTN': 'Kaernten',
    'NOE': 'Niederoesterreich',
    'OOE': 'Oberoesterreich',
    'SBG': 'Salzburg',
    'STK': 'Steiermark',
    'TIR': 'Tirol',
    'VBG': 'Vorarlberg',
    'VIE': 'Wien',
    'FOR': 'Ausland'
}

BIRTH_COLUMNS = {
    'REF_YEAR': 'Jahr',
    'DISTRICT_CODE': 'Bezirk_Roh',
    'SUB_DISTRICT_CODE': 'Sub_Bezirk',
    'REF_DATE': 'Datum',
    'SEX': 'Geschlecht',
    'BIR': 'Anzahl_Geburten'
}

BUNDESLAND_COLUMNS = [
    'Unbekannt', 'Burgenland', 'Kaernten', 'Niederoesterreich',
    'Oberoesterreich', 'Salzburg', 'Steiermark', 'Tirol',
    'Vorarlberg', 'Wien', 'Ausland'
]


def setup_mongodb():
    """
    Establish connection to MongoDB database.
    
    Returns:
        tuple: (MongoClient, Database) objects
        
    Raises:
        SystemExit: If connection fails
    """
    try:
        if MONGO_CONFIG['use_docker']:
            client = MongoClient(
                MONGO_CONFIG['uri'],
                serverSelectionTimeoutMS=5000,
                authSource=MONGO_CONFIG['auth_source']
            )
        else:
            client = MongoClient(
                MONGO_CONFIG['uri'],
                serverSelectionTimeoutMS=5000
            )
        
        client.server_info()
        db = client[MONGO_CONFIG['database']]
        
        print("MongoDB connection established successfully")
        print(f"Database: {MONGO_CONFIG['database']}")
        
        return client, db
        
    except Exception as e:
        print(f"ERROR: MongoDB connection failed - {e}")
        print("\nTroubleshooting:")
        print("1. Start MongoDB: docker-compose up -d")
        print("2. Check connection string in MONGO_CONFIG")
        print("3. Verify MongoDB is running: docker ps")
        sys.exit(1)


def clean_district_code(code):
    """
    Transform district codes from statistical format to Vienna postal codes.
    
    
    Args:
        code: District code from source data
        
    Returns:
        int: Cleaned district code (1010-1230) or 0 if invalid
        
    Examples:
        90101 -> 1010 
        90201 -> 1020
    """
    try:
        code_str = str(code).strip()
        
        if code_str.startswith('9') and len(code_str) == 5:
            district_num = int(code_str[1:3])
            return 1000 + district_num * 10
        
        if code_str.startswith('1') and len(code_str) == 4:
            return int(code_str)
        
        return int(code_str)
        
    except (ValueError, TypeError):
        return 0



def extract_data():
    """
    Load CSV files from disk.
    
    Returns:
        tuple: (population_df, births_df) pandas DataFrames
        
    Raises:
        Exception: If file loading or parsing fails
    """
    print("\n" + "="*70)
    print("PHASE 1: EXTRACT")
    print("="*70)
    
    print("\nLoading CSV files...")
    
    try:
        df_pop = pd.read_csv(
            DATA_FILES['population'],
            sep=';',
            encoding='utf-8-sig',
            skiprows=1
        )
        print(f"  Population: {len(df_pop):,} rows, {len(df_pop.columns)} columns")
        
        df_birth = pd.read_csv(
            DATA_FILES['births'],
            sep=';',
            encoding='utf-8-sig',
            skiprows=1
        )
        print(f"  Births: {len(df_birth):,} rows, {len(df_birth.columns)} columns")
        
        print("\nData preview:")
        print(f"  Population columns: {list(df_pop.columns)[:8]}")
        print(f"  Birth columns: {list(df_birth.columns)[:8]}")
        
        if len(df_pop) > 0:
            print(f"\n  First population row: {df_pop.iloc[0].to_dict()}")
        
        return df_pop, df_birth
        
    except Exception as e:
        print(f"\nERROR: Failed to load data files - {e}")
        raise


def transform_population_data(df):
    """
    Clean and transform population data.
    
    Transformations:
    - Rename columns to German
    - Convert district codes to postal codes
    - Convert numeric columns to integers
    - Remove invalid records
    
    Args:
        df: population DataFrame
        
    Returns:
        DataFrame: Cleaned population data
    """
    print("\nTransforming population data...")
    
    print(f"  Actual columns: {list(df.columns)[:10]}")
    
    rename_map = {k: v for k, v in POPULATION_COLUMNS.items() if k in df.columns}
    df = df.rename(columns=rename_map)
    
    district_col = None
    for col in ['Bezirk_Roh', 'DISTRICT_CODE', 'district_code']:
        if col in df.columns:
            district_col = col
            break
    
    if district_col:
        df['Bezirk'] = df[district_col].apply(clean_district_code)
    else:
        print("  WARNING: No district code column found")
        df['Bezirk'] = 0
    
    # Remove invalid districts
    initial_rows = len(df)
    df = df[df['Bezirk'] > 0]
    removed = initial_rows - len(df)
    if removed > 0:
        print(f"  Removed {removed} rows with invalid district codes")
    
    # Convert bundesland columns to integers
    for col in BUNDESLAND_COLUMNS:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors='coerce').fillna(0).astype(int)
    
    # Ensure Jahr is integer
    if 'Jahr' in df.columns:
        df['Jahr'] = pd.to_numeric(df['Jahr'], errors='coerce').fillna(0).astype(int)
    
    print(f"  Processed: {len(df):,} records")
    
    return df


def transform_birth_data(df):
    """
    Clean and transform birth data.
    
    Transformations:
    - Rename columns to German
    - Convert district codes to postal codes
    - Convert birth counts to integers
    - Remove invalid records
    
    Args:
        df:birth DataFrame
        
    Returns:
        DataFrame: Cleaned birth data
    """
    print("\nTransforming birth data...")
    
    rename_map = {k: v for k, v in BIRTH_COLUMNS.items() if k in df.columns}
    df = df.rename(columns=rename_map)
    
    # Clean district codes
    if 'Bezirk_Roh' in df.columns:
        df['Bezirk'] = df['Bezirk_Roh'].apply(clean_district_code)
        
        # Remove invalid districts
        initial_rows = len(df)
        df = df[df['Bezirk'] > 0]
        removed = initial_rows - len(df)
        if removed > 0:
            print(f"  Removed {removed} rows with invalid district codes")
    
    # Convert birth counts to integers
    if 'Anzahl_Geburten' in df.columns:
        df['Anzahl_Geburten'] = pd.to_numeric(
            df['Anzahl_Geburten'],
            errors='coerce'
        ).fillna(0).astype(int)
    
    # Ensure Jahr is integer
    if 'Jahr' in df.columns:
        df['Jahr'] = pd.to_numeric(df['Jahr'], errors='coerce').fillna(0).astype(int)
    
    print(f"  Processed: {len(df):,} records")
    
    return df


def transform_data(df_pop, df_birth):
    """
    Apply transformations to both datasets.
    
    Args:
        df_pop: Raw population DataFrame
        df_birth: Raw birth DataFrame
        
    Returns:
        tuple: (cleaned_population_df, cleaned_birth_df)
    """
    print("\n" + "="*70)
    print("PHASE 2: TRANSFORM")
    print("="*70)
    
    df_pop_clean = transform_population_data(df_pop)
    df_birth_clean = transform_birth_data(df_birth)
    
    return df_pop_clean, df_birth_clean


def load_data(db, df_pop, df_birth):
    """
    Load cleaned data into MongoDB.
    
    Args:
        db: MongoDB database object
        df_pop: Cleaned population DataFrame
        df_birth: Cleaned birth DataFrame
        
    Returns:
        tuple: (population_count, birth_count) document counts
    """
    print("\n" + "="*70)
    print("PHASE 3: LOAD")
    print("="*70)
    
    print("\nLoading data into MongoDB...")
    
    try:
        # Clear existing collections
        db["population"].delete_many({})
        db["births"].delete_many({})
        print("  Cleared existing collections")
        
        # Insert new data
        if len(df_pop) > 0:
            records_pop = df_pop.to_dict("records")
            db["population"].insert_many(records_pop)
        
        if len(df_birth) > 0:
            records_birth = df_birth.to_dict("records")
            db["births"].insert_many(records_birth)
        
        # Verify counts
        count_pop = db["population"].count_documents({})
        count_birth = db["births"].count_documents({})
        
        print(f"  Population: {count_pop:,} documents inserted")
        print(f"  Births: {count_birth:,} documents inserted")
        
        return count_pop, count_birth
        
    except Exception as e:
        print(f"\nERROR: Failed to load data into MongoDB - {e}")
        raise


def generate_statistics(df_pop, df_birth):
    """
    Generate and display summary statistics.
    
    Args:
        df_pop: Population DataFrame
        df_birth: Birth DataFrame
    """
    print("\n" + "="*70)
    print("DATA STATISTICS")
    print("="*70)
    
    # Time range
    if 'Jahr' in df_pop.columns:
        years_pop = sorted(df_pop['Jahr'].unique())
        print(f"\nPopulation data:")
        print(f"  Time range: {min(years_pop)} - {max(years_pop)}")
        print(f"  Years covered: {len(years_pop)}")
    
    if 'Jahr' in df_birth.columns:
        years_birth = sorted(df_birth['Jahr'].unique())
        print(f"\nBirth data:")
        print(f"  Time range: {min(years_birth)} - {max(years_birth)}")
        print(f"  Years covered: {len(years_birth)}")
    
    # Districts
    if 'Bezirk' in df_pop.columns:
        districts = sorted([d for d in df_pop['Bezirk'].unique() if d > 0])
        print(f"\nDistricts covered: {len(districts)}")
        print(f"  Districts: {districts}")
    
    # Gender distribution
    if 'Geschlecht' in df_pop.columns:
        genders = df_pop['Geschlecht'].unique()
        print(f"\nGender codes: {sorted(genders)}")


def run_pipeline():
    """
    Execute the complete ETL pipeline.
    
    Pipeline stages:
    1. Validate input files
    2. Connect to MongoDB
    3. Extract data from CSV files
    4. Transform and clean data
    5. Load data into MongoDB
    6. Generate statistics
    """
    print("\n" + "="*70)
    print("WIEN DEMOGRAFIE ETL PIPELINE")
    print("="*70)
    print(f"\nWorking directory: {os.getcwd()}")

    
    # Setup database
    client, db = setup_mongodb()
    
    try:
        # Extract
        df_pop, df_birth = extract_data()
        
        # Transform
        df_pop_clean, df_birth_clean = transform_data(df_pop, df_birth)
        
        # Load
        count_pop, count_birth = load_data(db, df_pop_clean, df_birth_clean)
        
        # Statistics
        generate_statistics(df_pop_clean, df_birth_clean)
        
        print("\n" + "="*70)
        print("PIPELINE COMPLETED SUCCESSFULLY")
        print("="*70)
        
    except Exception as e:
        print(f"\nPIPELINE FAILED: {e}")
        sys.exit(1)
        
    finally:
        client.close()
        print("\nMongoDB connection closed")


run_pipeline()


WIEN DEMOGRAFIE ETL PIPELINE

Working directory: C:\Users\Johannes\Desktop\Uni\5.Semester\BigData\BigDataProject
MongoDB connection established successfully
Database: wien_demografie_db

PHASE 1: EXTRACT

Loading CSV files...
  Population: 3,636 rows, 18 columns
  Births: 1,661 rows, 8 columns

Data preview:
  Population columns: ['NUTS', 'DISTRICT_CODE', 'SUB_DISTRICT_CODE', 'REF_YEAR', 'REF_DATE', 'SEX', 'AGE1', 'UNK']
  Birth columns: ['NUTS', 'DISTRICT_CODE', 'SUB_DISTRICT_CODE', 'REF_YEAR', 'REF_DATE', 'SEX', 'AGE', 'BIR']

  First population row: {'NUTS': 'AT13', 'DISTRICT_CODE': 90000, 'SUB_DISTRICT_CODE': 90000, 'REF_YEAR': 2008, 'REF_DATE': 20080101, 'SEX': 1, 'AGE1': 0, 'UNK': 0, 'BGD': 21, 'KTN': 10, 'NOE': 356, 'OOE': 19, 'SBG': 6, 'STK': 18, 'TIR': 3, 'VBG': 4, 'VIE': 7672, 'FOR': 181}

PHASE 2: TRANSFORM

Transforming population data...
  Actual columns: ['NUTS', 'DISTRICT_CODE', 'SUB_DISTRICT_CODE', 'REF_YEAR', 'REF_DATE', 'SEX', 'AGE1', 'UNK', 'BGD', 'KTN']
  Processed