# Data Model Documentation
## Overview

This document outlines the data model and processing framework required to support the following business intelligence (BI) and machine learning (ML) use cases for analyzing public net electricity production in Germany:

1. **Trend Analysis** of daily public net electricity production by type.
2. **Prediction** of underperformance of public net electricity on 30-minute intervals.
3. **Analysis** of daily electricity prices against net power for offshore and onshore wind production.

## Data Sources

The following JSON files serve as the primary data sources:

**1. public_power_data**:
```json
{
   "unix_seconds": [int],
    "production_types": [
     {
       "name": "string",
       "data": [float]
     }
   ],
   "deprecated": boolean
}


**2. price_data**
```json
{
  "license_info": "string",
  "unix_seconds": [int],
  "price": [float],
  "unit": "string",
  "deprecated": boolean
}


**3. installed_power_data**
```json
{
  "time": [string],
  "production_types": [
    {
      "name": "string",
      "data": [float]
    }
  ],
  "deprecated": boolean
}



# Data Model Design

The data model consists of three main CSV files to support the specified use cases:

### `public_power_cleaned`

**Columns:**
- `timestamp` (TIMESTAMP): UNIX timestamp converted to human-readable format.
- `production_type_name` (STRING): Name of the production type (e.g., solar, wind, etc.).
- `production_value` (FLOAT): Electricity production value.

**Purpose:**  
To track daily electricity production trends for each production type.

---

### `price_data_cleaned`

**Columns:**
- `license_info` (STRING): The license info.
- `price` (FLOAT): The day-ahead spot market price for a specified bidding zone in EUR/MWh.
- `unit` (FLOAT): unit of the electricity.
- `timestamp` (TIMESTAMP): UNIX timestamp converted to human-readable format.

**Purpose:**  
To store electricity prices for analysis.

---

### `installed_power_cleaned`

**Columns:**
- `time` (TIMESTAMP): Month and the Year.
- `production_name` (STRING): Name of the production type.
- `production_calue` (FLOAT): Electricity production value.

**Purpose:**  
To provide context for the installed capacity of different energy sources.

---

# Data Flow and Processing

The data will be ingested from JSON files, transformed using PySpark, and stored in the CSV Files. The data flow is as follows:

1. **Extract:** Read the JSON files.
2. **Transform:** Clean and reshape the data to match the table structures.
3. **Load:** Write the transformed data into the CSV files.

---

# PySpark Transformation Scripts

The PySpark transformation scripts can be found in the folder "data_transformation_scripts" and the cleaned data can be found in the folder "cleaned_data_csv".

---

# SQL Queries for BI/ML Use Case

The SQL scripts can be found in the folder "sql_queries_bi_ml_usecase" and can be run on the data saved in "cleaned_data_csv" folder.

The BI/ML use cases taken in consideration are:

1. Trend of daily public net electricity production in Germany for each production type.
2. Analysis of Daily Average Price Against Net Power for Offshore and Onshore Wind.