# Task 2 - Data Exploration, Analysis, and Preprocessing 

This notebook covers data quality, integration, comprehensive exploration, and preparing the data for the modeling tasks.

## 2.1 Setup and Data Loading

In [None]:
# import necessary libraries
import pandas as pd

# Load the two primary datasets (Power Generation and Sensor Data)
power_generation_df = pd.read_csv('../data/Plant_1_Generation_Data.csv')
weather_sensor_df = pd.read_csv('../data/Plant_1_Weather_Sensor_Data.csv')

## 2.2 Data Quality and Integration

### 2.2.1 Data Quality Assessment

#### Missing Values

In [63]:
def get_missing_data_report(df):
    """Generate a report of missing data percentages for each column in the DataFrame."""
    missing_data_report = pd.DataFrame({
        'Columns': df.columns,
        'Missing Values': df.isna().sum().values,
        'Percentage Missing': ((df.isna().sum().values / len(df)) * 100).round(2)
    })
    return missing_data_report

# Generate and print missing data report for the datasets
print("Power Generation - Missing Values Report:")
print(get_missing_data_report(power_generation_df),"\n")
print("Weather Sensor - Missing Values Report:")
print(get_missing_data_report(weather_sensor_df))

Power Generation - Missing Values Report:
               Columns  Missing Values  Percentage Missing
0            DATE_TIME               0                0.00
1             PLANT_ID               0                0.00
2           SOURCE_KEY               0                0.00
3             DC_POWER               0                0.00
4             AC_POWER               0                0.00
5          DAILY_YIELD               0                0.00
6          TOTAL_YIELD               0                0.00
7                  day               0                0.00
8  Operating_Condition           23098                2.26 

Weather Sensor - Missing Values Report:
               Columns  Missing Values  Percentage Missing
0            DATE_TIME               0                 0.0
1             PLANT_ID               0                 0.0
2           SOURCE_KEY               0                 0.0
3  AMBIENT_TEMPERATURE               0                 0.0
4   MODULE_TEMPERATURE         

- There are about 2.3% missing values in the _Operation_Condition_ column of the __Power Generation__ dataset.
- There are no missing values in the __Weather Sensor__ dataset.

#### Data Types

In [None]:
# Get unique Python types (per cell) and pandas dtypes for each column in weather_sensor_df
def get_types_report(df):
    """Generate a report of pandas dtypes and unique Python types for each column in the DataFrame."""
    types = {}
    for col in df.columns:
        py_types = df[col].map(lambda x: type(x).__name__).unique().tolist()
        types[col] = py_types

    result_df = pd.DataFrame({
        'column': list(types.keys()),
        'pandas_dtype': [df[col].dtype for col in types.keys()],
        'python_types': [types[col] for col in types.keys()]
    })
    return result_df

print("\nWeather Sensor - Data Types Report:")
print(get_types_report(weather_sensor_df))
print("\nPower Generation - Data Types Report:")
print(get_types_report(power_generation_df))


Weather Sensor - Data Types Report:
                column pandas_dtype python_types
0            DATE_TIME       object        [str]
1             PLANT_ID        int64        [int]
2           SOURCE_KEY       object        [str]
3  AMBIENT_TEMPERATURE      float64      [float]
4   MODULE_TEMPERATURE      float64      [float]
5          IRRADIATION      float64      [float]

Power Generation - Data Types Report:
                column pandas_dtype  python_types
0            DATE_TIME       object         [str]
1             PLANT_ID        int64         [int]
2           SOURCE_KEY       object         [str]
3             DC_POWER      float64       [float]
4             AC_POWER      float64       [float]
5          DAILY_YIELD      float64       [float]
6          TOTAL_YIELD      float64       [float]
7                  day        int64         [int]
8  Operating_Condition       object  [str, float]


### 2.2.2 Data Handling

#### Addressing Missing Values

#### Data Ranges

## 2.3 Exploratory Data Analysis

### 2.3.1 Statistical Summary

### 2.3.2 Visualizations

### 2.3.3 Trend Analysis

### 2.3.4 Correlation Analysis

### 2.3.5 Pattern Identification

## 2.4 Feature Engineering

### 2.4.1 Feature Scaling

### 2.4.2 Feature Selection