# 02 - Exploratory Data Analysis (EDA)

### Objective
This notebook explores the cleaned and merged dataset containing daily weather parameters and corresponding solar energy generation for Ireland in 2024.  
The goal is to understand the data's structure, identify trends and patterns, and uncover relationships between variables that can inform modeling strategy.

---

### Key Steps

**Dataset Overview**
   - Use `pandas` to inspect data shape, column types, missing values, and summary statistics.
   - Examine basic distributions of key variables like `solargen`, `glorad`, `maxtp`, `rain`.

**Time Series Trend Analysis**
   - Plot daily values using `matplotlib` to explore seasonal trends in solar generation and weather (e.g., radiation, temperature).
   - Look for periodic patterns or anomalies across the year.

**Correlation & Linear Relationships**
   - Use `.corr()` and `matplotlib` to create a correlation heatmap.
   - Generate scatter plots (e.g., `glorad` vs. `solargen`) to visually assess linear relationships.

**Outlier & Distribution Inspection**
   - Use histograms and boxplots (via `matplotlib.pyplot`) to explore distributions and detect outliers for each variable.

**Initial Observations**
   - Summarise key findings, strong predictors, and potential data quality concerns.
   - Identify which weather variables are promising candidates for the regression model.

---

**Input**: `Cleaned_National_Irish_Weather_Solar_2024.csv` (daily aggregated weather and solar generation)  
**Output**: Graphs, correlations, and insights to guide feature selection for modeling in the next notebook.

### Step 1: Add necessary libraries for EDA and Load Cleaned_National_Irish_Weather_Solar_2024.csv
- numpy, pandas, matplotlib, statsmodels
- Load clean dataset + view info

In [None]:
# Standard Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Statsmodels (for statistical analysis)
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor as VIF
from statsmodels.stats.anova import anova_lm


In [None]:
# Load Cleaned Data and view head
file_path = "../Cleaned Data/Cleaned_National_Irish_Weather_Solar_2024.csv"
df = pd.read_csv(file_path)

df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 366 entries, 0 to 365
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   date      366 non-null    object 
 1   rain      366 non-null    float64
 2   maxtp     366 non-null    float64
 3   mintp     366 non-null    float64
 4   cbl       366 non-null    float64
 5   glorad    366 non-null    float64
 6   solargen  366 non-null    float64
dtypes: float64(6), object(1)
memory usage: 20.1+ KB
