# 🌞 MoonLight Energy Solutions: Solar Investment Analysis   — Benin 🇧🇯

In [1]:
#all imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sys
sys.path.append('../scripts')
import warnings
# from scipy.stats import zscore
from data_quality_utils import columns_with_significant_missing_values,detect_outliers_zscore,find_columns_with_invalid_values, conditional_impute,impute_multiple_targets_with_model,impute_ghi_with_linear_regression,replace_negative_irradiance_with_nan
from visualization_utils import plot_continuous_histograms, plot_scatter_relationships,plot_rh_relationships,plot_bubble_ghi_vs_tamb,plot_mod_cleaning_effect,plot_irradiance_temperature_timeseries,plot_hourly_irradiance_temperature, plot_monthly_irradiance_temperature,plot_ghi_anomalies,plot_wind_rose,plot_correlation_heatmap,plot_pairplot
from feature_engineering_utils import log_transform_columns
from windrose import WindroseAxes

In [2]:
#suppress all warnings
warnings.filterwarnings("ignore")

## 📚 Table of Contents

1. [📊 Introduction & Objective](#1-introduction--objective)
2. [📦 Data Loading & Overview](#2-data-loading--overview)
3. [📐 Data Types & Basic Stats](#3-data-types--basic-stats)
4. [🔍 Data Quality Analysis](#4-data-quality-analysis)
5. [🧹 Data Cleaning](#5-data-cleaning)
6. [📈 Univariate Analysis (Single Variable)](#6-univariate-analysis-single-variable)
7. [📉 Bivariate/Multivariate Analysis](#7-bivariatemultivariate-analysis)
8. [🧮 Feature Engineering](#8-feature-engineering)
9. [📅 Time Series Trends](#9-time-series-trends)
10. [🧠 Key Insights](#10-key-insights)
11. [🔚 Conclusion & Next Steps](#11-conclusion--next-steps)

## 📊 1. Introduction & Objective <a id='#1-introduction--objective'></a>

### Background in the Subject Matter

Understanding lead and lag measures is crucial in solar energy analytics to identify what drives performance (lead) and what reflects performance outcomes (lag).

---

#### 🔹 Lead Measures

| Parameter         | Description |
|------------------|-------------|
| **Cleaning (1/0)** | Indicates whether a cleaning event occurred. A direct action that can influence panel efficiency. |
| **Precipitation (mm/min)** | Natural cleaning mechanism. Affects panel cleanliness and performance. |
| **RH (Relative Humidity)** | Can contribute to soiling or panel fogging. A predictive factor for efficiency. |
| **WS (Wind Speed)** | Can help remove dust/debris. High wind may act as a natural cleaning factor. |
| **TModA / TModB (°C)** | Module temperatures. Impact the conversion efficiency — monitored to optimize performance. |



#### 🔹 Lag Measures

| Parameter         | Description |
|------------------|-------------|
| **GHI (Global Horizontal Irradiance)** | Total solar radiation on a horizontal surface — reflects solar availability. |
| **DNI (Direct Normal Irradiance)** | Direct solar radiation received perpendicularly — outcome of atmospheric conditions. |
| **DHI (Diffuse Horizontal Irradiance)** | Scattered sunlight received — indicates sky clarity. |
| **ModA / ModB (W/m²)** | Actual power received by panels — outcome of environmental and maintenance factors. |
| **Tamb (Ambient Temperature)** | Environmental factor — affects efficiency but cannot be controlled. |
| **BP (Barometric Pressure)** | Reflects atmospheric conditions — no direct control. |
| **WD / WDstdev** | Wind direction and its variability — background environmental effects. |
| **WSstdev / WSgust** | Wind variability and gusts — lag indicators of natural impacts. |



### 🇧🇯 Background on Benin

#### ☀️ Solar Power Potential of Sierra Leone
Sierra Leone is located in West Africa, between latitudes 7° and 10° North of the Equator. 
Sierra Leone, located near the equator in West Africa, has significant untapped solar energy potential. With average Global Horizontal Irradiance (GHI) levels ranging from 4.5 to 5.5 kWh/m²/day, the country receives consistent and abundant sunlight throughout the year. This makes it highly suitable for solar photovoltaic (PV) energy generation, especially in rural and off-grid regions. 

### 🎯 Business Objective

- Perform a quick yet insightful analysis of solar radiation and environmental data to:
- Identify **key trends and performance drivers**.
- Understand the **impact of environmental conditions** on solar energy potential.
- Evaluate the **effect of soiling and cleaning** on solar module performance.
- Recommend **ideal conditions or locations** for sustainable solar installations.


### 🧠 Key Questions to Explore

1. **Solar Potential**  
   - Where and when is solar radiation (`GHI`, `DNI`, `DHI`) strongest and most consistent?

2. **Environmental Impact on Performance**  
   - How do temperature, humidity, wind, and pressure affect solar metrics?

3. **Sensor/Module Performance**  
   - How do `ModA` and `ModB` correlate with irradiance data?
   - Are performance improvements observed after cleaning?

4. **Cleaning Effectiveness**  
   - What is the impact of cleaning events on solar performance?
   - Can a cleaning schedule be recommended?

## 📦 2. Data Loading & Overview  <a id= '2-data-loading--overview'></a>

#### Loading Data

In [3]:
#load the data from the github repo or #using locally stored dataset
df=pd.read_csv('../data/sierraleone-bumbuna.csv')
df

Unnamed: 0,Timestamp,GHI,DNI,DHI,ModA,ModB,Tamb,RH,WS,WSgust,WSstdev,WD,WDstdev,BP,Cleaning,Precipitation,TModA,TModB,Comments
0,2021-10-30 00:01,-0.7,-0.1,-0.8,0.0,0.0,21.9,99.1,0.0,0.0,0.0,0.0,0.0,1002,0,0.0,22.3,22.6,
1,2021-10-30 00:02,-0.7,-0.1,-0.8,0.0,0.0,21.9,99.2,0.0,0.0,0.0,0.0,0.0,1002,0,0.0,22.3,22.6,
2,2021-10-30 00:03,-0.7,-0.1,-0.8,0.0,0.0,21.9,99.2,0.0,0.0,0.0,0.0,0.0,1002,0,0.0,22.3,22.6,
3,2021-10-30 00:04,-0.7,0.0,-0.8,0.0,0.0,21.9,99.3,0.0,0.0,0.0,0.0,0.0,1002,0,0.1,22.3,22.6,
4,2021-10-30 00:05,-0.7,-0.1,-0.8,0.0,0.0,21.9,99.3,0.0,0.0,0.0,0.0,0.0,1002,0,0.0,22.3,22.6,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
525595,2022-10-29 23:56,-1.6,-0.1,-2.9,0.0,0.0,24.0,100.0,0.0,0.0,0.0,0.0,0.0,999,0,0.0,24.2,24.5,
525596,2022-10-29 23:57,-1.7,-0.1,-3.0,0.0,0.0,24.0,100.0,0.0,0.0,0.0,0.0,0.0,999,0,0.0,24.2,24.5,
525597,2022-10-29 23:58,-1.7,-0.1,-3.1,0.0,0.0,24.0,100.0,0.0,0.0,0.0,0.0,0.0,1000,0,0.0,24.1,24.4,
525598,2022-10-29 23:59,-1.7,-0.2,-3.3,0.0,0.0,23.9,100.0,0.0,0.0,0.0,0.0,0.0,1000,0,0.0,24.1,24.4,


#### Data Overview

In [4]:
df.head()

Unnamed: 0,Timestamp,GHI,DNI,DHI,ModA,ModB,Tamb,RH,WS,WSgust,WSstdev,WD,WDstdev,BP,Cleaning,Precipitation,TModA,TModB,Comments
0,2021-10-30 00:01,-0.7,-0.1,-0.8,0.0,0.0,21.9,99.1,0.0,0.0,0.0,0.0,0.0,1002,0,0.0,22.3,22.6,
1,2021-10-30 00:02,-0.7,-0.1,-0.8,0.0,0.0,21.9,99.2,0.0,0.0,0.0,0.0,0.0,1002,0,0.0,22.3,22.6,
2,2021-10-30 00:03,-0.7,-0.1,-0.8,0.0,0.0,21.9,99.2,0.0,0.0,0.0,0.0,0.0,1002,0,0.0,22.3,22.6,
3,2021-10-30 00:04,-0.7,0.0,-0.8,0.0,0.0,21.9,99.3,0.0,0.0,0.0,0.0,0.0,1002,0,0.1,22.3,22.6,
4,2021-10-30 00:05,-0.7,-0.1,-0.8,0.0,0.0,21.9,99.3,0.0,0.0,0.0,0.0,0.0,1002,0,0.0,22.3,22.6,


In [5]:
df.tail()

Unnamed: 0,Timestamp,GHI,DNI,DHI,ModA,ModB,Tamb,RH,WS,WSgust,WSstdev,WD,WDstdev,BP,Cleaning,Precipitation,TModA,TModB,Comments
525595,2022-10-29 23:56,-1.6,-0.1,-2.9,0.0,0.0,24.0,100.0,0.0,0.0,0.0,0.0,0.0,999,0,0.0,24.2,24.5,
525596,2022-10-29 23:57,-1.7,-0.1,-3.0,0.0,0.0,24.0,100.0,0.0,0.0,0.0,0.0,0.0,999,0,0.0,24.2,24.5,
525597,2022-10-29 23:58,-1.7,-0.1,-3.1,0.0,0.0,24.0,100.0,0.0,0.0,0.0,0.0,0.0,1000,0,0.0,24.1,24.4,
525598,2022-10-29 23:59,-1.7,-0.2,-3.3,0.0,0.0,23.9,100.0,0.0,0.0,0.0,0.0,0.0,1000,0,0.0,24.1,24.4,
525599,2022-10-30 00:00,-1.7,-0.1,-3.4,0.0,0.0,23.9,100.0,0.0,0.0,0.0,0.0,0.0,1000,0,0.0,24.1,24.4,


In [6]:
df.sample(10)

Unnamed: 0,Timestamp,GHI,DNI,DHI,ModA,ModB,Tamb,RH,WS,WSgust,WSstdev,WD,WDstdev,BP,Cleaning,Precipitation,TModA,TModB,Comments
358314,2022-07-05 19:55,-1.4,-0.4,-1.6,0.0,0.0,26.3,89.2,0.1,1.1,0.4,190.8,1.5,1001,0,0.0,25.8,26.0,
411292,2022-08-11 14:53,441.8,1.2,438.1,424.4,403.7,27.2,84.3,1.8,2.4,0.4,216.2,14.7,999,0,0.0,35.1,35.8,
434160,2022-08-27 12:01,269.1,1.2,267.0,294.0,287.4,28.3,79.5,0.4,1.4,0.6,80.9,5.0,1000,0,0.0,48.2,46.4,
391723,2022-07-29 00:44,-1.8,-0.1,-2.3,0.0,0.0,22.8,99.6,0.0,0.0,0.0,0.0,0.0,1003,0,0.0,23.2,23.5,
137952,2022-02-02 19:13,-15.2,-1.1,-16.4,0.1,0.1,27.5,39.2,1.8,2.1,0.3,53.6,3.3,998,0,0.0,25.9,26.4,
457612,2022-09-12 18:53,-0.3,-0.1,-0.9,1.2,1.2,25.0,94.1,3.1,4.6,1.0,197.5,18.1,1000,0,0.0,25.6,25.8,
148651,2022-02-10 05:32,-11.8,-0.5,-11.7,0.0,0.0,20.1,92.0,0.0,0.0,0.0,0.0,0.0,998,0,0.0,19.1,19.2,
495587,2022-10-09 03:48,-1.0,-0.2,-2.3,0.0,0.0,22.5,100.0,0.1,1.1,0.4,55.9,2.4,999,0,0.0,22.6,22.9,
135383,2022-02-01 00:24,-10.6,-0.5,-10.9,0.0,0.0,22.7,81.4,0.0,0.0,0.0,0.0,0.0,1001,0,0.0,21.6,21.8,
185780,2022-03-08 00:21,-9.0,-0.4,-8.3,0.0,0.0,23.4,71.8,0.0,0.0,0.0,0.0,0.0,998,0,0.0,22.4,22.8,


In [7]:
#shape of the dataset
df.shape

(525600, 19)

In [8]:
#list of columns of the dataset
df.columns

Index(['Timestamp', 'GHI', 'DNI', 'DHI', 'ModA', 'ModB', 'Tamb', 'RH', 'WS',
       'WSgust', 'WSstdev', 'WD', 'WDstdev', 'BP', 'Cleaning', 'Precipitation',
       'TModA', 'TModB', 'Comments'],
      dtype='object')

## 📐 3. Data Types & Basic Stats <a id='3-data-types--basic-stats'></a>

#### Data Summaries - basics stats

In [9]:
#Numerical Columns
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
GHI,525600.0,201.957515,298.49515,-19.5,-2.8,0.3,362.4,1499.0
DNI,525600.0,116.376337,218.652659,-7.8,-0.3,-0.1,107.0,946.0
DHI,525600.0,113.720571,158.946032,-17.9,-3.8,-0.1,224.7,892.0
ModA,525600.0,206.643095,300.896893,0.0,0.0,3.6,359.5,1507.0
ModB,525600.0,198.114691,288.889073,0.0,0.0,3.4,345.4,1473.0
Tamb,525600.0,26.319394,4.398605,12.3,23.1,25.3,29.4,39.9
RH,525600.0,79.448857,20.520775,9.9,68.7,85.4,96.7,100.0
WS,525600.0,1.146113,1.239248,0.0,0.0,0.8,2.0,19.2
WSgust,525600.0,1.691606,1.617053,0.0,0.0,1.6,2.6,23.9
WSstdev,525600.0,0.363823,0.295,0.0,0.0,0.4,0.6,4.1


In [11]:
#for columns of object type
df.describe(include=['O']).T

Unnamed: 0,count,unique,top,freq
Timestamp,525600,525600,2022-10-29 23:44,1


#### Data Types

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 525600 entries, 0 to 525599
Data columns (total 19 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Timestamp      525600 non-null  object 
 1   GHI            525600 non-null  float64
 2   DNI            525600 non-null  float64
 3   DHI            525600 non-null  float64
 4   ModA           525600 non-null  float64
 5   ModB           525600 non-null  float64
 6   Tamb           525600 non-null  float64
 7   RH             525600 non-null  float64
 8   WS             525600 non-null  float64
 9   WSgust         525600 non-null  float64
 10  WSstdev        525600 non-null  float64
 11  WD             525600 non-null  float64
 12  WDstdev        525600 non-null  float64
 13  BP             525600 non-null  int64  
 14  Cleaning       525600 non-null  int64  
 15  Precipitation  525600 non-null  float64
 16  TModA          525600 non-null  float64
 17  TModB          525600 non-nul

### Distinct Values

In [13]:
df.nunique().sort_values(ascending=False)

Timestamp        525600
ModA              10188
GHI                8742
ModB               8524
DNI                8205
DHI                7183
WD                 3601
RH                  902
WDstdev             712
TModA               620
TModB               572
Tamb                276
WS                  145
WSgust               88
WSstdev              40
Precipitation        24
BP                   14
Cleaning              2
Comments              0
dtype: int64

## 🔍 4. Data Quality Analysis <a id='4-data-quality-analysis'></a>

### Missing-Values Analysis

In [14]:
#count of missing values per column
df.isna().sum() 

Timestamp             0
GHI                   0
DNI                   0
DHI                   0
ModA                  0
ModB                  0
Tamb                  0
RH                    0
WS                    0
WSgust                0
WSstdev               0
WD                    0
WDstdev               0
BP                    0
Cleaning              0
Precipitation         0
TModA                 0
TModB                 0
Comments         525600
dtype: int64

##### Columns with significant number of missing values

In [16]:
#column with >5% nulls
columns_with_significant_missing_values(df, threshold=5)

Unnamed: 0,#missing_values,percentage
Comments,525600,100.00%


### Duplicated Values Analysis

In [None]:
#check for duplicates
print(df.duplicated().sum())

### Detect Invalid Data

#### Check if data lies within the valid range

In [None]:
# Dictionary of variable name and valid range (min, max)
valid_ranges = {
    'GHI': (0, 1300),
    'DNI': (0, 1300),
    'DHI': (0, 1000),
    'ModA': (0, 1300),
    'ModB': (0, 1300),
    'Tamb': (-40, 60),
    'TModA': (-40, 80),
    'TModB': (-40, 80),
    'RH': (0, 100),
    'WS': (0, 60),
    'WSgust': (0, 80),
    'WSstdev': (0, 20),
    'WD': (0, 360),
    'WDstdev': (0, 180),
    'BP': (800, 1100),
    'Precipitation': (0, 10),
    'Cleaning': (0, 1)
}


In [None]:
# Find columns that violate the valid ranges
find_columns_with_invalid_values(df,valid_ranges)

##### Looking into Negative Irradiance values

In [None]:
# Count rows where any of the three values is negative
invalid_rows = df[(df['GHI'] < 0) | (df['DHI'] < 0) | (df['DNI'] < 0)]

# Print count
print(f"Rows with at least one invalid irradiance value (GHI, DHI, or DNI < 0): {len(invalid_rows)}")


##### Check if these negative values for GHI,DNI and DHI occur simultaneously

In [None]:
# Create a boolean mask for each condition
neg_ghi = df['GHI'] < 0
neg_dhi = df['DHI'] < 0
neg_dni = df['DNI'] < 0

# Create a new column to capture the combination as a label
df['irradiance_negative_combo'] = (
    neg_ghi.astype(int).astype(str) + 
    neg_dhi.astype(int).astype(str) + 
    neg_dni.astype(int).astype(str)
)

# Count frequency of each combination
combo_counts = df['irradiance_negative_combo'].value_counts().sort_index()
print(combo_counts)


##### Check if this negative values are recorded during the night -- to see if we can impute 0 as their value

In [None]:
#add hour and is_night columns to help with analysis
df['hour'] = pd.to_datetime(df['Timestamp']).dt.hour
df['is_night'] = (df['hour'] < 6) | (df['hour'] > 18)  # Example: night before 6 AM or after 6 PM
df


In [None]:
# Filter rows where GHI, DNI, DHI < 0
irradiance_neg = df[(df['GHI'] < 0) | (df['DNI'] < 0) | (df['DHI'] < 0)]

# Check how many of those rows occurred at night
irradiance_neg['is_night'].value_counts()


##### Negative GHI,DNI,DHI values happening simaltaneously during the night - > candidates to be imputed as zero

In [None]:
# Group by combo and night/day
combo_night_counts = df.groupby(['irradiance_negative_combo', 'is_night']).size().unstack(fill_value=0)

# Rename columns for clarity
combo_night_counts.columns = ['day_count', 'night_count']  # False → Day, True → Night

# Optional: Add total count per combo
combo_night_counts['total'] = combo_night_counts['day_count'] + combo_night_counts['night_count']

# Display the result
print(combo_night_counts)


<div style="border-radius:10px; border:orange solid; padding: 15px; font-size:100%; text-align:left; font-color:#325939;background-color:#2c2c2c">
<h3 align="left"><font color='orange'>💡 Course of Action:</font></h3>

Negative GHI,DHI,DNI values happenning simaltaneously during the night will be imputed as zero

##### Exploring Irradiance values beyond 1300

In [None]:
# Filter rows where GHI or DNI are greater than 1300
high_irradiance_df = df[(df['GHI'] > 1300) | (df['ModA'] > 1300)| (df['ModB'] > 1300)]

# Display the filtered rows
print(high_irradiance_df)

# Optionally, see how many such rows exist
print(f"Number of rows with ModA or ModB> 1300: {len(high_irradiance_df)}")


<div style="border-radius:10px; border:orange solid; padding: 15px; font-size:100%; text-align:left; font-color:#325939;background-color:#2c2c2c">
<h3 align="left"><font color='orange'>💡 Observations:</font></h3>
🔍 Observations:
1. All Entries Happen During Midday (Solar Peak Hours):
hour values range from 11 to 14, which is typically solar noon — the time of maximum irradiance.

is_night is False in all cases, so these are daytime observations.

2. High GHI Values:
GHI ranges from 1302 to 1413 W/m², slightly above the common limit of ~1300 W/m².

Some of these exceed the WMO typical clear-sky max of ~1360 W/m² at Earth's surface — values like 1390, 1413 W/m² are unusually high.

3. High ModA/ModB (Module Plane Irradiance):
ModA and ModB go up to 1342.3 W/m², which tracks GHI closely. This is plausible if modules are angled optimally or receiving additional diffuse/reflected radiation.
</dv>



In [None]:
# Drop flags and timestamp columns
df.drop(columns=['irradiance_negative_combo', 'hour', 'is_night'], inplace=True)
df

### Outlier Detection

#### Detect outliers

In [None]:
#checking for outliers in select columns
#using z-score method
columns_to_check_for_outliers = ['ModA','ModB','WS','WSgust','GHI','DHI','DNI']
outlier_counts = {
    "column": [],
    "num_outliers": []
}

for col in columns_to_check_for_outliers:
    outliers = detect_outliers_zscore(df, col)
    outlier_counts["column"].append(col)
    outlier_counts["num_outliers"].append(len(outliers))

outlier_df = pd.DataFrame(outlier_counts)
print(outlier_df)
    

#### Visualize outliers

In [None]:
#Z-Score Outlier Strip Plot
n_cols = 2# adjust as needed
n_rows = (len(columns_to_check_for_outliers) + n_cols - 1) // n_cols

fig, axes = plt.subplots(n_rows, n_cols, figsize=(8 * n_cols, 2 * n_rows))
axes = axes.flatten()

for i, col in enumerate(columns_to_check_for_outliers):
    df['z'] = zscore(df[col].dropna())
    sns.stripplot(x='z', data=df.dropna(subset=[col]), color='orange', ax=axes[i])
    axes[i].axvline(3, color='red', linestyle='--')
    axes[i].axvline(-3, color='red', linestyle='--')
    axes[i].set_title(f'Z-score Strip Plot: {col}')
    axes[i].set_xlabel('Z-score')

# Remove unused axes if any
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.show()


In [None]:
#Visualizing outliers using boxplots
plt.figure(figsize=(max(8, len(columns_to_check_for_outliers)* 1.5), 6))  # Auto-adjust width
sns.set_context("notebook", font_scale=1.1)

sns.boxplot(data=df[columns_to_check_for_outliers], palette="Set2")
plt.title("Boxplot of Selected Columns (Outliers Visualized)", fontsize=14)
plt.xticks(rotation=45, fontsize=12)
plt.yticks(fontsize=12)
plt.tight_layout()
plt.show()


In [None]:
#Singling out WS and WSgust
plt.figure(figsize=(max(8, len(columns_to_check_for_outliers) * 1.5), 6))  # Auto-adjust width
sns.set_context("notebook", font_scale=1.1)

sns.boxplot(data=df[columns_to_check_for_outliers[2:4]], palette="Set2")
plt.title("Boxplot of Selected Columns (Outliers Visualized)", fontsize=14)
plt.xticks(rotation=45, fontsize=12)
plt.yticks(fontsize=12)
plt.tight_layout()
plt.show()


In [None]:
# drop z column
df.drop(columns=['z'], inplace=True)    

## 🧹 5. Data Cleaning  <a id='5-data-cleaning'></a>

In [None]:
#saving the original dataset for later use
df_original=df.copy()

#### Handle Missing Values

In [None]:
# the comments columns doesn't have any value
columns_to_delete = ['Comments']
existing_columns = [col for col in columns_to_delete if col in df.columns]
df = df.drop(existing_columns, axis=1)

In [None]:
#inspect the data after dropping the columns
df.sample(10)

### Handle Inconsistencies / Inaccuracies

#### Impute zero when all GHI,DHI and DNI are negative during the night

In [None]:
# Impute zero when all GHI,DHI and DNI are negative during the night
conditions = {
    'GHI': '<= 0',
    'DHI': '<= 0',
    'DNI': '<= 0',
    'is_night': '== True'
}

updates = {
    'GHI': 0,
    'DHI': 0,
    'DNI': 0
}
df = conditional_impute(df, 'Timestamp', conditions, updates)

In [None]:
df.sample(10)

In [None]:
#because of the near linear relationship between GHi and MdoA and ModB, we can use regression to impute the 
#negative values of GHI

#Lets set all negative values of GHI,DHI,DNI to NaN
df=replace_negative_irradiance_with_nan(df)
# Impute GHI using ModA and ModB
df =impute_ghi_with_linear_regression(df)
df.sample(10)

In [None]:
#impute DNI and DHI using other features
df=impute_multiple_targets_with_model(df)


In [None]:
df.sample(10)

In [None]:
#count of missing values per column
df.isna().sum() 

## 📈 6. Univariate Analysis (Single Variable) <a id='Univariate Analysis (Single Variable)'></a>

### Distributions of Variables

In [None]:
#plot histograms for continuous variables
plot_continuous_histograms(df)

## 📉 7. Bivariate/Multivariate Analysis <a id='7-bivariatemultivariate-analysis'>

### Correlation & Relationship Analysis

In [None]:
columns=df.columns.to_list()
columns

In [None]:
# Pairplot of selected columns
columns_for_pairplot = [*columns_to_check_for_outliers, 'Tamb', 'RH', 'BP', 'Precipitation', 'TModA', 'TModB', 'WD']
plot_pairplot(df, columns_for_pairplot)

#### Heatmap of correlations (GHI, DNI, DHI, TModA, TModB).

In [None]:
columns = ['GHI', 'DNI', 'DHI', 'TModA', 'TModB']
plot_correlation_heatmap(df,columns)

#### Scatter plots: WS, WSgust, WD vs. GHI; RH vs. Tamb or RH vs. GHI.


In [None]:
plot_scatter_relationships(df)

#### Wind & Distribution Analysis

In [None]:
plot_wind_rose(df)

### Temperature Analysis

In [None]:
#Examine how relative humidity (RH) might influence temperature readings and solar radiation
plot_rh_relationships(df)

#### Bubble Chart

In [None]:
#GHI vs. Tamb with bubble size = RH or BP.
plot_bubble_ghi_vs_tamb(df)

### Cleaning Effect

In [None]:
# Group by 'Cleaning' and calculate mean for ModA and ModB
plot_mod_cleaning_effect(df)

## 🧮 Feature Engineering <a id='8-feature-engineering'><a>

#### GHI,DNI,DHI,ModA and ModB  have skewed distribution - transforming them to normal distribution will help in case we choose to use parametric tests later

In [None]:
#log transform the columns GHI, DNI, DHI, ModA and ModB
df=log_transform_columns(df, ['GHI', 'DNI', 'DHI', 'ModA', 'ModB'])

In [None]:
# recheck the distributions after log transformation
plot_continuous_histograms(df)

## 📅 9. Time Series Trends  <a id='9-time-series-trends'></a>

#### Line or bar charts of GHI, DNI, DHI, Tamb vs. Timestamp.

In [None]:
plot_irradiance_temperature_timeseries(df)

#### Observe patterns by month, trends throughout day, or anomalies, such as peaks in solar irradiance or temperature fluctuations. 

##### 🔍 1. Monthly Patterns

In [None]:
# Extract month from timestamp
plot_monthly_irradiance_temperature(df)

#### 🕒 2. Daily Trends

In [None]:
plot_hourly_irradiance_temperature(df)

##### ⚠️ 3. Anomaly Detection (Peaks & Drops)

In [None]:
plot_ghi_anomalies(df)

#### Save the preprocessed dataset

In [None]:
#### Save the preprocessed dataset
df.to_csv('../data/benin_clean.csv', index=False)
df.sample(10)

## 🧠 10. Key Insights <a id='10-key-insights'></a>

<div style="border-radius:10px; border:orange solid; padding: 15px; font-size:100%; text-align:left; font-color:#325939;background-color:#2c2c2c">
<h3 align="left"><font color='orange'>💡 Observations:</font></h3>

* There is near-linear relationship between WS (Wind Speed) and WSgust (Wind Gust Speed): In the context of the project — analyzing solar energy potential and sensor performance — it might be redundant to include both WS (Wind Speed) and WSgust (Wind Gust)
* There is a near linear relationship between GHI and both ModA and ModB
* There is a near linear relationship between ModA and ModB
* WD has little correlation with any of the other variables
* There is a linear relationship between TModA and TModB and their correlation with all other variables is near identical
</div>

<div style="border-radius:10px; border:orange solid; padding: 15px; font-size:100%; text-align:left; font-color:#325939;background-color:#2c2c2c">
<h3 align="left"><font color='orange'>💡 Observations:</font></h3>

**☀️ Solar Irradiance Variables (GHI, DNI, DHI, ModA, ModB)**
Distributions are right-skewed: Most values are close to zero, with a long tail of high values.

**Implication:** These are only non-zero during daytime → confirms irradiance-based splitting logic (e.g., GHI > 0 → daytime).

**Action:** There might be a need to  apply log transformation or clipping when using them for modeling or visualization.
</div>

<div style="border-radius:10px; border:orange solid; padding: 15px; font-size:100%; text-align:left; font-color:#325939;background-color:#2c2c2c">
<h3 align="left"><font color='orange'>💡 Observations:</font></h3>

**💧 Humidity (RH)**
Fairly uniform or slightly U-shaped: High frequency at both low and high RH levels.

**Implication:** Reflects variability in atmospheric moisture (from dry to humid).

**Action:** RH can be a good input to model heat dissipation or fog effects on panels.
</div>

<div style="border-radius:10px; border:orange solid; padding: 15px; font-size:100%; text-align:left; font-color:#325939;background-color:#2c2c2c">
<h3 align="left"><font color='orange'>💡 Observations:</font></h3>

**🧪 BP (Barometric Pressure)**
Shows cyclical fluctuations (possibly measurement artifact or elevation-influenced).

**Implication:** Limited predictive power unless used in atmospheric modeling.

**Optional:** Could be dropped if irrelevant for power forecasting.
</div>

## 🔚 11. Conclusion & Next Steps <a id='11-conclusion--next-steps'></a>

<div style="border-radius:10px; border:orange solid; padding: 15px; font-size:100%; text-align:left; font-color:#325939;background-color:#2c2c2c">
<h3 align="left"><font color='orange'>💡 Recommendations:</font></h3>

* Given the strong correlations among solar variables (GHI, DNI, DHI, ModA, ModB), consider dimensionality reduction (e.g., PCA) or selecting a subset to avoid multicollinearity in modeling.
* Temperature and humidity variables are moderately correlated with solar irradiance, so including both can help models capture environmental conditions affecting your system.
* Log-transforming skewed variables like solar irradiance and wind speeds could improve model performance by normalizing their distributions.
* Variables like Cleaning and Precipitation are mostly zeros. Ensure models handle this class imbalance properly or use specialized techniques for rare events.
* Consider interaction terms between solar radiation and humidity or temperature, as their interplay may affect your system’s behavior.
</div>