# Benin Solar Farm EDA

## Overview
This notebook performs exploratory data analysis (EDA) on the Benin solar farm dataset.  
We focus on:
- Profiling and cleaning the data
- Outlier detection
- Time-series analysis
- Sensor cleaning impact
- Correlation and relationship analysis
- Wind & distribution visualization
- Temperature & humidity analysis
- Bubble chart visualization
- Summary insights and strategic recommendations


## 0. Setup — Import Libraries & Helper Functions
We import the necessary Python libraries and define helper functions for:
- Loading CSV
- Saving cleaned data locally
- Computing z-scores for outlier detection


In [13]:
# Cell 1 — Setup
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import zscore
from datetime import datetime

plt.rcParams["figure.figsize"] = (12,5)
sns.set(style="whitegrid")

# Helper functions
def load_csv(path):
    return pd.read_csv(path, parse_dates=["Timestamp"])

def save_clean(df, out_path="../data/benin-malanville.csv"):
    df.to_csv(out_path, index=False)
    print(f"Saved cleaned dataset to: {out_path}")

def zscore_df(df, cols):
    return df[cols].apply(lambda col: zscore(col.fillna(col.median())))


## 1. Load Dataset & Initial Profiling
- Load Benin solar dataset
- Inspect shape, head, dtypes
- Summary statistics & missing-value report


In [14]:
DATA_PATH = "../data/benin-malanville.csv"
df = load_csv(DATA_PATH)

print("Shape:", df.shape)
display(df.head())
display(df.dtypes)

# Summary & missing report
display(df.describe(include="all"))
missing_counts = df.isna().sum()
missing_pct = (missing_counts / len(df)) * 100
print("Columns with >5% missing values:")
display(missing_pct[missing_pct > 5])


Shape: (525600, 19)


Unnamed: 0,Timestamp,GHI,DNI,DHI,ModA,ModB,Tamb,RH,WS,WSgust,WSstdev,WD,WDstdev,BP,Cleaning,Precipitation,TModA,TModB,Comments
0,2021-08-09 00:01:00,-1.2,-0.2,-1.1,0.0,0.0,26.2,93.4,0.0,0.4,0.1,122.1,0.0,998,0,0.0,26.3,26.2,
1,2021-08-09 00:02:00,-1.1,-0.2,-1.1,0.0,0.0,26.2,93.6,0.0,0.0,0.0,0.0,0.0,998,0,0.0,26.3,26.2,
2,2021-08-09 00:03:00,-1.1,-0.2,-1.1,0.0,0.0,26.2,93.7,0.3,1.1,0.5,124.6,1.5,997,0,0.0,26.4,26.2,
3,2021-08-09 00:04:00,-1.1,-0.1,-1.0,0.0,0.0,26.2,93.3,0.2,0.7,0.4,120.3,1.3,997,0,0.0,26.4,26.3,
4,2021-08-09 00:05:00,-1.0,-0.1,-1.0,0.0,0.0,26.2,93.3,0.1,0.7,0.3,113.2,1.0,997,0,0.0,26.4,26.3,


Timestamp        datetime64[ns]
GHI                     float64
DNI                     float64
DHI                     float64
ModA                    float64
ModB                    float64
Tamb                    float64
RH                      float64
WS                      float64
WSgust                  float64
WSstdev                 float64
WD                      float64
WDstdev                 float64
BP                        int64
Cleaning                  int64
Precipitation           float64
TModA                   float64
TModB                   float64
Comments                float64
dtype: object

Unnamed: 0,Timestamp,GHI,DNI,DHI,ModA,ModB,Tamb,RH,WS,WSgust,WSstdev,WD,WDstdev,BP,Cleaning,Precipitation,TModA,TModB,Comments
count,525600,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,525600.0,0.0
mean,2022-02-07 12:00:30.000000512,240.559452,167.187516,115.358961,236.589496,228.883576,28.179683,54.487969,2.121113,2.809195,0.47339,153.435172,8.582407,994.197199,0.000923,0.001905,35.246026,32.471736,
min,2021-08-09 00:01:00,-12.9,-7.8,-12.6,0.0,0.0,11.0,2.1,0.0,0.0,0.0,0.0,0.0,985.0,0.0,0.0,9.0,8.1,
25%,2021-11-08 06:00:45,-2.0,-0.5,-2.1,0.0,0.0,24.2,28.8,1.0,1.3,0.4,59.0,3.7,993.0,0.0,0.0,24.2,23.6,
50%,2022-02-07 12:00:30,1.8,-0.1,1.6,4.5,4.3,28.0,55.1,1.9,2.6,0.5,181.0,8.6,994.0,0.0,0.0,30.0,28.9,
75%,2022-05-09 18:00:15,483.4,314.2,216.3,463.7,447.9,32.3,80.1,3.1,4.1,0.6,235.1,12.3,996.0,0.0,0.0,46.9,41.5,
max,2022-08-09 00:00:00,1413.0,952.3,759.2,1342.3,1342.3,43.8,100.0,19.5,26.6,4.2,360.0,99.4,1003.0,1.0,2.5,81.0,72.5,
std,,331.131327,261.710501,158.691074,326.894859,316.536515,5.924297,28.073069,1.603466,2.02912,0.273395,102.332842,6.385864,2.474993,0.030363,0.037115,14.807258,12.348743,


Columns with >5% missing values:


Comments    100.0
dtype: float64