##  1: Imports and Setup

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import f_oneway, kruskal

# Plot settings
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)

## 🔄 Load and Combine Cleaned Country Datasets

We load cleaned solar datasets for Benin, Sierra Leone, and Togo. A `Country` column is added to allow grouping and comparison across datasets.

📦 Cell 3: Load and Combine Data

In [None]:
# Load cleaned CSVs
benin = pd.read_csv("../data/benin_clean.csv")
benin["Country"] = "Benin"

sierraleone = pd.read_csv("../data/sierraleone_clean.csv")
sierraleone["Country"] = "Sierra Leone"

togo = pd.read_csv("../data/togo_clean.csv")
togo["Country"] = "Togo"

# Concatenate into a single DataFrame
df_all = pd.concat([benin, sierraleone, togo], ignore_index=True)
df_all.head()

## 📦 Boxplot Comparison of Solar Metrics

The following plots compare the distribution of **GHI**, **DNI**, and **DHI** across Benin, Sierra Leone, and Togo.

In [None]:
for metric in ["GHI", "DNI", "DHI"]:
    plt.figure()
    sns.boxplot(data=df_all, x="Country", y=metric, palette="Set2")
    plt.title(f"{metric} Distribution by Country")
    plt.ylabel(f"{metric} (W/m²)")
    plt.xlabel("Country")
    plt.tight_layout()
    plt.show()

## 📊 Summary Statistics by Country

This table summarizes the **mean**, **median**, and **standard deviation** for each key solar metric grouped by country.

In [None]:
summary = df_all.groupby("Country")[["GHI", "DNI", "DHI"]].agg(["mean", "median", "std"]).round(2)
summary

## 🧪 Statistical Test for GHI Differences

To evaluate if there are statistically significant differences in **GHI** across countries, we run:

- **One-Way ANOVA** (assumes normal distribution)
- **Kruskal–Wallis Test** (non-parametric)

We report the p-values and interpret the results.


In [None]:
ghi_benin = benin["GHI"].dropna()
ghi_sierra = sierraleone["GHI"].dropna()
ghi_togo = togo["GHI"].dropna()

# One-Way ANOVA
anova_stat, anova_p = f_oneway(ghi_benin, ghi_sierra, ghi_togo)

# Kruskal-Wallis
kruskal_stat, kruskal_p = kruskal(ghi_benin, ghi_sierra, ghi_togo)

print(f"ANOVA p-value: {anova_p:.4f}")
print(f"Kruskal–Wallis p-value: {kruskal_p:.4f}")

## 📌 Key Observations

- **Togo** shows the **highest median GHI**, suggesting strong solar potential, but also exhibits higher variability.
- **Benin** has relatively stable GHI and lower standard deviation, potentially indicating more consistent solar conditions.
- **Sierra Leone** records the **lowest average GHI**, possibly due to cloudier coastal conditions.

## 🥇 Country Ranking by Average GHI


In [None]:
avg_ghi = df_all.groupby("Country")["GHI"].mean().sort_values(ascending=False)

sns.barplot(x=avg_ghi.values, y=avg_ghi.index, palette="viridis")
plt.xlabel("Average GHI (W/m²)")
plt.title("Average GHI by Country")
plt.tight_layout()
plt.show()