# Benin Solar Data - Exploratory Data Analysis

**Dataset:** benin-malanville.csv  
**Objective:** Profile, clean, and explore the Benin solar dataset to identify key trends and insights for solar investment decisions.

## Table of Contents
1. Setup & Data Loading
2. Summary Statistics & Missing Values
3. Outlier Detection & Data Quality
4. Data Cleaning
5. Time Series Analysis
6. Cleaning Impact Analysis
7. Correlation & Relationship Analysis
8. Wind Analysis & Distributions
9. Temperature & Humidity Analysis
10. Bubble Chart Analysis
11. Key Insights & Conclusions


## 1. Setup & Data Loading


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings

# Configuration
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Figure size defaults
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries imported successfully!")


Libraries imported successfully!


In [None]:
# Load the dataset
data_path = '../data/benin-malanville.csv'
df = pd.read_csv(data_path)

print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")


In [None]:
# Initial data inspection
print("=" * 80)
print("FIRST 5 ROWS:")
print("=" * 80)
display(df.head())

print("\n" + "=" * 80)
print("LAST 5 ROWS:")
print("=" * 80)
display(df.tail())

print("\n" + "=" * 80)
print("DATA TYPES:")
print("=" * 80)
display(df.dtypes)


## 2. Summary Statistics & Missing Values


In [None]:
# Summary statistics for all numeric columns
print("=" * 80)
print("SUMMARY STATISTICS FOR NUMERIC COLUMNS")
print("=" * 80)
summary_stats = df.describe()
display(summary_stats)


In [None]:
# Missing values analysis
print("=" * 80)
print("MISSING VALUES ANALYSIS")
print("=" * 80)

missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100

missing_df = pd.DataFrame({
    'Column': missing_values.index,
    'Missing Count': missing_values.values,
    'Missing Percentage': missing_percentage.values
})

missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Percentage', ascending=False)

if len(missing_df) > 0:
    display(missing_df)
    
    # Flag columns with >5% missing values
    high_missing = missing_df[missing_df['Missing Percentage'] > 5]
    if len(high_missing) > 0:
        print("\n⚠️ Columns with >5% missing values:")
        for col in high_missing['Column']:
            print(f"  - {col}: {high_missing[high_missing['Column']==col]['Missing Percentage'].values[0]:.2f}%")
else:
    print("✓ No missing values found in the dataset!")
