# Task 2: Data Profiling, Cleaning & EDA
**Objective:** Profile, clean, and explore the solar dataset for Benin so it’s ready for comparison and region-ranking tasks.

This notebook includes:
- Summary statistics and missing-value report
- Outlier detection and cleaning
- Time series analysis
- Correlation and scatter plots
- Wind and temperature analysis
- Bubble charts

## Importing the dependencies

In [14]:
# import dependencies

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

## Setting the plotting style and Loading the Data

In [15]:
# Set both plotting and display settings
sns.set_palette("husl")
plt.style.use('seaborn-v0_8')
pd.set_option('display.max_columns', 10)
pd.set_option('display.max_rows', 5)

# Load dataset
df = pd.read_csv(r"D:\Python\Week_01\data\data\benin-malanville.csv")


In [16]:
# Show basic info
print(f"Dataset shape: {df.shape}")
df.info()

Dataset shape: (525600, 19)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 525600 entries, 0 to 525599
Data columns (total 19 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Timestamp      525600 non-null  object 
 1   GHI            525600 non-null  float64
 2   DNI            525600 non-null  float64
 3   DHI            525600 non-null  float64
 4   ModA           525600 non-null  float64
 5   ModB           525600 non-null  float64
 6   Tamb           525600 non-null  float64
 7   RH             525600 non-null  float64
 8   WS             525600 non-null  float64
 9   WSgust         525600 non-null  float64
 10  WSstdev        525600 non-null  float64
 11  WD             525600 non-null  float64
 12  WDstdev        525600 non-null  float64
 13  BP             525600 non-null  int64  
 14  Cleaning       525600 non-null  int64  
 15  Precipitation  525600 non-null  float64
 16  TModA          525600 non-null  float64
 17  T

In [17]:
#  Display column names
print("\nColumn names:")
print(df.columns.tolist())


Column names:
['Timestamp', 'GHI', 'DNI', 'DHI', 'ModA', 'ModB', 'Tamb', 'RH', 'WS', 'WSgust', 'WSstdev', 'WD', 'WDstdev', 'BP', 'Cleaning', 'Precipitation', 'TModA', 'TModB', 'Comments']


In [18]:
#  Display the first 5 rows
print("\nFirst 5 rows:")
display(df.head())


First 5 rows:


Unnamed: 0,Timestamp,GHI,DNI,DHI,ModA,...,Cleaning,Precipitation,TModA,TModB,Comments
0,2021-08-09 00:01,-1.2,-0.2,-1.1,0.0,...,0,0.0,26.3,26.2,
1,2021-08-09 00:02,-1.1,-0.2,-1.1,0.0,...,0,0.0,26.3,26.2,
2,2021-08-09 00:03,-1.1,-0.2,-1.1,0.0,...,0,0.0,26.4,26.2,
3,2021-08-09 00:04,-1.1,-0.1,-1.0,0.0,...,0,0.0,26.4,26.3,
4,2021-08-09 00:05,-1.0,-0.1,-1.0,0.0,...,0,0.0,26.4,26.3,


In [None]:
# Display the last 10 rows
print("\nLast 10 rows:")
display(df.tail(10))

In [None]:
# Display 10 random sample rows
print("\nRandom sample of 10 rows:")
display(df.sample(10, random_state=42))

In [None]:
# Check for missing values
print("\nMissing values per column:")
print(df.isna().sum())

# Percentage of missing values per column
print("\nPercentage of missing values:")
print((df.isna().mean() * 100).round(2))

# Show only columns with any missing values
missing = df.isna().sum()
missing = missing[missing > 0]
print("\nColumns with missing values:")
print(missing)


