## Data Preparation for Benin
This notebook focuses on preparing the wind and solar dataset for analysis.

The goal is to clean, structure, and validate the dataset so that it‚Äôs ready for exploration and visualization.

It follows the assignment requirements step by step: loading data, summary statistics, missing value report, outlier detection and flagging, data cleaning and imputation, and saving the cleaned dataset.

All steps use the reusable functions defined in ```data_preparation.py```, ensuring consistency and reproducibility across multiple datasets.

## Imports

In [1]:
import sys
import pandas as pd;

sys.path.append('../../../scripts')
from data_preparation import (
    load_data,
    get_summary_report,
    calculate_zscore_and_flag_outliers,
    clean_and_impute,
    save_cleaned_data
)

## Load Raw Data

In [3]:
raw_file_path = "../../../data/benin/benin-malanville.csv"
output_dir = "../../../data/benin/"
df = load_data(raw_file_path) 

## Observe the Data

In [4]:
print("## üìä Head: First 5 Rows for Visual Check\n")
df.head()

## üìä Head: First 5 Rows for Visual Check



Unnamed: 0,Timestamp,GHI,DNI,DHI,ModA,ModB,Tamb,RH,WS,WSgust,WSstdev,WD,WDstdev,BP,Cleaning,Precipitation,TModA,TModB,Comments
0,2021-08-09 00:01,-1.2,-0.2,-1.1,0.0,0.0,26.2,93.4,0.0,0.4,0.1,122.1,0.0,998,0,0.0,26.3,26.2,
1,2021-08-09 00:02,-1.1,-0.2,-1.1,0.0,0.0,26.2,93.6,0.0,0.0,0.0,0.0,0.0,998,0,0.0,26.3,26.2,
2,2021-08-09 00:03,-1.1,-0.2,-1.1,0.0,0.0,26.2,93.7,0.3,1.1,0.5,124.6,1.5,997,0,0.0,26.4,26.2,
3,2021-08-09 00:04,-1.1,-0.1,-1.0,0.0,0.0,26.2,93.3,0.2,0.7,0.4,120.3,1.3,997,0,0.0,26.4,26.3,
4,2021-08-09 00:05,-1.0,-0.1,-1.0,0.0,0.0,26.2,93.3,0.1,0.7,0.3,113.2,1.0,997,0,0.0,26.4,26.3,


In [5]:
print("## ‚ÑπÔ∏è Info: Data Types, Non-Null Counts, and Memory Usage\n")
df.info()

## ‚ÑπÔ∏è Info: Data Types, Non-Null Counts, and Memory Usage

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 525600 entries, 0 to 525599
Data columns (total 19 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Timestamp      525600 non-null  object 
 1   GHI            525600 non-null  float64
 2   DNI            525600 non-null  float64
 3   DHI            525600 non-null  float64
 4   ModA           525600 non-null  float64
 5   ModB           525600 non-null  float64
 6   Tamb           525600 non-null  float64
 7   RH             525600 non-null  float64
 8   WS             525600 non-null  float64
 9   WSgust         525600 non-null  float64
 10  WSstdev        525600 non-null  float64
 11  WD             525600 non-null  float64
 12  WDstdev        525600 non-null  float64
 13  BP             525600 non-null  int64  
 14  Cleaning       525600 non-null  int64  
 15  Precipitation  525600 non-null  float64
 16  TModA      

The raw dataset includes columns like GHI, DNI, DHI, ModA, ModB, WS, WSgust, WD, Tamb, RH, and Timestamp. Some columns may have missing values and potential outliers.

In the previous step, the Timestamp column has a data type of Object. Let's change it to the appropriate data type.

In [6]:
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
print("Data Type for Timestamp -",df['Timestamp'].dtypes)

Data Type for Timestamp - datetime64[ns]


## Summary Statistics & Missing-Value Report

In [7]:
get_summary_report(df)


--- 1. Summary Statistics ---
                           Timestamp            GHI            DNI  \
count                         525600  525600.000000  525600.000000   
mean   2022-02-07 12:00:30.000000512     240.559452     167.187516   
min              2021-08-09 00:01:00     -12.900000      -7.800000   
25%              2021-11-08 06:00:45      -2.000000      -0.500000   
50%              2022-02-07 12:00:30       1.800000      -0.100000   
75%              2022-05-09 18:00:15     483.400000     314.200000   
max              2022-08-09 00:00:00    1413.000000     952.300000   
std                              NaN     331.131327     261.710501   

                 DHI           ModA           ModB           Tamb  \
count  525600.000000  525600.000000  525600.000000  525600.000000   
mean      115.358961     236.589496     228.883576      28.179683   
min       -12.600000       0.000000       0.000000      11.000000   
25%        -2.100000       0.000000       0.000000      24.200

We can see which columns have high missing values (>5%) and examine summary statistics for numeric columns. This identifies potential cleaning needs.

The Comments column contained 100% missing values, providing no useful information for analysis. Therefore, it was dropped to clean the dataset and simplify further processing.

In [8]:
df = df.drop(['Comments'], axis= 1).copy()
    
print("‚úÖ 'Comments' column dropped due to 100% missing values.")

print("After Dropping 'Comments' Column:")
df.head()

‚úÖ 'Comments' column dropped due to 100% missing values.
After Dropping 'Comments' Column:


Unnamed: 0,Timestamp,GHI,DNI,DHI,ModA,ModB,Tamb,RH,WS,WSgust,WSstdev,WD,WDstdev,BP,Cleaning,Precipitation,TModA,TModB
0,2021-08-09 00:01:00,-1.2,-0.2,-1.1,0.0,0.0,26.2,93.4,0.0,0.4,0.1,122.1,0.0,998,0,0.0,26.3,26.2
1,2021-08-09 00:02:00,-1.1,-0.2,-1.1,0.0,0.0,26.2,93.6,0.0,0.0,0.0,0.0,0.0,998,0,0.0,26.3,26.2
2,2021-08-09 00:03:00,-1.1,-0.2,-1.1,0.0,0.0,26.2,93.7,0.3,1.1,0.5,124.6,1.5,997,0,0.0,26.4,26.2
3,2021-08-09 00:04:00,-1.1,-0.1,-1.0,0.0,0.0,26.2,93.3,0.2,0.7,0.4,120.3,1.3,997,0,0.0,26.4,26.3
4,2021-08-09 00:05:00,-1.0,-0.1,-1.0,0.0,0.0,26.2,93.3,0.1,0.7,0.3,113.2,1.0,997,0,0.0,26.4,26.3


## Outlier Detection

In [9]:
df = calculate_zscore_and_flag_outliers(df)

outlier_rows = df[df['Outliers_Flag'] == True]
print("Count of rows with outlier values - ", df['Outliers_Flag'].sum())
print(outlier_rows.head(5))


Count of rows with outlier values -  7740
              Timestamp     GHI    DNI    DHI    ModA    ModB  Tamb    RH  \
670 2021-08-09 11:11:00   836.0  235.0  610.6   778.8   783.8  30.3  68.2   
671 2021-08-09 11:12:00  1274.0  698.8  615.2  1210.3  1210.3  30.1  69.6   
672 2021-08-09 11:13:00   938.0  340.1  612.8   891.1   891.1  30.4  68.4   
673 2021-08-09 11:14:00   718.5  126.8  593.2   682.6   682.6  30.6  68.2   
674 2021-08-09 11:15:00  1349.0  771.8  618.0  1281.5  1281.5  30.9  67.1   

      WS  WSgust  WSstdev     WD  WDstdev   BP  Cleaning  Precipitation  \
670  3.2     4.1      0.7  190.5     18.4  999         0            0.0   
671  3.4     4.1      0.6  175.8     13.3  999         0            0.0   
672  3.4     4.6      0.9  171.5     11.9  999         0            0.0   
673  4.7     5.6      0.6  160.7      8.0  999         0            0.0   
674  4.0     4.6      0.4  169.1     12.3  999         0            0.0   

     TModA  TModB  Outliers_Flag  
670   63.

## Cleaning & Imputation

In [10]:
impute_cols = ['GHI', 'DNI', 'DHI', 'ModA', 'ModB', 'WS', 'WSgust']
df_cleaned = clean_and_impute(df, impute_cols)
df_cleaned = df_cleaned.drop(['Outliers_Flag'], axis= 1).copy()

df_cleaned.head()

cleaned_file_path = output_dir + "benin_cleaned.csv"
save_cleaned_data(df_cleaned, cleaned_file_path) 


‚úÖ Cleaned Data Saved successfully to: ../../../data/benin/benin_cleaned.csv
