# Solar Power Energy Prediction - Data Preprocessing 
## Phase 2: Data Preprocessing (Following Analysis Phase)

**Prerequisites:** analysis.ipynb must be completed first

## 2.1 Outlier Treatment & Feature Engineering

Based on the analysis phase, we identified outliers in numerical columns.  
Now we will apply treatment strategies to handle them.

### 2.1.1 Re-import Libraries and Load Data

In [8]:
# Import pandas for data manipulation
import pandas as pd
# Import numpy for numerical operations
import numpy as np
# Import warnings to suppress warning messages
import warnings
# Suppress all warnings for cleaner output
warnings.filterwarnings("ignore")
# Import matplotlib for plotting
from matplotlib import pyplot as plt
# Import seaborn for statistical visualizations
import seaborn as sns

### 2.1.2 Load Dataset

In [9]:
# Load the dataset from CSV file (same as analysis phase)
data = pd.read_csv("../data/solarpowergeneration.csv")
# Display first few rows
data.head()

Unnamed: 0,distance-to-solar-noon,temperature,wind-direction,wind-speed,sky-cover,visibility,humidity,average-wind-speed-(period),average-pressure-(period),power-generated
0,0.859897,69,28,7.5,0,10.0,75,8.0,29.82,0
1,0.628535,69,28,7.5,0,10.0,77,5.0,29.85,0
2,0.397172,69,28,7.5,0,10.0,70,0.0,29.89,5418
3,0.16581,69,28,7.5,0,10.0,33,0.0,29.91,25477
4,0.065553,69,28,7.5,0,10.0,21,3.0,29.89,30069


### 2.1.3 Removing missng values

In [10]:
print("Drop the row with missing value.")

Drop the row with missing value.


In [11]:
# Check shape before
print(f"Original shape: {data.shape}")
# DROP ROWS with missing values
# inplace=True modifies the dataframe directly rather than creating a copy
data.dropna(inplace=True)
# Check shape after
print(f"Shape after dropping NaNs: {data.shape}")
# Verify clean status
if data.isnull().sum().sum() == 0:
    print("No missing values remain.")

Original shape: (2920, 10)
Shape after dropping NaNs: (2919, 10)
No missing values remain.


### 2.1.4 Removal of Physically Impossible Values (Rule-Based Filtering)

In [12]:
data = data[
    (data['wind-speed'] >= 0) &
    (data['average-pressure-(period)'] >= 0) &
    (data['humidity'] >= 0) &
    (data['average-wind-speed-(period)'] >= 0)
]

### 2.1.5 Contextual Outliers (Domain Knowledge-Based)

In [13]:
# --- LOGIC CHECK 1: NIGHT TIME GENERATION ---
# "Night" is roughly defined by high distance to solar noon.
# Let's assume distance > 1.5 radians is "dark" (approx 6 hours from noon).
# If power > 0 during this time, it's suspicious.

night_threshold = 1.2  
night_power_outliers = data[(data['distance-to-solar-noon'] > night_threshold) & (data['power-generated'] > 10)]

print(f"Suspicious Night Generation Count: {len(night_power_outliers)}")
if len(night_power_outliers) > 0:
    print(night_power_outliers[['distance_to_solar_noon', 'power-generated']].head())


# --- LOGIC CHECK 2: DAY TIME ZERO GENERATION ---
# If it is NOON (distance < 0.2) and CLEAR (sky_cover = 0) 
# and POWER is 0, something is wrong with the sensor.

day_failure_outliers = data[
    (data['distance-to-solar-noon'] < 0.5) & 
    (data['sky-cover'] == 0) & 
    (data['power-generated'] == 0)
]

print(f"\nSuspicious Zero Power at Noon Count: {len(day_failure_outliers)}")
if len(day_failure_outliers) > 0:
    print(day_failure_outliers[['distance-to-solar-noon', 'sky-cover', 'power-generated']].head())

Suspicious Night Generation Count: 0

Suspicious Zero Power at Noon Count: 13
     distance-to-solar-noon  sky-cover  power-generated
322                0.434018          0                0
323                0.170088          0                0
325                0.357771          0                0
330                0.435294          0                0
331                0.170588          0                0


In [14]:
# Drop the specific rows identified as sensor failures
# (Using the index of the outliers identified above)
rows_to_drop = day_failure_outliers.index

print(f"Dropping {len(rows_to_drop)} rows due to inconsistent solar physics...")
data = data.drop(rows_to_drop)
data.reset_index(drop=True, inplace=True)

print(f"Dataset Shape after handling outliers: {data.shape}")
#data_final.to_csv('solarpowergeneration_final.csv', index=False)

Dropping 13 rows due to inconsistent solar physics...
Dataset Shape after handling outliers: (2906, 10)


### 2.1.6 Cyclic Encoding of Wind Direction

In [15]:
# Convert 1â€“36 scale to degrees
data['wind_direction_deg'] = (data['wind-direction'] - 1) * 10
# Cyclic encoding
data['wind_dir_sin'] = np.sin(np.deg2rad(data['wind_direction_deg']))
data['wind_dir_cos'] = np.cos(np.deg2rad(data['wind_direction_deg']))
# Remove original direction columns
data.drop(columns=['wind-direction', 'wind_direction_deg'], inplace=True)

In [16]:
data.head()

Unnamed: 0,distance-to-solar-noon,temperature,wind-speed,sky-cover,visibility,humidity,average-wind-speed-(period),average-pressure-(period),power-generated,wind_dir_sin,wind_dir_cos
0,0.859897,69,7.5,0,10.0,75,8.0,29.82,0,-1.0,-1.83697e-16
1,0.628535,69,7.5,0,10.0,77,5.0,29.85,0,-1.0,-1.83697e-16
2,0.397172,69,7.5,0,10.0,70,0.0,29.89,5418,-1.0,-1.83697e-16
3,0.16581,69,7.5,0,10.0,33,0.0,29.91,25477,-1.0,-1.83697e-16
4,0.065553,69,7.5,0,10.0,21,3.0,29.89,30069,-1.0,-1.83697e-16


In [17]:
data.describe()

Unnamed: 0,distance-to-solar-noon,temperature,wind-speed,sky-cover,visibility,humidity,average-wind-speed-(period),average-pressure-(period),power-generated,wind_dir_sin,wind_dir_cos
count,2906.0,2906.0,2906.0,2906.0,2906.0,2906.0,2906.0,2906.0,2906.0,2906.0,2906.0
mean,0.504255,58.47488,10.104026,1.996559,9.558328,73.615623,10.128355,30.016886,7013.417756,-0.633423,-0.05658725
std,0.298257,6.847306,4.837807,1.409088,1.382782,14.951208,7.265156,0.141378,10325.774839,0.627948,0.4488413
min,0.050401,42.0,1.1,0.0,0.0,14.0,0.0,29.48,0.0,-1.0,-1.0
25%,0.255728,53.0,6.6,1.0,10.0,65.0,5.0,29.92,0.0,-0.984808,-0.3420201
50%,0.481741,59.0,10.0,2.0,10.0,77.0,9.0,29.99,427.5,-0.984808,-1.83697e-16
75%,0.740418,63.0,13.1,3.0,10.0,84.0,15.0,30.11,12778.0,-0.642788,0.1736482
max,1.141361,78.0,26.6,4.0,10.0,100.0,40.0,30.53,36580.0,1.0,1.0


## 2.2 Multicollinearity Check & Feature Selection

### 2.2.1 Multicollinearity Check (Variance Inflation Factor)

In [18]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant  # Import this

X = data.drop(columns=['power-generated'])
y=data['power-generated']
X_with_const = add_constant(X)

vif_data = pd.DataFrame()
vif_data['Feature'] = X_with_const.columns
vif_data['VIF'] = [variance_inflation_factor(X_with_const.values, i) 
                   for i in range(X_with_const.shape[1])]

vif_data = vif_data[vif_data['Feature'] != 'const']

display(vif_data)

Unnamed: 0,Feature,VIF
1,distance-to-solar-noon,1.314945
2,temperature,1.529166
3,wind-speed,2.426842
4,sky-cover,1.438077
5,visibility,1.279091
6,humidity,1.699462
7,average-wind-speed-(period),2.064488
8,average-pressure-(period),1.49964
9,wind_dir_sin,1.579454
10,wind_dir_cos,1.3038


VIF says that features aren't copying each other, only checks multicollinearity, meaning it tells us whether features are linearly redundant with each otherâ€”not whether they are useful for predicting solar power. A feature can have VIF â‰ˆ 1 and still contribute little or nothing to prediction. Because solar power generation is inherently non-linear (e.g., curved time-of-day and weather interactions), linear correlations and OLS p-values can be misleading, and feature usefulness should instead be evaluated with non-linear models or predictive performance metrics.

### 2.2.2 Feature Selection (Cross-Validated Drop-Feature Ablation)

In [19]:
#cross-validated drop-feature ablation to get a stable estimate of each featureâ€™s true predictive contribution
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import KFold

def cross_val_ablation(X, y, model=None, n_splits=5, random_state=42):
    if model is None:
        model = GradientBoostingRegressor(
        random_state=random_state
    )
    
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=random_state)
    
    baseline_errors = []
    feature_errors = {col: [] for col in X.columns}
    
    for train_idx, test_idx in kf.split(X):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
        
        # Baseline error with all features
        model.fit(X_train, y_train)
        baseline_pred = model.predict(X_test)
        baseline_mae = mean_absolute_error(y_test, baseline_pred)
        baseline_errors.append(baseline_mae)
        
        # Error dropping each feature
        for col in X.columns:
            X_tr_drop = X_train.drop(columns=[col])
            X_te_drop = X_test.drop(columns=[col])
            model.fit(X_tr_drop, y_train)
            pred_drop = model.predict(X_te_drop)
            mae_drop = mean_absolute_error(y_test, pred_drop)
            feature_errors[col].append(mae_drop)
    
    results = []
    mean_baseline = np.mean(baseline_errors)
    
    for col in X.columns:
        mean_drop = np.mean(feature_errors[col])
        mae_increase = mean_drop - mean_baseline
        results.append({"feature": col, "mean_mae_increase": mae_increase})
    
    return pd.DataFrame(results).sort_values("mean_mae_increase", ascending=False)

# Example usage:
results_df = cross_val_ablation(X, y)
display(results_df)


Unnamed: 0,feature,mean_mae_increase
0,distance-to-solar-noon,3234.077755
3,sky-cover,207.2775
5,humidity,82.255558
8,wind_dir_sin,16.973501
6,average-wind-speed-(period),9.097706
1,temperature,6.786497
9,wind_dir_cos,6.08321
2,wind-speed,5.624812
7,average-pressure-(period),0.434432
4,visibility,-0.814876


Based on cross-validated MAE increases, **distance-to-solar-noon, sky-cover, and humidity** are strong predictors and should be kept. Moderate contributors like **temperature, wind-speed, average-wind-speed-(period), and wind directions** add smaller positive effects, optional to drop for simplicity. **Visibility and average-pressure-(period)** have negligible or negative impact and can be removed. Next, train the final model on the retained features, confirm performance with cross-validation, and optionally visualize feature importances.


### 2.2.2 Correlation-Based Redundancy Removal

In [20]:
corr = data.corr()
high_corr_pairs = []
for i in range(len(corr.columns)):
    for j in range(i):
        if abs(corr.iloc[i, j]) > 0.8:
            high_corr_pairs.append((corr.columns[i], corr.columns[j], corr.iloc[i, j]))
            high_corr_pairs
high_corr_pairs

[]

### 2.2.3 Feature Selection: Removing Low-Impact Features

In [21]:
# Drop the 'visibility' column from the feature dataframe
data_final = data.drop(columns=['visibility'])
print(data_final.columns)

Index(['distance-to-solar-noon', 'temperature', 'wind-speed', 'sky-cover',
       'humidity', 'average-wind-speed-(period)', 'average-pressure-(period)',
       'power-generated', 'wind_dir_sin', 'wind_dir_cos'],
      dtype='object')


## 2.3 Save Cleaned Data

**README Section 2.3 - Steps 1-3**

### 2.3.1 Export to CSV

In [22]:
# Execute save to CSV
data_final.to_csv("../data/solarpowergeneration_cleaned.csv", index=False)
# index=False: Don't save row indices
print("Data saved to solarpowergeneration_cleaned.csv")

Data saved to solarpowergeneration_cleaned.csv


### 2.3.2 Verify file creation

In [23]:
# Check file exists in directory
import os
if os.path.exists("../data/solarpowergeneration_cleaned.csv"):
    file_size = os.path.getsize("../data/solarpowergeneration_cleaned.csv")
    print(f"File created successfully!")
    print(f"File size: {file_size} bytes ({file_size/(1024*1024):.2f} MB)")
else:
    print("Error: File not created")

File created successfully!
File size: 224046 bytes (0.21 MB)


### 2.3.3 Document cleaning steps

In [24]:
# Record what transformations were applied
print("\n" + "="*60)
print("DATA PREPROCESSING STEPS APPLIED")
print("="*60)
print("1. Removal of Missing Values")
print("2. Handling Statistical outliers and Conceptual outliers")
print("3. Cyclic Encoding of Wind Direction")
print("4. Multicollinearity Reduction & Feature Selection")
print("="*60)


DATA PREPROCESSING STEPS APPLIED
1. Removal of Missing Values
2. Handling Statistical outliers and Conceptual outliers
3. Cyclic Encoding of Wind Direction
4. Multicollinearity Reduction & Feature Selection


## 2.4 Reload Cleaned Data

### 2.4.1 Load Cleaned Data

In [25]:
# Load outlier-treated dataset
data = pd.read_csv("../data/solarpowergeneration_cleaned.csv")
print("Cleaned data loaded successfully")
print(f"Shape: {data.shape}")

Cleaned data loaded successfully
Shape: (2906, 10)


### 2.4.2 Display Data

In [26]:
# View DataFrame to confirm correct loading
# Check shape and structure
data.head()

Unnamed: 0,distance-to-solar-noon,temperature,wind-speed,sky-cover,humidity,average-wind-speed-(period),average-pressure-(period),power-generated,wind_dir_sin,wind_dir_cos
0,0.859897,69,7.5,0,75,8.0,29.82,0,-1.0,-1.83697e-16
1,0.628535,69,7.5,0,77,5.0,29.85,0,-1.0,-1.83697e-16
2,0.397172,69,7.5,0,70,0.0,29.89,5418,-1.0,-1.83697e-16
3,0.16581,69,7.5,0,33,0.0,29.91,25477,-1.0,-1.83697e-16
4,0.065553,69,7.5,0,21,3.0,29.89,30069,-1.0,-1.83697e-16


## 2.5 Train-Test Split

In [27]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler, StandardScaler
from sklearn.compose import ColumnTransformer
import pandas as pd
# Separate features (X) and target (y)
X = data.drop('power-generated', axis=1)
y = data['power-generated']

# Split Train and Test Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 2.6 Feature Scaling

In [28]:
from sklearn.preprocessing import RobustScaler, StandardScaler
from sklearn.compose import ColumnTransformer
import pandas as pd

robust_cols = [
    'wind-speed',
    'humidity',
    'average-wind-speed-(period)'
]

standard_cols = [
    'distance-to-solar-noon',
    'temperature'
]

preprocessor = ColumnTransformer(
    transformers=[
        ('robust', RobustScaler(), robust_cols),
        ('standard', StandardScaler(), standard_cols)
    ],
    remainder='passthrough',
    verbose_feature_names_out=False   # ðŸ”‘ removes prefixes
)

# Fit scaler ONLY on Training data
X_train_scaled = preprocessor.fit_transform(X_train)

# Transform Test data (Do NOT fit)
X_test_scaled = preprocessor.transform(X_test)


In [29]:
display(X_train_scaled)
display(y_train)

array([[ 0.65909091, -0.36842105,  0.1       , ..., 29.99      ,
        -0.98480775, -0.17364818],
       [-0.29545455,  1.        , -0.4       , ..., 30.08      ,
        -0.17364818, -0.98480775],
       [-0.08333333, -0.78947368,  0.8       , ..., 29.97      ,
        -0.98480775, -0.17364818],
       ...,
       [-1.15909091, -0.10526316, -0.6       , ..., 30.02      ,
         0.5       ,  0.8660254 ],
       [-0.35606061, -0.26315789, -0.1       , ..., 30.12      ,
        -0.5       , -0.8660254 ],
       [-0.20454545,  0.15789474, -0.3       , ..., 30.12      ,
        -0.17364818, -0.98480775]], shape=(2324, 9))

163     12786
1947        0
252     19260
1831    17762
2043        0
        ...  
1638    31714
1095        0
1130     1488
1294        0
860      8699
Name: power-generated, Length: 2324, dtype: int64

In [30]:
y_train_log = np.log1p(y_train)  # Apply log1p to training target
y_test_log = np.log1p(y_test)

**Target variable:** Power-generated was log-transformed so that very large values do not dominate the model and prediction errors become more stable; results are converted back to the original scale after prediction.
**Feature scaling:** Features with extreme but meaningful values were scaled using RobustScaler, while more normally distributed features were scaled using StandardScaler.
**Reasoning:** Some variables were left unscaled to keep their real-world meaning, ensuring the data remains realistic while improving model learning.

Feature scaling is generally needed for models sensitive to feature magnitudes or distances, including linear models, distance-based models(KNN, Kmeans, DBSCAN), SVRs, and neural networks.

## Preprocessing Complete!

### Summary of Preprocessing Phase:
1. **Outlier Treatment**: Handling Statistical outliers and Conceptual outliers
2. **Missing Values**: Dropped rows with missing values
3. **Encoding**: Cyclic Encoding of wind direction
4. **Multicollinearity Reduction and Feature Selection**: Dropped low impact visiblity feature
5. **Train-Test Split**: Spliting training and testing data
6. **Feature Scaling**: Input features are scaled using robust and standard scaling based on their distributions, and the target variable (power-generated) is log1p-transformed to stabilize variance and improve model performance

