# **Introduction**

In recent years, the shift towards renewable energy sources has been significant. The utilization of Photovoltaic (PV) Solar Power has become increasingly prominent due to advancements in PV technology, battery storage, and smart grid systems. PV Solar Plants, which convert solar energy directly into electrical power through solar panels, represent a key component of this green energy transition. These plants, characterized by their ability to harness sunlight to generate direct current (DC) power—which is then converted into alternating current (AC) power—are crucial for producing large-scale electrical power sustainably. However, the operation of solar power plants is accompanied by unique challenges, including variability in power output due to the diurnal and seasonal nature of sunlight, the need for immediate power usage or storage, and the extensive maintenance required to keep large arrays of solar panels functioning optimally.

# **Objectives**

The primary objective of this project is to analyze and predict solar power generation while addressing the inherent challenges of managing a solar power plant. This involves a detailed examination of solar power generation data alongside weather data to identify patterns, predict future power output, and suggest methods for enhancing grid stability and efficiency. Specifically, the project aims to: 

1) Analyze historical data to understand the impact of environmental conditions on power generation; 

2) Develop predictive models that forecast daily and seasonal power outputs; 

By achieving these goals, the project seeks to contribute to the digital transformation of the solar sector, enhancing the reliability and efficiency of solar power systems and promoting wider adoption of sustainable energy practices.

In [1]:
import math
import numpy as np
import pandas as pd
import datetime as dt
import plotly.express as px
import plotly.graph_objects as go
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings("ignore")

In [2]:
def preprocess_weather_data(file_path):
    df_weather = pd.read_csv(file_path)
    
    print("Missing values:")
    print(df_weather.isna().sum())
    
    print("\nData types:")
    print(df_weather.dtypes)
    
    df_weather['DATE_TIME'] = pd.to_datetime(df_weather['DATE_TIME'], format='%Y-%m-%d %H:%M:%S')
    df_weather.set_index('DATE_TIME', inplace=True)
    
    print("\nUnique sources:")
    print(df_weather['SOURCE_KEY'].unique())
    
    return df_weather

file_path = '../input/solar-power/Plant_2_Weather_Sensor_Data.csv'
df_weather = preprocess_weather_data(file_path)

Missing values:
DATE_TIME              0
PLANT_ID               0
SOURCE_KEY             0
AMBIENT_TEMPERATURE    0
MODULE_TEMPERATURE     0
IRRADIATION            0
dtype: int64

Data types:
DATE_TIME               object
PLANT_ID                 int64
SOURCE_KEY              object
AMBIENT_TEMPERATURE    float64
MODULE_TEMPERATURE     float64
IRRADIATION            float64
dtype: object

Unique sources:
['iq8k7ZNt4Mwm3w0']


In [3]:
df_weather

Unnamed: 0_level_0,PLANT_ID,SOURCE_KEY,AMBIENT_TEMPERATURE,MODULE_TEMPERATURE,IRRADIATION
DATE_TIME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-05-15 00:00:00,4136001,iq8k7ZNt4Mwm3w0,27.004764,25.060789,0.0
2020-05-15 00:15:00,4136001,iq8k7ZNt4Mwm3w0,26.880811,24.421869,0.0
2020-05-15 00:30:00,4136001,iq8k7ZNt4Mwm3w0,26.682055,24.427290,0.0
2020-05-15 00:45:00,4136001,iq8k7ZNt4Mwm3w0,26.500589,24.420678,0.0
2020-05-15 01:00:00,4136001,iq8k7ZNt4Mwm3w0,26.596148,25.088210,0.0
...,...,...,...,...,...
2020-06-17 22:45:00,4136001,iq8k7ZNt4Mwm3w0,23.511703,22.856201,0.0
2020-06-17 23:00:00,4136001,iq8k7ZNt4Mwm3w0,23.482282,22.744190,0.0
2020-06-17 23:15:00,4136001,iq8k7ZNt4Mwm3w0,23.354743,22.492245,0.0
2020-06-17 23:30:00,4136001,iq8k7ZNt4Mwm3w0,23.291048,22.373909,0.0


With the 'DATE_TIME' field now set as the index, it's possible to create quick and simple graphs to investigate the data. Some potential issues with the data are apparent, suggesting that further investigation may be warranted later. However, the diurnal pattern of solar irradiation is clearly visible, with noticeable variations.

In [4]:
def plot_irradiation(df):
    
    fig = px.line(df, x=df.index, y='IRRADIATION', title='Irradiation Over Time')
    fig.show()

plot_irradiation(df_weather)

Now, let's explore the solar generation data. An interesting point to note is that it has the same number of rows as the weather data, simplifying the process of joining them together for further analysis.

In [5]:
def preprocess_solar_data(file_path):

    df_solar = pd.read_csv(file_path)
    
    df_solar['DATE_TIME'] = pd.to_datetime(df_solar['DATE_TIME'], format='%Y-%m-%d %H:%M:%S')
    df_solar.set_index('DATE_TIME', inplace=True)
    
    print("Missing values:")
    print(df_solar.isna().sum())
    
    print("\nNumber of unique SOURCE_KEY values:")
    print(df_solar['SOURCE_KEY'].nunique())
    
    return df_solar


df_solar = preprocess_solar_data(file_path = '../input/solar-power/Plant_2_Generation_Data.csv')

Missing values:
PLANT_ID       0
SOURCE_KEY     0
DC_POWER       0
AC_POWER       0
DAILY_YIELD    0
TOTAL_YIELD    0
dtype: int64

Number of unique SOURCE_KEY values:
22


In [6]:
df_solar

Unnamed: 0_level_0,PLANT_ID,SOURCE_KEY,DC_POWER,AC_POWER,DAILY_YIELD,TOTAL_YIELD
DATE_TIME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-05-15 00:00:00,4136001,4UPUqMRk7TRMgml,0.0,0.0,9425.000000,2.429011e+06
2020-05-15 00:00:00,4136001,81aHJ1q11NBPMrL,0.0,0.0,0.000000,1.215279e+09
2020-05-15 00:00:00,4136001,9kRcWv60rDACzjR,0.0,0.0,3075.333333,2.247720e+09
2020-05-15 00:00:00,4136001,Et9kgGMDl729KT4,0.0,0.0,269.933333,1.704250e+06
2020-05-15 00:00:00,4136001,IQ2d7wF4YD8zU1Q,0.0,0.0,3177.000000,1.994153e+07
...,...,...,...,...,...,...
2020-06-17 23:45:00,4136001,q49J1IKaHRwDQnt,0.0,0.0,4157.000000,5.207580e+05
2020-06-17 23:45:00,4136001,rrq4fwE8jgrTyWY,0.0,0.0,3931.000000,1.211314e+08
2020-06-17 23:45:00,4136001,vOuJvMaM2sgwLmb,0.0,0.0,4322.000000,2.427691e+06
2020-06-17 23:45:00,4136001,xMbIugepa2P7lBB,0.0,0.0,4218.000000,1.068964e+08


With the date set as the index, it's possible to create graphs for each of the sources in the solar generation data. Typically, the generation is very similar for each sensor, but occasionally, there is some variation that might warrant further investigation.

# Combining Solar and Weather Data

In [7]:
def merge_solar_weather_data(df_solar, df_weather):
    merged_df = pd.merge(df_solar, df_weather, on='DATE_TIME', how='inner', suffixes=('_solar', '_weather'))
    
    return merged_df

df = merge_solar_weather_data(df_solar, df_weather)


# Dropping Columns

In [8]:
def drop_columns(df):
    df.drop(['SOURCE_KEY_solar', 'SOURCE_KEY_weather', 'PLANT_ID_weather'], axis=1, inplace=True)
    return df


df = drop_columns(df)
df.head(5)

Unnamed: 0_level_0,PLANT_ID_solar,DC_POWER,AC_POWER,DAILY_YIELD,TOTAL_YIELD,AMBIENT_TEMPERATURE,MODULE_TEMPERATURE,IRRADIATION
DATE_TIME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2020-05-15,4136001,0.0,0.0,9425.0,2429011.0,27.004764,25.060789,0.0
2020-05-15,4136001,0.0,0.0,0.0,1215279000.0,27.004764,25.060789,0.0
2020-05-15,4136001,0.0,0.0,3075.333333,2247720000.0,27.004764,25.060789,0.0
2020-05-15,4136001,0.0,0.0,269.933333,1704250.0,27.004764,25.060789,0.0
2020-05-15,4136001,0.0,0.0,3177.0,19941530.0,27.004764,25.060789,0.0


# Calculating Daily Mean Values for Solar Power Plant Metrics

In [9]:
def calculate_daily_mean(df):
    df_grouped = df[['DC_POWER', 'IRRADIATION', 'AMBIENT_TEMPERATURE', 'DAILY_YIELD']].groupby('DATE_TIME').mean()
    return df_grouped


df = calculate_daily_mean(df)
df.head()

Unnamed: 0_level_0,DC_POWER,IRRADIATION,AMBIENT_TEMPERATURE,DAILY_YIELD
DATE_TIME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-05-15 00:00:00,0.0,0.0,27.004764,2222.724459
2020-05-15 00:15:00,0.0,0.0,26.880811,1290.954545
2020-05-15 00:30:00,0.0,0.0,26.682055,1290.954545
2020-05-15 00:45:00,0.0,0.0,26.500589,1290.954545
2020-05-15 01:00:00,0.0,0.0,26.596148,1205.272727


# Adding Scaled Irradiation Column for Graphing Solar Plant Data

In [10]:
def add_irradiation_times_1000(df):
    df['IRRADIATIONx1000'] = df['IRRADIATION'] * 1000
    return df

df = add_irradiation_times_1000(df)
df.head()


Unnamed: 0_level_0,DC_POWER,IRRADIATION,AMBIENT_TEMPERATURE,DAILY_YIELD,IRRADIATIONx1000
DATE_TIME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-05-15 00:00:00,0.0,0.0,27.004764,2222.724459,0.0
2020-05-15 00:15:00,0.0,0.0,26.880811,1290.954545,0.0
2020-05-15 00:30:00,0.0,0.0,26.682055,1290.954545,0.0
2020-05-15 00:45:00,0.0,0.0,26.500589,1290.954545,0.0
2020-05-15 01:00:00,0.0,0.0,26.596148,1205.272727,0.0


# Graph Displaying the Correlation Between AC Power and Irradiation

In [11]:
def plot_irradiation_vs_dc_power(df):
    fig = go.Figure()
    
    fig.add_trace(go.Scatter(x=df.index, y=df['DC_POWER'], mode='lines', name='DC_POWER'))
    fig.add_trace(go.Scatter(x=df.index, y=df['IRRADIATIONx1000'], mode='lines', name='IRRADIATIONx1000'))
    
    fig.update_layout(title='Irradiation (x1000 for scale) versus DC_POWER',
                      xaxis_title='Date Time',
                      yaxis_title='Values',
                      width=1000,
                      height=500)
    
    fig.show()

plot_irradiation_vs_dc_power(df)


# Correlation Analysis of Solar Power Plant Parameters

This correlation plot reveals a perfect correlation between DC power and irradiation, illustrating the direct dependency of solar panel output on sunlight exposure. Additionally, ambient temperature is strongly linked to DC power and moderately to daily yield, suggesting higher temperatures enhance power generation.

In [12]:
def plot_correlation_heatmap(df):
    corr = df.loc[:, df.columns != 'IRRADIATIONx1000'].corr()

    text = np.around(corr.values, decimals=2)
    annotations = []
    for i, row in enumerate(corr.index):
        for j, col in enumerate(corr.columns):
            annotations.append(
                dict(
                    text=str(text[i, j]),
                    x=col,
                    y=row,
                    xref='x1',
                    yref='y1',
                    font=dict(color='white'),
                    showarrow=False)
            )

    fig = go.Figure(data=go.Heatmap(
        z=corr.values,
        x=corr.index,
        y=corr.columns,
        colorscale='RdBu',
        colorbar=dict(title='Correlation'),
        hoverongaps=False,
        hoverinfo='text',
        text=text
    ))

    fig.update_layout(
        title='Correlation Heatmap (Excluding IRRADIATIONx1000)',
        xaxis_title='Columns',
        yaxis_title='Columns',
        annotations=annotations,
        width=800,
        height=600,
    )

    fig.show()


plot_correlation_heatmap(df)

# Quantifying the Relationship Between Irradiation and DC Power

The graph underscores a clear relationship between irradiation and DC power, with a high correlation coefficient of 0.93, indicating a strong linear association. By resampling the data daily and adjusting the scale of irradiance (multiplied by 1000), we observe nearly identical patterns between these metrics. This comparison emphasizes patterns over exact values, focusing on how closely trends in irradiation align with changes in DC power output.

In [13]:
def resample_and_scale_irradiation(df):
    df_resample = df.resample('D').sum()

    df_resample['IRRADIATIONx1000'] = df_resample['IRRADIATION'] * 1000

    return df_resample

df = resample_and_scale_irradiation(df)
df.head()

Unnamed: 0_level_0,DC_POWER,IRRADIATION,AMBIENT_TEMPERATURE,DAILY_YIELD,IRRADIATIONx1000
DATE_TIME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-05-15,30300.346861,28.559055,2903.769278,363518.728571,28559.05517
2020-05-16,25765.447273,23.676573,2829.946762,328848.08658,23676.57285
2020-05-17,25283.496282,21.233595,2858.890508,305102.967566,21233.595056
2020-05-18,24126.076234,21.49578,2624.322622,284503.949351,21495.780314
2020-05-19,20158.902137,20.345321,2517.562616,250553.021528,20345.320608


In [14]:
def plot_daily_values(df):
    trace1 = go.Scatter(x=df.index, y=df['DC_POWER'], mode='lines', name='DC_POWER')
    trace2 = go.Scatter(x=df.index, y=df['IRRADIATIONx1000'], mode='lines', name='IRRADIATIONx1000')

    fig = go.Figure(data=[trace1, trace2])

    fig.update_layout(title="Daily values for total IRRADIANCE (x1000) and total DC_POWER",
                      xaxis_title="Date",
                      yaxis_title="Value")
    fig.show()

plot_daily_values(df)


# Forecasting DC Power Output Using Regression Models: A Comparative Analysis

Given the high correlation between irradiation and DC power output, we anticipate strong predictive performance from various forecasting models, ranging from simple linear regression to more complex XGBoost models. The analysis will involve splitting the dataset into a training set (70%) and a test set (30%) to evaluate the models' effectiveness in predicting DC power based solely on irradiation values, utilizing aggregated mean data across all sensors. This approach allows for a comprehensive comparison of different predictive techniques in solar power generation forecasting.

In [15]:
def prepare_data_for_prediction(df):
    X = df.drop(['DC_POWER', 'IRRADIATIONx1000'], axis=1)
    y = df['DC_POWER']
    
    return X, y

X, y = prepare_data_for_prediction(df)


In [16]:
def split_train_test_data(X, y, test_size=0.2, shuffle=False, random_state=42):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, shuffle=shuffle, random_state=random_state)
    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = split_train_test_data(X, y)


In [17]:
def train_and_evaluate_model(X_train, X_test, y_train, y_test):
    lr = LinearRegression()
    lr.fit(X_train, y_train)
    y_pred = lr.predict(X_test)
    
    model_score = lr.score(X_test, y_test)
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = math.sqrt(mse)
    
    model_score = round(model_score, 2)
    mae = round(mae, 2)
    mse = round(mse, 2)
    rmse = round(rmse, 2)
    
    results_df = pd.DataFrame({'Metric': ['Model Score', 'MAE', 'MSE', 'RMSE'],
                               'Value': [model_score, mae, mse, rmse]})
    
    fig = go.Figure(data=[go.Table(header=dict(values=['Metric', 'Value']),
                                   cells=dict(values=[results_df['Metric'], results_df['Value']]))
                         ])
    fig.update_layout(title='Evaluation Scores')
    fig.show()
    
    return lr, y_pred

model, y_pred = train_and_evaluate_model(X_train, X_test, y_train, y_test)


In [18]:
df_results_lr = pd.DataFrame({'y_test': y_test, 'y_pred': y_pred})
df_results_lr

Unnamed: 0_level_0,y_test,y_pred
DATE_TIME,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-06-11,15579.675188,17326.303069
2020-06-12,18692.448312,16942.521974
2020-06-13,20695.717749,20790.528065
2020-06-14,22995.975087,23347.87745
2020-06-15,18681.768615,21151.025449
2020-06-16,21855.003117,20777.29223
2020-06-17,17282.329113,15122.242192


In [19]:
def plot_results_lr(df_results_lr):
    fig = go.Figure()

    fig.add_trace(go.Scatter(x=df_results_lr.index, y=df_results_lr['y_test'], mode='lines', name='Actual', line=dict(color='blue')))

    fig.add_trace(go.Scatter(x=df_results_lr.index, y=df_results_lr['y_pred'], mode='lines', name='Predicted', line=dict(color='orange')))

    fig.update_layout(title='Linear Regression prediction of DC_POWER and the Actual values',
                      xaxis_title='Date',
                      yaxis_title='DC_POWER',
                      showlegend=True)
    
    fig.show()

plot_results_lr(df_results_lr)
