Resource - https://www.kaggle.com/code/muhmddanish/solar-energy-generation-eda-forecast

##### Problem Statement :

As mentioned, solar power generation has increased during the recent years and so the inefficient management of the electric grid. As the scale of the solar power generation increases, the excess energy would probably become useless and if there is a decrease in the generation, then it might become troublesome to provide for the consumers.

This gives rise to the question, is there a way to somehow mitigate these issues?

In [40]:
import numpy as np   # mathematical functions to operate on multi-dimensional arrays.
import pandas as pd  # data manipulation and analysis library that provides data structures like DataFrames
import matplotlib.pyplot as plt      # create various types of plots and charts
import seaborn as sns            # creating informative and attractive statistical graphics


In [41]:
# used to read data from a CSV (Comma-Separated Values) file into a Pandas DataFrame. 
solar = pd.read_csv('Solar Power Plant Data.csv')

# displays the rows of a Pandas DataFrame
solar

Unnamed: 0,Date-Hour(NMT),WindSpeed,Sunshine,AirPressure,Radiation,AirTemperature,RelativeAirHumidity,SystemProduction
0,01.01.2017-00:00,0.6,0,1003.8,-7.4,0.1,97,0.0
1,01.01.2017-01:00,1.7,0,1003.5,-7.4,-0.2,98,0.0
2,01.01.2017-02:00,0.6,0,1003.4,-6.7,-1.2,99,0.0
3,01.01.2017-03:00,2.4,0,1003.3,-7.2,-1.3,99,0.0
4,01.01.2017-04:00,4.0,0,1003.1,-6.3,3.6,67,0.0
...,...,...,...,...,...,...,...,...
8755,31.12.2017-19:00,4.1,0,988.2,-4.8,-0.7,94,0.0
8756,31.12.2017-20:00,2.1,0,987.3,-5.0,-0.3,95,0.0
8757,31.12.2017-21:00,1.8,0,986.7,-5.3,0.2,93,0.0
8758,31.12.2017-22:00,2.2,0,986.0,-5.4,0.3,92,0.0


In [42]:
# Convert date time format to 'np.datetime64' - NumPy datetime64 supports a wider range of time delta units
# astype method in NumPy is used to convert an array to a different data type
solar['Date-Hour(NMT)'] = solar['Date-Hour(NMT)'].astype(np.datetime64)

# Create a new column 'hour' with the extracted time component
# to extract hour values. The .dt.hour accessor is used to achieve it
solar['hour'] = solar['Date-Hour(NMT)'].dt.hour
solar['day'] = solar['Date-Hour(NMT)'].dt.day

# set date time column as index
# The set_index method in Pandas is used to set the DataFrame index using existing columns. 
# When the inplace parameter is set to True, the original DataFrame is modified. 
solar.set_index('Date-Hour(NMT)', inplace = True)

solar.head()

Unnamed: 0_level_0,WindSpeed,Sunshine,AirPressure,Radiation,AirTemperature,RelativeAirHumidity,SystemProduction,hour,day
Date-Hour(NMT),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2017-01-01 00:00:00,0.6,0,1003.8,-7.4,0.1,97,0.0,0,1
2017-01-01 01:00:00,1.7,0,1003.5,-7.4,-0.2,98,0.0,1,1
2017-01-01 02:00:00,0.6,0,1003.4,-6.7,-1.2,99,0.0,2,1
2017-01-01 03:00:00,2.4,0,1003.3,-7.2,-1.3,99,0.0,3,1
2017-01-01 04:00:00,4.0,0,1003.1,-6.3,3.6,67,0.0,4,1


In [44]:
# change the index name of Date-Hour(NMT)
solar.index.name = 'datetime'
solar.tail()

Unnamed: 0_level_0,WindSpeed,Sunshine,AirPressure,Radiation,AirTemperature,RelativeAirHumidity,SystemProduction,hour,day
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2017-12-31 19:00:00,4.1,0,988.2,-4.8,-0.7,94,0.0,19,31
2017-12-31 20:00:00,2.1,0,987.3,-5.0,-0.3,95,0.0,20,31
2017-12-31 21:00:00,1.8,0,986.7,-5.3,0.2,93,0.0,21,31
2017-12-31 22:00:00,2.2,0,986.0,-5.4,0.3,92,0.0,22,31
2017-12-31 23:00:00,2.4,0,985.6,-5.9,0.4,96,0.0,23,31


## DATA WRANGLING

- Data wrangling is the process of 
    - converting raw data into a usable form by exploring, 
    - transforming, and 
    - validating datasets 
    - from their messy and complex forms into high-quality data

In [45]:
# check missing values and datatypes of the dataset
solar.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 8760 entries, 2017-01-01 00:00:00 to 2017-12-31 23:00:00
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   WindSpeed            8760 non-null   float64
 1   Sunshine             8760 non-null   int64  
 2   AirPressure          8760 non-null   float64
 3   Radiation            8760 non-null   float64
 4   AirTemperature       8760 non-null   float64
 5   RelativeAirHumidity  8760 non-null   int64  
 6   SystemProduction     8760 non-null   float64
 7   hour                 8760 non-null   int64  
 8   day                  8760 non-null   int64  
dtypes: float64(5), int64(4)
memory usage: 684.4 KB


##### No missing values in the dataset
##### Dataset contains correct datatypes

In [46]:
# Statistical summary
solar.describe()

Unnamed: 0,WindSpeed,Sunshine,AirPressure,Radiation,AirTemperature,RelativeAirHumidity,SystemProduction,hour,day
count,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0
mean,2.639823,11.180479,1010.361781,97.538493,6.978893,76.719406,684.746071,11.5,15.720548
std,1.628754,21.171295,12.793971,182.336029,7.604266,19.278996,1487.454665,6.922582,8.796749
min,0.0,0.0,965.9,-9.3,-12.4,13.0,0.0,0.0,1.0
25%,1.4,0.0,1002.8,-6.2,0.5,64.0,0.0,5.75,8.0
50%,2.3,0.0,1011.0,-1.4,6.4,82.0,0.0,11.5,16.0
75%,3.6,7.0,1018.2,115.6,13.4,93.0,464.24995,17.25,23.0
max,10.9,60.0,1047.3,899.7,27.1,100.0,7701.0,23.0,31.0


##### The Radiation is negative at 50% values , there is no energy generation. Negative 'radiation' is not feasible due to mishandling of the data. Radiation cannot have negative values (the instance where PV panels instead of sun emit radiation). So, setting these values to 0 isn't a bad option. Radiation values can be negative. This can occur due to inaccuracies of methods and devices, as well as noise in sensor readings.  Negative radiation values are not theoretically impossible and can occur in practical measurements and analyses.

In [47]:
# Replace the negative value with 0
# clip method to set any values in the "Radiation" column that are less than 0 to 0.
solar['Radiation'] = solar['Radiation'].clip(lower = 0)
solar

Unnamed: 0_level_0,WindSpeed,Sunshine,AirPressure,Radiation,AirTemperature,RelativeAirHumidity,SystemProduction,hour,day
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2017-01-01 00:00:00,0.6,0,1003.8,0.0,0.1,97,0.0,0,1
2017-01-01 01:00:00,1.7,0,1003.5,0.0,-0.2,98,0.0,1,1
2017-01-01 02:00:00,0.6,0,1003.4,0.0,-1.2,99,0.0,2,1
2017-01-01 03:00:00,2.4,0,1003.3,0.0,-1.3,99,0.0,3,1
2017-01-01 04:00:00,4.0,0,1003.1,0.0,3.6,67,0.0,4,1
...,...,...,...,...,...,...,...,...,...
2017-12-31 19:00:00,4.1,0,988.2,0.0,-0.7,94,0.0,19,31
2017-12-31 20:00:00,2.1,0,987.3,0.0,-0.3,95,0.0,20,31
2017-12-31 21:00:00,1.8,0,986.7,0.0,0.2,93,0.0,21,31
2017-12-31 22:00:00,2.2,0,986.0,0.0,0.3,92,0.0,22,31


In [48]:
# statistical summary
solar.describe()

Unnamed: 0,WindSpeed,Sunshine,AirPressure,Radiation,AirTemperature,RelativeAirHumidity,SystemProduction,hour,day
count,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0
mean,2.639823,11.180479,1010.361781,100.594087,6.978893,76.719406,684.746071,11.5,15.720548
std,1.628754,21.171295,12.793971,180.614494,7.604266,19.278996,1487.454665,6.922582,8.796749
min,0.0,0.0,965.9,0.0,-12.4,13.0,0.0,0.0,1.0
25%,1.4,0.0,1002.8,0.0,0.5,64.0,0.0,5.75,8.0
50%,2.3,0.0,1011.0,0.0,6.4,82.0,0.0,11.5,16.0
75%,3.6,7.0,1018.2,115.6,13.4,93.0,464.24995,17.25,23.0
max,10.9,60.0,1047.3,899.7,27.1,100.0,7701.0,23.0,31.0


* The majority of the observations have wind speeds below 3.6.
* On average, there are 11.18 hours of sunshine.
* Air temperatures vary from -12.4 to 27.1 degrees Celsius.
* Radiation levels exhibit a wide range, from 0.0 to 899.7.
* System production spans from 0.0 to 7701.0, with an average of 684.75.
* The average humidity is 76.72%, and the distribution is spread across the entire range.

## EDA