
#Exercise
A customer has a site with a combined heat and power (CHP) plant and three gas boilers.
A CHP plant is a plant that takes natural gas as input and generates electricity and heat.
What can you say about the on-site electricity and heat generation? Gas consumption? And
the different units?

---



In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px

In [2]:
!pwd

/content


## Data Set Overview

In [226]:
# Read the csv file
df = pd.read_csv('data_site.csv')

In [88]:
# Display the head of the dataset using the head() function
df.head()

Unnamed: 0,From Timestamp,CHP Electricity Generated Day Total (kWh),CHP Electricity Generated Night Total (kWh),Boiler 1 Heat Energy Total (MWh),Boiler 2 Heat Energy Total (MWh),Boiler 3 Heat Energy Total (MWh),CHP Heat Energy Total (MWh),Boiler 1 Gas Corrected Volume (Sm3),Boiler 2 Gas Corrected Volume (Sm3),Boiler 3 Gas Corrected Volume (Sm3),CHP 2G Natural Gas Corrected Volume Total (Sm3)
0,01/10/2022 00:00,0,540,0.0,0.1,0.0,0.5,0,19,0,131
1,01/10/2022 00:30,0,530,0.0,0.1,0.0,0.4,0,19,0,129
2,01/10/2022 01:00,0,480,0.0,0.3,0.0,0.4,0,25,0,118
3,01/10/2022 01:30,0,510,0.0,0.2,0.0,0.4,0,30,0,122
4,01/10/2022 02:00,0,510,0.0,0.1,0.0,0.5,0,15,0,123


In [89]:
# Display the bottom 5 rows from the dataset using the tail() function
df.tail()

Unnamed: 0,From Timestamp,CHP Electricity Generated Day Total (kWh),CHP Electricity Generated Night Total (kWh),Boiler 1 Heat Energy Total (MWh),Boiler 2 Heat Energy Total (MWh),Boiler 3 Heat Energy Total (MWh),CHP Heat Energy Total (MWh),Boiler 1 Gas Corrected Volume (Sm3),Boiler 2 Gas Corrected Volume (Sm3),Boiler 3 Gas Corrected Volume (Sm3),CHP 2G Natural Gas Corrected Volume Total (Sm3)
1483,31/10/2022 21:30,550,0,0.0,0.0,0.3,0.5,0,0,28,133
1484,31/10/2022 22:00,540,0,0.0,0.0,0.2,0.5,0,0,35,134
1485,31/10/2022 22:30,550,0,0.0,0.0,0.3,0.5,0,0,40,134
1486,31/10/2022 23:00,540,0,0.0,0.0,0.3,0.5,0,0,35,134
1487,31/10/2022 23:30,540,0,0.0,0.0,0.3,0.4,0,0,34,132


### Observations:

Column *CHP Heat Energy Total (MWh)* values are in distinct units from CHP Electricity. The reading interval is every 30 minutes, and columns CHP Electricity Generated per Day and Night may be redundant?


Formula to convert units:
$$ \begin{equation} kWh = MWh \times 1000 \end{equation}$$

In [227]:
# Converting the data of the data set to constant units (kWh)
df.iloc[:,3:7] *= 1000

In [228]:
# Renaming all the columns for simplicity and uniform format
df.columns = ['Datetime','CHP_Electricity_Day(kWh)', 'CHP_Electricity_Night(kWh)', 'Boiler_1_Heat(kWh)', 'Boiler_2_Heat(kWh)', 'Boiler_3_Heat(kWh)', 'CHP_Total_Energy(kWh)', 'Boiler_1_Volume(Sm3)', 'Boiler_2_Volume(Sm3)','Boiler_3_Volume(Sm3)','CHP_2G_Volume(Sm3)']

In [105]:
df.head()

Unnamed: 0,Datetime,CHP_Electricity_Day(kWh),CHP_Electricity_Night(kWh),Boiler_1_Heat(kWh),Boiler_2_Heat(kWh),Boiler_3_Heat(kWh),CHP_Total_Energy(kWh),Boiler_1_Volume(Sm3),Boiler_2_Volume(Sm3),Boiler_3_Volume(Sm3),CHP_2G_Volume(Sm3)
0,2022-01-10 00:00:00,0,540,0.0,100.0,0.0,500.0,0,19,0,131
1,2022-01-10 00:30:00,0,530,0.0,100.0,0.0,400.0,0,19,0,129
2,2022-01-10 01:00:00,0,480,0.0,300.0,0.0,400.0,0,25,0,118
3,2022-01-10 01:30:00,0,510,0.0,200.0,0.0,400.0,0,30,0,122
4,2022-01-10 02:00:00,0,510,0.0,100.0,0.0,500.0,0,15,0,123


## Feature Data Types

In [106]:
# Summary of the data set information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1488 entries, 0 to 1487
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   Datetime                    1488 non-null   datetime64[ns]
 1   CHP_Electricity_Day(kWh)    1488 non-null   int64         
 2   CHP_Electricity_Night(kWh)  1488 non-null   int64         
 3   Boiler_1_Heat(kWh)          1488 non-null   float64       
 4   Boiler_2_Heat(kWh)          1488 non-null   float64       
 5   Boiler_3_Heat(kWh)          1488 non-null   float64       
 6   CHP_Total_Energy(kWh)       1488 non-null   float64       
 7   Boiler_1_Volume(Sm3)        1488 non-null   int64         
 8   Boiler_2_Volume(Sm3)        1488 non-null   int64         
 9   Boiler_3_Volume(Sm3)        1488 non-null   int64         
 10  CHP_2G_Volume(Sm3)          1488 non-null   int64         
dtypes: datetime64[ns](1), float64(4), int64(6)
memory usage:

The provided data consists of over 1400 observations contained in 11 columns variables. 10 columns are numeric  and the one remaining is an object.  

In [229]:
# Converting date column column to date format
df['Datetime'] = pd.to_datetime(df['Datetime'])

## Feature Statistics Summary

Below table provides the statistical details for each column.

In [108]:
# Analytical summary of the data set
df.describe()

Unnamed: 0,CHP_Electricity_Day(kWh),CHP_Electricity_Night(kWh),Boiler_1_Heat(kWh),Boiler_2_Heat(kWh),Boiler_3_Heat(kWh),CHP_Total_Energy(kWh),Boiler_1_Volume(Sm3),Boiler_2_Volume(Sm3),Boiler_3_Volume(Sm3),CHP_2G_Volume(Sm3)
count,1488.0,1488.0,1488.0,1488.0,1488.0,1488.0,1488.0,1488.0,1488.0,1488.0
mean,364.798387,135.013441,0.134409,261.08871,19.556452,449.798387,0.076613,31.699597,2.52621,121.887769
std,274.689455,227.427649,3.664945,203.990183,70.928643,160.02677,1.355272,22.433114,9.000746,41.230504
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,100.0,0.0,400.0,0.0,19.0,0.0,123.0
50%,550.0,0.0,0.0,200.0,0.0,500.0,0.0,30.0,0.0,136.0
75%,580.0,460.0,0.0,300.0,0.0,500.0,0.0,39.0,0.0,142.0
max,920.0,900.0,100.0,1400.0,500.0,800.0,34.0,172.0,62.0,224.0


### Observations
1. We can see that within the 25% quartile, most of the variables have zero values, apart from *Boiler_2_Heat, CHP_Total_Energy*  and their respective volumes.
2. The standard deviations of *CHP_Electricity_Day(kWh), CHP_Electricity_Night(kWh)* and *Boiler_2_Heat(kWh)* depicts a similar behaviour, on the other hand, the mean values present a significant variation. 


## Exploratory Data Analysis

In [109]:
# Finding missing values in the data set
df.isnull().sum()

Datetime                      0
CHP_Electricity_Day(kWh)      0
CHP_Electricity_Night(kWh)    0
Boiler_1_Heat(kWh)            0
Boiler_2_Heat(kWh)            0
Boiler_3_Heat(kWh)            0
CHP_Total_Energy(kWh)         0
Boiler_1_Volume(Sm3)          0
Boiler_2_Volume(Sm3)          0
Boiler_3_Volume(Sm3)          0
CHP_2G_Volume(Sm3)            0
dtype: int64

In [150]:
#Finding rows containing duplicate data
duplicate_rows_df = df[df.duplicated()]
print(f'There are {0} duplicated rows in the dataset'.format(duplicate_rows_df))

There are 0 duplicated rows in the dataset


The data set does not contain either missing  nor duplicated data.

### Data Distribution Visualization - Part 1

In [207]:
# Plotting a histogram matrix for all the variables
fig = ff.create_scatterplotmatrix(df.drop(columns= 'Datetime'), diag='histogram', height=3000, width=2600)
fig.show()

### Observations
1. CHP Electricity generated during the day has 2 significant peaks: 0 kWh at q frequency of 500. The 600 kWh of electricity generated shows a frequency of over 700. While CHP Electricity generated during the night displays a peak at 0 kWh twice the count of the CHP used for the day.

2. In Boiler 1 all data is distributed at the mark of 0 Energy Total (MWh) 
Boiler 2 depicts a right-skewed distribution, a significant portion of the data is within the range of 0 to 0.4 MWh over 350 points. Reaching almost a frequency of 500. Also, the frequency decreased abruptly from 0.4 to 1.4 MWh having less than 100 counts. Boiler 3 shows a similar response as Boiler 1 but presents a slight difference, it delivers less than 100 counts at 0.15 MWh, and within 0.2-0.3 MWh.

3. Boiler 1 gas corrected volume is grouped at  $ 0\ Sm^{3}$ which is congruent with the Heat Energy plot. Boiler 2 gas volume has its peak around $20-30\ Sm^{3}$. Boiler 3 has an expected gas volume of about 0 and a few data within 20 and 50 $Sm^{3}$. The CHP gas volume is more likely to fall roughly 100 up to 160 $Sm^{3}$. And in comparison, the Boilers' volume is notably larger.

#### Note 
$Sm^{3}$ means the volume of Gas that occupies one cubic meter $(1\ m^{3})$ at Standard Conditions

### Data Distribution Visualization - Part 2

In [149]:
# Boxplots to get an idea of the distribution/outliers
fig = px.box(df.iloc[:,1:-1])
fig.show()

In [155]:
# Finding the relationship between the variables
fig = px.imshow(df.corr(), text_auto=True, aspect="auto", color_continuous_scale=px.colors.sequential.Blues, height=800, width=1200)
fig.show()

### Observations

1. Columns *CHP_Electricity_Day(kWh)* and *CHP_Electricity_Night(kWh)* are a bit misleading since they are representing half a day performance. The readings with zero value in column Day means the CHP is generating electricity at night and the same the other way around.

2. Considerable amount of outliers in Boiler_2_Heat(kWh), Boiler_3_Heat(kWh) and their volumes.

Graph 2: 
- A clear postivie linear relationship between boilers heat and volume.
- CHP total energy relationship with volumen is a linear as well. 


3. CHP Electricity during day has a larger linear relationship with the heat energy generated than at night.  

**Questions:**
Is the boiler usage due to any seasonal effect?
Do the CHP and boilers are used at the same intervals of time? How is it described the perfromance of the CHP regarding electricity and heat?

## Feature Engineering


To visualise the data by time steps values are going to be group by mean per day intervals.

In [193]:
df = df.set_index('Datetime').groupby(pd.Grouper(freq='D')).mean()
df= df.dropna().reset_index()

KeyError: ignored

In [212]:
CHP_Electricity_df = df.iloc[:,0:3]
fig = px.line(df, x='Datetime', y=CHP_Electricity_df.columns)
fig.update_layout(
    title='CHP Electricity Day/Night',
    xaxis_title="Date time",
    yaxis_title="Electricity kWh",
    legend_title="Variables"
    )
fig.show()

In [214]:
fig = px.line(df, x='Datetime', y='CHP_Total_Energy(kWh)')
fig.update_layout(
    title='CHP Total Heat Energy',
    xaxis_title="Date time",
    yaxis_title="Heat Energy kWh"
    )
fig.show()

- The plots above show a constant generation from 10th April to 10th August 2020 about the 500 kWh mark in Heat Energy. Additionally CHP electricity during Day performs about 400kWh and during the night roughly 150 kWh.
Subsequently, CHP electricity and heat indicator faced a fall, dropping down the 100kWh mark on Oct 10th. 
- Max peak was reached Oct 14th es remains stable till Oct 31st. 
- Another dropping point is detected on 10th Nov and continues rising from there.


In [216]:
Boilers_df = df.iloc[:,3:6]
fig = px.line(df, x="Datetime", y=Boilers_df.columns)
fig.update_layout(
    title='Boilers Comparative',
    xaxis_title="Date time",
    yaxis_title="Heat Energy kWh",
    legend_title="Variables"
    )
fig.show()

- In contrast, Boiler usage's highest points are the same dates in which CHP Electricity and Heat Energy experience a declined generation. It demostrates a inverse relationship.
- Furthermore, in the period with constant CHP Electricity and Heat generation, all boilers usage is minimal, and only Boiler_2 consumption is active at a minor level a cause may be that it compensates for the reduction of CHP electricity at the night.

In [231]:
import plotly.graph_objects as go
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

X = df['CHP_Total_Energy(kWh)'].values.reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(X, df['Boiler_2_Heat(kWh)'], random_state=0)

model = LinearRegression()
model.fit(X_train, y_train)

x_range = np.linspace(X.min(), X.max(), 100)
y_range = model.predict(x_range.reshape(-1, 1))


fig = go.Figure([
    go.Scatter(x=X_train.squeeze(), y=y_train, name='train', mode='markers'),
    go.Scatter(x=X_test.squeeze(), y=y_test, name='test', mode='markers'),
    go.Scatter(x=x_range, y=y_range, name='prediction')
])
fig.show()

- Missing values/Nan values (mean) Porque?
- Linea del tiempo de la electricidad (graficar x la fecha, y electricidad) Linea del tiempo (meses con mas gasto)
- Matriz de correlacion
- Checar relaciones
- cambiar unidades (consistente)
- Energia y calor (Formula)
- $\frac{a}{b}$ usar latex en graficas y en explicaciones (labels y titulos)
- Lineal, Logistica (Encontrar Correlacion)
- Media, mediana blox plot que dicen
