# Correlation analysis for the apartments with themselves

<div class="alert alert-success" role="alert">
    <strong>Summary:</strong><br> This notebook calculates the correlation between apartment to form groups with similar consumption patterns. The first section of the notebook determines the correlation between apartments with themselves with a sampling of 10-min then we increase the timestep until the correlation hits a correlation above 0.5. At the same time, we explore the second hypothesis which is that the correlation is higher for the mean value of the three apartments. As before we increase the timestep as long as the threshold is not respected.
</div>

The `os`module is employed to get the names of the entries by manipulating the path. To get more information, you have the option to refer to the library's documentation. Even though, this notebook will provide explanations for the [functions used]( https://docs.python.org/3/library/os.html).  
The `pandas` module facilitates data analysis and the use of dataframes. You can find the library's documentation [there.](https://pandas.pydata.org/docs/)   
The `numpy` module is instrumental in manipulating matrices and tables, serving for numerical calculations. You can find the library's documentation on [there.](https://numpy.org/doc/stable/) 

In [1]:
import os
import numpy as np
import pandas as pd

The required Domestic Hot Water (DHW) files end with "-IECS", this designation indicates data related to DHW.  
The `os` module is utilized to compile a list of DHW data for all apartments.

In [2]:
folder = r"../Data/"
files = os.listdir(folder)
#Get a list of the different files named IECS
list_IECS = [file for file in files if '-IECS' in file]   
list_IECS.sort()

Using the `pandas` module, we read the CSV files containing the listed information from the previous step. Then, we use the `resample` function within the `pandas` module to resample the data to a 10-min interval.

In [3]:
data = {} #Creation of a dictionary
for file in list_IECS:
    df = pd.read_csv(folder + file) #Read the csv file
    ts = df.set_index('0')['Value']     # DataFrame -> TimeSeries
    ts.index = pd.to_datetime(ts.index, unit='s')   # index to secondes
    ts = ts.resample("10Min").mean()    # resample 10 min
    data[file[:-4]] = ts 

We proceed to create a dataframe with the previously acquired data.  
Preprocessing our data we remove the NaN value that can create some noises in the study.

In [4]:
df = pd.DataFrame(data)
#print(df)
df = df[~df.isnull().any(axis=1)]  # remove the row with Nan Value

We employ the `drop` function from the `pandas` module to suppress columns unnecessary for our analysis.  
Then, the `mean` function from the `pandas` module is used to work out the average value across the three remaining apartments.

In [5]:
df=df.drop(columns=['25-IECS', '26-IECS', '27-IECS', '35-IECS', '64-IECS', '65-IECS', '67-IECS', '9-IECS' ])
df['mean'] = df.mean(axis=1) 
#print(df)

Conversion of the water consumption from l/s to l/h.  
Subsequently, we are normalizing the data using the `mean` and `std` function of the `pandas` library. The `mean` function work out the mean of each columns of the dataframe and the `std` function work out the standard deviation of each columns of the dataframe. Then we are working out the normalized values with the formula $\frac{(x-\hat x)}{\sigma}$ where $\hat x$ is the mean value of the dataframe.

In [6]:
df=df*3600
#print(df)
#Normalization of the data
df_mean = df.mean()
df_std=df.std()
dfn=(df-df_mean)/df_std #Called dfn because it is the global dataframe will all the data and n for normalization
#print(dfn)

                     34-IECS  36-IECS  63-IECS   mean
0                                                    
2020-12-01 12:10:00     0.00     0.00     0.00   0.00
2020-12-01 12:20:00     0.00     0.00     0.00   0.00
2020-12-01 12:30:00     0.00     0.00     0.00   0.00
2020-12-01 12:40:00     5.94     0.00     0.00   1.98
2020-12-01 12:50:00     0.00     0.00     5.76   1.92
...                      ...      ...      ...    ...
2021-06-08 14:50:00     0.00    59.40     0.00  19.80
2021-06-08 15:00:00     0.00    11.88     0.00   3.96
2021-06-08 15:10:00     0.00     0.00     0.00   0.00
2021-06-08 15:20:00     0.00    17.64     0.00   5.88
2021-06-08 15:30:00     0.00     0.00     0.00   0.00

[26089 rows x 4 columns]


## Comparing the month of December

<div class="alert alert-info">
<strong>Details :</strong><br>
Working out the correlation of the apartments with themselves and for their average value. If not sufficient, increase the resampling timestep. For this analysis we need to divide our information into two dataframes to compare them together. We want to compare the data of the month of December so we divide the month into two equal dataset.
</div>

We start by dividing our data into two dataframes, the first analysis is composed of the month of December 2020, we will compare their values to see if ten-minute sampling of first part of December is equivalent to ten-minute sampling of the second part of December.  

### Data preprocessing: 


It is essential to have the same number of columns and rows for this analysis either way the correlation won't work. The dataframe may exhibit varying numbers of rows, even though the period remains the same. This difference arises because we eliminated NaN values from the dataset, and their presence could be anywhere in the dataset.   
To overcome this issue, we need to preprocess our data so we analyse our values by printing the data and looking at the numbers of rows and columns. We can use the `drop` function to suppress the right number of rows. We must change the dataset until the number of rows is equal to the other one. The dataprocessing is essential to compare the dataframe together, either way it won't be possible and an error will appear.

The dataframe are called 1 and 2 because they are the order. The first dataframe is the first dataframe of our analysis.  
We use the `corrwith` function of the `pandas` module to work out the correlation between two dataframes using the Pearson correlation criterion expressed by the formula $ r=\frac{\sum(x-\hat x)*(y-\hat y)}{\sqrt(\sum(x-\hat x)^2*\sum(y-\hat y)^2)} $ where $\hat x$ and $\hat y$ are the mean of each value.  
The name "df_12_corr" mean that we get a dataframe containing correlation values (indicate by the suffixe "corr") comparing the dataframe 1 and 2.

In [7]:
df1=dfn['2020-12-01 00:00:00' : '2020-12-16 00:00:00']
df2=dfn['2020-12-16 00:10:00' : '2020-12-30 16:00:00']
print(df1)
print(df2)

                      34-IECS   36-IECS   63-IECS      mean
0                                                          
2020-12-01 12:10:00 -0.224086 -0.236966 -0.323985 -0.430591
2020-12-01 12:20:00 -0.224086 -0.236966 -0.323985 -0.430591
2020-12-01 12:30:00 -0.224086 -0.236966 -0.323985 -0.430591
2020-12-01 12:40:00  0.099491 -0.236966 -0.323985 -0.238126
2020-12-01 12:50:00 -0.224086 -0.236966  0.023660 -0.243958
...                       ...       ...       ...       ...
2020-12-15 23:10:00  0.099491 -0.236966 -0.323985 -0.238126
2020-12-15 23:30:00  0.746644 -0.236966 -0.323985  0.146802
2020-12-15 23:40:00 -0.224086 -0.236966 -0.323985 -0.430591
2020-12-15 23:50:00  0.423068 -0.236966 -0.323985 -0.045662
2020-12-16 00:00:00 -0.224086 -0.236966 -0.323985 -0.430591

[2004 rows x 4 columns]
                      34-IECS   36-IECS   63-IECS      mean
0                                                          
2020-12-16 00:10:00 -0.224086 -0.236966 -0.323985 -0.430591
2020-12-16 00:2

When it is done and we have the same number of rows, we can apply the correlation to the dataset.  
To work out the correlation between two dataframe we use the function `corrwith` that calculate the correlation between two dataframes.  
However, to compare the dataframes it is necessary to have the same name of the different rows, either way the correlation will return a NaN value. To do so, we use the `set_axis` function of the `pandas` module that sets the rows name of the dataframe 2 as the rows name of the dataframe 1.

In [8]:
df_12_corr =df1.corrwith(df2.set_axis(df1.index, axis='index', copy=False)) # Correlation matrice of df1 with df2 changing the index name with the index name of dh1 to compare them
print(df_12_corr)

34-IECS    0.009986
36-IECS   -0.011536
63-IECS   -0.050793
mean      -0.064636
dtype: float64


As the value of the correlation are insufficient, we increase the timeset of the sampling to one hour and drop the needed number of value to get the same number of rows by preprocessing the data.

In [9]:
df3=df1.resample('H').mean()
df4=df2.resample('H').mean()

print(df3.shape)
print(df4.shape)

df4=df4.drop('2020-12-30 13:00:00')
df4=df4.drop('2020-12-30 14:00:00')
df4=df4.drop('2020-12-30 15:00:00')
df4=df4.drop('2020-12-30 16:00:00')

print(df4.shape)

df_34_corr =df3.corrwith(df4.set_axis(df3.index, axis='index', copy=False)) # Correlation matrice of df1h with df2 changing the index name with the index name of dh1 to compare them
print(df_34_corr)

(349, 4)
(353, 4)
(349, 4)
34-IECS   -0.016868
36-IECS   -0.093109
63-IECS   -0.108098
mean      -0.194524
dtype: float64


As the value of the correlation are insufficient, we increase the timeset of the sampling to a day. It means that if the correlation works, the apartment will need a tank to cover the hot water supply of one-day.  
We repeat the process of preprocessing the data as done previously.  
We repeat our analysis with the `corrwith` function and the `set_axis` function for the same purpose as mentioned earlier.

In [10]:
df5=df1.resample('D').mean()
df6=df2.resample('D').mean()

print(df5.shape)
print(df6.shape)

df5=df5.drop('2020-12-16')

df_56_corr =df5.corrwith(df6.set_axis(df5.index, axis='index', copy=False)) # Correlation matrice of df1h with df2 changing the index name with the index name of dh1 to compare them
print(df_56_corr)

(16, 4)
(15, 4)
34-IECS    0.037100
36-IECS    0.461429
63-IECS   -0.120559
mean       0.108289
dtype: float64


As the value of the correlation are insufficient except for one apartment, we increase the timeset of the sampling to five days. It means that if the correlation works, the apartment will need a tank to cover the hot water supply of five-days.  
We repeat the process of preprocessing the data as done previously.  
We repeat our analysis with the `corrwith` function and the `set_axis` function for the same purpose as mentioned earlier.

In [11]:
df7=df1.resample('5D').mean()
df8=df2.resample('5D').mean()

print(df7.shape)
print(df8.shape)

df7=df7.drop('2020-12-16')

df_78_corr =df7.corrwith(df8.set_axis(df7.index, axis='index', copy=False)) # Correlation matrice of df1h with df2 changing the index name with the index name of dh1 to compare them
print(df_78_corr)

(4, 4)
(3, 4)
34-IECS   -0.699922
36-IECS   -0.411685
63-IECS    0.609686
mean      -0.614913
dtype: float64


As we can see the correlation work only for the apartment 63 because the threshold is above 0.5.  
Compare to what we expected the mean is still low and doesn't increase with time.  
We want to shift our analysis to compare one month of data with another one months of data. We compare the month of December with the month of January.

## Correlation of the 3 apartments with themselves and of the average value of the apartment starting with a comparison between December 2020 and January 2021

<div class="alert alert-info">
<strong>Details :</strong><br>
Working out the correlation of the apartments with themselves and for their average value. If not sufficient, increase the resampling timestep. For this analysis we need to divide our information into two dataframe to compare them together.
</div>

We start by dividing our data into two dataframes, the second analysis is composed of the month of December 2020 and January 2021, we will compare their values to see if ten-minute sampling of December is equivalent to ten-minute sampling of January.  

### Data preprocessing: 


It is essential to have the same number of columns and rows for this analysis either way the correlation won't work. The dataframe may exhibit varying numbers of rows, even though the period remains the same. This difference arises because we eliminated NaN values from the dataset, and their presence could be anywhere in the dataset.   
To overcome this issue, we need to preprocess our data so we analyse our values by printing the data and looking at the numbers of rows and columns. We must change the dataset until the number of rows is equal to the other one. The dataprocessing is essential to compare the dataframe together, either way it won't be possible and an error will appear.

In [12]:
df9=dfn['2020-12-01 00:00:00' : '2020-12-31 00:00:00']
df10=dfn['2021-01-01 00:00:00' : '2021-01-30 13:00:00']
print(df9)
print(df10)

                      34-IECS   36-IECS   63-IECS      mean
0                                                          
2020-12-01 12:10:00 -0.224086 -0.236966 -0.323985 -0.430591
2020-12-01 12:20:00 -0.224086 -0.236966 -0.323985 -0.430591
2020-12-01 12:30:00 -0.224086 -0.236966 -0.323985 -0.430591
2020-12-01 12:40:00  0.099491 -0.236966 -0.323985 -0.238126
2020-12-01 12:50:00 -0.224086 -0.236966  0.023660 -0.243958
...                       ...       ...       ...       ...
2020-12-30 23:20:00 -0.224086 -0.236966 -0.323985 -0.430591
2020-12-30 23:30:00 -0.224086 -0.236966 -0.323985 -0.430591
2020-12-30 23:40:00 -0.224086 -0.236966 -0.323985 -0.430591
2020-12-30 23:50:00 -0.224086 -0.236966 -0.323985 -0.430591
2020-12-31 00:00:00 -0.224086 -0.236966 -0.323985 -0.430591

[4053 rows x 4 columns]
                      34-IECS   36-IECS   63-IECS      mean
0                                                          
2021-01-01 00:00:00 -0.224086 -0.236966 -0.323985 -0.430591
2021-01-01 00:1

When it is done and we have the same number of rows, we can apply the correlation to the dataset.  
To work out the correlation between two dataframe we use the `corrwith` function that calculate the correlation between two dataframes.  
However, to compare the dataframes it is necessary to have the same name of the different rows, either way the correlation will return a NaN value. To do so, we use the `set_axis` function of the `pandas` module that sets the rows name of the dataframe 2 as the rows name of the dataframe 1.

In [13]:
df_910_corr =df9.corrwith(df10.set_axis(df9.index, axis='index', copy=False)) # Correlation matrice of df1h with df2 changing the index name with the index name of dh1 to compare them
print(df_910_corr)

34-IECS   -0.017422
36-IECS   -0.021889
63-IECS   -0.025402
mean      -0.058032
dtype: float64


As the value of the correlation are insufficient, we increase the timeset of the sampling to one hour.  
We repeat the process of preprocessing the data.  
We repeat our analysis with the `corrwith` function and the `set_axis` function for the same purpose as mentioned earlier.

In [14]:
df11=df9.resample('H').mean()
df12=df10.resample('H').mean()

print(df11.shape)
print(df12.shape)

df12=df12.drop('2021-01-30 13:00:00')


df_1112_corr =df11.corrwith(df12.set_axis(df11.index, axis='index', copy=False)) # Correlation matrice of df1h with df2 changing the index name with the index name of dh1 to compare them
print(df_1112_corr)

(709, 4)
(710, 4)
34-IECS   -0.057279
36-IECS   -0.065575
63-IECS   -0.108426
mean      -0.178663
dtype: float64


As the value of the correlation are insufficient, we increase the timeset of the sampling to 10 days. It means that if the correlation works, the apartment will need a tank to cover the hot water supply of ten-days.  
We repeat the process of preprocessing the data.  
We repeat our analysis with the `corrwith` function and the `set_axis` function for the same purpose as mentioned earlier.

In [15]:
df13=df9.resample('5D').mean()
df14=df10.resample('5D').mean()

print(df13.shape)
print(df14.shape)

df13=df13.drop('2020-12-31')
print(df13.shape)

df_1415_corr =df13.corrwith(df14.set_axis(df13.index, axis='index', copy=False)) # Correlation matrice of df1h with df2 changing the index name with the index name of dh1 to compare them
print(df_1415_corr)

(7, 4)
(6, 4)
(6, 4)
34-IECS   -0.606976
36-IECS    0.635867
63-IECS   -0.331582
mean      -0.086563
dtype: float64


As we can see the correlation work only for the apartment 36 because the threshold is above 0.5.  
Compare to what we expected the mean is still low and doesn't increase with time.  
We want to shift our analysis to compare six months with another six months.

## Comparing two dataframes of three months


<div class="alert alert-info">
<strong>Details :</strong><br>
We repeat our experimentation with two dataframes with three months of data. Starting by working out the correlation of the apartments with themselves and for their average value. If not sufficient, increase the resampling timestep. For this analysis we need to divide our information into two equivalent dataframes to compare them together.
</div>

We start by dividing our data into two dataframes, the third analysis is composed of the three months, we will compare their values to see if ten-minute sampling of first three months is equivalent to ten-minute sampling of the second three months period.  

### Data preprocessing:  

It is essential to have the same number of columns and rows for this analysis either way the correlation won't work. The dataframe may exhibit varying numbers of rows, even though the period remains the same. This difference arises because we eliminated NaN values from the dataset, and their presence could be anywhere in the dataset.   
To overcome this issue, we need to preprocess our data so we analyse our values by printing the data and looking at the numbers of rows and columns. We must change the dataset until the number of rows is equal to the other one. The dataprocessing is essential to compare the dataframe together, either way it won't be possible and an error will appear.

In [16]:
df15=dfn['2020-12-01 00:00:00' : '2021-02-28 00:00:00']
df16=dfn['2021-03-01 00:00:00' : '2021-05-28 04:00:00']
print(df15)
print(df16)


df_1516_corr =df15.corrwith(df16.set_axis(df15.index, axis='index', copy=False)) # Correlation matrice of df1h with df2 changing the index name with the index name of dh1 to compare them
print(df_1516_corr)

                      34-IECS   36-IECS   63-IECS      mean
0                                                          
2020-12-01 12:10:00 -0.224086 -0.236966 -0.323985 -0.430591
2020-12-01 12:20:00 -0.224086 -0.236966 -0.323985 -0.430591
2020-12-01 12:30:00 -0.224086 -0.236966 -0.323985 -0.430591
2020-12-01 12:40:00  0.099491 -0.236966 -0.323985 -0.238126
2020-12-01 12:50:00 -0.224086 -0.236966  0.023660 -0.243958
...                       ...       ...       ...       ...
2021-02-27 23:20:00 -0.224086 -0.236966 -0.323985 -0.430591
2021-02-27 23:30:00 -0.224086 -0.236966 -0.323985 -0.430591
2021-02-27 23:40:00 -0.224086 -0.236966 -0.323985 -0.430591
2021-02-27 23:50:00 -0.224086 -0.236966 -0.323985 -0.430591
2021-02-28 00:00:00 -0.224086 -0.236966 -0.323985 -0.430591

[12187 rows x 4 columns]
                      34-IECS   36-IECS   63-IECS      mean
0                                                          
2021-03-01 00:00:00 -0.224086 -0.236966 -0.323985 -0.430591
2021-03-01 00:

As the value of the correlation are insufficient, we increase the timeset of the sampling to one hour. It means that if the correlation works, the apartment will need a tank to cover the hot water supply of one-hour.  
We repeat the process of preprocessing the data.  
We repeat our analysis with the `corrwith` function and the `set_axis` function for the same purpose as mentioned earlier.

In [17]:
df17=df15.resample('H').mean()
df18=df16.resample('H').mean()
print(df17.shape)
print(df18.shape)

df17=df17.drop('2020-12-01 12:00:00')
df17=df17.drop('2020-12-01 13:00:00')
df17=df17.drop('2020-12-01 14:00:00')
df17=df17.drop('2020-12-01 15:00:00')
df17=df17.drop('2020-12-01 16:00:00')
df17=df17.drop('2020-12-01 17:00:00')
df17=df17.drop('2020-12-01 18:00:00')
df17=df17.drop('2020-12-01 19:00:00')

print(df17.shape)

df_1718_corr =df17.corrwith(df18.set_axis(df17.index, axis='index', copy=False)) # Correlation matrice of df1h with df2 changing the index name with the index name of dh1 to compare them
print(df_1718_corr)

(2125, 4)
(2117, 4)
(2117, 4)
34-IECS    0.007814
36-IECS    0.007959
63-IECS    0.047884
mean       0.055771
dtype: float64


As the value of the correlation are insufficient, we increase the timeset of the sampling to one day. It means that if the correlation works, the apartment will need a tank to cover the hot water supply of one-day.  
We repeat the process of preprocessing the data.  
We repeat our analysis with the `corrwith` function and the `set_axis` function for the same purpose as mentioned earlier.

In [18]:
df19=df15.resample('D').mean()
df20=df16.resample('D').mean()
print(df19.shape)
print(df20.shape)

df19=df19.drop('2020-12-01')

print(df19.shape)

df_1920_corr =df19.corrwith(df20.set_axis(df19.index, axis='index', copy=False)) # Correlation matrice of df1h with df2 changing the index name with the index name of dh1 to compare them
print(df_1920_corr)

(90, 4)
(89, 4)
(89, 4)
34-IECS   -0.082955
36-IECS    0.092144
63-IECS   -0.153276
mean       0.084778
dtype: float64


As the value of the correlation are insufficient, we increase the timeset of the sampling to 10 days. It means that if the correlation works, the apartment will need a tank to cover the hot water supply of ten-days.  
We repeat the process of preprocessing the data.  
We repeat our analysis with the `corrwith` function and the `set_axis` function for the same purpose as mentioned earlier.

In [19]:
df21=df15.resample('10D').mean()
df22=df16.resample('10D').mean()
print(df21.shape)
print(df22.shape)

df_2122_corr =df21.corrwith(df22.set_axis(df21.index, axis='index', copy=False)) # Correlation matrice of df1h with df2 changing the index name with the index name of dh1 to compare them
print(df_2122_corr)

(9, 4)
(9, 4)
34-IECS    0.160359
36-IECS   -0.168538
63-IECS   -0.584197
mean      -0.530821
dtype: float64


As the value of the correlation are insufficient, we increase the timeset of the sampling to one month. It means that if the correlation works, the apartment will need a tank to cover the hot water supply of one-month.  
We repeat the process of preprocessing the data.  
We repeat our analysis with the `corrwith` function and the `set_axis` function for the same purpose as mentioned earlier.

In [20]:
df23=df15.resample('M').mean()
df24=df16.resample('M').mean()
print(df23.shape)
print(df24.shape)



df_2324_corr =df23.corrwith(df24.set_axis(df23.index, axis='index', copy=False)) # Correlation matrice of df1h with df2 changing the index name with the index name of dh1 to compare them
print(df_2324_corr)

(3, 4)
(3, 4)
34-IECS    0.557498
36-IECS    0.036186
63-IECS   -0.874419
mean      -0.992668
dtype: float64


The correlation value is enough for one apartment the 34 but not for the other.  
The mean value is no better than the apartment by themselves.