In [None]:
%reload_ext autoreload
%autoreload 2

import missingno as msno
import numpy as np
import os
import pandas as pd
import plotly.express as px
import seaborn as sns
import sys

from loguru import logger
from matplotlib import pyplot as plt
from pathlib import Path

sys.path.append(str(Path.cwd().parent))

from settings.params import *
from src.utils import configure_logger

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 100)

In [None]:
data = pd.read_csv(RAW_DATA)
TARGET_NAME = MODEL_PARAMS['TARGET_NAME']

# Exploratory Data Analysis


## Missing values


In [None]:
msno.bar(data)

In [None]:
data.isna().sum()

In [None]:
data['ComplianceStatus'].value_counts()

After reading the descriptions of each of the columns present in the dataset, we can make the following observations about some notable missing data:

- SecondLargestPropertyUseType & ThirdLargestPropertyUseType contain lots of missing values. When they are absent we can interpret that as the corresponding building not having a second or third use type.
- Outlier is a column which indicates if the building's measures correspond to a high or low outlier(true outliers in summary). We interpret the absence of value as indicating that the building is supposedly normal.
- YearsEnergyStarCertified is a list of years for which the building has been certified EnergyStar. Nan values mean that the building has never had the certification.
- Comments is a column which should contain comments by a building owner or an agent to provide context about the building's energy use. No comments were made in the dataset. We can already drop that column from our dataset.


## Target Analysis


In [None]:
data[TARGET_NAME].describe()

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 4))

sns.histplot(data[TARGET_NAME], color='r', kde=True, ax=axes[0])
axes[0].set_title('Distribution of energy consumption')

sns.histplot(np.log(data[TARGET_NAME]), color='b', kde=True, ax=axes[1])
axes[1].set_title('Distribution of energy consumption in $log$ scale')
axes[1].set_xscale('log')

- Some of the buildings investigated have zero energy use which does not make sense in this context. An analysis on the compliance of some of the data shows that almost all of those buildings' data are marked as non-compliant, missing or Default Data in the dataset. That means we cannot take that data into account.
- The distribution of the target variable is highly skewed. Converting it to a log scale allows for a normal distribution, which can be more easily exploited by the model to be built.


In [None]:
# Compliance of the rows for which SiteEnergyUse is zero
data[data[TARGET_NAME] == 0]['ComplianceStatus']

In [None]:
fig = px.box(data, y="SiteEnergyUse(kBtu)")
fig.show()

The box plot illustrates the distribution of SiteEnergyUse(kBtu). The majority of the data points cluster near the lower end of the energy use spectrum, indicating relatively low energy consumption for most entries. However, there are several prominent outliers that deviate significantly from this cluster:

- There are a few data points that exhibit exceptionally high energy use, with values reaching up to 800 million kBtu.
- These high outliers are considerably distant from the main cluster, indicating that certain sites have much higher energy consumption compared to the rest.
- The presence of these outliers suggests variability in the dataset, which could be due to differences in site size, operational hours, or inefficiencies.
- These outliers need to be further investigated to understand the underlying causes and to determine if they should be included in the analysis or addressed separately. That analysis will be done when cleaning the data and we will use the Outlier column available in the dataset to determine which ones are true outliers and which ones are incoherent values.


In [None]:
# df = data.copy()
# z_scores = np.abs((df["SiteEnergyUse(kBtu)"] - df["SiteEnergyUse(kBtu)"].mean()) / df["SiteEnergyUse(kBtu)"].std())
# outliers = df[z_scores >= 3]
# len(outliers)
# false_outliers = outliers[outliers['Outlier'].isna()]
# len(false_outliers)

## Data Analysis


### Numerical Features


In [None]:
numerical_features = data.select_dtypes(include="number").columns
numerical_data = data[numerical_features]
logger.info(f"Categorical features:\n {sorted(numerical_features)}\n")

In [None]:
numerical_data.head()

#### Data distributions


In [None]:
num_cols = 3
num_rows = (len(numerical_features) + num_cols - 1) // num_cols  # calculate the number of rows needed
fig, axes = plt.subplots(num_rows, num_cols, figsize=(num_cols*6, num_rows*5))  # adjust figsize as needed

for i, col in enumerate(numerical_features):
    row = i // num_cols
    col_pos = i % num_cols
    sns.histplot(numerical_data[col], kde=False, ax=axes[row, col_pos])
    axes[row, col_pos].set_title(col)

plt.tight_layout()
plt.show()

#### Scatter Plots


In order to have meaningful visualizations, we need to limit the data to building's whose energy use is between the 5th and 95th percentiles.


In [None]:
lower_percentile = data[TARGET_NAME].quantile(0.05)
upper_percentile = data[TARGET_NAME].quantile(0.95)

In [None]:
limited_numerical_data = data[(data[TARGET_NAME] >= lower_percentile) & (data[TARGET_NAME] <= upper_percentile)]

In [None]:
# Number of plots per row
plots_per_row = 3

# Calculate number of rows needed
num_rows = len(numerical_features) // plots_per_row + (len(numerical_features) % plots_per_row > 0)

# Create subplots
fig, axs = plt.subplots(num_rows, plots_per_row, figsize=(6 * plots_per_row, 5 * num_rows))

# Flatten axs for easy iteration
axs = axs.flatten()

# Plot each numerical column against the target variable
for i, column in enumerate(numerical_features):
    sns.scatterplot(x=limited_numerical_data[column], y=limited_numerical_data[TARGET_NAME], ax=axs[i])
    axs[i].set_title(f'{column} vs {TARGET_NAME}')

fig.subplots_adjust(hspace=0.3, wspace=0.2)

#### Correlation Matrix


In [None]:
correlation_matrix = numerical_data.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, cmap='coolwarm', center=0)
plt.title('Correlation Matrix of Numerical Features')
plt.show()

#### Observations:

- All values related to energy are highly correlated to the target column SiteEnergyUse(kbtu). All those columns correspond to data that is obtained with the energy consumption statement. They can cause data leakage when used as features for the model

- The column NumberOfBuildings has some zeros. For this column the value should be at least 1

- The column NumberOfFloors also has zeros where its minimum value should be 1

- The column NumberOfFloors

- Some properties can already be classified as being unhelpful to the future model since they do not provide any relevant or quantifiable information about the building in itself: OSEBuildingID, DataYear, ZipCode, CouncilDistrictCode

- Instead of using directly YearBuilt, we are going to transform it into an Age column to simplify its relationship to the target variable and avoid data leakage.

- Since the site energy use is known, there is no need for having an absolute value for the consumption of Electricity, Steam or Gas. We are going to replace the corresponding columns with the ratio of the consumption of those types of energy on the total energy use.

- In the same way, instead of using absolute values for the largest use types GPAs, we can use the ratios on the total GPA.


### Categorical Features


In [None]:
categorical_features = data.select_dtypes(include=["object", "bool"]).columns
categorical_data = data[categorical_features]
categorical_data[TARGET_NAME] = data[TARGET_NAME]
logger.info(f"Categorical features:\n {sorted(categorical_features)}\n")

In [None]:
categorical_data.loc[:, categorical_features].describe()

In [None]:
categorical_data.loc[:, categorical_features].info()

#### Box plots


Before creating the box plots, some features can already be filtered out since they obviously do not provide any information relevant to energy use and they would make the visualization more difficult to analyze


In [None]:
categorical_features = list(filter(lambda x: x not in ["TaxParcelIdentificationNumber", "PropertyName", "Address", "City", "State", "ListOfAllPropertyUseTypes"], categorical_features))

In [None]:
categorical_features

In the same manner as before, we are going to limit the data in order to have meaningful visualizations


In [None]:
len(set(data['LargestPropertyUseType'].unique()) - set(data['PrimaryPropertyType'].unique()))

In [None]:
data['PrimaryPropertyType'].unique()

In [None]:
data['LargestPropertyUseType'].unique()

In [None]:
lower_percentile = data[TARGET_NAME].quantile(0.05)
upper_percentile = data[TARGET_NAME].quantile(0.95)

limited_categorical_data = categorical_data[(categorical_data[TARGET_NAME] >= lower_percentile) & (categorical_data[TARGET_NAME] <= upper_percentile)]

# Plot each categorical column against the target variable
for i, column in enumerate(categorical_features):
    fig = px.box(data_frame=limited_categorical_data, x=column, y=TARGET_NAME, title=f'{column} by {TARGET_NAME}')
    fig.show()

#### Observations

- PrimaryPropertyType column contains data redundant to what's in LargestPropertyUseType. We will work only with the second one which contains a lot more classes. That could help the model make better choices.
- The columns BuildingType, PrimaryPropertyType and the ones concerning the property use types seem to have an effect on the target column.
- Some columns can already be classified as not providing valuable information for our problem: TaxParcelIdentificationNumber, PropertyName, Address, City and State
- DefaultData, Outlier and ComplianceStatus do not really provide information that could be relevant to knowing the energy usage of a site. They will mostly help in order ot understand better the data at hand and clean it accordingly
- ListOfAllPropertyUseTypes contains data that will be mostly redundant to what we have in LargestPropertyUseType, SecondLargestPropertyUseType and ThirdLargestPropertyUseType. There are very few buildings that have up to a third use, so we suppose that buildings that have 4 or more uses are even more rare. Since this column contains information redundant to other columns and does not provide more information than we have, we will not use it for our model.
- YearsEnergyStarCertified cannot be used since it will cause data leakage. That information may not be known at the time of prediction.
