<a href="https://www.kaggle.com/code/mrsimple07/energy-consumption-eda-prediction?scriptVersionId=167477625" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/energy-consumption-prediction/Energy_consumption.csv


## Introduction

Understanding and predicting energy consumption is crucial for optimizing resource utilization, reducing costs, and minimizing environmental impact. In this analysis, we delve into a dataset containing information on energy consumption along with various environmental and operational factors such as temperature, humidity, occupancy, and usage of HVAC and lighting systems. Our objective is to perform an Exploratory Data Analysis (EDA) to uncover insights into energy consumption patterns and subsequently develop predictive models to forecast future energy usage.

The dataset provides a comprehensive view of energy consumption dynamics over time, allowing us to explore how different factors influence energy demand. By examining the relationships between energy consumption and environmental variables like temperature and humidity, as well as operational factors such as occupancy and usage of HVAC and lighting systems, we aim to identify key drivers of energy consumption and understand their impact.

Through EDA, we will visualize the distribution of variables, investigate correlations between features and energy consumption, analyze temporal trends, and explore patterns across different categorical variables such as day of the week and holidays. These insights will guide the development of predictive models capable of forecasting energy consumption accurately.

By leveraging regression models, time series forecasting techniques, and ensemble methods, we aim to build robust models that can effectively predict future energy usage based on historical data and contextual factors. Evaluation metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE) will be used to assess the performance of the models and ensure their reliability.

Ultimately, the findings from this analysis can inform decision-making processes related to energy management, enabling stakeholders to optimize resource allocation, improve energy efficiency, and make informed decisions towards sustainability goals.


In [2]:
import pandas as pd
df = pd.read_csv('/kaggle/input/energy-consumption-prediction/Energy_consumption.csv')

df.head()

Unnamed: 0,Timestamp,Temperature,Humidity,SquareFootage,Occupancy,HVACUsage,LightingUsage,RenewableEnergy,DayOfWeek,Holiday,EnergyConsumption
0,2022-01-01 00:00:00,25.139433,43.431581,1565.693999,5,On,Off,2.774699,Monday,No,75.364373
1,2022-01-01 01:00:00,27.731651,54.225919,1411.064918,1,On,On,21.831384,Saturday,No,83.401855
2,2022-01-01 02:00:00,28.704277,58.907658,1755.715009,2,Off,Off,6.764672,Sunday,No,78.270888
3,2022-01-01 03:00:00,20.080469,50.371637,1452.316318,1,Off,On,8.623447,Wednesday,No,56.51985
4,2022-01-01 04:00:00,23.097359,51.401421,1094.130359,9,On,Off,3.071969,Friday,No,70.811732


In [3]:
df.info

<bound method DataFrame.info of                Timestamp  Temperature   Humidity  SquareFootage  Occupancy  \
0    2022-01-01 00:00:00    25.139433  43.431581    1565.693999          5   
1    2022-01-01 01:00:00    27.731651  54.225919    1411.064918          1   
2    2022-01-01 02:00:00    28.704277  58.907658    1755.715009          2   
3    2022-01-01 03:00:00    20.080469  50.371637    1452.316318          1   
4    2022-01-01 04:00:00    23.097359  51.401421    1094.130359          9   
..                   ...          ...        ...            ...        ...   
995  2022-02-11 11:00:00    28.619382  48.850160    1080.087000          5   
996  2022-02-11 12:00:00    23.836647  47.256435    1705.235156          4   
997  2022-02-11 13:00:00    23.005340  48.720501    1320.285281          6   
998  2022-02-11 14:00:00    25.138365  31.306459    1309.079719          3   
999  2022-02-11 15:00:00    23.051165  42.615421    1018.140606          6   

    HVACUsage LightingUsage  Re

In [4]:
df.describe()

Unnamed: 0,Temperature,Humidity,SquareFootage,Occupancy,RenewableEnergy,EnergyConsumption
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,24.982026,45.395412,1500.052488,4.581,15.132813,77.055873
std,2.83685,8.518905,288.418873,2.865598,8.745917,8.144112
min,20.007565,30.015975,1000.512661,0.0,0.006642,53.263278
25%,22.64507,38.297722,1247.108548,2.0,7.628385,71.54469
50%,24.751637,45.972116,1507.967426,5.0,15.072296,76.943696
75%,27.418174,52.420066,1740.340165,7.0,22.884064,82.921742
max,29.998671,59.969085,1999.982252,9.0,29.965327,99.20112


In [5]:
df.isnull().sum()

Timestamp            0
Temperature          0
Humidity             0
SquareFootage        0
Occupancy            0
HVACUsage            0
LightingUsage        0
RenewableEnergy      0
DayOfWeek            0
Holiday              0
EnergyConsumption    0
dtype: int64

We first have to convert the categorical into numerical

In [6]:
df = pd.get_dummies(df)

And now we can use LinearRegression to predict the Energy Consumption

In [7]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X,y = df.drop(['EnergyConsumption'], axis =1), df['EnergyConsumption']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size =0.2, random_state =42)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(mean_squared_error(y_test, y_pred))

26.547112625876924
