<a href="https://colab.research.google.com/github/Adhi1755/Power-Consumption-Demand-Forecasting/blob/main/Mini_Project_on_Demand_Forecasting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# **Demand Forecasting on Power Consumption**
This notebook focuses on analyzing and preprocessing household electricity consumption data recorded at regular time intervals. It covers data understanding, cleaning, and preparation steps required for time series analysis, including datatype conversion, datetime construction, and basic exploratory insights.
The goal is to transform raw power consumption data into a clean, analysis-ready format for further modeling and forecasting.

## **Overview**

The dataset consists of **2,075,259 observations with 9 features**, representing time-indexed household electrical power consumption data. Each row corresponds to a single timestamped measurement, making the dataset suitable for time series analysis and forecasting tasks.

**Feature Description**

| Column Name            | Data Type | Description |
|------------------------|----------|-------------|
| Date                   | object   | Date of measurement (day/month/year format) |
| Time                   | object   | Time of measurement (hour:minute:second) |
| Global_active_power    | object   | Total active power consumed by the household (kilowatts) |
| Global_reactive_power  | object   | Total reactive power consumed (kilowatts) |
| Voltage                | object   | Voltage level of the household electrical system (volts) |
| Global_intensity       | object   | Household current intensity (amperes) |
| Sub_metering_1         | object   | Energy consumed by kitchen appliances (watt-hours) |
| Sub_metering_2         | object   | Energy consumed by laundry appliances (watt-hours) |
| Sub_metering_3         | float64  | Energy consumed by climate control systems (watt-hours) |


## **Importing Required Libraries**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

## **Basic Analysis on the Data**

In [None]:
df = pd.read_csv("/content/drive/MyDrive/Google Colab/Mini Project/household_power_consumption.csv")

  df = pd.read_csv("/content/drive/MyDrive/Google Colab/Mini Project/household_power_consumption.csv")


In [None]:
df.head()

Unnamed: 0,Date,Time,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
0,16/12/2006,17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0
1,16/12/2006,17:25:00,5.36,0.436,233.63,23.0,0.0,1.0,16.0
2,16/12/2006,17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0
3,16/12/2006,17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0
4,16/12/2006,17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0


In [None]:
df.shape

(2075259, 9)

In [None]:
df.describe()

Unnamed: 0,Sub_metering_3
count,2049280.0
mean,6.458447
std,8.437154
min,0.0
25%,0.0
50%,1.0
75%,17.0
max,31.0


## **Ensuring Correct Data Types for Analysis**

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075259 entries, 0 to 2075258
Data columns (total 9 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   Date                   object 
 1   Time                   object 
 2   Global_active_power    object 
 3   Global_reactive_power  object 
 4   Voltage                object 
 5   Global_intensity       object 
 6   Sub_metering_1         object 
 7   Sub_metering_2         object 
 8   Sub_metering_3         float64
dtypes: float64(1), object(8)
memory usage: 142.5+ MB


In [None]:
numeric_cols = [
    "Global_active_power",
    "Global_reactive_power",
    "Voltage",
    "Global_intensity",
    "Sub_metering_1",
    "Sub_metering_2"
]

for col in numeric_cols:
    df[col] = pd.to_numeric(df[col], errors="coerce")


In [None]:
df["Date"] = pd.to_datetime(df["Date"], format="%d/%m/%Y")

In [None]:
df["Datetime"] = pd.to_datetime(
    df["Date"].astype(str) + " " + df["Time"]
)

In [None]:
df = df.drop(columns=['Date', 'Time'])

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2049280 entries, 2006-12-16 17:24:00 to 2010-11-26 21:02:00
Data columns (total 7 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   Global_active_power    float64
 1   Global_reactive_power  float64
 2   Voltage                float64
 3   Global_intensity       float64
 4   Sub_metering_1         float64
 5   Sub_metering_2         float64
 6   Sub_metering_3         float64
dtypes: float64(7)
memory usage: 125.1 MB


In [None]:
df.dtypes

Unnamed: 0,0
Global_active_power,float64
Global_reactive_power,float64
Voltage,float64
Global_intensity,float64
Sub_metering_1,float64
Sub_metering_2,float64
Sub_metering_3,float64


In [None]:
df.head()

Unnamed: 0_level_0,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2006-12-16 17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0
2006-12-16 17:25:00,5.36,0.436,233.63,23.0,0.0,1.0,16.0
2006-12-16 17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0
2006-12-16 17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0
2006-12-16 17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0


In [None]:
df.isnull().sum()

Unnamed: 0,0
Global_active_power,0
Global_reactive_power,0
Voltage,0
Global_intensity,0
Sub_metering_1,0
Sub_metering_2,0
Sub_metering_3,0


In [None]:
df.duplicated().sum()

np.int64(142582)

In [None]:
df[df.isnull().any(axis=1)]

Unnamed: 0_level_0,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1


In [None]:
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)

In [None]:
df.isnull().sum()

Unnamed: 0,0
Global_active_power,0
Global_reactive_power,0
Voltage,0
Global_intensity,0
Sub_metering_1,0
Sub_metering_2,0
Sub_metering_3,0


In [None]:
df.shape

(2049280, 7)

## **Time Gap Analysis in the Time Series Data**

As an initial step, we check the time range of the dataset using the **min() and max()** values of the Datetime column to understand the total duration covered by the data.



In [None]:
df.index.min(), df.index.max()

(Timestamp('2006-12-16 17:24:00'), Timestamp('2010-11-26 21:02:00'))

Since this is time series data, analyzing temporal consistency is essential because most time series models assume observations occur at regular intervals. To identify irregularities, we use the **diff()** function on the Datetime column, which computes the time difference between consecutive timestamps.



In [None]:
df.index.diff().value_counts().head()

Unnamed: 0_level_0,count
Datetime,Unnamed: 1_level_1
0 days 00:01:00,2049208
0 days 00:02:00,38
0 days 00:03:00,14
0 days 00:04:00,2
2 days 14:04:00,1


The analysis shows that the dataset is largely consistent, with most observations recorded at a 1-minute interval, making it well-suited for time series analysis and forecasting. A small number of irregular gaps are present: **2-minute gaps (38 rows)**, **3-minute gaps (14 rows)**, and **4-minute gaps (2 rows**). These minor gaps are typical in real-world sensor data and can be handled using resampling and interpolation.

Additionally, **a single large gap of approximately 2 days is observed**, which is critical and likely indicates a data collection interruption. This gap requires special handling, as it may impact trend analysis and forecasting performance.

Overall, the dataset exhibits strong temporal consistency, and appropriate gap-handling techniques will be applied to prepare it for reliable time series modeling.

**Handling the gaps in Timestamp**

In [None]:
df.dtypes

Unnamed: 0,0
Date,datetime64[ns]
Time,object
Global_active_power,float64
Global_reactive_power,float64
Voltage,float64
Global_intensity,float64
Sub_metering_1,float64
Sub_metering_2,float64
Sub_metering_3,float64


In [None]:
df = df.resample('1min').mean()

In [None]:
df.shape

(2075259, 7)

In [None]:
df.index.diff().value_counts().head()

Unnamed: 0_level_0,count
Datetime,Unnamed: 1_level_1
0 days 00:01:00,2075258


In [None]:
df.isnull().sum()

Unnamed: 0,0
Global_active_power,25979
Global_reactive_power,25979
Voltage,25979
Global_intensity,25979
Sub_metering_1,25979
Sub_metering_2,25979
Sub_metering_3,25979


In [None]:
time_diff = df.index.to_series().diff()
large_gaps = time_diff[time_diff > pd.Timedelta('10min')]
large_gaps

Unnamed: 0_level_0,Datetime
Datetime,Unnamed: 1_level_1


In [None]:
df = df.interpolate(method='time')

In [None]:
df.isnull().sum()


Unnamed: 0,0
Global_active_power,0
Global_reactive_power,0
Voltage,0
Global_intensity,0
Sub_metering_1,0
Sub_metering_2,0
Sub_metering_3,0


In [None]:
df['large_gap_flag'] = (time_diff > pd.Timedelta('10min')).astype(int)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2075259 entries, 2006-12-16 17:24:00 to 2010-11-26 21:02:00
Freq: min
Data columns (total 8 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   Global_active_power    float64
 1   Global_reactive_power  float64
 2   Voltage                float64
 3   Global_intensity       float64
 4   Sub_metering_1         float64
 5   Sub_metering_2         float64
 6   Sub_metering_3         float64
 7   large_gap_flag         int64  
dtypes: float64(7), int64(1)
memory usage: 207.0 MB
