# Household Electricity Consumption Forecasting

## Part 1: Basic Data Understanding of Dataset
Title: Basic Data Understanding and Cleaning Data

Purpose: This phase focuses on initial data inspection, identifying data types, handling missing values, and getting descriptive insights. To create additional derived features from existing ones that can capture deeper relationships in energy consumption

### Load Dataset

In [1]:
# Import Libraries 
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Load dataset
df = pd.read_csv('household_power_consumption.csv')

### Q1. Display Initial Records

In [3]:
df.head(5)

Unnamed: 0,Datetime,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3,Date,Total_sub_metering,Unmetered_power,Power_to_Voltage_ratio,Reactive_to_Active_ratio,Energy_efficiency_score
0,2006-12-16 17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0,2006-12-16,18.0,52.266667,0.017953,0.099146,0.900854
1,2006-12-16 17:25:00,5.36,0.436,233.63,23.0,0.0,1.0,16.0,2006-12-16,17.0,72.333333,0.022942,0.081343,0.918657
2,2006-12-16 17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0,2006-12-16,19.0,70.566667,0.023036,0.092668,0.907332
3,2006-12-16 17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0,2006-12-16,18.0,71.8,0.023051,0.09317,0.90683
4,2006-12-16 17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0,2006-12-16,18.0,43.1,0.015555,0.144026,0.855974


### Observation:
- Datetime/Date: Specifies the exact time and day the power measurement was recorded.
- Global_active_power: Measures the total useful electricity consumed by the household at that moment, which is the power you pay for.
- Global_reactive_power: Measures the non-useful power needed to run things like motors (not directly consumed but impacts power quality).
- Voltage: Indicates the level of electrical potential (pressure) supplied to the house, typically around 230-240 volts
- Global_intensity: Shows the total electrical current (Amperes) being drawn by all active appliances simultaneously.
- Sub_metering_1, 2, 3: Track the energy usage for three different group, specific circuits or groups of appliances within the home.
- Total_sub_metering: The combined power used by the three monitored zones (Sub-metering 1, 2, and 3).
- Unmetered_power: The estimated electricity use by everything else in the house that isn't sub-metered (e.g., lights, TVs, computers).
- Power_to_Voltage_ratio: How much flow (Amperes) is needed for the given voltage, indicating overall load on the system.
- Reactive_to_Active_ratio: How much wasted power there is compared to the useful power, indicating electrical efficiency.
- Energy_efficiency_score: A single calculated number representing the overall health and efficiency of the house's power usage.

### Q2. Check Basic Info

In [4]:
df.info()

print("Rows: ",df.shape[0])
print("Columns: ",df.shape[1])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2049280 entries, 0 to 2049279
Data columns (total 14 columns):
 #   Column                    Dtype  
---  ------                    -----  
 0   Datetime                  object 
 1   Global_active_power       float64
 2   Global_reactive_power     float64
 3   Voltage                   float64
 4   Global_intensity          float64
 5   Sub_metering_1            float64
 6   Sub_metering_2            float64
 7   Sub_metering_3            float64
 8   Date                      object 
 9   Total_sub_metering        float64
 10  Unmetered_power           float64
 11  Power_to_Voltage_ratio    float64
 12  Reactive_to_Active_ratio  float64
 13  Energy_efficiency_score   float64
dtypes: float64(12), object(2)
memory usage: 218.9+ MB
Rows:  2049280
Columns:  14


### Observation:
- Understanding data types and memory usage.
- The core finding is that Active Power is highly related to Global Intensity (perfectly linear) and Sub-metering 3 (strongest predictor)
- while Voltage remains largely uncorrelated, making it irrelevant for forecasting.

### Q3. Summary Statistics for Numerical Columns

In [5]:
df.describe()

Unnamed: 0,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3,Total_sub_metering,Unmetered_power,Power_to_Voltage_ratio,Reactive_to_Active_ratio,Energy_efficiency_score
count,2049280.0,2049280.0,2049280.0,2049280.0,2049280.0,2049280.0,2049280.0,2049280.0,2049280.0,2049280.0,2049280.0,2049280.0
mean,1.091615,0.1237145,240.8399,4.627759,1.121923,1.29852,6.458447,8.878891,9.314693,0.004557198,0.2016994,0.7984415
std,1.057294,0.112722,3.239987,4.444396,6.153031,5.822026,8.437154,12.863,9.585916,0.004458831,0.2246231,0.2240961
min,0.076,0.0,223.2,0.2,0.0,0.0,0.0,0.0,-2.4,0.0003213531,0.0,0.0
25%,0.308,0.048,238.99,1.4,0.0,0.0,0.0,0.0,3.8,0.001275531,0.02336449,0.6783217
50%,0.602,0.1,241.01,2.6,0.0,0.0,1.0,1.0,5.5,0.002493132,0.1152416,0.8847584
75%,1.528,0.194,242.89,6.4,0.0,1.0,17.0,18.0,10.36667,0.00635324,0.3216783,0.9766355
max,11.122,1.39,254.15,48.4,88.0,80.0,31.0,134.0,124.8333,0.04840282,1.495495,1.0


### Observation:
- Shows basic statistical metrics like mean, median, min, max for all the columns present in the dataset.
- The data reveals that average power consumption is low ($\approx 1.09 \text{ kW}$)
- The voltage is extremely stable ($\text{mean} \approx 240 \text{ V}$)
- Sub-metering 3 (likely the kitchen/major appliances) has the highest average usage ($\approx 6.46 \text{ kW}$) among the sub-zones.

### Q4. Identify Categorical and Numerical Columns

In [6]:
cat_col = df.select_dtypes(include=['object'])
num_col = df.select_dtypes(exclude=['object'])

print('Categorical Columns: ', list(cat_col))
print('Numerical Columns: ', list(num_col))

Categorical Columns:  ['Datetime', 'Date']
Numerical Columns:  ['Global_active_power', 'Global_reactive_power', 'Voltage', 'Global_intensity', 'Sub_metering_1', 'Sub_metering_2', 'Sub_metering_3', 'Total_sub_metering', 'Unmetered_power', 'Power_to_Voltage_ratio', 'Reactive_to_Active_ratio', 'Energy_efficiency_score']


### Q5. Check for Missing Values

In [7]:
df.isnull().sum()

Datetime                    0
Global_active_power         0
Global_reactive_power       0
Voltage                     0
Global_intensity            0
Sub_metering_1              0
Sub_metering_2              0
Sub_metering_3              0
Date                        0
Total_sub_metering          0
Unmetered_power             0
Power_to_Voltage_ratio      0
Reactive_to_Active_ratio    0
Energy_efficiency_score     0
dtype: int64

### Q6. Check Unique Values in Each Column

In [8]:
df.nunique().sort_values()

Sub_metering_3                   32
Sub_metering_2                   81
Sub_metering_1                   88
Total_sub_metering              135
Global_intensity                221
Global_reactive_power           532
Date                           1433
Voltage                        2837
Global_active_power            4186
Unmetered_power                5386
Energy_efficiency_score      205471
Reactive_to_Active_ratio     206952
Power_to_Voltage_ratio       939303
Datetime                    2049280
dtype: int64

### Q7. Check for Duplicate

In [9]:
df.duplicated().sum()

0

### Q8. Analyze Date/Time Columns

In [10]:
df.set_index('Datetime', inplace=True)

print("Start Date:", df.index.min())
print("End Date:", df.index.max())

Start Date: 2006-12-16 17:24:00
End Date: 2010-11-26 21:02:00


### Q9. Kurtosis of Numerical Columns

In [15]:
df.select_dtypes(include=np.number).skew()

Global_active_power         1.786233
Global_reactive_power       1.261914
Voltage                    -0.326665
Global_intensity            1.849100
Sub_metering_1              5.944541
Sub_metering_2              7.090553
Sub_metering_3              0.724688
Total_sub_metering          2.228822
Unmetered_power             2.486911
Power_to_Voltage_ratio      1.833993
Reactive_to_Active_ratio    1.257536
Energy_efficiency_score    -1.241802
dtype: float64

### Observation:
- Most power features (Active Power and Intensity) have a high positive skew ($\approx 1.7$ to $1.8$), meaning consumption is usually low but has a long tail of very high usage events.
- The Sub-metering zones show extreme positive skew (up to $7.0$), indicating that these areas are mostly idle but experience very large spikes when used.
- Voltage is the only feature that is nearly symmetrical with a slight negative skew ($-0.32$), confirming its stable, bell-shaped distribution.

### Q10. Kurtosis of Numerical Columns

In [16]:
print(df.select_dtypes(include=np.number).kurtosis())

Global_active_power          4.218685
Global_reactive_power        2.605633
Voltage                      0.724707
Global_intensity             4.601243
Sub_metering_1              35.642993
Sub_metering_2              57.907344
Sub_metering_3              -1.282198
Total_sub_metering           7.341440
Unmetered_power              7.911011
Power_to_Voltage_ratio       4.527916
Reactive_to_Active_ratio     0.918256
Energy_efficiency_score      0.825323
dtype: float64


### Observation:
- The high Kurtosis values ($\approx 4$ to $5$) for Active Power and Intensity suggest the data is "spiky," meaning most values cluster tightly near the average, but the rare high-usage events are very far out (heavy tails).
- Sub-metering 1 and 2 have exceptionally high Kurtosis ($\approx 35$ and $57$), which strongly confirms that the energy use in these specific zones is dominated by extreme, short-lived spikes rather than sustained use.
- Voltage has the lowest Kurtosis ($0.72$), further proving it has a flat, predictable distribution with no major outliers, making it fundamentally different from all the power consumption metrics.