#  Prediction_Energy Consumption 

## Dataset Description:
#### The "Household Power Consumption" dataset contains measurements of electric power consumption in a single household over a period of time. The data includes details recorded every minute from December 2006 to November 2010. Here's a detailed description of each column:

## Dataset Overview:
##### Rows: 2,075,259 (entries/records)
##### Columns: 9 (attributes/features)
##### Time Period: December 2006 - November 2010
##### Granularity: Data is recorded every minute.
### Column Descriptions:

##### Date (object):

The date of the observation in the format DD/MM/YYYY.

##### Time (object):

The time of the observation in the format HH:MM:SS.


##### Global_active_power:

Household global active power in kilowatts (kW).
This refers to the total power consumed by the household appliances.

##### Global_reactive_power: 

Household global reactive power in kilovolt-amperes reactive (kVAR).
Reactive power is the portion of electricity that establishes and sustains the electric and magnetic fields of alternating current equipment (not useful power).

##### Voltage (object initially; should be numeric):

Voltage (in volts) supplied to the house during the given minute.

##### Global_intensity (object initially; should be numeric):

Current intensity in amperes (A).
The total electrical current being drawn at that moment.

##### Sub_metering_1 (object initially; should be numeric):

Energy sub-metering for the kitchen (in watt-hours of active energy).
Example appliances: Dishwasher, oven, and microwave.

##### Sub_metering_2 (object initially; should be numeric):

Energy sub-metering for the laundry room (in watt-hours of active energy).
Example appliances: Washing machine, tumble dryer, refrigerator.

##### Sub_metering_3 (float64):
Energy sub-metering for electric water-heater and air-conditioning systems (in watt-hours of active energy)


#### The dataset contains a few missing values, particularly in the Sub_metering_3 column (about 1.25% of the data).

## Defining Problem Statement
The goal is to analyze the household electric power consumption dataset and extract meaningful insights. This could involve understanding consumption patterns, identifying peak usage times, and predicting future power consumption

### Import libraries and Load the dataset

In [36]:
import pandas as pd
import numpy as np

#### 2. Loading the Data

Use Pandas to load the dataset from the .txt file. Since it's delimited by ; , specify the separator

In [37]:
df=pd.read_csv(r"D:\infosys\household_power_consumption.txt",sep=";")
df

  df=pd.read_csv(r"D:\infosys\household_power_consumption.txt",sep=";")


Unnamed: 0,Date,Time,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
0,16/12/2006,17:24:00,4.216,0.418,234.840,18.400,0.000,1.000,17.0
1,16/12/2006,17:25:00,5.360,0.436,233.630,23.000,0.000,1.000,16.0
2,16/12/2006,17:26:00,5.374,0.498,233.290,23.000,0.000,2.000,17.0
3,16/12/2006,17:27:00,5.388,0.502,233.740,23.000,0.000,1.000,17.0
4,16/12/2006,17:28:00,3.666,0.528,235.680,15.800,0.000,1.000,17.0
...,...,...,...,...,...,...,...,...,...
2075254,26/11/2010,20:58:00,0.946,0.0,240.43,4.0,0.0,0.0,0.0
2075255,26/11/2010,20:59:00,0.944,0.0,240.0,4.0,0.0,0.0,0.0
2075256,26/11/2010,21:00:00,0.938,0.0,239.82,3.8,0.0,0.0,0.0
2075257,26/11/2010,21:01:00,0.934,0.0,239.7,3.8,0.0,0.0,0.0


### 3.  Data Exploration
Check the shape and column information of the dataset to understand the data types and structure.

#### df.info() 
It Provides concise information about the DataFrame, including the number of non-null values, the data type of each column, and memory usage.

Total entries: Total number of rows.

Data types: Shows the data type of each column (e.g., int64, float64, object).

Null values: Indicates whether a column has missing values.

Memory usage: Gives an idea of how much memory the DataFrame is using

In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075259 entries, 0 to 2075258
Data columns (total 9 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   Date                   object 
 1   Time                   object 
 2   Global_active_power    object 
 3   Global_reactive_power  object 
 4   Voltage                object 
 5   Global_intensity       object 
 6   Sub_metering_1         object 
 7   Sub_metering_2         object 
 8   Sub_metering_3         float64
dtypes: float64(1), object(8)
memory usage: 142.5+ MB


### Converting all columns to dtype

In [39]:
df['Global_active_power'] = pd.to_numeric(df['Global_active_power'],errors = 'coerce')
df['Global_reactive_power'] = pd.to_numeric(df['Global_reactive_power'],errors = 'coerce')
df['Voltage'] = pd.to_numeric(df['Voltage'],errors = 'coerce')
df['Global_intensity'] = pd.to_numeric(df['Global_intensity'],errors = 'coerce')
df['Sub_metering_1'] = pd.to_numeric(df['Sub_metering_1'],errors = 'coerce')
df['Sub_metering_2'] = pd.to_numeric(df['Sub_metering_2'],errors = 'coerce')
df['Sub_metering_3'] = pd.to_numeric(df['Sub_metering_3'],errors = 'coerce')

df.info()
 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075259 entries, 0 to 2075258
Data columns (total 9 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   Date                   object 
 1   Time                   object 
 2   Global_active_power    float64
 3   Global_reactive_power  float64
 4   Voltage                float64
 5   Global_intensity       float64
 6   Sub_metering_1         float64
 7   Sub_metering_2         float64
 8   Sub_metering_3         float64
dtypes: float64(7), object(2)
memory usage: 142.5+ MB


### df.head()
it Displays the first 5 rows of the DataFrame by default, though you can specify any number of rows to display.

usage:

Gives a quick look at the first few rows of the dataset.

it also Helps to verify if the data was loaded correctly and whether the columns are in the expected format

In [40]:
df.head()

Unnamed: 0,Date,Time,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
0,16/12/2006,17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0
1,16/12/2006,17:25:00,5.36,0.436,233.63,23.0,0.0,1.0,16.0
2,16/12/2006,17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0
3,16/12/2006,17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0
4,16/12/2006,17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0


### df.tail()
it Displays the last 5 rows of the DataFrame by default, but you can specify the number of rows.

usage:

it  Helps to inspect the most recent or last entries in the dataset, which can be useful for time-series data to ensure data continuity.

In [41]:
df.tail()

Unnamed: 0,Date,Time,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
2075254,26/11/2010,20:58:00,0.946,0.0,240.43,4.0,0.0,0.0,0.0
2075255,26/11/2010,20:59:00,0.944,0.0,240.0,4.0,0.0,0.0,0.0
2075256,26/11/2010,21:00:00,0.938,0.0,239.82,3.8,0.0,0.0,0.0
2075257,26/11/2010,21:01:00,0.934,0.0,239.7,3.8,0.0,0.0,0.0
2075258,26/11/2010,21:02:00,0.932,0.0,239.55,3.8,0.0,0.0,0.0


## df.shape
It Provides the dimensions of the DataFrame as a tuple (rows, columns).

Quick check to know the number of rows and columns in the dataset
from the dataset we can observe there are 2075259 columns and 9 rows

In [42]:
df.shape

(2075259, 9)

### df.columns
Lists all column names in the DataFrame.

In [43]:
df.columns

Index(['Date', 'Time', 'Global_active_power', 'Global_reactive_power',
       'Voltage', 'Global_intensity', 'Sub_metering_1', 'Sub_metering_2',
       'Sub_metering_3'],
      dtype='object')

### df.describe()
This function provides summary statistics for the numerical columns in the dataset.

###### Count: The number of non-null values.

##### Mean: The average of the column.

##### Standard Deviation (std): Shows how much the values deviate from the mean.

Min, 25th percentile, 50th percentile (median), 75th percentile, Max: These values help in understanding the spread and distribution of the data.

In [44]:
df.describe()

Unnamed: 0,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
count,2049280.0,2049280.0,2049280.0,2049280.0,2049280.0,2049280.0,2049280.0
mean,1.091615,0.1237145,240.8399,4.627759,1.121923,1.29852,6.458447
std,1.057294,0.112722,3.239987,4.444396,6.153031,5.822026,8.437154
min,0.076,0.0,223.2,0.2,0.0,0.0,0.0
25%,0.308,0.048,238.99,1.4,0.0,0.0,0.0
50%,0.602,0.1,241.01,2.6,0.0,0.0,1.0
75%,1.528,0.194,242.89,6.4,0.0,1.0,17.0
max,11.122,1.39,254.15,48.4,88.0,80.0,31.0


### df.describe(include="object")

This function helps you analyze categorical or non-numeric data. When applied to object-type columns, it summarizes these columns by providing:

Count: The number of non-null entries in each column.

Unique: The number of unique values in the column.

Top: The most frequent (or "top") value in the column.

Freq: The frequency of the most common value (i.e., how often the "top" value appears)

In [45]:
df.describe(include="object")

Unnamed: 0,Date,Time
count,2075259,2075259
unique,1442,1440
top,6/12/2008,17:24:00
freq,1440,1442


### Checking for Null Values (df.isnull() and df.isnull().sum())
df.isnull() checks if there are any missing values in each cell, and df.isnull().sum() returns the number of missing values per column.

Usage:

To identify missing or null values that may need handling.

Useful to understand if imputation or dropping missing values is required for certain columns.

In [46]:
df.isnull().any()

Date                     False
Time                     False
Global_active_power       True
Global_reactive_power     True
Voltage                   True
Global_intensity          True
Sub_metering_1            True
Sub_metering_2            True
Sub_metering_3            True
dtype: bool

In [47]:
df.isnull().sum()

Date                         0
Time                         0
Global_active_power      25979
Global_reactive_power    25979
Voltage                  25979
Global_intensity         25979
Sub_metering_1           25979
Sub_metering_2           25979
Sub_metering_3           25979
dtype: int64

## null_percentage
This will help us to check the percentage of missing values for each column in the DataFrame

In [48]:
null_percentage = (df.isnull().sum() / len(df)) * 100
null_percentage

Date                     0.000000
Time                     0.000000
Global_active_power      1.251844
Global_reactive_power    1.251844
Voltage                  1.251844
Global_intensity         1.251844
Sub_metering_1           1.251844
Sub_metering_2           1.251844
Sub_metering_3           1.251844
dtype: float64

### **Handling null values**

### * a. Fill with a fixed value 0 * 

In [49]:
df.fillna(0)


Unnamed: 0,Date,Time,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
0,16/12/2006,17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0
1,16/12/2006,17:25:00,5.360,0.436,233.63,23.0,0.0,1.0,16.0
2,16/12/2006,17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0
3,16/12/2006,17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0
4,16/12/2006,17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0
...,...,...,...,...,...,...,...,...,...
2075254,26/11/2010,20:58:00,0.946,0.000,240.43,4.0,0.0,0.0,0.0
2075255,26/11/2010,20:59:00,0.944,0.000,240.00,4.0,0.0,0.0,0.0
2075256,26/11/2010,21:00:00,0.938,0.000,239.82,3.8,0.0,0.0,0.0
2075257,26/11/2010,21:01:00,0.934,0.000,239.70,3.8,0.0,0.0,0.0


### Fill with the column's mean 

In [61]:
df['Sub_metering_3'].fillna(df['Sub_metering_3'].mean())
df['Sub_metering_3'].fillna(df['Sub_metering_2'].mean())
df['Sub_metering_3'].fillna(df['Sub_metering_1'].mean())
df['Global_active_power'].fillna(df['Global_active_power'].mean())
df['Global_reactive_power'].fillna(df['Global_reactive_power'].mean())
df['Voltage'].fillna(df['Voltage'].mean())
df['Global_intensity'].fillna(df['Global_intensity'].mean())


0          18.4
1          23.0
2          23.0
3          23.0
4          15.8
           ... 
2075254     4.0
2075255     4.0
2075256     3.8
2075257     3.8
2075258     3.8
Name: Global_intensity, Length: 2075259, dtype: float64

##### Fill with the column's median

In [62]:
df['Sub_metering_3'].fillna(df['Sub_metering_3'].median())
df['Sub_metering_3'].fillna(df['Sub_metering_3'].median())
df['Sub_metering_3'].fillna(df['Sub_metering_2'].median())
df['Sub_metering_3'].fillna(df['Sub_metering_1'].median())
df['Global_active_power'].fillna(df['Global_active_power'].median())
df['Global_reactive_power'].fillna(df['Global_reactive_power'].median())
df['Voltage'].fillna(df['Voltage'].median())
df['Global_intensity'].fillna(df['Global_intensity'].median())



0          18.4
1          23.0
2          23.0
3          23.0
4          15.8
           ... 
2075254     4.0
2075255     4.0
2075256     3.8
2075257     3.8
2075258     3.8
Name: Global_intensity, Length: 2075259, dtype: float64

#### Fill with the most frequent value (mode):

In [52]:
df['Sub_metering_3'].fillna(df['Sub_metering_3'].mode()[0])
df['Sub_metering_3'].fillna(df['Sub_metering_3'].mode()[0])
df['Sub_metering_3'].fillna(df['Sub_metering_2'].mode()[0])
df['Sub_metering_3'].fillna(df['Sub_metering_1'].mode()[0])
df['Global_active_power'].fillna(df['Global_active_power'].mean())
df['Global_reactive_power'].fillna(df['Global_reactive_power'].mean())
df['Voltage'].fillna(df['Voltage'].mean())
df['Global_intensity'].fillna(df['Global_intensity'].mean())


0          17.0
1          16.0
2          17.0
3          17.0
4          17.0
           ... 
2075254     0.0
2075255     0.0
2075256     0.0
2075257     0.0
2075258     0.0
Name: Sub_metering_3, Length: 2075259, dtype: float64

####  Droping columns with null values

In [53]:
df.dropna(axis=1)


Unnamed: 0,Date,Time
0,16/12/2006,17:24:00
1,16/12/2006,17:25:00
2,16/12/2006,17:26:00
3,16/12/2006,17:27:00
4,16/12/2006,17:28:00
...,...,...
2075254,26/11/2010,20:58:00
2075255,26/11/2010,20:59:00
2075256,26/11/2010,21:00:00
2075257,26/11/2010,21:01:00


##### Drop rows with null values in specific columns

In [59]:
df.dropna(subset=['Sub_metering_3'])
df.dropna(subset=['Sub_metering_2'])
df.dropna(subset=['Sub_metering_1'])
df.dropna(subset=['Global_active_power'])
df.dropna(subset=['Global_reactive_power'])
df.dropna(subset=['Voltage'])
df.dropna(subset=['Global_intensity'])




Unnamed: 0,Date,Time,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
0,16/12/2006,17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0
1,16/12/2006,17:25:00,5.360,0.436,233.63,23.0,0.0,1.0,16.0
2,16/12/2006,17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0
3,16/12/2006,17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0
4,16/12/2006,17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0
...,...,...,...,...,...,...,...,...,...
2075254,26/11/2010,20:58:00,0.946,0.000,240.43,4.0,0.0,0.0,0.0
2075255,26/11/2010,20:59:00,0.944,0.000,240.00,4.0,0.0,0.0,0.0
2075256,26/11/2010,21:00:00,0.938,0.000,239.82,3.8,0.0,0.0,0.0
2075257,26/11/2010,21:01:00,0.934,0.000,239.70,3.8,0.0,0.0,0.0


choosing mean method to fill nan values

In [57]:
df['Sub_metering_3'].fillna(df['Sub_metering_3'].mean(), inplace=True)
df['Global_reactive_power'].fillna(df['Global_reactive_power'].mean(), inplace=True)
df['Global_active_power'].fillna(df['Global_active_power'].mean(), inplace=True)
df['Voltage'].fillna(df['Voltage'].mean(), inplace=True)
df['Global_intensity'].fillna(df['Global_intensity'].mean(), inplace=True)
df['Sub_metering_1'].fillna(df['Sub_metering_1'].mean(), inplace=True)
df['Sub_metering_2'].fillna(df['Sub_metering_2'].mean(), inplace=True)

df

Unnamed: 0,Date,Time,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
0,16/12/2006,17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0
1,16/12/2006,17:25:00,5.360,0.436,233.63,23.0,0.0,1.0,16.0
2,16/12/2006,17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0
3,16/12/2006,17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0
4,16/12/2006,17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0
...,...,...,...,...,...,...,...,...,...
2075254,26/11/2010,20:58:00,0.946,0.000,240.43,4.0,0.0,0.0,0.0
2075255,26/11/2010,20:59:00,0.944,0.000,240.00,4.0,0.0,0.0,0.0
2075256,26/11/2010,21:00:00,0.938,0.000,239.82,3.8,0.0,0.0,0.0
2075257,26/11/2010,21:01:00,0.934,0.000,239.70,3.8,0.0,0.0,0.0


checking weather null values are present are not

In [58]:
df.isnull().sum()

Date                     0
Time                     0
Global_active_power      0
Global_reactive_power    0
Voltage                  0
Global_intensity         0
Sub_metering_1           0
Sub_metering_2           0
Sub_metering_3           0
dtype: int64