# Dataset Description

**Source :**

https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption



**Data Set Information:**

This archive contains 2075259 measurements gathered in a house located in Sceaux (7km of Paris, France) between December 2006 and November 2010 (47 months).

Notes:
1. (global_active_power*1000/60 - sub_metering_1 - sub_metering_2 - sub_metering_3) represents the active energy consumed every minute (in watt hour) in the household by electrical equipment not measured in sub-meterings 1, 2 and 3.
2. The dataset contains some missing values in the measurements (nearly 1,25% of the rows). All calendar timestamps are present in the dataset but for some timestamps, the measurement values are missing: a missing value is represented by the absence of value between two consecutive semi-colon attribute separators. For instance, the dataset shows missing values on April 28, 2007.



**Attribute Information:**

1. date: Date in format dd/mm/yyyy
2. time: time in format hh:mm:ss
3. global_active_power: household global minute-averaged active power (in kilowatt)
4. global_reactive_power: household global minute-averaged reactive power (in kilowatt)
5. voltage: minute-averaged voltage (in volt)
6. global_intensity: household global minute-averaged current intensity (in ampere)
7. sub_metering_1: energy sub-metering No. 1 (in watt-hour of active energy). It corresponds to the kitchen, containing mainly a  dishwasher, an oven and a microwave (hot plates are not electric but gas powered).
8. sub_metering_2: energy sub-metering No. 2 (in watt-hour of active energy). It corresponds to the laundry room, containing a washing-machine, a tumble-drier, a refrigerator and a light.
9. sub_metering_3: energy sub-metering No. 3 (in watt-hour of active energy). It corresponds to an electric water-heater and an air-conditioner.

# -> Task: Our task is to forecast the Voltage and compare it with given voltage.

# Importing some common Packages and Modules

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import itertools
import warnings
from sklearn.metrics import mean_squared_error

### Loading Dataset using pandas Package

In [2]:
Data = pd.read_csv("C:\\Users\\SS\\Downloads\\MDS Course files\\Data set\\household_power_consumption\\household_power_consumption.txt",sep=';')

  interactivity=interactivity, compiler=compiler, result=result)


Dataset set is in txt format and also have data separated by ';'. So, we used sep argument to make correct dataframe / tabel

In [3]:
Data.head()

Unnamed: 0,Date,Time,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
0,16/12/2006,17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0
1,16/12/2006,17:25:00,5.36,0.436,233.63,23.0,0.0,1.0,16.0
2,16/12/2006,17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0
3,16/12/2006,17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0
4,16/12/2006,17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0


## Checking Data Structure
-> Number of rows / instaces / enteries

-> Number of columns / features / attributes

In [4]:
print(f"Number of rows are {Data.shape[0]}. \nNumber of Columns are {Data.shape[1]}.")

Number of rows are 2075259. 
Number of Columns are 9.


## Information about columns

In [5]:
Data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075259 entries, 0 to 2075258
Data columns (total 9 columns):
Date                     object
Time                     object
Global_active_power      object
Global_reactive_power    object
Voltage                  object
Global_intensity         object
Sub_metering_1           object
Sub_metering_2           object
Sub_metering_3           float64
dtypes: float64(1), object(8)
memory usage: 142.5+ MB


Clearly from this we can see there are many features which are numbers in real but categorised as object.

So, we will change their types.

In [6]:
Data.columns

Index(['Date', 'Time', 'Global_active_power', 'Global_reactive_power',
       'Voltage', 'Global_intensity', 'Sub_metering_1', 'Sub_metering_2',
       'Sub_metering_3'],
      dtype='object')

To change the types of the columns, First we seprate Date, Time and Other Features then we **use apply function and .to_numeric function**

In [7]:
temp1 = Data[['Date', 'Time']]
temp1.head(2)

Unnamed: 0,Date,Time
0,16/12/2006,17:24:00
1,16/12/2006,17:25:00


In [None]:
temp1[['Date']] = temp1[['Date']].apply(pd.to_datetime)
temp1.info()

In [None]:
temp1.head()

In [None]:
temp2 = Data[['Global_active_power', 'Global_reactive_power','Voltage', 'Global_intensity', 'Sub_metering_1', 'Sub_metering_2',
              'Sub_metering_3']]
temp2.head(2)

In [None]:
temp2 = temp2.apply(pd.to_numeric, errors='coerce')
temp2.info()

In [None]:
Data = pd.concat([temp1,temp2],axis=1)
Data.head()

Now Data is ready with correct feature types

# Checking Missing Values

In [None]:
Data.isnull().sum()

There are 25979 instances which are empty in our Data.

In [None]:
# creating bool series True for NaN values 
bool_series = pd.isnull(Data["Voltage"])  
    
# filtering data  
# displaying data only with Voltage = NaN  
Data[bool_series]

Clearly, the 25979 rows are completely empty.
Now we fill these rows using **interpolate function** using linear method

In [None]:
Data = Data.interpolate(method ='linear') 

Checking again if there left any null values. 

In [None]:
Data.isnull().sum()

### Missing Values are handled

# Maximum and Minimum Datestamp

In [None]:
Data['Date'].min()

In [None]:
Data['Date'].max()

# Reseting Index
We reset 'Date' as index

In [None]:
Data.set_index('Date',inplace=True)
Data.head(4)

# Visualization

In [None]:
plt.figure(figsize=(30,10))
plt.scatter(Data.index,Data['Voltage'])
plt.tick_params(labelsize=20)
plt.xlabel('Date',fontsize=30)
plt.show()

In [None]:
Data[['Voltage']].plot(figsize=[30,10],fontsize=20)
plt.legend(fontsize=30)
plt.xlabel('Date',fontsize=30)
plt.show()

There is some trend and seasonality in voltage. But it is not clearly as Data is two large.

So, we group data by dates as we have data on per hour.

In [None]:
Data = Data.groupby('Date')
Data = Data.mean()
Data.head()

In [None]:
Data.tail()

In [None]:
Data.info()

20 Lakh enteries of original Data changed to 1442 enteries

In [None]:
Data[['Voltage']].plot(figsize=[30,10],fontsize=20,legend=False)
plt.legend(fontsize=30)
plt.xlabel('Date',fontsize=30)
plt.show()

## Since In this project we are going to forecast voltage. So for that and for EDA we separate our data with Train Data and Test Data.
### From minimum and maximum date we have data of almost 4 year. So, we use 3 years for train data and 1 year for test data.

In [None]:
train_data = Data[0:1072]
train_data.info()

In [None]:
test_data = Data[1072:]
test_data.info()

In [None]:
train_data[['Voltage']].plot(figsize=[30,10],fontsize=20)
plt.legend(fontsize=30)
plt.xlabel('Date',fontsize=30)
plt.show()

### Since it has trend and seasonality. We decompose it in different parts.

## Importing module for decomposition

In [None]:
import statsmodels.api as sm

In [None]:
timeseries = train_data['Voltage']

# Decomposition of multiplicative time series
decomposition = sm.tsa.seasonal_decompose(timeseries, model='multiplicative')

# Visualisation
fig = decomposition.plot()
fig.set_figwidth(20)
fig.set_figheight(12)
fig.suptitle('Decomposition of multiplicative time series',fontsize=20)
plt.show()

Seasonaity is not clear from this also. 
So, we try to decomose a small data.

In [None]:
timeseries2 = train_data['Voltage'][0:100]

# Decomposition of multiplicative time series
decomposition2 = sm.tsa.seasonal_decompose(timeseries2, model='multiplicative')

# Visualisation
fig2 = decomposition2.plot()
fig2.set_figwidth(12)
fig2.set_figheight(8)
fig2.suptitle('Decomposition of multiplicative time series',fontsize=12)
plt.show()

**The following are some of our key observations from this analysis:**

1) Trend: clearly it is not a straight line. but we can see there is a periodic trend.

2) Seasonality: as discussed, seasonal plot displays a fairly consistent month-on-month pattern. The monthly seasonal components are average values for a month after removal of trend.

3) Irregular Remainder (random): is the residual left in the series after removal of trend and seasonal components.

The expectations from remainder component are that it should look like white noise i.e. displays no pattern at all.

## Data vs Rolling mean Plot

In [None]:
plt.figure(figsize=(30,10))
timeseries.rolling(12).mean().plot(label='12 Month Rolling Mean')
timeseries.plot(fontsize=20)
plt.legend(fontsize=30)
plt.xlabel('Date',fontsize=30)
plt.show()

In [None]:
##### Time series plot for smaller data for better understanding
timeseries2.rolling(12).mean().plot(label='12 Month Rolling Mean')
timeseries2.plot()
plt.legend()
plt.show()

# Dickey-Fuller Test
This test is to check if Data is stationary or not.

### Importing adfuller function from stats tools module

In [None]:
from statsmodels.tsa.stattools import adfuller

**Defining DF_Test function** to check state of Data

In [None]:
def DF_Test(time_series):
    result=adfuller(time_series)
    
    print(f"Result of Dickey-Fuller test: \n{result}")
    print()
    print('Augmented Dicley Fuller test:--')
    
    labels=['ADF Test Statistics','p-value','#Lags used','Number of Observation used']
    
    for value,label in zip(result,labels):
        print(f'{label} : {str(value)}')
        
    if result[1]<=0.05:
        print("\nConclusion:\nStrong evidence against the null hypothesis, reject the null hypothesis. Data is stationary")
    else:
        print("\nConclusion:\nWeak evidence against the null hypothesis, accept the null hypothesis. Data is not stationary")


In [None]:
Data['Voltage Difference'] = Data['Voltage']- Data['Voltage'].shift(1)

# Dropping NA values and Calling function
DF_Test(Data['Voltage Difference'].dropna())

## Finally Data is Stationary. Now we can apply ARIMA Model on Data

### ARIMA is a combination of 3 parts and it has 3 parameters i.e. **p, d, q**
1. AutoRegressive (AR) – extract the influence of the previous periods values on the current period.
   **p** is the parameter associated with the auto-regressive aspect of the model, which incorporates past values.
   
2. Integrated (I) – Subtract time series from its lagged series to extract trends from the data. 
    **d** is the parameter associated with the integrated part of the model, which effects the amount of differencing to apply to a time series.
    
3. Moving Average (MA) – extract the influence of the previous period’s error terms on the current period’s error.
    **q** is the parameter associated with the moving average part of the model.

# 1. AutoRegessive 

In [None]:
Import acf 

In [None]:
from statsmodels.tsa.stattools import acf, pacf

lag_acf = acf(ts_log_diff, nlags=20)
lag_pacf = pacf(ts_log_diff, nlags=20)
plt.figure(figsize=(16,8))
lags = np.array([i for i in range(21)])
#Plot ACF: 
plt.subplot(121) 
plt.bar(lags, lag_acf)
plt.axhline(y=0,linestyle='--',color='gray')
plt.axhline(y=-1.96/np.sqrt(len(ts_log_diff)),linestyle='--',color='gray')
plt.axhline(y=1.96/np.sqrt(len(ts_log_diff)),linestyle='--',color='gray')
plt.title('Autocorrelation Function')