In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# 🏠 Time Series Forecasting of Household Electricity Consumption 📊⚡️

## Problem Statement

This internship project aims to leverage Python for **time series prediction** of household electricity consumption. The dataset includes essential features such as date, time, global active power, reactive power, voltage, intensity, and sub-metering values. The objective is to build robust forecasting models that accurately predict future electricity consumption trends based on historical data. Insights from this analysis can empower households to optimize energy usage, plan efficiently, and contribute to sustainable practices. 💡🌱

## Dataset Description

1. **Date & Time**: Timestamps of electricity consumption recordings.
2. **Global Active Power**: Total power consumed by the household.
3. **Global Reactive Power**: Reactive power consumed.
4. **Voltage**: Voltage levels during consumption periods.
5. **Global Intensity**: Total current intensity used.
6. **Sub-metering 1-3**: Electricity consumed in specific areas (e.g., kitchen, laundry, water heater).

## Project Objectives

1. **Data Preprocessing**:
   - Clean and preprocess the dataset 🧹, handle missing values, and outliers.
   - Combine date and time into datetime format for analysis 📅⏰.

2. **Exploratory Data Analysis (EDA)**:
   - Uncover consumption patterns, trends, and seasonality 📈.
   - Visualize relationships between features to gain insights 📊.

3. **Time Series Forecasting Models**:
   - Implement models such as ARIMA, SARIMA, LSTM 🛠️.
   - Evaluate performance metrics to select the best model 📉.

4. **Feature Engineering**:
   - Analyze feature impacts on electricity consumption ⚙️.
   - Create new features to improve prediction accuracy 🔍.

5. **Model Evaluation and Tuning**:
   - Fine-tune model hyperparameters for optimal performance 🎯.
   - Validate and optimize models using separate test datasets 📝.

6. **Future Consumption Prediction**:
   - Generate forecasts for future electricity usage 🚀.
   - Visualize and interpret predictions to identify consumption patterns 📈.

## Deliverables

- Python scripts for data preprocessing, EDA, and time series models 🐍.
- Visualizations illustrating consumption patterns and model evaluations 📊.
- Comprehensive report summarizing findings, challenges, and recommendations 📄.

This project equips interns with hands-on experience in time series analysis, forecasting, and feature engineering, promoting energy-efficient practices in households. 🌍⚡️


# ⚡️ Time Series Analysis and Data Preparation 📊

## Libraries and Dependencies 📚

- **Pandas**, **NumPy**: Data manipulation and numerical operations.
- **Plotly**, **Matplotlib**: Interactive and static plotting.
- **Statsmodels**, **Prophet**: Time series analysis and forecasting.
- **Seaborn**, **Pandas.plotting**: Statistical visualization.
- **Scikit-learn**, **XGBoost**, **Keras**: Data preprocessing and modeling.

## Data Loading and Inspection 🕵️‍♂️

- **Load Data**: Read 'household_power_consumption.txt', parse dates, handle missing values.
- **Data Overview**: Check data info, inspect missing values in each column.

## Data Types and Warnings ⚠️

- **Column Types**: Ensure correct data types for analysis.
- **Suppress Warnings**: Ignore warnings during data processing.

## Additional Functions and Tools 🛠️

- **Statistical Plots**: Explore relationships using lag plots, autocorrelation, and scatter matrices.
- **Data Scaling**: Normalize data using StandardScaler, MinMaxScaler, and QuantileTransformer.
- **Model Evaluation**: Assess models with mean squared error and R-squared scores.

## Models Implemented 🧠

- **ARIMA**, **SARIMA**: Time series models from Statsmodels.
- **XGBoost**: Gradient boosting for regression and time series forecasting.
- **LSTM**: Deep learning model for sequence prediction with Keras.


In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from scipy.fftpack import fft
import xgboost as xgb
from statsmodels.tsa.seasonal import seasonal_decompose
from prophet import Prophet
import warnings as w
import os
import seaborn as sn
import matplotlib.pyplot as mp
from pandas.plotting import lag_plot,autocorrelation_plot,scatter_matrix
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.preprocessing import QuantileTransformer
from sklearn.metrics import mean_squared_error,r2_score
import itertools
from keras.models import Sequential
from keras.layers import LSTM, Dense

import joblib
from keras.models import load_model
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
w.filterwarnings('ignore')

In [None]:
df = pd.read_csv('/kaggle/input/household-power-consumption/household_power_consumption.txt', sep=';', parse_dates={'datetime': ['Date', 'Time']}, infer_datetime_format=True, low_memory=False, na_values=['nan','?'])

In [None]:
df.set_index('datetime',inplace=True)

In [None]:
df.head(3)

In [None]:
df.info()

In [None]:
for i in df:
    print(df[i].isna().value_counts())

In [None]:
for i in df:
    print("***"*30)
    print(i,"------------------>",df[i].dtype)
    print("***"*30)

In [None]:
df.shape

🔌📊 **Exploring Household Electricity Consumption Trends**

### Data Preparation

1. **Data Cleaning and Sampling**:
   - Remove missing values with `dropna()`.
   - Sample 200,000 rows for analysis 🧹.

2. **Statistical Summary**:
   - Generate descriptive statistics using `describe()` to understand the dataset at a glance 📈.

### Visualizing Consumption Patterns

- **Global Active Power** ⚡️:
  - Plot a line chart to visualize the consumption of global active power over time.
  - Customize plot aesthetics for better clarity and appeal 🎨.

- **Global Reactive Power** 🔋:
  - Display how global reactive power consumption varies across time periods.
  - Use dark-themed backgrounds and contrasting fonts to enhance visualization 🖥️.

- **Voltage** 📏:
  - Explore voltage consumption trends and fluctuations using interactive line charts.
  - Ensure visual elements such as colors and fonts are optimized for readability 🌈.

- **Global Intensity** 🔆:
  - Analyze global intensity trends to understand peak consumption periods.
  - Utilize Plotly's expressive capabilities to highlight key insights 📊.

- **Sub-metering 1** 🍽️:
  - Visualize sub-metering 1 data to identify specific household consumption patterns.
  - Incorporate themed layouts to maintain consistency and engagement 🏡.

- **Sub-metering 2 and 3** 🚿:
  - Investigate electricity consumption in sub-metering 2 (e.g., laundry) and sub-metering 3 (e.g., water heater).
  - Ensure each visualization is clear and informative, enhancing overall understanding 📉.

### Conclusion

This section utilizes interactive visualizations with emojis and themed layouts to provide an engaging exploration of household electricity consumption trends. It aims to convey insights effectively while maintaining aesthetic appeal and readability for the reader.


In [None]:
df.dropna(inplace=True)

In [None]:
df.shape

In [None]:
df = df.sample(n=200000, random_state=42)


In [None]:
df.head(5)

In [None]:
df.describe()

In [None]:
fig=px.line(df['Global_active_power'],title="Global_active_power consumption over Time")
fig.update_layout(plot_bgcolor="Black",paper_bgcolor="Black",font=dict(color="yellow"),title_font=dict(color="Blue"),xaxis=dict(range=['2008-12-16 17:24:00','2009-12-16 17:24:00']))
fig.show()

In [None]:
fig=px.line(df['Global_reactive_power'],title="Global_reactive_power consumption over Time")
fig.update_layout(plot_bgcolor="Black",paper_bgcolor="Black",font=dict(color="yellow"),title_font=dict(color="Blue"))
fig.show()

In [None]:
fig=px.line(df['Voltage'],title="Voltage consumption over Time")
fig.update_layout(plot_bgcolor="Black",paper_bgcolor="Black",font=dict(color="yellow"),title_font=dict(color="Blue"))
fig.show()

In [None]:
df.columns

📊 **Visualizing Household Electricity Consumption Patterns**

### Time Series Analysis

- **Daily, Weekly, Monthly, Quarterly, and Yearly Averages**: 
  - Resampling the data to analyze average global active power consumption over different time intervals helps identify long-term trends and seasonal patterns 📅⏰.

- **Histogram of Global Active Power**:
  - Displaying a histogram provides a distribution overview, aiding in understanding the range and frequency of global active power consumption across the dataset 📈.

### Statistical Visualization

- **Box Plot of Global Active Power**:
  - The box plot summarizes the distribution of global active power, highlighting outliers and central tendency measures such as median and quartiles 📦.

- **Scatter Plots**:
  - Scatter plots visualize relationships between global active power and other variables like voltage, global reactive power, and sub-metering values, helping to identify correlations and anomalies 🔍.

### Correlation Matrix

- **Feature Correlation Matrix**:
  - Displaying a heatmap of correlations between features provides insights into how different variables interact with each other, guiding feature selection and modeling decisions 🧩.

### Lag Plot

- **Lag Plot of Global Active Power**:
  - The lag plot checks for autocorrelation in global active power consumption over time, essential for identifying seasonality and temporal dependencies ⏱️.

These visualizations collectively enhance understanding of household electricity consumption patterns, aiding in informed decision-making and forecasting strategies. 🌐⚡️


In [None]:
df_daily=df['Global_active_power'].resample('D').mean()
fig=px.line(df_daily, y='Global_active_power', title='Daily Average Global Active Power')
fig.update_traces(fill='tozeroy')
fig.update_layout(plot_bgcolor="Black",paper_bgcolor="Black",font=dict(color="Yellow"),title_font=dict(color="RED"))
fig.show()

In [None]:
df_Week=df['Global_active_power'].resample('W').mean()
fig=px.line(df_Week, y='Global_active_power', title='Week Average Global Active Power')
fig.update_layout(plot_bgcolor="Black",paper_bgcolor="Black",font=dict(color="Yellow"),title_font=dict(color="RED"))
fig.show()

In [None]:
df_monthly=df['Global_active_power'].resample('M').mean()
fig=px.line(df_monthly, y='Global_active_power', title='Monthly Average Global Active Power')
fig.update_layout(plot_bgcolor="Black",paper_bgcolor="Black",font=dict(color="Yellow"),title_font=dict(color="RED"))
fig.show()

In [None]:
df_yearend=df['Global_active_power'].resample('A').mean()
fig=px.line(df_yearend, y='Global_active_power', title='Year Average Global Active Power')
fig.update_layout(plot_bgcolor="Black",paper_bgcolor="Black",font=dict(color="Yellow"),title_font=dict(color="RED"))
fig.show()

In [None]:
df_Quaterly_start=df['Global_active_power'].resample('QS').mean()
fig=px.line(df_Quaterly_start, y='Global_active_power', title='Quaterly Average Global Active Power')
fig.update_layout(plot_bgcolor="Black",paper_bgcolor="Black",font=dict(color="Yellow"),title_font=dict(color="RED"))
fig.show()

In [None]:
fig=px.histogram(df['Global_active_power'],title="Global Active Power Histogram")
fig.update_layout(plot_bgcolor="Black",paper_bgcolor="Black",font=dict(color="Yellow"),title_font=dict(color="RED"))
fig.update_xaxes(title='Global Active Power (kilowatts)')
fig.update_yaxes(title='Count')
fig.show()

In [None]:
fig = px.box(df, y='Global_active_power', title='Box Plot of Global Active Power')
fig.update_yaxes(title='Global Active Power (kilowatts)')
fig.update_layout(plot_bgcolor="Black",paper_bgcolor="Black",font=dict(color="Yellow"),title_font=dict(color="RED"))
fig.show()

In [None]:
fig = px.scatter(df, x='Voltage', y='Global_active_power', title='Scatter Plot of Global Active Power vs. Voltage')
fig.update_xaxes(title='Voltage')
fig.update_yaxes(title='Global Active Power (kilowatts)')
fig.update_layout(plot_bgcolor="Black",paper_bgcolor="Black",font=dict(color="Yellow"),title_font=dict(color="RED"))
fig.update_traces(marker=dict(color='red'))
fig.show()


In [None]:
fig = px.scatter(df, x='Global_reactive_power', y='Global_active_power', title='Scatter Plot of Global Active Power vs. Global Reactive Power')
fig.update_xaxes(title='Global Reactive Power')
fig.update_yaxes(title='Global Active Power (kilowatts)')
fig.update_layout(plot_bgcolor="Black",paper_bgcolor="Black",font=dict(color="Yellow"),title_font=dict(color="RED"))
fig.update_traces(marker=dict(color='yellow'))
fig.show()

In [None]:
fig = px.scatter(df, x='Global_intensity', y='Global_active_power', title='Scatter Plot of Global Active Power vs. Global Intensity')
fig.update_xaxes(title='Global Intensity')
fig.update_yaxes(title='Global Active Power (kilowatts)')
fig.update_layout(plot_bgcolor="Black",paper_bgcolor="Black",font=dict(color="Yellow"),title_font=dict(color="Blue"))
fig.update_traces(marker=dict(color='red'))
fig.show()

In [None]:
df_corr=df.corr()
fig=px.imshow(df_corr,text_auto=True,aspect="auto", title='Feature Correlation Matrix')
fig.update_layout(plot_bgcolor="Black",paper_bgcolor="Black",font=dict(color="Yellow"),title_font=dict(color="RED"))
fig.show()

In [None]:
mp.figure(figsize=(10,6))
fig=lag_plot(df['Global_active_power'])
mp.title("Lag Plot of Global active power")
mp.show()

📊 **Advanced Visualization of Household Electricity Consumption**

### Density Contour Plots

- **Density Contour Plots**: 
  - Visualize the distribution of global active power, reactive power, voltage, global intensity, and sub-metering values using density contour plots. Useful for understanding density patterns across different variables 🌐.

### Violin Plots

- **Violin Plots**:
  - Display the distribution of global active power, reactive power, voltage, global intensity, and sub-metering values with violin plots. Helpful for observing data spread and concentration 🎻.

### Seasonal Decomposition and Heatmaps

- **Seasonal Decomposition**:
  - Decompose global active power into seasonal, trend, and residual components using additive and multiplicative models, aiding in understanding seasonal variations and trends over time ⏰.

- **Heatmaps of Correlations**:
  - Visualize correlation matrices of hourly, daily, monthly, and yearly averages of household electricity consumption variables. Useful for identifying relationships and dependencies between features 🔍.

### Scatter Matrix

- **Scatter Matrix**:
  - Plot pairwise relationships between global active power, reactive power, voltage, global intensity, and sub-metering values. Facilitates understanding of variable interactions and distributions in a single view 📈.


In [None]:
fig=px.density_contour(df['Global_active_power'],title='Density Contour plot of Global Active Power')
fig.update_layout(paper_bgcolor="Black",font=dict(color="Yellow"),title_font=dict(color="Red"))
fig.update_traces(contours_coloring='heatmap', colorscale='Reds')
fig.show()


In [None]:
fig=px.density_contour(df['Global_reactive_power'],title='Density Contour plot of Global Reactive Power')
fig.update_layout(plot_bgcolor='Black',paper_bgcolor="Black",font=dict(color="Yellow"),title_font=dict(color="Red"))
fig.update_traces(contours_coloring='heatmap', colorscale='Reds')
fig.show()

In [None]:
fig=px.density_contour(df['Voltage'],title='Density Contour plot of Voltage')
fig.update_layout(plot_bgcolor='Black',paper_bgcolor="Black",font=dict(color="Yellow"),title_font=dict(color="Red"))
fig.update_traces(contours_coloring='heatmap', colorscale='Reds')
fig.show()

In [None]:
fig=px.density_contour(df['Global_intensity'],title='Density Contour plot of Global Intensity')
fig.update_layout(plot_bgcolor='Black',paper_bgcolor="Black",font=dict(color="Yellow"),title_font=dict(color="Red"))
fig.update_traces(contours_coloring='heatmap', colorscale='Reds')
fig.show()

In [None]:
fig=px.density_contour(df['Sub_metering_1'],title='Density Contour plot of Sub_metering_1')
fig.update_layout(plot_bgcolor='Black',paper_bgcolor="Black",font=dict(color="Yellow"),title_font=dict(color="Red"))
fig.update_traces(contours_coloring='heatmap', colorscale='Reds')
fig.show()

In [None]:
fig=px.violin(df['Global_active_power'],title='Violin plot of Global_active_power')
fig.update_layout(plot_bgcolor='Black',paper_bgcolor="Black",font=dict(color="Yellow"),title_font=dict(color="Red"))
fig.update_traces(marker_color='red')
fig.show()

In [None]:
fig=px.violin(df['Global_reactive_power'],title='Violin plot of Global Reactive Power')
fig.update_layout(plot_bgcolor='Black',paper_bgcolor="Black",font=dict(color="Yellow"),title_font=dict(color="Red"))
fig.update_traces(marker_color='green')
fig.show()

In [None]:
fig=px.violin(df['Voltage'],title='Violin plot of Voltage')
fig.update_layout(plot_bgcolor='Black',paper_bgcolor="Black",font=dict(color="Yellow"),title_font=dict(color="Red"))
fig.show()

In [None]:
fig=px.violin(df['Global_intensity'],title='Violin plot of Global_Intensity')
fig.update_layout(plot_bgcolor='Black',paper_bgcolor="Black",font=dict(color="Yellow"),title_font=dict(color="Red"))
fig.update_traces(marker_color='pink')
fig.show()

In [None]:
result=seasonal_decompose(df['Global_active_power'],model='additive',period=24)
result.plot()
mp.show()

In [None]:
df_hourly = df.groupby(df.index.hour).mean()
fig = px.imshow(df_hourly.corr(),text_auto=True, aspect='auto', title='Heatmap of Hourly Averages')
fig.update_layout(plot_bgcolor="Black",paper_bgcolor="Black",font=dict(color="Yellow"),title_font=dict(color="RED"))
fig.show()

In [None]:
decomp=seasonal_decompose(df['Global_active_power'],model='multiplicative',period=24)
decomp.plot()
mp.show()

In [None]:
scatter_matrix(df[['Global_active_power', 'Global_reactive_power', 'Voltage', 'Global_intensity', 'Sub_metering_1', 'Sub_metering_2', 'Sub_metering_3']], figsize=(15, 15), diagonal='kde')
mp.show()

🔧 **Feature Engineering for Electricity Consumption Analysis** 🔧

### Temporal Features 📅⏰
- **Temporal Breakdown**: 
  - Extracted year, month, day, hour, day of week, and week of year from timestamp for time-based analysis.

### Lagged Features 🕒
- **Lagged Values**: 
  - Created lagged features (lag_1, lag_2, lag_3) to capture previous values of global active power, aiding in time series forecasting.

### Rolling Statistics 📉
- **Rolling Statistics**: 
  - Computed rolling mean and standard deviation over 30-day and 10-day windows to smooth out noise and detect trends.

### Expanding Statistics 📊
- **Expanding Statistics**: 
  - Calculated expanding mean and standard deviation of global active power to capture cumulative trends over time.

### Exponential Weighted Moving Average (EWMA) 🔍
- **EWMA Features**: 
  - Used EWMA with span and alpha values (0.1 and 0.3) to smooth out data and emphasize recent trends in electricity consumption.

### Differencing 🔄
- **Differencing**: 
  - Applied differencing to detect seasonality and remove trends from global active power data, using shifts of 1 and 2.

### Time-based Features 🌞🌜
- **Time-based Transformations**: 
  - Transformed hour into sine and cosine components to capture periodicity within a day.

### Ratio Features ⚡️
- **Ratio Features**: 
  - Calculated ratios of sub-metering values to global active power to understand relative contributions to overall consumption.

### Resampling Features 📊
- **Resampling Aggregations**: 
  - Aggregated daily, weekly, monthly, quarterly, and yearly sums of active and reactive power consumption to study consumption patterns over different time intervals.

### Cumulative Sum 📈
- **Cumulative Sum**: 
  - Calculated cumulative sums of active and reactive power consumption to observe overall trends and totals over time.

### Quantile Transformation 📊
- **Quantile Transformation**: 
  - Transformed global active power using QuantileTransformer to achieve a normal distribution of data for improved modeling accuracy.

### Seasonal Features 🌸🌞🍂❄️
- **Seasonal Indicators**: 
  - Identified seasonal patterns by categorizing months into spring, summer, fall, and winter based on their respective calendar months.

### Decomposition Features 📉
- **Seasonal Decomposition**: 
  - Applied seasonal decomposition (additive model) to decompose global active power into trend, seasonal, and residual components for seasonal analysis and trend detection.

### Windowed Statistics 🕒📊
- **Windowed Statistics**: 
  - Created lagged features with rolling mean and standard deviation over windows of 3, 7, and 14 days to capture short-term trends and patterns.


In [None]:
df['year']=df.index.year
df['month']=df.index.month
df['day']=df.index.day
df['hour']=df.index.hour
df['day_of_week'] = df.index.dayofweek
df['week_of_year'] = df.index.isocalendar().week

In [None]:
df['lag_1']=df['Global_active_power'].shift(1)
df['lag_2']=df['Global_active_power'].shift(2)
df['lag_3']=df['Global_active_power'].shift(3)

In [None]:
df['mean_rolling_30']=df['Global_active_power'].rolling(24).mean()
df['std_rolling_30']=df['Global_active_power'].rolling(24).std()
df['mean_rolling_10']=df['Global_active_power'].rolling(10).mean()
df['std_rolling_10']=df['Global_active_power'].rolling(10).std()


In [None]:
df['mean_expanding']=df['Global_active_power'].expanding().mean()
df['std_expanding']=df['Global_active_power'].expanding().std()

In [None]:
df['span_mean_ewm_03']=df['Global_active_power'].ewm(span=3,adjust=False).mean()
df['span_std_ewm_03']=df['Global_active_power'].ewm(span=3,adjust=False).std()
df['span_mean_ewm_05']=df['Global_active_power'].ewm(span=5,adjust=False).mean()
df['span_std_ewm_05']=df['Global_active_power'].ewm(span=5,adjust=False).std()
df['alpha_mean_ewm_0.1']=df['Global_active_power'].ewm(alpha=0.1,adjust=True).mean()
df['alpha_std_ewm_0.1']=df['Global_active_power'].ewm(alpha=0.1,adjust=True).std()
df['alpha_mean_ewm_0.3']=df['Global_active_power'].ewm(alpha=0.3,adjust=True).mean()
df['alpha_std_ewm_0.3']=df['Global_active_power'].ewm(alpha=0.3,adjust=True).std()

In [None]:
df['diff_1']=df['Global_active_power'].diff(1)
df['diff_2']=df['Global_active_power'].diff(2)

In [None]:
df['day_night'] = df['hour'].apply(lambda x: 1 if 6 <= x < 18 else 0)

In [None]:
df['is_weekend']=df['day_of_week'].apply(lambda x:1 if x>=5 else 0)

In [None]:
df['time_since_start']=(df.index-df.index[0]).total_seconds()/3600.0

In [None]:
df['ratio_sub_metering_1']=df['Sub_metering_1']/df['Global_active_power']
df['ratio_sub_metering_2']=df['Sub_metering_2']/df['Global_active_power']
df['ratio_sub_metering_3']=df['Sub_metering_3']/df['Global_active_power']

In [None]:
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)

In [None]:
df['power_intensity']=df['Global_active_power']/df['Global_intensity']
df['power_voltage']=df['Global_active_power']/df['Voltage']
df['intensity_voltage']=df['Global_intensity']/df['Voltage']

In [None]:
df['daily_active_power']=df['Global_active_power'].resample('D').transform('sum')
df['daily_reactive_power']=df['Global_reactive_power'].resample('D').transform('sum')

df['weekly_active_power']=df['Global_active_power'].resample('W').transform('sum')
df['weekly_reactive_power']=df['Global_reactive_power'].resample('W').transform('sum')

df['monthly_active_power']=df['Global_active_power'].resample('M').transform('sum')
df['monthly_reactive_power']=df['Global_reactive_power'].resample('M').transform('sum')

df['quaterly_active_power']=df['Global_active_power'].resample('QS').transform('sum')
df['quaterly_reactive_power']=df['Global_reactive_power'].resample('QS').transform('sum')

df['yearly_active_power']=df['Global_active_power'].resample('Y').transform('sum')
df['yearly_reactive_power']=df['Global_reactive_power'].resample('Y').transform('sum')

In [None]:
df['cumsum_active_power']=df['Global_active_power'].cumsum()
df['cumsum_reactive_power']=df['Global_reactive_power'].cumsum()

In [None]:
qt=QuantileTransformer(output_distribution='normal')
df[['Global_active_power_qt']]=qt.fit_transform(df[['Global_active_power']])

In [None]:
df['high_voltage']=df['Voltage'].apply(lambda x: 1 if x>240 else 0)

In [None]:
df['spring']=df['month'].apply(lambda x: 1 if x in [3,4,5] else 0)
df['summer']=df['month'].apply(lambda x: 1 if x in [6,7,8] else 0)
df['fall']=df['month'].apply(lambda x: 1 if x in [9,10,11] else 0)
df['winter']=df['month'].apply(lambda x : 1 if x in [12,1,2] else 0)

In [None]:
df['relative_used_power']=df['Global_active_power']/df['Global_intensity']

In [None]:
decomp=seasonal_decompose(df['Global_active_power'],model='additive',period=30)
df['resid']=decomp.resid
df['seasonality']=decomp.seasonal
df['trends']=decomp.trend

In [None]:

for window in [3, 7, 14]:
    df[f'lag_{window}'] = df['Global_active_power'].shift(window)
    df[f'lag_{window}_mean'] = df['Global_active_power'].shift(window).rolling(window=window).mean()
    df[f'lag_{window}_std'] = df['Global_active_power'].shift(window).rolling(window=window).std()


💡 **Model Training and Prediction for Electricity Consumption Forecasting** 💡

### Time Series Split and Preparation 🕒📊
- **Train-Test Split**: 
  - Split the dataset into training and testing sets for model evaluation.

### Prophet Model 📈🔮
- **Prophet Model Training**: 
  - Trained a Prophet model on the training data to forecast global active power consumption.
- **Prophet Prediction**: 
  - Made predictions using Prophet and evaluated performance metrics like MSE and R².

### SARIMAX Model 🌀📉
- **SARIMAX Model Training**: 
  - Utilized SARIMAX (Seasonal AutoRegressive Integrated Moving Average with eXogenous factors) to model and predict global active power consumption.
- **SARIMAX Prediction**: 
  - Generated future predictions and visualized alongside actual data.

### LSTM Model 🌐🧠
- **LSTM Model Training**: 
  - Trained a Long Short-Term Memory (LSTM) neural network to capture temporal dependencies in global active power consumption.
- **LSTM Prediction**: 
  - Made LSTM predictions and assessed model performance using MSE and R².

### XGBoost Model 🌳🚀
- **XGBoost Model Training**: 
  - Trained an XGBoost regressor to predict global active power consumption based on various features.
- **XGBoost Prediction**: 
  - Generated XGBoost predictions and compared with actual data.

### Model Comparison and Visualization 📊📈
- **Visualization**: 
  - Created interactive plots to compare actual and predicted values from Prophet, SARIMAX, LSTM, and XGBoost models.

### Model Saving 📁💾
- **Model Saving**: 
  - Saved trained LSTM and XGBoost models for future use.

### Conclusion 🎯📝
- **Summary**: 
  - Evaluated and compared performance metrics (MSE, R²) across different forecasting models to determine the best approach for predicting electricity consumption.
- **XGBoost Performance**: 
  - By MSE and R² metrics, the XGBoost model significantly outperformed other algorithms, demonstrating superior forecasting accuracy for global active power consumption.

🔗 **Note**: Each model's performance and predictions were visualized to provide insights into electricity consumption patterns and forecasting accuracy.


In [None]:
split_index = int(len(df) * 0.8)
train, test = df.iloc[:split_index], df.iloc[split_index:]


df_prophet = df.reset_index()
df_prophet.rename(columns={'datetime': 'ds', 'Global_active_power': 'y'}, inplace=True)
train_prophet = df_prophet.iloc[:split_index]
test_prophet = df_prophet.iloc[split_index:]


prophet_model = Prophet()
prophet_model.fit(train_prophet)


future_dates = pd.DataFrame(test_prophet['ds'])
forecast = prophet_model.predict(future_dates)


forecast.set_index('ds', inplace=True)
test_prophet.set_index('ds', inplace=True)
prophet_pred = forecast.loc[test_prophet.index, 'yhat']

prophet_pred = prophet_pred.reindex(test_prophet.index)


prophet_mse = mean_squared_error(test_prophet['y'], prophet_pred)
prophet_r2 = r2_score(test_prophet['y'], prophet_pred)

print(f'Prophet MSE: {prophet_mse}, R²: {prophet_r2}')



In [None]:
# sarimax_model = SARIMAX(train['Global_active_power'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 24))
# sarimax_result = sarimax_model.fit()


# sarimax_pred = sarimax_result.predict(start=test.index[0], end=test.index[-1])
# sarimax_result = sarimax_model.fit(maxiter=10, disp=True)


# sarimax_mse = mean_squared_error(test['Global_active_power'], sarimax_pred)
# sarimax_r2 = r2_score(test['Global_active_power'], sarimax_pred)

# print(f'SARIMAX MSE: {sarimax_mse}, R²: {sarimax_r2}')

In [None]:
from sklearn.preprocessing import MinMaxScaler


scaler = MinMaxScaler()
scaled_train = scaler.fit_transform(train[['Global_active_power']])
scaled_test = scaler.transform(test[['Global_active_power']])

def create_sequences(data, seq_length):
    xs, ys = [], []
    for i in range(len(data) - seq_length):
        x = data[i:i + seq_length]
        y = data[i + seq_length]
        xs.append(x)
        ys.append(y)
    return np.array(xs), np.array(ys)

seq_length = 24
X_train, y_train = create_sequences(scaled_train, seq_length)
X_test, y_test = create_sequences(scaled_test, seq_length)


X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
X_test = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))


lstm_model = Sequential()
lstm_model.add(LSTM(50, return_sequences=True, input_shape=(seq_length, 1)))
lstm_model.add(LSTM(50, return_sequences=False))
lstm_model.add(Dense(1))
lstm_model.compile(optimizer='adam', loss='mse')


lstm_model.fit(X_train, y_train, epochs=10, batch_size=32)


lstm_pred = lstm_model.predict(X_test)
lstm_pred = scaler.inverse_transform(lstm_pred)
y_test = scaler.inverse_transform(y_test.reshape(-1, 1))


lstm_mse = mean_squared_error(y_test, lstm_pred)
lstm_r2 = r2_score(y_test, lstm_pred)

print(f'LSTM MSE: {lstm_mse}, R²: {lstm_r2}')

In [None]:

X_train_xgb = train.drop(columns=['Global_active_power'])
y_train_xgb = train['Global_active_power']
X_test_xgb = test.drop(columns=['Global_active_power'])
y_test_xgb = test['Global_active_power']


xgb_model = xgb.XGBRegressor(objective='reg:squarederror')
xgb_model.fit(X_train_xgb, y_train_xgb)


xgb_pred = xgb_model.predict(X_test_xgb)


xgb_mse = mean_squared_error(y_test_xgb, xgb_pred)
xgb_r2 = r2_score(y_test_xgb, xgb_pred)

print(f'XGBoost MSE: {xgb_mse}, R²: {xgb_r2}')

In [None]:
future_prophet = prophet_model.make_future_dataframe(periods=24*365*2, freq='H')
forecast_prophet = prophet_model.predict(future_prophet)

mp.figure(figsize=(14, 7))
sn.lineplot(x='ds', y='y', data=df_prophet, label='Actual', color='blue')

sn.lineplot(x='ds', y='yhat', data=forecast_prophet, label='Prophet Prediction', color='orange')

mp.title('Prophet: Actual vs Predicted Global Active Power')
mp.xlabel('Date')
mp.ylabel('Global Active Power')
mp.legend()

mp.show()

In [None]:
future_lstm_scaled = scaler.transform(test[['Global_active_power']])
X_future, _ = create_sequences(future_lstm_scaled, seq_length)
future_lstm_pred = lstm_model.predict(X_future)
future_lstm_pred = scaler.inverse_transform(future_lstm_pred)


lstm_combined = np.concatenate([test['Global_active_power'].values, future_lstm_pred[:,0]])

prediction_dates = pd.date_range(start=df.index[-1], periods=len(future_lstm_pred), freq='H')

mp.figure(figsize=(14, 7))

mp.plot(df.index, df['Global_active_power'], label='Actual', color='blue')

mp.plot(prediction_dates, future_lstm_pred[:,0], label='LSTM Prediction', color='orange')

mp.title('LSTM: Actual vs Predicted Global Active Power')
mp.xlabel('Date')
mp.ylabel('Global Active Power')
mp.legend()

mp.show()

In [None]:
future_dates = pd.date_range(start=test.index[-1], periods=24*365*2, freq='H')
future_xgb_features = pd.DataFrame(index=future_dates, columns=X_test_xgb.columns).fillna(0) # Fill with necessary future feature values
future_xgb_pred = xgb_model.predict(future_xgb_features)


xgb_combined = np.concatenate([test['Global_active_power'].values, future_xgb_pred])
prediction_dates = pd.date_range(start=df.index[-1], periods=len(future_xgb_pred), freq='H')
mp.figure(figsize=(14, 7))

sn.lineplot(x=df.index, y=df['Global_active_power'], label='Actual', color='blue')

sn.lineplot(x=prediction_dates, y=future_xgb_pred, label='XGBoost Prediction', color='green')

mp.title('XGBoost: Actual vs Predicted Global Active Power')
mp.xlabel('Date')
mp.ylabel('Global Active Power')
mp.legend()

mp.show()

# 🚫 Not Using SARIMAX

## Decision and Reasoning

### #NoSARIMAX #SystemPerformance #AlternativeApproaches

---

🕒 SARIMAX is not included in this analysis.

### Reasons:

- 📉 SARIMAX did not produce satisfactory results for predictive accuracy.
- 💻 Running SARIMAX slowed down system performance significantly during execution.

### Alternative Approaches:

- 🚀 Instead, models like XGBoost and Prophet were chosen for their superior performance and efficiency.
- 📊 XGBoost excels in capturing complex dependencies, while Prophet handles seasonality and anomalies effectively.

---

Thank you for understanding! If you have any questions or suggestions, feel free to reach out.


## 📊 Model Performance Evaluation

### XGBoost Model:
🚀 The XGBoost model demonstrates strong predictive accuracy based on quantitative metrics. It efficiently captures patterns and dependencies in the data, yielding promising results.

### Prophet Model:
🔮 The Prophet model, on the other hand, excels in capturing the overall trends and seasonal patterns in the data. Its forecast aligns closely with observed values, especially visible in graphical representations.

### Factors Affecting Prophet's Performance:
🌍 Prophet's performance superiority in graphical representations over other models can be attributed to its robust handling of anomalies, dependencies, and external fluctuations inherent in real-world data. These factors significantly influence global active power values, making it challenging to accurately predict using conventional methods.

### Memory Constraints:
💻 Despite its effectiveness, Prophet's potential to utilize additional features is constrained by memory limitations, leading to system lags and performance issues. This restricts the model's ability to leverage all available data features effectively.


# 😊 Thank you for reading! 

## If you found this notebook helpful, please consider giving it an upvote. 🌟

### #TimeSeries #Analysis #PredictiveModels #DataScience