# Regression Example - Household Electric Power Consumption

This is the regression example with _household electric power consumption_ data for working with _SeqRep_ package.

You can [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MIR-MU/seqrep/blob/main/examples/RegressionExample-Electric_Power_Constumption.ipynb)
or
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/MIR-MU/seqrep/main?labpath=examples%2FRegressionExample-Electric_Power_Constumption.ipynb).


## Install _SeqRep_ Package

In [1]:
!pip install seqrep

Collecting seqrep
  Downloading seqrep-0.0.2-py3-none-any.whl (19 kB)
Collecting ta>=0.8.0
  Downloading ta-0.9.0.tar.gz (25 kB)
Collecting hrv-analysis>=1.0.4
  Downloading hrv_analysis-1.0.4-py3-none-any.whl (28 kB)
Collecting pandas-ta>=0.3.14b0
  Downloading pandas_ta-0.3.14b.tar.gz (115 kB)
[K     |████████████████████████████████| 115 kB 7.1 MB/s 
[?25hCollecting numpy-ext>=0.9.6
  Downloading numpy_ext-0.9.6-py3-none-any.whl (6.9 kB)
Collecting nolds>=0.4.1
  Downloading nolds-0.5.2-py2.py3-none-any.whl (39 kB)
Collecting numpy>=1.15.1
  Downloading numpy-1.20.3-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.3 MB)
[K     |████████████████████████████████| 15.3 MB 39.6 MB/s 
[?25hCollecting joblib<1.1.0,>=1.0.1
  Downloading joblib-1.0.1-py3-none-any.whl (303 kB)
[K     |████████████████████████████████| 303 kB 39.1 MB/s 
Building wheels for collected packages: pandas-ta, ta
  Building wheel for pandas-ta (setup.py) ... [?25l[?25hdone
  Created wheel for pan

## Import Needed Packages

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import date as dt

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.neural_network import MLPRegressor
from sklearn.linear_model import LinearRegression

from seqrep import *
from seqrep.feature_engineering import *
from seqrep.labeling import *
from seqrep.splitting import *
from seqrep.scaling import *
from seqrep.feature_reduction import *
from seqrep.evaluation import *
from seqrep.pipeline_evaluation import *

In [6]:
def highlight(x, value=min):
    """
    Helper function for higligting particular cells in dataframe.
    """
    return ["font-weight: bold" if v == value(x) else "" for v in x]

## Download Dataset

In [13]:
# kaggle.json has to be in the working directory!

! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
!chmod 600 /root/.kaggle/kaggle.json
!kaggle datasets download -d uciml/electric-power-consumption-data-set

mkdir: cannot create directory ‘/root/.kaggle’: File exists
Downloading electric-power-consumption-data-set.zip to /content
 26% 5.00M/19.4M [00:01<00:04, 3.55MB/s]
100% 19.4M/19.4M [00:01<00:00, 13.0MB/s]


In [14]:
!unzip *.zip

Archive:  electric-power-consumption-data-set.zip
  inflating: household_power_consumption.txt  


In [16]:
df = pd.read_csv(
    "household_power_consumption.txt",
    parse_dates=True,
    sep=";",
    nrows=1e6,
)
df = df[df != "?"]
df["date_time"] = df["Date"] + " " + df["Time"]
df = df.drop(columns=["Date", "Time"])
columns_types = {
    "date_time": "datetime64",
    "Global_active_power": float,
    "Global_reactive_power": float,
    "Voltage": float,
    "Global_intensity": float,
    "Sub_metering_1": float,
    "Sub_metering_2": float,
    "Sub_metering_3": float,
}
df = df.astype(columns_types)
df = df.dropna()
df = df.set_index("date_time")
df


Columns (2,3,4,5,6,7) have mixed types.Specify dtype option on import or set low_memory=False.



Unnamed: 0_level_0,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2006-12-16 17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0
2006-12-16 17:25:00,5.360,0.436,233.63,23.0,0.0,1.0,16.0
2006-12-16 17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0
2006-12-16 17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0
2006-12-16 17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0
...,...,...,...,...,...,...,...
2010-11-26 20:58:00,0.946,0.000,240.43,4.0,0.0,0.0,0.0
2010-11-26 20:59:00,0.944,0.000,240.00,4.0,0.0,0.0,0.0
2010-11-26 21:00:00,0.938,0.000,239.82,3.8,0.0,0.0,0.0
2010-11-26 21:01:00,0.934,0.000,239.70,3.8,0.0,0.0,0.0


In [21]:
# Dataset is too big, so we take only ca. first half.
df = df[: int(1e6)]
df.shape

(1000000, 7)

In [23]:
df.dtypes

Global_active_power      float64
Global_reactive_power    float64
Voltage                  float64
Global_intensity         float64
Sub_metering_1           float64
Sub_metering_2           float64
Sub_metering_3           float64
dtype: object

In [24]:
df.isna().sum()

Global_active_power      0
Global_reactive_power    0
Voltage                  0
Global_intensity         0
Sub_metering_1           0
Sub_metering_2           0
Sub_metering_3           0
dtype: int64

## Simple Approach
In this section, only the original temperature values (one feature) is used.

In [26]:
# This DataFrame collects the results of various runs for comparison.

# Uncomment following line if you want to clear the DataFrame with the results.
# del results_for_comparison

try:
    results_for_comparison
except NameError:
    print("Create new empty DataFrame.")
    results_for_comparison = pd.DataFrame()
else:
    print("DataFrame already exist!")

Create new empty DataFrame.


In [28]:
%%capture --no-stdout --no-display


pipe = Pipeline(
    [
        ("scale_u", UniversalScaler(scaler=MinMaxScaler())),
    ]
)

pipe_eval = PipelineEvaluator(
    labeler=RegressionLabeler(
        positive="Global_active_power",
        negative="Global_active_power",
        base="Global_active_power",
    ),
    splitter=TrainTestSplitter(),
    pipeline=pipe,
    evaluator=RegressionEvaluator(),
)

models = [
    LinearRegression(),
    MLPRegressor(shuffle=False),
]

for model in models:
    print()
    print(model)
    pipe_eval.model = model
    result = pipe_eval.run(data=df)

    results_for_comparison = results_for_comparison.append(
        pd.Series(result, name=f"Simple pipeline with {str(pipe_eval.model)} model"),
    )


LinearRegression()
07:57:37.811 Labeling data
07:57:46.774 Splitting data
07:57:46.816 Fitting pipeline
07:57:46.867 Applying pipeline transformations
07:57:47.023 	Original shape:		(750000, 7); 
		shape after removing NaNs: (750000, 7).
07:57:47.083 	Original shape:		(250000, 7); 
		shape after removing NaNs: (250000, 7).
07:57:47.084 Fitting model
07:57:47.285 Predicting
07:57:47.300 Evaluating predictions
MAE:  0.0966 
MSE:  0.0638
RMSE: 0.2526
R2:   0.0209


MLPRegressor(shuffle=False)
07:57:47.405 Labeling data
07:57:55.463 Splitting data
07:57:55.611 Fitting pipeline
07:57:55.667 Applying pipeline transformations
07:57:55.858 	Original shape:		(750000, 7); 
		shape after removing NaNs: (750000, 7).
07:57:55.907 	Original shape:		(250000, 7); 
		shape after removing NaNs: (250000, 7).
07:57:55.908 Fitting model
08:00:05.548 Predicting
08:00:06.514 Evaluating predictions
MAE:  0.0921 
MSE:  0.0589
RMSE: 0.2427
R2:   0.0965



In [29]:
results_for_comparison.reset_index().style.apply(
    highlight, subset=["MAE", "MSE", "RMSE"]
).apply(highlight, subset=["R2"], value=max)

Unnamed: 0,index,MAE,MSE,R2,RMSE
0,Simple pipeline with LinearRegression() model,0.096587,0.063826,0.02088,0.252638
1,Simple pipeline with MLPRegressor(shuffle=False) model,0.092139,0.058897,0.096487,0.242688


## Approach with Feature Engineering (FE)

Derived features (calculated from the original temperature) are added for this approach.

In [30]:
# This DataFrame collects the results of various runs for comparison.

# Uncomment following line if you want to clear the DataFrame with the results.
# del results_approach_FS

try:
    results_approach_FS
except NameError:
    print("Create new empty DataFrame.")
    results_approach_FS = pd.DataFrame()
else:
    print("DataFrame already exist!")

Create new empty DataFrame.


In [31]:
%%capture --no-stdout --no-display

pipe = Pipeline(
    [
        ("fext_prev", PreviousValuesExtractor()),
        ("fext_prev2", PreviousValuesExtractor(shift=2)),
        ("fext_prev3", PreviousValuesExtractor(shift=3)),
        (
            "fext_time",
            TimeFeaturesExtractor(
                intervals=["weekday", "day", "weekofyear", "month", "year"]
            ),
        ),
        (
            "fext_func0",
            FuncApplyFeatureExtractor(
                func=lambda x: x.rolling(10).mean(),
                columns_to_apply="Global_active_power",
            ),
        ),
        (
            "fext_func1",
            FuncApplyFeatureExtractor(
                func=lambda x: x.rolling(20).mean(),
                columns_to_apply="Global_active_power",
                rsuffix="-",
            ),
        ),
        ("fext0", HRVExtractor()),
        ("scale_u", UniversalScaler(scaler=MinMaxScaler())),
    ]
)

pipe_prep = PipelineEvaluator(
    labeler=RegressionLabeler(
        positive="Global_active_power",
        negative="Global_active_power",
        base="Global_active_power",
    ),
    splitter=TrainTestSplitter(),
    pipeline=pipe,
    feature_reductor=UnivariateFeatureSelector(number=0.8),
)

pipe_prep.run(data=df)
print()

for model in models:
    print()
    print(model)
    pipe_eval = PipelineEvaluator(
        model=model,
        evaluator=RegressionEvaluator(),
    )
    for attribute in ["X_train", "X_test", "y_train", "y_test"]:
        value = getattr(pipe_prep, attribute)
        setattr(pipe_eval, attribute, value)

    result = pipe_eval.run()

    results_approach_FS = results_approach_FS.append(
        pd.Series(
            result,
            name=f"Approach with FE with {str(pipe_eval.model)} model_{models.index(model)}",
        ),
    )

08:00:06.830 Labeling data
08:00:15.096 Splitting data
08:00:15.218 Fitting pipeline


Calculating columns:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating methods:   0%|          | 0/6 [00:00<?, ?it/s]

08:44:11.413 Applying pipeline transformations


Calculating columns:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating methods:   0%|          | 0/6 [00:00<?, ?it/s]

08:58:49.059 	Original shape:		(750000, 95); 
		shape after removing NaNs: (749948, 89).
08:58:49.440 	Original shape:		(250000, 95); 
		shape after removing NaNs: (249220, 89).
08:58:49.442 Applying feature reduction


LinearRegression()
08:59:24.718 	Original shape:		(749948, 71); 
		shape after removing NaNs: (749948, 71).
08:59:24.963 	Original shape:		(249220, 71); 
		shape after removing NaNs: (249220, 71).
08:59:24.964 Fitting model
08:59:29.902 Predicting
08:59:29.937 Evaluating predictions
MAE:  0.1002 
MSE:  0.0582
RMSE: 0.2413
R2:   0.1083


MLPRegressor(shuffle=False)
08:59:30.620 	Original shape:		(749948, 71); 
		shape after removing NaNs: (749948, 71).
08:59:30.814 	Original shape:		(249220, 71); 
		shape after removing NaNs: (249220, 71).
08:59:30.815 Fitting model
09:12:53.055 Predicting
09:13:00.623 Evaluating predictions
MAE:  0.0894 
MSE:  0.0505
RMSE: 0.2247
R2:   0.2268



In [32]:
results_approach_FS.reset_index().style.apply(
    highlight, subset=["MAE", "MSE", "RMSE"]
).apply(highlight, subset=["R2"], value=max)

Unnamed: 0,index,MAE,MSE,R2,RMSE
0,Approach with FE with LinearRegression() model_0,0.100224,0.058212,0.108268,0.241271
1,Approach with FE with MLPRegressor(shuffle=False) model_1,0.089352,0.050477,0.22675,0.224672


## Approach with (FE) and Feature Reduction (FR)

In this case, not only derived features are added but they are reduced by feature selection.

In [33]:
# This DataFrame collects the results of various runs for comparison.

# Uncomment following line if you want to clear the DataFrame with the results.
# del results_approach_FS_FR

try:
    results_approach_FS_FR
except NameError:
    print("Create new empty DataFrame.")
    results_approach_FS_FR = pd.DataFrame()
else:
    print("DataFrame already exist!")

Create new empty DataFrame.


In [34]:
%%capture --no-stdout --no-display

pipe = Pipeline(
    [
        ("fext_prev", PreviousValuesExtractor()),
        ("fext_prev2", PreviousValuesExtractor(shift=2)),
        ("fext_prev3", PreviousValuesExtractor(shift=3)),
        (
            "fext_time",
            TimeFeaturesExtractor(
                intervals=["weekday", "day", "weekofyear", "month", "year"]
            ),
        ),
        (
            "fext_func0",
            FuncApplyFeatureExtractor(
                func=lambda x: x.rolling(10).mean(),
                columns_to_apply="Global_active_power",
            ),
        ),
        (
            "fext_func1",
            FuncApplyFeatureExtractor(
                func=lambda x: x.rolling(20).mean(),
                columns_to_apply="Global_active_power",
                rsuffix="-",
            ),
        ),
        ("fext0", HRVExtractor()),
        ("scale_u", UniversalScaler(scaler=MinMaxScaler())),
    ]
)

pipe_prep = PipelineEvaluator(
    labeler=RegressionLabeler(
        positive="Global_active_power",
        negative="Global_active_power",
        base="Global_active_power",
    ),
    splitter=TrainTestSplitter(),
    pipeline=pipe,
    feature_reductor=UnivariateFeatureSelector(number=0.6),
)

pipe_prep.run(data=df)
print()

for model in models:
    print()
    print(model)
    pipe_eval = PipelineEvaluator(
        model=model,
        evaluator=RegressionEvaluator(),
    )
    for attribute in ["X_train", "X_test", "y_train", "y_test"]:
        value = getattr(pipe_prep, attribute)
        setattr(pipe_eval, attribute, value)

    result = pipe_eval.run()

    results_approach_FS_FR = results_approach_FS_FR.append(
        pd.Series(
            result,
            name=f"Approach with FE and FR with {str(pipe_eval.model)} model_{models.index(model)}",
        ),
    )

09:13:00.829 Labeling data
09:13:09.820 Splitting data
09:13:10.163 Fitting pipeline


Calculating columns:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating methods:   0%|          | 0/6 [00:00<?, ?it/s]

09:56:58.411 Applying pipeline transformations


Calculating columns:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating methods:   0%|          | 0/6 [00:00<?, ?it/s]

10:11:19.826 	Original shape:		(750000, 95); 
		shape after removing NaNs: (749948, 89).
10:11:20.250 	Original shape:		(250000, 95); 
		shape after removing NaNs: (249220, 89).
10:11:20.251 Applying feature reduction


LinearRegression()
10:11:55.698 	Original shape:		(749948, 53); 
		shape after removing NaNs: (749948, 53).
10:11:55.946 	Original shape:		(249220, 53); 
		shape after removing NaNs: (249220, 53).
10:11:55.947 Fitting model
10:11:59.177 Predicting
10:11:59.219 Evaluating predictions
MAE:  0.0966 
MSE:  0.0581
RMSE: 0.2410
R2:   0.1099


MLPRegressor(shuffle=False)
10:11:59.806 	Original shape:		(749948, 53); 
		shape after removing NaNs: (749948, 53).
10:11:59.948 	Original shape:		(249220, 53); 
		shape after removing NaNs: (249220, 53).
10:11:59.948 Fitting model
10:22:29.518 Predicting
10:22:34.541 Evaluating predictions
MAE:  0.0874 
MSE:  0.0483
RMSE: 0.2198
R2:   0.2602



In [35]:
results_approach_FS_FR.reset_index().style.apply(
    highlight, subset=["MAE", "MSE", "RMSE"]
).apply(highlight, subset=["R2"], value=max)

Unnamed: 0,index,MAE,MSE,R2,RMSE
0,Approach with FE and FR with LinearRegression() model_0,0.096632,0.058103,0.109941,0.241045
1,Approach with FE and FR with MLPRegressor(shuffle=False) model_1,0.08735,0.048293,0.260212,0.219757


## Result Comparison

In [36]:
results_for_comparison.append(results_approach_FS).append(
    results_approach_FS_FR
).reset_index().style.apply(highlight, subset=["MAE", "MSE", "RMSE"]).apply(
    highlight, subset=["R2"], value=max
)

Unnamed: 0,index,MAE,MSE,R2,RMSE
0,Simple pipeline with LinearRegression() model,0.096587,0.063826,0.02088,0.252638
1,Simple pipeline with MLPRegressor(shuffle=False) model,0.092139,0.058897,0.096487,0.242688
2,Approach with FE with LinearRegression() model_0,0.100224,0.058212,0.108268,0.241271
3,Approach with FE with MLPRegressor(shuffle=False) model_1,0.089352,0.050477,0.22675,0.224672
4,Approach with FE and FR with LinearRegression() model_0,0.096632,0.058103,0.109941,0.241045
5,Approach with FE and FR with MLPRegressor(shuffle=False) model_1,0.08735,0.048293,0.260212,0.219757


We can see that **adding features** has improved the results.

Further improvement has been achieved through the **selection of suitable features**.