# Homework-2: Data Visualization for ECO Dataset



### 1. Data Description

The data used in this project comes from the **ECO (Electricity Consumption and Occupancy) dataset**. This analysis focuses on plug data for all appliances in the dataset and smart meter data from Households 4, 5, and 6.The plug data records individual appliance consumption, while the smart meter data records total household consumption.

The original data was sampled at a frequency of 1 Hz, meaning power consumption was recorded once per second in **watts (W)**. This results in **86,400 measurements per day**. The dataset covers the period from **June 27 to January 31**, although some days and individual values are missing. Missing values are stored as âˆ’1 in the original dataset.<br>


### 2. Research Questions

For the given households and their energy usage, we are curious about these questions:

1. Which household has the highest total electricity consumption, and how does their daily consumption change over time?

2. How does appliance energy consumption vary throughout the day for each household?

3. How does entertainment appliance usage differ between households over time?<br>


### 3. Data Preparation

Missing values were replaced using **linear interpolation**, since electricity consumption typically changes gradually over time.

The data was then **resampled from 1-second intervals to hourly intervals**. During this process, power measurements in watts were converted to energy consumption in **kilowatt-hours (kWh)**, which is more suitable for comparing energy usage.

All plug data was combined into a long-format table containing `household`, `appliance`, `datetime`, and `energy consumption`. Smart meter data was processed similarly to create a table containing `household`, `datetime`, and `total energy consumption`.


In [1]:
import numpy as np
import pandas as pd
from pathlib import Path

In [2]:
#| eval: false
def process_plug_file(filepath, house, appliance):
    df = pd.read_csv(filepath)
    df = df.replace(-1, np.nan) # dealing with missing val
    df = df.interpolate(method='linear')
    #print(df.dtypes)
    # df = pd.to_numeric(df errors='coerce')
    date = pd.to_datetime(filepath.stem[:10]) # get date from filename
    df['datetime'] = pd.date_range( # create datetime index
        start=date,
        periods=len(df),
        freq="s"
    ) 
    df = df.set_index("datetime")
    csp = df.columns[-1]
    df[csp] = pd.to_numeric(df[csp], errors="coerce")
    hourly = df[csp].resample("1h").sum() # unit conversions resampling
    hourly = hourly / 3600000 # convert to kWh
    result = pd.DataFrame({
        "house": house,
        "appliance": appliance,
        "datetime": hourly.index,
        "consumption_kWh": hourly.values
    })
    return result

## process smart meter data
def process_sm_file(filepath, house):
    df = pd.read_csv(filepath)
    df = df.replace(-1, np.nan) 
    date = pd.to_datetime(filepath.stem[:10])
    df['datetime'] = pd.date_range(
        start=date,
        periods=len(df),
        freq="s"
    )
    df = df.set_index("datetime")
    total_csp = df.columns[0]
    df[total_csp] = pd.to_numeric(df[total_csp], errors="coerce")
    hourly = df[total_csp].resample("1h").sum() / 3600000
    result = pd.DataFrame({
            "house": house,
            "datetime": hourly.index,
            "power_all_phases_kWh": hourly.values
        })
    return result

In [3]:
#| eval: false
data_path = Path('data')
output_path = Path('processed_data')
plug_df = []
sm_df = []

id_to_app_map_04 = {'01': 'Fridge', '02': 'Kitchen appliances','03': 'Lamp','04': 'Stereo and laptop',
'05': 'Freezer','06': 'Tablet','07': 'Entertainment','08': 'Microwave' }

id_to_app_map_05 = {'01': 'Tablet', '02': 'Coffee machine','03': 'Fountain','04': 'Microwave',
'05': 'Fridge','06': 'Entertainment','07': 'PC','08': 'Kettle' }

id_to_app_map_06 = {'01': 'Lamp', '02': 'Laptop','03': 'Router','04': 'Coffee machine',
'05': 'Entertainment','06': 'Fridge','07': 'Kettle' }

id_to_house_map = {
    '04':id_to_app_map_04,
    '05':id_to_app_map_05,
    '06':id_to_app_map_06
}

for folder in data_path.iterdir():
    if not folder.is_dir():
        continue
    house_name = folder.name.split('_')[0]
    map_id = id_to_house_map.get(house_name)
    # plug folders
    if not folder.name.endswith("sm"):
        for f in folder.iterdir():
            if not f.is_dir():
                continue
            appliance_name = map_id.get(f.name)
            for csvfile in sorted(f.glob("*.csv")):
                result = process_plug_file(
                    csvfile,
                    house_name,
                    appliance_name
                )
                plug_df.append(result)
            
    # smart meter folders
    elif folder.name.endswith("sm"):
        for csvfile in sorted(folder.glob("*.csv")):
            result = process_sm_file(
                csvfile,
                house_name
            )
            sm_df.append(result)

plug_df = pd.concat(plug_df)
sm_df = pd.concat(sm_df)

plug_df.to_csv(output_path/'plug_data.csv',index=False)
sm_df.to_csv(output_path/'sm_data.csv',index=False)

### 4. EDA for Energy Consumption

In [4]:
plug_data = pd.read_csv('processed_data/plug_data.csv')
sm_data = pd.read_csv('processed_data/sm_data.csv')

print(plug_data.groupby(['house','appliance']).describe())
print(sm_data.groupby(['house']).describe())

                         consumption_kWh                                     \
                                   count      mean       std  min       25%   
house appliance                                                               
4     Entertainment               4464.0  0.032219  0.037270  0.0  0.010795   
      Freezer                     4608.0  0.174046  0.077144  0.0  0.164894   
      Fridge                      4656.0  0.027403  0.012580  0.0  0.020224   
      Kitchen appliances          4656.0  0.009502  0.031118  0.0  0.000140   
      Lamp                        4080.0  0.012023  0.020475  0.0  0.000601   
      Microwave                   4680.0  0.015906  0.050570  0.0  0.003677   
      Stereo and laptop           4056.0  0.013095  0.013788  0.0  0.000000   
      Tablet                      4536.0  0.001350  0.001745  0.0  0.000976   
5     Coffee machine              5232.0  0.005576  0.014309  0.0  0.000000   
      Entertainment               4608.0  0.026767  

### 5. Data Visualization Design and Storytelling

All visualizations were created using *Altair* because it provides strong support for interactive and linked visualizations.

In [5]:
import altair as alt
import plotly.graph_objects as go
import plotly.express as pe
import pandas as pd

# alt.data_transformers.enable("json")
# alt.renderers.set_embed_options(actions=False)
# alt.data_transformers.enable("json") 
# alt.data_transformers.consolidate_datasets = False 
alt.data_transformers.disable_max_rows()

plug_df = pd.read_csv('processed_data/plug_data.csv')
sm_df = pd.read_csv('processed_data/sm_data.csv')

#### **5.1 Total and Daily Energy Consumption by Household**

---

The first visualization combines a `bar chart` and a `time series line chart`. 

* The bar chart shows total energy consumption for each household. 

* The line chart shows daily energy consumption over time. 

* The two charts are **linked** using `selection` interaction: when a household is selected, its data is highlighted while the others are faded. 

In [6]:
# bar chart 
house_select = alt.selection_point(fields=["house"], empty="Random")
zoom1 = alt.selection_interval(bind='scales')

bar = (
    alt.Chart(sm_df)
    .mark_bar()
    .encode(
        x=alt.X("house:N", title="Household",axis=alt.Axis(titleFontSize=14)),
        y=alt.Y("sum(power_all_phases_kWh):Q", title="Total Energy Consumption (kWh)"),
        color=alt.condition(
            house_select,
            alt.Color("house:N", legend=None),
            alt.value("lightgray")
        ),
        tooltip=[
            alt.Tooltip("house:N"),
            alt.Tooltip("sum(power_all_phases_kWh):Q", title="Total kWh")
        ]
    )
    .add_params(house_select)
    .properties(width=275, height=300)
)

line = (
    alt.Chart(sm_df)
    .mark_line()
    .encode(
        x=alt.X(
            "yearmonthdate(datetime):T",
            title="Date"
        ),
        y=alt.Y(
            "sum(power_all_phases_kWh):Q",
            title="Daily Total Energy Consumption (kWh)"
        ),
        color=alt.condition(
            house_select,
            alt.Color("house:N", legend=None),
            alt.value("lightgray")
        ),
        tooltip=[
            "house:N",
            alt.Tooltip("yearmonthdate(datetime):T", title="Date"),
            alt.Tooltip("sum(power_all_phases_kWh):Q", 
            title='Daily Total kWh', format=".2f")
        ]
    )
    .add_params(zoom1)
    .properties(width=275, height=300)
)

chart = (bar | line).configure_axis(
    titleFontSize=14,
    labelFontSize=13,
    titlePadding=12
).configure_title(
    fontSize=16
).properties(
    spacing=60,
    padding=20,
    title={
        'text':"Total and Daily Energy Consumption by Household",
        'offset':20
    })


chart

The first visualization shows the overall energy usage across the households:

* Household 4 has the highest total electricity consumption, while Household 6 has the lowest.

* Household 4 and Household 5 have similar consumption during the summer, but Household 4â€™s consumption increases significantly in late fall and early winter.

* Household 6 shows consistently lower consumption, partly due to missing data.<br>


#### **5.2 Average Appliance Energy Consumption by Household and Hour**

---

The second visualization uses a `bar chart` to show average hourly appliance consumption. 

* The `slider` allows the user to select a specific hour of the day (0â€“23h).

* The bar chart updates to show appliance consumption during that hour. 

* In addition, household selection is controlled using the `legend`. This allows users to select one household for analysis. 

In [7]:
zoom2 = alt.selection_interval(bind='scales')

plug_df['datetime'] = pd.to_datetime(plug_df['datetime'])

plug_df['hour'] = plug_df['datetime'].dt.hour

avg_df = (
    plug_df
    .groupby(['house', 'appliance', 'hour'])['consumption_kWh']
    .mean()
    .reset_index()
)

hour_slider = alt.binding_range(
    min=0,
    max=23,
    step=1,
    name='Hour of the Day (h): '
)

hour_select = alt.selection_point(
    fields=['hour'],
    bind=hour_slider,
    value=12   
)

house_select = alt.selection_point(
    fields=['house'],
    bind='legend',
    value=[{'house': avg_df['house'].unique()[0]}] 
)

chart = (
    alt.Chart(avg_df)
    .mark_bar()
    .encode(
        x=alt.X('appliance:N', title='Appliance'),
        y=alt.Y('consumption_kWh:Q', title='Hourly Average Consumption (kW)',stack=None),
       color=alt.Color(
        'house:N',
        title='Household',
        legend=alt.Legend(
        orient='top',
        direction='horizontal',
        titleFontSize=14,
        labelFontSize=14,
        symbolSize=200
    )
),
        opacity=alt.condition(
            house_select,
            alt.value(1),
            alt.value(0)
),  
        tooltip=[
            alt.Tooltip('house:N'),
            alt.Tooltip('appliance:N'),
            alt.Tooltip('hour:Q'),
            alt.Tooltip('consumption_kWh:Q', title='Hourly Average kWh',format='.3f')
        ]
    ).add_params(hour_select, house_select,zoom2)
    .transform_filter(hour_select)# .transform_filter(house_select)
    .configure_axis(
        titleFontSize=14,
        labelFontSize=13,
        titlePadding=12
    ).configure_title(
        fontSize=16)
    .properties(
        width=600,
        height=300,
        padding=20,
        title={'text':'Average Appliance Energy Consumption by Household and Hour',
        'offset':20})
)

chart

<br>The second visualization reveals **appliance usage patterns** throughout the day

* Household 4 shows consistently high energy usage from appliances such as the freezer, which runs continuously. Kitchen appliances and microwaves show peaks around typical meal times. Entertainment devices show increased usage in the afternoon and evening after lunch and dinner.

* Household 5 shows more balanced appliance usage, with peaks from coffee machines and microwaves during the breakfast time, and entertainment devices at expected times.

* Household 6 has lower overall consumption, with most usage coming from essential appliances and some evening entertainment usage.<br>


#### **5.3 Daily Entertainment Energy Consumption by Household (Jun-Nov)**

---

The third visualization combines a `heatmap` and a `time series chart` to show entertainment energy consumption from June to November. 

* The heatmap uses color intensity to represent daily consumption, making it easy to identify patterns and variations over time. 

* The time series chart below shows the same data as lines, making trends easier to interpret. 

* These charts are **linked** using `hover` interaction: When the user hovers over a household in the heatmap, the corresponding line in the time series is highlighted while the others are faded. 

In [8]:
hover = alt.selection_point(
    fields=['house'],
    on='mouseover',
    clear='mouseout'
)

zoom3 = alt.selection_interval(bind='scales')

entertainment_df = plug_df[
    (plug_df['appliance'] == 'Entertainment') &
    (plug_df['datetime'].dt.month >= 6) &
    (plug_df['datetime'].dt.month <= 11)
].copy()

heatmap = (
    alt.Chart(entertainment_df)
    .mark_rect()
    .encode(
        x=alt.X('yearmonthdate(datetime):T', title='Date'),
        y=alt.Y(
            'house:N',
            title='Household'
        ),

        color=alt.Color(
            'sum(consumption_kWh):Q',
            title=['Energy','Consumption (kWh)'],
            scale=alt.Scale(scheme='inferno'),
            legend=alt.Legend(orient='right')
        ),
        tooltip=[
            alt.Tooltip('house:N', title='House'),
            alt.Tooltip('yearmonthdate(datetime):T', title='Date'),
            alt.Tooltip('sum(consumption_kWh):Q', title='Daily Total kWh', format='.3f')
        ]
    )
    .properties(
        width=525,
        height=250,
        title='Heatmap for Daily Entertainment Energy Consumption by Household (Jun-Nov)'
    ).add_params(hover)
)

line_chart = (
    alt.Chart(entertainment_df)
    .mark_line(size=2)
    .encode(
        x=alt.X('yearmonthdate(datetime):T', title='Date'),

        y=alt.Y(
            'sum(consumption_kWh):Q',
            title='Daily Consumption (kWh)'
        ),

        color=alt.Color('house:N', legend=None),

        tooltip=[
            alt.Tooltip('house:N', title='House'),
            alt.Tooltip('yearmonthdate(datetime):T', title='Date'),
            alt.Tooltip('sum(consumption_kWh):Q', title='Daily Total kWh', format='.3f')
        ],
     
        opacity=alt.condition(
            hover,
            alt.value(1),
            alt.value(0.2)
        )
    )
    .properties(width=600, 
                height=250,
               title='Time Series Plot for Daily Entertainment Energy Consumption by Household (Jun-Nov)')
).add_params(zoom3)

chart = (heatmap & line_chart).configure_axis(
        titleFontSize=14,
        labelFontSize=13,
        titlePadding=12
    ).configure_title(
        fontSize=16
    ).properties(
        padding=20
    ).resolve_scale(x='shared')


chart

The third visualization highlights **differences in entertainment usage** between households. 

* Household 4 shows more consistent daily usage, indicating regular entertainment activity. 

* Household 5 shows more irregular usage, with alternating periods of higher and lower consumption. 

* Household 6 has lower overall entertainment consumption. 

Across all households, entertainment usage is generally concentrated in the evening, suggesting similar daily routines.