# **Data Loading and Insepection** #

In this step, we load the raw electricity and weather datasets and inspect their structure, size, data types, and completeness. This helps us understand the data distribution, identify missing values or inconsistenc
ies, and ensure the dataset is suitable for time-series analysis. Early inspection prevents downstream modeling errors and supports reliable anomaly detection.

### **Import Libraries** ###

In [None]:
import pandas as pd
import numpy as np
import os



### **Data Loading** ###

In [27]:
print("Files in electricity raw folder:")
print(os.listdir())


Files in electricity raw folder:
['chilledwater.csv', 'electricity.csv', 'gas.csv', 'hotwater.csv', 'irrigation.csv', 'solar.csv', 'steam.csv', 'water.csv']


In [3]:
electricity_df = pd.read_csv(
    os.path.join(ELECTRICITY_PATH, "electricity.csv")
)

print("Electricity dataset loaded")
print("Shape:", electricity_df.shape)


Electricity dataset loaded
Shape: (17544, 1579)


In [28]:
electricity_df = pd.read_csv("electricity.csv")

In [None]:

weather_df = pd.read_csv("weather.csv")

### **Inspection** ###

#### **Inspection on electricty dataset** ####

In [7]:
electricity_df.head()

Unnamed: 0,timestamp,Panther_parking_Lorriane,Panther_lodging_Cora,Panther_office_Hannah,Panther_lodging_Hattie,Panther_education_Teofila,Panther_education_Jerome,Panther_retail_Felix,Panther_parking_Asia,Panther_education_Misty,...,Cockatoo_public_Caleb,Cockatoo_education_Tyler,Cockatoo_public_Shad,Mouse_health_Buddy,Mouse_health_Modesto,Mouse_lodging_Vicente,Mouse_health_Justin,Mouse_health_Ileana,Mouse_health_Estela,Mouse_science_Micheal
0,2016-01-01 00:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,123.2,727.575,69.2,8.8224,370.087,10.0,282.9965,26.0,135.0,168.2243
1,2016-01-01 01:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,126.475,731.2,66.275,17.6449,737.826,30.0,574.9265,51.0,265.0,336.4486
2,2016-01-01 02:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,127.825,724.675,64.675,17.6449,729.9255,30.0,570.278,50.0,272.0,336.4486
3,2016-01-01 03:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,130.475,737.375,65.875,17.6449,722.262,20.0,561.147,52.0,276.0,336.4486
4,2016-01-01 04:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,129.675,721.15,66.275,17.6449,719.1665,30.0,564.3695,50.0,280.0,336.4486


In [8]:
electricity_df.tail()

Unnamed: 0,timestamp,Panther_parking_Lorriane,Panther_lodging_Cora,Panther_office_Hannah,Panther_lodging_Hattie,Panther_education_Teofila,Panther_education_Jerome,Panther_retail_Felix,Panther_parking_Asia,Panther_education_Misty,...,Cockatoo_public_Caleb,Cockatoo_education_Tyler,Cockatoo_public_Shad,Mouse_health_Buddy,Mouse_health_Modesto,Mouse_lodging_Vicente,Mouse_health_Justin,Mouse_health_Ileana,Mouse_health_Estela,Mouse_science_Micheal
17539,2017-12-31 19:00:00,15.483,135.2261,3.4357,79.1353,105.6374,465.0898,67.0199,35.7069,16.3231,...,96.925,704.95,111.35,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17540,2017-12-31 20:00:00,12.7224,135.9262,3.4087,81.6958,107.7348,463.6895,56.6869,35.7069,16.0831,...,97.55,695.7,115.875,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17541,2017-12-31 21:00:00,11.2822,135.1761,3.3546,82.816,106.1295,461.289,55.0576,35.5068,16.1631,...,93.825,687.325,111.65,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17542,2017-12-31 22:00:00,16.9233,137.6266,3.2876,82.3359,109.6282,460.5889,49.6776,35.7069,14.8829,...,94.15,674.275,111.95,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17543,2017-12-31 23:00:00,11.8223,136.1263,3.3686,82.4959,103.754,462.0892,42.6952,35.6069,16.0031,...,96.325,677.4,113.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### **Dataset Shape** ####

In [9]:
print("Electricity data shape:", electricity_df.shape)

Electricity data shape: (17544, 1579)


#### **Columns** ####

In [10]:
electricity_df.columns

Index(['timestamp', 'Panther_parking_Lorriane', 'Panther_lodging_Cora',
       'Panther_office_Hannah', 'Panther_lodging_Hattie',
       'Panther_education_Teofila', 'Panther_education_Jerome',
       'Panther_retail_Felix', 'Panther_parking_Asia',
       'Panther_education_Misty',
       ...
       'Cockatoo_public_Caleb', 'Cockatoo_education_Tyler',
       'Cockatoo_public_Shad', 'Mouse_health_Buddy', 'Mouse_health_Modesto',
       'Mouse_lodging_Vicente', 'Mouse_health_Justin', 'Mouse_health_Ileana',
       'Mouse_health_Estela', 'Mouse_science_Micheal'],
      dtype='object', length=1579)

#### **Missing Value** ####

In [11]:
electricity_df.isna().sum()

timestamp                     0
Panther_parking_Lorriane     11
Panther_lodging_Cora         11
Panther_office_Hannah        12
Panther_lodging_Hattie       12
                           ... 
Mouse_lodging_Vicente         0
Mouse_health_Justin           0
Mouse_health_Ileana         165
Mouse_health_Estela           0
Mouse_science_Micheal         0
Length: 1579, dtype: int64

##### **Missing Value (%)** #####

In [12]:
(electricity_df.isna().mean() * 100).sort_values(ascending=False)

Eagle_lodging_Garland      100.000000
Rat_public_Ulysses         100.000000
Bobcat_education_Barbra     99.373005
Bobcat_education_Seth       96.243730
Rat_education_Mac           94.795942
                              ...    
Lamb_public_Angeline         0.000000
Lamb_education_Harold        0.000000
Lamb_office_Peggy            0.000000
Lamb_office_William          0.000000
Eagle_assembly_Ian           0.000000
Length: 1579, dtype: float64

##### **Statistical Summary** #####

In [13]:
electricity_df.describe()

Unnamed: 0,Panther_parking_Lorriane,Panther_lodging_Cora,Panther_office_Hannah,Panther_lodging_Hattie,Panther_education_Teofila,Panther_education_Jerome,Panther_retail_Felix,Panther_parking_Asia,Panther_education_Misty,Panther_retail_Gilbert,...,Cockatoo_public_Caleb,Cockatoo_education_Tyler,Cockatoo_public_Shad,Mouse_health_Buddy,Mouse_health_Modesto,Mouse_lodging_Vicente,Mouse_health_Justin,Mouse_health_Ileana,Mouse_health_Estela,Mouse_science_Micheal
count,17533.0,17533.0,17532.0,17532.0,17533.0,17534.0,17533.0,17533.0,17524.0,17532.0,...,16110.0,16116.0,16113.0,17483.0,17540.0,17544.0,17544.0,17379.0,17544.0,17544.0
mean,8.661108,109.326952,5.396102,113.520501,126.329541,392.126185,96.513085,21.7257,27.257845,0.594539,...,181.787028,770.348807,129.774161,6.134131,482.296901,41.447504,708.649868,39.434,348.329542,180.161176
std,4.900062,56.498653,4.223728,61.636034,71.783004,195.116299,62.818199,13.532895,15.092654,0.408065,...,54.150864,47.807854,37.264065,8.463139,266.172973,18.780824,284.883572,23.680096,144.408599,263.682848
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,83.35,548.8,60.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,6.6613,108.2709,2.6765,89.9374,100.8495,444.3858,56.0978,16.0031,17.9235,0.3301,...,133.425,734.56875,97.525,0.0,310.0305,30.0,616.11125,25.0,299.5,0.0
50%,9.6019,126.6244,4.7619,133.7058,141.8664,469.3906,96.1006,18.0035,34.6467,0.5501,...,172.4875,763.925,126.55,0.0,433.645,40.0,727.7025,32.0,337.89795,0.0
75%,12.6625,145.278,7.952,158.2705,179.4046,498.3712,148.1366,35.8069,39.0475,0.8902,...,228.875,803.29375,158.825,17.6449,671.24325,60.0,870.693,56.0,438.0,336.4486
max,25.4972,285.4512,27.7704,298.0532,375.0384,1052.711,279.1259,54.8043,74.7331,1.5203,...,316.35,925.55,278.575,40.8163,1267.846,100.0,1487.017,126.0,665.0,2000.0


##### **Structure** #####

In [14]:
electricity_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17544 entries, 0 to 17543
Columns: 1579 entries, timestamp to Mouse_science_Micheal
dtypes: float64(1578), object(1)
memory usage: 211.3+ MB


##### **Data Types** #####

In [15]:
electricity_df.dtypes

timestamp                    object
Panther_parking_Lorriane    float64
Panther_lodging_Cora        float64
Panther_office_Hannah       float64
Panther_lodging_Hattie      float64
                             ...   
Mouse_lodging_Vicente       float64
Mouse_health_Justin         float64
Mouse_health_Ileana         float64
Mouse_health_Estela         float64
Mouse_science_Micheal       float64
Length: 1579, dtype: object

### **Key findings** ###
- Rows: 17,544 hourly timestamps starting from (2016-01-01 00:00:00) till (2017-12-31 23:00:00)
- Columns: 1579
  - 1st = timestamp (object)
  - others = 1578 building/meter energy readings (float64)
- Memory Usage: ~ 211 MB (very large)
- Missing Value:
  - no missing value in timestamp col
  - multiple in meter columns
- Timestamp is not yet datetime (need to be converted)

#### **Inspection on weather dataset** ####

In [16]:
weather_df.head()

Unnamed: 0,timestamp,site_id,airTemperature,cloudCoverage,dewTemperature,precipDepth1HR,precipDepth6HR,seaLvlPressure,windDirection,windSpeed
0,2016-01-01 00:00:00,Panther,19.4,,19.4,0.0,,,0.0,0.0
1,2016-01-01 01:00:00,Panther,21.1,6.0,21.1,-1.0,,1019.4,0.0,0.0
2,2016-01-01 02:00:00,Panther,21.1,,21.1,0.0,,1018.8,210.0,1.5
3,2016-01-01 03:00:00,Panther,20.6,,20.0,0.0,,1018.1,0.0,0.0
4,2016-01-01 04:00:00,Panther,21.1,,20.6,0.0,,1019.0,290.0,1.5


In [17]:
weather_df.tail()

Unnamed: 0,timestamp,site_id,airTemperature,cloudCoverage,dewTemperature,precipDepth1HR,precipDepth6HR,seaLvlPressure,windDirection,windSpeed
331161,2017-12-31 19:00:00,Mouse,8.5,,4.8,,,992.3,210.0,8.2
331162,2017-12-31 20:00:00,Mouse,8.5,,4.5,,,992.1,210.0,7.2
331163,2017-12-31 21:00:00,Mouse,8.2,,4.0,,,992.1,230.0,10.3
331164,2017-12-31 22:00:00,Mouse,7.5,,4.3,,,993.7,260.0,12.9
331165,2017-12-31 23:00:00,Mouse,7.2,,3.7,,,995.7,260.0,10.3


##### **Dataset Shape** #####

In [18]:
print("Weather data shape:", weather_df.shape)

Weather data shape: (331166, 10)


##### **Columns** #####

In [19]:
weather_df.columns

Index(['timestamp', 'site_id', 'airTemperature', 'cloudCoverage',
       'dewTemperature', 'precipDepth1HR', 'precipDepth6HR', 'seaLvlPressure',
       'windDirection', 'windSpeed'],
      dtype='object')

##### **Missing Value** #####

In [20]:
weather_df.isna().sum()

timestamp              0
site_id                0
airTemperature       128
cloudCoverage     170987
dewTemperature       328
precipDepth1HR    133186
precipDepth6HR    313004
seaLvlPressure     21624
windDirection      13005
windSpeed            574
dtype: int64

##### **Missing Value(%)** #####

In [21]:
(weather_df.isna().mean() * 100).sort_values(ascending=False)

precipDepth6HR    94.515741
cloudCoverage     51.631810
precipDepth1HR    40.217293
seaLvlPressure     6.529656
windDirection      3.927034
windSpeed          0.173327
dewTemperature     0.099044
airTemperature     0.038651
site_id            0.000000
timestamp          0.000000
dtype: float64

##### **Statistical summary** #####

In [22]:
weather_df.describe()

Unnamed: 0,airTemperature,cloudCoverage,dewTemperature,precipDepth1HR,precipDepth6HR,seaLvlPressure,windDirection,windSpeed
count,331038.0,160179.0,330838.0,197980.0,18162.0,309542.0,318161.0,330592.0
mean,14.235343,1.920907,7.64937,0.955738,13.53656,1016.063498,184.391299,3.569554
std,9.990392,2.550744,9.201438,8.273852,43.801017,8.052463,111.571354,2.335197
min,-28.9,0.0,-35.0,-1.0,-1.0,968.2,0.0,0.0
25%,7.8,0.0,1.8,0.0,0.0,1011.6,90.0,2.1
50%,14.4,0.0,8.5,0.0,0.0,1016.2,200.0,3.1
75%,21.1,4.0,13.9,0.0,5.0,1020.9,280.0,5.0
max,48.3,9.0,26.7,597.0,770.0,1050.1,360.0,24.2


##### **Structure** #####

In [23]:
weather_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 331166 entries, 0 to 331165
Data columns (total 10 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   timestamp       331166 non-null  object 
 1   site_id         331166 non-null  object 
 2   airTemperature  331038 non-null  float64
 3   cloudCoverage   160179 non-null  float64
 4   dewTemperature  330838 non-null  float64
 5   precipDepth1HR  197980 non-null  float64
 6   precipDepth6HR  18162 non-null   float64
 7   seaLvlPressure  309542 non-null  float64
 8   windDirection   318161 non-null  float64
 9   windSpeed       330592 non-null  float64
dtypes: float64(8), object(2)
memory usage: 25.3+ MB


##### **Data types** #####

In [24]:
weather_df.dtypes

timestamp          object
site_id            object
airTemperature    float64
cloudCoverage     float64
dewTemperature    float64
precipDepth1HR    float64
precipDepth6HR    float64
seaLvlPressure    float64
windDirection     float64
windSpeed         float64
dtype: object

### **Key Findings:** ###

- Rows: 331,166 hourly timestamps starting from (2016-01-01 00:00:00) till (2017-12-31 23:00:00)
- Columns: 10
  - timestamp (object)
  - site_d (object)
  - other key feature (float64)
- Key Features:
  - airTemperature
  - dewTemperature
  - windSpeed
  - seaLvlPressure
  - cloudCoverage
  - precipDepth1HR
  - precipDepth6HR
- Timestamp: Object (will need conversion later)
- Missing Values:
  - High missing values in:
    - precipDepth6HR
    - cloudCoverage
- Memory Usage: ~25 MB (manageable)

## **Observation** ##

#### Electricity Dataset
- The electricity dataset contains energy consumption readings for 1,578 buildings/meters that spans approximately two years of hourly data.
- The dataset is wide, with each column representing a separate building or meter.
- The timestamp column is currently of type object and will need conversion to datetime.
- A significant number of missing values are present across different meter columns likely due to sensor downtime.
- The large size and multivariate nature of the dataset make it suitable for time-series anomaly detection across multiple entities.

#### Weather Dataset
- The weather dataset provides contextual environmental information such as temperature, wind speed, pressure, and precipitation.
- It contains multiple weather variables recorded over timestamp.
- Some features (e.g., precipDepth6HR, cloudCoverage) contain a high percentage of missing values.
- The dataset is relatively structured and consistent, making it suitable for use as external contextual features during anomaly detection.

#### Overall Assessment
- Both datasets are aligned temporally and are appropriate for time-series energy anomaly detection.
- Electricity data serves as the core signal, while weather data provides valuable contextual signals.
- No data cleaning or transformations have been applied at this stage, as this notebook is strictly focused on data understanding.