## Links
- Github repository:https://github.com/Gianna-liu/Weathering_with_you_GegeLiu

- streamlit App: https://weatheringwithyou-gegeliu.streamlit.app/

For this CA3, I added two pages, ```production-data quality``` and ```weather-data quality```

The structure of Github repository:
- scripts/: This folder contains the Jupyter Notebook, PDF, and HTML files used for data analysis and documentation.

- streamlitApp/: This folder contains all the files required to develop and run the Streamlit application.

 - requirements.txt: This file lists all the Python package dependencies required to run the project, allowing easy installation via pip install -r requirements.txt.



In [46]:
import datetime

## Development Log
For this assignment, I performed implemented methods to detect noise, outliers, anomalies which are extremly useful when working with raw or unprocessed data.

Jupyter Notebook

- In the notebook part, I first created a dataframe containing geographical information for five cities and easily retrieved meteo data with the provided python code from the open-meteo api. For the main analytical tasks, I applied high-pass filtering with DCT to remove long-term seasonal effects and detected outliers using robust MAD-based SPC boundaries on this time-series dataset. I spent some time tuning the cutoff frequency of the high-pass filter for the temperature variable, since we wanteed to remove the long-term trends but retain short-term variations,I selected W_filter = 1/(10*24), roughly corresponding to a 10-day cycle, which aligns with common weather forecast horizons. I also developed another function that detects anomalies in a time series using the Local Outlier Factor algorithm, where parameters such as the contamination and n_neighbors need to be adjusted. When switching to the eletricity production data across these five areas, I used the LOESS-based STL decomposition method to separate the data into trend, seasonal, and residual components. Given that this data is hourly and covers a full year, I set the period parameter to 24 and experimented with various smoothing parameters, while setting robust=True to improve resistance to noise.

Streamlit Application

- For the streamlit part, Since the core functions could be directly reused from the notebook, most of the work involved structuring the interface and organizing multiple visualizations. However, I encounted one chanllege when I tried to pass variables between different pages using st.session_state. Although the session ID remained consistent, the state values were not preserved across pages. As a result, I had to re-create the selection widgets (e.g., city or variable dropdowns) on each page, which is not user-friendly.

## AI usage
During this assignment, I used GitHub Copilot in my VS Code environment to assist with code review, debugging, and improving overall code structure.

I also consulted ChatGPT to learn how to visualize data using Plotly, such as creating spectrograms and STL decomposition plots. Additionally, I explored ChatGPT-generated solutions for passing variables between different Streamlit pages—although these attempts were not fully successful on my end.

Furthermore, I used ChatGPT to gain a deeper understanding of the underlying concepts and techniques, particularly when tuning parameters and interpreting results for anomaly detection and time-series decomposition.

## Tasks

### Library imports

In [1]:
# Basic library
import requests
import pandas as pd
import numpy as np
import json

# Loading data from API
import openmeteo_requests
import requests_cache
from retry_requests import retry

# Loading data from MongoDB
from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi

# Ploting
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = "notebook_connected"

# Detecting outlier or anomalies
from scipy.fft import dct, idct
import scipy.stats as stats
from scipy.signal import stft
from sklearn.neighbors import LocalOutlierFactor
from statsmodels.tsa.seasonal import STL

### 1. Create the dataframe for the five cities in Norway

In [2]:
basic_data = {
    "city":["Oslo", "Kristiansand", "Trondheim", "Tromsø", "Bergen"],
    "price_area_code":["NO1", "NO2", "NO3", "NO4", "NO5"],
    "latitude": [59.9127, 58.1467, 63.4305, 69.6489, 60.393],
    "longitude": [10.7461, 7.9956, 10.3951, 18.9551, 5.3242]
}

basic_info = pd.DataFrame(basic_data)

### 2. Create a function to load data using API

In [3]:
def load_data_fromAPI(longitude, latitude, selected_year):
	# Setup the Open-Meteo API client with cache and retry on error
	cache_session = requests_cache.CachedSession('.cache', expire_after = -1)
	retry_session = retry(cache_session, retries = 5, backoff_factor = 0.2)
	openmeteo = openmeteo_requests.Client(session = retry_session)

	# Make sure all required weather variables are listed here
	# The order of variables in hourly or daily is important to assign them correctly below
	url = "https://archive-api.open-meteo.com/v1/archive"
	params = {
		"latitude": latitude,
		"longitude": longitude,
        "start_date": f"{selected_year}-01-01",
        "end_date": f"{selected_year}-12-31",
		"hourly": ["temperature_2m", "wind_speed_10m", "wind_gusts_10m", "wind_direction_10m", "precipitation"],
		"models": "era5",
		"timezone": "auto",
	}
	responses = openmeteo.weather_api(url, params=params)

	# Process first location. Add a for-loop for multiple locations or weather models
	response = responses[0]
	print(f"Coordinates: {response.Latitude()}°N {response.Longitude()}°E")
	print(f"Date_range: {params['start_date']} - {params['end_date']}")
	print(f"Variables: {params['hourly']}")
	#print(f"Elevation: {response.Elevation()} m asl")
	#print(f"Timezone difference to GMT+0: {response.UtcOffsetSeconds()}s")

	# Process hourly data. The order of variables needs to be the same as requested.
	hourly = response.Hourly()
	hourly_temperature_2m = hourly.Variables(0).ValuesAsNumpy()
	hourly_wind_speed_10m = hourly.Variables(1).ValuesAsNumpy()
	hourly_wind_gusts_10m = hourly.Variables(2).ValuesAsNumpy()
	hourly_wind_direction_10m = hourly.Variables(3).ValuesAsNumpy()
	hourly_precipitation = hourly.Variables(4).ValuesAsNumpy()

	hourly_data = {"date": pd.date_range(
		start = pd.to_datetime(hourly.Time(), unit = "s", utc = True),
		end =  pd.to_datetime(hourly.TimeEnd(), unit = "s", utc = True),
		freq = pd.Timedelta(seconds = hourly.Interval()),
		inclusive = "left"
	)}

	hourly_data["temperature_2m"] = hourly_temperature_2m
	hourly_data["wind_speed_10m"] = hourly_wind_speed_10m
	hourly_data["wind_gusts_10m"] = hourly_wind_gusts_10m
	hourly_data["wind_direction_10m"] = hourly_wind_direction_10m
	hourly_data["precipitation"] = hourly_precipitation

	hourly_dataframe = pd.DataFrame(data = hourly_data)
	# Change the time zone to Europe/Oslo
	hourly_dataframe["date"] = hourly_dataframe["date"].dt.tz_convert("Europe/Oslo")
	print(f"Sucessfully load the data")
	
	return hourly_dataframe

In [4]:
# here, we use Bergen as a example to test our function
selected_year = 2019
selected_city = 'Bergen'
latitude = basic_info[basic_info["city"] == selected_city]['latitude']
longitude = basic_info[basic_info["city"] == selected_city]['longitude']
hourly_dataframe = load_data_fromAPI(longitude,latitude,selected_year)

Coordinates: 60.5°N 5.25°E
Date_range: 2019-01-01 - 2019-12-31
Variables: ['temperature_2m', 'wind_speed_10m', 'wind_gusts_10m', 'wind_direction_10m', 'precipitation']
Sucessfully load the data


Briefly explore the dataset: no missing data

In [5]:
hourly_dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype                      
---  ------              --------------  -----                      
 0   date                8760 non-null   datetime64[ns, Europe/Oslo]
 1   temperature_2m      8760 non-null   float32                    
 2   wind_speed_10m      8760 non-null   float32                    
 3   wind_gusts_10m      8760 non-null   float32                    
 4   wind_direction_10m  8760 non-null   float32                    
 5   precipitation       8760 non-null   float32                    
dtypes: datetime64[ns, Europe/Oslo](1), float32(5)
memory usage: 239.7 KB


In [6]:
hourly_dataframe.describe()

Unnamed: 0,temperature_2m,wind_speed_10m,wind_gusts_10m,wind_direction_10m,precipitation
count,8760.0,8760.0,8760.0,8760.0,8760.0
mean,8.846621,14.565521,29.391081,186.588104,0.254053
std,5.089198,8.156624,15.241786,101.865685,0.541051
min,-5.45,0.0,4.68,0.674022,0.0
25%,4.75,8.373386,17.639999,107.915705,0.0
50%,8.05,12.849528,26.280001,164.664124,0.0
75%,12.7,19.657119,38.519997,286.471169,0.3
max,29.35,55.753529,119.879997,360.0,7.3


In [7]:
hourly_dataframe.head()

Unnamed: 0,date,temperature_2m,wind_speed_10m,wind_gusts_10m,wind_direction_10m,precipitation
0,2019-01-01 00:00:00+01:00,6.7,42.671768,81.720001,260.776154,0.4
1,2019-01-01 01:00:00+01:00,6.55,47.959782,87.839996,277.765076,0.5
2,2019-01-01 02:00:00+01:00,6.8,48.62133,80.279999,296.375275,0.9
3,2019-01-01 03:00:00+01:00,6.85,52.63884,85.32,310.006195,0.7
4,2019-01-01 04:00:00+01:00,6.55,55.753529,98.639999,314.215271,0.6


### 3. Outliers and anomalies:

#### Plot the temperature as a function of time.
The ```plot_outlier_detection_dct()``` functionPerforms high-pass filtering with DCT to remove seasonal effects and detects outliers using robust MAD-based SPC boundaries. Returns an interactive Plotly plot.

In [8]:
def plot_outlier_detection_dct(hourly_dataframe,selected_variable: str, W_filter: float = 1/(10*24), coef_k: float = 3):
    signal = hourly_dataframe[selected_variable].to_numpy(float)
    N = hourly_dataframe.shape[0]
    dt = 1
    W = np.linspace(0, 1/(2*dt), N) # cycles/hour

    # check and fill NaN values with mean
    if np.isnan(signal).any():
        signal = np.nan_to_num(signal, nan=np.nanmean(signal))

    # Discrete Cosine transform
    fourier_signal = dct(signal, norm="ortho")

    # high-pass filter to keep the high-frequency components
    filtered_hp_signal = fourier_signal.copy()
    filtered_hp_signal[(W < W_filter)] = 0 
    satv = idct(filtered_hp_signal, norm="ortho") 

    # low-pass filter to keep the trend
    filtered_lp_signal = fourier_signal.copy()
    filtered_lp_signal[(W > W_filter)] = 0 
    trend = idct(filtered_lp_signal, norm="ortho") 

    # Median absolute deviation
    coef_k = coef_k
    mad_raw = stats.median_abs_deviation(satv, scale=1.0)
    sd = mad_raw * 1.4826
    print(sd)
    # Find the boundaries
    upper_boundary = trend + coef_k * sd
    lower_boundary = trend - coef_k * sd

    # Detect the outliers
    outlier_mask = (signal > upper_boundary) | (signal < lower_boundary)

    fig = go.Figure()

    # Plot trend
    fig.add_trace(go.Scatter(
        x=hourly_dataframe['date'], y=trend,
        mode="lines", line=dict(color="green"),
        name="Trend"
    ))

    # Upper and lower dynamic boundaries
    fig.add_trace(go.Scatter(
        x=hourly_dataframe['date'], y=upper_boundary,
        mode="lines", line=dict(color='red', dash='dash'),
        name=f"{coef_k}*SD (Upper)"
    ))

    fig.add_trace(go.Scatter(
        x=hourly_dataframe['date'], y=lower_boundary,
        mode="lines", line=dict(color='red', dash='dash'),
        name=f"{coef_k}*SD (Lower)"
    ))

    # Normal points
    fig.add_trace(go.Scatter(
        x=hourly_dataframe['date'], y=signal,
        mode="markers", marker=dict(color="blue", size=5),
        name="Normal"
    ))

    # Outliers
    fig.add_trace(go.Scatter(
        x=hourly_dataframe.loc[outlier_mask, 'date'],
        y=signal[outlier_mask],
        mode="markers", marker=dict(color="orange", size=8),
        name="Outliers"
    ))

    fig.update_layout(
        title=f"DCT-Based Outlier Detection (High-pass SATV) for {selected_variable}",
        xaxis_title="Time",
        yaxis_title="Value"
    )
    # add some basic info about the outliers
    summary = {
        "variable": selected_variable,
        "num_sample": N,
        "Sigma Multiplier (k)":coef_k,
        "High-pass Filter W_cutoff":W_filter,
        "num_outliers": int(outlier_mask.sum()),
        "ratio_outlier": round(outlier_mask.sum() / N * 100, 2),
    }
    return fig,summary

In [9]:
# Test the plot_outlier_detection_dct
fig,dct_summary = plot_outlier_detection_dct(hourly_dataframe,"temperature_2m")
for k, v in dct_summary.items():
    print(f"{k}: {v}")

1.7669691328709922
variable: temperature_2m
num_sample: 8760
Sigma Multiplier (k): 3
High-pass Filter W_cutoff: 0.004166666666666667
num_outliers: 28
ratio_outlier: 0.32


In [10]:
# show the plot within detected outliers
fig.show()

#### Plot the precipitation as a function of time.

The ```plot_outlier_detection_lof()``` function detects anomalies in a time series using the Local Outlier Factor (LOF) algorithm and visualizes normal and outlier points in an scatter plot

In [11]:
def plot_outlier_detection_lof(hourly_dataframe, selected_variable: str, contamination: float = 0.01, n_neighbors: int = 50):
    # Prepare the input data for LOF
    selected_data = hourly_dataframe[[selected_variable]].copy()

    # Fit LOF model
    lof = LocalOutlierFactor(n_neighbors=n_neighbors, contamination=contamination)
    pred_labels = lof.fit_predict(selected_data)

    # Separate normal and outliers
    outlier_mask = pred_labels == -1
    normal_mask = pred_labels == 1

    # Visualization
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=hourly_dataframe.loc[normal_mask, 'date'],y=selected_data.loc[normal_mask, selected_variable], mode='markers', marker=dict(color='blue'), name='normal'))
    fig.add_trace(go.Scatter(x=hourly_dataframe.loc[outlier_mask, 'date'],y=selected_data.loc[outlier_mask, selected_variable], mode='markers', marker=dict(color='orange'), name='outlier'))
    fig.update_layout(title=f'The distribution of {selected_variable} with outliers', xaxis_title='Time (hourly)', yaxis_title='Values')

    # Add the brief summary
    summary = {
        "variable": selected_variable,
        "num_sample": len(selected_data),
        "contamination_param":contamination,
        "n_neighbors":n_neighbors,
        "num_outliers": outlier_mask.sum(),
        "ratio_outlier (%)": round(outlier_mask.mean() * 100, 2),
        "mean_value": round(selected_data[selected_variable].mean(), 2)
    }

    return fig, summary

In [12]:
# Test the data
fig,lof_summary= plot_outlier_detection_lof(hourly_dataframe, "precipitation")
for k, v in lof_summary.items():
    print(f"{k}: {v}")

variable: precipitation
num_sample: 8760
contamination_param: 0.01
n_neighbors: 50
num_outliers: 80
ratio_outlier (%): 0.91
mean_value: 0.25



Duplicate values are leading to incorrect results. Increase the number of neighbors for more accurate results.



In [13]:
# show the figure
fig.show()

### 4.Seasonal-Trend decomposition using LOESS (STL)

In [14]:
# Load the data from Mongodb
with open("config_local.json") as f:
        config = json.load(f)

mongo_uri = config["mongo_uri"]

# Create a new client and connect to the server
client = MongoClient(mongo_uri, server_api=ServerApi('1'))
# Send a ping to confirm a successful connection
try:
    client.admin.command('ping')
    print("Pinged your deployment. You successfully connected to MongoDB!")
except Exception as e:
    print(e)

# connect to the Mongodb
db = client["elhub_db"] 
collection = db["production_data"] 

Pinged your deployment. You successfully connected to MongoDB!


In [48]:
db["production_data"].find_one({"starttime": {"$gte": datetime.datetime(2021,1,1), "$lt": datetime.datetime(2021,1,2)}})


{'_id': ObjectId('68fab66d9fd3626b63c4bde7'),
 'pricearea': 'NO5',
 'productiongroup': 'thermal',
 'starttime': datetime.datetime(2021, 1, 1, 6, 0),
 'quantitykwh': 77913.0}

In [50]:
df = pd.DataFrame(list(db["production_data"].find({"starttime": {"$lt": datetime.datetime(2021,1,1,6)}})))
print(df.shape)

(144, 5)


In [53]:
df = df[df['pricearea'] == 'NO1']

In [54]:
df

Unnamed: 0,_id,pricearea,productiongroup,starttime,quantitykwh
12,68fab66d9fd3626b63c50237,NO1,solar,2021-01-01 00:00:00,6.106
13,68fab66d9fd3626b63c5023b,NO1,solar,2021-01-01 04:00:00,8.616
14,68fab66d9fd3626b63c50238,NO1,solar,2021-01-01 01:00:00,4.03
15,68fab66d9fd3626b63c50239,NO1,solar,2021-01-01 02:00:00,3.982
16,68fab66d9fd3626b63c5023a,NO1,solar,2021-01-01 03:00:00,8.146
17,68fab66d9fd3626b63c5023c,NO1,solar,2021-01-01 05:00:00,10.207
36,68fab66e9fd3626b63c58ae7,NO1,wind,2021-01-01 04:00:00,505.071
37,68fab66e9fd3626b63c58ae4,NO1,wind,2021-01-01 01:00:00,649.068
38,68fab66e9fd3626b63c58ae8,NO1,wind,2021-01-01 05:00:00,793.071
39,68fab66e9fd3626b63c58ae5,NO1,wind,2021-01-01 02:00:00,144.0


In [57]:
types = db["production_data"].aggregate([
    {"$match": {"starttime": {"$gte": datetime.datetime(2021,1,1), "$lt": datetime.datetime(2021,1,2)}}},
    {"$project": {"ts_type": {"$type": "$starttime"}}},
    {"$group": {"_id": "$ts_type", "count": {"$sum": 1}}}
])

print(list(types))

[{'_id': 'date', 'count': 576}]


In [15]:
# check the number of data
count = collection.count_documents({})
print(f"Document count: {count:,}")

Document count: 871,658


In [16]:
# process the starttime column and sort the values by starttime
cursor = collection.find({}, {"_id": 0}) # remove the id 
df_production = pd.DataFrame(list(cursor))
df_production.head()

Unnamed: 0,pricearea,productiongroup,starttime,quantitykwh
0,NO5,thermal,2021-01-01 06:00:00,77913.0
1,NO5,thermal,2021-01-01 08:00:00,78222.0
2,NO5,thermal,2021-01-01 10:00:00,78141.0
3,NO5,thermal,2021-01-01 11:00:00,78399.0
4,NO5,thermal,2021-01-01 15:00:00,78157.0


In [41]:
df_production['month'] = pd.to_datetime(df_production['starttime']).dt.month


In [42]:
df_production[(df_production['datetime'] == pd.Timestamp("2021-01-01")) &(df_production['pricearea'] == 'NO1') & (df_production['productiongroup']=='hydro')]

Unnamed: 0,pricearea,productiongroup,starttime,quantitykwh,datetime,year,month
153953,NO1,hydro,2021-01-01 00:00:00,2507716.8,2021-01-01,2021,1
153827,NO1,hydro,2021-01-01 01:00:00,2494728.0,2021-01-01,2021,1
153580,NO1,hydro,2021-01-01 02:00:00,2486777.5,2021-01-01,2021,1
153828,NO1,hydro,2021-01-01 03:00:00,2461176.0,2021-01-01,2021,1
153829,NO1,hydro,2021-01-01 04:00:00,2466969.2,2021-01-01,2021,1
153954,NO1,hydro,2021-01-01 05:00:00,2467460.0,2021-01-01,2021,1
153955,NO1,hydro,2021-01-01 06:00:00,2482320.8,2021-01-01,2021,1
153581,NO1,hydro,2021-01-01 07:00:00,2509533.0,2021-01-01,2021,1
153956,NO1,hydro,2021-01-01 08:00:00,2550758.2,2021-01-01,2021,1
153693,NO1,hydro,2021-01-01 09:00:00,2693111.0,2021-01-01,2021,1


In [43]:
df_production[(df_production['pricearea'] == 'NO1') & (df_production['productiongroup']=='hydro')].groupby(['year','month']).count().sort_index()

Unnamed: 0_level_0,Unnamed: 1_level_0,pricearea,productiongroup,starttime,quantitykwh,datetime
year,month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021,1,743,743,743,743,743
2021,2,671,671,671,671,671
2021,3,742,742,742,742,742
2021,4,719,719,719,719,719
2021,5,743,743,743,743,743
2021,6,719,719,719,719,719
2021,7,743,743,743,743,743
2021,8,743,743,743,743,743
2021,9,719,719,719,719,719
2021,10,743,743,743,743,743


In [33]:
df_production[(df_production['pricearea'] == 'NO1') & (df_production['productiongroup']=='hydro') & (df_production['datetime'] < pd.Timestamp("2022-01-04")) & (df_production['datetime'] > pd.Timestamp("2022-01-01"))].groupby('datetime').count().sort_index()

Unnamed: 0_level_0,pricearea,productiongroup,starttime,quantitykwh
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2022-01-02,24,24,24,24
2022-01-03,24,24,24,24


In [17]:
df_production['starttime'] = pd.to_datetime(df_production['starttime']) 
df_production = df_production.sort_values("starttime")

The plot_stl_decompostion() function performs a Seasonal-Trend decomposition using LOESS (STL) on production data for a selected area and production group, and visualizes the observed, trend, seasonal, and residual components in an interactive Plotly subplot.

In [18]:
def plot_stl_decompostion(df_production, area:str = 'NO1',group:str = 'hydro',period:int = 24,seasonal:int = 24*10+1,trend:int =24*30+1 ,robust:bool = True):
    # Prepare the data
    df_subset = df_production[(df_production['pricearea']==area) & (df_production['productiongroup']==group)]
    df_subset.reset_index(inplace=True,drop=True)
    # deploy the STL decompostion method
    stl = STL(df_subset["quantitykwh"], period=period,seasonal=seasonal,trend=trend,robust=robust)
    res = stl.fit() # Contains the components and a plot function
    ## Plot the results
    fig = go.Figure()
    fig = make_subplots(rows=4, cols=1, shared_xaxes=True,subplot_titles=["Observed", "Trend", "Seasonal", "Residual"],vertical_spacing=0.05)
    # Add original data
    fig.add_trace(go.Scatter(x=df_subset['starttime'],y=df_subset["quantitykwh"],mode='lines',name='Observed',line=dict(color='blue')),row=1, col=1)
    # Add Trend 
    fig.add_trace(go.Scatter(x=df_subset['starttime'],y=res.trend,mode='lines',name='Trend',line=dict(color='orange')),row=2, col=1)
    # Add Seasonal
    fig.add_trace(go.Scatter(x=df_subset['starttime'],y=res.seasonal,mode='lines',name='Seasonal',line=dict(color='green')),row=3, col=1)
    # Add Residual
    fig.add_trace(go.Scatter(x=df_subset['starttime'],y=res.resid,mode='lines',name='Residual',line=dict(color='gray', dash='dot')),row=4, col=1)
    fig.update_layout(title=f'The STL decomposition of area:{area} and productiongroup:{group}', xaxis4_title='Time (hourly)', yaxis_title='Values',height=900)

    return fig

In [23]:
df_production[(df_production['pricearea']=='NO1')& (df_production['productiongroup']=='hydro')].sort_index().head()

Unnamed: 0,pricearea,productiongroup,starttime,quantitykwh
153580,NO1,hydro,2021-01-01 02:00:00,2486777.5
153581,NO1,hydro,2021-01-01 07:00:00,2509533.0
153582,NO1,hydro,2021-01-01 10:00:00,2762854.8
153583,NO1,hydro,2021-01-01 13:00:00,2779136.0
153584,NO1,hydro,2021-01-01 15:00:00,2846295.5


In [19]:
# Test the funciton
fig = plot_stl_decompostion(df_production)
fig.show()

### 5.Spectrogram

The ```plot_spectrogram_elhub()``` function computes a Short-Time Fourier Transform of production data to visualize how signal energy varies across time and frequency, returning an spectrogram.

In [20]:
def plot_spectrogram_elhub(df_production,area: str = "NO1",group: str = "hydro",nperseg: int = 40,noverlap: int = 20):
    # Prepare the dataset
    df_subset = df_production[(df_production["pricearea"] == area)& (df_production["productiongroup"] == group)].sort_values("starttime")
    y = df_subset["quantitykwh"].values # Signal 
    fs = 1 # Sampling frequency
    # Deploy the short-time fourier transform
    f, t, Zxx = stft(y, fs=fs, nperseg=nperseg, noverlap=noverlap)

    # Visulazation
    fig = go.Figure()
    magnitude = np.abs(Zxx)
    start_time = df_subset["starttime"].iloc[0]
    t_datetime = [start_time + pd.Timedelta(hours=float(h)) for h in t]

    fig.add_trace(go.Heatmap(
        x=t_datetime,
        y=f,
        z=magnitude,
        colorbar=dict(title="Amplitude"),
        zmin=0,
        zmax=magnitude.max() * 0.8,
        hovertemplate="Date: %{x|%Y-%m-%d %H:%M}<br>Freq: %{y:.4f}/h<br>Amp: %{z:.2f}<extra></extra>"
    ))

    fig.update_layout(
        title=f"Spectrogram of {group} production — {area}",
        xaxis_title='Time (hourly)',
        yaxis_title="Frequency [1/hour]",
        template="plotly_white",
        height=600,
    )
    fig.update_yaxes(range=[0, 0.05]) # focus on the low-frequency
    return fig

In [21]:
# Test the function
fig = plot_spectrogram_elhub(df_production)
fig.show()