# Project Work, Part 4 - Machine Learning
## 1. Introduction
This project involves analysing data with implementing machine learning model in a Jupyter Notebook and creating a multi-page online app with Streamlit, with all work and code shared on GitHub. AI tools (e.g., ChatGPT) were utilized during the project to clarify requirements and to gain a deeper understanding of the technologies used.

- Task: Analysis of Norwegian electricity production (Elhub) and meteorological data (Open‑Meteo API).
- Goal: automate data collection, perform time‑series decomposition, periodic analysis, and anomaly detection; then visualize results in a Jupyter Notebook and Streamlit dashboard.

## 2. Repository and App Links
- GitHub: https://github.com/Indraadhikari/IND320_Indra
- Streamlit app: https://ind320-k2r8aymxk9takanegm8e3y.streamlit.app

## 3. Project Overview
### 3.1 AI Usage Description
In this project, I used AI (ChatGPT) as a helpful assistant during development. It supported me in solving coding errors, generating code ideas, and improving my understanding of concepts. The AI explained topics such as STL decomposition, Discrete Cosine Transform (DCT) filtering, and Local Outlier Factor (LOF) anomaly detection, giving both theory and example code.

I also used it to debug Python and Streamlit issues, like fixing empty DataFrames, using st.session_state, avoiding runtime errors, and organizing the multi-page layout. During implementation, I followed AI suggestions to clean up functions, set better parameter defaults, and make the visualizations easier to read.

All AI outputs were carefully checked, tested, and modified to fit the project’s goals and my own coding style. Overall, the AI acted as a learning and support tool, helping me work faster and understand data analysis and software design more deeply.

### 3.2 Project Log
For the compulsory work, I began by defining representative cities for Norway’s five electricity price areas (NO1–NO5) and storing their latitude and longitude in a Pandas DataFrame. This mapping created the geographic foundation for the rest of the analyses. I then downloaded hourly electricity production data from the Elhub API for 2021, focusing on the *PRODUCTION_PER_GROUP_MBA_HOUR* dataset. The raw *JSON* responses were normalized into a clean DataFrame.

Next, I replaced my earlier CSV‑based meteorological import with live calls to the Open‑Meteo API. For each selected price area, the application automatically queries the API using the corresponding city’s coordinates, returning hourly temperature, precipitation, and wind observations for 2019 in a Notebook file and 2021 for the Streamlit app. The fetched data are transformed into a tidy format in a Pandas DataFrame and cached for efficient reuse.

Analytical development was divided into three main components, implemented and tested first in a Jupyter Notebook.
- Seasonal‑Trend decomposition using LOESS (STL): using the *statsmodels.tsa.seasonal.STL* class, I decomposed the production time series into trend, seasonal, and residual components.
- Spectrogram analysis: applying *scipy.signal.spectrogram*, I generated time–frequency plots to reveal changes in periodic behavior across the year.
- Outlier and Anomaly detection: I implemented a robust Statistical Process Control (SPC) method using *Median ± k × MAD* boundaries on filtered temperature data and applied the Local Outlier Factor (LOF) algorithm from *scikit‑learn* to identify precipitation anomalies.
Each analytical block was wrapped in a modular Python function with configurable parameters (area, group, window length, etc.) and tested interactively in the notebook before integration into the Streamlit app.

I then updated the Streamlit dashboard to follow the new required page order. The global area selector was moved to the second page (named *Energy Production(4)* in the app), ensuring that all subsequent analyses depend on the user’s chosen region. Between existing pages, I added *new A (STL and Spectrogram(A))* and *new B (Outliers and Anomalies (B))* pages, each built with *st.tabs()* for navigation. Both pages render Matplotlib plots directly and display them. Communication between pages is managed through *st.session_state*, allowing the selected price area imported meteorological data and production data to persist throughout the session.

The completed workflow demonstrates a full data pipeline: acquiring data dynamically via APIs, performing time‑series analysis, detecting anomalies, and presenting interactive results through a structured Streamlit interface and Jupyter Notebook.

# 4. Importing Libraries

In [1]:
import requests 
import pandas as pd
import calendar
import numpy as np
import matplotlib.pyplot as plt
from scipy.fftpack import dct, idct
from sklearn.neighbors import LocalOutlierFactor
from statsmodels.tsa.seasonal import STL
from scipy.signal import spectrogram

## 5. Working with Data Sources

### 5.1 Connection Check for Cassandra

In [2]:
from cassandra.cluster import Cluster

try:
    cluster = Cluster(['localhost'], port=9042)
    session = cluster.connect()
    print("✅ Connected to Cassandra!")
    print("Cluster name:", cluster.metadata.cluster_name)
    print("Hosts:", cluster.metadata.all_hosts())
    cluster.shutdown()
except Exception as e:
    print("❌ Connection failed:", e)

✅ Connected to Cassandra!
Cluster name: Test Cluster
Hosts: [<Host: ::1:9042 datacenter1>]


### 5.2 Connection Check for MangoDB

In [3]:
from pymongo.mongo_client import MongoClient

c_file = '/Users/indra/Documents/Masters in Data Science/Data to Decision/IND320_Indra/No_sync/MongoDB.txt' #creadential file
USR, PWD = open(c_file).read().splitlines()

uri = "mongodb+srv://"+USR+":"+PWD+"@cluster0.wmoqhtp.mongodb.net/?retryWrites=true&w=majority&appName=Cluster0"

# Create a new client and connect to the server
client = MongoClient(uri)

# Send a ping to confirm a successful connection
try:
    client.admin.command('ping')
    print("Pinged your deployment. You successfully connected to MongoDB!")
except Exception as e:
    print(e)

Pinged your deployment. You successfully connected to MongoDB!


### 5.3 Reading Data from  Elhub API

In [6]:
import requests

headers = {    
    
}

endpoint = "https://api.elhub.no/energy-data/v0/"
entity = 'price-areas'
dataset = "PRODUCTION_PER_GROUP_MBA_HOUR"
#startdate = '2022-01-01T00:20:00%2B02:00'
#enddate = '2024-12-31T23:59:59%2B02:00'
year = [2022, 2023, 2024]

In [7]:
import calendar
import pandas as pd

dates = []
for i in year:
    year = i
    # accessing the data for a month at a time as Endpoint does not allow us to get for a whole year.
    for month in range(1, 13):
        # Get number of days in month
        _, last_day = calendar.monthrange(year, month)
        
        # Format month and day properly (e.g. 01, 02, …)
        startdate = f"{year}-{month:02d}-01T00:20:00%2B02:00"
        enddate = f"{year}-{month:02d}-{last_day:02d}T23:59:59%2B02:00"
        
        dates.append((startdate, enddate))

all_data = []

for startdate, enddate in dates:
    #print(f"Start: {start}   End: {end}")
    data = []
    response = requests.get(f"{endpoint}{entity}?dataset={dataset}&startDate={startdate}&endDate={enddate}", headers=headers)
    #print(response.status_code)
    data = response.json()
    #data['data'][1]['attributes']['productionPerGroupMbaHour']
    for i in data['data']:
        all_data.extend(i['attributes']['productionPerGroupMbaHour'])
df = pd.DataFrame(all_data)
print(df.shape)

(656700, 6)


In [8]:
df.head()

Unnamed: 0,endTime,lastUpdatedTime,priceArea,productionGroup,quantityKwh,startTime
0,2022-01-01T02:00:00+01:00,2025-02-01T18:02:57+01:00,NO1,hydro,1246209.4,2022-01-01T01:00:00+01:00
1,2022-01-01T03:00:00+01:00,2025-02-01T18:02:57+01:00,NO1,hydro,1271757.0,2022-01-01T02:00:00+01:00
2,2022-01-01T04:00:00+01:00,2025-02-01T18:02:57+01:00,NO1,hydro,1204251.8,2022-01-01T03:00:00+01:00
3,2022-01-01T05:00:00+01:00,2025-02-01T18:02:57+01:00,NO1,hydro,1202086.9,2022-01-01T04:00:00+01:00
4,2022-01-01T06:00:00+01:00,2025-02-01T18:02:57+01:00,NO1,hydro,1235809.9,2022-01-01T05:00:00+01:00
