**https://github.com/MajedAlmnety/IND320-Project-Part1**

**https://ind320-project-part1-mvjrbnbyefycsijdcn3im4.streamlit.app/**

In [19]:


# Core setup 
from datetime import datetime
import requests
import pandas as pd
from zoneinfo import ZoneInfo 
import os
import time
import json
from datetime import datetime, timedelta
from typing import List, Dict
from zoneinfo import ZoneInfo
import requests
import pandas as pd
import plotly.express as px
from dotenv import load_dotenv
from pymongo import MongoClient
from pyspark.sql import SparkSession
from pyspark.sql.functions import sum as spark_sum, month
# Configuration
ELHUB_BASE_URL = "https://api.elhub.no/energy-data/v0/price-areas"
DATASET = "PRODUCTION_PER_GROUP_MBA_HOUR"
#OSLO = ZoneInfo("Europe/Oslo")
UTC = ZoneInfo("UTC")

YEAR = 2021        
PAUSE_S = 0.4      # gentle pause between API calls
TIMEOUT_S = 30     # request timeout


In [21]:
# Construct (start_of_month, start_of_next_month) for the year
month_windows = []
for m in range(1, 12 + 1):  # Loop through elements
    start = datetime(YEAR, m, 1, 0, 0, 0, tzinfo=UTC)
    next_year = YEAR + (m // 12)
    next_month = (m % 12) + 1
    end = datetime(next_year, next_month, 1, 0, 0, 0, tzinfo=UTC)
    month_windows.append((start, end))

# quick peek of the first three windows (non-verbose)
month_windows[:3]


[(datetime.datetime(2021, 1, 1, 0, 0, tzinfo=zoneinfo.ZoneInfo(key='UTC')),
  datetime.datetime(2021, 2, 1, 0, 0, tzinfo=zoneinfo.ZoneInfo(key='UTC'))),
 (datetime.datetime(2021, 2, 1, 0, 0, tzinfo=zoneinfo.ZoneInfo(key='UTC')),
  datetime.datetime(2021, 3, 1, 0, 0, tzinfo=zoneinfo.ZoneInfo(key='UTC'))),
 (datetime.datetime(2021, 3, 1, 0, 0, tzinfo=zoneinfo.ZoneInfo(key='UTC')),
  datetime.datetime(2021, 4, 1, 0, 0, tzinfo=zoneinfo.ZoneInfo(key='UTC')))]

Fetch loop (concise, academic-style monthly notices; no cumulative)

In [22]:
all_records = []

session = requests.Session()
print("Fetching 2021 Elhub production records, please wait...")
for start_dt, end_dt in month_windows:  # Loop through elements
    start_iso = start_dt.isoformat()  # honors DST for Europe/Oslo
    end_iso   = end_dt.isoformat()
    params = {"dataset": DATASET, "startDate": start_iso, "endDate": end_iso}

    try:
        r = session.get(ELHUB_BASE_URL, params=params, timeout=TIMEOUT_S)
        r.raise_for_status()
        payload = r.json()
    except requests.RequestException as e:
        # understated academic note; proceed
        print(f"{start_dt:%Y-%m} — omitted (transient I/O): {e}")
        time.sleep(PAUSE_S)
        continue

    # count for this month only
    n_month = 0

    # extract attributes.productionPerGroupMbaHour, if present
    for area in payload.get("data", []):
        attrs = area.get("attributes") or {}
        chunk = attrs.get("productionPerGroupMbaHour") or []
        if isinstance(chunk, list):
            all_records.extend(chunk)
            n_month += len(chunk)

    time.sleep(PAUSE_S)

# avoid auto-displaying the grand total in notebooks
grand_total = len(all_records)

if grand_total == 0:
    print("No data retrieved.")
else:
    print(f"\nData retrieved successfully — total records: {grand_total}")



Fetching 2021 Elhub production records, please wait...

Data retrieved successfully — total records: 215353


In [23]:
# To pandas; normalize time columns to Europe/Oslo
df = pd.DataFrame(all_records)

# Normalize temporal columns (if present) to Europe/Oslo
for col in ("startTime", "endTime", "lastUpdatedTime"):  # Loop through elements
    if col in df.columns:
        df[col] = pd.to_datetime(df[col], utc=True, errors="coerce").dt.tz_convert(UTC)

# Optional tidy ordering
preferred = ["startTime", "endTime", "lastUpdatedTime", "priceArea", "productionGroup", "quantityKwh"]
ordered = [c for c in preferred if c in df.columns] + [c for c in df.columns if c not in preferred]
df = df[ordered]

# compact sanity check (shape + head in one line-like view)
df.shape

(215353, 6)

In [24]:
df.head()


Unnamed: 0,startTime,endTime,lastUpdatedTime,priceArea,productionGroup,quantityKwh
0,2020-12-31 23:00:00+00:00,2021-01-01 00:00:00+00:00,2024-12-20 09:35:40+00:00,NO1,hydro,2507716.8
1,2021-01-01 00:00:00+00:00,2021-01-01 01:00:00+00:00,2024-12-20 09:35:40+00:00,NO1,hydro,2494728.0
2,2021-01-01 01:00:00+00:00,2021-01-01 02:00:00+00:00,2024-12-20 09:35:40+00:00,NO1,hydro,2486777.5
3,2021-01-01 02:00:00+00:00,2021-01-01 03:00:00+00:00,2024-12-20 09:35:40+00:00,NO1,hydro,2461176.0
4,2021-01-01 03:00:00+00:00,2021-01-01 04:00:00+00:00,2024-12-20 09:35:40+00:00,NO1,hydro,2466969.2


## Configure Spark to Connect to Cassandra
- We set up a Spark session that connects to Cassandra inside Docker.  
- Spark runs outside Docker, so we use `127.0.0.1` as the host (mapped to port `9042`).

In [25]:

import sys, os  # Import required libraries

# Python interpreter for driver and workers
os.environ["PYSPARK_PYTHON"] = sys.executable
os.environ["PYSPARK_DRIVER_PYTHON"] = sys.executable

spark = (
    SparkSession.builder
    .appName("Elhub-Cassandra")
    .config("spark.cassandra.connection.host", "127.0.0.1")
    .config("spark.cassandra.connection.port", "9042")
    .config("spark.jars.packages", "com.datastax.spark:spark-cassandra-connector_2.12:3.5.1")
    .getOrCreate()
)

print("Spark connected to Cassandra successfully.")


Spark connected to Cassandra successfully.


In [26]:

print("Converting pandas DataFrame to Spark DataFrame...")

# Convert
spark_df = spark.createDataFrame(df)
# Confirm schema and preview data
print("Conversion successful — Spark DataFrame schema:")
spark_df.printSchema()

print("\nFirst 5 rows:")
spark_df.show(5, truncate=False)


Converting pandas DataFrame to Spark DataFrame...
Conversion successful — Spark DataFrame schema:
root
 |-- startTime: timestamp (nullable = true)
 |-- endTime: timestamp (nullable = true)
 |-- lastUpdatedTime: timestamp (nullable = true)
 |-- priceArea: string (nullable = true)
 |-- productionGroup: string (nullable = true)
 |-- quantityKwh: double (nullable = true)


First 5 rows:
+-------------------+-------------------+-------------------+---------+---------------+-----------+
|startTime          |endTime            |lastUpdatedTime    |priceArea|productionGroup|quantityKwh|
+-------------------+-------------------+-------------------+---------+---------------+-----------+
|2021-01-01 00:00:00|2021-01-01 01:00:00|2024-12-20 10:35:40|NO1      |hydro          |2507716.8  |
|2021-01-01 01:00:00|2021-01-01 02:00:00|2024-12-20 10:35:40|NO1      |hydro          |2494728.0  |
|2021-01-01 02:00:00|2021-01-01 03:00:00|2024-12-20 10:35:40|NO1      |hydro          |2486777.5  |
|2021-01-01 03

In [27]:
spark_df.describe().show()

+-------+---------+---------------+------------------+
|summary|priceArea|productionGroup|       quantityKwh|
+-------+---------+---------------+------------------+
|  count|   215353|         215353|            215353|
|   mean|     NULL|           NULL| 729647.3826658489|
| stddev|     NULL|           NULL|1549543.7955768807|
|    min|      NO1|          hydro|               0.0|
|    max|      NO5|           wind|         9715193.0|
+-------+---------+---------------+------------------+



In [28]:
spark_df.columns


['startTime',
 'endTime',
 'lastUpdatedTime',
 'priceArea',
 'productionGroup',
 'quantityKwh']

In [29]:
from dotenv import load_dotenv, find_dotenv
import os
from urllib.parse import quote_plus

load_dotenv(find_dotenv())

user = os.getenv("MONGO_USER")
password = quote_plus(os.getenv("MONGO_PASS") or "")
cluster = os.getenv("MONGO_CLUSTER")

if not all([user, password, cluster]):  # Conditional check
    raise SystemExit("Missing MongoDB credentials in .env")

# Don’t print full URI (avoid leaking secrets). Printing the cluster is fine.
print(f"Connecting to MongoDB cluster: {cluster}")

uri = f"mongodb+srv://{user}:{password}@{cluster}/?retryWrites=true&w=majority"

print(f"uri MongoDB: {uri}")

Connecting to MongoDB cluster: cluster0.2ahqa17.mongodb.net
uri MongoDB: mongodb+srv://majed_almnety:ntofa@cluster0.2ahqa17.mongodb.net/?retryWrites=true&w=majority


###  Setting Up Cassandra Before Writing Data from Spark

## Before writing data to Cassandra from Spark, make sure you have the following installed:

1. **Docker** – we will run Cassandra inside a Docker container.  
2. **Apache Cassandra** – runs inside Docker and stores our data.  
  
We will run Cassandra, create a keyspace and table, connect Spark to Cassandra, and finally write data safely.

## Run Cassandra in Docker
Start a Cassandra container on port 9042 (the default CQL port).  
docker run -d \
  --name cassandra \
  -p 9042:9042 \
  cassandra:4.1
  
This will make Cassandra available for Spark running outside Docker.


## Prepare the Column Names
Before writing, rename columns from `camelCase` to `snake_case` to match Cassandra's schema.  
Cassandra automatically converts unquoted names to lowercase, so `snake_case` is safer.
from pyspark.sql import functions as F



In [30]:
# create a new DataFrame with renamed columns
from pyspark.sql import functions as F  
spark_df = spark_df.select(
    F.col("startTime").alias("start_time"),         
    F.col("endTime").alias("end_time"),              
    F.col("lastUpdatedTime").alias("last_updated_time"),  
    F.col("priceArea").alias("price_area"),          
    F.col("productionGroup").alias("production_group"),  
    F.col("quantityKwh").alias("value")             
)


In [31]:
spark_df.columns

['start_time',
 'end_time',
 'last_updated_time',
 'price_area',
 'production_group',
 'value']

In [32]:
spark_df.describe().show()


+-------+----------+----------------+------------------+
|summary|price_area|production_group|             value|
+-------+----------+----------------+------------------+
|  count|    215353|          215353|            215353|
|   mean|      NULL|            NULL| 729647.3826658489|
| stddev|      NULL|            NULL|1549543.7955768807|
|    min|       NO1|           hydro|               0.0|
|    max|       NO5|            wind|         9715193.0|
+-------+----------+----------------+------------------+



## Create the Keyspace and Table

Now we create:
- A keyspace (similar to a database) named **energy_data**
- A table named **production_2021**

-This schema will store electricity production data.

CREATE TABLE IF NOT EXISTS production_2021 (

               price_area text,
               
               production_group text,
               
               start_time timestamp,
               
               end_time timestamp,
               
               last_updated_time timestamp,
               
               value double,
               
               PRIMARY KEY ((price_area, production_group), start_time)
               
               ) WITH CLUSTERING ORDER BY (start_time ASC);



##  Test Writing the DataFrame to Cassandra

We now append data from `spark_df` into the `energy_data.production_2021` table.


In [33]:
(
    spark_df.write
    .format("org.apache.spark.sql.cassandra")  # Use Cassandra Spark connector
    .mode("append")                            # Append new rows (do not overwrite existing data)
    .option("keyspace", "energy_data")         # Cassandra keyspace name
    .option("table", "production_2021")        # Target table for production data  # Loop through elements
    .save()                                    # Trigger the write operation
)

print("Data successfully inserted into Cassandra (energy_data.production_2021)")


Data successfully inserted into Cassandra (energy_data.production_2021)


## Verify the Data Inside Cassandra

- In this step, we verify that the data was successfully inserted
- into the Cassandra table by reading it with Spark and previewing




In [34]:

# Read the table from Cassandra
df = (spark.read
      .format("org.apache.spark.sql.cassandra")
      .options(keyspace="energy_data", table="production_2021")
      .load())

# Print the schema 
df.printSchema()

# 3️Shw the first 5 rows to 
df.select("price_area", "production_group", "start_time", "value").show(5, truncate=False)


root
 |-- price_area: string (nullable = false)
 |-- production_group: string (nullable = false)
 |-- start_time: timestamp (nullable = true)
 |-- end_time: timestamp (nullable = true)
 |-- last_updated_time: timestamp (nullable = true)
 |-- value: double (nullable = true)

+----------+----------------+-------------------+-----+
|price_area|production_group|start_time         |value|
+----------+----------------+-------------------+-----+
|NO2       |other           |2021-01-01 00:00:00|4.346|
|NO2       |other           |2021-01-01 01:00:00|3.642|
|NO2       |other           |2021-01-01 02:00:00|3.562|
|NO2       |other           |2021-01-01 03:00:00|4.864|
|NO2       |other           |2021-01-01 04:00:00|5.168|
+----------+----------------+-------------------+-----+
only showing top 5 rows



## Visualizing Data from Cassandra using Spark

In [36]:
# Read from Cassandra (narrow columns early)

from pyspark.sql import functions as F  
# Choose which price area and year to visualize
#price_area = "NO1"
year_label = "2021"

# Read data from Cassandra
df = (
    spark.read
    .format("org.apache.spark.sql.cassandra")
    .option("keyspace", "energy_data")          # Cassandra keyspace (database)
    .option("table", "production_2021")         # Table to read
    .load()
    .select("price_area", "production_group", "start_time", "value")  # Only needed columns
    
)

df.cache()  # Cache in memory for faster re-use


DataFrame[price_area: string, production_group: string, start_time: timestamp, value: double]

In [None]:
df.show(5)

+----------+----------------+-------------------+-----+
|price_area|production_group|         start_time|value|
+----------+----------------+-------------------+-----+
|       NO1|           solar|2021-01-01 00:00:00|6.106|
|       NO1|           solar|2021-01-01 01:00:00| 4.03|
|       NO1|           solar|2021-01-01 02:00:00|3.982|
|       NO1|           solar|2021-01-01 03:00:00|8.146|
|       NO1|           solar|2021-01-01 04:00:00|8.616|
+----------+----------------+-------------------+-----+
only showing top 5 rows



In [57]:
df.filter(df["price_area"] == "NO1").show(2)
#df.filter(df["price_area"] == "NO2").show(2)
#df.filter(df["price_area"] == "NO3").show(2)
#df.filter(df["price_area"] == "NO4").show(2)
#df.filter(df["price_area"] == "NO5").show(2)

+----------+----------------+-------------------+-----+
|price_area|production_group|         start_time|value|
+----------+----------------+-------------------+-----+
|       NO1|           solar|2021-01-01 00:00:00|6.106|
|       NO1|           solar|2021-01-01 01:00:00| 4.03|
+----------+----------------+-------------------+-----+
only showing top 2 rows



##  PIE CHART — total production over the whole chosen year

Displays the **total annual production share** for the selected price area.
Each slice represents a **production group** (e.g., hydro, wind, thermal).
Interactive — hover to see exact values and percentages.


In [58]:
# Group and sum total production per production group
df_area_agg = (
    df.groupBy("production_group")              # Group by production group (e.g. Wind, Hydro)
      .agg(F.sum("value").alias("total_quantity"))  # Sum total production value
      .orderBy(F.desc("total_quantity"))            
)

# Convert Spark DataFrame to pandas for Plotly
df_area_agg = df_area_agg.toPandas()

In [59]:
df_area_agg

Unnamed: 0,production_group,total_quantity
0,hydro,145003700000.0
1,wind,10730810000.0
2,thermal,1332156000.0
3,solar,34732650.0
4,other,16660230.0


In [61]:
# Calculate share (%) and sort for nicer plotting
df_area_agg["share"] = 100 * df_area_agg["total_quantity"] / df_area_agg["total_quantity"].sum()
df_area_agg = df_area_agg.sort_values("total_quantity", ascending=False).reset_index(drop=True)

# Decide where to show labels (inside if large enough)
inside_thresh = 5.0  # percent threshold for label placement  # Loop through elements
textpos = ["inside" if s >= inside_thresh else "outside" for s in df_area_agg["share"]]

# Create the pastel pie chart
fig = px.pie(
    df_area_agg,
    values="total_quantity",
    names="production_group",
    title=f"Total Production in ({year_label})",
    color_discrete_sequence=px.colors.qualitative.Pastel  # Soft colors
)

# Adjust the look of the chart
fig.update_traces(
    textinfo="percent+label",                      
    textposition=textpos,                          # Label placement (inside/outside)
    pull=[0.04] * len(df_area_agg),                # Slightly pull out all slices
    hovertemplate="%{label}<br>%{percent:.1%} (%{value:,.0f})<extra></extra>", 
    sort=False,
    direction="clockwise",
    insidetextorientation="horizontal"
)

# Layout tweaks for better readability
fig.update_layout(
    width=900, height=520,
    title=dict(x=0.5, y=0.98, xanchor="center", yanchor="top"),   # Center title
    legend=dict(
        title=None,
        orientation="v",
        y=0.5, yanchor="middle",
        x=1.05, xanchor="left"   # Move legend to right side
    ),
    margin=dict(l=40, r=160, t=60, b=40)  # Extra space for legend
)

# Show the interactive pie chart
fig.show()


## Line plot: January by production group (per hour)
Shows **hourly production in January** for the selected price area.
Each line represents a **different production group** (e.g., hydro, wind, thermal).
Interactive — you can hover, zoom, and toggle groups.


In [None]:
YEAR = 2021
price_area = "NO4"  # change to NO2/NO1/NO3/NO5 for testing

# Ensure numeric type and handle nulls
df = df.withColumn("value", F.col("value").cast("double")).na.fill({"value": 0.0})

# January only + chosen price area
df_jan = df.where(
    (F.year(F.col("start_time")) == YEAR) &
    (F.month(F.col("start_time")) == 1) &
    (F.col("price_area") == price_area)
)

# Aggregate hourly per production group
df_hourly = (
    df_jan.groupBy(
        "production_group",
        F.date_trunc("hour", F.col("start_time")).alias("ts_hour")
    )
    .agg(F.sum("value").alias("total_production"))
)

# Guard
if df_hourly.rdd.isEmpty():
    raise ValueError(f"No January data found for price area '{price_area}' in {YEAR}.")

# To pandas and plot
pdf = df_hourly.toPandas().sort_values(["ts_hour", "production_group"])
pdf["ts_hour"] = pd.to_datetime(pdf["ts_hour"])

fig = px.line(
    pdf,
    x="ts_hour",
    y="total_production",
    color="production_group",
    title=f"Hourly Production by Group — {price_area} — January {YEAR}",
    labels={"ts_hour": "Time", "total_production": "Production (MWh)", "production_group": "Group"}
)
fig.update_traces(mode="lines", hovertemplate="<b>%{fullData.name}</b><br>%{x}<br>%{y:,.0f} MWh<extra></extra>")
fig.update_layout(width=1100, height=520, legend_title_text="Production group", margin=dict(l=40, r=20, t=70, b=40))
fig.show()


### Data Transfer to MongoDB
In the following code cells, we establish a connection to a MongoDB database and transfer our processed data. This typically involves:
1. Connecting to the MongoDB server.
2. Selecting the appropriate database and collection.
3. Inserting or updating records.
4. Verifying the successful data upload.

This step ensures our project data is stored persistently and can be queried later for analysis.

In [70]:
# pip install python-dotenv pymongo dnspython
from dotenv import load_dotenv, find_dotenv  # Import required libraries
from urllib.parse import quote_plus
import os
from pymongo import MongoClient
from pymongo.errors import ConnectionFailure, ConfigurationError

In [71]:
load_dotenv(find_dotenv())

user = os.getenv("MONGO_USER")
password = quote_plus(os.getenv("MONGO_PASS") or "")
cluster = os.getenv("MONGO_CLUSTER")

if not all([user, password, cluster]):  # Conditional check
    raise SystemExit("Missing MongoDB credentials in .env")

#print(f"Connecting to MongoDB cluster: {cluster}")

uri = f"mongodb+srv://{user}:{password}@{cluster}/?retryWrites=true&w=majority"



In [72]:
try:
    client = MongoClient(uri)
    # test connection
    client.admin.command('ping')
    print("Successfully connected to MongoDB!")
except (ConnectionFailure, ConfigurationError) as e:
    print("MongoDB connection failed:", e)


Successfully connected to MongoDB!


In [73]:
# Select only the important columns as required in the project
df_selected = df.select("price_area", "production_group", "start_time", "value")

# Convert Spark DataFrame to a pandas DataFrame
pdf = df_selected.toPandas()

# Display the first 5 rows to verify the data
pdf.head()


Unnamed: 0,price_area,production_group,start_time,value
0,NO1,solar,2021-01-01 00:00:00,6.106
1,NO1,solar,2021-01-01 01:00:00,4.03
2,NO1,solar,2021-01-01 02:00:00,3.982
3,NO1,solar,2021-01-01 03:00:00,8.146
4,NO1,solar,2021-01-01 04:00:00,8.616


In [86]:
pdf_no2 = pdf[pdf["price_area"] == "NO1"]

In [93]:
pdf["price_area"].unique()


array(['NO1', 'NO5', 'NO4', 'NO3', 'NO2'], dtype=object)

In [74]:
# --- Insert Spark-extracted data into MongoDB safely ---
from pymongo import MongoClient, UpdateOne  # Import required libraries
import pandas as pd

# Ensure 'start_time' column is in proper datetime format
pdf["start_time"] = pd.to_datetime(pdf["start_time"])

#  Connect to MongoDB
client = MongoClient(uri)  
db = client["elhub_data"]      
collection = db["energy_data_2021"]

# Create a unique index to prevent duplicates  (price_area + production_group + start_time)
collection.create_index(
    [("price_area", 1), ("production_group", 1), ("start_time", 1)],
    unique=True,
    name="uniq_price_group_time"
)

# Convert pandas DataFrame to a list of dictionaries (MongoDB documents)
records = pdf.to_dict("records")

# Prepare bulk upsert operations: 
# if a document exists -> update it, otherwise insert it
ops = [
    UpdateOne(
        {
            "price_area": r["price_area"],
            "production_group": r["production_group"],
            "start_time": r["start_time"],
        },
        {"$set": r},
        upsert=True
    )
    for r in records
]

# Execute the bulk operation
if ops:
    result = collection.bulk_write(ops, ordered=False)
    print("MongoDB upsert done")
    print("Matched:", result.matched_count)
    print("Modified:", result.modified_count)
    print("Upserted:", len(result.upserted_ids))
    print("Total attempted:", len(ops))
else:
    print("No records to insert.")


MongoDB upsert done
Matched: 40075
Modified: 0
Upserted: 175253
Total attempted: 215328


In [95]:
# --- Insert Spark-extracted data into MongoDB safely ---

from pymongo import MongoClient, UpdateOne
import pandas as pd

# Ensure 'start_time' is datetime
pdf["start_time"] = pd.to_datetime(pdf["start_time"], errors="coerce")

# Connect to MongoDB (make sure `uri` is defined above)
client = MongoClient(uri)

# Use the same DB/collection that Streamlit reads
db = client["elhub"]
collection = db["production_2021"]

# (optional) clean old partial data if you previously wrote only NO1
collection.delete_many({})

# Unique index to prevent duplicates
collection.create_index(
    [("price_area", 1), ("production_group", 1), ("start_time", 1)],
    unique=True,
    name="price_area_1_production_group_1_start_time_1",
)


# Prepare bulk upsert
records = pdf.to_dict("records")
ops = [
    UpdateOne(
        {
            "price_area": r["price_area"],
            "production_group": r["production_group"],
            "start_time": r["start_time"],
        },
        {"$set": r},
        upsert=True,
    )
    for r in records
]

if ops:
    result = collection.bulk_write(ops, ordered=False)
    print("MongoDB upsert done")
    print("Matched:", result.matched_count)
    print("Modified:", result.modified_count)
    print("Upserted:", len(result.upserted_ids))
    print("Total attempted:", len(ops))
else:
    print("No records to insert.")


MongoDB upsert done
Matched: 0
Modified: 0
Upserted: 215328
Total attempted: 215328


## Quick Verification


In [96]:
# Show one sample document (without the _id field)
collection.find_one({}, {"_id": 0})


{'start_time': datetime.datetime(2021, 1, 1, 0, 0),
 'production_group': 'solar',
 'price_area': 'NO1',
 'value': 6.106}

In [97]:
# Check distinct values for price_area and production_group
areas = collection.distinct("price_area")
groups = collection.distinct("production_group")

print("Distinct price areas:", areas)
print("Distinct production groups:", groups)

Distinct price areas: ['NO1', 'NO2', 'NO3', 'NO4', 'NO5']
Distinct production groups: ['hydro', 'other', 'solar', 'thermal', 'wind']


In [98]:
collection.count_documents({})


215328

### **Project Log: Jupyter Notebook, Streamlit, and AI Use**

In this project, I worked with **Jupyter Notebook** and **Streamlit** to complete data analysis and visualization tasks. The main goal was to prepare and process a dataset, store it in a database, and then build an interactive web application to present the results. Through this work, I learned how to manage the full data process, from preparation to presentation.

I started my work in **Jupyter Notebook**, where I focused on cleaning, transforming, and organizing the dataset. I used Python libraries such as `pandas`, `numpy`, and `pyspark` to handle the data efficiently. After preparing the dataset, I connected to a **MongoDB** database using the `pymongo` library. This allowed me to save and retrieve the processed data when needed. The notebook made it easier to test code step by step and view immediate outputs, which helped me understand the data flow and fix issues as they appeared. I also used Markdown cells to explain the steps clearly, which made the workflow easier to follow and review.

After the analysis part was complete, I used **Streamlit** to build an interactive dashboard for presenting the results. Streamlit allowed me to convert my Python code into a simple and attractive web interface. I included filters, selection options, and graphs to make the results easy to explore. I used visualization libraries such as `matplotlib` and `plotly` to create charts that helped explain the trends in the data. This step showed me how to communicate results effectively through interactive tools.

During the project, I also used **artificial intelligence (AI)** tools to support my work. I used AI mainly to **get help in understanding and correcting complex errors**, especially those related to connecting or reading data from the MongoDB database. In some cases, AI also helped me **write or adjust parts of the code** to make it more efficient or readable. All AI assistance was used responsibly and under my supervision, to improve the quality of the code and my understanding of each step. The AI served as a helpful technical assistant rather than replacing my own work or decision-making.

Overall, this project helped me gain valuable experience in data analysis, database management, and interactive visualization. It also showed how AI can be a useful tool for learning, problem-solving, and improving code quality when used carefully and ethically.
