# Assignment 2

## AI usage 

I generally prefer coding by myself over using AI for help. For this assignment I did, however, find it necessary to use AI to increase my efficiency. 

Used DeepSeek for help on how to connect Cassandra and Spark. 

## Log 

For this assignment I will try to work more systematic than the last. I will focus on finishing all the elements for the notebook first, and work on the streamlit app afterwards. 

I started by connecting Cassandra and Spark, but ran into problems regarding the sparksession. After troubleshooting for a while and getting some help, we ended up using DeepSeek for help on how to maybe fix it. DeepSeek suggested adding the last three lines of code to force sessionbuilder to use the localhost. 

I have struggled a bit with connecting to Spark and MongoDB, especially lagging a couple weeks behind on lectures. 

## Links 

- Github: https://github.com/Satheris/IND320_SMAA
- Streamlit app: https://ind320smaa-2eg32uba6uhmrknkwtxzar.streamlit.app/

## Coding

### Imports and system variables

In [52]:
import numpy as np
import pandas as pd 
import streamlit as st
import pymongo
from cassandra.cluster import Cluster
from pyspark.sql import SparkSession
from pyjstat import pyjstat
import requests
import json
import plotly.express as px


In [19]:
# Set environment variables for PySpark (system and version dependent!) 
# if not already set persistently (e.g., in .bashrc or .bash_profile or Windows environment variables)
import os
# Set the Java home path to the one you are using ((un)comment and edit as needed):
os.environ["JAVA_HOME"] = r"C:\Program Files\Java\jre1.8.0_471"

# If you are using environments in Python, you can set the environment variables like the alternative below.
# The default Python environment is used if the variables are set to "python" (edit if needed):
os.environ["PYSPARK_PYTHON"] = "python" # or similar to "/Users/kristian/miniforge3/envs/tf_M1/bin/python"
os.environ["PYSPARK_DRIVER_PYTHON"] = "python" # or similar to "/Users/kristian/miniforge3/envs/tf_M1/bin/python"

# On Windows you need to specify where the Hadoop drivers are located (uncomment and edit if needed):
os.environ["HADOOP_HOME"] = r"C:\Users\saraa\Documents\winutils\hadoop-3.3.1"

# Set the Hadoop version to the one you are using, e.g., none:
os.environ["PYSPARK_HADOOP_VERSION"] = "without"

### Cassandra and Spark

In [20]:
# Connecting to Cassandra
cluster = Cluster(['localhost'], port=9042)
session = cluster.connect()

In [33]:
# Set up new keyspace
#                                              name of keyspace                        replication strategy           replication factor
session.execute("CREATE KEYSPACE IF NOT EXISTS ind320_keyspace WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };")

# Create a new table
session.set_keyspace('ind320_keyspace')
session.execute("DROP TABLE IF EXISTS ind320_keyspace.elhub_api;") # Starting from scratch every time
session.execute("CREATE TABLE IF NOT EXISTS elhub_api (ind int PRIMARY KEY, endTime text, lastUpdatedTime text, priceArea text, productionGroup text, quantityKwh float, startTime text);")

<cassandra.cluster.ResultSet at 0x2834184b490>

In [22]:
spark = SparkSession.builder.appName('SparkCassandraApp').\
    config('spark.jars.packages', 'com.datastax.spark:spark-cassandra-connector_2.12:3.5.1').\
    config('spark.cassandra.connection.host', 'localhost').\
    config('spark.sql.extensions', 'com.datastax.spark.connector.CassandraSparkExtensions').\
    config('spark.sql.catalog.mycatalog', 'com.datastax.spark.connector.datasource.CassandraCatalog').\
    config('spark.cassandra.connection.port', '9042').\
    config('spark.driver.host', 'localhost').\
    config('spark.driver.bindAddress', '127.0.0.1').\
    config('spark.sql.adaptive.enabled', 'true').\
    getOrCreate()

#### Testing that the connection works

In [23]:
# .load() is used to load data from Cassandra table as a Spark DataFrame
spark.read.format("org.apache.spark.sql.cassandra").options(table="my_first_table", keyspace="my_first_keyspace").load().show()

+---+--------+-------+
|ind| company|  model|
+---+--------+-------+
|460|    Ford|Transit|
|459|    Ford| Escort|
|  1|   Tesla|Model S|
|  2|   Tesla|Model 3|
|  3|Polestar|      3|
+---+--------+-------+



In [24]:
# Read CSV file into Spark DataFrame
# planets = spark.read.csv("../data/planets.csv", header=True, inferSchema=True)
# planets.show()

### MongoDB

In [25]:
def init_connection():
    return pymongo.MongoClient(st.secrets["mongo"]["uri"])

client = init_connection()

# Send a ping to confirm a successful connection
try:
    client.admin.command('ping')
    print("Pinged your deployment. You successfully connected to MongoDB!")
except Exception as e:
    print(e)

Pinged your deployment. You successfully connected to MongoDB!


### Elhub API

In [27]:
URL = 'https://api.elhub.no/energy-data/v0/price-areas?dataset=PRODUCTION_PER_GROUP_MBA_HOUR' \
        '&startDate=2021-01-01T00:00:00%2B02:00&endDate=2021-02-01T00:00:00%2B02:00'

payload = { 
    "query": [], 
    "response": { "format": "json-stat2" } }

response = requests.get(URL, json=payload)
data = response.json()

# Writing the data into a file
with open(r'data\api_response.json', 'w', encoding='utf-8') as f:
    json.dump(response.json(), f, indent=2, ensure_ascii=False)
print("Response saved to 'api_response.json'")


# Prints for status
print("\nStatus Code:", response.status_code)
print("Headers:", response.headers.get('content-type'))


# Extract all production records
all_records = []
for area in data['data']:
    records = area['attributes']['productionPerGroupMbaHour']
    for record in records:
        record['priceArea'] = area['attributes']['name']  # Add area name
        all_records.append(record)

df = pd.DataFrame(all_records)
print(f"\nCreated DataFrame with {len(df)} rows")
df.head()

Response saved to 'api_response.json'

Status Code: 200
Headers: application/json; charset=utf-8

Created DataFrame with 17856 rows


Unnamed: 0,endTime,lastUpdatedTime,priceArea,productionGroup,quantityKwh,startTime
0,2021-01-01T01:00:00+01:00,2024-12-20T10:35:40+01:00,NO1,hydro,2507716.8,2021-01-01T00:00:00+01:00
1,2021-01-01T02:00:00+01:00,2024-12-20T10:35:40+01:00,NO1,hydro,2494728.0,2021-01-01T01:00:00+01:00
2,2021-01-01T03:00:00+01:00,2024-12-20T10:35:40+01:00,NO1,hydro,2486777.5,2021-01-01T02:00:00+01:00
3,2021-01-01T04:00:00+01:00,2024-12-20T10:35:40+01:00,NO1,hydro,2461176.0,2021-01-01T03:00:00+01:00
4,2021-01-01T05:00:00+01:00,2024-12-20T10:35:40+01:00,NO1,hydro,2466969.2,2021-01-01T04:00:00+01:00


Now that I successfully imported 1 month, I need to import for all twelve months. Maximum allowed data range is 1 month, so I need to extract the data in a for-loop. 

In [None]:
monthStart = ['2021-01-01', '2021-02-01', '2021-03-01',
              '2021-04-01', '2021-05-01', '2021-06-01',
              '2021-07-01', '2021-08-01', '2021-09-01',
              '2021-10-01', '2021-11-01', '2021-12-01',
              '2022-01-01']


all_records = []

for i, month in enumerate(monthStart[:12]):
    URL = 'https://api.elhub.no/energy-data/v0/price-areas?dataset=PRODUCTION_PER_GROUP_MBA_HOUR&'\
        f'startDate={month}T00:00:00%2B02:00&endDate={monthStart[i+1]}T00:00:00%2B02:00'

    payload = { 
        "query": [], 
        "response": { "format": "json-stat2" } }

    response = requests.get(URL, json=payload)
    
    # print("\nStatus Code:", response.status_code)

    data = response.json()

    for area in data['data']:
        records = area['attributes']['productionPerGroupMbaHour']
        for record in records:
            record['priceArea'] = area['attributes']['name']
            all_records.append(record)

df = pd.DataFrame(all_records)
df.index.name = 'ind'
df = df.reset_index()

print(f"\nCreated DataFrame with {len(df)} rows")
df.head()


Created DataFrame with 215353 rows


Unnamed: 0,ind,endTime,lastUpdatedTime,priceArea,productionGroup,quantityKwh,startTime
0,0,2021-01-01T01:00:00+01:00,2024-12-20T10:35:40+01:00,NO1,hydro,2507716.8,2021-01-01T00:00:00+01:00
1,1,2021-01-01T02:00:00+01:00,2024-12-20T10:35:40+01:00,NO1,hydro,2494728.0,2021-01-01T01:00:00+01:00
2,2,2021-01-01T03:00:00+01:00,2024-12-20T10:35:40+01:00,NO1,hydro,2486777.5,2021-01-01T02:00:00+01:00
3,3,2021-01-01T04:00:00+01:00,2024-12-20T10:35:40+01:00,NO1,hydro,2461176.0,2021-01-01T03:00:00+01:00
4,4,2021-01-01T05:00:00+01:00,2024-12-20T10:35:40+01:00,NO1,hydro,2466969.2,2021-01-01T04:00:00+01:00


In [42]:
name_dict = {}
for capitalname in (df.columns):
    name_dict[capitalname] = capitalname.lower()
name_dict

df = df.rename(columns=name_dict)

In [43]:
# Convert the Pandas DataFrame to Spark DataFrame and save it to Cassandra (append mode)
spark.createDataFrame(df).write.format("org.apache.spark.sql.cassandra")\
.options(table="elhub_api", keyspace="ind320_keyspace").mode("append").save()

In [50]:
spark.read.format("org.apache.spark.sql.cassandra")\
.options(table="elhub_api", keyspace="ind320_keyspace").load()\
.createOrReplaceTempView("elhub_api_view")

df_spark = spark.sql("SELECT priceArea, productionGroup, startTime, quantityKwh FROM elhub_api_view")

In [66]:
area = 'NO1'
df_kwh_byArea = df_spark[df_spark['priceArea'] == area].groupBy('productionGroup').agg({'quantityKwh': 'sum'}).toPandas()
df_kwh_byArea

Unnamed: 0,productionGroup,sum(quantityKwh)
0,solar,14381940.0
1,other,52561.23
2,thermal,236118000.0
3,hydro,18356780000.0
4,wind,547360300.0


In [70]:
fig = px.pie(df_kwh_byArea, values='sum(quantityKwh)', names='productionGroup', 
             title=f'Total energy production in area {area} by groduction group', 
             color='productionGroup')
fig.show()

In [None]:
# Stop Spark session
try:
    spark.stop()
    print('Spark session terminated successfully')
except ConnectionRefusedError:
    print("Spark session already stopped.")
except NameError:
    print('Spark session is not defined')

# General


A Streamlit app running from https://[yourproject].streamlit.app/.
This is an online version of the project, accessing data that has been exported to CSV format and accessing your MongoDB database for additional data.
The code, hosted at GitHub, must include relevant comments from the Jupyter Notebook and further comments regarding Streamlit usage.


## Tasks

### Local database: Cassandra
If not already done, set up Cassandra and Spark as described in the book.
Test that your Spark-Cassandra connection works.
The Cassandra database will be accessed from the Jupyter Notebook and used to store data from the API mentioned later. 

### Remote database: MongoDB
If not already done, prepare a MongoDB account at mongodb.com.
Test that you can manipulate data from Python.
The MongoDB database will store data that has been trimmed/curated/prepared through the Jupyter Notebook and Spark filtering.
These data will be accessed directly from the Streamlit app.

### API
Familiarise yourself with the API connection at https://api.elhub.noLenker til en ekstern side.

Observe how time is encoded and how transitions between summer and winter time are handled.
Be aware of the time period limitations for each API request and how this differs between datasets.


### Jupyter Notebook

#### Standard requirements

Must include a brief description of AI usage.

Must include a 300-500-word log describing the compulsory work (including both Jupyter Notebook and Streamlit experience).

Must include links to your public GitHub repository and Streamlit app (see below) for the compulsory work.

Document headings should be clear and usable for navigation during development.

All code blocks must include enough comments to be understandable and reproducible if someone inherits your project.

All code blocks must be run before an export to PDF so the messages and plots are shown. In addition, add the .ipynb file to the GitHub repository where you have your Streamlit project.


#### Tasks for assignment 2
Use the Elhub API to retrieve hourly production data for all price areas using PRODUCTION_PER_GROUP_MBA_HOUR for all days and hours of the year 2021.

Extract only the list in productionPerGroupMbaHour, convert to a DataFrame, and insert the data into Cassandra using Spark.

Use Spark to extract the columns priceArea, productionGroup, startTime, and quantityKwh from Cassandra.

Create the following plots:
- A pie chart for the total production of the year from a chosen price area, where each piece of the pie is one of the production groups.
- A line plot for the first month of the year for a chosen price area. Make separate lines for each production group.

Insert the Spark-extracted data into your MongoDB.

Remember to fill in the log and AI mentioned in the General section above.

### Streamlit app
Establish a connection with your MongoDB database. When running this at streamlit.io, remember to copy your secrets to the webpage instead of exposing them on GitHub.

On page four, split the view into two columns using st.columns.

- On the left side, use radio buttons (st.radio) to select a price area and display a pie chart like in the Jupyter Notebook

- On the right side, use pills (st.pills) to select which production groups to include and a selection element of your choice to select a month. Combine the price area, production group(s) and month, and display a line plot like in the Jupyter Notebook (but for any month).

- Below the columns, insert an expander (st.expander) where you briefly document the source of the data shown on the page.