# Log

Appending data to the Cassandra db was quite easy and went pretty quickly. Starting Spark and uploading to MongoDB was also straight forward. It was mostly copy pasting from the previous exercise.    

My big challenge arose when trying to make the map in the first task using plotly. I was not able to get this to work, even after spending 10+ hours on it, the highlighting did not work properly. I then switched to using folium instead and was able to solve the task. It still was not an easy task, but I was able to solve it. Since other pages are dependent on the clicks you make on this page it was essential to solve this problem first.    

I found that the other tasks were easier than the map, at least easier than trying to do the map in plotly. But they were still challenging, and it took quite a lot of time and AI use to solve all the problems I faced.   

The comments on the correlation between hydro and temperature_2m. Lag of n*24 hour does not seem to give a pattern that repeats more regularly. Lag of closer to 12 hours seems to remove a lot of this regularity. Decreasing the window length increases the extreme values a lot, so the data will be less correlated. Increasing the window length gives a pattern that goes less up and down, so this gives the data higher correlation.
For the bonus task I have implemented spinners to indicate that data is loading, and I have tried to cache most of the app.   

The jupyter notebook looks a bit messy as I had overlooked the energy consumption data in the beginning, so I just added it quickly at the end.

# AI usage

AI was used heavily in this exercise. I was stuck a lot on making the map especially. I mainly used Claude, but also Chatgpt when I had spent all my free usage on Claude. I had done energy_plots page in matplotlib earlier. It was super easy to change this to plotly with AI.

# Github and Streamlit

https://github.com/KristofferHemm/ind320/tree/part4
https://ind320-kristoffer.streamlit.app/

In [62]:
import pandas as pd
import requests
import numpy as np
import plotly.express as px
import plotly.io as pio
import os
import datetime
from cassandra.cluster import Cluster
from cassandra.query import BatchStatement
from dotenv import load_dotenv
from pyspark.sql import SparkSession, functions
from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi

In [2]:
# Setting the environment for PySpark
load_dotenv()

HADOOP_PATH = os.getenv("HADOOP_PATH")

os.environ["JAVA_HOME"] = r"C:\Program Files\Microsoft\jdk-11.0.28.6-hotspot" 
os.environ["PYSPARK_HADOOP_VERSION"] = "without"
os.environ["HADOOP_HOME"] = HADOOP_PATH
os.environ["PYSPARK_PYTHON"] = "python"
os.environ["PYSPARK_DRIVER_PYTHON"] = "python"

In [3]:
# Setting the environment for MongoDB
load_dotenv()

USR,PWD = os.getenv("DB_USER"), os.getenv("DB_PWD")

uri = f"mongodb+srv://{USR}:{PWD}@ind320.nxw58bh.mongodb.net/?retryWrites=true&w=majority&appName=IND320"

# Create a client and connect to the server
client = MongoClient(uri, server_api=ServerApi('1'))

# Send a ping to confirm a successful connection
try:
    client.admin.command('ping')
    print("Pinged your deployment. You successfully connected to MongoDB!")
except Exception as e:
    print(e)

Pinged your deployment. You successfully connected to MongoDB!


# Collecting data

## Production

In [4]:
# Setting the environment for collecting the elhub-data for 2021-01-01
entity = 'price-areas'
dataset = 'PRODUCTION_PER_GROUP_MBA_HOUR'
start = '2021-01-01T00:00:00%2B02:00'
end = '2021-01-01T23:59:59%2B02:00' 
res = requests.get(f'https://api.elhub.no/energy-data/v0/{entity}?dataset={dataset}&startDate={start}&endDate={end}')

In [7]:
# Check that we got a connection to the API
assert res.status_code == 200

In [6]:
for header_name, header_value in res.headers.items():
    print(f'{header_name:16s}: {header_value}')

Date            : Fri, 28 Nov 2025 07:57:19 GMT
Content-Type    : application/json; charset=utf-8
Transfer-Encoding: chunked
Connection      : keep-alive
Cache-Control   : public, max-age=3600
strict-transport-security: max-age=63072000; includeSubDomains


In [6]:
# Creating a list which we will extend with dataframes each containing data for one month
data = []

# Creating start and stop dates for the api call and collecting data
years = [2022, 2023, 2024]
months = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
for year in years:
    for month in months:
        if month < 9:
            start = f'{year}-0{month}-01T00:00:00%2B02:00'
            end = f'{year}-0{month+1}-01T00:00:00%2B02:00'
        elif month == 9:
            start = f'{year}-0{month}-01T00:00:00%2B02:00'
            end = f'{year}-{month+1}-01T00:00:00%2B02:00'
        elif month == 12:
            start = f'{year}-{month}-01T00:00:00%2B02:00'
            end = f'{year}-{month}-31T23:59:59%2B02:00'
        else: 
            start = f'{year}-{month}-01T00:00:00%2B02:00'
            end = f'{year}-{month+1}-01T00:00:00%2B02:00'
        print(start)
        res = requests.get(f'https://api.elhub.no/energy-data/v0/{entity}?dataset={dataset}&startDate={start}&endDate={end}')
        assert res.status_code == 200
        payload = res.json()
        temp_data = [pd.DataFrame(entry['attributes']['productionPerGroupMbaHour'])
                 for entry in payload['data']]
        data.extend(temp_data)

2022-01-01T00:00:00%2B02:00
2022-02-01T00:00:00%2B02:00
2022-03-01T00:00:00%2B02:00
2022-04-01T00:00:00%2B02:00
2022-05-01T00:00:00%2B02:00
2022-06-01T00:00:00%2B02:00
2022-07-01T00:00:00%2B02:00
2022-08-01T00:00:00%2B02:00
2022-09-01T00:00:00%2B02:00
2022-10-01T00:00:00%2B02:00
2022-11-01T00:00:00%2B02:00
2022-12-01T00:00:00%2B02:00
2023-01-01T00:00:00%2B02:00
2023-02-01T00:00:00%2B02:00
2023-03-01T00:00:00%2B02:00
2023-04-01T00:00:00%2B02:00
2023-05-01T00:00:00%2B02:00
2023-06-01T00:00:00%2B02:00
2023-07-01T00:00:00%2B02:00
2023-08-01T00:00:00%2B02:00
2023-09-01T00:00:00%2B02:00
2023-10-01T00:00:00%2B02:00
2023-11-01T00:00:00%2B02:00
2023-12-01T00:00:00%2B02:00
2024-01-01T00:00:00%2B02:00
2024-02-01T00:00:00%2B02:00
2024-03-01T00:00:00%2B02:00
2024-04-01T00:00:00%2B02:00
2024-05-01T00:00:00%2B02:00
2024-06-01T00:00:00%2B02:00
2024-07-01T00:00:00%2B02:00
2024-08-01T00:00:00%2B02:00
2024-09-01T00:00:00%2B02:00
2024-10-01T00:00:00%2B02:00
2024-11-01T00:00:00%2B02:00
2024-12-01T00:00:00%

In [7]:
# Make one datafram from the list of dataframes collected
df = pd.concat(data, ignore_index=True)

In [8]:
df.head()

Unnamed: 0,endTime,lastUpdatedTime,priceArea,productionGroup,quantityKwh,startTime
0,2022-01-01T01:00:00+01:00,2025-02-01T18:02:57+01:00,NO1,hydro,1291422.4,2022-01-01T00:00:00+01:00
1,2022-01-01T02:00:00+01:00,2025-02-01T18:02:57+01:00,NO1,hydro,1246209.4,2022-01-01T01:00:00+01:00
2,2022-01-01T03:00:00+01:00,2025-02-01T18:02:57+01:00,NO1,hydro,1271757.0,2022-01-01T02:00:00+01:00
3,2022-01-01T04:00:00+01:00,2025-02-01T18:02:57+01:00,NO1,hydro,1204251.8,2022-01-01T03:00:00+01:00
4,2022-01-01T05:00:00+01:00,2025-02-01T18:02:57+01:00,NO1,hydro,1202086.9,2022-01-01T04:00:00+01:00


In [9]:
df.shape

(657600, 6)

In [10]:
# Drop duplicates if there were any duplicates created at the start/end of months 
df = df.drop_duplicates()
df.shape

(657600, 6)

In [11]:
# Checking datatypes and converting datetime columns to type datetime
df.dtypes

endTime             object
lastUpdatedTime     object
priceArea           object
productionGroup     object
quantityKwh        float64
startTime           object
dtype: object

In [12]:
df['endTime'] = pd.to_datetime(df['endTime'], utc=True).dt.tz_localize(None)
df['lastUpdatedTime'] = pd.to_datetime(df['lastUpdatedTime'], utc=True).dt.tz_localize(None)
df['startTime'] = pd.to_datetime(df['startTime'], utc=True).dt.tz_localize(None)

In [13]:
df.dtypes

endTime            datetime64[ns]
lastUpdatedTime    datetime64[ns]
priceArea                  object
productionGroup            object
quantityKwh               float64
startTime          datetime64[ns]
dtype: object

In [13]:
# Setting Cassandra environment
keyspace = 'my_first_keyspace'
table_name = 'elhub'


In [14]:
# Function for converting pandas datatypes to Cassandra compatible datatypes
def pandas_to_cassandra_type(dtype):
    if pd.api.types.is_integer_dtype(dtype):
        return 'int'
    elif pd.api.types.is_float_dtype(dtype):
        return 'double'
    elif np.issubdtype(dtype, np.datetime64):
        return 'timestamp'
    else:
        return 'text'

In [12]:
# Connecting to Cassandra
cluster = Cluster(['localhost'], port=9042)
session = cluster.connect()
session.set_keyspace(f'{keyspace}')

In [17]:
# Setting column definitions for Cassandra
columns_cql = ', '.join([
    f'{col} {pandas_to_cassandra_type(df[col].dtype)}'
    for col in df.columns
])
columns_cql

'endTime timestamp, lastUpdatedTime timestamp, priceArea text, productionGroup text, quantityKwh double, startTime timestamp, row_id int'

In [20]:
# Read max row_id from Cassandra db to start inserting data with row ids starting after max row_id
result = session.execute("SELECT max(row_id) FROM elhub")
max_id = result.one()[0]
max_id

215353

In [27]:
df = df.drop('row_id', axis=1)

In [28]:
df.head()

Unnamed: 0,endTime,lastUpdatedTime,priceArea,productionGroup,quantityKwh,startTime
0,2022-01-01 00:00:00,2025-02-01 17:02:57,NO1,hydro,1291422.4,2021-12-31 23:00:00
1,2022-01-01 01:00:00,2025-02-01 17:02:57,NO1,hydro,1246209.4,2022-01-01 00:00:00
2,2022-01-01 02:00:00,2025-02-01 17:02:57,NO1,hydro,1271757.0,2022-01-01 01:00:00
3,2022-01-01 03:00:00,2025-02-01 17:02:57,NO1,hydro,1204251.8,2022-01-01 02:00:00
4,2022-01-01 04:00:00,2025-02-01 17:02:57,NO1,hydro,1202086.9,2022-01-01 03:00:00


In [29]:
df['row_number'] = range(1, len(df)+1)
primary_key = df.columns[-1]
df['row_id'] = max_id + df['row_number']
df = df.drop('row_number', axis=1)
df.head()

Unnamed: 0,endTime,lastUpdatedTime,priceArea,productionGroup,quantityKwh,startTime,row_id
0,2022-01-01 00:00:00,2025-02-01 17:02:57,NO1,hydro,1291422.4,2021-12-31 23:00:00,215354
1,2022-01-01 01:00:00,2025-02-01 17:02:57,NO1,hydro,1246209.4,2022-01-01 00:00:00,215355
2,2022-01-01 02:00:00,2025-02-01 17:02:57,NO1,hydro,1271757.0,2022-01-01 01:00:00,215356
3,2022-01-01 03:00:00,2025-02-01 17:02:57,NO1,hydro,1204251.8,2022-01-01 02:00:00,215357
4,2022-01-01 04:00:00,2025-02-01 17:02:57,NO1,hydro,1202086.9,2022-01-01 03:00:00,215358


In [31]:
# Inserting data into Cassandra using batch-insert 
columns = list(df.columns)
placeholders = ", ".join(["?"] * len(columns))
columns_str = ", ".join(columns)

insert_cql = f"INSERT INTO elhub ({columns_str}) VALUES ({placeholders})"

BATCH_SIZE = 100

prepared = session.prepare(insert_cql)
batch = BatchStatement()

for i, (_, row) in enumerate(df.iterrows(), 1):
    values = [v.to_pydatetime() if isinstance(v, pd.Timestamp) else v for v in row]
    batch.add(prepared, tuple(values))

    if i % BATCH_SIZE == 0:
        session.execute(batch)
        batch = BatchStatement()  # reset batch

# execute remaining
if len(batch) > 0:
    session.execute(batch)

print("Bulk insert completed")

Bulk insert completed


## Consumption

In [8]:
# Setting the environment for collecting the elhub consumption data for 2021-01-01
entity = 'price-areas'
dataset = 'CONSUMPTION_PER_GROUP_MBA_HOUR'
start = '2021-01-01T00:00:00%2B02:00'
end = '2021-01-01T23:59:59%2B02:00' 
res = requests.get(f'https://api.elhub.no/energy-data/v0/{entity}?dataset={dataset}&startDate={start}&endDate={end}')

In [9]:
# Check that we got a connection to the API
assert res.status_code == 200

In [10]:
for header_name, header_value in res.headers.items():
    print(f'{header_name:16s}: {header_value}')

Date            : Fri, 28 Nov 2025 07:59:48 GMT
Content-Type    : application/json; charset=utf-8
Transfer-Encoding: chunked
Connection      : keep-alive
Cache-Control   : public, max-age=3600
strict-transport-security: max-age=63072000; includeSubDomains


In [11]:
# Creating a list which we will extend with dataframes each containing data for one month
data = []

# Creating start and stop dates for the api call and collecting data
years = [2021, 2022, 2023, 2024]
months = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
for year in years:
    for month in months:
        if month < 9:
            start = f'{year}-0{month}-01T00:00:00%2B02:00'
            end = f'{year}-0{month+1}-01T00:00:00%2B02:00'
        elif month == 9:
            start = f'{year}-0{month}-01T00:00:00%2B02:00'
            end = f'{year}-{month+1}-01T00:00:00%2B02:00'
        elif month == 12:
            start = f'{year}-{month}-01T00:00:00%2B02:00'
            end = f'{year}-{month}-31T23:59:59%2B02:00'
        else: 
            start = f'{year}-{month}-01T00:00:00%2B02:00'
            end = f'{year}-{month+1}-01T00:00:00%2B02:00'
        print(start)
        res = requests.get(f'https://api.elhub.no/energy-data/v0/{entity}?dataset={dataset}&startDate={start}&endDate={end}')
        assert res.status_code == 200
        payload = res.json()
        temp_data = [pd.DataFrame(entry['attributes']['consumptionPerGroupMbaHour'])
                 for entry in payload['data']]
        data.extend(temp_data)

2021-01-01T00:00:00%2B02:00
2021-02-01T00:00:00%2B02:00
2021-03-01T00:00:00%2B02:00
2021-04-01T00:00:00%2B02:00
2021-05-01T00:00:00%2B02:00
2021-06-01T00:00:00%2B02:00
2021-07-01T00:00:00%2B02:00
2021-08-01T00:00:00%2B02:00
2021-09-01T00:00:00%2B02:00
2021-10-01T00:00:00%2B02:00
2021-11-01T00:00:00%2B02:00
2021-12-01T00:00:00%2B02:00
2022-01-01T00:00:00%2B02:00
2022-02-01T00:00:00%2B02:00
2022-03-01T00:00:00%2B02:00
2022-04-01T00:00:00%2B02:00
2022-05-01T00:00:00%2B02:00
2022-06-01T00:00:00%2B02:00
2022-07-01T00:00:00%2B02:00
2022-08-01T00:00:00%2B02:00
2022-09-01T00:00:00%2B02:00
2022-10-01T00:00:00%2B02:00
2022-11-01T00:00:00%2B02:00
2022-12-01T00:00:00%2B02:00
2023-01-01T00:00:00%2B02:00
2023-02-01T00:00:00%2B02:00
2023-03-01T00:00:00%2B02:00
2023-04-01T00:00:00%2B02:00
2023-05-01T00:00:00%2B02:00
2023-06-01T00:00:00%2B02:00
2023-07-01T00:00:00%2B02:00
2023-08-01T00:00:00%2B02:00
2023-09-01T00:00:00%2B02:00
2023-10-01T00:00:00%2B02:00
2023-11-01T00:00:00%2B02:00
2023-12-01T00:00:00%

In [37]:
# Make one datafram from the list of dataframes collected
df = pd.concat(data, ignore_index=True)

In [38]:
df.columns

Index(['consumptionGroup', 'endTime', 'lastUpdatedTime', 'meteringPointCount',
       'priceArea', 'quantityKwh', 'startTime'],
      dtype='object')

In [39]:
df = df.drop('meteringPointCount', axis=1)

In [40]:
df['endTime'] = pd.to_datetime(df['endTime'], utc=True).dt.tz_localize(None)
df['lastUpdatedTime'] = pd.to_datetime(df['lastUpdatedTime'], utc=True).dt.tz_localize(None)
df['startTime'] = pd.to_datetime(df['startTime'], utc=True).dt.tz_localize(None)

In [41]:
df.head()

Unnamed: 0,consumptionGroup,endTime,lastUpdatedTime,priceArea,quantityKwh,startTime
0,cabin,2021-01-01 00:00:00,2024-12-20 09:35:40,NO1,177071.56,2020-12-31 23:00:00
1,cabin,2021-01-01 01:00:00,2024-12-20 09:35:40,NO1,171335.12,2021-01-01 00:00:00
2,cabin,2021-01-01 02:00:00,2024-12-20 09:35:40,NO1,164912.02,2021-01-01 01:00:00
3,cabin,2021-01-01 03:00:00,2024-12-20 09:35:40,NO1,160265.77,2021-01-01 02:00:00
4,cabin,2021-01-01 04:00:00,2024-12-20 09:35:40,NO1,159828.69,2021-01-01 03:00:00


In [42]:
# Drop duplicates if there were any duplicates created at the start/end of months 
df = df.drop_duplicates()
df.shape

(876600, 6)

In [43]:
df.dtypes

consumptionGroup            object
endTime             datetime64[ns]
lastUpdatedTime     datetime64[ns]
priceArea                   object
quantityKwh                float64
startTime           datetime64[ns]
dtype: object

In [50]:
# Setting Cassandra environment
keyspace = 'my_first_keyspace'
table_name = 'consumption'

In [51]:
# Function for converting pandas datatypes to Cassandra compatible datatypes
def pandas_to_cassandra_type(dtype):
    if pd.api.types.is_integer_dtype(dtype):
        return 'int'
    elif pd.api.types.is_float_dtype(dtype):
        return 'double'
    elif np.issubdtype(dtype, np.datetime64):
        return 'timestamp'
    else:
        return 'text'

In [52]:
# Connecting to Cassandra
cluster = Cluster(['localhost'], port=9042)
session = cluster.connect()
session.set_keyspace(f'{keyspace}')

In [53]:
df['row_id'] = range(1, len(df)+1)
primary_key = df.columns[-1]
df.head()

Unnamed: 0,consumptionGroup,endTime,lastUpdatedTime,priceArea,quantityKwh,startTime,row_id
0,cabin,2021-01-01 00:00:00,2024-12-20 09:35:40,NO1,177071.56,2020-12-31 23:00:00,1
1,cabin,2021-01-01 01:00:00,2024-12-20 09:35:40,NO1,171335.12,2021-01-01 00:00:00,2
2,cabin,2021-01-01 02:00:00,2024-12-20 09:35:40,NO1,164912.02,2021-01-01 01:00:00,3
3,cabin,2021-01-01 03:00:00,2024-12-20 09:35:40,NO1,160265.77,2021-01-01 02:00:00,4
4,cabin,2021-01-01 04:00:00,2024-12-20 09:35:40,NO1,159828.69,2021-01-01 03:00:00,5


In [74]:
# Setting column definitions for Cassandra
columns_cql = ', '.join([
    f'{col} {pandas_to_cassandra_type(df[col].dtype)}'
    for col in df.columns
])
columns_cql

'consumptionGroup text, endTime timestamp, lastUpdatedTime timestamp, priceArea text, quantityKwh double, startTime timestamp, row_id int'

In [75]:
# Creating the consumption table
create_table_cql = f"""
CREATE TABLE IF NOT EXISTS {table_name} (
    {columns_cql},
    PRIMARY KEY ({primary_key})
)
"""
session.execute(create_table_cql)
print(f'Table {table_name} created')

Table consumption created


In [77]:
# Inserting data into Cassandra using batch-insert 
columns = list(df.columns)
placeholders = ", ".join(["?"] * len(columns))
columns_str = ", ".join(columns)

insert_cql = f"INSERT INTO consumption ({columns_str}) VALUES ({placeholders})"

BATCH_SIZE = 100

prepared = session.prepare(insert_cql)
batch = BatchStatement()

for i, (_, row) in enumerate(df.iterrows(), 1):
    values = [v.to_pydatetime() if isinstance(v, pd.Timestamp) else v for v in row]
    batch.add(prepared, tuple(values))

    if i % BATCH_SIZE == 0:
        session.execute(batch)
        batch = BatchStatement()  # reset batch

# execute remaining
if len(batch) > 0:
    session.execute(batch)

print("Bulk insert completed")

Bulk insert completed


# Reading production data from Cassandra using Spark

In [78]:
# Start a Spark session
spark = (
    SparkSession.builder
    .appName('CassandraReader')
    .config('spark.jars.packages', 'com.datastax.spark:spark-cassandra-connector_2.12:3.4.1')
    .config('spark.cassandra.connection.host', 'localhost')  
    .config('spark.cassandra.connection.port', '9042')
    .getOrCreate()
)

In [97]:
# Collect the data from the Cassandra database
df = (
    spark.read
    .format('org.apache.spark.sql.cassandra')
    .options(table='elhub', keyspace='my_first_keyspace')
    .load()
    .select('pricearea', 'productiongroup', 'starttime', 'quantitykwh')
)

In [98]:
# Check the dimensions of the data from Cassandra. Looks like the df is 4 times as long, good.
print((df.count(), len(df.columns)))

(872953, 4)


In [99]:
# Checking that the data looks ok
df.show()

+---------+---------------+-------------------+-----------+
|pricearea|productiongroup|          starttime|quantitykwh|
+---------+---------------+-------------------+-----------+
|      NO4|           wind|2023-03-01 23:00:00|   952622.6|
|      NO1|          other|2023-04-09 01:00:00|      21.58|
|      NO1|          hydro|2023-01-15 20:00:00|  1783630.6|
|      NO3|          hydro|2022-11-13 03:00:00|  2379384.2|
|      NO1|           wind|2024-03-25 10:00:00|   7346.178|
|      NO3|          other|2023-07-12 04:00:00|    180.415|
|      NO2|          solar|2024-06-15 23:00:00|     22.702|
|      NO4|          other|2024-09-01 12:00:00|       12.4|
|      NO5|          other|2024-07-25 08:00:00|     55.101|
|      NO4|          hydro|2024-05-01 08:00:00|  1390578.4|
|      NO5|          other|2024-02-22 03:00:00|        0.0|
|      NO3|          other|2023-11-14 21:00:00|     11.579|
|      NO4|          hydro|2021-08-23 14:00:00|  2385593.0|
|      NO1|        thermal|2023-06-10 14

In [100]:
# Lets also look at the tail of the data. Both head and tail looks good.
df.orderBy("row_id", ascending=False).limit(5).collect()

[Row(pricearea='NO5', productiongroup='wind', starttime=datetime.datetime(2024, 12, 31, 23, 0), quantitykwh=0.0),
 Row(pricearea='NO5', productiongroup='wind', starttime=datetime.datetime(2024, 12, 31, 22, 0), quantitykwh=0.0),
 Row(pricearea='NO5', productiongroup='wind', starttime=datetime.datetime(2024, 12, 31, 21, 0), quantitykwh=0.0),
 Row(pricearea='NO5', productiongroup='wind', starttime=datetime.datetime(2024, 12, 31, 20, 0), quantitykwh=0.0),
 Row(pricearea='NO5', productiongroup='wind', starttime=datetime.datetime(2024, 12, 31, 19, 0), quantitykwh=0.0)]

# Inserting production data to MongoDB

In [18]:
# Converting the Spark datafram to a pandas dataframe
pdf = df.toPandas()

In [19]:
pdf.head()

Unnamed: 0,pricearea,productiongroup,starttime,quantitykwh
0,NO4,thermal,2021-07-09 09:00:00,20562.0
1,NO1,other,2021-08-21 01:00:00,0.0
2,NO2,solar,2022-12-30 19:00:00,36.8
3,NO2,solar,2023-09-21 12:00:00,6647.843
4,NO5,thermal,2022-06-18 15:00:00,25514.0


In [20]:
# Inserting the data to MongoDB
collection = client.IND320.production_NO1
x = collection.insert_many(pdf.to_dict('records'))

# Reading consumption data from Cassandra using Spark and uploading to MongoDB

In [89]:
# Collect the data from the Cassandra database
df = (
    spark.read
    .format('org.apache.spark.sql.cassandra')
    .options(table='consumption', keyspace='my_first_keyspace')
    .load()
    .select('pricearea', 'consumptiongroup', 'starttime', 'quantitykwh')
)

In [90]:
# Check the dimensions of the data from Cassandra. Looks like the df is 4 times as long, good.
print((df.count(), len(df.columns)))

(876600, 4)


In [92]:
df.show()

+---------+----------------+-------------------+-----------+
|pricearea|consumptiongroup|          starttime|quantitykwh|
+---------+----------------+-------------------+-----------+
|      NO4|         primary|2023-08-27 04:00:00|  43776.594|
|      NO4|       household|2022-09-13 03:00:00|  325525.75|
|      NO1|       secondary|2024-06-20 11:00:00|   738869.5|
|      NO2|       household|2021-05-28 18:00:00|   809364.0|
|      NO1|        tertiary|2021-08-21 20:00:00|   780594.0|
|      NO3|        tertiary|2023-01-01 08:00:00|  513778.56|
|      NO1|         primary|2021-12-24 07:00:00|   67900.29|
|      NO3|       secondary|2022-05-18 00:00:00|  1734967.8|
|      NO5|           cabin|2022-08-23 14:00:00|  13076.411|
|      NO1|           cabin|2022-10-02 04:00:00|  56975.133|
|      NO4|           cabin|2023-10-09 01:00:00|  39618.633|
|      NO5|        tertiary|2024-08-17 03:00:00|  168533.05|
|      NO1|        tertiary|2021-07-30 05:00:00|  692040.94|
|      NO4|        terti

In [93]:
# Lets also look at the tail of the data. Both head and tail looks good.
df.orderBy("row_id", ascending=False).limit(5).collect()

[Row(pricearea='NO5', consumptiongroup='tertiary', starttime=datetime.datetime(2024, 12, 31, 23, 0), quantitykwh=300571.7),
 Row(pricearea='NO5', consumptiongroup='tertiary', starttime=datetime.datetime(2024, 12, 31, 22, 0), quantitykwh=311207.06),
 Row(pricearea='NO5', consumptiongroup='tertiary', starttime=datetime.datetime(2024, 12, 31, 21, 0), quantitykwh=325010.97),
 Row(pricearea='NO5', consumptiongroup='tertiary', starttime=datetime.datetime(2024, 12, 31, 20, 0), quantitykwh=335697.84),
 Row(pricearea='NO5', consumptiongroup='tertiary', starttime=datetime.datetime(2024, 12, 31, 19, 0), quantitykwh=346044.1)]

In [94]:
# Converting the Spark datafram to a pandas dataframe
pdf = df.toPandas()

In [95]:
pdf.head()

Unnamed: 0,pricearea,consumptiongroup,starttime,quantitykwh
0,NO1,secondary,2021-05-22 05:00:00,563663.6
1,NO5,cabin,2022-01-12 09:00:00,52592.51
2,NO4,household,2021-10-27 06:00:00,589983.75
3,NO5,secondary,2024-09-08 03:00:00,952329.4
4,NO4,secondary,2023-12-31 17:00:00,980870.0


In [96]:
# Inserting the data to MongoDB
collection = client.IND320.consumption_NO1
x = collection.insert_many(pdf.to_dict('records'))