# Assignment 2

## AI usage 

Used DeepSeek for help on how to connect Cassandra and Spark. 

## Log 

For this assignment I will try to work more systematic than the last. I will focus on finishing all the elements for the notebook first, and work on the streamlit app afterwards. 

I started by connecting Cassandra and Spark, but ran into problems regarding the sparksession. After troubleshooting for a while and getting some help, we ended up using DeepSeek for help on how to maybe fix it. DeepSeek suggested adding the last three lines of code to force sessionbuilder to use the localhost. 


## Links 

- Github: https://github.com/Satheris/IND320_SMAA
- Streamlit app: https://ind320smaa-2eg32uba6uhmrknkwtxzar.streamlit.app/

## Coding

### Imports and system variables

In [2]:
import numpy as np
import pandas as pd 

In [3]:
# Set environment variables for PySpark (system and version dependent!) 
# if not already set persistently (e.g., in .bashrc or .bash_profile or Windows environment variables)
import os
# Set the Java home path to the one you are using ((un)comment and edit as needed):
os.environ["JAVA_HOME"] = r"C:\Program Files\Java\jre1.8.0_471"

# If you are using environments in Python, you can set the environment variables like the alternative below.
# The default Python environment is used if the variables are set to "python" (edit if needed):
os.environ["PYSPARK_PYTHON"] = "python" # or similar to "/Users/kristian/miniforge3/envs/tf_M1/bin/python"
os.environ["PYSPARK_DRIVER_PYTHON"] = "python" # or similar to "/Users/kristian/miniforge3/envs/tf_M1/bin/python"

# On Windows you need to specify where the Hadoop drivers are located (uncomment and edit if needed):
os.environ["HADOOP_HOME"] = r"C:\Users\saraa\Documents\winutils\hadoop-3.3.1"

# Set the Hadoop version to the one you are using, e.g., none:
os.environ["PYSPARK_HADOOP_VERSION"] = "without"

### Cassandra and Spark

In [4]:
# Connecting to Cassandra
from cassandra.cluster import Cluster
cluster = Cluster(['localhost'], port=9042)
session = cluster.connect()

In [5]:
# Set up new keyspace (first time only)
#                                              name of keyspace                        replication strategy           replication factor
session.execute("CREATE KEYSPACE IF NOT EXISTS my_first_keyspace WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };")

<cassandra.cluster.ResultSet at 0x21f7ab32850>

In [6]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('SparkCassandraApp').\
    config('spark.jars.packages', 'com.datastax.spark:spark-cassandra-connector_2.12:3.5.1').\
    config('spark.cassandra.connection.host', 'localhost').\
    config('spark.sql.extensions', 'com.datastax.spark.connector.CassandraSparkExtensions').\
    config('spark.sql.catalog.mycatalog', 'com.datastax.spark.connector.datasource.CassandraCatalog').\
    config('spark.cassandra.connection.port', '9042').\
    config('spark.driver.host', 'localhost').\
    config('spark.driver.bindAddress', '127.0.0.1').\
    config('spark.sql.adaptive.enabled', 'true').\
    getOrCreate()

In [7]:
# .load() is used to load data from Cassandra table as a Spark DataFrame
spark.read.format("org.apache.spark.sql.cassandra").options(table="my_first_table", keyspace="my_first_keyspace").load().show()

+---+--------+-------+
|ind| company|  model|
+---+--------+-------+
|  1|   Tesla|Model S|
|  3|Polestar|      3|
|  2|   Tesla|Model 3|
+---+--------+-------+



In [8]:
# Create view for simpler SQL queries
spark.read.format("org.apache.spark.sql.cassandra").options(table="table_with_uuid", keyspace="my_first_keyspace").load().createOrReplaceTempView("my_first_table_view")

In [10]:
# Read CSV file into Spark DataFrame
planets = spark.read.csv("../data/planets.csv", header=True, inferSchema=True)
planets.show()

+-------+---------+---------+
| planet| distance| diameter|
+-------+---------+---------+
|Mercury| 0.387 AU|  4878 km|
|  Venus| 0.723 AU| 12104 km|
|  Earth| 1.000 AU| 12756 km|
|   Mars| 1.524 AU|  6787 km|
|Jupiter| 5.203 AU|142796 km|
| Saturn| 9.546 AU|120660 km|
| Uranus|19.218 AU| 51118 km|
|Neptune|30.069 AU| 48600 km|
+-------+---------+---------+



### MongoDB

In [11]:
# Stop Spark session
try:
    spark.stop()
    print('Spark session terminated successfully')
except ConnectionRefusedError:
    print("Spark session already stopped.")
except NameError:
    print('Spark session is not defined')

Spark session terminated successfully


# General


A Streamlit app running from https://[yourproject].streamlit.app/.
This is an online version of the project, accessing data that has been exported to CSV format and accessing your MongoDB database for additional data.
The code, hosted at GitHub, must include relevant comments from the Jupyter Notebook and further comments regarding Streamlit usage.


## Tasks

### Local database: Cassandra
If not already done, set up Cassandra and Spark as described in the book.
Test that your Spark-Cassandra connection works.
The Cassandra database will be accessed from the Jupyter Notebook and used to store data from the API mentioned later. 

### Remote database: MongoDB
If not already done, prepare a MongoDB account at mongodb.com.
Test that you can manipulate data from Python.
The MongoDB database will store data that has been trimmed/curated/prepared through the Jupyter Notebook and Spark filtering.
These data will be accessed directly from the Streamlit app.

### API
Familiarise yourself with the API connection at https://api.elhub.noLenker til en ekstern side.

Observe how time is encoded and how transitions between summer and winter time are handled.
Be aware of the time period limitations for each API request and how this differs between datasets.


### Jupyter Notebook

#### Standard requirements

Must include a brief description of AI usage.

Must include a 300-500-word log describing the compulsory work (including both Jupyter Notebook and Streamlit experience).

Must include links to your public GitHub repository and Streamlit app (see below) for the compulsory work.

Document headings should be clear and usable for navigation during development.

All code blocks must include enough comments to be understandable and reproducible if someone inherits your project.

All code blocks must be run before an export to PDF so the messages and plots are shown. In addition, add the .ipynb file to the GitHub repository where you have your Streamlit project.


#### Tasks for assignment 2
Use the Elhub API to retrieve hourly production data for all price areas using PRODUCTION_PER_GROUP_MBA_HOUR for all days and hours of the year 2021.

Extract only the list in productionPerGroupMbaHour, convert to a DataFrame, and insert the data into Cassandra using Spark.

Use Spark to extract the columns priceArea, productionGroup, startTime, and quantityKwh from Cassandra.

Create the following plots:
- A pie chart for the total production of the year from a chosen price area, where each piece of the pie is one of the production groups.
- A line plot for the first month of the year for a chosen price area. Make separate lines for each production group.

Insert the Spark-extracted data into your MongoDB.

Remember to fill in the log and AI mentioned in the General section above.

### Streamlit app
Establish a connection with your MongoDB database. When running this at streamlit.io, remember to copy your secrets to the webpage instead of exposing them on GitHub.

On page four, split the view into two columns using st.columns.

- On the left side, use radio buttons (st.radio) to select a price area and display a pie chart like in the Jupyter Notebook

- On the right side, use pills (st.pills) to select which production groups to include and a selection element of your choice to select a month. Combine the price area, production group(s) and month, and display a line plot like in the Jupyter Notebook (but for any month).

- Below the columns, insert an expander (st.expander) where you briefly document the source of the data shown on the page.