# Part 1 - Data Ingestion

### Description

In this phase, we focus on the ingestion of raw weather data into the data pipeline. The raw data consists of historical hourly temperature readings from multiple weather stations, which will be used later for building a machine learning model to predict future temperatures. 

We use Spark's distributed computing capabilities to handle the ingestion of large datasets, ensuring scalability and efficiency, especially as the dataset grows over time. 

To make the dataset more challenging and to simulate real-world data processing, we artificially "unstructured" the JSON data into a raw, log-like format. This requires additional preprocessing to transform it back into a usable format. 

#### Steps in the Ingestion Process

1. **API Data Source**:
   - We retrieve temperature data from the Frost API, which provides detailed weather observations. The raw dataset includes hourly temperature readings, metadata about the measurement conditions, and timestamps.
   - The raw data is ingested into our Spark cluster, ensuring it is distributed across nodes for efficient processing.

2. **Simulating Raw Data**:
   - Although the API provides semi-structured JSON data, we convert it into a less structured log format. This simulates the ingestion of raw data that requires significant preprocessing. An example of the raw data format looks as follows:
     ```
     SN44640:0  2024-01-01T00:00:00Z  air_temperature:3.5degC  height:2m
     SN44640:0  2024/01/01 01:00:00  3.5 deg C height: 2 m
     ```
   - The structure has been intentionally removed to introduce complexity, making it harder to directly extract relevant fields like `temperature` or `timestamp`.

3. **Cloud Environment Setup**:
   - A cloud infrastructure is established on [Cloud Provider], with a Spark cluster set up to handle distributed data processing. We use the following services:
     - **Storage**: For storing raw, intermediate, and processed datasets.
     - **Compute**: A Spark cluster for parallel data processing.
   - The cloud setup ensures that the system can scale horizontally as more data is ingested.

4. **Data Ingestion into Spark**:
   - The raw, unstructured log data is ingested into the Spark environment using the `spark.read.text()` function, which loads the raw logs into a DataFrame for further processing.
   - The ingestion pipeline is capable of handling both real-time streaming data (if the API provides live updates) and batch loading for historical data.

#### Challenges Encountered
- **Handling Unstructured Data**: Due to the simulated raw format, it was necessary to design robust parsing routines that could handle variations in formatting (e.g., inconsistent delimiters or missing fields).
- **Scaling for Large Datasets**: Given the potential size of the dataset, we optimized the ingestion process to handle large volumes of data by leveraging Spark's distributed capabilities.

#### Ingestion Outcome
- By the end of this step, the raw data is successfully ingested into the Spark cluster, ready for the next phase of **Data Cleaning**. Each record includes essential fields such as temperature, timestamp, and weather station metadata.


### Create the url for the API call
SN19710 is the station id for the weather station at Asker. Choosen because it is a weather station with a long history Achived from https://seklima.met.no/. 

The data is fetched from the frost.met.no API. The data is fetched from the API using the following parameters:
- Sources: SN19710
- From date: 2010-01-01
- To date: 2024-10-04
- Elements: air_temperature, wind_speed, precipitation_amount, and relative_humidity


### Fetch the data from the API. Have to split it up since the api only allows for 10000 records at a time.

In [1]:
import requests

username = "09e81c81-f133-474d-a678-7929cb28b4ef:5291a085-b859-4490-9a99-6539c879a165"
password = "5291a085-b859-4490-9a99-6539c879a165"

stationNr = "SN19710"

elements = "air_temperature"

data_entries = []

#Fetch data from 2000-2010
fromDate = "2000-01-01"
toDate = "2010-12-31"
url1 = "https://frost.met.no/observations/v0.jsonld?sources=" + stationNr + "&referencetime=" + fromDate + "/" + toDate + "&elements="+ elements

result = requests.get(url1, auth=(username, password))
data = result.json()
data_entries.extend(data["data"])

#Fetch data from 2011-2020
fromDate = "2011-01-01"
toDate = "2020-12-31"
url2 = "https://frost.met.no/observations/v0.jsonld?sources=" + stationNr + "&referencetime=" + fromDate + "/" + toDate + "&elements="+ elements

result = requests.get(url2, auth=(username, password))
data = result.json()
data_entries.extend(data["data"])

#Fetch data from 2021-2024
fromDate = "2021-01-01"
toDate = "2024-10-18"
url3 = "https://frost.met.no/observations/v0.jsonld?sources=" + stationNr + "&referencetime=" + fromDate + "/" + toDate + "&elements="+ elements

result = requests.get(url3, auth=(username, password))
data = result.json()
data_entries.extend(data["data"])

print("Total number of data entries: ", len(data_entries))

Total number of data entries:  121414


### Load it into a RDD in Spark

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("data_ingestion").getOrCreate()

rdd = spark.sparkContext.parallelize(data_entries)

### Since the data is semi-structured, we need to convert it to a raw format and save it in a txt file.

In [3]:
def to_raw_format(entry):
    source_id = entry["sourceId"]
    ref_time = entry["referenceTime"]
    obs = entry["observations"][0]
    temp_value = obs["value"]
    temp_unit = obs["unit"]
    height = obs["level"]["value"]
    height_unit = obs["level"]["unit"]
    time_res = obs["timeResolution"]
    
    log_entry = f"{source_id}  {ref_time}  air_temperature:{temp_value}{temp_unit}  height_above_ground:{height}{height_unit}  {time_res}"
    return log_entry
    
raw_rdd = rdd.map(to_raw_format)

try:
    #Save it and overwrite if it already exists    
    raw_rdd.saveAsTextFile("hdfs:///project/raw_temperature_data")
except Exception as e:
    print("File already exists")
    #Delete the file and save it again
    !hdfs dfs -rm -r /project/raw_temperature_data
    raw_rdd.saveAsTextFile("hdfs:///project/raw_temperature_data")
    



File already exists
Deleted /project/raw_temperature_data


                                                                                

In [4]:
##Print part of the data
for entry in raw_rdd.take(5):
    print(entry)
spark.stop()


SN19710:0  2000-01-01T06:00:00.000Z  air_temperature:-6.3degC  height_above_ground:2m  PT6H
SN19710:0  2000-01-01T12:00:00.000Z  air_temperature:-4degC  height_above_ground:2m  PT6H
SN19710:0  2000-01-01T18:00:00.000Z  air_temperature:-7degC  height_above_ground:2m  PT6H
SN19710:0  2000-01-02T06:00:00.000Z  air_temperature:-4.4degC  height_above_ground:2m  PT6H
SN19710:0  2000-01-02T12:00:00.000Z  air_temperature:2.1degC  height_above_ground:2m  PT6H
