# Project
## Problem Statement


# Part 1 - Data Ingestion

### Description

In this phase, we focus on the ingestion of raw weather data into the data pipeline. The raw data consists of historical hourly temperature readings from multiple weather stations, which will be used later for building a machine learning model to predict future temperatures. 

We use Spark's distributed computing capabilities to handle the ingestion of large datasets, ensuring scalability and efficiency, especially as the dataset grows over time. 

To make the dataset more challenging and to simulate real-world data processing, we artificially "unstructured" the JSON data into a raw, log-like format. This requires additional preprocessing to transform it back into a usable format. 

#### Steps in the Ingestion Process

1. **API Data Source**:
   - We retrieve temperature data from the Frost API, which provides detailed weather observations. The raw dataset includes hourly temperature readings, metadata about the measurement conditions, and timestamps.
   - The raw data is ingested into our Spark cluster, ensuring it is distributed across nodes for efficient processing.

2. **Simulating Raw Data**:
   - Although the API provides semi-structured JSON data, we convert it into a less structured log format. This simulates the ingestion of raw data that requires significant preprocessing. An example of the raw data format looks as follows:
     ```
     SN44640:0 | 2024-01-01T00:00:00Z | air_temperature:3.5degC | height:2m
     SN44640:0 | 2024/01/01 01:00:00 | 3.5 deg C, height: 2 m
     ```
   - The structure has been intentionally removed to introduce complexity, making it harder to directly extract relevant fields like `temperature` or `timestamp`.

3. **Cloud Environment Setup**:
   - A cloud infrastructure is established on [Cloud Provider], with a Spark cluster set up to handle distributed data processing. We use the following services:
     - **Storage**: For storing raw, intermediate, and processed datasets.
     - **Compute**: A Spark cluster for parallel data processing.
   - The cloud setup ensures that the system can scale horizontally as more data is ingested.

4. **Data Ingestion into Spark**:
   - The raw, unstructured log data is ingested into the Spark environment using the `spark.read.text()` function, which loads the raw logs into a DataFrame for further processing.
   - The ingestion pipeline is capable of handling both real-time streaming data (if the API provides live updates) and batch loading for historical data.

#### Challenges Encountered
- **Handling Unstructured Data**: Due to the simulated raw format, it was necessary to design robust parsing routines that could handle variations in formatting (e.g., inconsistent delimiters or missing fields).
- **Scaling for Large Datasets**: Given the potential size of the dataset, we optimized the ingestion process to handle large volumes of data by leveraging Spark's distributed capabilities.

#### Ingestion Outcome
- By the end of this step, the raw data is successfully ingested into the Spark cluster, ready for the next phase of **Data Cleaning**. Each record includes essential fields such as temperature, timestamp, and weather station metadata.


### Create the url for the API call
SN44640 and SN44630 is the station id for the weather station at Stavanger - Våland and Stavanger - Kiellandsmyra respectively. Achived from https://seklima.met.no/. 

The data is fetched from the frost.met.no API. The data is fetched from the API using the following parameters:
- Sources: SN44640 and SN44630
- From date: 2010-01-01
- To date: 2024-10-04
- Elements: air_temperature, wind_speed, precipitation_amount, and relative_humidity


In [1]:
# stationNr = "SN44630, SN44640" usikker på om to egt er nødvendig. Kan evt ta averagen av de to. Hvis vi bare tar 1 kan vi hente data for en lengre periode.
stationNr = "SN44640"

#fromDate = "2015-01-01"
fromDate = "2024-10-05"
toDate = "2024-10-09"

elements = "air_temperature"

url = "https://frost.met.no/observations/v0.jsonld?sources=" + stationNr + "&referencetime=" + fromDate + "/" + toDate + "&elements="+ elements
username = "09e81c81-f133-474d-a678-7929cb28b4ef:5291a085-b859-4490-9a99-6539c879a165"
password = "5291a085-b859-4490-9a99-6539c879a165"

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("TemperatureDataIngestion").getOrCreate()

### Fetch the data from the API

In [3]:
import requests

result = requests.get(url, auth=(username, password))
data = result.json()
data_entries = data['data']
rdd = spark.sparkContext.parallelize(data_entries)

### Since the data is semi-structured, we need to convert it to a raw format and save it in a txt file.

In [8]:
def to_raw_format(entry):
    source_id = entry['sourceId']
    ref_time = entry['referenceTime']
    obs = entry['observations'][0]
    temp_value = obs['value']
    temp_unit = obs['unit']
    height = obs['level']['value']
    height_unit = obs['level']['unit']
    time_res = obs['timeResolution']
    
    log_entry = f"{source_id}  {ref_time}  air_temperature:{temp_value}{temp_unit}  height_above_ground:{height}{height_unit}  {time_res}"
    return log_entry
    
raw_rdd = rdd.map(to_raw_format)

raw_rdd.saveAsTextFile("hdfs:///project/raw_temperature_data")



### 

### Load the saved text file

In [None]:
# Load the saved data from the directory
loaded_rdd = spark.sparkContext.textFile("hdfs:///project/raw_temperature_data")

# Print the first 5 entries to verify it was saved and loaded correctly
print(loaded_rdd.take(5))



# Part 2. Data Cleaning


### Process the raw data line by line and convert it into a structured format containing date as the ky and temperature and day of the year as values. Since the data is hourly we convert the datetime to date

In [None]:
from datetime import datetime
import math

# Define a function to process each line of the log file and extract date as a key and temperature and day of the year as values
def process_line(line):
    fields = line.split('  ')
    if len(fields) != 5:
        return None  # Invalid format, skip this line

    source_id, ref_time, temperature, height, time_res = [f.strip() for f in fields]

    # Extract date and temperature
    date = ref_time.split('T')[0]  # Assumes ISO format like '2024-10-01T00:00:00.000Z'
    temp_value = float(temperature.split(':')[1].rstrip('degC'))

    # Convert date string to datetime object
    date_obj = datetime.strptime(date, '%Y-%m-%d')

    # Calculate day of the year
    day_of_year = date_obj.timetuple().tm_yday

    # Apply sine/cosine transformation to day of the year to capture seasonality
    day_of_year_sin = math.sin(2 * math.pi * day_of_year / 365)
    day_of_year_cos = math.cos(2 * math.pi * day_of_year / 365)
    
    return (date, (temp_value, day_of_year_sin, day_of_year_cos))

    

    # Process the loaded RDD to extract the required information
    

processed_rdd = loaded_rdd.map(process_line).filter(lambda x: x is not None)
print(processed_rdd.take(5))

### Since the data is in hourly format, and we above converted it to daily format, we need to aggregate the data to get the average temperature for each day.

### Profile the performance of the MapReduce implementation (e.g., Spark job execution time, memory usage).

In [None]:
import time
import statistics

# Measure the execution time of the MapReduce job
start_time = time.time()

import statistics

# Perform the MapReduce operation with rounding to whole numbers
daily_avg_rdd = processed_rdd.groupByKey().mapValues(lambda temps: list(temps)) \
    .mapValues(lambda temps: (round(statistics.mean([t[0] for t in temps])), temps[0][1], temps[0][2]))


# # Trigger an action to force execution
# daily_avg_rdd.count()

end_time = time.time()
execution_time = end_time - start_time

print(f"MapReduce job execution time: {execution_time} seconds")

# To monitor memory usage, you can use the Spark UI. 
# The Spark UI is usually available at http://<driver-node>:4040
# You can access it by navigating to the URL in your web browser.

print(daily_avg_rdd.take(5))

### If there is gaps in the data, fill the missing values with the average of the previous and next value.

In [None]:
min_max_dates = processed_rdd.map(lambda x: x[1][1]).min(), processed_rdd.map(lambda x: x[1][1]).max()
start_day, end_day = min_max_dates
complete_days = spark.sparkContext.parallelize(range(start_day, end_day + 1))