#### Raw_to_Bronze
This notebook loads the raw NOAA GHCN JSON dataset and stores it as Bronze Parquet.<br>
No cleaning is performed here — Bronze must stay raw.


#### EXPLANATION OF FOLDERS CONTENT

##### RAW
- original dataset exactly as downloaded from external sources (NOAA GHCN-Daily).
- Format: JSON
- `gcc_raw_2015_2025.json`

##### BRONZE
- Raw JSON converted into Parquet
- No cleaning or imputation performed  

##### SILVER
- cleaned climate dataset, ready for feature engineering.

##### GOLD
- the machine-learning–ready dataset, built from the Silver data.

#### EXPLANATION OF RAW GHCN DAILY COLUMNS

**ID:**  
- The weather station identifier.  
- Example: `"AE000041196"` (UAE station).

**DATE:**  
- The observation date in `YYYYMMDD` format.  
- Example: `20150101` → January 1st, 2015.

**ELEMENT:**  
- The type of climate measurement.  
- Common values:  
  - **TMAX** = Maximum temperature of the day (°C × 10)  
  - **TMIN** = Minimum temperature of the day (°C × 10)  
  - **TAVG** = Average temperature of the day (°C × 10)  
  - **PRCP** = Daily precipitation (mm × 10)  
    - If `PRCP = null`, the station did not report rain → convert to 0 (no rain)  
  - **SNWD** = Snow depth (mm)  
    - Always null in GCC → we drop SNWD completely  
- We will **pivot ELEMENT into columns** later.

**DATA_VALUE:**  
- The numeric measurement for the ELEMENT.  
- All values are **scaled by 10**.  
- Example:  
  - `TMAX = 286` → real value = **28.6°C**  
- We divide `DATA_VALUE` by 10 in the Silver layer.

**MFLAG (Measurement Flag):**  
- Extra information about the measurement.  
- Usually null.  
- We **drop this column**.

**QFLAG (Quality Flag):**  
- Indicates if the measurement failed a quality check.  
- If QFLAG is not null → the value is invalid.  
- In our GCC dataset, QFLAG is mostly null → data is good.

**SFLAG (Source Flag):**  
- Indicates the source of the observation (e.g., 'S', 'H').  
- Not useful for ML → we drop it.

**OBS_TIME:**  
- Observation time in HHMM format.  
- Usually null in daily datasets.  
- Not required for ML → we drop it.


In [0]:
spark.conf.set(
    "fs.azure.account.key.qatarclimateanalysis.dfs.core.windows.net",
    "<account key>"
)


In [0]:
raw_path = "abfss://lakehouse@qatarclimateanalysis.dfs.core.windows.net/raw/gcc_raw_2015_2025.json"

# Read raw data
df_raw = spark.read.json(raw_path, multiLine=True)
display(df_raw.limit(5))

DATA_VALUE,DATE,ELEMENT,ID,MFLAG,OBS_TIME,QFLAG,SFLAG
125,20150101,TMIN,AE000041196,,,,S
0,20150101,PRCP,AE000041196,,,,S
206,20150101,TAVG,AE000041196,H,,,S
286,20150101,TMAX,AEM00041194,,,,S
180,20150101,TMIN,AEM00041194,,,,S


In [0]:
bronze_path = "abfss://lakehouse@qatarclimateanalysis.dfs.core.windows.net/bronze/gcc_bronze.parquet"

# write bronze data
df_raw.write.mode("overwrite").parquet(bronze_path)
