# BayScen Data Collection and Processing Tutorial

This notebook demonstrates how to collect and process real-world weather data for BayScen scenario generation.

## Overview

**Data Sources:**

**Frost API** (Norwegian Meteorological Institute)
- Official documentation: https://frost.met.no/
- Credentials: https://frost.met.no/auth/requestCredentials.html
- **Main Station**: Road conditions (friction, wetness, surface type), wind speed, precipitation, visibility
- **Secondary Station**: Cloudiness data (since main station may lack it)

**Output:** Hourly weather data with 9 parameters ready for Bayesian Network training.

## Step 1: Configuration

First, set up your API credentials and data collection parameters.

In [8]:
# API Credentials
# Register at: https://frost.met.no/auth/requestCredentials.html
FROST_CLIENT_ID = "YOUR_FROST_CLIENT_ID"
FROST_CLIENT_SECRET = "YOUR_FROST_CLIENT_SECRET"

# Location and stations
STATION_MAIN = "SN84770"  # Main station with road conditions
STATION_CLOUD = "SN84970"  # Station with cloudiness data

# Time period
START_TIME = "2020-12-01T00:00:00Z"
END_TIME = "2025-11-05T23:00:00Z"

## Step 2: Collect Raw Data

Fetch data from all sources and save to `raw/` directory.

In [2]:
from pathlib import Path
from collect import collect_full_dataset

# Collect data
output_files = collect_full_dataset(
    frost_client_id=FROST_CLIENT_ID,
    frost_client_secret=FROST_CLIENT_SECRET,
    station_id_main=STATION_MAIN,
    station_id_cloud=STATION_CLOUD,
    start_time=START_TIME,
    end_time=END_TIME,
    output_dir=Path("raw")
)

print("\n✓ Raw data collection complete!")
for name, path in output_files.items():
    print(f"  {name}: {path}")


STEP 1: Fetching main station data (road conditions)

Fetching data from station SN84770...
Elements: ['wind_speed', 'road_friction', 'road_surface_condition', 'visibility_in_air_mor', 'road_water_film_thickness', 'sum(precipitation_amount PT10M)']
Period: 2020-12-01T00:00:00Z to 2025-11-05T23:00:00Z
Splitting into 31 chunks to avoid API limits...
  Chunk 1/31: 2020-12-01 to 2021-01-30
  ✓ Fetched 8001 observations
  Chunk 2/31: 2021-01-30 to 2021-03-31
  ✓ Fetched 8482 observations
  Chunk 3/31: 2021-03-31 to 2021-05-30
  ✓ Fetched 8419 observations
  Chunk 4/31: 2021-05-30 to 2021-07-29
  ✓ Fetched 8579 observations
  Chunk 5/31: 2021-07-29 to 2021-09-27
  ✓ Fetched 7821 observations
  Chunk 6/31: 2021-09-27 to 2021-11-26
  ✓ Fetched 8333 observations
  Chunk 7/31: 2021-11-26 to 2022-01-25
  ✓ Fetched 8479 observations
  Chunk 8/31: 2022-01-25 to 2022-03-26
  ✓ Fetched 8487 observations
  Chunk 9/31: 2022-03-26 to 2022-05-25
  ✓ Fetched 8357 observations
  Chunk 10/31: 2022-05-25 to

## Step 3: Load Raw Data

Load the collected raw data for processing.

In [9]:
import pandas as pd

# Load main station data
df_main = pd.read_csv("raw/frost_main_station.csv")
df_main['timestamp'] = pd.to_datetime(df_main['timestamp'], utc=True)

print(f"Main station data: {df_main.shape}")
print(f"Columns: {list(df_main.columns)}")
df_main.head()

Main station data: (254101, 7)
Columns: ['timestamp', 'wind_speed', 'road_friction', 'road_surface_condition', 'visibility_in_air_mor', 'road_water_film_thickness', 'sum(precipitation_amount PT10M)']


Unnamed: 0,timestamp,wind_speed,road_friction,road_surface_condition,visibility_in_air_mor,road_water_film_thickness,sum(precipitation_amount PT10M)
0,2020-12-01 00:00:00+00:00,3.2,0.75,1024.0,20000.0,0.2,0.0
1,2020-12-01 00:10:00+00:00,3.5,0.75,1024.0,20000.0,0.2,0.0
2,2020-12-01 00:20:00+00:00,3.5,0.75,1024.0,19894.0,0.2,0.0
3,2020-12-01 00:30:00+00:00,2.8,0.75,1024.0,20000.0,0.2,0.0
4,2020-12-01 00:40:00+00:00,3.1,0.76,1024.0,20000.0,0.2,0.0


In [2]:
# Load cloudiness data (if available)
df_cloud = None
cloud_file = Path("raw/frost_cloud_station.csv")

if cloud_file.exists():
    df_cloud = pd.read_csv(cloud_file)
    df_cloud['timestamp'] = pd.to_datetime(df_cloud['timestamp'], utc=True)
    print(f"Cloud station data: {df_cloud.shape}")
    display(df_cloud.head())
else:
    print("No cloudiness data from secondary station")

Cloud station data: (49566, 2)


Unnamed: 0,timestamp,cloud_area_fraction1
0,2020-12-01 06:50:00+00:00,6
1,2020-12-01 07:50:00+00:00,4
2,2020-12-01 08:20:00+00:00,2
3,2020-12-01 08:50:00+00:00,2
4,2020-12-01 09:20:00+00:00,2


## Step 4: Process Data

Transform raw data into BayScen format with:
- Hourly aggregation
- Missing value handling
- Discretization to CARLA-compatible ranges
- Final column naming

In [3]:
from process import WeatherDataProcessor

# Initialize processor
processor = WeatherDataProcessor()

# Run full processing pipeline
df_processed = processor.process_full_pipeline(df_main, df_cloud)

print("\n✓ Processing complete!")
print(f"Final shape: {df_processed.shape}")


PROCESSING PIPELINE

Merging data sources...
✓ Merged cloudiness data from secondary station

Resampling to hourly frequency...
Starting from first hourly row: 2020-12-01 00:00:00+00:00

Using aggregation rules:
{'sum(precipitation_amount PT10M)': 'sum', 'wind_speed': 'first', 'road_friction': 'first', 'road_surface_condition': 'first', 'visibility_in_air_mor': 'first', 'road_water_film_thickness': 'first', 'cloud_area_fraction1': 'first'}

Hourly data shape: (43223, 8)
Shape before dropping NaNs: (43223, 8)
Shape after dropping NaNs: (43207, 8)

First few rows of hourly data:
                   timestamp  sum(precipitation_amount PT10M)  wind_speed  \
6  2020-12-01 06:00:00+00:00                              0.0         1.8   
7  2020-12-01 07:00:00+00:00                              0.0         2.2   
8  2020-12-01 08:00:00+00:00                              0.0         3.0   
9  2020-12-01 09:00:00+00:00                              0.0         3.7   
10 2020-12-01 10:00:00+00:00  

  df_processed[col] = df_processed[col].fillna(method='ffill')


## Step 5: Inspect Processed Data

View the final dataset structure and distributions.

In [4]:
# Display first few rows
df_processed.head(10)

Unnamed: 0,Time of Day,Cloudiness,Precipitation,Wind Intensity,Fog Density,Fog Distance,Wetness,Precipitation Deposits,Road Friction
0,-60,60,0,40,0,100,60,0,0.1
1,-60,60,0,60,0,100,100,0,0.2
2,-60,40,0,80,0,100,40,20,0.8
3,-30,20,0,80,0,100,60,20,0.8
4,-30,20,0,100,0,100,100,0,0.2
5,-30,20,0,80,0,100,100,0,0.2
6,0,20,0,80,0,100,60,20,0.8
7,0,60,0,60,0,100,40,20,0.8
8,0,60,0,80,0,100,40,20,0.8
9,30,60,0,80,0,100,40,20,0.8


In [5]:
# Display data types and null counts
print("Data Types:")
print(df_processed.dtypes)
print("\nNull Values:")
print(df_processed.isnull().sum())

Data Types:
Time of Day                 int32
Cloudiness                  int32
Precipitation               int32
Wind Intensity              int32
Fog Density                 int32
Fog Distance                int32
Wetness                     int32
Precipitation Deposits      int32
Road Friction             float64
dtype: object

Null Values:
Time of Day               0
Cloudiness                0
Precipitation             0
Wind Intensity            0
Fog Density               0
Fog Distance              0
Wetness                   0
Precipitation Deposits    0
Road Friction             0
dtype: int64


In [6]:
# Display value distributions
for col in df_processed.columns:
    print(f"\n{col}:")
    print(df_processed[col].value_counts().sort_index().head(10))


Time of Day:
Time of Day
-90    10798
-60     5402
-30     5403
 0      5403
 30     5403
 60     5399
 90     5399
Name: count, dtype: int64

Cloudiness:
Cloudiness
0       288
20    24291
40     8975
60     8983
80      670
Name: count, dtype: int64

Precipitation:
Precipitation
0      37508
20      4609
40       887
60       159
80        31
100       13
Name: count, dtype: int64

Wind Intensity:
Wind Intensity
20     8881
40     8510
60     9136
80     8071
100    8609
Name: count, dtype: int64

Fog Density:
Fog Density
0      36681
20      1306
40       403
60      4303
100      514
Name: count, dtype: int64

Fog Distance:
Fog Distance
0        514
40      4303
60       403
80      1306
100    36681
Name: count, dtype: int64

Wetness:
Wetness
0      24659
20       834
40     11070
60       734
80      3760
100     2150
Name: count, dtype: int64

Precipitation Deposits:
Precipitation Deposits
0      31304
20      4568
40      2214
60      1988
80      1580
100     1553
Name: count

## Step 6: Save Processed Data

Save the final dataset for use in BayScen.

In [7]:
# Create output directory
output_dir = Path("processed")
output_dir.mkdir(exist_ok=True)

# Save processed data
output_file = output_dir / "bayscen_final_data.csv"
df_processed.to_csv(output_file, index=False)

print(f"✓ Saved: {output_file}")
print(f"  {len(df_processed)} rows × {len(df_processed.columns)} columns")

✓ Saved: processed\bayscen_final_data.csv
  43207 rows × 9 columns


## Summary

You now have:

1. **Raw data** in `raw/` directory
   - `frost_main_station.csv` - Road conditions, wind, precipitation, visibility
   - `frost_cloud_station.csv` - Cloudiness data

2. **Processed data** in `processed/` directory
   - `bayscen_final_data.csv` - Ready for Bayesian Network training

**Next steps:**
- Use the processed data to train Bayesian Networks
- Generate test scenarios with BayScen
- Validate scenarios in CARLA simulation

**Final column format:**
- `Time of Day` - Sun altitude angle (-90 to 90)
- `Cloudiness` - Cloud coverage (0-100)
- `Precipitation` - Precipitation intensity (0-100)
- `Wind Intensity` - Wind strength (0-100)
- `Fog Density` - Fog thickness (0-100)
- `Fog Distance` - Visibility distance (0-100)
- `Wetness` - Road wetness (0-100)
- `Precipitation Deposits` - Water/snow on road (0-100)
- `Road Friction` - Tire grip (0.0-1.0)