In [4]:
import polars as pl

input_file = "merged_adsb_dataSMALL.csv"
output_file = "adsb_cleanSMALL.csv"

required_cols = [
    "icao", "flight", "aircraft_type", "lat", "lon", "alt",
    "ground_speed", "track", "adsb_type", "true_heading",
    "squawk", "category"
]

# Load CSV with overrides for problematic values
df = pl.scan_csv(
    input_file,
    infer_schema_length=10000,
    null_values=["", "null", "None"],  # but NOT "ground", we’ll handle that manually
    schema_overrides={"alt": pl.Utf8}  # altitude as string first
)

# Clean and convert
df = (
    df.drop_nulls(subset=required_cols)
      .filter(pl.col("lat").cast(pl.Float64).is_between(-90, 90))
      .filter(pl.col("lon").cast(pl.Float64).is_between(-180, 180))
      .with_columns(
          pl.when(pl.col("alt") == "ground")
            .then(pl.lit(0))
            .otherwise(pl.col("alt").cast(pl.Int64, strict=False))
            .alias("alt")
      )
)

# Save cleaned dataset
df.collect().write_csv(output_file)
print(f"✅ Cleaned dataset saved to {output_file} (altitude 'ground' → 0)")


✅ Cleaned dataset saved to adsb_cleanSMALL.csv (altitude 'ground' → 0)


### ✈️ ADS-B Data Parser & Cleaner

This project extracts and processes **ADS-B aircraft surveillance data** from JSON trace logs into a structured database (CSV/Parquet).

Each log file contains an aircraft’s flight trace, including positional and metadata records.
The parser converts these traces into flat records with the following key fields:

* **Aircraft Information**:

  * `icao`: ICAO hex identifier
  * `flight`: Flight callsign
  * `aircraft_type`: ICAO aircraft type code (e.g., A320, B429)
  * `desc`: Human-readable aircraft description

* **Position & Movement**:

  * `timestamp_offset`: Relative timestamp (seconds from dataset start)
  * `lat`, `lon`: Geographic position
  * `alt`: Altitude (ft, `"ground"` converted to `0`)
  * `ground_speed`: Speed over ground (knots)
  * `track`: Track/course (degrees)
  * `geom_rate`: Climb/descent rate (ft/min)

* **ADS-B Metadata**:

  * `adsb_type`: Source message type (`adsb_icao`, etc.)
  * `true_heading`, `mag_heading`: Heading information
  * `squawk`: Transponder squawk code
  * `emergency`: Emergency status
  * `category`: Aircraft category

* **Quality & Integrity Flags**:

  * `nav_qnh`: Barometric pressure setting (hPa)
  * `alert`, `spi`: Special purpose indicators
  * `nic`, `rc`, `nac_p`, `nac_v`, `sil`, `sil_type`: Integrity/confidence metrics
  * `gva`, `sda`: Geometric accuracy indicators

The resulting dataset is suitable for:

* **Flight data analysis** (traffic patterns, altitude profiles, squawk usage)
* **Machine learning** (anomaly detection, spoofing research, classification)
* **Visualization** (heatmaps, live tracking dashboards)

In [5]:
import polars as pl

# load cleaned CSV
df = pl.read_csv("adsb_cleanSMALL.csv")

# show first rows
print(df.head())

# show schema
print(df.schema)


shape: (5, 28)
┌────────┬────────┬───────────────┬──────────────┬───┬─────┬──────────┬─────┬─────┐
│ icao   ┆ flight ┆ aircraft_type ┆ desc         ┆ … ┆ sil ┆ sil_type ┆ gva ┆ sda │
│ ---    ┆ ---    ┆ ---           ┆ ---          ┆   ┆ --- ┆ ---      ┆ --- ┆ --- │
│ str    ┆ str    ┆ str           ┆ str          ┆   ┆ f64 ┆ str      ┆ f64 ┆ f64 │
╞════════╪════════╪═══════════════╪══════════════╪═══╪═════╪══════════╪═════╪═════╡
│ ad5ff1 ┆ N961AV ┆ A320          ┆ AIRBUS A-320 ┆ … ┆ 3.0 ┆ perhour  ┆ 1.0 ┆ 2.0 │
│ ad5ff1 ┆ N961AV ┆ A320          ┆ AIRBUS A-320 ┆ … ┆ 3.0 ┆ perhour  ┆ 1.0 ┆ 2.0 │
│ ad5ff1 ┆ N961AV ┆ A320          ┆ AIRBUS A-320 ┆ … ┆ 3.0 ┆ perhour  ┆ 1.0 ┆ 2.0 │
│ ad5ff1 ┆ N961AV ┆ A320          ┆ AIRBUS A-320 ┆ … ┆ 3.0 ┆ perhour  ┆ 1.0 ┆ 2.0 │
│ ad5ff1 ┆ N961AV ┆ A320          ┆ AIRBUS A-320 ┆ … ┆ 3.0 ┆ perhour  ┆ 2.0 ┆ 2.0 │
└────────┴────────┴───────────────┴──────────────┴───┴─────┴──────────┴─────┴─────┘
Schema({'icao': String, 'flight': String, 'aircraft_type': St

Great question 👍 — for **ADS-B spoofing detection**, you want to strip the data down to the **minimum features that carry security-relevant signal**, while avoiding redundant or easy-to-forge fields.

Here’s a breakdown:

---

### ✅ Must-keep (core minimal features for spoofing detection)

These are critical to detect inconsistencies, anomalies, or spoofed injections:

* **icao** → spoofers often re-use or clone ICAO hex codes.
* **lat, lon, alt** → positional anomalies are the heart of spoof detection.
* **ground\_speed, track** → kinematics; spoofed tracks often break physical motion laws.
* **timestamp\_offset** → temporal context, allows sequence/velocity validation.
* **nic, nac\_p, nac\_v, sil, sil\_type, rc** → integrity/confidence fields; spoofers may skip or mis-set these.

---

### ⚠️ Useful, but secondary

Can improve robustness of your model, but not strictly minimal:

* **squawk** → spoofers may use default or unusual codes (7700/0000/1200).
* **category** → spoofers might mismatch aircraft type vs. observed kinematics.
* **gva, sda** → additional accuracy flags, may show inconsistencies.

---

### ❌ Usually less useful for spoofing detection

Because they can be easily forged or add noise:

* **flight** (callsign) → trivial to spoof, not reliable.
* **aircraft\_type, desc** → metadata, not critical for spoof detection.
* **true\_heading, mag\_heading** → often derived, redundant with track.
* **nav\_qnh** → useful for altitude cross-checks, but spoofers may leave nulls.
* **alert, spi, emergency** → rare, but could be spoofed for DoS-style attacks.

---

### 🚀 Minimal ML-ready dataset suggestion:

Keep only these columns:

```text
icao, timestamp_offset, lat, lon, alt, ground_speed, track,
nic, nac_p, nac_v, sil, rc
```

That gives you:

* **Identity** (icao)
* **Spatiotemporal state** (lat, lon, alt, speed, track, time)
* **Integrity/confidence metadata** (nic/nac/sil/rc)

With just this, you can already train models to detect spoofing via:

* Impossible motion (e.g., jumps, speed > Mach 2 for a Cessna)
* Duplicate icao in two places at once
* Integrity flags inconsistent with observed precision

---

👉 Do you want me to sketch a **Polars snippet** that takes your full cleaned CSV and outputs a slimmed-down “spoofing dataset” with only these features?


In [None]:
try: