Robust Severity Assessment and Information Extraction from Noisy Maritime Distress Communications Using Large Language Models
- Project Goal
- Quick Example
- Classification Task
- Key Results
- Pipeline
- Project Structure
- Notebooks
- Dataset
- Experiments
- Installation
- Quick Start
SeaAlert is an NLP system designed to:
- Classify maritime radio calls into 4 severity levels (Distress, Urgency, Safety, Routine)
- Extract actionable information (Location, Vessel Name, Persons on Board, Nature of Incident)
Maritime radio calls are made under extreme conditions:
- High Noise Environment — Engine noise, storms, VHF static interference
- Human Stress — Panic causes operators to omit keywords or speak informally
- Protocol Violations — Not all distress calls follow GMDSS standards ("MAYDAY", "PAN PAN")
Therefore, my classification model must handle very noisy ASR (Automatic Speech Recognition) transcriptions, not clean text. This is the core challenge of my project.
We tackle this challenge using two augmentation techniques:
- LLM-based Text Generation — GPT-4o-mini generates diverse maritime messages (formal, informal, protocol violations)
- ASR-based Augmentation — Text → TTS → Noisy Audio → Whisper ASR → Corrupted Text
This creates realistic training data that mimics real-world maritime communication failures.
Here's a real example from my dataset — demonstrating how severe ASR errors can be under high noise conditions:
| Stage | Content |
|---|---|
| Original Message | "MAYDAY, MAYDAY, MAYDAY. This is the fishing vessel 'Ocean Explorer', call sign WXYZ123, MMSI 123456789. We are adrift, approximately 15 nautical miles east of Cape Point, at position 34 degrees 12 minutes South, 18 degrees 29 minutes East. The vessel's engine has failed, and we are currently taking on water. Weather conditions are worsening with 4-meter swells and visibility reduced to 2 nautical miles. There are 6 persons on board. We require immediate assistance for towing. Repeat, we are requesting a tow. Over." |
| ASR Output (High Noise) | "maybe, maybe, maybe. This is the Fishing Vessel Oceanate Spoiler. Paul Signed to be its Ryzen 123 MMSI 120 3 million 456000 7809. The Area Drift approximately 15 nautical miles east of Cape Point, a position 34 degrees 12 minutes south, 18 degrees 29 minutes east. The Vessel's engine has failed and we are currently taking on water. Whether conditions are a worse name before need as well as invisibility we choose to T-Nautical miles. There are six persons on board. You require immediate assistance for training. You please, you are wrecked." |
| Classification | 🔴 DISTRESS |
| Extracted Information | Vessel: Oceanate Spoiler · Location: NONE · POB: NONE · Nature: taking on water |
Critical ASR Errors Shown:
MAYDAY, MAYDAY, MAYDAY→maybe, maybe, maybe🔴 (codeword completely lost!)Ocean Explorer→Oceanate Spoiler(vessel name corrupted)call sign WXYZ123, MMSI 123456789→Paul Signed to be its Ryzen 123...(identifiers destroyed)visibility reduced to 2 nautical miles→invisibility we choose to T-Nautical miles(nonsensical)requesting a tow. Over.→training. You please, you are wrecked.(meaning completely altered)
Despite these catastrophic ASR errors — where the critical MAYDAY codeword became "maybe" and the message ended with "you are wrecked" — my Transformer model correctly classifies the message as DISTRESS based on contextual understanding of phrases like "engine has failed", "taking on water", and "require immediate assistance".
SeaAlert classifies messages into 4 severity labels based on GMDSS protocol:
| Label | Codeword | Description |
|---|---|---|
| Distress | MAYDAY | Life-threatening emergencies requiring immediate assistance |
| Urgency | PAN PAN | Urgent situations not immediately life-threatening |
| Safety | SECURITE | Navigation hazards, weather warnings |
| Routine | NONE | Regular communications, radio checks |
Beyond classification, SeaAlert extracts structured, actionable data from unstructured messages:
| Field | Description | Example |
|---|---|---|
| Vessel Name | Name of the ship in distress | Ocean Explorer |
| Call Sign / MMSI | Unique radio identifiers | WXYZ123 / 123456789 |
| Location | Coordinates or relative position | 34°15'N, 120°45'W |
| POB | Persons On Board (Count) | 15 |
| Nature | Type of incident | Sinking, Fire, Medical |
This structured output is critical for rescue coordination centers to dispatch appropriate resources.
Two transformer models were evaluated on the validation set:
| Model | Parameters | Validation F1 | Selected |
|---|---|---|---|
| DistilBERT | 66M | 0.679 | ❌ |
| RoBERTa | 125M | 0.734 | ✅ |
RoBERTa was selected for all experiments due to its superior validation performance (+5.5% F1).
| Model | Type | Clean F1 | ASR-High F1 | Trap F1 | ASR Robustness |
|---|---|---|---|---|---|
| Logistic Regression | Baseline | 0.674 | 0.423 | 0.139 | -37% drop |
| Linear SVM | Baseline | 0.686 | - | - | - |
| Naive Bayes | Baseline | 0.592 | - | - | - |
| RoBERTa | Transformer | 0.664 | 0.569 | 0.236 | -14% drop |
-
ASR Robustness — RoBERTa maintains better performance on noisy ASR transcripts:
- BoW: 67.4% → 42.3% F1 (37% degradation)
- RoBERTa: 66.4% → 56.9% F1 (only 14% degradation)
-
Codeword Reliance — Both models rely heavily on GMDSS keywords:
- With codeword: 100% accuracy (both models)
- Without codeword: ~51% accuracy (both models)
-
Adversarial Robustness — RoBERTa handles tricky cases better:
- Negations: "This is NOT a distress"
- Drills: "MAYDAY - this is a drill"
- RoBERTa: 23.6% F1 vs BoW: 13.9% F1 (70% improvement)
-
Data Augmentation — Training with ASR-corrupted text improves robustness:
- BoW with ASR augmentation: 58.9% F1 on ASR-high (vs 42.3% without)
My end-to-end pipeline simulates real maritime communication:
┌─────────────────────────────────────────────────────────────────────────┐
│ SeaAlert Pipeline │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ GPT-4o-mini│ │ Coqui TTS │ │ Noise Layer │ │
│ │ Generation │───▶│ Synthesis │───▶│ (VHF Radio) │ │
│ │ 1,872 msgs │ │ 16kHz │ │ 6/12/18 dB │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Clean Text │ │ Whisper ASR │ │
│ │ Dataset │ │ Transcription│ │
│ └──────────────┘ └──────────────┘ │
│ │ │ │
│ └───────────────┬───────────────────────┘ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ Model Training │ │
│ │ BoW vs Transformer │ │
│ └─────────────────────┘ │
│ │ │
│ ┌───────────────┼───────────────┐ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Exp 1: │ │ Exp 2: │ │ Exp 3: │ │
│ │ Codeword │ │ Adversarial │ │ ASR │ │
│ │ Masking │ │ Traps │ │ Robustness │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ Classification + │ │
│ │ Info Extraction │ │
│ └─────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
| Stage | Description | Output |
|---|---|---|
| 1. Data Generation | GPT-4o-mini synthetic maritime messages | 1,872 balanced samples |
| 2. Text-to-Speech | Coqui TTS audio synthesis | WAV files (16kHz) |
| 3. Noise Simulation | VHF radio noise at 3 SNR levels | Noisy audio files |
| 4. ASR Transcription | Faster-Whisper speech-to-text | Corrupted text transcripts |
| 5. Model Training | BoW baselines + RoBERTa transformer | Trained classifiers |
| 6. Evaluation | 3 experiments + information extraction | Results & analysis |
SeaAlert/
├── notebooks/ # Jupyter notebooks (run in order)
│ ├── 00_eda_dataset.ipynb # EDA for synthetic dataset
│ ├── 00_eda_audio_asr.ipynb # EDA for audio & ASR quality
│ ├── 01_generate_synthetic_dataset.ipynb # GPT-4o-mini data generation
│ ├── 02_text_to_speech.ipynb # Coqui TTS synthesis
│ ├── 03_noise_and_asr.ipynb # Noise injection + Whisper ASR
│ ├── 04_train_and_evaluate.ipynb # Model training & experiments
│ └── 05_demo_inference_and_extraction.ipynb # Demo & extraction
│
├── data/ # Datasets
│ ├── processed/
│ │ ├── 02seaalert.csv # Main dataset (clean text)
│ │ └── 03seaalert_with_asr.csv # Dataset with ASR transcripts
│ ├── asr/
│ │ └── asr_transcripts.csv # Whisper raw transcripts
│ └── audio_*/ # Audio index files
│ └── *_index.csv
│
├── results/ # Results & visualizations
│ ├── csv/ # CSV data (metrics, splits, error reports)
│ └── visuals/ # Figures, plots, and text reports
│
├── presentation/ # Project presentations
│ ├── Proposal.pdf
│ ├── Interim.pdf
│ └── Final.pdf
│
├── archive/ # Previous project versions
│
├── assets/ # Project images and diagrams
│ └── pipeline_diagram.png
│
├── .gitignore # Git ignore rules
└── README.md # This file
- Proposal – Initial project proposal
- Interim – Mid-project progress update
- Final – Final project presentation
(PPTX files are also included in the presentation/ folder)
- Label/style/scenario distributions
- Text length analysis
- Codeword presence analysis
- Word clouds by severity label
- Audio duration distributions
- Spectrogram visualizations
- WER (Word Error Rate) by noise level
- Codeword preservation in ASR
Generates 1,872 synthetic maritime messages using GPT-4o-mini.
Features:
- 4 balanced classes: 468 samples each
- 3 communication styles: formal, informal, third_party
- 12 scenario types: water_ingress, fire_smoke, medical_issue, etc.
- Codeword masking for experiments
- Stratified train/val/test splits (70/15/15)
Converts text to speech using Coqui TTS.
Model: tts_models/en/ljspeech/tacotron2-DDC
Output: 1,872 WAV files (16kHz mono)
Adds realistic VHF radio noise and transcribes with Whisper.
| Noise Level | SNR | WER | Characteristics |
|---|---|---|---|
| Low | 18dB | ~15% | Light static |
| Med | 12dB | ~20% | Moderate static, some dropouts |
| High | 6dB | ~25% | Heavy static, frequent dropouts |
Main training notebook with comprehensive experiments.
Models:
| Model | Library | Notes |
|---|---|---|
| TF-IDF + LogReg | scikit-learn | Baseline |
| TF-IDF + SVM | scikit-learn | Baseline |
| TF-IDF + NaiveBayes | scikit-learn | Baseline |
| DistilBERT | HuggingFace | Evaluated (66M params) |
| RoBERTa-base | HuggingFace | Selected (125M params) |
Experiments:
- Codeword Masking — Tests reliance on GMDSS keywords
- Adversarial Traps — Negations, drills, resolved incidents
- ASR Robustness — Performance on noisy transcripts
End-to-end demonstration:
- Classify messages with trained RoBERTa model
- Compare original vs ASR-corrupted text
- Extract structured information (vessel, location, POB)
- Generate visual rescue reports
Download/View Full Dataset (Audio & Metadata) on Google Drive
| Column | Type | Description |
|---|---|---|
idx |
int | Unique sample index (0-1871) |
text |
str | Original message text |
label |
str | Routine / Safety / Urgency / Distress |
style |
str | formal / informal / third_party |
scenario_type |
str | water_ingress, fire_smoke, etc. |
has_codeword |
bool | Contains MAYDAY/PAN PAN/SECURITE |
codeword |
str | MAYDAY / PAN PAN / SECURITE / NONE |
text_masked |
str | Codewords replaced by [SIGNAL] |
vessel |
str | Vessel name |
call_sign |
str | Radio call sign |
mmsi |
str | MMSI number (9 digits) |
location |
str | Position/coordinates |
pob |
int | Persons on board |
nature |
str | Nature of incident |
- Total samples: 1,872
- Labels: 468 per class (perfectly balanced)
- With codeword: ~35%
- Text length: 35-129 words (avg: 79)
Tests if models rely on GMDSS codewords or understand context.
| Setting | Train Data | Test Data | BoW F1 | RoBERTa F1 |
|---|---|---|---|---|
| A (Clean) | text | text | 0.674 | 0.664 |
| B (Masked) | masked | masked | 0.565 | 0.520 |
| C (Transfer) | text | masked | 0.444 | 0.520 |
Finding: Both models rely heavily on codewords. RoBERTa shows better transfer to masked text.
Tests with samples designed to fool keyword-based models:
- Negation: "This is NOT a distress"
- Drills: "MAYDAY - this is a drill"
- Past incidents: "Distress was resolved yesterday"
| Model | Trap Accuracy | Trap F1 |
|---|---|---|
| BoW | 26.7% | 0.139 |
| RoBERTa | 33.3% | 0.236 |
Finding: Both struggle, but RoBERTa performs ~70% better.
Tests performance on Whisper-transcribed noisy audio.
| Model | Clean F1 | ASR-Med F1 | ASR-High F1 | Degradation |
|---|---|---|---|---|
| BoW | 0.674 | 0.427 | 0.423 | -37% |
| RoBERTa | 0.664 | 0.605 | 0.569 | -14% |
| BoW (augmented) | - | - | 0.589 | - |
Finding: RoBERTa is significantly more robust to ASR noise. Data augmentation helps BoW.
Each notebook auto-installs dependencies. Just run the first cell.
# Core
pip install pandas numpy tqdm scikit-learn matplotlib joblib
# Text generation
pip install openai jsonschema
# TTS & Audio
pip install TTS soundfile librosa scipy
# ASR
pip install faster-whisper
# Transformers
pip install transformers datasets evaluate accelerate torchgit clone https://github.com/your-repo/SeaAlert.git
cd SeaAlert# Create src/API_KEY.py
OPENAI_API_KEY = "sk-your-key-here"01_generate_synthetic_dataset.ipynb → Generate data
02_text_to_speech.ipynb → Create audio
03_noise_and_asr.ipynb → Add noise & transcribe
04_train_and_evaluate.ipynb → Train & evaluate
05_demo_inference_and_extraction.ipynb → Demo
Set QUICK_RUN = True in Notebook 01 for template-based data.
Educational project for NLP course.
- Coqui TTS - Text-to-Speech synthesis
- Faster Whisper - ASR transcription
- HuggingFace Transformers - RoBERTa model
- OpenAI GPT-4o-mini - Synthetic data generation
