# Data preparation of multiple NOTAMs files to create a timeline

What is NOTAMS?

NOTAM actually stands for Notice To Airmen and is the primary means of disseminating all kinds of information to pilots

As NOTAMs contain critical information about aircraft and passenger entry requirements, we use NLP to extract relevant information from these messages. NOTAMs are similar to telegram messages and contain a lot of abbreviations to reduce the length of the message. So before extracting information from these message, the text has to be preprocessed.

This notebook deals with cleaning the NOTAMs messages that can be later used for further analysis.

**NOTE:**

The results in this notebook cannot be directly reproduced. The input data has to be downloaded by the user!

We read in the NOTAMS data downloaded from https://www.icao.int/safety/iStars/Pages/API-Data-Service.aspx. The data will not be published on this site and has to be downloaded by each user on their own to carry out the following analysis. In order to download the data, the user must register on https://www.icao.int/safety/iStars/Pages/API-Data-Service.aspx to get a free API key.



## Data Sources

**Data collection from ICAO website**

The following data are collected:

1. COVID related NOTAMS from airpspaces (Airspace COVID-19 NOTAMs)

Airport NOTAMS also provide information about closure of airports


Example of a COVID-19 NOTAM message:

    COVID-19: ORDERS OF THE STATE GOVERNMENT OF BRANDENBURG WITH THE AIM OF PREVENTING THE INTRODUCTION OR SPREAD OF INFECTIONS BY SARS-COV-2. ALL PAX ENTERING THE FEDERAL REPUBLIC OF GERMANY AS THEIR FINAL DESTINATION FROM RISK AREAS DIRECT OR VIA TRANSFER(1) MUST STAY IN QUARANTINE FOR 14 DAYS AFTER ARRIVAL AND (2) MUST CONTACT LOCAL HEALTH AUTHORITY OF THEIR FINAL DESTINATION IMMEDIATLY. EXCEPTIONS ARE POSSIBLE IN THE CASE OF A NEGARIVE PCR TEST FOR SARS-COV-2 IN GERMAN AND ENGLISH FOR A MAXIMUM OF 48 HOURS BEFORE ENTRY. THESE REGULATIONS DO NOT APPLY FOR CREW MEMBERS. THE CREW MUST PROVIDE INFORMATION ABOUT THESE REGULATIONS TO ALL PAX INFLIGHT. CREATED: 15 Jun 2020 15:58:00 SOURCE: EUECYIYN


**Input**

   Downloaded datasets 

    - all_airspace_covid_notams_xx.csv (Multiple files)

**Output**
  
  Preprocessed datasets- multiple airspace notams files (one per week beginning last week of May 2020)

    - valid_airspace_notams_xx.csv

where 'xx' corresponds to the date

The following steps are carried out in preprocessing the data:

1. Extracting NOTAMs from json: As the NOTAMs are stored in json format, the intial step is to extract these messages from the json string


2. Cleaning the NOTAMS

    * Remove white spaces
    * Remove hyperlinks
    * Mapping abbreviations to actual words
    * Remove foreign text if the phrase "english text/english version" is present
    * Remove words starting with symbols
    * Remove punctuations


3. Generating tokens

    * Remove numbers
    * Lemmatization
    * Remove stop words

4. The latest file for each week is selected and then the above preprocessing steps are carried out

In [None]:
try:
    import spacy
except:
    !pip install spacy
try:
    import spacy_langdetect
except:
    !pip install spacy-langdetect
try:
    import flair
except:
    !pip install flair
try:
    import geonamescache
except:
    !pip install geonamescache
try:
    import spacy_fastlang
except:    
    !pip install spacy_fastlang
    #!pip install sense2vec==1.0.0a1
try:
    import gensim
except:
    !pip install gensim
try:
    import wordcloud
except:
    !pip install wordcloud
try:
    import nltk
except:
    !pip install nltk
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_md

In [None]:
import spacy

from collections import Counter, defaultdict,OrderedDict

import pandas as pd
import os
import csv
import itertools
import re
import json
import numpy as np
import matplotlib.pyplot as plt
import datetime
import string

from spacy_langdetect import LanguageDetector
import plac
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher
from spacy.tokens import Doc, Span, Token

from langdetect import detect, detect_langs
from nltk.tokenize import sent_tokenize

import nltk
nltk.download('punkt')

**1. Extract NOTAM from json**

In [None]:
def decode_airport_message(df):
    apt_dt = dict()
    apt_dt['message'] = []
    apt_dt['Qcode'] = []
    apt_dt['createdDate'] = []
    apt_dt['Closed'] = []
    apt_dt['airportName'] = []
    apt_dt['airportCode'] = []
    apt_dt['cityName'] = []
    apt_dt['countryCode'] = []
    apt_dt['countryName'] = []
    apt_dt['latitude'] = []
    apt_dt['longitude'] = []
    
    for idx,row in df.iterrows():
        if type(row['notams'])==str:
            jsd = json.loads(row['notams'])
            for m in jsd['message'].values():
                m_ = m.split("\nCREATED: ")[0]
                m_ = m_.replace('\n',' ')
                #print(m_)
                apt_dt['message'].append(m_)
                apt_dt['Closed'].append(row['Closed'])
                apt_dt['airportName'].append(row['airportName'])
                apt_dt['airportCode'].append(row['airportCode'])
                apt_dt['cityName'].append(row['cityName'])
                apt_dt['countryCode'].append(row['countryCode'])
                apt_dt['countryName'].append(row['countryName'])
                apt_dt['latitude'].append(row['latitude'])
                apt_dt['longitude'].append(row['longitude'])

            for qc in jsd['Qcode'].values():
                apt_dt['Qcode'].append(qc)
            for cd in jsd['Created'].values():
                apt_dt['createdDate'].append(cd)
        else:
            apt_dt['message'].append(None)
            apt_dt['Qcode'].append(None)
            apt_dt['createdDate'].append(None)
            apt_dt['Closed'].append(row['Closed'])
            apt_dt['airportName'].append(row['airportName'])
            apt_dt['airportCode'].append(row['airportCode'])
            apt_dt['cityName'].append(row['cityName'])
            apt_dt['countryCode'].append(row['countryCode'])
            apt_dt['countryName'].append(row['countryName'])
            apt_dt['latitude'].append(row['latitude'])
            apt_dt['longitude'].append(row['longitude'])

    apt_df = pd.DataFrame(apt_dt)
    apt_df['createdDate'] = pd.to_datetime(apt_df['createdDate'])
    return apt_df

In [None]:
def decode_airspace_message(df):
    apt_dt = dict()
    apt_dt['message'] = []
    apt_dt['Qcode'] = []
    apt_dt['createdDate'] = []
    apt_dt['Closed'] = []
    apt_dt['FIRcode'] = []
    apt_dt['FIRname'] = []
    apt_dt['countryCode'] = []
    apt_dt['countryName'] = []
    
    for idx,row in df.iterrows():
        if type(row['notams'])==str:
            jsd = json.loads(row['notams'])
            for m in jsd['message'].values():
                m_ = m.split("\nCREATED: ")[0]
                m_ = m_.replace('\n',' ')
                #print(m_)
                apt_dt['message'].append(m_)
                apt_dt['FIRname'].append(row['FIRname'])
                apt_dt['FIRcode'].append(row['FIRcode'])
                apt_dt['countryCode'].append(row['countryCode'])
                apt_dt['countryName'].append(row['countryName'])

            for qc in jsd['Qcode'].values():
                apt_dt['Qcode'].append(qc)
            for cd in jsd['Created'].values():
                apt_dt['createdDate'].append(cd)
            for c_ in jsd['Closed'].values():
                apt_dt['Closed'].append(c_)
        else:
            apt_dt['message'].append(None)
            apt_dt['Qcode'].append(None)
            apt_dt['createdDate'].append(None)
            apt_dt['Closed'].append(None)
            apt_dt['FIRname'].append(row['FIRname'])
            apt_dt['FIRcode'].append(row['FIRcode'])
            apt_dt['countryCode'].append(row['countryCode'])
            apt_dt['countryName'].append(row['countryName'])

    apt_df = pd.DataFrame(apt_dt)
    apt_df['createdDate'] = pd.to_datetime(apt_df['createdDate'])
    return apt_df

**The following API gives a consolidated NOTAMS of Airport and Airspace Restrictions related to COVID 19**

List of NOTAMS for airspaces and aairports referring to COVID-19 restrictions

Data dictionary of Airport COVID-19 NOTAMS

|Field|	Type|	Description|
|-----|-----|-----|
|countryName|	string|	Name of the Country|
|countryCode|	string|	ISO 3-Letter Code of the Country|
|airportName|	string|	Name of the airport, searchable|
|cityName|	string|	Name of the city, searchable|
|airportCode|	string|	ICAO 4-letter code of the airport|
|latitude|	number|	Latitude in Decimal degrees|
|longitude|	number|	Longitude in Decimal degrees|
|NoTraffic|	string|	Wheather the airport has less than one flight per day in the last 7 days (TRUEor FALSE)|
|Closed|	string|	If the airport has a NOTAM which is Q-code FALC (TRUE or FALSE), which means the airport is closed |
|traffic|	string|	Traffic data of the reference week, previous week and current week (json stringified format)|
|notams|	string|	NOTAMS containing COVID or CORONAVIRUS key words for the airport (json stringified format)|
|messages| string| NOTAMS message as a string|
|Qcode| string| Qcode of the NOTAM|
|createdDate| datetime| NOTAM created date|

Qcode reference: https://www.notams.faa.gov/common/qcode/qcode.html

**Preprocessing**

In [None]:
all_airport_codes = pd.read_csv("/project_data/data_asset/all_airports_covid_notams_20200525.csv")['airportCode'].values

In [None]:
############
#Stop words#
############

nlp_ = spacy.load('en_core_web_md')

# Adding stop words
new_stop_words = ["create","source","euecyiyn",'etczyoyx','tel']

# Add airport codes to stop words
new_stop_words.extend([ac.lower() for ac in list(all_airport_codes)])


for new_word in new_stop_words:
    nlp_.vocab[new_word].is_stop = True

# Add language detector to pipeline
nlp_.add_pipe(LanguageDetector(), name='language_detector', last=True)
    

#https://proairpilot.com/faa-notam.html
#https://www.icao.int/NACC/Documents/Meetings/2014/ECARAIM/REF03-ICAOCodes.pdf
mapping = {"acc": "area control", "acft": "aircraft", "ad": "aerodrome", "aic": "aeronautical information circular",
           "aip": "aeronautical information publication", "ais": "aeronautical information services",
           "alt": "altitude", "altn": "alternate", "ap": "airport", "aro": "air traffic services reporting office",
           "arr": "arrival", "atc": "air traffic control", "ats": "air traffic services", "attn": "attention",
           "auth": "authorized", "avbl": "available", "bfr": "before", "cat": "category", "chg": "change","civ":"civil",
           "clsd": "closed", "cov": "cover", "cta": "control area", "ctc": "contact", "ctr": "control zone",
           "dem.": "democratic", "dep": "depart", "emerg": "emergency", "enr": "en route", "exc": "except",
           "fed.": "federation", "fir": "flight information region", "fis": "flight information service",
           "flt": "flight", "flts": "flights", "flw": "follows", "fm": "from", "fpl": "filed flight plan",
           "fri": "friday", "gen": "general", "hr": "hour", "intl": "international", "isl.": "islands",
           "ldg": "landing", "mil": "military", "mon": "monday", "op": "operation","ops": "operations", 
           "opr": "operating","pax": "passenger",
           "ppr": "prior permission required", "ref": "refernce to", "rep.": "republic", "req": "request",
           "rffs": "rescue and fire fighting services", "rmk": "remark", "rte": "route", "rwy": "runway",
           "sat": "saturday", "ser": "service", "svc": "service message", "taf": "terminal aerodrome forecast",
           "tfc": "traffic", "thu": "thursday", "tma": "terminal control area", "tue": "tuesday",
           "twr": "aerodrome control tower", "vfr": "visual flight rules"}


**2. Cleaning the NOTAMs**

In [None]:
def clean_message(message):
    #############################
    ## mapping, lower, language #
    #############################
    #Make everything to lower case
    message = row['message'].lower().strip()
    message = re.sub(r'(http|https|www)\S+', '', message)
    
    # Mapping short terms to actual word
    for s_, w in mapping.items():
        message = re.sub(r'\b{}\b'.format(s_),w,message)
    
    # If a country has NOTAMs in two languages, the english text starts with the phrase "english text/english version"
    if "english text" in message:
        #print("TEXT")
        e_t = []
        for m_ in message.split("english text"):
            for sent_ in sent_tokenize(m_):
                if detect(sent_) == "en":
                    e_t.append(sent_)
        message = "".join(e_t)

    elif "english version" in message:
        #print("VERSION")
        e_t = []
        for m_ in message.split("english version"):
            for sent_ in sent_tokenize(m_):
                if detect(sent_) == "en":
                    e_t.append(sent_)
        
        message = "".join(e_t)
        #message = message.split("english version")[1]
    elif detect(message) != "en":
        message = ""

    #Start of string other than character or digit
    message = re.sub(r'[^ 0-9a-z]', ' ', message)
    message = message.translate(message.maketrans('', '', string.punctuation)) #extra punctuations removal

    # Remove unnecessary white space
    message = " ".join(message.split())
    
    return message

**3. Generating tokens**

In [None]:
def generate_tokens(message):
    sent_ = nlp_(message)
    # Cleaning text
    tokens = []
    for token in sent_:
        # Remove punctuation and numbers
        # Get only date digits!
        if token.is_alpha:
            # Lemma
            lemma_text = token.lemma_
            #Remove stop words
            if not nlp_.vocab[lemma_text].is_stop:
                if len(lemma_text) > 2:
                    tokens.append(lemma_text)
    return tokens

**Airspace COVID-19 NOTAMS - one file per week**

In [None]:
files_dt = dict()
files_dt['download_date'] = []
files_dt['download_week'] = []
files_dt['file_location'] = []
for file_ in os.listdir("/project_data/data_asset/"):
    if file_.startswith("all_air"):
        #print(file_)
        files_dt['download_date'].append(datetime.datetime.strptime(file_.split("_")[-1].split(".csv")[0],"%Y%m%d"))
        files_dt['download_week'].append(datetime.datetime.strptime(file_.split("_")[-1].split(".csv")[0],"%Y%m%d").isocalendar()[1])
        files_dt['file_location'].append(os.path.join("/project_data/data_asset/",file_))

# The file with the latest date for the selected week is considered
files_df = pd.DataFrame(files_dt)
weekly_files_df = files_df[files_df['download_date'] == files_df.groupby("download_week",sort="as")["download_date"].transform("max")]

# Iterate one file per week!
for idx, r in weekly_files_df.iterrows():
    if "all_airspaces" in r['file_location']:
        print(r['file_location'])
        
        asp_week_df = pd.read_csv(r['file_location'])
        
        # Expand the jsons in the dataframe
        asp_df = decode_airspace_message(asp_week_df)
        
        # Preprocess the data
        asp_df['tokens'] = None
        asp_df['cleaned_message'] = None

        for idx,row in asp_df.iterrows():
            if row['message'] is not None:
                message = row['message']

                message_ = clean_message(message)
                tokens = generate_tokens(message_)

                asp_df.at[idx,"cleaned_message"] = message_
                asp_df.at[idx,"tokens"] = tokens
        
        # Remove na rows
        valid_asp_df = asp_df.dropna(subset=['message'])
        valid_asp_df = valid_asp_df[valid_asp_df.cleaned_message != '']
        
        # Add the date components as columns
        valid_asp_df['active_week']=r['download_week']
        valid_asp_df['download_date'] = r['download_date']
        print(len(valid_asp_df))
        valid_asp_df.to_csv("/project_data/data_asset/notams/valid_airspace_notams_"+str(r['download_week'])+".csv", index=False,quoting=csv.QUOTE_NONNUMERIC)

**Unresolved abbreviations**

de, del,atr,sar (this refes to special administrative zone as well as search and rescue),
dgac (some authority center),caa(some authority center),enac(some authority center),hum,act (multiple meanings)

**Qcode**

Qcode is a brevity code that informs the type of message being sent. For our analysis we can ignore some Qcodes such as aerodrome service hours as we are mainly looking for quarantine duration as well as country restrictions


|Qcode| Meaning|
|----|-----|
|FAXX | Aerodrome - other|
|FAAH | Aerodrome - HOURS OF SERVICE ARE|
|FALT | Aerodrome - LIMITED TO|
|FFCG | FIRE FIGHTING AND RESCUE - DOWNGRADED TO|
|XXXX | Other - Other|
|FALC | Aerodrome - CLOSED|
|ACXX | CLASS B, C, D OR E SURFACE AREA (ICAO-CONTROL ZONE) - OTHER|
|FAAP | Aerodrome - PRIOR PERMISSION REQUIRED|
|SPAH | APPROACH CONTROL - HOURS OF SERVICE ARE|
|AFXX |FLIGHT INFORMATION REGION (FIR) - OTHER|
|OEXX | AIRCRAFT ENTRY REQUIREMENTS - OTHER| 
|OECA |AIRCRAFT ENTRY REQUIREMENTS - |
|OAXX |AERONAUTICAL INFORMATION SERVICE - OTHER|
|SEAH |FLIGHT INFORMATION SERVICE -HOURS OF SERVICE ARE |

Qcode to be excluded: FAAH, FFCG, SPAH other codes that end with AH
    
Qcode of interest: FAXX, FALC, FALT, OEXX, OECA
    
Qcode not sure: ACXX, FAAP

**Author**

* Shri Nishanth Rajendran - AI Development Specialist, R² Data Labs, Rolls Royce