# Extraction of quarantine information from NOTAMs


In this notebook, we use the preprocessed NOTAMs dataset to search for the terms "quarantine" and "isolate" in the messages and add those messages to a separate column in the dataframe. We also run Named Entity Recognition on the filtered text to identify DATE tags. The intention in identifying DATE tags is based on the assumption that the DATE tag would correspond to quarantine duration. 


**Input**

To generate the input dataset, refer this notebook: ws2_snr_NOTAMs_1_data_preparation

Preprocessed datasets

    - valid_airport_notams_xx.csv
    - valid_airspace_notams_xx.csv

**Output**

    - valid_airport_notams_with_quarantine_xx.csv
    - valid_airspace_notams_with_quarantine_xx.csv

Datasets with additional columns corresponding to quarantine related text


The following steps are carried out:

1. Read the preprocessed datset

2. Extract quarantine related text

3. Save the file

In [None]:
import requests


import spacy

from collections import Counter, defaultdict

import pandas as pd
import os
import csv
import itertools
import re
import json
import numpy as np
import matplotlib.pyplot as plt
import datetime
import ast

from spacy_langdetect import LanguageDetector
import plac
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher
from spacy.tokens import Doc, Span, Token

from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import TweetTokenizer
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB

from wordcloud import WordCloud
from spacy import displacy
import seaborn as sbs
import geonamescache


plt.style.use('fivethirtyeight')
%matplotlib inline

**1. Read the preprocessed datset**

In [None]:
apt_df = pd.read_csv("/project_data/data_asset/ws2/notams/valid_airport_notams_20200717.csv")
asp_df = pd.read_csv("/project_data/data_asset/ws2/notams/valid_airspace_notams_20200717.csv")

**2. Extract quarantine related text**

- Load spacy

- Filter for text containing the terms "quarantine" and "isolation"

- Extract DATE (DAY or WEEK) tags for the filtered text

In [None]:
############
#Stop words#
############

nlp_ = spacy.load('en_core_web_md')

# Adding stop words
new_stop_words = ["create","source","euecyiyn",'etczyoyx','tel']

# Add airport codes to stop words
new_stop_words.extend([ac.lower() for ac in list(apt_df.airportCode.values)])

for new_word in new_stop_words:
    nlp_.vocab[new_word].is_stop = True

In [None]:
def extract_quarantine_info(df):
    quarantine_duration_df = df.copy()
    quarantine_duration_df['quarantine_text'] = ""
    quarantine_duration_df['quarantine_days'] = ""
    for idx, row in quarantine_duration_df.iterrows():
        quarantine_days = []
        quarantine_text = []
        message = row['cleaned_message']
        doc_ = nlp_(message)
        if ('quarantine' in row['tokens']) | ('isolation' in row['tokens']):
            for ent in doc_.ents:
                if (ent.label_ == "DATE") & (("DAY" in ent.text.upper())|("WEEK" in ent.text.upper())):
                    quarantine_days.append(ent.text)
                    quarantine_text.append(message)
                    #spacy.displacy.render(doc_, jupyter=True, style='ent',options={'ents':['DATE']})
        if not len(quarantine_days) == 0:
            quarantine_duration_df.loc[idx,'quarantine_days'] = ",".join(quarantine_days)
        if not len(quarantine_text) == 0:
            quarantine_duration_df.loc[idx,'quarantine_text'] = " ".join(quarantine_text)

    return quarantine_duration_df

In [None]:
apt_df = extract_quarantine_info(apt_df)
asp_df = extract_quarantine_info(asp_df)

**3. Save the file**

In [None]:
apt_df.to_csv("/project_data/data_asset/ws2/notams/valid_airport_notams_with_quarantine_20200717.csv",index=False,quoting=csv.QUOTE_NONNUMERIC)
asp_df.to_csv("/project_data/data_asset/ws2/notams/valid_airspace_notams_with_quarantine_20200717.csv",index=False,quoting=csv.QUOTE_NONNUMERIC)

**4. Observations**

In [None]:
apt_df[apt_df.quarantine_days != '']

Airport NOTAMS

* NOTE: for some rows the quarantine_days column might not correspond to the quarantine duration but to some other date tags in the text!

* From the above dataframe, we see that for most cases the quarantine duration corresponds to 14 days. The text has to be read to get the exact quarantine regulations

* Further work to be done on identifying different quarantine restrictions!

In [None]:
asp_df[asp_df.quarantine_days != '']

Airspace NOTAMS


- Norway has a quarantine duration of 10 days

- South Africa has a quarantine duration of upto 21 days

- In some cases there is mention of 7 and 14 days in the message. The message has to be read to understand the exact quarantine regulation!

**Author**

* Shri Nishanth Rajendran - AI Development Specialist, R² Data Labs, Rolls Royce