# Topic Modeling of Airport NOTAMs


In this notebook, we read the preprocessed NOTAMs data and identify differnt topics present in the messages. This exercise is based on the topic modeling analysis carried out on News data: https://github.com/emergent-analytics/workstreams/blob/master/ws2/news-analysis/%20ws2_2_topic_modelling.ipynb 

As the Qcodes present in the data also correspond to the nature of NOTAMs, we use topic modeling to ascertain our assumption

**Input**

  To generate the input dataset, refer this notebook: ws2_snr_NOTAMs_1_data_preparation
  
  Preprocessed airport dataset
  
  - valid_airport_notams_xx.csv

**Output**

Dataset with identified topics
  
  - valid_airport_notams_with_topics_xx.csv
  
Visualisation of topic modeling results
  
  
  - covid_airport_notams_lda.html

where 'xx' corresponds to the date


  
The following steps are carried out:

    1. Import the preprocessed data

    2. Filter out NOTAMs related to service hours 

    3. Train the LDA model and compute the coherence metric
    
    4. Visualize the topics
    
    5. LDA as feature
    
    6. Map manual labels to topics
    
    7. Analyse the results


In [None]:
try:
    import spacy
except:
    !pip install spacy
try:
    import spacy_langdetect
except:
    !pip install spacy-langdetect
try:
    import flair
except:
    !pip install flair
try:
    import geonamescache
except:
    !pip install geonamescache
try:
    import spacy_fastlang
except:    
    !pip install spacy_fastlang
    #!pip install sense2vec==1.0.0a1
try:
    import gensim
except:
    !pip install gensim
try:
    import wordcloud
except:
    !pip install wordcloud
try:
    import nltk
except:
    !pip install nltk

try:
    import pyLDAvis
except:
    !pip install pyLDAvis

In [None]:
import spacy

from collections import Counter, defaultdict

import pandas as pd
import os
import csv
import itertools
import re
import json
import numpy as np
import matplotlib.pyplot as plt
import datetime
import string

from spacy_langdetect import LanguageDetector
import plac
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher
from spacy.tokens import Doc, Span, Token

from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import TweetTokenizer
from nltk.stem import WordNetLemmatizer

import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('tagsets')
nltk.download('words')

from wordcloud import WordCloud
from spacy import displacy
import seaborn as sbs
import geonamescache
import ast

#!pip install gensim
import gensim
from gensim import corpora
from gensim.models import CoherenceModel

#!pip install pyLDAvis
import pyLDAvis.gensim

plt.style.use('fivethirtyeight')
%matplotlib inline

**1. Import the preprocessed data**

In [None]:
apt_df_ = pd.read_csv("/project_data/data_asset/ws2/notams/valid_airport_notams_20200703.csv")

In [None]:
apt_df_.head()

**2. Filter out NOTAMs related to service hours**

Filtering out NOTAMs related to service hours and fire fighting rescue information

In [None]:
apt_lda_df = apt_df_[~((apt_df_.Qcode.str.endswith("AH")) | (apt_df_.Qcode == "FFCG"))]

In [None]:
apt_lda_df.reset_index(drop=True,inplace=True)

**3. Train the LDA model and compute the coherence metric**

In [None]:
words = []
for text in apt_lda_df['tokens']:
    words.append(ast.literal_eval(text))

In [None]:
len(words)

In [None]:
# create the term dictionary of courpus
dictionary = corpora.Dictionary(words)

# filter the least and most frequent words: filters if less than no_below, more than no_above
dictionary.filter_extremes(no_below=10, no_above=0.9) 
dictionary.compactify()

# convert list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(word) for word in words]

In [None]:
# train LDA, computing the coherence score for a range of topics
coherence_scores = []

for num_topics in range(2, 12, 1):
    
    print(f"Number of topics: ", num_topics)
    
    # create the object for LDA model using gensim library
    Lda = gensim.models.ldamulticore.LdaMulticore

    # run and train LDA model on the document term matrix.
    ldamodel = Lda(doc_term_matrix, 
                   num_topics=num_topics, 
                   id2word = dictionary, 
                   passes=20, 
                   chunksize = 2000, 
                   random_state=42,
                   workers=6)
    
    # compute the coherence score
    coherence_model = CoherenceModel(model=ldamodel, 
                                     texts=words, 
                                     dictionary=dictionary, 
                                     coherence='c_v')

    coherence_lda = coherence_model.get_coherence()
    
    coherence_scores.append((num_topics, coherence_lda))

coherence_scores = [*zip(*coherence_scores)]

In [None]:
# plot the coherence score for topics
plt.plot(coherence_scores[0], coherence_scores[1], marker='o')
plt.title('Coherence Score for Topics')
plt.show()

As the above plot shows that there are 4 main topics in airport NOTAMs we assign the number of topics to 4

In [None]:
# set the number of topics where coherence score is the highest
num_topics = 4

# run and train LDA model on the document term matrix.
Lda = gensim.models.ldamulticore.LdaMulticore

ldamodel = Lda(doc_term_matrix, 
               num_topics=num_topics, 
               id2word=dictionary, 
               passes=20, 
               chunksize=10000, 
               random_state=42,
               workers=6)

In [None]:
# view the topics with their most important words and their proportions
ldamodel.print_topics(num_topics=num_topics, num_words=10)

**4. Visualize the topics**

In [None]:
# visualize the intractive LDA plot
lda_display = pyLDAvis.gensim.prepare(ldamodel, 
                                      doc_term_matrix, 
                                      dictionary, 
                                      sort_topics=False)
pyLDAvis.display(lda_display)

In [None]:
# save the plot in html format
pyLDAvis.save_html(lda_display, f"/project_data/data_asset/ws2/notams/covid_airport_notams_lda.html")

**5. LDA as feature**

In [None]:
# user inputs
corpus = doc_term_matrix
texts = apt_lda_df
df = apt_lda_df

In [None]:
len(apt_lda_df),len(words)

In [None]:
# function to get dominant topic, percentage of contribution, and keywords for each document
def format_topics_sentences(ldamodel, corpus):

    results = []
    
    # get main topic in each document
    for row in ldamodel[corpus]:
        
        if len(row) == 0:
            continue
            
        row = list(sorted(row, key=lambda elem: elem[1], reverse=True))
        
        # get the dominant topic, percentage of contribution and keywords for each document
        topic_num, prop_topic = row[0]        
        wp = ldamodel.show_topic(topic_num)
        topic_keywords = ", ".join([word for word, prop in wp])
        results.append((topic_num, round(prop_topic, 4), [topic_keywords]))
    
    df = pd.DataFrame.from_records(results, columns=['dominant_topic', 'weight', 'keywords'])
    
    return(df)

In [None]:
df_topics = format_topics_sentences(ldamodel, corpus)
df_topics.head()

In [None]:
len(df_topics)

In [None]:
# concatenate with the main dataset
apt_lda_df_ = pd.concat([apt_lda_df, df_topics.reindex(apt_lda_df.index)], axis=1)

**6. Map manual labels to topics**

In [None]:
# Define the topic labels for all the topics identified.
 
topics_dict = [[0, 'label_1'],
               [1, 'label_2'], 
               [2, 'label_3'], 
               [3, 'label_4']]

labels = pd.DataFrame(topics_dict, columns =['topic_num', 'topic_label'])

# merge with the main dataset
apt_lda_df_ = pd.merge(apt_lda_df_, labels, how='left', left_on = 'dominant_topic', right_on='topic_num')
apt_lda_df_.drop("topic_num", axis=1, inplace=True)
apt_lda_df_.head()

In [None]:
apt_lda_df_.to_csv("/project_data/data_asset/ws2/notams/valid_airport_notams_with_topics_20200703.csv", index=False, quoting=csv.QUOTE_NONNUMERIC)

**7. Analyse results**

In [None]:
for l in ["label_1","label_2","label_3","label_4"]:
    d_ = apt_lda_df_[apt_lda_df_.topic_label==l]
    print(l)
    print(len(d_))
    print(d_.Qcode.value_counts()[:5])
    #print(d_.countryName.unique())
    print(d_.keywords.values[0])

In [None]:
for label_ in ['label_1','label_2','label_3','label_4']:
    all_t = []
    for t in apt_lda_df_[apt_lda_df_.topic_label==label_]['tokens']:
        for t_ in ast.literal_eval(t):
            all_t.append(t_)

    wc = WordCloud(background_color="white", max_words=200, random_state=1,collocations=False).generate(' '.join(all_t))# to recolour the image
    plt.figure(figsize=(15,5)) #, width=1400, height=800,
    plt.title("word cloud for - {}".format(label_))
    plt.grid(b=None)
    plt.imshow(wc)

**Observations:**

Based on pyLDAvis and wordcloud of the different topics the following insights have been noted:

1. Label 3 contains information related to cargo

2. Labels 1 and 3 contain all the information related to quarantine

3. Label 2 contains information related to visual flight rules

4. Label 4 contains information more related to control tower as the terms in this label include tower, control, frequency, mhz (unit of frequency)


In terms of commercial passenger flights, topics with labels 1 and 3 are to be considered for further analysis

**Author**

* Shri Nishanth Rajendran - AI Development Specialist, R² Data Labs, Rolls Royce

The topic modelling work is based on the analysis done below:
https://github.com/emergent-analytics/workstreams/blob/master/ws2/news-analysis/%20ws2_2_topic_modelling.ipynb 