# Topic Modelling

#### Evan Yathon

This notebook is intended to be run with papermill from the project root.

The purpose of this notebook is to process the review title and content to extract key phrases in each review.  Then key phrases will be used in a regression analysis to find out what is most important for a reviewer in recommending or not recommending the airline.

Topic modelling using LDA will be the tool of choice.

Usage:

`papermill src/ipynbs/topic_modelling.ipynb -p load_path data/cleaned_gw_reviews.csv`

In [1]:
#parameters section for Papermill

load_path = "../../data/cleaned_gw_reviews.csv"

In [2]:
# loading packages

# utils
import pandas as pd
import re
import numpy as np

# NLP
import spacy
import gensim 
import gensim.corpora as corpora
from gensim.corpora import Dictionary
from gensim import models
import pyLDAvis.gensim

# display preference for ipynbs
%matplotlib inline

In [3]:
# ensure the dates are parsed correctly with parse_dates argument
reviews = pd.read_csv(load_path, parse_dates = ["date_of_review", "date_flown"])

In [4]:
reviews.head()

Unnamed: 0,title,review_value,n_user_reviews,reviewer_name,reviewer_country,date_of_review,review_text,aircraft,traveller_type,seat_type,route,date_flown,seat_comfort_rating,cabin_staff_service_rating,food_and_beverages_rating,inflight_entertainment_rating,ground_service_rating,value_for_money_rating,recommendation
0,"""Seat was fine with enough legroom""",7.0,8 reviews,Sander van Kan,Netherlands,2019-07-01,✅ Trip Verified | Dusseldorf to Berlin. Eurowi...,A319,Couple Leisure,Economy Class,Dusseldorf to Berlin,2019-06-01,4.0,3.0,,1.0,3.0,5.0,yes
1,"""crew were smiling and good""",6.0,8 reviews,Sander van Kan,Netherlands,2019-07-01,✅ Trip Verified | Berlin to Dusseldorf. Eurowi...,A319,Couple Leisure,Economy Class,Berlin to Dusseldorf,2019-06-01,3.0,3.0,,1.0,3.0,5.0,yes
2,"""only two agents available""",1.0,6 reviews,Andrew Maynard,United Kingdom,2017-01-04,Check in process at Cologne very poor. Flight ...,,Couple Leisure,Economy Class,CGN to MAN,2017-01-01,2.0,2.0,,,1.0,2.0,no
3,"""good flight and friendly staff""",7.0,1 reviews,T Steen,Netherlands,2016-09-13,✅ Verified Review | Amsterdam to Stuttgart. G...,,Business,Economy Class,AMS to STR,2016-09-01,5.0,5.0,1.0,,5.0,5.0,yes
4,"""never been treated as badly""",1.0,,Karen Kirner,Austria,2016-08-16,✅ Verified Review | I have been a frequent tr...,,Business,Economy Class,DUS to VIE,2016-08-01,1.0,1.0,,,3.0,1.0,no


### Topic Modelling - Preprocessing

Prior to LDA, preprocessing of the text is necessary.  We need to:
- Tokenize each document (review)
- Minimum token length
- Remove stopwords
- Lemmatize
- Consider only specific parts of speech (nouns, verbs etc.)

Use [`spaCy`](https://spacy.io/) for this.

In [5]:
# load the english spacy model
eng_nlp = spacy.load('en')

In [6]:
# create a function to preprocess the reviews and review titles

def preprocess_text(text, min_token_len = 2, relevant_pos = ["NOUN", "VERB", "ADJ"]):
    """
    Given text, min_token_len, and relevant_pos carry out preprocessing of the text 
    and return a preprocessed string. 
    
    Keyword arguments:
    text (str): the text to be preprocessed
    min_token_len (int): min_token_length required
    relevant_pos (list): a list of relevant pos tags
    
    Returns: (str) the preprocessed text
    """
    # use several regex expressions to preprocess strange characters
    # or unwanted text
    
    # remove verified emoji
    text = re.sub(r"✅", "", text)
    
    # remove verified review and trip verified
    text = re.sub(r"Verified Review|Trip Verified", "", text)
    
    # remove anything that is not a word
    text = re.sub(r"[^\w]", " ", text)
    
    # replace multiple spaces with a single space
    text = re.sub(r"\s+", " ", text)
    
    # remove numbers
    text = re.sub(r"[0-9]+", "", text)
    
    # change all text to lowercase
    text = text.lower()
    
    # remove common terms found in topic modelling output
    # eg/ germanwings, flight
    # these words are not useful for this application as they
    # are common terms found in most reviews, and won't be indicative of 
    # things that influenced their recommendation
    terms = ["flight", "germanwing", "fly", "review"]
    text = re.sub("|".join(terms)," ", text)
    
    # utilize spaCy to tokenize and break text into lemmas/POS:
    # - remove stopwords
    # - reduce words to their lemmas
    # - only keep relevant parts of speech
    # - remove short words less than min_token_len
    
    # spacy will utilize
    doc = eng_nlp(text)
    
    processed_text = ""
    
    for token in doc:
        if not token.is_stop and len(token.text) >= min_token_len and token.pos_ in relevant_pos:
            processed_text += " " + token.lemma_
    
    return processed_text
    

Use the preprocessing function to process all review texts and review titles.

A good check will be to see if the review topics somewhat make sense with the title topics

In [7]:
reviews["clean_review_text"] = reviews["review_text"].apply(preprocess_text, min_token_len = 3)
reviews["clean_title"] = reviews["title"].apply(preprocess_text, min_token_len = 3)

In [8]:
reviews[["clean_review_text", "review_text", "clean_title", "title"]].head(10)

Unnamed: 0,clean_review_text,review_text,clean_title,title
0,dusseldorf berlin eurowing operate slight del...,✅ Trip Verified | Dusseldorf to Berlin. Eurowi...,seat fine legroom,"""Seat was fine with enough legroom"""
1,berlin dusseldorf eurowing operate different ...,✅ Trip Verified | Berlin to Dusseldorf. Eurowi...,crew smile good,"""crew were smiling and good"""
2,check process cologne poor display delay minu...,Check in process at Cologne very poor. Flight ...,agent available,"""only two agents available"""
3,amsterdam stuttgart good friendly staff airpl...,✅ Verified Review | Amsterdam to Stuttgart. G...,good friendly staff,"""good flight and friendly staff"""
4,frequent traveler year fly distance cabin cla...,✅ Verified Review | I have been a frequent tr...,treat,"""never been treated as badly"""
5,fly london stanste cologne bonn delay hour lo...,✅ Verified Review | Flew Germanwings from Lon...,cramp,"""very cramped"""
6,boyfriend book basic fare hamburg rome fiumic...,My boyfriend and I booked the basic fare from ...,staff friendly,"""staff were friendly"""
7,operation transfer take eurowing read eurowin...,Germanwings operations have now been transferr...,refer eurowing,PLEASE REFER TO EUROWINGS
8,operate barcelona berlin tegel fight time qui...,Lufthansa operated by Germanwings from Barcelo...,great value money,"""great value for money"""
9,bangkok london cologne imagine ryanair hour f...,Germanwings from Bangkok to London via Cologne...,pay little money,"""pay very little money"""


It looks mostly good, but I noticed that in the first review it is removing 'enough'.  This could be valuable in context, but I'll keep digging for now.  It's an issue since 'enough' is a stopword and is being removed.

### Topic Modelling: Dictionary and Document-Term Co-occurrence Matrix

In [9]:
# change the cleaned reviews and cleaned titles to 
# an array, required by corpora.Dictionary
review_corpus = [doc.split() for doc in reviews["clean_review_text"].tolist()]
title_corpus = [doc.split() for doc in reviews["clean_title"].tolist()]

In [10]:
# use gensim dictionary to map between words and integer ids
dct_review = corpora.Dictionary(review_corpus)
dct_title = corpora.Dictionary(title_corpus)

In [11]:
# create the document-term co-occurrence matrices
doc_term_mat_review = [dct_review.doc2bow(doc) for doc in review_corpus]
doc_term_mat_title = [dct_title.doc2bow(doc) for doc in title_corpus]

### Topic Modelling:  LDA Topic Model

Number of topics was tuned using the visualizations below by examining the Intertopic distance map.  When there was no overlap and the topics were separated by a fair amount, the number of topics was chosen.

In [12]:
# create the LDA models
# use a lower alpha, meaning that each document is representative of only a few topics
# do this so that each review doesn't contain similar topics and will have variation
# that we can use to figure out why the recommendation was a yes or no

lda_review = models.LdaModel(corpus = doc_term_mat_review, id2word = dct_review, num_topics = 4, alpha = 0.5, passes = 5)
lda_title = models.LdaModel(corpus = doc_term_mat_title, id2word = dct_title, num_topics = 3, alpha = 0.01, passes = 5)

In [13]:
# view the topics and the words that describe them
lda_review.print_topics()

[(0,
  '0.019*"good" + 0.015*"crew" + 0.015*"seat" + 0.014*"time" + 0.012*"delay" + 0.011*"staff" + 0.011*"cabin" + 0.010*"cologne" + 0.010*"passenger" + 0.009*"minute"'),
 (1,
  '0.014*"time" + 0.013*"service" + 0.012*"seat" + 0.011*"cologne" + 0.011*"delay" + 0.010*"check" + 0.009*"return" + 0.009*"offer" + 0.009*"price" + 0.009*"hour"'),
 (2,
  '0.014*"time" + 0.013*"check" + 0.013*"good" + 0.013*"seat" + 0.012*"airline" + 0.012*"luggage" + 0.009*"pay" + 0.008*"passenger" + 0.008*"charge" + 0.008*"return"'),
 (3,
  '0.016*"seat" + 0.010*"check" + 0.010*"board" + 0.009*"ticket" + 0.009*"good" + 0.008*"return" + 0.008*"hour" + 0.008*"hamburg" + 0.007*"take" + 0.007*"minute"')]

In [14]:
# view the topics and the words that describe them
lda_title.print_topics()

[(0,
  '0.149*"customer" + 0.131*"staff" + 0.099*"friendly" + 0.048*"delay" + 0.028*"problem" + 0.028*"heathrow" + 0.028*"terrible" + 0.028*"food" + 0.028*"airline" + 0.028*"info"'),
 (1,
  '0.775*"customer" + 0.009*"eurowing" + 0.009*"vow" + 0.009*"book" + 0.009*"water" + 0.009*"refuse" + 0.009*"legroom" + 0.009*"refer" + 0.009*"fine" + 0.009*"agent"'),
 (2,
  '0.111*"crew" + 0.111*"friendly" + 0.091*"cabin" + 0.066*"money" + 0.066*"professional" + 0.038*"great" + 0.038*"pay" + 0.038*"value" + 0.038*"little" + 0.038*"service"')]

### Topic Modelling: Visualization

In [15]:
# view the reviews 
pyLDAvis.enable_notebook()
vis_reviews = pyLDAvis.gensim.prepare(lda_review, doc_term_mat_review, dct_review, sort_topics=False)
vis_reviews

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


In [16]:
# view the titles
vis_titles = pyLDAvis.gensim.prepare(lda_title, doc_term_mat_title, dct_title, sort_topics=False)
vis_titles

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


### Topic Modelling: Assigning each Review a Topic

Create a handy function to help assign topics to the reviews dataframe

In [20]:

def get_correct_topics(lda_doc_topics, num_topics):
    """
    get_correct_topics uses an output LDA object to correctly place
    LDA outputs to their correct topic column number for use in a
    pandas dataframe
    
    Args:
        lda_doc_topics (TransformedCorpus): Document topic probabilities obtained through
                                            the .get_document_topics method of an LDA obj
        num_topics (int) : Number of topics in lda_doc_topics
    Returns:
        pandas dataframe of correctly assigned LDA topics
    """
    # placeholder array with same number of rows as the input 
    # corpus and number of topics for columns
    # initialize all zeros so that topics with missing probabilities
    # will be zero
    
    topic_prob_array = np.zeros((pd.DataFrame(lda_doc_topics).shape[0], num_topics))

    for row_num, row in enumerate(lda_doc_topics):

        # assign each topic to the correct
        # topic[0] is the topic number
        # topic[1] is the corresponding probability
        for topic in row:
            topic_prob_array[row_num, topic[0]] = topic[1]

    topic_prob_df = pd.DataFrame(topic_prob_array)
    
    return topic_prob_df

Get both title and reviews dataframes, concatenate with `reviews`

In [21]:
review_prob_df = get_correct_topics(lda_review.get_document_topics(doc_term_mat_review), 4)
title_prob_df = get_correct_topics(lda_title.get_document_topics(doc_term_mat_title), 3)