# Train travel comments dataset initial exploration 

The objective of this task is to extract the topics that people are talking about and provide insights that help the client calibrate their internal strategy on how to improve customer experience. Use whichever tools you feel are appropriate for the task, given the time available.

In [49]:
import json
import re
import string
from typing import Any, Dict, List, Tuple

import pandas as pd
import numpy as np

import gensim
from gensim import corpora
import pyLDAvis
import pyLDAvis.gensim_models

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

DATA_PATH = '/Users/kremerr/Documents/GitHub/OkraAI-interview/data/train_reviews.json'

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kremerr/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/kremerr/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /Users/kremerr/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Reviewing the dataset with Pandas

In [4]:
df = pd.read_json(DATA_PATH)
df.head()

Unnamed: 0,date,title,text,url,stars
0,2015-10-10 14:32:51+00:00,Bad customer service. Staff are very impolite ...,Used national rail twice and on both occasions...,https://uk.trustpilot.com/review/www.nationalr...,star-rating star-rating-1 star-rating--medium
1,2015-09-22 17:04:56+00:00,Pretty awful service,I phoned National rail to find out why several...,https://uk.trustpilot.com/review/www.nationalr...,star-rating star-rating-1 star-rating--medium
2,2015-03-13 23:37:05+00:00,Awful staff,I travel from Brokenhast to southampton centra...,https://uk.trustpilot.com/review/www.nationalr...,star-rating star-rating-1 star-rating--medium
3,2015-01-13 12:26:52+00:00,Very good,Saved about £50 on a single trip using Nationa...,https://uk.trustpilot.com/review/www.nationalr...,star-rating star-rating-4 star-rating--medium
4,2011-05-28 15:00:36+00:00,Check it out.,I always use the national rail enquiry site wh...,https://uk.trustpilot.com/review/www.nationalr...,star-rating star-rating-4 star-rating--medium


In [41]:
print(df['text'][0])

Used national rail twice and on both occasions I found the staff unpleasant, unfriendly and incompetent. Wasn't helpful in regards to platform information and general customer service skills.


In [5]:
print(min(df['date']))
print(max(df['date']))

2011-05-28 15:00:36+00:00
2018-07-17 13:59:52+00:00


In [32]:
df['stars'].unique()

array(['star-rating star-rating-1 star-rating--medium',
       'star-rating star-rating-4 star-rating--medium',
       'star-rating star-rating-5 star-rating--medium',
       'star-rating star-rating-2 star-rating--medium',
       'star-rating star-rating-3 star-rating--medium'], dtype=object)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2021 entries, 0 to 2020
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype              
---  ------  --------------  -----              
 0   date    2021 non-null   datetime64[ns, UTC]
 1   title   2021 non-null   object             
 2   text    2021 non-null   object             
 3   url     2021 non-null   object             
 4   stars   2021 non-null   object             
dtypes: datetime64[ns, UTC](1), object(4)
memory usage: 79.1+ KB


## Latent Dirichlet Allocation (LDA)

For topic extraction, we will use a topic modelling technique called Latent Dirichlet Allocation (LDA). 
For LDA to work, we first need to perform some preprocessing steps, namely:
- Text Cleaning
  - lowercasing
  - removing punctuation
  - removing stopwords (optional)
  - lemmatization (optional)
- Tokenization
- Vectorization

In [65]:
def load_data(
        file_path: str
    ) -> List[Dict[str, Any]]:
    """
    Load the JSON data from the given file path.
    
    Args:
        file_path (str): The path to the JSON file.
        
    Returns:
        list: A list of dictionaries, where each dictionary represents a customer review.
    """
    with open(file_path, 'r') as f:
        data = json.load(f)
    return data

def preprocess_text(
        text: str, 
        lemmatize: bool = True,
        remove_stopwords: bool = True
    ) -> str:
    """
    Preprocess the given text by removing URLs, digits, and punctuation, and converting to lowercase.
    
    Args:
        text (str): The input text to preprocess.
        
    Returns:
        str: The preprocessed text.
    """
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))

    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    # Remove digits
    text = re.sub(r'\d+', '', text)
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    if lemmatize:
        # Remove stopwords
        text = ' '.join([word for word in text.split() if word not in stop_words])
    if remove_stopwords:
        # Lemmatize
        text = ' '.join([lemmatizer.lemmatize(word) for word in text.split()])
    # Convert to lowercase
    text = text.lower()
    return text

def create_corpus(
        data: List[Dict[str, Any]]
    ) -> Tuple[corpora.Dictionary, List[List[Tuple[int,int]]]]:
    """
    Create a corpus from the preprocessed data.
    
    Args:
        data (list): A list of dictionaries representing the customer reviews.
        
    Returns:
        gensim.corpora.Dictionary: A dictionary mapping words to word IDs.
        list: A list of bag-of-words vectors representing the corpus.
    """
    texts = [word_tokenize(preprocess_text(review['text'])) for review in data]
    dictionary = corpora.Dictionary(texts)
    corpus = [dictionary.doc2bow(text) for text in texts]
    return dictionary, corpus

def train_lda_model(
        corpus: List[List[Tuple[int,int]]], 
        dictionary: corpora.Dictionary, 
        num_topics: int = 5
    ) -> gensim.models.LdaMulticore:
    """
    Train an LDA model on the given corpus.
    
    Args:
        corpus (list): A list of bag-of-words vectors representing the corpus.
        dictionary (gensim.corpora.Dictionary): A dictionary mapping words to word IDs.
        num_topics (int): The number of topics to extract.
        
    Returns:
        gensim.models.LdaMulticore: The trained LDA model.
    """
    lda_model = gensim.models.LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=num_topics)
    return lda_model

def visualize_topics(
        lda_model: gensim.models.LdaMulticore, 
        corpus: List[List[Tuple[int,int]]], 
        dictionary: corpora.Dictionary
    ) -> None:
    """
    Visualize the topics extracted by the LDA model using PyLDAvis.
    
    Args:
        lda_model (gensim.models.LdaMulticore): The trained LDA model.
        corpus (list): A list of bag-of-words vectors representing the corpus.
        dictionary (gensim.corpora.Dictionary): A dictionary mapping words to word IDs.
    """
    pyLDAvis.enable_notebook()
    vis_data = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)
    pyLDAvis.show(vis_data, local=False)

In [66]:
data = load_data(DATA_PATH)
dictionary, corpus = create_corpus(data)
lda_model = train_lda_model(corpus, dictionary, num_topics=5)

In [67]:
visualize_topics(lda_model, corpus, dictionary)

Serving to http://127.0.0.1:8888/    [Ctrl-C to exit]


127.0.0.1 - - [12/Mar/2024 16:24:58] "GET / HTTP/1.1" 200 -



stopping Server...


In [75]:
for key in dictionary.keys():
    print(dictionary[key])

customer
found
general
helpful
i
incompetent
information
national
occasion
platform
rail
regard
service
skill
staff
twice
unfriendly
unpleasant
used
wasnt
absolutely
adviser
also
another
ask
attempting
brighton
cancel
cancelled
catch
central
check
desperately
didnt
early
easily
english
ever
find
flight
foot
get
give
go
going
guard
harassed
home
including
journey
load
many
meaning
member
merely
morning
next
none
one
people
phoned
plane
public
reason
run
seem
seems
several
shoulder
show
shrug
southampton
spoke
station
stick
stranded
tell
time
told
tomorrow
train
try
trying
turns
verbally
want
way
website
witnessed
worse
apologise
arrived
asked
awful
birth
brokenhast
buy
cant
communication
criminal
day
detail
even
every
explained
fine
happened
however
human
humans
if
job
like
man
pay
paying
place
problem
really
requires
running
sarcastic
shouldt
take
talking
ticket
travel
treated
unfair
when
without
cheaper
easy
good
inexplicably
its
look
nationalrail
pick
saved
single
specific
surroundin