## TrustPilot reviews

We'll be performing some topic classification using TrustPilot reviews.

In [2]:
from functools import partial
from string import punctuation

import dacy
import nltk
import pandas as pd
import spacy
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer

import fetch
from fetch.utils import json_or_fetch

In [3]:
many_pages = partial(fetch.trustpilot, page_limit=100)

In [1]:
urls = ['https://dk.trustpilot.com/review/www.postnord.dk', 'https://dk.trustpilot.com/review/www.fedex.com', 'https://dk.trustpilot.com/review/www.ups.com']
args = tuple(zip(urls))

In [6]:
reviews_dict = json_or_fetch(many_pages, urls, args, path='Data/trustpilot.json')

https://dk.trustpilot.com/review/www.postnord.dk: 2000 reviews from 100 pages. There are more pages left.
https://dk.trustpilot.com/review/www.fedex.com: 245 reviews from 13 pages.
https://dk.trustpilot.com/review/www.ups.com: 1100 reviews from 55 pages.


In [7]:
reviews = []
for company_reviews in reviews_dict.values():
    reviews.extend(company_reviews)

In [8]:
df = pd.DataFrame(reviews)
df.head()

Unnamed: 0,title,body,rating
0,Må udtrykke min store skuffelse,Må udtrykke min store skuffelse. Alt i pakken ...,1
1,I behøver altså ikke bede om en…,I behøver altså ikke bede om en bedømmelse næs...,2
2,Jeg bestilte en stor vare og betalte…,Jeg bestilte en stor vare og betalte for hjemm...,1
3,Elendig service,Havde betalt 49kr. for at få min pakke leveret...,1
4,OK....,Det er efterhånden skide irriterende med al de...,5


In [17]:
df.count()

title     3345
body      2438
rating    3345
dtype: int64

In [16]:
df.isna().sum()

title       0
body      907
rating      0
dtype: int64

In [18]:
df.dropna(inplace=True)

In [19]:
df.count()

title     2438
body      2438
rating    2438
dtype: int64

In [20]:
df.groupby('rating').count()

Unnamed: 0_level_0,title,body
rating,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1294,1294
2,76,76
3,51,51
4,105,105
5,912,912


In [21]:
docs = df['body'].tolist()

I'm having trouble with stopwords even though I'm using the techniques described in the Bertopic documentation for revoming those. It's not recommended to remove them in preprocessing, but I thought I'd try and see if it helped.

In [11]:
# da_stopwords = nltk.corpus.stopwords.words('danish')
da_stopwords = spacy.lang.da.stop_words.STOP_WORDS

In [18]:
type(da_stopwords)

set

In [12]:
count_vectorizer = CountVectorizer(stop_words=list(da_stopwords))

In [8]:
def clean_sentence(sentence):
    sentence = sentence.strip()
    words = nltk.word_tokenize(sentence, language='danish')
    words = [word for word in words if word.lower() not in da_stopwords and word not in punctuation]
    return ' '.join(words)

In [9]:
# docs = [clean_sentence(doc) for doc in docs]

In [13]:
embedder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

In [22]:
embeddings = embedder.encode(docs, show_progress_bar=True)

Batches:   0%|          | 0/77 [00:00<?, ?it/s]

In [23]:
embeddings.shape

(2438, 384)

## Dacy embeddings

In [51]:
# nlp = dacy.load('medium', exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])

In [53]:
# dacy_docs = list(nlp.pipe(docs))

## Bertopic model

In [25]:
ctfid_model = ClassTfidfTransformer(reduce_frequent_words=True)

In [26]:
topic_model = BERTopic(language="multilingual", nr_topics=None, min_topic_size=5, vectorizer_model=count_vectorizer,
                       seed_topic_list=[['god', 'godt', 'hurtig', 'hurtigt'],
                                        ['dårlig', 'dårligt', 'dårlige', 'langsom']])

In [27]:
topics, probs = topic_model.fit_transform(docs, embeddings)

In [28]:
topic_model.get_topic_info()[:20]

Unnamed: 0,Topic,Count,Name
0,-1,896,-1_pakken_pakke_ups_hjemme
1,0,160,0_dårlig_firma_service_kundeservice
2,1,96,1_kl_dag_leveret_pakke
3,2,67,2_fedex_pakke_pakken_kundenummer
4,3,63,3_tyskland_danmark_oktober_kolding
5,4,59,4_post_nord_hurtig_postbud
6,5,58,5_adresse_pakken_pakke_leveret
7,6,57,6_kr_betale_told_moms
8,7,46,7_service_fin_hurtig_venlig
9,8,43,8_chaufføren_ups_bilen_15


~~At least, now the name isn't all stopwords, but it still only puts it in one outlier topic.~~

I think the actual problem is just having way too few documents for it extract topics.

Decreasing `min_topic_size` helped.

In [29]:
topic_model.visualize_barchart()

The topics don't seem to allign greatly with sentiment just from looking at the keywords.

In [30]:
topic_model.visualize_topics(top_n_topics=20)

In [31]:
topic_model.visualize_documents(docs, embeddings=embeddings, hide_annotations=True)