## TrustPilot reviews

We'll be performing some topic classification using TrustPilot reviews.

In [24]:
from string import punctuation

import dacy
import nltk
import pandas as pd
import spacy
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer

import fetch
from fetch.utils import json_or_fetch

In [13]:
keys = ['Post Nord', 'Net Company']
args = (('https://dk.trustpilot.com/review/www.postnord.dk',), ('https://dk.trustpilot.com/review/www.netcompany.com',))
kwargs = ({'page_limit': 1},)*len(keys)
_ = json_or_fetch(fetch.trustpilot, keys, args, kwargs)

https://dk.trustpilot.com/review/www.postnord.dk: 20 reviews from 1 page. There are more pages left.
https://dk.trustpilot.com/review/www.netcompany.com: 20 reviews from 1 page. There are more pages left.


In [12]:
from functools import partial

# Using partial to set page_limit to 1 for all keys.
one_page = partial(fetch.trustpilot, page_limit=1)
_ = json_or_fetch(one_page, keys, args)

https://dk.trustpilot.com/review/www.postnord.dk: 20 reviews from 1 page. There are more pages left.
https://dk.trustpilot.com/review/www.netcompany.com: 20 reviews from 1 page. There are more pages left.


In [20]:
many_pages = partial(fetch.trustpilot, page_limit=100)

In [18]:
urls = ['https://dk.trustpilot.com/review/www.postnord.dk', 'https://dk.trustpilot.com/review/www.fedex.com', 'https://dk.trustpilot.com/review/www.ups.com']
args = tuple(zip(urls))
args

(('https://dk.trustpilot.com/review/www.postnord.dk',),
 ('https://dk.trustpilot.com/review/www.fedex.com',),
 ('https://dk.trustpilot.com/review/www.ups.com',))

In [21]:
reviews_dict = json_or_fetch(many_pages, urls, args, path='data/trustpilot.json')

https://dk.trustpilot.com/review/www.postnord.dk: 2000 reviews from 100 pages. There are more pages left.
https://dk.trustpilot.com/review/www.fedex.com: 245 reviews from 13 pages.
https://dk.trustpilot.com/review/www.ups.com: 1100 reviews from 55 pages.


In [22]:
reviews = []
for company_reviews in reviews_dict.values():
    reviews.extend(company_reviews)

In [25]:
df = pd.DataFrame(reviews)
df.head()

Unnamed: 0,title,body,rating
0,M√• udtrykke min store skuffelse,M√• udtrykke min store skuffelse. Alt i pakken ...,1
1,Jeg bestilte en stor vare og betalte‚Ä¶,Jeg bestilte en stor vare og betalte for hjemm...,1
2,Elendig service,Havde betalt 49kr. for at f√• min pakke leveret...,1
3,OK....,Det er efterh√•nden skide irriterende med al de...,5
4,Jeg synes det er s√• smart,"Jeg synes det er s√• smart, at i tager et bille...",5


In [26]:
df.groupby('rating').count()

Unnamed: 0_level_0,title,body
rating,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1320,1320
2,79,79
3,57,57
4,153,153
5,1736,1736


In [27]:
docs = df['body'].tolist()

I'm having trouble with stopwords even though I'm using the techniques described in the Bertopic documentation for revoming those. It's not recommended to remove them in preprocessing, but I thought I'd try and see if it helped.

In [28]:
# da_stopwords = nltk.corpus.stopwords.words('danish')
da_stopwords = spacy.lang.da.stop_words.STOP_WORDS

In [18]:
type(da_stopwords)

set

In [29]:
count_vectorizer = CountVectorizer(stop_words=list(da_stopwords))

In [8]:
def clean_sentence(sentence):
    sentence = sentence.strip()
    words = nltk.word_tokenize(sentence, language='danish')
    words = [word for word in words if word.lower() not in da_stopwords and word not in punctuation]
    return ' '.join(words)

In [9]:
docs = [clean_sentence(doc) for doc in docs]

In [30]:
embedder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

In [31]:
embeddings = embedder.encode(docs, show_progress_bar=True)

Batches:   0%|          | 0/105 [00:00<?, ?it/s]

In [32]:
embeddings.shape

(3345, 384)

## Dacy embeddings

In [51]:
# nlp = dacy.load('medium', exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])

In [53]:
# dacy_docs = list(nlp.pipe(docs))

## Bertopic model

In [33]:
ctfid_model = ClassTfidfTransformer(reduce_frequent_words=True)

In [34]:
topic_model = BERTopic(language="multilingual", nr_topics=None, min_topic_size=5, vectorizer_model=count_vectorizer,
                       seed_topic_list=[['god', 'godt', 'hurtig', 'hurtigt'],
                                        ['d√•rlig', 'd√•rligt', 'd√•rlige', 'langsom']])

In [35]:
topics, probs = topic_model.fit_transform(docs, embeddings)

In [36]:
topic_model.get_topic_info()[:20]

Unnamed: 0,Topic,Count,Name
0,-1,731,-1_pakken_pakke_levering_hjemme
1,0,354,0_ups_hjemme_pakken_pakke
2,1,295,1_25_2023_april_dato
3,2,234,2_26_2023_april_dato
4,3,198,3_27_2023_april_dato
5,4,100,4_firma_d√•rlig_service_kundeservice
6,5,98,5_danmark_tyskland_pakken_pakke
7,6,95,6_kl_dag_pakke_leveret
8,7,87,7_hjemme_hele_dagen_d√∏ren
9,8,67,8_fedex_pakke_pakken_kundenummer


~~At least, now the name isn't all stopwords, but it still only puts it in one outlier topic.~~

I think the actual problem is just having way too few documents for it extract topics.

Decreasing `min_topic_size` helped.

In [37]:
topic_model.visualize_barchart()

The topics don't seem to allign greatly with sentiment just from looking at the keywords.

In [38]:
topic_model.visualize_topics(top_n_topics=20)

In [39]:
topic_model.visualize_documents(docs, embeddings=embeddings, hide_annotations=True)