## TrustPilot reviews

We'll be performing some topic classification using TrustPilot reviews.

In [1]:
from string import punctuation

import dacy
import nltk
import pandas as pd
import spacy
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from sentence_transformers import SentenceTransformer

import fetch
from fetch.utils import json_or_fetch

In [2]:
url = 'https://dk.trustpilot.com/review/www.netcompany.com'

In [3]:
reviews_dict = json_or_fetch(url, fetch.trustpilot, 'data/trustpilot.json')

In [4]:
reviews = []
for company_reviews in reviews_dict.values():
    reviews.extend(company_reviews)

In [5]:
df = pd.DataFrame(reviews)
df.head()

Unnamed: 0,title,body,rating
0,Galleri på Aula,Jeg ved snart ikke hvor længe jeg har haft pro...,1
1,Netcompany burde ikke får eneste…,"Netcompany burde ikke får eneste stjerne, hvis...",1
2,Min rating er nok forket (og det er…,Min rating er nok forket (og det er den).Men j...,3
3,Som slutbruger som er ramt af deres…,Som slutbruger som er ramt af deres ufærdige i...,1
4,Firmaet bag PAC våbentilladelsessystem,Firmaet bag PAC våbentilladelsessystem. Mage t...,1


In [6]:
docs = df['body'].tolist()

I'm having trouble with stopwords even though I'm using the techniques described in the Bertopic documentation for revoming those. It's not recommended to remove them in preprocessing, but I thought I'd try and see if it helped.

In [7]:
# da_stopwords = nltk.corpus.stopwords.words('danish')
da_stopwords = spacy.lang.da.stop_words.STOP_WORDS

In [8]:
def clean_sentence(sentence):
    sentence = sentence.strip()
    words = nltk.word_tokenize(sentence, language='danish')
    words = [word for word in words if word.lower() not in da_stopwords and word not in punctuation]
    return ' '.join(words)

In [9]:
docs = [clean_sentence(doc) for doc in docs]

In [10]:
embedder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
embeddings = embedder.encode(docs, show_progress_bar=True)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [11]:
# nlp = dacy.load('medium', exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])

In [12]:
ctfid_model = ClassTfidfTransformer(reduce_frequent_words=True)

In [13]:
topic_model = BERTopic(language="multilingual", min_topic_size=3, embedding_model=embedder, ctfidf_model=ctfid_model)

In [14]:
topics, probs = topic_model.fit_transform(docs)

In [15]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,4,-1_elendigt_samtlige_it_offentlige
1,0,6,0_virker_pac_januar_politiets
2,1,5,1_se_netcompany_lov_dårlige
3,2,5,2_penge_danske_spild_fejl


~~At least, now the name isn't all stopwords, but it still only puts it in one outlier topic.~~

I think the actual problem is just having way too few documents for it extract topics.

Decreasing `min_topic_size` helped.

In [16]:
topic_model.visualize_barchart()

The topics don't seem to allign greatly with sentiment just from looking at the keywords.