## TrustPilot reviews

We'll be performing some topic classification using TrustPilot reviews.

In [1]:
import requests
from bs4 import BeautifulSoup

def get_reviews(url):
    texts = []
    ratings = []
    
    for page in range(1, 101):
 
        response = requests.get(f"{url}?page={page}")
        soup = BeautifulSoup(response.content, "html.parser")

        review_divs = soup.find_all("div", {"class": "styles_cardWrapper__LcCPA styles_show__HUXRb styles_reviewCard__9HxJJ"})

        for div in review_divs:
            review_title = div.find("h2", {"class": "review-title"})
            review_stars_div = div.find("div", {"class": "styles_reviewHeader__iU9Px"})

            review_rating = review_stars_div['data-service-review-rating']

            review_paragraph = div.find("p")
            texts.append(review_paragraph.text)
            ratings.append(review_rating)

    reviews = dict(zip(texts, ratings))
    return reviews

In [19]:
from string import punctuation

import dacy
import nltk
import pandas as pd
import spacy
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer

In [4]:
url = 'https://dk.trustpilot.com/review/www.postnord.dk'

In [3]:
reviews_dict = json_or_fetch(url, fetch.trustpilot, 'data/trustpilot.json')

In [5]:
reviews_dict = get_reviews(url)

In [4]:
reviews = []
for company_reviews in reviews_dict.values():
    reviews.extend(company_reviews)

In [59]:
docs2 = get_reviews('https://dk.trustpilot.com/review/www.fedex.com')

In [60]:
docs3 = get_reviews('https://dk.trustpilot.com/review/www.ups.com')

In [61]:
reviews_dict.update(docs2)
reviews_dict.update(docs3)

In [9]:
df = pd.DataFrame(reviews)
df.head()

Unnamed: 0,review
0,Nær box kunne ikke åbne. fik at vide de ville ...
1,Jeg bliver nødt til at tage omdelt post med i ...
2,Altid godt at få leveret fra post nord. Men I ...
3,"Kan det være rigtigt, at man i 2023 ikke kan g..."
4,"Jeg forstår ikke helt, at når der ikke er plad..."


In [None]:
# number of columns and rows
print("Number of columns:", len(df.columns))
print("Number of rows:", len(df))

In [None]:
# type of data
print("Data types:")
print(df.dtypes)

In [None]:
# see the size
df.shape

In [None]:
df.columns

In [62]:
df = pd.DataFrame(reviews_dict.items(), columns=['review', 'rating'])

In [63]:
df['rating'] = df['rating'].astype(int)

In [66]:
df.groupby('rating').count()

Unnamed: 0_level_0,review
rating,Unnamed: 1_level_1
1,1322
2,96
3,61
4,115
5,844


In [10]:
docs = list(reviews_dict.keys())

In [42]:
docs2 = list(get_reviews('https://dk.trustpilot.com/review/www.fedex.com').keys())

In [43]:
docs.extend(docs2)

In [47]:
docs3 = list(get_reviews('https://dk.trustpilot.com/review/www.ups.com').keys())
docs.extend(docs3)

In [54]:
docs[-1:-10:-1]

['Jeg har flere gange fået leveret varer med UPS, og det er alle gange gået uden problemer ;) super service og hurtig levering.',
 'Har fået leveret pakker hos dem flere gange... jeg kan ikke huske noget negativt - så det må være godt',
 'Leveringen fejlede ingenting, men håber aldrig nogen får brug for at komme i kontakt med UPS kundeservice.\rFor at gøre en lang historie kort: En medarbejder hos UPS udfylder en fragtseddel for mig forkert selvom jeg 3 gange understregede, at det var en modtager-betaler-pakke. \rJeg får derfor tilsendt en regning samt en masse rykkere selvom jeg er i løbende kontakt med deres kundeservice, hvor de lover at de vil krediterer mig, og at de vil lukke sagen med det samme. \r- Imens de "undersøger" sagen sætter de den ikke i bero, hvilket betyder, at jeg får flere rykkere fra inkasso.\r- Det er ligeledes ikke muligt at kontakte de personer, der undersøger sagen eller få respons på, hvad der sker og evt, hvor lang tid det vil tage. \r- De tager MEGET lang t

I'm having trouble with stopwords even though I'm using the techniques described in the Bertopic documentation for revoming those. It's not recommended to remove them in preprocessing, but I thought I'd try and see if it helped.

In [16]:
# da_stopwords = nltk.corpus.stopwords.words('danish')
da_stopwords = spacy.lang.da.stop_words.STOP_WORDS

In [18]:
type(da_stopwords)

set

In [97]:
count_vectorizer = CountVectorizer(stop_words=list(da_stopwords))

In [8]:
def clean_sentence(sentence):
    sentence = sentence.strip()
    words = nltk.word_tokenize(sentence, language='danish')
    words = [word for word in words if word.lower() not in da_stopwords and word not in punctuation]
    return ' '.join(words)

In [9]:
docs = [clean_sentence(doc) for doc in docs]

In [None]:
embedder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

In [48]:
embeddings = embedder.encode(docs, show_progress_bar=True)

Batches:   0%|          | 0/77 [00:00<?, ?it/s]

In [None]:
embeddings.shape

## Dacy embeddings

In [51]:
# nlp = dacy.load('medium', exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])

In [53]:
# dacy_docs = list(nlp.pipe(docs))

## Bertopic model

In [12]:
ctfid_model = ClassTfidfTransformer(reduce_frequent_words=True)

In [132]:
topic_model = BERTopic(language="multilingual", nr_topics=None, min_topic_size=5, vectorizer_model=count_vectorizer,
                       seed_topic_list=[['god', 'godt', 'hurtig', 'hurtigt'],
                                        ['dårlig', 'dårligt', 'dårlige', 'langsom']])

In [133]:
topics, probs = topic_model.fit_transform(docs, embeddings)

In [134]:
topic_model.get_topic_info()[:20]

Unnamed: 0,Topic,Count,Name
0,-1,933,-1_pakken_ups_pakke_hjemme
1,0,125,0_hurtig_levering_tilfreds_præcis
2,1,88,1_kl_fredag_april_21
3,2,80,2_hjemme_hele_dagen_fkn
4,3,76,3_dårlig_kundeservice_firma_service
5,4,71,4_fedex_pakke_pakken_usa
6,5,67,5_post_nord_postnord_hurtig
7,6,55,6_tyskland_pakken_oktober_kolding
8,7,47,7_bestilte_lørdag_hurtig_bestilt
9,8,43,8_postnord_pakke_fik_retur


~~At least, now the name isn't all stopwords, but it still only puts it in one outlier topic.~~

I think the actual problem is just having way too few documents for it extract topics.

Decreasing `min_topic_size` helped.

In [135]:
topic_model.visualize_barchart()

The topics don't seem to allign greatly with sentiment just from looking at the keywords.

In [136]:
topic_model.visualize_topics(top_n_topics=20)

In [137]:
topic_model.visualize_documents(docs, embeddings=embeddings, hide_annotations=True)