In [21]:
import requests
from bs4 import BeautifulSoup
import re

Lets get data request from the webpage and display the text

In [10]:
r = requests.get("https://www.trustpilot.com/review/www.corsair.com")
r.text[:500]

'<!DOCTYPE html><html lang="en-US"><head><meta charSet="UTF-8"/><meta name="viewport" content="width=device-width, initial-scale=1"/><link rel="shortcut icon" type="image/x-icon" href="https://cdn.trustpilot.net/brand-assets/4.3.0/favicons/favicon.ico"/><link rel="manifest" href="/manifest.json"/><meta name="application-name" content="Trustpilot"/><meta name="theme-color" content="#1c1c1c"/><link rel="apple-touch-icon" sizes="180x180" href="https://cdn.trustpilot.net/brand-assets/4.3.0/favicons/a'

Text is a giant HTML soup and we're only interested in the reviews on the page
- Need to find the div + class structure for reviews

In [31]:
soup = BeautifulSoup(r.text,"html.parser")
reviews = soup.find_all("div", {"class": "styles_reviewCardInner__EwDq2"})

Above I've used Chrome's Object Painter to discover the div container and its class name for reviews ... here's an example of how to get the review text

In [20]:
reviews[1].get_text()

"ETetka7 reviewsUS18 hours agoSolid hardwareI have been using their mousepads for years and been satisfied with their headphones. I am not a fan of RGB, so I haven't experienced any problems related to software. I recently got a Corsair PSU since my old one was very loud and inefficient, and it has been pretty good so far.Date of experience: April 10, 2024"

In [156]:
for i in reviews:
    print(i.get_text())

ROrobert30 reviewsGB5 hours agoGoodVery strange seeing it getting 2star rating any time I've had issues they helped me super quick and always sent me out parts that needed if broken.tje core nodes can be a bit better but in general there are very good.Date of experience: May 03, 2024
ETetka7 reviewsUS18 hours agoSolid hardwareI have been using their mousepads for years and been satisfied with their headphones. I am not a fan of RGB, so I haven't experienced any problems related to software. I recently got a Corsair PSU since my old one was very loud and inefficient, and it has been pretty good so far.Date of experience: April 10, 2024
THThefrench33 reviewsAT6 days agoPerfect serviceI encountered an issue with my iCUE Hub, but the support team helped me in troubleshooting the problem. They promptly sent me a replacement hub, and after installing it, everything worked flawlessly. I'm highly satisfied with their service, it was not only nice but also incredibly helpful. This positive expe

Need to create a function to gather the following from the review:
- The date of the review (For timeline chart purposes)
- Body of the review (Text not including the name or location)
> Subsequently use NLP filter the text for a word cloud graphic
- Location of the Reviewer (Ex: US vs AT vs GB)
> Subsequently need to discipher these 2 letter codes because "AT" is Austria and "GB" is Great Britain

In [87]:
def find_date(txt):
    '''
    Function to find the date -- Use Date of Experience
    '''
    start = re.search("Date of experience:",txt).span()[1]
    return txt[start+1:]

for i in reviews: #Test
    print(find_date(i.get_text()))

May 03, 2024
April 10, 2024
May 07, 2024
May 04, 2024
May 01, 2024
April 29, 2024
April 24, 2024
April 12, 2024
April 28, 2024
February 08, 2024Reply from CorsairFeb 21, 2024Hello,We are sorry to hear about any inconvenience that may have occurred with your recent online web order and the experience you had with our Support Team.Please provide us with more details about your order or any support ticket and we can review this matter further.
January 22, 2024
February 22, 2024
January 15, 2024
January 30, 2024
March 08, 2024
March 24, 2024
October 02, 2023
January 23, 2024
March 07, 2024Reply from CorsairMar 11, 2024Hello,We apologize if there was any confusion with our warranty replacement processes. Whether the replacement unit is new or refurbished, the warranty period does not restart. The replacement unit will continue with the remaining period of the original warranty. Additional details about our warranty can be found here: https://help.corsair.com/hc/en-us/articles/360033067832-C

This kind of convenient -- Replies from Corsair would appear after the "Date of experience" text. Lets extract it with a function as well

In [149]:
def find_reply(txt):
    '''
    Function to find replies if there are any from a representative
    '''
    search = re.finditer(r"\d{1}, \d{4}",txt)
    idxs = []
    for i in search:
        idxs.append(i.span())
        #First span should be Date of Review
        #Second span should be Date of Experience ()
        #Last span should be Date of Reply
    if len(idxs)<=1:
        return "No Reply"
    else:
        reply = txt[idxs[-1][1]:]
        if reply=="": #Sometimes it's blank 
            return "No Reply"
        else:
            return reply  #Find the last span and get the text after it

for i in reviews: #Test
    print(find_reply(i.get_text()))

No Reply
No Reply
No Reply
No Reply
No Reply
No Reply
No Reply
No Reply
No Reply
Hello,We are sorry to hear about any inconvenience that may have occurred with your recent online web order and the experience you had with our Support Team.Please provide us with more details about your order or any support ticket and we can review this matter further.
No Reply
No Reply
No Reply
No Reply
No Reply
No Reply
No Reply
No Reply
Hello,We apologize if there was any confusion with our warranty replacement processes. Whether the replacement unit is new or refurbished, the warranty period does not restart. The replacement unit will continue with the remaining period of the original warranty. Additional details about our warranty can be found here: https://help.corsair.com/hc/en-us/articles/360033067832-Customer-Service-Corsair-Limited-Warranty 
No Reply


In [151]:
 def find_geo(txt):
    '''
    Find geography tag for the review
    '''
    start = re.search("review",txt).span()[1] #Not plural because 1 review vs 10 review**s**
    return txt[start+1:start+3]

for i in reviews: #Test
    print(find_geo(i.get_text()))

GB
US
AT
DK
GB
BA
FI
UU
NO
GB
US
GB
BE
AJ
GB
GB
GB
SJ
GB
NO


In [177]:
def find_body(txt):
    '''
    Function to find body text
    '''
    search = re.finditer(r"\d{1}, \d{4}",txt)
    end = re.search("Date of experience:",txt).span()[0]
    idxs = [i.span() for i in search]
    #Find the text between the review date and date of experience
    try: #The reviews which are written recently say "3 hours ago"
        #find the "ago" text
        start = re.search(" ago",txt).span()[1]
        return txt[start:end]
    except:
        return txt[idxs[0][1]:end]
    
for i in reviews: #Test
    print(find_body(i.get_text()))
    break #dont show all of them in this case ... too much text block

GoodVery strange seeing it getting 2star rating any time I've had issues they helped me super quick and always sent me out parts that needed if broken.tje core nodes can be a bit better but in general there are very good.
