# **Scrapping reviews from truspilot**
This code scrapps reviews from Trustpilot for a certain company and returns a csv file with the needed content

The tutorial associated can be found in [this link](https://medium.com/@isetitra/b79ffde43232).


### **Step 1: import the libraries**

In [None]:
import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
import numpy as np

### **Step 2: Define the necessary functions**

Define some helper functions

In [None]:
def replace_br(elements):
  element_content =''
  for line in elements:
      element_content += str(line) if str(line) != '<br/>' else '. '
  return element_content

In [None]:
def get_content(review, key, elt_tag, elt_class):
  if key == 'rating':
    element = review.find_all(elt_tag, class_=elt_class)
    elements = element[0] if element and 0 < len(element) else '-1'
    element_content = elements.get('data-service-review-rating')
  elif key == 'review_date':
    element = review.find_all('time')
    element_content = element[0].get('datetime') if 0 < len(element) else ''  
  elif key == 'reply':
    element = review.find_all(elt_tag, class_=elt_class)
    element_content = ''
    if 1 < len(element):
        elements = element[2].contents if element else []
        if 0 < len(elements):
            element_content = elements[0]
  elif key == 'reply_date':
    element = review.find_all('time')
    element_content = element[1].get('datetime') if 1 < len(element) else ''
  else:
    element = review.find_all(elt_tag, class_=elt_class)
    elements = element[0].contents if element else []
    element_content = replace_br(elements)     

  return element_content


### **Step 3: Scrap one page**

Defining the url to scrap

In [None]:
url = 'https://fr.trustpilot.com/review/www.carrefour.fr'
index= 0
url = url + "?page=" + str(index + 1)
print('Scrapping page ',index, ' from url ', url)

Scrapping page  0  from url  https://fr.trustpilot.com/review/www.carrefour.fr?page=1


Scrapping the content of the web page

In [None]:
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

In [None]:
content_page = soup.find_all("script", id="__NEXT_DATA__")[0].contents[0]

Getting the json file

In [None]:
content_page_json = json.loads(content_page)

In [None]:
with open('json_file.json', 'w') as outfile:
  outfile.write(json.dumps(content_page_json))


In [None]:
# This command returns the number of pages that are contained in the web according to the url we gave
num_of_pages: int = content_page_json.get('props').get('pageProps')\
                .get('filters').get('pagination').get('totalPages')
num_of_pages 

103

Getting the content of the web page.

First Get the tags we are interested in.

In [None]:
elt_tag = {'title': 'h2', 'rating': 'div', 'text': 'p', 'review_date': 'time', 'reply': 'p', 'reply_date': 'time'}
elt_class = {'title': 'typography_heading-s__f7029',
               'rating': 'styles_reviewHeader__iU9Px', 'text': 'typography_body-l__KUYFJ','review_date': 'time'
               , 'reply': 'typography_body-m__xgxZ_',  'reply_date': 'typography_body-m__xgxZ_'}

Get the main tag which contains all the information about a review

In [None]:
reviews = soup.find_all('div', 'styles_cardWrapper__LcCPA')

In [None]:
type(reviews)

bs4.element.ResultSet

In [None]:
print ( 'the number of reviews for this page is : ' , len(reviews) )

the number of reviews for this page is :  20


Get the reviews in a data frame

In [None]:
all_reviews = pd.DataFrame()
for (k1,v1), (k2,v2) in zip(elt_tag.items(), elt_class.items()):
    all_reviews[k1]=''

i=0
for review in reviews:
  all_reviews.loc[i]=''
  for (k1,v1), (k2,v2) in zip(elt_tag.items(), elt_class.items()):
    all_reviews[k1].iloc[i] = get_content(review, k1, v1, v2)
  i+=1

In [None]:
type(review)

bs4.element.Tag

Have some displays to understand the content

In [None]:
elt_tag

{'title': 'h2',
 'rating': 'div',
 'text': 'p',
 'review_date': 'time',
 'reply': 'p',
 'reply_date': 'time'}

In [None]:
elt_class

{'title': 'typography_heading-s__f7029',
 'rating': 'styles_reviewHeader__iU9Px',
 'text': 'typography_body-l__KUYFJ',
 'review_date': 'time',
 'reply': 'typography_body-m__xgxZ_',
 'reply_date': 'typography_body-m__xgxZ_'}

In [None]:
element = review.find_all(elt_tag['rating'], elt_class['rating'])
element[0]

<div class="styles_reviewHeader__iU9Px" data-service-review-rating="1"><div class="star-rating_starRating__4rrcf star-rating_medium__iN6Ty"><img alt="Noté 1 sur 5 étoiles" src="https://cdn.trustpilot.net/brand-assets/4.1.0/stars/stars-1.svg"/></div><div class="typography_body-m__xgxZ_ typography_appearance-subtle__8_H2l styles_datesWrapper__RCEKH"><time class="" data-service-review-date-time-ago="true" datetime="2023-02-14T13:34:12.000Z">14 févr. 2023</time></div></div>

In [None]:
element[0].contents

[<div class="star-rating_starRating__4rrcf star-rating_medium__iN6Ty"><img alt="Noté 1 sur 5 étoiles" src="https://cdn.trustpilot.net/brand-assets/4.1.0/stars/stars-1.svg"/></div>,
 <div class="typography_body-m__xgxZ_ typography_appearance-subtle__8_H2l styles_datesWrapper__RCEKH"><time class="" data-service-review-date-time-ago="true" datetime="2023-02-14T13:34:12.000Z">14 févr. 2023</time></div>]

In [None]:
element = review.find_all(elt_tag['title'], elt_class['title'])
element[0]

<h2 class="typography_heading-s__f7029 typography_appearance-default__AAY17" data-service-review-title-typography="true">Promotion carrefour 30 pour cent  sur…</h2>

In [None]:
element[0].contents

['Promotion carrefour 30 pour cent  sur…']

### Display some reviews

In [None]:
all_reviews.head()

Unnamed: 0,title,rating,text,review_date,reply,reply_date
0,"bravo, c'est plus simple.",1,: rien à redire pour le produit. par contre co...,2023-03-09T12:39:39.000Z,,
1,Bonjour pour une fois je commande sur…,1,Bonjour pour une fois je commande sur internet...,2023-03-09T20:34:24.000Z,,
2,Bravo le gaspillage énergétique!,1,Je trouve ça honteux de trouver encore dans ce...,2023-03-08T14:26:46.000Z,"Bonjour, Carrefour a pris l'engagement de rédu...",2023-03-08T19:14:29.000Z
3,SAV incompétent,1,J’ai commandé une PlayStation 5. La livraison ...,2023-03-06T21:56:51.000Z,"Bonjour, Pourriez-vous nous communiquer votre ...",2023-03-07T11:26:54.000Z
4,De passage dans la région je vais à…,1,De passage dans la région je vais à Carrefour ...,2023-03-04T21:10:55.000Z,"Bonjour, information pris près du magasin, c'e...",2023-03-06T14:48:54.000Z


### **Step 4: Scrap all the web pages**

In [None]:
# First get the number of pages 
url = 'https://fr.trustpilot.com/review/www.carrefour.fr'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
num_of_pages: int = content_page_json.get('props').get('pageProps')\
                .get('filters').get('pagination').get('totalPages')

all_reviews = pd.DataFrame()

for index in range(num_of_pages):
  for (k1,v1), (k2,v2) in zip(elt_tag.items(), elt_class.items()):
      all_reviews[k1]=''

i=0
for index in range(num_of_pages):
  url_page = url + "?page=" + str(index + 1)
  print('Scrapping page ',index, ' from url ', url)
  page = requests.get(url_page)
  soup = BeautifulSoup(page.content, 'html.parser')
  reviews = soup.find_all('div', 'styles_cardWrapper__LcCPA')
  for review in reviews:
    all_reviews.loc[i]=''
    for (k1,v1), (k2,v2) in zip(elt_tag.items(), elt_class.items()):
      all_reviews[k1].iloc[i] = get_content(review, k1, v1, v2)
    i+=1


Scrapping page  0  from url  https://fr.trustpilot.com/review/www.carrefour.fr
Scrapping page  1  from url  https://fr.trustpilot.com/review/www.carrefour.fr
Scrapping page  2  from url  https://fr.trustpilot.com/review/www.carrefour.fr
Scrapping page  3  from url  https://fr.trustpilot.com/review/www.carrefour.fr
Scrapping page  4  from url  https://fr.trustpilot.com/review/www.carrefour.fr
Scrapping page  5  from url  https://fr.trustpilot.com/review/www.carrefour.fr
Scrapping page  6  from url  https://fr.trustpilot.com/review/www.carrefour.fr
Scrapping page  7  from url  https://fr.trustpilot.com/review/www.carrefour.fr
Scrapping page  8  from url  https://fr.trustpilot.com/review/www.carrefour.fr
Scrapping page  9  from url  https://fr.trustpilot.com/review/www.carrefour.fr
Scrapping page  10  from url  https://fr.trustpilot.com/review/www.carrefour.fr
Scrapping page  11  from url  https://fr.trustpilot.com/review/www.carrefour.fr
Scrapping page  12  from url  https://fr.trustpilo

In [None]:
len(all_reviews)

2059

In [None]:
all_reviews_to=  all_reviews.iloc[0:2000]

In [None]:
all_reviews_to

Unnamed: 0,title,rating,text,review_date,reply,reply_date
0,"bravo, c'est plus simple.",1,: rien à redire pour le produit. par contre co...,2023-03-09T12:39:39.000Z,,
1,Bonjour pour une fois je commande sur…,1,Bonjour pour une fois je commande sur internet...,2023-03-09T20:34:24.000Z,,
2,Bravo le gaspillage énergétique!,1,Je trouve ça honteux de trouver encore dans ce...,2023-03-08T14:26:46.000Z,"Bonjour, Carrefour a pris l'engagement de rédu...",2023-03-08T19:14:29.000Z
3,SAV incompétent,1,J’ai commandé une PlayStation 5. La livraison ...,2023-03-06T21:56:51.000Z,"Bonjour, Pourriez-vous nous communiquer votre ...",2023-03-07T11:26:54.000Z
4,De passage dans la région je vais à…,1,De passage dans la région je vais à Carrefour ...,2023-03-04T21:10:55.000Z,"Bonjour, information pris près du magasin, c'e...",2023-03-06T14:48:54.000Z
...,...,...,...,...,...,...
1995,Aucun contact possible avec le SAV,1,À l'heure ou j'écris j'en suis au 5eme appel c...,2014-09-12T14:27:00.000Z,,
1996,A EVITER SAV CATASTROPHIQUE,1,J'ai commandé une TV Philips 140 Cm au bout de...,2014-09-10T15:19:43.000Z,,
1997,Bel appât !,1,Mon ordinateur portable commençant à me lâcher...,2014-09-10T12:31:16.000Z,,
1998,"Publicité mensongère, site online inacceptable...",1,des expériences similaires à la mienne ont déj...,2014-09-05T19:59:34.000Z,,


### **Save the data frame to a csv**

In [None]:
all_reviews.to_csv('reviews_carrefour.csv')