### **About the project**


About

Hello, my name is Jacques Benhur ESSISSONGO, I am a BSc student in Data Science and certified in the Google Data analytics program. 

The only purpose of this project is to make available to the community of analysts and data scientists, a dataset of customer reviews on the largest companies from the Trustpilot website in order to perform analysis, build models and make recommendations.

This data is very useful as it contains a lot of information that can be used in different ways.

Customer reviews are comments given to a 
company based on a customer's experience 
with the organization. By obtaining and 
analyzing customer reviews, companies can 
measure customer satisfaction, identify 
recurring customer issues, determine areas for 
improvement in existing strategies, and even 
discover new trends that can be exploited.

### **Default Setting to run selenium on colab**

In [None]:
!pip install selenium
!apt-get update # to update ubuntu to correctly run apt install
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Hit:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease
Hit:2 http://security.ubuntu.com/ubuntu bionic-security InRelease
Hit:3 http://archive.ubuntu.com/ubuntu bionic InRelease
Hit:4 http://archive.ubuntu.com/ubuntu bionic-updates InRelease
Ign:5 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Hit:7 http://archive.ubuntu.com/ubuntu bionic-backports InRelease
Hit:8 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Hit:9 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Hit:10 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Hit:11 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic InRelease
Hit:12 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu

### **importing the important librairies**


In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

options = Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

### **Defining the main classes that will perform the extraction**


In [None]:
class Trustpilot:

  def __init__(self,n_run=0):
    self.n_runs=n_run
    self.source_code=None

  def extract(self,ps):
    self.source_code = ps
    self.soup = BeautifulSoup(self.source_code,"html.parser")
    
    """the extration is done in two steps. First, it finds all the tags that meet the criteria and return it as a list.
    Secondly, it loops over all of the tags and extract the text in it""" 
    self.n_runs+=1 # incriment n_runs each time that the method is executed

    if self.n_runs==1:
      # if it is the first time that the method is executed

      #get the ratings
      all_ratings = self.soup.find_all("div",class_="star-rating_starRating__4rrcf star-rating_medium__iN6Ty")
      all_ratings.pop(0)
      self.ratings = [rate.find("img")["alt"] for rate in all_ratings]

      #get the usernames, number of reviews and locations
      all_profiles = self.soup.find_all("a", attrs={"name":"consumer-profile"})
      self.profiles=[profile.text for profile in all_profiles ]

      # get the dates
      all_dates = self.soup.find_all("p",class_="typography_body-m__xgxZ_ typography_appearance-default__AAY17 typography_color-black__5LYEn")
      self.dates = [ date.text for date in all_dates]

    # comments
      all_comments = self.soup.find_all("p",class_="typography_body-l__KUYFJ typography_appearance-default__AAY17 typography_color-black__5LYEn")
      self.comments = [comment.text for comment in all_comments]
    else:
  
      all_ratings = self.soup.find_all("div",class_="star-rating_starRating__4rrcf star-rating_medium__iN6Ty")
      all_ratings.pop(0)
      all_profiles = self.soup.find_all("a", attrs={"name":"consumer-profile"})
      all_dates = self.soup.find_all("p",class_="typography_body-m__xgxZ_ typography_appearance-default__AAY17 typography_color-black__5LYEn")
      all_comments = self.soup.find_all("p",class_="typography_body-l__KUYFJ typography_appearance-default__AAY17 typography_color-black__5LYEn")

      for rate,profile,date,comment in zip(all_ratings,all_profiles,all_dates,all_comments):
        
        self.ratings.append(rate.find("img")["alt"])
        self.profiles.append(profile.text)
        self.dates.append(date.text)
        self.comments.append(comment.text)

  def is_same_length(self):
    # check the length of all the arrays. If the arrays are not the same size return False
 
    to_check=[self.ratings,self.profiles,self.dates,self.comments]
    lengths = [len(self.ratings),len(self.profiles),len(self.dates),len(self.comments)]

    if max(lengths)==min(lengths):
      return True
    else:
      count1= lengths.count(max(lengths))
      count2= lengths.count(min(lengths))
      if count1<2:
        n = lengths.index(max(lengths))
        self.to_balance = to_check[n]
        self.difference = abs(max(lengths)-min(lengths))

      elif count2<2:
        n =  lengths.index(min(lengths))  
        self.to_balance=to_check[n]
        self.difference = abs(max(lengths)-min(lengths))
      return False

  def harmony(self):
    # make sure than all the arrays are the same lengths, else add nan

    if self.is_same_length():
      return True
    else:
      for missing in range(self.difference):
        self.to_balance.append(np.nan)

  def save(self):
    self.harmony()
    collected_data = {"profile":self.profiles,"date":self.dates,"rating out of 5":self.ratings,"comment":self.comments}
    self.df = pd.DataFrame(collected_data)
  
  def clean_df(self):
      df=self.df
      df["date"] = [date.split(":")[1] for index,date in df["date"].items()] # split at : and return only date
      df["rating out of 5"] = [rate.split(" ")[1] for index,rate in df["rating out of 5"].items()] # get only the rate
      countries= df["profile"].str.split("review",expand=True)[1].str.replace("s","")
      df.insert(1,"country",countries)

      df["date"] = pd.to_datetime(df["date"])
      df["rating out of 5"].astype("int")
      df.drop("profile",axis=1,inplace=True)
      self.cleaned_df=df
      self.cleaned_df.to_csv(f"trustpilot{self.n_runs}.csv")
      return self.cleaned_df

In [None]:

class RetailIndex:
  """This class extract data from retail index website"""
  def __init__(self,ps):
    self.ps = ps
    self.soup = BeautifulSoup(ps,"html.parser")
    self.extracted_data=[]
    self.n = 0
  def extract(self):
    
    # get all the tr tag with the specified attributes
    tr_tag = self.soup.find_all("tr",attrs={"height":"23","style":"height:17.0pt"})
    # get all the td tags and return a list
    td_tag=[BeautifulSoup(str(tag),"html.parser").find_all("td") for tag in tr_tag]
    # extract data from the td tags
    self.data=[tag.text for taglist in td_tag for tag in taglist]

  def split_list(self,level):
    # split a large list into smaller one base on the level provided
  
    self.extracted_data.append(self.data[self.n:self.n+level])
    self.n+=level
    if self.n<len(self.data):
      self.split_list(level)
  
  def save(self):

    import csv
    # saving the results into a csv file
    colname=["Rank","Company","Country","Main Sector(s)","2014","2015","2016","2017","2018","2019","2020","Online sales %"]
    with open("retail_index.csv","w") as f:
      file_csv = csv.writer(f)
      file_csv.writerow(colname)
      for row in self.extracted_data:
        file_csv.writerow(row)

    import pandas as pd
    self.df = pd.read_csv("retail_index.csv")
    self.df.set_index("Rank",inplace=True)

### **extracting data from retail index**

In [None]:
link1 = "https://www.retail-index.com/E-CommerceRetail.aspx"

driver = webdriver.Chrome(options=options)
driver.get(link1)
ps = driver.page_source # ps is the page source

extractor1 = RetailIndex(ps)
extractor1.extract()
extractor1.split_list(12)
extractor1.save()

driver.quit()

In [None]:
extractor1.df

Unnamed: 0_level_0,Company,Country,Main Sector(s),2014,2015,2016,2017,2018,2019,2020,Online sales %
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,Amazon,USA,All Sectors,24.23,25.6,28.65,30.4,33.744,32.185,32.185,100%
2,Otto Group,Germany,Fashion,5.175,5.49,5.86,6.49,6.48,6.91,12.149,57%
3,Zalando,Germany,"Fashion, Footwear and Leather",2.214,2.958,3.64,4.119,5.388,6.483,6.483,100%
4,Apple,USA,Consumer Electronics,3.75,4.0,4.2,4.77,5.65,5.75,55.604,10%
5,Tesco,United Kingdom,Food/ All Sectors,3.533,4.35,4.1,4.25,4.5,4.8,53.83,9%
6,Veepee (Vente-Privée),France,Fashion,1.7,2.0,3.0,3.3,3.7,4.0,4.0,100%
7,Carrefour,France,Food/ All Sectors,1.8,1.85,1.86,2.278,2.8,3.1,55.764,6%
8,Ceconomy (Mediamarkt/Saturn),Germany,Consumer Electronics,1.5,1.766,1.952,2.407,2.592,2.935,21.455,14%
9,Bol.com (Ahold),Netherlands,Consumer Electronics/ All Sectors,680.0,900.0,1.3,1.6,2.1,2.8,2.8,100%
10,E. Leclerc,France,Food/ All Sectors,1.9,2.274,2.562,2.76,2.642,2.72,38.85,7%


### **extracting data from trustpilot**

In [None]:
#connect to the website with selenium
driver2 = webdriver.Chrome(options=options)
driver2.get("https://www.trustpilot.com/review/www.apple.fr?languages=all")
ps2 = driver2.page_source # get the page source

n=0
# instantiate Trustpilot class and extract the first page
extractor2 = Trustpilot()
extractor2.extract(ps2)

# mecanism to switch to next page 
anchor_has_href=True
while anchor_has_href:
  n+=1
  try:
    element = WebDriverWait(driver2, 10).until(EC.presence_of_element_located((By.LINK_TEXT,"Next page")))
    next_page = element.get_attribute("href")
    driver2.get(next_page)
    ps2 = driver2.page_source
    extractor2.extract(ps2)

  except Exception as e:
    anchor_has_href=False
    print(n)
    break
  finally:
    driver.quit()



44


In [None]:
extractor2.save()

In [None]:
extractor2.clean_df()

Unnamed: 0,country,date,rating out of 5,comment
0,FR,2022-11-01,5,Top. J’ai commandé un iPad Pro et tout s’est b...
1,FR,2022-10-10,1,"Service client inefficace, ils nous demandent ..."
2,FR,2022-10-09,5,J'ai contacté un assistant de chez vous un dim...
3,SN,2022-10-05,1,Mon téléphone a été désactivé car…Mon téléphon...
4,FR,2022-09-29,1,J’ai déjà publié un avis mais je voulais parta...
...,...,...,...,...
844,FR,2011-12-31,5,"correspond à la description, et envoi rapide"
845,FR,2011-12-29,5,"difficile de m'en passer à présent, outil fant..."
846,FR,2011-12-14,5,Utiliser un appareil Apple devient de plus en ...
847,FR,2011-12-07,1,pour les amateurs de la pomme !!!


### **Reviews Data Cleaning**

In [None]:
def clean_df(dfs):
  df=dfs
  df["date"] = [date.split(":")[1] for index,date in df["date"].items()] # split at : and return only date
  df["rating out of 5"] = [rate.split(" ")[1] for index,rate in df["rating out of 5"].items()] # get only the rate
  countries= df["profile"].str.split("review",expand=True)[1].str.replace("s","")
  df.insert(1,"country",countries)

  df["date"] = pd.to_datetime(df["date"])
  df["rating out of 5"].astype("int")
  df.drop("profile",axis=1,inplace=True)
  return df

raw = pd.read_csv("amazon490.csv")
raw.drop("Unnamed: 0",axis=1,inplace=True)

In [None]:
extractor2.clean_df(raw)

Unnamed: 0,country,date,rating out of 5,comment
0,FR,2022-11-07,1,J'ai toujours le même problème. On me raccroch...
1,FR,2022-11-07,5,Je commande 1 à 3 fois par an. Tout s'est touj...
2,FR,2022-11-07,1,si je pouvias mettre 0 je le ferais incompéten...
3,FR,2022-11-07,1,"Terminé, je ne commanderais plus JAMAIS chez A..."
4,FR,2022-09-22,1,Cela fait des années que je suis cliente chez ...
...,...,...,...,...
9610,FR,2011-02-04,5,plusieurs commande également sur ce web marcha...
9611,FR,2010-11-29,5,Site qui centralise des vendeurs en ligne de n...
9612,FR,2010-04-28,5,"J'adore Amazon, fiable, rapide, sérieux, on tr..."
9613,NL,2010-01-07,4,Snel en goede service.
