# NLP HughesNet Project Webscraping Part

This notebook shows how I webscraped Trustpilot reviews of the internet company HughesNet. I have chosen to gather reviews for this company since it roughly received the same amount of positive and negative feedbacks, which makes it good data for training a ML model.

To start let's import the required libraries:

In [7]:
from bs4 import BeautifulSoup as bs
import requests
import random
import pandas as pd
import time 

The first page we're going to scrape is https://www.trustpilot.com/review/hughesnet.com?page=1, in which there are 20 reviews. In total there are 3838 pages so we can expect to scrape around 70k reviews in total. Along with the text, we're also going to scrape the number of stars (from 1 to 5) for each review.

Unfortunately, Trustpilot has a bot detection system which will temporarily deny us access after a certain number of requests. We can work around it by setting a 10 minutes timer everytime our request gets denied. After that, we'll restart automatically the loop from where it left using a boolean variable.

In [2]:
sentiment = [] # Here we'll store the number of stars
text = [] # Here we'll store the reviews' texts

for i in range(1, 3848):
    restart = True  # We set a boolean variable to restart the loop when we'll get detected
    while restart:
        restart=False
        url = f"https://www.trustpilot.com/review/hughesnet.com?page={i}"
        req = requests.get(url).text
        page = bs(req, "html.parser")

        revs = page.find_all("article") # list with all the reviews in the page (20)

        if len(revs) == 0: # if the list is empty it means we have been detected
            print(f"DETECTED AT {i}")
            print("RESTARTING IN 10 MINUTES")
            time.sleep(600) # 10 mins timer
            i-=1
            restart=True # once the timer runs out the loop will automatically restart

        for rev in revs:
            stars = rev.find("div", class_="styles_reviewHeader__iU9Px") # scrape the stars
            stars = stars["data-service-review-rating"] 
            sentiment.append(stars)

            txt = rev.find("p").text # scrape the text
            text.append(txt)

        if i%100 == 0:
            print(f"Page {i} completed")
                

Page 100 completed
Page 200 completed
DETECTED AT 244
RESTARTING IN 10 MINUTES
Page 300 completed
Page 400 completed
DETECTED AT 421
RESTARTING IN 10 MINUTES
Page 500 completed
DETECTED AT 581
RESTARTING IN 10 MINUTES
Page 600 completed
Page 700 completed
DETECTED AT 765
RESTARTING IN 10 MINUTES
Page 800 completed
Page 900 completed
DETECTED AT 944
RESTARTING IN 10 MINUTES
Page 1000 completed
Page 1100 completed
DETECTED AT 1179
RESTARTING IN 10 MINUTES
Page 1200 completed
Page 1300 completed
DETECTED AT 1373
RESTARTING IN 10 MINUTES
Page 1400 completed
Page 1500 completed
DETECTED AT 1558
RESTARTING IN 10 MINUTES
Page 1600 completed
Page 1700 completed
DETECTED AT 1732
RESTARTING IN 10 MINUTES
Page 1800 completed
Page 1900 completed
DETECTED AT 1911
RESTARTING IN 10 MINUTES
Page 2000 completed
DETECTED AT 2088
RESTARTING IN 10 MINUTES
Page 2100 completed
Page 2200 completed
DETECTED AT 2259
RESTARTING IN 10 MINUTES
Page 2300 completed
Page 2400 completed
DETECTED AT 2436
RESTARTING IN

Done! Let's put all this data into a dataframe:

In [4]:
df = pd.DataFrame({"Text":text, "Sentiment":sentiment})
df.head()

Unnamed: 0,Text,Sentiment
0,The efforts of the Hughes net team and field t...,3
1,"Our ""Local"" bank merged with an out of town gr...",5
2,I requsted my service to be suspended while I ...,5
3,This rep had the knowledge and power to resolv...,5
4,"Started out good ,got disconnected next day ,W...",5


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76940 entries, 0 to 76939
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Text       76940 non-null  object
 1   Sentiment  76940 non-null  object
dtypes: object(2)
memory usage: 1.2+ MB


Looks like we have a reasonably large amount of data and zero null values! We export it as a csv file for future using:

In [6]:
df.to_csv("D:/Utente/Documenti/datasets/trustpilot_nlp.csv")

That's it for the webscraping part, check out the other notebook for the sentiment model!