# **Web scraping script**

**How is the data used in a sentiment analysis usually obtained?**

 This script presents the web scraping part that consist in a technique used to extract data from websites and store it in a json file: I used OOP the create a class that can scrape Amazon reviews based on the ASIN (Amazon's product ID) given in input.

---

In this case I will scrape from Amazon the reviews on a product which is a pair of speakers.

Reviews URL: 'https://www.amazon.co.uk/product-reviews/B075QVMBT9/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&sortBy=recent&pageNumber=1'

---
#### Steps:
- Get the URL of the page to be scrapped
- Inspect the elements of the page and identify the tags required
- Access the URL
- Get the element from the required tags

## Importing dependencies

In [9]:
from requests_html import HTMLSession
import json
import time

Since the user-agent is private, I will store my personal user agent in a environment variable and recall it thrugh the .env file.

In [10]:
from dotenv import load_dotenv
import os
load_dotenv()
USER_AGENT = os.getenv("USER-AGENT")

## Reviews class

#### Class explanation:
- **what is asin?** For amazon, the ASIN (Amazon Standard Identification Number) is basically the product identifier.
- **User agents:** When scraping a website, you also need to set user-agents on every request as otherwise the website may block your requests because it knows you aren't a real user. This user agent will clearly identify your requests as coming from a web scraper, so the website can easily block you from scraping the site. User-agent found on by googling 'my user agent'.
- **URL:** For the self.url variable i replaced the asin with the self.asin variable to make this script available for all products (given the asin of it). Note also that in the last part of the URL I left the 'pageNumber=' make it easier to iterate over all the reviews pages, that have different URLs.
---

For the **pagination** class method we use the session object created before to get the new url with the new page (obtained by concatenating the 2 strings self.url and the page number). The if means the following: if we don't find any reviews on the URL, return False (so when using this method in a loop, if its false we can break out the loop), otherwise return the response. This is going ot be really useful when lopping through many pages without knowing how many pages there are.

---

The **parse** method pulls out all the needed info, puts everything into a dictionary and that append it to a list which will return a list of many dictionaries which basically are all the reviews with their titles, star ratings and content. When runnign the code I incurred in many different errors, so i decided to put a try/except to handle these problems.

---

Lastly, the **save** method allows me to save the results in a json file.

---

**Note:** For selecting the needed pieces of the 'div' tag i used the CSS selector syntax for simplicity (https://requests.readthedocs.io/projects/requests-html/en/latest/) thanks to the requests_html library. This allows me to use the .find() method in a much more cleaner way rather than putting the class pf the 'div' element which is usually long and more complex, so for example:
- a[target=value] where 'a' is the tag in which i am 'navigating' and in the brackets we have some class and its value that i want to select.

In [11]:
class Reviews:
    def __init__(self, asin) -> None:
        self.session = HTMLSession()        # session object that is used continuously whenever this class instance is called
        self.headers = {'User-Agent': USER_AGENT}
        self.asin = asin
        self.url = f'https://www.amazon.co.uk/product-reviews/{self.asin}/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&sortBy=recent&pageNumber='


    def pagination(self, page):
        response = self.session.get(self.url + str(page), headers=self.headers)     # getting the new page URLs
        # check if there are reviews in the new url o not, using CSS selectors
        if not response.html.find('div[data-hook=review]'):
            return False
        else:
            return response.html.find('div[data-hook=review]')
    

    def parse(self, reviews):
        total = []
        # looping on reviews to parse the data i will extract and put in a .json file later on
        for review in reviews:
            # .find() outputs a list, so with first=True i say that i want the first element of that list
            # .text to convert into text (string)
            try:
                title = review.find('a[data-hook=review-title] span', first=True).text
                rating = review.find('i[data-hook=review-star-rating] span', first=True).text   
                # span means i'm selecting the span value of the element i[data-hook=review-star-rating] -> basically here i'm getting the number of stars of the review
                place_date = review.find('span[data-hook="review-date"]', first=True).text
                content = review.find('span[data-hook=review-body] span', first=True).text.replace('\n', '.').strip()
                # .replace('\n') and .strip() to eliminate backslashes and clean the text
            except:
                continue        # for sure will end up with some missing pages or data to be cleaned

            data = {'title': title,
                    'rating': rating,
                    'place and date': place_date,
                    'body': content[:2000]}      # crop excessively long reviews
            total.append(data)
        return total
    

    def save(self, results):
        # opening a new file named 'ASIN_ID_reviews.json', in which i will write ('w') all the results
        # assigning the new .json file to a new variable called f
        with open(self.asin + '_reviews.json', 'w') as f:
            json.dump(results, f)


## Scraping time!

In [12]:
if __name__ == '__main__':              # makes sure that this runs only when directly executed and not imported as a module
    amz = Reviews('B075QVMBT9')
    results = []
    for i in range(1,101):              # 101 => a url provides n reviews with 10 reviews on each page --> therefore i take the first 1000 reviews
        print('Getting page', i)
        time.sleep(0.3)                 # to let the loop "breathe"
        reviews = amz.pagination(i)
        if reviews:                     # if there is something into reviews (seeing the many errors i've gotten) -> append it to the results
            results.append(amz.parse(reviews))

Getting page 1
Getting page 2
Getting page 3
Getting page 4
Getting page 5
Getting page 6
Getting page 7
Getting page 8
Getting page 9
Getting page 10
Getting page 11
Getting page 12
Getting page 13
Getting page 14
Getting page 15
Getting page 16
Getting page 17
Getting page 18
Getting page 19
Getting page 20
Getting page 21
Getting page 22
Getting page 23
Getting page 24
Getting page 25
Getting page 26
Getting page 27
Getting page 28
Getting page 29
Getting page 30
Getting page 31
Getting page 32
Getting page 33
Getting page 34
Getting page 35
Getting page 36
Getting page 37
Getting page 38
Getting page 39
Getting page 40
Getting page 41
Getting page 42
Getting page 43
Getting page 44
Getting page 45
Getting page 46
Getting page 47
Getting page 48
Getting page 49
Getting page 50
Getting page 51
Getting page 52
Getting page 53
Getting page 54
Getting page 55
Getting page 56
Getting page 57
Getting page 58
Getting page 59
Getting page 60
Getting page 61
Getting page 62
Getting page 63
G

In [13]:
results

[[{'title': 'Good sound',
   'rating': '5.0 out of 5 stars',
   'place and date': 'Reviewed in the United Kingdom 🇬🇧 on 28 May 2023',
   'body': 'Seem decent build quality and good sound. Very happy with purchase.'},
  {'title': "I didn't realise how bad my audio setup was",
   'rating': '5.0 out of 5 stars',
   'place and date': 'Reviewed in the United Kingdom 🇬🇧 on 27 May 2023',
   'body': "Considering I used to have quite a respectable setup many years ago I've feel into a trap of using Bluetooth speakers which leave a lot to be desired. For the price these provide a perfect audio experience. Now I'm only filling a 3x3m room so hardly an auditorium but the quality is amazing. I do feel like I could do with a subwoofer for some genres but for the average listening experience, without the neighbours complaining, it's beautiful. Almost brings a tear to my eye."},
  {'title': 'Its a beuaty',
   'rating': '5.0 out of 5 stars',
   'place and date': 'Reviewed in the United Kingdom 🇬🇧 on 26

Now all the desired pages are scraped but there are for sure some pages that were blocked by Amazon: let's check all the non-scraped pages.

In [16]:
ind=[]
for i in range(0,100):
    if len(results[i]) == 0:
        ind.append(i)
print("The non-scrapped pages are: ", ind)

The non-scrapped pages are:  [74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]


Looks like it ended up with a solid result: more than 730 reviews! Proceeding now to save the reviews in a json file which will be later converted a Pandas DataFrame to proceed with the sentiment analysis.

In [15]:
amz.save(results)