# Scraping Amazon Reviews
This notebook can be used to scrape reviews of a product on Amazon.

> [!CAUTION]
> This code is intended for **educational and research purposes only**. The use of this code to scrape Amazon may violate their [Terms of Service](https://www.amazon.com/gp/help/customer/display.html?nodeId=508088) and could lead to legal consequences.
> By using this script, you agree that you understand these warnings and disclaimers.

## Getting the URL
The `get_url` function returns the URL of a page containing $10$ reviews of the given product with the given number of stars. 

> [!NOTE]
> This script was tested to work on January 16, 2025. As you know, web scraping depends on the website, which evolves over time. In the future, `get_url` may not return the correct URL.

With a given filter configuration, Amazon limits the number of pages to $10. So I decided to filter by the number of stars. If available, this script will collect 500 reviews, 100 1-star reviews, 100 2-star reviews, and so on.

> [!IMPORTANT]
> Amazon requires the user to be logged to view the reviews dedicated page. Therefore, you need to login with your browser and export your cookies, as a JSON file in the `/scraper` directory. I used [*Cookie-Editor*](https://cookie-editor.com/) to do so.


In [1]:
import json
import requests

In [53]:
PRODUCT_NAME = 'HP-OfficeJet-Printer-Thermal-Inkjet'
PRODUCT_ID = 'B08XW7LT9P'


def get_url(product_name: str = PRODUCT_NAME, product_id: str = PRODUCT_ID, pageNumber: int = 1, stars:int=1) -> str:
    star_conversion={1:'one',2:'two',3:'three',4:'four',5:'five'}
    return f"https://www.amazon.com/{product_name}/product-reviews/{product_id}/ref=cm_cr_getr_d_paging_btm_next_{pageNumber}?ie=UTF8&reviewerType=all_reviews&pageNumber={pageNumber}&sortBy=helpful&filterByStar={star_conversion[stars]}_star"

In [23]:
# Load the cookies for the login
cookies = {}
with open('cookies.json') as f:
    cookies = {cookie['name']: cookie['value'] for cookie in json.load(f)}

In [56]:
reviews_raw = {}

In [58]:
# Set headers to mimick a Firefox client
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:115.0) Gecko/20100101 Firefox/115.0",
}

# Make the requests
stars = list(range(1, 6))
pages = list(range(1, 11))
# random.shuffle(pages)
for star in stars:
    for page in pages[:]:
        reviews_raw[f'{star}:{page}'] = requests.get(get_url(pageNumber=page, stars=star),
                                         cookies=cookies, headers=headers).text

# Save the scraping results
with open('reviews_raw.json', 'w') as f:
    json.dump(reviews_raw, f, indent=4)