# Scraping Rate Your Music Reviews

> *Advanced Customer Analytics*  
> *MSc in Data Science, Department of Informatics*  
> *Athens University of Economics and Business*

---

Select an English-speaking website that hosts customer reviews on products (or services, businesses, movies, events, etc.). Make sure that the website includes a free-text search box that users can use to search for products. Create a first Python notebook with a function called <code>scrape()</code>. The function should accept as a parameter a query (a word or short phrase). The function should then use ***selenium*** to (1) submit the query to the website's search box and retrieve the list of matching products, and (2) access the first product on the. list and download all its reviews into a csv file. For each review, the function should get the text, the rating and the date. One line per review, 3 fields per line.

---

##### *Libraries*

In [1]:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import re, time,csv
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from langdetect import detect

##### *Function to clear a possible cookie overlay*

In [2]:
def clear_consent_overlay():
    
    try: # try to get the consent button in case there is a coukie overlay  
        consent_button = driver.find_element(by = By.CSS_SELECTOR, 
                                             value='button[class="fc-button fc-cta-consent fc-primary-button"]')
        consent_button.click()
    except NoSuchElementException as e: # continue if there is no overlay
        pass

##### *Function to get the reviews from each page*

In [3]:
def get_page_reviews(writer: csv.writer):
    """
    Extracts reviews from a webpage and writes them to a CSV file.

    Parameters:
    - writer: CSV writer object to handle writing rows to a CSV file.

    Returns:
    None
    """

    try:
        reviews = driver.find_elements(by=By.CSS_SELECTOR, value='div[class="review"]')
    except:
        print('No Review Found In This Page')

    for review in reviews:

        valid = False

        content, rating, date = 'NA', 'NA', 'NA'

        try:
            content_box = review.find_element(by=By.CSS_SELECTOR, value='span[itemprop="description"]')
            content = content_box.text
            if content != '' and content != ' ':
                if detect(content) == 'en':
                    valid = True
        except NoSuchElementException as e:  # review content could not be found
            print('Could not Extract Review Content')

        try:
            rating_box = review.find_element(by=By.CSS_SELECTOR, value='span[class="review_rating"]')
            rating = rating_box.get_attribute('content')
        except NoSuchElementException as e:  # review rating could not be found
            print('Could not Extract Review Rating')

        try:
            date_box = review.find_element(by=By.CSS_SELECTOR, value='span[class="review_date"]')
            date = date_box.get_attribute('content')
        except NoSuchElementException as e:  # review date could not be found
            print('Could not Extract Review Date')

        # write a new row if review content is not empty
        if valid == True:
            writer.writerow([rating, date, content])

##### *Function to scrape the reviews*

In [4]:
def scrape(query: str, delay: int = 2):
    """
    Scrapes reviews for a given query from the RateYourMusic website.

    Parameters:
    - query: The search term for reviews.
    - delay: Time delay in seconds between page navigation.

    Returns:
    None
    """

    # create a new CSV writer for the story links
    fw = open('The Car Reviews.csv', 'w', encoding="utf-8")
    writer = csv.writer(fw, lineterminator='\n')
    writer.writerow(['Rating', 'Date', 'Content'])

    url = 'https://rateyourmusic.com/search?searchterm=' + query
    driver.get(url)

    clear_consent_overlay()

    # get all the links from the results page
    album_links = driver.find_elements(By.LINK_TEXT, 'The Car')

    try:
        driver.get(album_links[0].get_attribute('href'))
    except IndexError:
        print('Could Not Get Any Links')

    clear_consent_overlay()

    page_cnt = 1  # keep track of page count

    while True:

        print('page', page_cnt)  # print the current page count

        page_cnt += 1  # increment

        get_page_reviews(writer)  # get reviews from the current page

        time.sleep(delay)

        clear_consent_overlay()

        try:  # check whether there is a next button
            # three navigation bars with the same functionality, we always get the first
            next_button = driver.find_elements(by=By.CSS_SELECTOR, value='a[class="navlinknext"]')[0]
            driver.get(next_button.get_attribute('href'))  # go to the next review page
        except IndexError:
            print('No more review pages')
            break

##### *Execution*

In [8]:
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.maximize_window()

scrape('arctic%20monkeys%20the%20car&searchtype=')

driver.quit()

page 1
page 2
page 3
page 4
page 5
page 6
page 7
page 8
page 9
page 10
page 11
page 12
page 13
page 14
No more review pages
