# 2. Mining Diamond Data - Brilliant Earth®

## (a) Introduction
In this series of notebooks we are mining diamond data from merchants on the web, and subsequently using Machine Learning to be able to predict the price of a diamond. Diamond merchants often display data on the diamonds they are selling so people can peruse them and make a purchase online. They'll also usually have a comparison element with lots of features (the 5 C's etc.). Really though, if you're anything like me (a noob jeweller), how can you tell how the diamond is priced based on these features? I guess you'd have to take the merchants' word on it... 

What we need is data, and a regression algorithm looking at price. This notebook is the second in the series, and in it we'll tackle scraping Brilliant Earth® data from [their website](https://www.brilliantearth.com/loose-diamonds/search/) (soz Brilliant Earth... but thx for the data). 

In all seriousness, this data is the property of Brilliant Earth®, so please be respectful. I try and stick to web scraping best practises in these scripts, so if you are going to use it, please keep these in. They mostly revolve around slowing the functions down, which I realise may be frustrating, but let's keep to the code people.

We'll start by importing our packages and defining a couple of functions.

## (b) Import packages / define functions

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
plt.rcParams['figure.figsize'] = [12, 7]

from bs4 import BeautifulSoup
import requests
import re

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains

import itertools
import time # To help slow our functions down and time them
import random # To assign random floats to breaks, hiding predictable patterns

The below two functions are essential to working with HTML data, and we'll use these a lot throughout the notebook.

In [2]:
def pause_random(start=0.3, stop=2):
    """
    Pause the function for a random amount of time between the two integers entered.
    """
    time.sleep(random.uniform(start, stop))

In [3]:
def get_page_content(page_link):
    """
    Scrape the targeted HTML and store as a bs object
    """
    page_response = requests.get(page_link, timeout=5)
    page_content = BeautifulSoup(page_response.content)
    return(page_content)

In [4]:
def cleanhtml(raw_html):
    """
    Remove HTML tags from string.
    """
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', raw_html)
    return(cleantext)

## (c) Prepare Blue Nile® dataset

In this section we create a function to prepare the Blue Nile table, and do some intial exploration to help our final crawler save time.
For reference, I denote Blue Nile® as `bn` for short.

In [5]:
# For conciseness, Brilliant Earth we will denote as 'be'
be_link = 'https://www.brilliantearth.com/loose-diamonds/search/'

In [10]:
# Start web driver
browser = webdriver.Chrome('C:/Users/Edward Sims/Downloads/chromedriver.exe')
browser.get(be_link)
pause_random()

In [11]:
def prep_bn_table(link):
    """
    Opens webdriver and prepares the table for scraping.
    """
    
    # Continue past the cookie notice if it exists
    try:
        cookie_continue = browser.find_element_by_xpath('/html/body/div[1]/div[2]/div[4]/div[2]/div')
        cookie_continue.click()
    except:
        pass
        
    # If more filters is unselected, select it    
    filter_status = browser.find_element_by_xpath('//*[@id="collapseExample"]')
    filter_status = str(filter_status.get_attribute('outerHTML'))
    if 'row ir246-advanced-filters__content collapse' in filter_status:
        more_filters = browser.find_element_by_xpath('/html/body/div[9]/div[4]/span')
        more_filters.click()
    
    # Add in all types of shape
    round_details = browser.find_element_by_xpath('/html/body/div[9]/div[2]/div[1]/div[1]/div/div[2]/div/ul/li[1]/a')
    oval_details = browser.find_element_by_xpath('/html/body/div[9]/div[2]/div[1]/div[1]/div/div[2]/div/ul/li[2]/a')
    cushion_details = browser.find_element_by_xpath('/html/body/div[9]/div[2]/div[1]/div[1]/div/div[2]/div/ul/li[3]/a')
    princess_details = browser.find_element_by_xpath('/html/body/div[9]/div[2]/div[1]/div[1]/div/div[2]/div/ul/li[4]/a')
    pear_details = browser.find_element_by_xpath('/html/body/div[9]/div[2]/div[1]/div[1]/div/div[2]/div/ul/li[5]/a')
    emerald_details = browser.find_element_by_xpath('/html/body/div[9]/div[2]/div[1]/div[1]/div/div[2]/div/ul/li[6]/a')
    marquise_details = browser.find_element_by_xpath('/html/body/div[9]/div[2]/div[1]/div[1]/div/div[2]/div/ul/li[7]/a')
    asscher_details = browser.find_element_by_xpath('/html/body/div[9]/div[2]/div[1]/div[1]/div/div[2]/div/ul/li[8]/a')
    radiant_details = browser.find_element_by_xpath('/html/body/div[9]/div[2]/div[1]/div[1]/div/div[2]/div/ul/li[9]/a')
    heart_details = browser.find_element_by_xpath('/html/body/div[9]/div[2]/div[1]/div[1]/div/div[2]/div/ul/li[10]/a')
    
    shape_details_all = [round_details, oval_details, cushion_details, princess_details, pear_details,
                        emerald_details, marquise_details, asscher_details, radiant_details, heart_details]
    
    for shape_details in shape_details_all:
        if 'active' not in str(shape_details.get_attribute('outerHTML')):
            shape_details.click()
            pause_random()

In [12]:
# Open and prepare the bn table for scraping
prep_bn_table(be_link)

## (d) Scrape the data
The difficulty with scraping the table is that a maximum of 1,000 results are displayed. And the prices of diamonds are hugely skewed around the £600-£2,000 price range. 

So we've arrived at our first major problem: 
 - If we increment our price by static small amounts, it'll take weeks (umm no thanks).
 - If we increment them by static large amounts, we'll miss out loads of data from the price ranges with high frequencies.

My solution below follows this method:
 1. Retrieve the headers for our table
 2. Create an initial price interval
 3. Check the number of results displayed.
 4. If more than 999, make an estimate on the sub intervals that will approximately yield 999 or less results. Scrape the table in each sub interval.
 5. If under 999, scrape the table.
 6. Tracking the cumulative number of results, once there are less than 999 results left, skip to the maximum price and scrape the table. 

In [71]:
# Min and max price locations
min_price_box = browser.find_element_by_xpath('/html/body/div[9]/div[2]/div[1]/div[2]/div/div[2]/div[1]/div[2]/div[1]')
max_price_box = browser.find_element_by_xpath('/html/body/div[9]/div[2]/div[1]/div[2]/div/div[2]/div[1]/div[2]/div[2]')

# Assign default interval value
price_interval = 1000

# Get min and max values (without £ and comma values)
min_price_value = browser.find_element_by_xpath('//*[@id="min_price_display"]')
min_price_value = int(min_price_value.get_attribute('value')[1:].replace(',', ''))
min_price_value = min_price_value - 1 # Minus 1, so we can add 1 in the loop

max_price_value = browser.find_element_by_xpath('//*[@id="max_price_display"]')
max_price_value = int(max_price_value.get_attribute('value')[1:].replace(',', ''))
# Find a neutral zone to click on
neutral = browser.find_element_by_xpath('/html/body/div[7]/div/div/h1')

total_freq = get_num_results()
cumul_freq = 0 # To cumulitively add the freqs as we go
print(min_price_value)
print(max_price_value)

454
1580375


In [75]:
#for min_val in range(min_price_value, max_price_value, price_interval):
    
    #lower_price = min_val + 1 # Add 1 so there are no overlapping intervals
    #higher_price = min_val + price_interval
    
    # Edit max price
    max_price_box.click()
    #max_price_box.send_keys(Keys.BACKSPACE)
    #max_price_box.send_keys(str(999)) 
    neutral.click()
    #time.sleep(random.uniform(0,1))
    ## Edit min price            
    #min_price_box.click()
    #min_price_box.send_keys(Keys.BACKSPACE)
    #min_price_box.send_keys(str(lower_price))
    #neutral.click()
    #time.sleep(random.uniform(1,3)) 
        

KeyboardInterrupt: 

In [None]:
def get_be_data():
    """
    Loops through all the price values, scrapes the results and stores
    it into a dataframe.
    """
    
    def get_num_results():
        """
        Scrapes the number of results shown in the price range.    
        """
        results_path = browser.find_element_by_xpath('//*[@id="totalResult"]')
        # Scrape HTML and clean
        results_val = cleanhtml(str(BeautifulSoup(results_path.get_attribute('innerHTML'))))
        # Strip  punctuation, and convert to integer
        results_val = int(re.sub(r'[^\w\s]','', results_val))
        return(results_val)
    
    start = time.time()
    be_headers = []
    # Isolate the table headers HTML
    headers_data = browser.find_element_by_xpath('//*[@id="search_result_header_table"]/thead/tr')
    headers_html = BeautifulSoup(headers_data.get_attribute('innerHTML'))
    
    # Get the header values
    for th in headers_html.find_all('th'):
        for header in th.find_all('span'):
            be_headers.append(cleanhtml(str(header)))
    be_headers = list(filter(('').__ne__, be_headers))
    be_headers.remove('compare')
    be_headers.remove('Collection') 
    
    # Min and max price locations
    min_price_box = browser.find_element_by_xpath('/html/body/div[9]/div[2]/div[1]/div[2]/div/div[2]/div[1]/div[2]/div[1]')
    max_price_box = browser.find_element_by_xpath('/html/body/div[9]/div[2]/div[1]/div[2]/div/div[2]/div[1]/div[2]/div[2]')
    
    # Assign default interval value
    price_interval = 1000
    
    # Get min and max values (without £ and comma values)
    min_price_value = browser.find_element_by_xpath('//*[@id="min_price_display"]')
    min_price_value = int(min_price_value.get_attribute('value')[1:].replace(',', ''))
    min_price_value = min_price_value - 1 # Minus 1, so we can add 1 in the loop
    
    max_price_value = browser.find_element_by_xpath('//*[@id="max_price_display"]')
    max_price_value = int(max_price_value.get_attribute('value')[1:].replace(',', ''))
    # Find a neutral zone to click on
    neutral = browser.find_element_by_xpath('/html/body/div[7]/div/div/h1')
    
    total_freq = get_num_results()
    cumul_freq = 0 # To cumulitively add the freqs as we go
    
    def be_scrape_table():
        be_table = pd.DataFrame(columns = be_headers)
        table_web_source = browser.find_element_by_xpath('//*[@id="diamonds_search_table"]')
        table_html = BeautifulSoup(table_web_source.get_attribute('innerHTML'))
        # Scrape the table! First get the raw table html
        table_rows_html = table_html.find_all('tr',{'class':'search-item'})
        # Then loop through each row 
        for row in table_rows_html:
            be_data = []
            # And loop through each value
            for value in row.find_all('td'):
                be_data.append(cleanhtml(str(value)))
            be_data = list(filter(('').__ne__, be_data)) # Remove all empty values
            #del be_data[4] # Delete index 4 in list as it returns two dupe vals - unique to their HTML
            
            be_dict = dict(zip(be_headers, be_data))
            be_table = be_table.append(be_dict, ignore_index=True)
        return(be_table)
    
    # Loop through prices
    for min_val in range(min_price_value, max_price_value, price_interval):
        
        lower_price = min_val + 1 # Add 1 so there are no overlapping intervals
        higher_price = min_val + price_interval
        
        # Edit max price
        max_price_box.click()
        max_price_box.send_keys(Keys.BACKSPACE)
        max_price_box.send_keys(str(higher_price)) 
        neutral.click()
        time.sleep(random.uniform(0,1))
        # Edit min price            
        min_price_box.click()
        min_price_box.send_keys(Keys.BACKSPACE)
        min_price_box.send_keys(str(lower_price))
        neutral.click()
        time.sleep(random.uniform(1,3)) 
        
        freq = get_num_results()
        cumul_freq = cumul_freq + freq 
        
        if (total_freq - cumul_freq) <= 999:
            browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            pause_random(7,8) # Wait for the table to load
            
            be_df = be_df.append(be_scrape_table(), ignore_index=True)
            
            # Edit max price
            max_price_box.click()
            max_price_box.send_keys(Keys.BACKSPACE)
            max_price_box.send_keys(str(max_price_value)) 
            neutral.click()
            time.sleep(random.uniform(0,1))
            # Edit min price            
            min_price_box.click()
            min_price_box.send_keys(Keys.BACKSPACE)
            min_price_box.send_keys(str(lower_price + price_interval))
            neutral.click()
            time.sleep(random.uniform(0,1))
            
            browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            pause_random(7,8) # Wait for the table to load
            
            be_df = be_df.append(be_scrape_table(), ignore_index=True)
            break 
        else:           
            # If there are over 999 results, divide the interval into smaller
            # intervals that approximately return 999 or less results for each.
            if freq > 999:
                sub_interval_number = freq / 999
                sub_interval_price = round(price_interval / sub_interval_number)
                
                sub_min_price_value = lower_price - 1
                sub_max_price_value = higher_price
                
                for min_val in range(sub_min_price_value, sub_max_price_value, sub_interval_price):
                    
                    sub_lower_price = min_val + 1 # Add 1 so there are no overlapping intervals
                    sub_higher_price = min_val + sub_interval_price
                    
                    # Edit max price
                    max_price_box.click()
                    max_price_box.send_keys(Keys.BACKSPACE)
                    max_price_box.send_keys(str(sub_higher_price)) 
                    neutral.click()
                    time.sleep(random.uniform(0,1))
                    # Edit min price            
                    min_price_box.click()
                    min_price_box.send_keys(Keys.BACKSPACE)
                    min_price_box.send_keys(str(sub_lower_price))
                    neutral.click()
                    time.sleep(random.uniform(0,1))
                    
                    browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
                    pause_random(7,8) # Wait for the table to load
                    
                    be_df = be_df.append(be_scrape_table(), ignore_index=True)
            
    end = time.time()
    print((end - start) / 60, 'mins to complete')
    return(be_df)