# 1. Mining Diamond Data - Blue Nile®

## (a) Introduction
In this series of notebooks we are mining diamond data from merchants on the web, and subsequently using Machine Learning to be able to predict the price of a diamond. Diamond merchants often display data on the diamonds they are selling so people can peruse them and make a purchase online. They'll also usually have a comparison element with lots of features (the 5 C's etc.). Really though, if you're anything like me (a noob jeweller), how can you tell how the diamond is priced based on these features? I guess you'd have to take the merchants' word on it... 

What we need is data, and a regression algorithm looking at price. This notebook is the first in the series, and in it we'll tackle scraping Blue Nile® data from [their website](https://www.bluenile.com/uk/diamond-search) (soz Blue Nile... but thx for the data). 

In all seriousness, this data is the property of Blue Nile®, so please be respectful. I try and stick to web scraping best practises in these scripts, so if you are going to use it, please keep these in. They mostly revolve around slowing the functions down, which I realise may be frustrating, but let's keep to the code people.

We'll start by importing our packages and defining a couple of functions.

## (b) Import packages / define functions

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
plt.rcParams['figure.figsize'] = [12, 7] # Change default fig size

from bs4 import BeautifulSoup
import requests
import re

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains

import itertools
import time # To help slow our functions down and time them
import random # To assign random floats to breaks, hiding predictable patterns

We'll define some useful functions that we'll be utilising a lot in the notebook. The last two functions are essential to working with HTML data.

In [3]:
def pause_random(start=0.3, stop=2):
    """
    Pause the function for a random amount of time between the two integers entered.
    """
    time.sleep(random.uniform(start, stop))

In [4]:
def get_page_content(page_link):
    """
    Scrape the targeted HTML and store as a bs object
    """
    page_response = requests.get(page_link, timeout=5)
    page_content = BeautifulSoup(page_response.content)
    return(page_content)

In [5]:
def cleanhtml(raw_html):
    """
    Remove HTML tags from string.
    """
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', raw_html)
    return(cleantext)

## (c) Prepare Blue Nile® dataset

In this section we create a function to prepare the Blue Nile table, and do some intial exploration to help our final crawler save time.
For reference, I denote Blue Nile® as `bn` for short.

In [6]:
# For conciseness, Blue Nile we will denote as 'bn'
bn_link = 'https://www.bluenile.com/uk/diamond-search'

In [18]:
# Start web driver
browser = webdriver.Chrome('C:/Users/Edward Sims/Downloads/chromedriver.exe')
browser.get(bn_link)
pause_random()

In [22]:
browser.find_element_by_id('three-sixty-filter-5455')

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="three-sixty-filter-5455"]"}
  (Session info: chrome=79.0.3945.88)


In [19]:
def prep_bn_table(link):
    """
    Opens webdriver and prepares the table for scraping.
    """
    
    # Continue past the cookie notice if it exists
    try:
        cookie_continue = browser.find_element_by_xpath('/html/body/div[1]/button[3]')
        cookie_continue.click()
    except:
        pass

    # If the 360 view option is checked, uncheck it
    view_checkbox_status = browser.find_element_by_xpath('/html/body/div[1]/main/div/div/div/div/section[1]/div[1]/div[2]/div[3]/div[4]/div[2]/div/div/input')
    view_checkbox_status = str(view_checkbox_status.get_attribute('innerHTML'))
    if 'checked' in view_checkbox_status:
        view_checkbox = browser.find_element_by_class_name('bn-checkbox')
        view_checkbox.click()
        pause_random()

    # If astor option is checked, uncheck it
    astor_checkbox_status = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[5]/div[2]/div/div/div[1]/div/div')
    astor_checkbox_status = str(astor_checkbox_status.get_attribute('innerHTML'))
    if 'checked' in astor_checkbox_status:
        astor_checkbox = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[5]/div[2]/div/div')
        astor_checkbox.click()
        pause_random()
        
    # If more filters is unselected, select it    
    filter_status = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[13]/span')
    filter_status = str(filter_status.get_attribute('innerHTML'))
    if 'More' in filter_status:
        more_filters = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[13]')
        more_filters.click()
    
    # Add extra options if they are not already added
    polish_add = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[12]/div[1]/div[1]/div/div')
    symmetry_add = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[12]/div[2]/div[1]/div/div')
    fluorescence_add = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[12]/div[3]/div[1]/div/div')
    depth_add = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[12]/div[4]/div[1]/div/div')
    table_add = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[12]/div[5]/div[1]/div/div')
    lw_add = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[12]/div[6]/div[1]/div/div')
    
    feature_add_all = [polish_add, symmetry_add, fluorescence_add, depth_add, table_add, lw_add]
    for feature_add in feature_add_all:
        if 'toggled' not in str(feature_add.get_attribute('outerHTML')):
            feature_add.click()
            pause_random()
            
    culet_add = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[12]/div[8]/div[2]/button')
    if 'active' not in str(culet_add.get_attribute('outerHTML')):
        culet_add.click()
        pause_random()
    
    # Add in all types of shape
    round_details = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[6]/div[2]/div/div[1]/div[3]')
    princess_details = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[6]/div[2]/div/div[2]/div[3]')
    emerald_details = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[6]/div[2]/div/div[3]/div[3]')
    asscher_details = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[6]/div[2]/div/div[4]/div[3]')
    cushion_details = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[6]/div[2]/div/div[5]/div[3]')
    marquise_details = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[6]/div[2]/div/div[6]/div[3]')
    radiant_details = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[6]/div[2]/div/div[7]/div[3]')
    oval_details = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[6]/div[2]/div/div[8]/div[3]')
    pear_details = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[6]/div[2]/div/div[9]/div[3]')
    heart_details = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[6]/div[2]/div/div[10]/div[3]')
    
    shape_details_all = [round_details, princess_details, emerald_details, asscher_details, cushion_details, 
                         marquise_details, radiant_details, oval_details, pear_details, heart_details]
    
    for shape_details in shape_details_all:
        if 'selected' not in str(shape_details.get_attribute('outerHTML')):
            shape_details.click()
            pause_random()

In [20]:
# Open and prepare the bn table for scraping
prep_bn_table(bn_link)

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"/html/body/div[1]/main/div/div/div/div/section[1]/div[1]/div[2]/div[3]/div[4]/div[2]/div/div/input"}
  (Session info: chrome=79.0.3945.88)


## (d) Scrape the data
The difficulty with scraping the table is that a maximum of 1,000 results are displayed. And the prices of diamonds are hugely skewed around the £600-£2,000 price range. 

So we've arrived at our first major problem: 
 - If we increment our price by static small amounts, it'll take weeks (umm no thanks).
 - If we increment them by static large amounts, we'll miss out loads of data from the price ranges with high frequencies.

My solution below follows this method:
 1. Retrieve the headers for our table
 2. Create an initial price interval
 3. Check the number of results displayed.
 4. If more than 999, make an estimate on the sub intervals that will approximately yield 999 or less results. Scrape the table in each sub interval.
 5. If under 999, scrape the table.
 6. Tracking the cumulative number of results, once there are less than 999 results left, skip to the maximum price and scrape the table. 

In [9]:
def get_bn_data():
    """
    Loops through all the price values, scrapes the results and stores
    it into a dataframe.
    """
    
    def get_num_results():
        """
        Scrapes the number of results shown in the price range.    
        """
        results_path = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[2]/div[4]/button[1]/span[2]')
        # Scrape HTML and clean
        results_val = cleanhtml(str(BeautifulSoup(results_path.get_attribute('innerHTML'))))
        # Strip  punctuation, and convert to integer
        results_val = int(re.sub(r'[^\w\s]','', results_val))
        return(results_val)
    
    start = time.time()
    bn_headers = []
    # Isolate the table headers HTML
    headers_data = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/section/div/div/div[1]/div')
    headers_html = BeautifulSoup(headers_data.get_attribute('innerHTML'))
    
    # Get the header values
    for div in headers_html.find_all('div'):
        for header in div.find_all('span'):
            bn_headers.append(cleanhtml(str(header)))
    bn_headers = list(filter(('').__ne__, bn_headers))
    bn_headers.remove('Compare')
    # Create a dataframe with our new headers
    bn_df = pd.DataFrame(columns = bn_headers)  
    
    # Min and max price locations
    min_price_box = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[7]/div[2]/div/div[1]/input[1]')
    max_price_box = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[7]/div[2]/div/div[1]/input[2]')
    
    # Assign default interval value
    price_interval = 1000
    
    # Get min and max values (without £ and comma values)
    min_price_value = int(min_price_box.get_attribute('value')[1:].replace(',', ''))
    min_price_value = min_price_value - 1 # Minus 1, so we can add 1 in the loop
    max_price_value = int(max_price_box.get_attribute('value')[1:].replace(',', ''))
    # Find a neutral zone to click on
    neutral = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[7]/div[1]/h3')
    
    total_freq = get_num_results()
    cumul_freq = 0 # To cumulitively add the freqs as we go
    
    def bn_scrape_table():
            bn_table = pd.DataFrame(columns = bn_headers)
            table_web_source = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/section/div/div/div[2]')
            table_html = BeautifulSoup(table_web_source.get_attribute('innerHTML'))
            # Scrape the table! First get the raw table html
            table_rows_html = table_html.find_all('a',{'class':'grid-row row '})
            # Then loop through each row 
            for row in table_rows_html:
                bn_data = []
                # And loop through each value
                for value in row.find_all('span'):
                    bn_data.append(cleanhtml(str(value)))
                bn_data = list(filter(('').__ne__, bn_data)) # Remove all empty values
                del bn_data[4] # Delete index 4 in list as it returns two dupe vals - unique to their HTML
                
                bn_dict = dict(zip(bn_headers, bn_data))
                bn_table = bn_table.append(bn_dict, ignore_index=True)
            return(bn_table)
    
    # Loop through prices
    for min_val in range(min_price_value, max_price_value, price_interval):
        
        lower_price = min_val + 1 # Add 1 so there are no overlapping intervals
        higher_price = min_val + price_interval
        
        # Edit max price
        max_price_box.click()
        max_price_box.send_keys(Keys.BACKSPACE)
        max_price_box.send_keys(str(higher_price)) 
        neutral.click()
        time.sleep(random.uniform(0,1))
        # Edit min price            
        min_price_box.click()
        min_price_box.send_keys(Keys.BACKSPACE)
        min_price_box.send_keys(str(lower_price))
        neutral.click()
        time.sleep(random.uniform(1,3)) 
        
        freq = get_num_results()
        cumul_freq = cumul_freq + freq 
        
        if (total_freq - cumul_freq) <= 999:
            browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            pause_random(7,8) # Wait for the table to load
            
            bn_df = bn_df.append(bn_scrape_table(), ignore_index=True)
            
            # Edit max price
            max_price_box.click()
            max_price_box.send_keys(Keys.BACKSPACE)
            max_price_box.send_keys(str(max_price_value)) 
            neutral.click()
            time.sleep(random.uniform(0,1))
            # Edit min price            
            min_price_box.click()
            min_price_box.send_keys(Keys.BACKSPACE)
            min_price_box.send_keys(str(lower_price + price_interval))
            neutral.click()
            time.sleep(random.uniform(0,1))
            
            browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            pause_random(7,8) # Wait for the table to load
            
            bn_df = bn_df.append(bn_scrape_table(), ignore_index=True)
            break 
        else:           
            # If there are over 999 results, divide the interval into smaller
            # intervals that approximately return 999 or less results for each.
            if freq > 999:
                sub_interval_number = freq / 999
                sub_interval_price = round(price_interval / sub_interval_number)
                
                sub_min_price_value = lower_price - 1
                sub_max_price_value = higher_price
                
                for min_val in range(sub_min_price_value, sub_max_price_value, sub_interval_price):
                    
                    sub_lower_price = min_val + 1 # Add 1 so there are no overlapping intervals
                    sub_higher_price = min_val + sub_interval_price
                    
                    # Edit max price
                    max_price_box.click()
                    max_price_box.send_keys(Keys.BACKSPACE)
                    max_price_box.send_keys(str(sub_higher_price)) 
                    neutral.click()
                    time.sleep(random.uniform(0,1))
                    # Edit min price            
                    min_price_box.click()
                    min_price_box.send_keys(Keys.BACKSPACE)
                    min_price_box.send_keys(str(sub_lower_price))
                    neutral.click()
                    time.sleep(random.uniform(0,1))
                    
                    browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
                    pause_random(7,8) # Wait for the table to load
                    
                    bn_df = bn_df.append(bn_scrape_table(), ignore_index=True)
            
    end = time.time()
    print((end - start) / 60, 'mins to complete')
    return(bn_df)

In [10]:
bn_df = get_bn_data()

47.12620820204417 mins to complete


In [11]:
bn_df.to_csv('blue_nile_data.csv', index=False)

In [49]:
print(bn_df.shape)
bn_df.head()

(173725, 14)


Unnamed: 0,shape,price,carat,cut,color,clarity,polish,symmetry,fluorescence,depth,table,l/w,price/ct,culet
0,Emerald,214.8,0.3,Good,I,VS1,Very Good,Very Good,,68.6,74.0,1.24,716.0,
1,Pear,219.6,0.31,Very Good,J,SI2,Very Good,Good,Faint,61.8,56.0,1.51,708.0,
2,Princess,238.8,0.25,Very Good,D,SI1,Excellent,Very Good,,73.4,73.0,1.05,955.0,
3,Pear,248.4,0.31,Very Good,G,SI2,Good,Good,,62.2,64.0,1.62,801.0,
4,Pear,249.6,0.29,Very Good,E,SI2,Very Good,Good,,65.7,59.0,1.5,861.0,
