# 1. Mining Diamond Data - Blue Nile®

## Introduction
In this series of notebooks we are mining diamond data from as many sources as possible to prepare our dataset. Ultimately, we want to be doing some Machine Learning on these data - but the sccraping is just as fun! We are targeting diamond merchants on the web, starting with Blue Nile® (soz Blue Nile®... but thx for the data). 

In all seriousness, this data is the property of Blue Nile®, so please be respectful. I try and stick to web scraping best practises in these scripts, so if you are going to use it, please keep these in. They mostly revolve around slowing the functions down, which I realise may be frustrating, but let's keep up the standards.

We'll start by importing our usual packages, packages for scraping and regular expression as well as some ad hoc ones.

https://www.google.com/search?q=binning+data+equal+frequency&rlz=1C1GCEA_enGB838GB838&oq=binning+data+equal+frequency&aqs=chrome..69i57j33.6746j0j7&sourceid=chrome&ie=UTF-8

## Import packages / define functions

In [14]:
import pandas as pd
import numpy as np

from bs4 import BeautifulSoup
import requests
import re

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains

import time # To help with slowing our functions down
import random # To assign random floats to breaks, hiding predictable patterns

In [2]:
def get_page_content(page_link):
    page_response = requests.get(page_link, timeout=5)
    page_content = BeautifulSoup(page_response.content)
    return(page_content)

In [3]:
def cleanhtml(raw_html):
    """
    Remove HTML tags from string.
    """
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', raw_html)
    return(cleantext)

## Blue Nile

In [42]:
# For conciseness, Blue Nile we will denote as 'bn'
bn_link = 'https://www.bluenile.com/uk/diamond-search'

In [43]:
def get_bn_headers(page_content):
    """
    Retrieves the headers for our Blue Nile dataframe
    """
    headers_grid = page_content.find('div',{'class':'grid-header normal-header'})
    headers_row = headers_grid.find('div', {'class':'row'})
    
    # Find all headers, and remove the tags from the string
    headers_containers = []
    for div in headers_row.find_all('div'):
        headers_containers.append(cleanhtml(str(div.find('span'))))
    
    # Remove all 'None' string values from list
    headers = list(filter(('None').__ne__, headers_containers))
    headers.remove('Compare')
    
    return(headers)

In [44]:
bn_page_content = get_page_content(bn_link)
bn_headers = get_bn_headers(bn_page_content)
print(bn_headers)

['Shape', 'Price', 'Carat', 'Cut', 'Colour', 'Clarity', 'Polish', 'Symmetry', 'Fluorescence', 'Depth', 'Table', 'L/W', 'Price/Ct', 'Culet', 'Stock No.', 'Dispatch Date']


In [134]:
browser = webdriver.Chrome('C:/Users/Edward Sims/Downloads/chromedriver.exe')
browser.get('https://www.bluenile.com/uk/diamond-search?track=NavDiaSea')

In [135]:
# Uncheck the 360 view option for more data
view_checkbox = browser.find_element_by_class_name('bn-checkbox')
view_checkbox.click()
time.sleep(random.uniform(0.3,3))

# Open more filters
more_filters = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[13]')
more_filters.click()
time.sleep(random.uniform(0.3,3))

# Open polish option
polish_add = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[12]/div[1]/div[1]/div/div/div')
polish_add.click()
time.sleep(random.uniform(0.3,3))

# Open symmetry option
symmetry_add = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[12]/div[2]/div[1]/div/div/div')
symmetry_add.click()
time.sleep(random.uniform(0.3,3))

# Open fluorescence option
fluorescence_add = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[12]/div[3]/div[1]/div/div')
fluorescence_add.click()
time.sleep(random.uniform(0.3,3))

# Open depth % option
depth_add = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[12]/div[4]/div[1]/div/div')
depth_add.click()
time.sleep(random.uniform(0.3,3))

# Open table % option
table_add = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[12]/div[5]/div[1]/div/div')
table_add.click()
time.sleep(random.uniform(0.3,3))

# Open L/W Ratio option
lw_add = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[12]/div[6]/div[1]/div/div')
lw_add.click()
time.sleep(random.uniform(0.3,3))

# Add culet column
culet_add = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[12]/div[8]/div[2]/button')
culet_add.click()
time.sleep(random.uniform(0.3,3))

In [136]:
princess_details = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[6]/div[2]/div/div[2]/div[3]')
emerald_details = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[6]/div[2]/div/div[3]/div[3]')
asscher_details = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[6]/div[2]/div/div[4]/div[3]')
cushion_details = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[6]/div[2]/div/div[5]/div[3]')
marquise_details = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[6]/div[2]/div/div[6]/div[3]')
radiant_details = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[6]/div[2]/div/div[7]/div[3]')
oval_details = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[6]/div[2]/div/div[8]/div[3]')
pear_details = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[6]/div[2]/div/div[9]/div[3]')
heart_details = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[6]/div[2]/div/div[10]/div[3]')

princess_details.click()
time.sleep(random.uniform(0.3,3))
emerald_details.click()
time.sleep(random.uniform(0.3,3))
asscher_details.click()
time.sleep(random.uniform(0.3,3))
cushion_details.click()
time.sleep(random.uniform(0.3,3))
marquise_details.click()
time.sleep(random.uniform(0.3,3))
radiant_details.click()
time.sleep(random.uniform(0.3,3))
oval_details.click()
time.sleep(random.uniform(0.3,3))
pear_details.click()
time.sleep(random.uniform(0.3,3))
heart_details.click()
time.sleep(random.uniform(0.3,3))

The difficulty with scraping the table is that a maximum of 1000 results are displayed. And the prices of diamonds are hugely skewed around the £800-£1,000 price range. Now we could just increase the price range by a small amount, say, £10 at a time, but given that the maximum price is over a million, this will end up taking FOREVER. What's more, even in that price range there are still sometimes too many records for the table to display. And even worse still, because the data are so skewed, we'll be looping through prices at the higher end and there will not even be anything in there to display.

So instead we'll bin our prices into unequal bins, but where the frequency is equal to 999 for each bin

In [105]:
def get_bn_data():
    """
    Loops through all the price values, scrapes the results and stores
    it into a dataframe.
    """
    
    def get_num_results():
    """
    Scrapes the number of results shown in the price range.
    
    """
        results_path = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[2]/div[4]/button[1]/span[2]')
        # Scrape the HTML and clean
        results_val = cleanhtml(str(BeautifulSoup(results_path.get_attribute('innerHTML'))))
        # Strip the punctuation, and convert to integer
        results_val = int(re.sub(r'[^\w\s]','', results_val))
            
        return(results_val)
    
    bn_headers = []
    
    # Isolate the table headers HTML
    headers_data = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/section/div/div/div[1]/div')
    headers_html = BeautifulSoup(headers_data.get_attribute('innerHTML'))
    
    # Get the header values
    for div in headers_html.find_all('div'):
        for header in div.find_all('span'):
            bn_headers.append(cleanhtml(str(header)))
    bn_headers = list(filter(('').__ne__, bn_headers))
    bn_headers.remove('Compare')
    
    # Create a dataframe with our new headers
    bn_df = pd.DataFrame(columns=bn_headers)
    
    # Min and max price locations
    min_price = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[7]/div[2]/div/div[1]/input[1]')
    max_price = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[7]/div[2]/div/div[1]/input[2]')
    
    # Assign a default range value
    price_range = 10
    
    # Get the min and max values (without £ and comma values)
    min_price_value = int(min_price.get_attribute('value')[1:].replace(',', ''))
    max_price_value = int(max_price.get_attribute('value')[1:].replace(',', ''))
    
    # Find a neutral zone to click on
    neutral = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/div[1]/div[2]/div[3]/div[7]/div[1]/h3')
    
    # Loop through prices to limit numbers displayed
    for min_val in range(min_price_value, 237, price_range):
        
        # Edit min price
        min_price.click()
        min_price.send_keys(Keys.BACKSPACE)
        time.sleep(1)
        min_price.send_keys(str(min_val))
        neutral.click()
        time.sleep(random.uniform(0.3,3))
        
        # Edit max price
        max_price.click()
        max_price.send_keys(Keys.BACKSPACE)
        time.sleep(2)
        max_price.send_keys(str(min_val+price_range)) 
        neutral.click()
        
        
        

        
        
        get_num_results()


        time.sleep(random.uniform(0.3,4))
        
        table_web_source = browser.find_element_by_xpath('//*[@id="react-app"]/div/div/div/section[1]/section/div/div/div[2]')
        table_html = BeautifulSoup(table_web_source.get_attribute('innerHTML'))
        
        # Scrape the table! First get the raw table html
        table_rows_html = table_html.find_all('a',{'class':'grid-row row '})
        time.sleep(random.uniform(0.3,10))
            
        # Then loop through each row 
        for row in table_rows_html:
            bn_data = []
            # And loop through each value
            for value in row.find_all('span'):
                bn_data.append(cleanhtml(str(value)))
            
            bn_data = list(filter(('').__ne__, bn_data)) # Remove all empty values
            del bn_data[4] # Delete index 4 in list as it returns two dupe vals - unique to their HTML
            #print(bn_data)
            bn_df = bn_df.append(dict(zip(bn_headers,bn_data)),ignore_index=True)

    
    
    
    
    
    
    
    
        return(bn_df)

In [106]:
get_bn_data()

Unnamed: 0,Shape,Price,Carat,Cut,Color,Clarity,Polish,Symmetry,Fluorescence,Depth,Table,L/W,Price/Ct,Culet,Stock No.,Dispatch Date
0,Marquise,£204.00,0.29,Good,J,VS1,Very Good,Very Good,Strong,55.8,62.0,2.09,£703,,LD12094866,Jun 18
1,Emerald,£217.20,0.3,Good,I,VS1,Very Good,Very Good,,68.6,74.0,1.24,£724,,LD10955945,Jun 13
2,Princess,£217.20,0.26,Very Good,D,SI2,Excellent,Very Good,,68.2,65.0,1.05,£835,,LD12053315,Jun 11
3,Princess,£220.80,0.23,Very Good,D,SI1,Excellent,Very Good,,75.0,73.0,1.05,£960,,LD12437279,Jun 14
4,Round,£236.40,0.24,Good,F,SI2,Very Good,Good,,63.3,60.0,1.01,£985,,LD12438597,Jun 11
5,Pear,£238.80,0.23,Good,I,VVS2,Very Good,Very Good,,56.6,62.0,1.64,"£1,038",,LD11691361,Jun 11
6,Round,£240.00,0.24,Very Good,F,SI2,Very Good,Good,Faint,61.3,60.0,1.01,"£1,000",,LD12438596,Jun 11
7,Marquise,£241.20,0.33,Very Good,J,VS1,Very Good,Very Good,Faint,61.2,59.0,1.77,£731,,LD12022928,Jun 18
8,Oval,£241.20,0.3,Very Good,E,SI2,Very Good,Good,Strong,59.2,63.0,1.43,£804,,LD12094834,Jun 18
9,Oval,£242.40,0.3,Very Good,I,SI1,Very Good,Very Good,,61.6,56.0,1.35,£808,,LD12046989,Jun 18


In [None]:
if results_val > 999: 
    

In [None]:
min_price_value = int(min_price.get_attribute('value')[1:].replace(',', ''))
max_price_value = int(max_price.get_attribute('value')[1:].replace(',', ''))