## <center>Webscraping with Selenium and BeautifulSoup</center>

<center>Scraping GoBear aggregator website to get quotes information from different insurer </center>

<center>The information from GoBear relies on the stability of the partner sites and therefore the number of insurers quoting can change from quote to quote depending on site availability , T&Cs and the website html structure updates </center>

## Importing packages

Selenium is a web scraping bot that allows crwaling not only the static web pages but also dynamic generated contents (by javascript for example) from web pages. It act as a human being to takes control of your browser then navigate on the websites you want to crawl and get the data directly or via the help of BeautifulSoup which is another package to pulling date out of  html or xml.

In [1]:
import selenium
from bs4 import BeautifulSoup
import urllib2
import json

import re
from selenium import webdriver
from selenium.webdriver import ActionChains # useful if we want simulate actions, r.g moving a slide bar. Here is not manadatory
from selenium.webdriver.support.ui import Select
import time

import pandas as pd
import numpy as np

from __future__ import division

# Dict containing the mapping bewtween car model with website xpath. This ons is specific with different website. 
from scrapLib.GOBEAR_DICT import CAR_MODEL_DICT, CAR_MAKE_DICT

In [2]:
# initialize input and output location as well as scraping chunk size
PROFILE_PATH = 'C:/Users/liuleo/Documents/KT/WEB_SCRAP/RISK_PROFILES_gobear_clean.csv'
CHUNK_SIZE = 10
OUTPUT_FOLDER = 'C:/Users/liuleo/Documents/KT/WEB_SCRAP/test_results/'

In [3]:
gb_df = pd.read_csv(PROFILE_PATH)

### Data preprocessing

Preprocessing on the profile data to match the crawling website available information

In [4]:
# change data type
gb_df['BIRTH_YEAR'] = gb_df['DOB'].apply(lambda x : x[-4:])
gb_df['BIRTH_YEAR'] = gb_df['BIRTH_YEAR'].astype(np.int32)
gb_df['LICENCE YEARS'] = gb_df['LICENCE YEARS'].astype(np.int8)
gb_df['NCD'] = gb_df['NCD'].astype(np.int8)

# modify the manufacture year to reflect current time (2015 to 2017)
gb_df['YEAR OF MANUFACTURE'] = gb_df['YEAR OF MANUFACTURE'] + 2
gb_df['YEAR OF MANUFACTURE'] = gb_df['YEAR OF MANUFACTURE'].astype(np.int32)

# For occupation, GoBear only has two types : indoor and outdoor
# For marital status, GoBear only has two types : Single and Married
gb_df['OCCUPATION_FLAG'] = gb_df['OCCUPATION'].apply(lambda x : 1 if x in ['INDOOR MIDDLE LVL MGMT', 'OUTDOOR SALES/STAFF'] else 0)
gb_df['MARITAL_STATUS_FLAG'] = gb_df['MARITAL STATUS'].apply(lambda x : 1 if x in ['SINGLE', 'MARRIED'] else 0)

# Combine Car model and Engine CC
def concat_str(x):
    return x['MODEL'] + ' ' + str(x['CC'])
gb_df['MODEL_CC'] = gb_df.apply(concat_str, axis=1)

## Driver age is between 18 to 65 on GoBear website
gb_df = gb_df[(gb_df['AGE OF DRIVER']<=65) & (gb_df['AGE OF DRIVER']>=18)].reset_index(drop=True)

gb_df_dedup = gb_df.drop_duplicates([col for col in gb_df.columns if col != 'RISK']).reset_index(drop=True)

print "Number of initial profiles {}".format(gb_df.shape)
print "Number of profiles after deduplication {}".format(gb_df_dedup.shape)

Number of initial profiles (461, 19)
Number of profiles after deduplication (383, 19)


Initialize the quoting information need to be crawled from targeting website

In [5]:
gb_df_dedup['insurer_names'] = 'NA'
gb_df_dedup['plans'] = 'NA'
gb_df_dedup['prices'] = 'NA'
gb_df_dedup['excesses'] = 'NA'

For testing, can strat by a sample. Once scrap bot works well, can start scrap all the profiles

In [6]:
gb_df_sample = gb_df_dedup.sample(22).reset_index(drop=True)
# if crawl for entire profiles, just copy the entire profiles to the gb_df_sample dataframe
#gb_df_sample = gb_df_dedup.copy()

## Initialize the webdriver and website starting url to scrap

The webdriver is browser specific, for chorm you can find at http://chromedriver.storage.googleapis.com/index.html

In [7]:
# initialize selenium chrom driver
path_to_chromedriver = 'C:\Users\liuleo\Documents\Python\chromedriver_win32\chromedriver.exe' 
browser = webdriver.Chrome(executable_path = path_to_chromedriver)

In [15]:
# navigate to the webpage containing information to be crawled
url = 'https://www.gobear.com/sg'
browser.get(url)

In [16]:
# go to car insuance quotes panel
browser.find_element_by_xpath('//*[@id="Insurance"]/div/ul/li[1]/a').click()

In [13]:
gb_df_sample

Unnamed: 0,RISK,MAKE,MODEL,CC,MMCC,OFF PEAK,DOB,AGE OF DRIVER,GENDER,OCCUPATION,...,POLICY EXCESS,MARITAL STATUS,BIRTH_YEAR,OCCUPATION_FLAG,MARITAL_STATUS_FLAG,MODEL_CC,insurer_names,plans,prices,excesses
0,260,MERCEDES,CLA220,2143,MERCEDES CLA220 2143,NO,1/21/1978,40,MALE,INDOOR MIDDLE LVL MGMT,...,500,SINGLE,1978,1,1,CLA220 2143,,,,
1,259,MERCEDES,CLA200,1595,MERCEDES CLA200 1595,NO,1/21/1978,40,MALE,INDOOR MIDDLE LVL MGMT,...,500,SINGLE,1978,1,1,CLA200 1595,,,,
2,249,TOYOTA,YARIS,1496,TOYOTA YARIS 1496,NO,1/21/1978,40,MALE,INDOOR MIDDLE LVL MGMT,...,500,SINGLE,1978,1,1,YARIS 1496,,,,
3,188,TOYOTA,CAMRY,2494,TOYOTA CAMRY 2494,NO,1/21/1978,40,MALE,INDOOR MIDDLE LVL MGMT,...,500,SINGLE,1978,1,1,CAMRY 2494,,,,
4,9,TOYOTA,COROLLA ALTIS,1598,TOYOTA COROLLA ALTIS 1598,NO,1/21/1978,40,MALE,INDOOR MIDDLE LVL MGMT,...,250,SINGLE,1978,1,1,COROLLA ALTIS 1598,,,,
5,463,TOYOTA,COROLLA ALTIS,1598,TOYOTA COROLLA ALTIS 1598,NO,1/21/1978,40,FEMALE,INDOOR MIDDLE LVL MGMT,...,500,SINGLE,1978,1,1,COROLLA ALTIS 1598,,,,
6,77,TOYOTA,COROLLA ALTIS,1598,TOYOTA COROLLA ALTIS 1598,NO,1/21/1953,65,MALE,INDOOR MIDDLE LVL MGMT,...,500,SINGLE,1953,1,1,COROLLA ALTIS 1598,,,,
7,319,BMW,335i,2979,BMW 335i 2979,NO,1/21/1978,40,MALE,INDOOR MIDDLE LVL MGMT,...,500,SINGLE,1978,1,1,335i 2979,,,,
8,180,TOYOTA,AVANZA,1495,TOYOTA AVANZA 1495,NO,1/21/1978,40,MALE,INDOOR MIDDLE LVL MGMT,...,500,SINGLE,1978,1,1,AVANZA 1495,,,,
9,367,HYUNDAI,SANTA FE,2351,HYUNDAI SANTA FE 2351,NO,1/21/1978,40,MALE,INDOOR MIDDLE LVL MGMT,...,500,SINGLE,1978,1,1,SANTA FE 2351,,,,


Scrapping the profiles quotes information from GoBear website chunk by chunk

In [17]:

for i in range(0,len(gb_df_sample)):
    
    print 'Scraping Model:{} Customer BIR YEAR: {} Customer Gender: {} Customer Marital: {} NCD: {}'.format(gb_df_sample.loc[i,'MODEL_CC'],
                                                                                                    gb_df_sample.loc[i,'BIRTH_YEAR'], 
                                                                                                    gb_df_sample.loc[i,'GENDER'], 
                                                                                                    gb_df_sample.loc[i,'MARITAL STATUS'], 
                                                                                                           gb_df_sample.loc[i,'NCD'])

    # select age
    #time.sleep(1)
    browser.find_element_by_class_name('age-holder').click()
    browser.find_element_by_name('year').click()
    # send year information to website
    time.sleep(1.5)
    browser.find_element_by_name('year').send_keys(gb_df_sample.loc[i,'BIRTH_YEAR'])
    time.sleep(1)
    ###

    # select marital status
    browser.find_element_by_xpath('//*[@id="car-form"]/div[1]/div[1]/div[2]/button').click()
    time.sleep(0.5)
    if gb_df_sample.loc[i,'MARITAL STATUS'] == 'SINGLE':
        #single
        browser.find_element_by_xpath('//*[@id="car-form"]/div[1]/div[1]/div[2]/div/ul/li[1]/a/link').click()
    elif gb_df_sample.loc[i,'MARITAL STATUS'] == 'MARRIED':
        #married
        browser.find_element_by_xpath('//*[@id="car-form"]/div[1]/div[1]/div[2]/div/ul/li[2]/a/link').click()
    else:
        continue
    time.sleep(1)
    ###

    # select gender
    browser.find_element_by_xpath('//*[@id="car-form"]/div[1]/div[1]/div[3]/div/button').click()
    time.sleep(0.5)
    if gb_df_sample.loc[i,'GENDER'] == 'MALE':
        # Male
        browser.find_element_by_xpath('//*[@id="car-form"]/div[1]/div[1]/div[3]/div/div/ul/li[1]/a/link').click()
    elif gb_df_sample.loc[i,'GENDER'] == 'FEMALE':
        # Female
        browser.find_element_by_xpath('//*[@id="car-form"]/div[1]/div[1]/div[3]/div/div/ul/li[2]/a/link').click()
    else:
        continue
    time.sleep(1)
    ###

    # select driving exp
    browser.find_element_by_xpath('//*[@id="car-form"]/div[1]/div[1]/div[4]/button').click()
    time.sleep(0.5)
    if gb_df_sample.loc[i,'LICENCE YEARS'] >=15:
        browser.find_element_by_xpath('//*[@id="car-form"]/div[1]/div[1]/div[4]/div/ul/li[16]/a/link').click()
    else:
        tmp_exp = str(gb_df_sample.loc[i,'LICENCE YEARS']+1)
        browser.find_element_by_xpath('//*[@id="car-form"]/div[1]/div[1]/div[4]/div/ul/li[' + tmp_exp + ']/a/link').click()
    time.sleep(0.5)
    ###

    # select NCD (no claim discount)
    browser.find_element_by_xpath('//*[@id="car-form"]/div[1]/div[2]/div/div[1]/button').click()
    time.sleep(1)
    # get the corresponding xpath element on the website with ncd value in profile 
    tmp_ncd = str((gb_df_sample.loc[i,'NCD'] / 10) + 1)
    browser.find_element_by_xpath('//*[@id="car-form"]/div[1]/div[2]/div/div[1]/div/ul/li[' + tmp_ncd + ']/a/link').click()
    time.sleep(0.5)
    ###
    
    
    # select car manufacture year 
    browser.find_element_by_xpath('//*[@id="carDetails"]/div[1]/div/button').click()
    time.sleep(1)
    # get the corresponding xpath element on the website with year of manufacture in profile 
    tmp_car_year = 2020-gb_df_sample.loc[i,'YEAR OF MANUFACTURE'] #2018+2
    browser.find_element_by_xpath('//*[@id="carDetails"]/div[1]/div/div/ul/li[' + str(tmp_car_year) + ']/a/link').click()
    time.sleep(0.5)
    ###
    
    # select brand
    browser.find_element_by_xpath('//*[@id="carDetails"]/div[2]/div/button').click()
    time.sleep(1)
    # get the corresponding xpath element on the website with car maker in profile 
    tmp_car_make = CAR_MAKE_DICT[gb_df_sample.loc[i,'MAKE']]
    browser.find_element_by_xpath('//*[@id="carDetails"]/div[2]/div/div/ul/li[' + str(tmp_car_make) + ']/a/link').click()
    time.sleep(0.5)
    ###
    
    # select model
    browser.find_element_by_xpath('//*[@id="carDetails"]/div[3]/div/button').click()
    time.sleep(1)
    # get the corresponding xpath element on the website with car model + cc in profile
    if CAR_MODEL_DICT[gb_df_sample.loc[i,'MODEL_CC']] != 'NA':
        tmp_car_model = CAR_MODEL_DICT[gb_df_sample.loc[i,'MODEL_CC']]
        browser.find_element_by_xpath(tmp_car_model).click()
    else:
        continue

    # click show my results
    browser.find_element_by_xpath('//*[@id="car-form"]/div[2]/div[2]/button[1]/link').click()
    # sleep a bit to wait for generating results
    time.sleep(4.5)
    
    ### website navigate to quotes page already

    # click radio button for choosing off-peak scheme
    if gb_df_sample.loc[i,'OFF PEAK'] == 'NO':
        browser.find_element_by_xpath('//*[@id="detailCollapse"]/div[2]/div[1]/label').click()
        time.sleep(5.5)
    elif gb_df_sample.loc[i,'OFF PEAK'] == 'YES':
        # for non-peak quotes
        browser.find_element_by_xpath('//*[@id="detailCollapse"]/div[2]/div[2]/label').click()
        time.sleep(5.5)
    else:
        continue
    
    ## click for occupation outdoor if necessary (indoor by default)
    if gb_df_sample.loc[i,'OCCUPATION'] == 'OUTDOOR SALES/STAFF':
        browser.find_element_by_xpath('//*[@id="detailCollapse"]/div[1]/div[2]/label').click()
        time.sleep(4)
    
    # start to get quotes information with beautifulsoup
    html = browser.page_source
    soup = BeautifulSoup(html, "lxml")
    
    # get corresponding html blocks for quotes information required
    price_table = soup.find_all('span', attrs={'class': 'value'})
    insurer_table = soup.find_all('h4', attrs={'class': 'name'})
    plan_table = soup.find_all('div',attrs={'class':'card-title text-center'})
    excess_table = soup.find_all('p', attrs={'class': 'col-xs-6 text-right detail-value'})
    
    # get policy excess info
    excess_list = []
    j = 0
    for x in excess_table:
        if j % 5 == 0:
            excess_list.append(str(x.find('span').text.strip()))
        j = j+1
    
    # get quotes rates info
    price_list = []
    for price in price_table:
        price_list.append(str(price.text.strip()))
    
    # get info of insurer name
    insurer_list = []
    for insurer in insurer_table:
        insurer_list.append(str(insurer.text.strip()))
    
    # get info of insurer plan name
    plan_list = []
    for x in plan_table:
        plan_list.append(str(x.find('p').text.strip()))
    
    # for each profile, adding quotes information to our main dataframe
    gb_df_sample.loc[i, 'insurer_names'] = '|'.join(insurer_list)
    gb_df_sample.loc[i, 'plans'] = '|'.join(plan_list)
    gb_df_sample.loc[i, 'prices'] = '|'.join(price_list)
    gb_df_sample.loc[i, 'excesses'] = '|'.join(excess_list)
    
    # for every chunksize, save the results to folder
    if (i+1) % CHUNK_SIZE == 0:
        print 'Scraping first {} records'.format(i+1)
        gb_df_sample[(i+1-CHUNK_SIZE):(i+1)].to_csv(OUTPUT_FOLDER + 'GOBEAR_CHUNK_{}_{}.csv'.format(i+1-CHUNK_SIZE,i+1), index=False, sep=';')

    time.sleep(5)
    
    # navigate back to the webpage to crawl a new profile
    url = 'https://www.gobear.com/sg'
    browser.get(url)

    time.sleep(1)
    
    # click car panel 
    browser.find_element_by_xpath('//*[@id="Insurance"]/div/ul/li[1]/a').click()
    time.sleep(0.5)

# save all crawling results
gb_df_sample.to_csv(OUTPUT_FOLDER + 'GOBEAR_FINAL_RESULTS.csv',index=False, sep=';')

Scraping Model:CLA220 2143 Customer BIR YEAR: 1978 Customer Gender: MALE Customer Marital: SINGLE NCD: 50


NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//*[@id="car-form"]/div[1]/div[1]/div[2]/button"}
  (Session info: chrome=63.0.3239.132)
  (Driver info: chromedriver=2.32.498550 (9dec58e66c31bcc53a9ce3c7226f0c1c5810906a),platform=Windows NT 6.1.7601 SP1 x86_64)
