<center><h1>Selenium Web Scraping Project</h1></center>
<center><h2><i>This project extracts data from the Nova Scocia Barristers' Society website</i></h2></center>
<center><h4><b><i><a href="https://members.nsbs.org/NSBSWEB/Lawyer_Search/Search_Page.aspx" target="_blank">https://members.nsbs.org/NSBSWEB/Lawyer_Search/Search_Page.aspx</a></i></b></h4></center>

In [1]:
# Import Selenium 'webdriver' for navigation
from selenium import webdriver
# Import 'By' to find elements by tage name, id, etc. on webpage
from selenium.webdriver.common.by import By
# Import 'Keys' to send enter keyboard values programmatically
from selenium.webdriver.common.keys import Keys
# Import 'time' to allow code to 'sleep' in order to allow page to load
import time
# import regular expressions for data cleaning
import re
# Import 'pandas' for data analysis
import pandas as pd

In [2]:
# Set webdriver, open browser, and navigate to website

url = "https://members.nsbs.org/NSBSWEB/Lawyer_Search/Search_Page.aspx"
driver = webdriver.Chrome()
driver.get(url)

# Make sure all elements are loaded before any other functions are ran
driver.implicitly_wait(10)

In [3]:
# Click search to show search results

search = driver.find_element(By.ID, "ctl01_TemplateBody_WebPartManager1_gwpciNewQueryMenuCommon_ciNewQueryMenuCommon_ResultsGrid_Sheet0_SubmitButton")

search.click()

In [4]:
# Enter the amount of links per page in 'views per page' input box 
# At the time of running this code, there are 3956 lawers in this database

views = driver.find_element(By.ID, "ctl01_TemplateBody_WebPartManager1_gwpciNewQueryMenuCommon_ciNewQueryMenuCommon_ResultsGrid_Grid1_ctl00_ctl02_ctl00_ChangePageSizeTextBox")

views.send_keys('4000')

In [5]:
# Make the change that shows all 3956 rows - Click the change button and wait for results to load

change = driver.find_element(By.ID, "ctl01_TemplateBody_WebPartManager1_gwpciNewQueryMenuCommon_ciNewQueryMenuCommon_ResultsGrid_Grid1_ctl00_ctl02_ctl00_ChangePageSizeLinkButton")

change.click()

time.sleep(5)

In [6]:
# Search all columns on the first page for all the last name

# Find the table element
table = driver.find_element(By.TAG_NAME, 'table')

# Find all the rows in the table
rows = table.find_elements(By.TAG_NAME, 'tr')

last_name = []
# Iterate through each row
for row in rows:
    # Find all the cells in the row
    cells = row.find_elements(By.TAG_NAME,'td')

    # Check if the row has at least two cells (columns)
    if len(cells) >= 2:
        # Get the second cell (index 1) and print its text
        second_column_text = cells[1].text
#         print(second_column_text)
        last_name.append(second_column_text)

In [7]:
# Check the number of last name's found
len(last_name)

3952

In [8]:
# Search the first page for all the first names

first_name = []
# Iterate through each row
for row in rows:
    # Find all the cells in the row
    cells = row.find_elements(By.TAG_NAME,'td')

    # Check if the row has at least three cells (columns) 
    if len(cells) >= 3:
        third_column_text = cells[2].text
#         print(third_column_text)
        first_name.append(third_column_text)

In [9]:
# Check the number of first name's found
len(first_name)

3952

In [10]:
# Get a list of all the urls for each lawyer's page

links = driver.find_elements(By.CSS_SELECTOR, "a[title='View Details']")

In [11]:
# Make sure we have loaded all rows
len(links)

3952

In [12]:
# Extract the 'href' from each element found 

urls = []
for x in links:
    url_list = x.get_attribute('href')
    urls.append(url_list)

In [13]:
# Extract data from each lawyer's url and add it to lists

# Initialize empty lists to store data
full_names = []
calls_to_bar = []
contacts = []

# Iterate over each URL in the 'urls' list
for link in urls:
    # Access the URL using a web driver (assumed to be defined and set up beforehand)
    driver.get(link)
    
    # Find the element on the page with the specified ID for name, call, and contact data
    data_name = driver.find_element(By.ID, "ctl01_TemplateBody_WebPartManager1_gwpciNewQueryMenuCommon5_ciNewQueryMenuCommon5_ResultsGrid_Grid1_ctl00")
    data_call = driver.find_element(By.ID, "ctl01_TemplateBody_WebPartManager1_gwpciNewQueryMenuCommon4_ciNewQueryMenuCommon4_ResultsGrid_Grid1_ctl00")
    data_contact = driver.find_element(By.ID, "ctl01_TemplateBody_WebPartManager1_gwpciNewQueryMenuCommon3_ciNewQueryMenuCommon3_ResultsGrid_Grid1_ctl00")
    
    # Extract the text from the name, call, and contact elements and append to respective lists
    full_names.append(data_name.text)
    calls_to_bar.append(data_call.text)
    contacts.append(data_contact.text)  

In [14]:
# Clean the data that contains the names of each lawyer

# Create empty list to store new values
full_name = []

# Loop through list and clear unnecessary characters
for names in full_names:
    name = names.replace('Name\n', '')
    full_name.append(name)

In [15]:
# Find the call to the Bar date for each lawyer

# Initialize an empty list to store call to the Bar dates
call_to_bar = []

# Define a regular expression pattern to match the date format
date_pattern = r'(\w{3} \d{2}, \d{4})'

# Define a value to use when no date is found
none_value = 'None'

# Iterate over each text in the 'calls_to_bar' list
for text in calls_to_bar:
    # Search for a match of the date pattern in the text
    match = re.search(date_pattern, text)
    
    # If a match is found, append the matched date to the 'call_to_bar' list
    if match:
        call_to_bar.append(match.group(1))
    # If no match is found, append the 'none_value' to the 'call_to_bar' list
    else:
        call_to_bar.append(none_value)

In [16]:
# Extracts the contact information by splitting each contact at the '\nPhone #:' substring,
# taking the first part, removing 'Primary Address', and stripping any leading/trailing whitespace.
# This is done for each contact in the 'contacts' list using a list comprehension.

extracted_contact = [contact.split('\nPhone #:')[0].replace('Primary Address', '').strip() for contact in contacts]

In [17]:
# Extract the name and address of the firm for each lawyer

# Initialize empty lists to store firm names and addresses
firm_name = []
address = []

# Iterate over each element in the 'extracted_contact' list
for element in extracted_contact:
    # Split the element at the first occurrence of '\n'
    split_data = element.split('\n', 1)
    
    # Append the first part of the split data (firm name) to the 'firm_name' list
    firm_name.append(split_data[0])
    
    # Check if there is a second part of the split data (address)
    if len(split_data) > 1:
        # If there is, replace '\n' with ', ' in the address and append it to the 'address' list
        address.append(split_data[1].replace('\n', ', '))
    else:
        # If there is no second part, append an empty string to the 'address' list
        address.append('')

        
# NEXT TIME USE SOMETHING LIKE THIS TO EXTRACT DATA FROM A RAW STRING -- THIS EXTRACTS ADDRESS 

# new_thing = []
# for thing in contacts:
#     first = thing.find('\n')
#     second = thing.find("\n", first + 1)
# #     third = thing.find('\nPhone #:')
#     if first != -1 and second != -1:
#         start = second 
#         end = thing.find('\nPhone #:')
#         goof = thing[start:end].strip()  # Extract the address and remove leading/trailing spaces
#         new_thing.append(goof)
#     else:
#         new_thing.append("")

In [18]:
# Extract the email for each lawyer

# Initialize an empty list to store email and website information
email_website = []

# Iterate over each element in the 'contacts' list
for element in contacts:
    # Split the element at the first occurrence of 'Email'
    split_data = element.split('Email', 1)
    
    # Check if there is a second part of the split data (email and website)
    if len(split_data) > 1:
        # If there is, append "Email" concatenated with the second part to the 'email_website' list
        email_website.append("Email" + split_data[1])
    else:
        # If there is no second part, append an empty string to the 'email_website' list
        email_website.append('')

In [19]:
# Clean up some values in the email and website data

# Define the replacements as a dictionary
replacements = {
    "\n": ", ",     # Replace newline characters with comma and space
    "Email:": "",   # Remove the string 'Email:'
    "Website:": "", # Remove the string 'Website:'
    "\\\\": "//",   # Replace '\\' with '//'
    # " ": ""       # (Commented out) Remove all occurrences of space
}

# Iterate over each element in the 'email_website' list
for i in range(len(email_website)):
    # Iterate over each key-value pair in the 'replacements' dictionary
    for old_value, new_value in replacements.items():
        # Replace each occurrence of the 'old_value' with the 'new_value'
        email_website[i] = email_website[i].replace(old_value, new_value)

In [20]:
# Extract all lawyers' email and the website for their firm

# Initialize empty lists to store email and website information
email = []
website = []

# Iterate over each item in the 'email_website' list
for item in email_website:
    # Check if the item contains a comma
    if ',' in item:
        # Split the item at the comma and strip any leading/trailing whitespace from each part
        parts = item.split(',')
        email_part = parts[0].strip()
        website_part = parts[1].strip()
        
        # Append the email part and website part to the respective lists
        email.append(email_part)
        website.append(website_part)
    else:
        # If there is no comma in the item, append an empty string to the email list
        email.append('')
        
        # If the item has content after stripping leading/trailing whitespace, append it to the website list
        # Otherwise, append an empty string to the website list
        website.append(item.strip() if item.strip() else '')

In [22]:
# Check for consistency of all cleaned data

print("Last name:")
print(len(last_name))
print("\nFirst name:")
print(len(first_name))
print("\nFull name:")
print(len(full_name))
print('\nCall to Bar:')
print(len(call_to_bar))
print("\nFirm:")
print(len(firm_name))
print("\nAddress:")
print(len(address))
print("\nEmail:")
print(len(email))
print("\nWebsite:")
print(len(website))

Last name:
3952

First name:
3952

Full name:
3952

Call to Bar:
3952

Firm:
3952

Address:
3952

Email:
3952

Website:
3952


In [23]:
# Put all data into a dataframe for visualization and analysis

# Put all data into a dictionary
data = {
    'last_name': last_name,
    'first_name': first_name,
    'full_name': full_name,
    'call_to_bar': call_to_bar,
    'firm_name': firm_name,
    'address': address,
    'email': email,
    'website': website
}

# Put dictionary into a dataframe
df = pd.DataFrame(data)

In [24]:
# See dataframe
df

Unnamed: 0,last_name,first_name,full_name,call_to_bar,firm_name,address,email,website
0,Samson,Michel,Michel Samson,"Nov 28, 1998",Cox & Palmer,"Nova Centre, South Tower, 1500-1625 Grafton St...",msamson@coxandpalmer.com,https://www.coxandpalmerlaw.com
1,Sinclair,Ian,Ian Sinclair,"Jun 10, 2011",Fasken Martineau DuMoulin LLP,"550 Burrard Street, Suite 2900, Vancouver, BC ...",isinclair@fasken.com,
2,Saunders,Paul,Paul Saunders,"Jun 13, 2008",Stewart McKelvey,"600-1741 Lower Water Street, PO Box 997, Halif...",psaunders@stewartmckelvey.com,http://www.stewartmckelvey.com
3,MacEachern,Duncan,Duncan MacEachern,"Aug 09, 1985",Lorway MacEachern McLeod Burke,"112 Charlotte Street, Sydney, NS B1P 1B9",northlaw@eastlink.ca,http://www.northlawcan.com/
4,Wedlake,David,David Wedlake,"Jun 10, 2016",Stewart McKelvey,"600-1741 Lower Water Street, PO Box 997, Halif...",dwedlake@stewartmckelvey.com,http://www.stewartmckelvey.com
...,...,...,...,...,...,...,...,...
3947,Fraser,Caitlin,Caitlin Fraser,"Nov 19, 2021",Buildscale Inc. dba Vidyard,"1 Queen Street, Unit 301, Kitchener, ON N2H 2G7",caitlin.fraser@vidyard.com,https://www.vidyard.com
3948,Payne,Cheryl,Cheryl Payne,"Mar 16, 1982",Department of Justice (NS),"1690 Hollis Street, PO Box 7 STN Central, Hali...",cheryl.payne@novascotia.ca,http://novascotia.ca/just/
3949,Clark,Sheila,Sheila Clark,"Feb 18, 1991",,,,
3950,Tam,Tony,Tony Tam KC,"Aug 08, 1986",McInnes Cooper,"1969 Upper Water Street, Suite 1300, McInnes C...",tony.tam@mcinnescooper.com,http://www.mcinnescooper.com/


In [25]:
# Check dimensions of dataframe
df.shape

(3952, 8)

In [26]:
# Save dataframe to .CSV file

filename = 'Nova Scotia Lawyer Scrape.csv'
df.to_csv(filename, index=False)

print(f"DataFrame saved as {filename} successfully.")

DataFrame saved as Nova Scotia Lawyer Scrape.csv successfully.
