# Scraping Maryland Business Licenses with Selenium

Maryland has a [great portal](https://jportal.mdcourts.gov/license/pbPublicSearch.jsp) for searching business licenses, but the only problem is you have to check a box in order to get in.

1. Try to visit [the public search page](https://jportal.mdcourts.gov/license/pbPublicSearch.jsp)
2. Get redirected to a "I agree to this" page. Click that you've read the disclaimer, click Enter the Site.
3. Click "Search License Records" down at the bottom of the page
4. You're now on the search page! From the "Jurisdiction" dropdown, select "Statewide"
5. In the "Trade Name" field, type "Vap%" to try to find vape shops
6. Click "Next" in the bottom right-hand corner to go to the next page
7. Click "Click for detail" to see the details for a specific business license.

That's a lot of stuff! **Let's get to work.**

In [1]:
# TO USE AN ENTER

# from selenium.webdriver.common.keys import Keys
# element = driver.find_element_by_class_name("q")
# element.send_keys(Keys.RETURN)

# TO USE A SELECT:

# from selenium.webdriver.support.ui import Select
# select_tag = driver.find_element_by_name('phy_city')
# select = Select(select_tag)
# select.select_by_visible_text('Houston')

# # keep going until you get an error
# count = 0
# while True:
#     try :
#         # Hey to get that link and click it
#         driver.get_the_thing("Next").click()
#     except :
#         # oh did you got a error? Exit the while loop break
#     count = count + 1

## Preparation

### When you search for a business license, what URL should Selenium try to visit first?

In [None]:
###

In [None]:
###

**It isn't going to work, though! It's going to redirect to that intro page.** You can use *Incognito mode* to go back through the "Check the box, etc" series of pages.

### How will you identify the checkbox to check it?

Selenium can find elements by:

- name
- Class
- ID
- CSS selector (**ASK ME WHAT THIS IS** if you don't know)
- XPath (**ASK ME WHAT THIS IS** because you definitely don't know)
- Link text
- Partial link text

So in other words, what's unique about this element?

- *TIP: I have a secret awesome way to do this, but you have to ask.*

In [None]:
###

### How will you identify the button to select it, or the form to submit it?

Selenium can submit forms by either

- Selecting the form and using `.submit()`, or
- Selecting the button and using `.click()`

You only need to be able to get **one, not both.**

In [1]:
from selenium import webdriver

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("- - icognito")

driver = webdriver.Chrome(chrome_options = chrome_options)
driver.get("https://jportal.mdcourts.gov/license/index_disclaimer.jsp")


In [2]:
check_box = driver.find_element_by_xpath('//*[@id="checkbox"]')


In [3]:
check_box.click()


In [4]:
enter_box = driver.find_element_by_xpath('/html/body/table/tbody/tr[7]/td/form/div/input[2]')


In [5]:
enter_box.click()


### Once you're on the next page, how will you click the "SEARCH LICENSE RECORDS" link or follow its URL?

In [6]:
from selenium.webdriver.support.ui import Select


In [7]:
select = driver.find_element_by_xpath('/html/body/table[1]/tbody/tr[2]/td[2]/table/tbody/tr/td[3]/a')
select.click()


### Now you're on the form page. How do you pick "Statewide" for the jurisdiction dropdown?

In [8]:
select_dd = driver.find_element_by_xpath('//*[@id="slcJurisdiction"]')
# select.click()


In [9]:
search_input_dd = driver.find_element_by_xpath('//*[@id="slcJurisdiction"]')
search_input_dd.send_keys("Statewide")


### How do you type "vap%" into the Trade Name field?

In [10]:
search_input_vap = driver.find_element_by_xpath('//*[@id="txtTradeName"]')
search_input_vap.send_keys("vap%")


### How do you click the submit button or submit the form?

In [11]:
search_button_submit = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/form/table/tbody/tr[14]/td/input[1]')
search_button_submit.click()


### How can you find and click the 'Next' button on the search results page?

In [12]:
search_button_next = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[3]/a')
search_button_next.click()


# Okay, let's scrape!

### Use Selenium to search for vape shops statewide, and then click through until it reaches the last page.

Don't scrape yet!

In [13]:
# Next button
search_button_next_2 = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[3]/a')
search_button_next_2.click()
# Shift + Entered for 6 times


In [14]:
# Back button
search_button_back = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[1]/a')
search_button_back.click()


### Use Selenium to scrape the first page of search results for vape shops statewide (well, we'll try).

- *TIP: You can find elements/text using Selenium, or use BeautifulSoup with `doc = BeautifulSoup(driver.page_source)`*

I've included some BeautifulSoup code that might be helpful. If you use it, **ask me how I made it.** It's important.

In [15]:
from bs4 import BeautifulSoup
doc = BeautifulSoup(driver.page_source, 'html.parser')


In [16]:
# IF YOU ARE USING BEAUTIFULSOUP, HERE IS SOME SAMPLE CODE
# IF YOU USE THIS CODE, ASK ME HOW I MADE IT. IT'S IMPORTANT.
business_headers = doc.find_all('tr',class_='searchfieldtitle')
len(business_headers)


5

In [17]:
# You'll probably need to find specific tds inside
# of the rows instead of looking at the whole thing.
for header in business_headers:
    rows = header.find_next_siblings('tr')
    print("HEADER is", header.text.strip())
    print("ROW 0 IS", rows[0].text.strip())
    print("ROW 1 IS", rows[1].text.strip())
    print("ROW 2 IS", rows[2].text.strip())
    print("ROW 3 IS", rows[3].text.strip())
    print("----")
    

HEADER is 6.
VAPE LOFT (THE)
ROW 0 IS DISBROW II EMERSON HARRINGTON
Lic. Status: Issued
ROW 1 IS 185 MITCHELLS CHANCE RD
License: 02102408
ROW 2 IS EDGEWATER, MD 21037
Issued Date: 4/13/2017
ROW 3 IS Anne Arundel County
----
HEADER is 7.
VAPE N CIGAR
ROW 0 IS DISCOUNT TOBACCO ESSEX LLC
Lic. Status: Issued
ROW 1 IS 7104 MINSTREL UNIT #7
License: 13141786
ROW 2 IS COLUMBIA, MD 21045
Issued Date: 5/19/2017
ROW 3 IS Howard County
----
HEADER is 8.
VAPE DOJO
ROW 0 IS FAIRGROUND VILLAGE LLC
Lic. Status: Issued
ROW 1 IS 330 ONE FORTY VILLAGE ROAD
License: 06126253
ROW 2 IS WESTMINSTER, MD 21157
Issued Date: 4/21/2017
ROW 3 IS Carroll County
----
HEADER is 9.
VAPE HAVEN
Pending *
ROW 0 IS GRIMM JENNIFER
Lic. Status: Pending
ROW 1 IS 29890 THREE NOTCH ROAD
ROW 2 IS CHARLOTTE HALL, MD 20622
ROW 3 IS St. Mary's County
----
HEADER is 10.
VAPE BIRD
ROW 0 IS HUTCH VAPES LLC
Lic. Status: Issued
ROW 1 IS 356 ROMANCOKE ROAD
License: 17166688
ROW 2 IS STEVENSVILLE, MD 21666
Issued Date: 4/13/2017
ROW 3 

### Save these into `vape-results.csv`

In [18]:
vape_results_1 = []

for element in business_headers :
    vape_results_dictionary = {}
    rows = header.find_next_siblings('tr')
    
    headers = element.text.strip()
    if headers :
#         print(headers)
        vape_results_dictionary['Header'] = headers
    row_0 = rows[0].text.strip()
    if row_0 :
#         print(row_0)
        vape_results_dictionary['Row 0'] = row_0
    row_1 = rows[1].text.strip()
    if row_1 :
#         print(row_1)
        vape_results_dictionary['Row 1'] = row_1
    row_2 = rows[2].text.strip()
    if row_2 :
#         print(row_2)
        vape_results_dictionary['Row 2'] = row_2
    row_3 = rows[3].text.strip()
    if row_3 :
#         print(row_3)
        vape_results_dictionary['Row 3'] = row_3        
        
    vape_results_1.append(vape_results_dictionary)
#     print("-----")
vape_results_1


[{'Header': '6.\nVAPE LOFT (THE)',
  'Row 0': 'HUTCH VAPES LLC\nLic. Status: Issued',
  'Row 1': '356 ROMANCOKE ROAD\nLicense: 17166688',
  'Row 2': 'STEVENSVILLE, MD 21666\nIssued Date: 4/13/2017',
  'Row 3': "Queen Anne's County"},
 {'Header': '7.\nVAPE N CIGAR',
  'Row 0': 'HUTCH VAPES LLC\nLic. Status: Issued',
  'Row 1': '356 ROMANCOKE ROAD\nLicense: 17166688',
  'Row 2': 'STEVENSVILLE, MD 21666\nIssued Date: 4/13/2017',
  'Row 3': "Queen Anne's County"},
 {'Header': '8.\nVAPE DOJO',
  'Row 0': 'HUTCH VAPES LLC\nLic. Status: Issued',
  'Row 1': '356 ROMANCOKE ROAD\nLicense: 17166688',
  'Row 2': 'STEVENSVILLE, MD 21666\nIssued Date: 4/13/2017',
  'Row 3': "Queen Anne's County"},
 {'Header': '9.\nVAPE HAVEN\nPending *',
  'Row 0': 'HUTCH VAPES LLC\nLic. Status: Issued',
  'Row 1': '356 ROMANCOKE ROAD\nLicense: 17166688',
  'Row 2': 'STEVENSVILLE, MD 21666\nIssued Date: 4/13/2017',
  'Row 3': "Queen Anne's County"},
 {'Header': '10.\nVAPE BIRD',
  'Row 0': 'HUTCH VAPES LLC\nLic. Sta

In [19]:
import pandas as pd
df_1 = pd.DataFrame(vape_results_1)
df_1.to_csv("vape_results_1.csv", index=False)


### Open `vape-results.csv` to make sure there aren't any extra weird columns

In [20]:
df_1 = pd.read_csv("vape_results_1.csv")
df_1.head()


Unnamed: 0,Header,Row 0,Row 1,Row 2,Row 3
0,6.\nVAPE LOFT (THE),HUTCH VAPES LLC\nLic. Status: Issued,356 ROMANCOKE ROAD\nLicense: 17166688,"STEVENSVILLE, MD 21666\nIssued Date: 4/13/2017",Queen Anne's County
1,7.\nVAPE N CIGAR,HUTCH VAPES LLC\nLic. Status: Issued,356 ROMANCOKE ROAD\nLicense: 17166688,"STEVENSVILLE, MD 21666\nIssued Date: 4/13/2017",Queen Anne's County
2,8.\nVAPE DOJO,HUTCH VAPES LLC\nLic. Status: Issued,356 ROMANCOKE ROAD\nLicense: 17166688,"STEVENSVILLE, MD 21666\nIssued Date: 4/13/2017",Queen Anne's County
3,9.\nVAPE HAVEN\nPending *,HUTCH VAPES LLC\nLic. Status: Issued,356 ROMANCOKE ROAD\nLicense: 17166688,"STEVENSVILLE, MD 21666\nIssued Date: 4/13/2017",Queen Anne's County
4,10.\nVAPE BIRD,HUTCH VAPES LLC\nLic. Status: Issued,356 ROMANCOKE ROAD\nLicense: 17166688,"STEVENSVILLE, MD 21666\nIssued Date: 4/13/2017",Queen Anne's County


In [21]:
# search_button_next_2 = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[3]/a')
# search_button_next_2.click()

# search_the_first_page = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[2]/a[1]')
# search_the_first_page.click()

## Use Selenium to scrape ALL pages of results, save the results into `vape-results-all.csv`.

In [22]:
vape_results_all = []

# The 1st page
search_the_first_page = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[2]/a[1]')
search_the_first_page.click()

while True :

    try :

        doc = BeautifulSoup(driver.page_source, 'html.parser')

        business_headers = doc.find_all('tr',class_='searchfieldtitle')

        for element in business_headers :
            vape_results_all_dictionary = {}
            
            headers = element.text.strip()
            if headers :
                vape_results_all_dictionary['Header'] = headers
            
            rows = header.find_next_siblings('tr')
            row_0 = rows[0].text.strip()
            if row_0 :
                vape_results_all_dictionary['Row 0'] = row_0
            row_1 = rows[1].text.strip()
            if row_1 :
                vape_results_all_dictionary['Row 1'] = row_1
            row_2 = rows[2].text.strip()
            if row_2 :
                vape_results_all_dictionary['Row 2'] = row_2
            row_3 = rows[3].text.strip()
            if row_3 :
                vape_results_all_dictionary['Row 3'] = row_3        
            
            vape_results_all.append(vape_results_all_dictionary)
        
        # The next page
        search_button_next_2 = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[3]/a')
        search_button_next_2.click()
    except :
        break
        
# vape_results_all


In [23]:
df_all = pd.DataFrame(vape_results_all)
df_all.to_csv("vape-results-all.csv", index=False)
df_all = pd.read_csv("vape-results-all.csv")
df_all


Unnamed: 0,Header,Row 0,Row 1,Row 2,Row 3
0,1.\nVAPE IT STORE I,HUTCH VAPES LLC\nLic. Status: Issued,356 ROMANCOKE ROAD\nLicense: 17166688,"STEVENSVILLE, MD 21666\nIssued Date: 4/13/2017",Queen Anne's County
1,2.\nVAPE IT STORE II,HUTCH VAPES LLC\nLic. Status: Issued,356 ROMANCOKE ROAD\nLicense: 17166688,"STEVENSVILLE, MD 21666\nIssued Date: 4/13/2017",Queen Anne's County
2,3.\nVAPEPAD THE,HUTCH VAPES LLC\nLic. Status: Issued,356 ROMANCOKE ROAD\nLicense: 17166688,"STEVENSVILLE, MD 21666\nIssued Date: 4/13/2017",Queen Anne's County
3,4.\nVAPE FROG,HUTCH VAPES LLC\nLic. Status: Issued,356 ROMANCOKE ROAD\nLicense: 17166688,"STEVENSVILLE, MD 21666\nIssued Date: 4/13/2017",Queen Anne's County
4,5.\nVAPE FROG\nPending *,HUTCH VAPES LLC\nLic. Status: Issued,356 ROMANCOKE ROAD\nLicense: 17166688,"STEVENSVILLE, MD 21666\nIssued Date: 4/13/2017",Queen Anne's County
5,6.\nVAPE LOFT (THE),HUTCH VAPES LLC\nLic. Status: Issued,356 ROMANCOKE ROAD\nLicense: 17166688,"STEVENSVILLE, MD 21666\nIssued Date: 4/13/2017",Queen Anne's County
6,7.\nVAPE N CIGAR,HUTCH VAPES LLC\nLic. Status: Issued,356 ROMANCOKE ROAD\nLicense: 17166688,"STEVENSVILLE, MD 21666\nIssued Date: 4/13/2017",Queen Anne's County
7,8.\nVAPE DOJO,HUTCH VAPES LLC\nLic. Status: Issued,356 ROMANCOKE ROAD\nLicense: 17166688,"STEVENSVILLE, MD 21666\nIssued Date: 4/13/2017",Queen Anne's County
8,9.\nVAPE HAVEN\nPending *,HUTCH VAPES LLC\nLic. Status: Issued,356 ROMANCOKE ROAD\nLicense: 17166688,"STEVENSVILLE, MD 21666\nIssued Date: 4/13/2017",Queen Anne's County
9,10.\nVAPE BIRD,HUTCH VAPES LLC\nLic. Status: Issued,356 ROMANCOKE ROAD\nLicense: 17166688,"STEVENSVILLE, MD 21666\nIssued Date: 4/13/2017",Queen Anne's County
