# Scraping Maryland Business Licenses with Selenium

Maryland has a [great portal](https://jportal.mdcourts.gov/license/pbPublicSearch.jsp) for searching business licenses, but the only problem is you have to check a box in order to get in.

1. Try to visit [the public search page](https://jportal.mdcourts.gov/license/pbPublicSearch.jsp)
2. Get redirected to a "I agree to this" page. Click that you've read the disclaimer, click Enter the Site.
3. Click "Search License Records" down at the bottom of the page
4. You're now on the search page! From the "Jurisdiction" dropdown, select "Statewide"
5. In the "Trade Name" field, type "Vap%" to try to find vape shops
6. Click "Next" in the bottom right-hand corner to go to the next page
7. Click "Click for detail" to see the details for a specific business license.

That's a lot of stuff! **Let's get to work.**

## Preparation

### When you search for a business license, what URL should Selenium try to visit first?

In [205]:
# https://jportal.mdcourts.gov/license/pbPublicSearch.jsp

**It isn't going to work, though! It's going to redirect to that intro page.** You can use *Incognito mode* to go back through the "Check the box, etc" series of pages.

### How will you identify the checkbox to check it?

Selenium can find elements by:

- name
- Class
- ID
- CSS selector (**ASK ME WHAT THIS IS** if you don't know)
- XPath (**ASK ME WHAT THIS IS** because you definitely don't know)
- Link text
- Partial link text

So in other words, what's unique about this element?

- *TIP: I have a secret awesome way to do this, but you have to ask.*

In [206]:
#XPath <3

### How will you identify the button to select it, or the form to submit it?

Selenium can submit forms by either

- Selecting the form and using `.submit()`, or
- Selecting the button and using `.click()`

You only need to be able to get **one, not both.**

In [207]:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://jportal.mdcourts.gov/license/pbPublicSearch.jsp')

In [208]:
check_box = driver.find_element_by_id("checkbox")
check_box.click()

In [209]:
enter_site = driver.find_element_by_xpath('/html/body/table/tbody/tr[7]/td/form/div/input[2]')
enter_site.click()

### Once you're on the next page, how will you click the "SEARCH LICENSE RECORDS" link or follow its URL?

In [210]:
licence_records = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[6]/td[2]/a[2]')
licence_records.click()

### Now you're on the form page. How do you pick "Statewide" for the jurisdiction dropdown?

In [211]:
jurisdiction_dropdown = driver.find_element_by_xpath('//*[@id="slcJurisdiction"]')
jurisdiction_dropdown.click()

statewide = driver.find_element_by_xpath('//*[@id="slcJurisdiction"]/option[2]')
statewide.click()

### How do you type "vap%" into the Trade Name field?

In [212]:
trade_name = driver.find_element_by_xpath('//*[@id="txtTradeName"]')
trade_name.send_keys('vap%')

### How do you click the submit button or submit the form?

In [213]:
search_form = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/form')
search_form.submit()

### How can you find and click the 'Next' button on the search results page?

In [214]:
next_button = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[3]/a/nobr')
next_button.click()

# Okay, let's scrape!

### Use Selenium to search for vape shops statewide, and then click through until it reaches the last page.

Don't scrape yet!

In [215]:
#elements doesn't throw an error if it doesn't exist
for page in range(1,8):
    next_buttons = driver.find_elements_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[3]/a/nobr')
    if len(next_buttons) > 0:
        next_buttons[0].click()
        
        
# for page in range(1,8):
#     # Try to do something, if it fails, don't throw an error
#     try:
#         next_button = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[3]/a/nobr')
#         next_button.click()
#     except:
#         pass

### Use Selenium to scrape the first page of search results for vape shops statewide (well, we'll try).

- *TIP: You can find elements/text using Selenium, or use BeautifulSoup with `doc = BeautifulSoup(driver.page_source)`*

I've included some BeautifulSoup code that might be helpful. If you use it, **ask me how I made it.** It's important.

In [216]:
for page in range(1,8):
    next_buttons = driver.find_elements_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[1]/a')
    if len(next_buttons) > 0:
        next_buttons[0].click()

In [217]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

doc = BeautifulSoup(driver.page_source, 'html.parser')
doc

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="EN" xmlns="http://www.w3.org/1999/xhtml"><head>
<title>Maryland Judiciary Business Licenses Online</title>
<link href="theme/styles.css" rel="STYLESHEET" type="text/css"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
</head>
<body>
<!-- HEADER AREA - Header graphic, company logo and link images go here-->
<table border="0" cellpadding="0" cellspacing="0" summary="Page Layout Table" width="100%">
<tbody><tr>
<td colspan="3">
<img alt="MARYLAND BUSINESS LICENSES ONLINE" height="66" src="images/header_new.gif" width="596"/><img alt="" src="images/spacer.gif" width="35"/>
</td>
</tr>
<tr>
<td class="headerline">
<img alt="blank spacer" height="18" src="images/spacer.gif" width="35"/>
</td>
<td align="LEFT" class="headerline">
<table border="0" cellpadding="0" cellspacing="0" summary="Navigation Menu">
<tbody><tr valign="top">
<td>
<img alt="" height="18

In [218]:
# IF YOU ARE USING BEAUTIFULSOUP, HERE IS SOME SAMPLE CODE
# IF YOU USE THIS CODE, ASK ME HOW I MADE IT. IT'S IMPORTANT.
business_headers = doc.find_all('tr',class_='searchfieldtitle')
len(business_headers)

5

In [219]:
# You'll probably need to find specific tds inside
# of the rows instead of looking at the whole thing.

stores = []

for header in business_headers:
    Vape_Stores = {}
    rows = header.find_next_siblings('tr')
    print("HEADER is", header.text.strip())
    link = header.find('a')
    if link:
        link_2 = header.find('a')['href']
        print("Click for detail", "https://jportal.mdcourts.gov/license/" + link_2)
        Vape_Stores['Details'] = "https://jportal.mdcourts.gov/license/" + link_2
    print("ROW 0 IS", rows[0].text.strip())
    print("ROW 1 IS", rows[1].text.strip())
    print("ROW 2 IS", rows[2].text.strip())
    print("ROW 3 IS", rows[3].text.strip())
    Vape_Stores['Store'] = header.find_all('td')[1].text.strip()
    Vape_Stores['Company'] = rows[0].find_all('td')[1].text.strip()
    Vape_Stores['Address'] = rows[1].find_all('td')[1].text.strip() + rows[2].find_all('td')[1].text.strip()
    Vape_Stores['County'] = rows[3].find_all('td')[1].text.strip()
    Vape_Stores['Lic. Status'] = rows[0].find_all('td')[2].text.strip().replace('Lic. Status: ', '')
    Vape_Stores['License'] = rows[1].find_all('td')[2].text.strip().replace('License: ', '')
    Vape_Stores['Issued Date'] = rows[2].find_all('td')[2].text.strip().replace('Issued Date: ', '')
    stores.append(Vape_Stores)
    print(Vape_Stores)
    print("----")

HEADER is 1.
VAPE IT STORE II
Click for detail https://jportal.mdcourts.gov/license/pbLicenseDetail.jsp?owi=lSlY%2FYw0Bwk%3D
ROW 0 IS AMIN NARGIS
Lic. Status: Issued
ROW 1 IS 1015 S SALISBURY BLVD
License: 22173808
ROW 2 IS SALISBURY, MD 21801
Issued Date: 4/27/2017
ROW 3 IS Wicomico County
{'Details': 'https://jportal.mdcourts.gov/license/pbLicenseDetail.jsp?owi=lSlY%2FYw0Bwk%3D', 'Store': 'VAPE IT STORE II', 'Company': 'AMIN NARGIS', 'Address': '1015 S SALISBURY BLVDSALISBURY, MD 21801', 'County': 'Wicomico County', 'Lic. Status': 'Issued', 'License': '22173808', 'Issued Date': '4/27/2017'}
----
HEADER is 2.
VAPE IT STORE I
Click for detail https://jportal.mdcourts.gov/license/pbLicenseDetail.jsp?owi=03DkF%2FArLqY%3D
ROW 0 IS AMIN NARGIS
Lic. Status: Issued
ROW 1 IS 1724 N SALISBURY BLVD UNIT 2
License: 22173807
ROW 2 IS SALISBURY, MD 21801
Issued Date: 4/27/2017
ROW 3 IS Wicomico County
{'Details': 'https://jportal.mdcourts.gov/license/pbLicenseDetail.jsp?owi=03DkF%2FArLqY%3D', 'Sto

### Save these into `vape-results.csv`

In [220]:
import pandas as pd

In [221]:
df = pd.DataFrame(stores)
df.head()

Unnamed: 0,Address,Company,County,Details,Issued Date,Lic. Status,License,Store
0,"1015 S SALISBURY BLVDSALISBURY, MD 21801",AMIN NARGIS,Wicomico County,https://jportal.mdcourts.gov/license/pbLicense...,4/27/2017,Issued,22173808.0,VAPE IT STORE II
1,"1724 N SALISBURY BLVD UNIT 2SALISBURY, MD 21801",AMIN NARGIS,Wicomico County,https://jportal.mdcourts.gov/license/pbLicense...,4/27/2017,Issued,22173807.0,VAPE IT STORE I
2,"2299 JOHNS HOPKINS ROADGAMBRILLS, MD 21054",ANJ DISTRIBUTIONS LLC,Anne Arundel County,https://jportal.mdcourts.gov/license/pbLicense...,4/05/2017,Issued,2104436.0,VAPEPAD THE
3,"110 S. PINEY RDCHESTER, MD 21619",COX TRADING COMPANY L L C,Queen Anne's County,https://jportal.mdcourts.gov/license/pbLicense...,5/31/2017,Issued,17165957.0,VAPE FROG
4,"346 RITCHIE HIGHWAYSEVERNA PARK, MD 21146",COX TRADING LLC,Anne Arundel County,,,Pending,,VAPE FROG


In [222]:
df.to_csv("vape_results.csv", index=False)

### Open `vape-results.csv` to make sure there aren't any extra weird columns

In [223]:
vape_results_df = pd.read_csv("vape_results.csv")
vape_results_df

Unnamed: 0,Address,Company,County,Details,Issued Date,Lic. Status,License,Store
0,"1015 S SALISBURY BLVDSALISBURY, MD 21801",AMIN NARGIS,Wicomico County,https://jportal.mdcourts.gov/license/pbLicense...,4/27/2017,Issued,22173808.0,VAPE IT STORE II
1,"1724 N SALISBURY BLVD UNIT 2SALISBURY, MD 21801",AMIN NARGIS,Wicomico County,https://jportal.mdcourts.gov/license/pbLicense...,4/27/2017,Issued,22173807.0,VAPE IT STORE I
2,"2299 JOHNS HOPKINS ROADGAMBRILLS, MD 21054",ANJ DISTRIBUTIONS LLC,Anne Arundel County,https://jportal.mdcourts.gov/license/pbLicense...,4/05/2017,Issued,2104436.0,VAPEPAD THE
3,"110 S. PINEY RDCHESTER, MD 21619",COX TRADING COMPANY L L C,Queen Anne's County,https://jportal.mdcourts.gov/license/pbLicense...,5/31/2017,Issued,17165957.0,VAPE FROG
4,"346 RITCHIE HIGHWAYSEVERNA PARK, MD 21146",COX TRADING LLC,Anne Arundel County,,,Pending,,VAPE FROG


## Use Selenium to scrape ALL pages of results, save the results into `vape-results-all.csv`.

In [224]:
stores_all = []

for page in range(1,8):
    doc = BeautifulSoup(driver.page_source, 'html.parser')
    business_headers = doc.find_all('tr',class_='searchfieldtitle')
    for header in business_headers:
        Vape_Stores_all = {}
        rows = header.find_next_siblings('tr')
        print("HEADER is", header.text.strip())
        link = header.find('a')
        if link:
            link_2 = header.find('a')['href']
            print("Click for detail", "https://jportal.mdcourts.gov/license/" + link_2)
            Vape_Stores_all['Details'] = "https://jportal.mdcourts.gov/license/" + link_2
        print("ROW 0 IS", rows[0].text.strip())
        print("ROW 1 IS", rows[1].text.strip())
        print("ROW 2 IS", rows[2].text.strip())
        print("ROW 3 IS", rows[3].text.strip())
        Vape_Stores_all['Store'] = header.find_all('td')[1].text.strip()
        Vape_Stores_all['Company'] = rows[0].find_all('td')[1].text.strip()
        Vape_Stores_all['Address'] = rows[1].find_all('td')[1].text.strip() + rows[2].find_all('td')[1].text.strip()
        Vape_Stores_all['County'] = rows[3].find_all('td')[1].text.strip()
        Vape_Stores_all['Lic. Status'] = rows[0].find_all('td')[2].text.strip().replace('Lic. Status: ', '')
        Vape_Stores_all['License'] = rows[1].find_all('td')[2].text.strip().replace('License: ', '')
        Vape_Stores_all['Issued Date'] = rows[2].find_all('td')[2].text.strip().replace('Issued Date: ', '')
        stores_all.append(Vape_Stores_all)
        print(Vape_Stores_all)
        print("----")
    next_buttons = driver.find_elements_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[3]/a/nobr')
    if len(next_buttons) > 0:
        next_buttons[0].click()

HEADER is 1.
VAPE IT STORE II
Click for detail https://jportal.mdcourts.gov/license/pbLicenseDetail.jsp?owi=lSlY%2FYw0Bwk%3D
ROW 0 IS AMIN NARGIS
Lic. Status: Issued
ROW 1 IS 1015 S SALISBURY BLVD
License: 22173808
ROW 2 IS SALISBURY, MD 21801
Issued Date: 4/27/2017
ROW 3 IS Wicomico County
{'Details': 'https://jportal.mdcourts.gov/license/pbLicenseDetail.jsp?owi=lSlY%2FYw0Bwk%3D', 'Store': 'VAPE IT STORE II', 'Company': 'AMIN NARGIS', 'Address': '1015 S SALISBURY BLVDSALISBURY, MD 21801', 'County': 'Wicomico County', 'Lic. Status': 'Issued', 'License': '22173808', 'Issued Date': '4/27/2017'}
----
HEADER is 2.
VAPE IT STORE I
Click for detail https://jportal.mdcourts.gov/license/pbLicenseDetail.jsp?owi=03DkF%2FArLqY%3D
ROW 0 IS AMIN NARGIS
Lic. Status: Issued
ROW 1 IS 1724 N SALISBURY BLVD UNIT 2
License: 22173807
ROW 2 IS SALISBURY, MD 21801
Issued Date: 4/27/2017
ROW 3 IS Wicomico County
{'Details': 'https://jportal.mdcourts.gov/license/pbLicenseDetail.jsp?owi=03DkF%2FArLqY%3D', 'Sto

HEADER is 21.
VAPE JUNGLE
Click for detail https://jportal.mdcourts.gov/license/pbLicenseDetail.jsp?owi=7hlMpOr1HKw%3D
ROW 0 IS VAPE JUNGLE LLC
Lic. Status: Issued
ROW 1 IS 2070 CRAIN HIGHWAY  UNIT F
License: 08131285
ROW 2 IS WALDORF, MD 20601
Issued Date: 3/31/2017
ROW 3 IS Charles County
{'Details': 'https://jportal.mdcourts.gov/license/pbLicenseDetail.jsp?owi=7hlMpOr1HKw%3D', 'Store': 'VAPE JUNGLE', 'Company': 'VAPE JUNGLE LLC', 'Address': '2070 CRAIN HIGHWAY  UNIT FWALDORF, MD 20601', 'County': 'Charles County', 'Lic. Status': 'Issued', 'License': '08131285', 'Issued Date': '3/31/2017'}
----
HEADER is 22.
VAPE TIME
Pending *
ROW 0 IS VAPE TIME LLC
Lic. Status: Pending
ROW 1 IS 4130 E JOPPA RD UNIT 109
ROW 2 IS NOTTINGHAM, MD 21236
ROW 3 IS Baltimore County
{'Store': 'VAPE TIME', 'Company': 'VAPE TIME LLC', 'Address': '4130 E JOPPA RD UNIT 109NOTTINGHAM, MD 21236', 'County': 'Baltimore County', 'Lic. Status': 'Pending', 'License': '', 'Issued Date': ''}
----
HEADER is 23.
VAPEBAR E

In [225]:
all_df = pd.DataFrame(stores_all)
all_df

Unnamed: 0,Address,Company,County,Details,Issued Date,Lic. Status,License,Store
0,"1015 S SALISBURY BLVDSALISBURY, MD 21801",AMIN NARGIS,Wicomico County,https://jportal.mdcourts.gov/license/pbLicense...,4/27/2017,Issued,22173808.0,VAPE IT STORE II
1,"1724 N SALISBURY BLVD UNIT 2SALISBURY, MD 21801",AMIN NARGIS,Wicomico County,https://jportal.mdcourts.gov/license/pbLicense...,4/27/2017,Issued,22173807.0,VAPE IT STORE I
2,"2299 JOHNS HOPKINS ROADGAMBRILLS, MD 21054",ANJ DISTRIBUTIONS LLC,Anne Arundel County,https://jportal.mdcourts.gov/license/pbLicense...,4/05/2017,Issued,2104436.0,VAPEPAD THE
3,"110 S. PINEY RDCHESTER, MD 21619",COX TRADING COMPANY L L C,Queen Anne's County,https://jportal.mdcourts.gov/license/pbLicense...,5/31/2017,Issued,17165957.0,VAPE FROG
4,"346 RITCHIE HIGHWAYSEVERNA PARK, MD 21146",COX TRADING LLC,Anne Arundel County,,,Pending,,VAPE FROG
5,"185 MITCHELLS CHANCE RDEDGEWATER, MD 21037",DISBROW II EMERSON HARRINGTON,Anne Arundel County,https://jportal.mdcourts.gov/license/pbLicense...,4/13/2017,Issued,2102408.0,VAPE LOFT (THE)
6,"7104 MINSTREL UNIT #7COLUMBIA, MD 21045",DISCOUNT TOBACCO ESSEX LLC,Howard County,https://jportal.mdcourts.gov/license/pbLicense...,5/19/2017,Issued,13141786.0,VAPE N CIGAR
7,"330 ONE FORTY VILLAGE ROADWESTMINSTER, MD 21157",FAIRGROUND VILLAGE LLC,Carroll County,https://jportal.mdcourts.gov/license/pbLicense...,4/21/2017,Issued,6126253.0,VAPE DOJO
8,"29890 THREE NOTCH ROADCHARLOTTE HALL, MD 20622",GRIMM JENNIFER,St. Mary's County,,,Pending,,VAPE HAVEN
9,"356 ROMANCOKE ROADSTEVENSVILLE, MD 21666",HUTCH VAPES LLC,Queen Anne's County,https://jportal.mdcourts.gov/license/pbLicense...,4/13/2017,Issued,17166688.0,VAPE BIRD


In [226]:
all_df.to_csv("vape_results_all.csv", index=False)

In [227]:
vape_results_all_df = pd.read_csv("vape_results_all.csv")
vape_results_all_df

Unnamed: 0,Address,Company,County,Details,Issued Date,Lic. Status,License,Store
0,"1015 S SALISBURY BLVDSALISBURY, MD 21801",AMIN NARGIS,Wicomico County,https://jportal.mdcourts.gov/license/pbLicense...,4/27/2017,Issued,22173808.0,VAPE IT STORE II
1,"1724 N SALISBURY BLVD UNIT 2SALISBURY, MD 21801",AMIN NARGIS,Wicomico County,https://jportal.mdcourts.gov/license/pbLicense...,4/27/2017,Issued,22173807.0,VAPE IT STORE I
2,"2299 JOHNS HOPKINS ROADGAMBRILLS, MD 21054",ANJ DISTRIBUTIONS LLC,Anne Arundel County,https://jportal.mdcourts.gov/license/pbLicense...,4/05/2017,Issued,2104436.0,VAPEPAD THE
3,"110 S. PINEY RDCHESTER, MD 21619",COX TRADING COMPANY L L C,Queen Anne's County,https://jportal.mdcourts.gov/license/pbLicense...,5/31/2017,Issued,17165957.0,VAPE FROG
4,"346 RITCHIE HIGHWAYSEVERNA PARK, MD 21146",COX TRADING LLC,Anne Arundel County,,,Pending,,VAPE FROG
5,"185 MITCHELLS CHANCE RDEDGEWATER, MD 21037",DISBROW II EMERSON HARRINGTON,Anne Arundel County,https://jportal.mdcourts.gov/license/pbLicense...,4/13/2017,Issued,2102408.0,VAPE LOFT (THE)
6,"7104 MINSTREL UNIT #7COLUMBIA, MD 21045",DISCOUNT TOBACCO ESSEX LLC,Howard County,https://jportal.mdcourts.gov/license/pbLicense...,5/19/2017,Issued,13141786.0,VAPE N CIGAR
7,"330 ONE FORTY VILLAGE ROADWESTMINSTER, MD 21157",FAIRGROUND VILLAGE LLC,Carroll County,https://jportal.mdcourts.gov/license/pbLicense...,4/21/2017,Issued,6126253.0,VAPE DOJO
8,"29890 THREE NOTCH ROADCHARLOTTE HALL, MD 20622",GRIMM JENNIFER,St. Mary's County,,,Pending,,VAPE HAVEN
9,"356 ROMANCOKE ROADSTEVENSVILLE, MD 21666",HUTCH VAPES LLC,Queen Anne's County,https://jportal.mdcourts.gov/license/pbLicense...,4/13/2017,Issued,17166688.0,VAPE BIRD
