# Scraping Maryland Business Licenses with Selenium

Maryland has a [great portal](https://jportal.mdcourts.gov/license/pbPublicSearch.jsp) for searching business licenses, but the only problem is you have to check a box in order to get in.

1. Try to visit [the public search page](https://jportal.mdcourts.gov/license/pbPublicSearch.jsp)
2. Get redirected to a "I agree to this" page. Click that you've read the disclaimer, click Enter the Site.
3. Click "Search License Records" down at the bottom of the page
4. You're now on the search page! From the "Jurisdiction" dropdown, select "Statewide"
5. In the "Trade Name" field, type "Vap%" to try to find vape shops
6. Click "Next" in the bottom right-hand corner to go to the next page
7. Click "Click for detail" to see the details for a specific business license.

That's a lot of stuff! **Let's get to work.**

## Preparation

### When you search for a business license, what URL should Selenium try to visit first?

In [40]:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://jportal.mdcourts.gov/license/pbLicenseDetail.jsp?owi=uYnr%2BUU26hU%3D")

**It isn't going to work, though! It's going to redirect to that intro page.** You can use *Incognito mode* to go back through the "Check the box, etc" series of pages.

### How will you identify the checkbox to check it?

Selenium can find elements by:

- name
- Class
- ID
- CSS selector (**ASK ME WHAT THIS IS** if you don't know)
- XPath (**ASK ME WHAT THIS IS** because you definitely don't know)
- Link text
- Partial link text

So in other words, what's unique about this element?

- *TIP: I have a secret awesome way to do this, but you have to ask.*

In [41]:
checkbox = driver.find_element_by_xpath('//*[@id="checkbox"]')
checkbox

<selenium.webdriver.remote.webelement.WebElement (session="390c121fcb9e99aaad5da9c80bf77daa", element="0.21200405325449223-1")>

### How will you identify the button to select it, or the form to submit it?

Selenium can submit forms by either

- Selecting the form and using `.submit()`, or
- Selecting the button and using `.click()`

You only need to be able to get **one, not both.**

In [43]:
Enter_the_Site = driver.find_element_by_xpath('/html/body/table/tbody/tr[7]/td/form/div/input[2]')
Enter_the_Site.click()

### Once you're on the next page, how will you click the "SEARCH LICENSE RECORDS" link or follow its URL?

In [44]:
Liscence_records = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[6]/td[2]/a[2]')
Liscence_records.click()

### Now you're on the form page. How do you pick "Statewide" for the jurisdiction dropdown?

In [45]:
from selenium.webdriver.support.ui import Select
select = Select(driver.find_element_by_xpath('//*[@id="slcJurisdiction"]'))
select.select_by_visible_text("Statewide")

### How do you type "vap%" into the Trade Name field?

In [46]:
trade_name = driver.find_element_by_xpath('//*[@id="txtTradeName"]')
trade_name.send_keys("vap%")

### How do you click the submit button or submit the form?

In [47]:
submit = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/form/table/tbody/tr[14]/td/input[1]')
submit.click()

### How can you find and click the 'Next' button on the search results page?

In [13]:
while True:
    ## This will run, until it gets an error and then it will break. 
    ## IF YOU DON'T WRITE BREAK IT WILL RUN FOREVER!!!!
    try:
        next_button = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[3]/a/nobr')
        next_button.click()
    except:
        break
        
        
        
        
        

# Okay, let's scrape!

### Use Selenium to search for vape shops statewide, and then click through until it reaches the last page.

Don't scrape yet!

In [None]:
## I did this above

### Use Selenium to scrape the first page of search results for vape shops statewide (well, we'll try).

- *TIP: You can find elements/text using Selenium, or use BeautifulSoup with `doc = BeautifulSoup(driver.page_source)`*

I've included some BeautifulSoup code that might be helpful. If you use it, **ask me how I made it.** It's important.

In [62]:
# IF YOU ARE USING BEAUTIFULSOUP, HERE IS SOME SAMPLE CODE
# IF YOU USE THIS CODE, ASK ME HOW I MADE IT. IT'S IMPORTANT.
from bs4 import BeautifulSoup
doc = BeautifulSoup(driver.page_source, "html.parser")
business_headers = doc.find_all('tr',class_='searchfieldtitle')
len(business_headers)

5

In [76]:
# You'll probably need to find specific tds inside
# of the rows instead of looking at the whole thing.
for header in business_headers:
    rows = header.find_next_siblings('tr')
    print("HEADER is", header.text.strip())
    if header.find("a"):
        links = header.find("a")["href"]
        print("link is", links.strip())
    else:
        links = "none"
        print(links)
    print("ROW 0 IS", rows[0].text.strip())
    print("ROW 1 IS", rows[1].text.strip())
    print("ROW 2 IS", rows[2].text.strip())
    print("ROW 3 IS", rows[3].text.strip())
    print("----")
    



HEADER is 1.
VAPE IT STORE II
link is pbLicenseDetail.jsp?owi=QBwCBqE3UH0%3D
ROW 0 IS AMIN NARGIS
Lic. Status: Issued
ROW 1 IS 1015 S SALISBURY BLVD
License: 22173808
ROW 2 IS SALISBURY, MD 21801
Issued Date: 4/27/2017
ROW 3 IS Wicomico County
----
HEADER is 2.
VAPE IT STORE I
link is pbLicenseDetail.jsp?owi=gXYTVqpDCfA%3D
ROW 0 IS AMIN NARGIS
Lic. Status: Issued
ROW 1 IS 1724 N SALISBURY BLVD UNIT 2
License: 22173807
ROW 2 IS SALISBURY, MD 21801
Issued Date: 4/27/2017
ROW 3 IS Wicomico County
----
HEADER is 3.
VAPEPAD THE
link is pbLicenseDetail.jsp?owi=vbuPTC20I14%3D
ROW 0 IS ANJ DISTRIBUTIONS LLC
Lic. Status: Issued
ROW 1 IS 2299 JOHNS HOPKINS ROAD
License: 02104436
ROW 2 IS GAMBRILLS, MD 21054
Issued Date: 4/05/2017
ROW 3 IS Anne Arundel County
----
HEADER is 4.
VAPE FROG
link is pbLicenseDetail.jsp?owi=xs9NlmRQHEs%3D
ROW 0 IS COX TRADING COMPANY L L C
Lic. Status: Issued
ROW 1 IS 110 S. PINEY RD
License: 17165957
ROW 2 IS CHESTER, MD 21619
Issued Date: 5/31/2017
ROW 3 IS Queen Ann

### Save these into `vape-results.csv`

In [77]:
list_1 = []
for header in business_headers:
    dictionary = {}
    rows = header.find_next_siblings('tr')
    header_1 = header.text.strip().replace("\n", "")
    if header_1:
        print(header_1)
        dictionary["Header"] = header_1
    if header.find("a"):
        links = header.find("a")["href"]
        print("link is", links.strip())
        dictionary["link"] = links
    else:
        links = "none"
        print(links)
        dictionary["link"] = links
    Status = rows[0].text.strip().replace("\n", "")
    if Status:
        print(Status)
        dictionary["Status"] = Status
    License = rows[1].text.strip().replace("\n", "")
    if License:
        print(License)
        dictionary["License"] = License
    Date = rows[2].text.strip().replace("\n", "")
    if Date:
        print(Date)
        dictionary["Date"] = Date
    county = rows[3].text.strip().replace("\n", "")
    if county:
        print(county)
        dictionary["county"] = county
    list_1.append(dictionary)

1.VAPE IT STORE II
link is pbLicenseDetail.jsp?owi=QBwCBqE3UH0%3D
AMIN NARGISLic. Status: Issued
1015 S SALISBURY BLVDLicense: 22173808
SALISBURY, MD 21801Issued Date: 4/27/2017
Wicomico County
2.VAPE IT STORE I
link is pbLicenseDetail.jsp?owi=gXYTVqpDCfA%3D
AMIN NARGISLic. Status: Issued
1724 N SALISBURY BLVD UNIT 2License: 22173807
SALISBURY, MD 21801Issued Date: 4/27/2017
Wicomico County
3.VAPEPAD THE
link is pbLicenseDetail.jsp?owi=vbuPTC20I14%3D
ANJ DISTRIBUTIONS LLCLic. Status: Issued
2299 JOHNS HOPKINS ROADLicense: 02104436
GAMBRILLS, MD 21054Issued Date: 4/05/2017
Anne Arundel County
4.VAPE FROG
link is pbLicenseDetail.jsp?owi=xs9NlmRQHEs%3D
COX TRADING COMPANY L L CLic. Status: Issued
110 S. PINEY RDLicense: 17165957
CHESTER, MD 21619Issued Date: 5/31/2017
Queen Anne's County
5.VAPE FROGPending *
none
COX TRADING LLCLic. Status: Pending
346 RITCHIE HIGHWAY
SEVERNA PARK, MD 21146
Anne Arundel County


In [78]:
list_1

[{'Date': 'SALISBURY, MD 21801Issued Date: 4/27/2017',
  'Header': '1.VAPE IT STORE II',
  'License': '1015 S SALISBURY BLVDLicense: 22173808',
  'Status': 'AMIN NARGISLic. Status: Issued',
  'county': 'Wicomico County',
  'link': 'pbLicenseDetail.jsp?owi=QBwCBqE3UH0%3D'},
 {'Date': 'SALISBURY, MD 21801Issued Date: 4/27/2017',
  'Header': '2.VAPE IT STORE I',
  'License': '1724 N SALISBURY BLVD UNIT 2License: 22173807',
  'Status': 'AMIN NARGISLic. Status: Issued',
  'county': 'Wicomico County',
  'link': 'pbLicenseDetail.jsp?owi=gXYTVqpDCfA%3D'},
 {'Date': 'GAMBRILLS, MD 21054Issued Date: 4/05/2017',
  'Header': '3.VAPEPAD THE',
  'License': '2299 JOHNS HOPKINS ROADLicense: 02104436',
  'Status': 'ANJ DISTRIBUTIONS LLCLic. Status: Issued',
  'county': 'Anne Arundel County',
  'link': 'pbLicenseDetail.jsp?owi=vbuPTC20I14%3D'},
 {'Date': 'CHESTER, MD 21619Issued Date: 5/31/2017',
  'Header': '4.VAPE FROG',
  'License': '110 S. PINEY RDLicense: 17165957',
  'Status': 'COX TRADING COMPANY

In [79]:
import pandas as pd

df = pd.DataFrame(list_1)

df.to_csv("vap-results.csv", index = False)


In [80]:
df.head()

Unnamed: 0,Date,Header,License,Status,county,link
0,"SALISBURY, MD 21801Issued Date: 4/27/2017",1.VAPE IT STORE II,1015 S SALISBURY BLVDLicense: 22173808,AMIN NARGISLic. Status: Issued,Wicomico County,pbLicenseDetail.jsp?owi=QBwCBqE3UH0%3D
1,"SALISBURY, MD 21801Issued Date: 4/27/2017",2.VAPE IT STORE I,1724 N SALISBURY BLVD UNIT 2License: 22173807,AMIN NARGISLic. Status: Issued,Wicomico County,pbLicenseDetail.jsp?owi=gXYTVqpDCfA%3D
2,"GAMBRILLS, MD 21054Issued Date: 4/05/2017",3.VAPEPAD THE,2299 JOHNS HOPKINS ROADLicense: 02104436,ANJ DISTRIBUTIONS LLCLic. Status: Issued,Anne Arundel County,pbLicenseDetail.jsp?owi=vbuPTC20I14%3D
3,"CHESTER, MD 21619Issued Date: 5/31/2017",4.VAPE FROG,110 S. PINEY RDLicense: 17165957,COX TRADING COMPANY L L CLic. Status: Issued,Queen Anne's County,pbLicenseDetail.jsp?owi=xs9NlmRQHEs%3D
4,"SEVERNA PARK, MD 21146",5.VAPE FROGPending *,346 RITCHIE HIGHWAY,COX TRADING LLCLic. Status: Pending,Anne Arundel County,none


### Open `vape-results.csv` to make sure there aren't any extra weird columns

In [81]:
df_vap = pd.read_csv("vap-results.csv")
df_vap

Unnamed: 0,Date,Header,License,Status,county,link
0,"SALISBURY, MD 21801Issued Date: 4/27/2017",1.VAPE IT STORE II,1015 S SALISBURY BLVDLicense: 22173808,AMIN NARGISLic. Status: Issued,Wicomico County,pbLicenseDetail.jsp?owi=QBwCBqE3UH0%3D
1,"SALISBURY, MD 21801Issued Date: 4/27/2017",2.VAPE IT STORE I,1724 N SALISBURY BLVD UNIT 2License: 22173807,AMIN NARGISLic. Status: Issued,Wicomico County,pbLicenseDetail.jsp?owi=gXYTVqpDCfA%3D
2,"GAMBRILLS, MD 21054Issued Date: 4/05/2017",3.VAPEPAD THE,2299 JOHNS HOPKINS ROADLicense: 02104436,ANJ DISTRIBUTIONS LLCLic. Status: Issued,Anne Arundel County,pbLicenseDetail.jsp?owi=vbuPTC20I14%3D
3,"CHESTER, MD 21619Issued Date: 5/31/2017",4.VAPE FROG,110 S. PINEY RDLicense: 17165957,COX TRADING COMPANY L L CLic. Status: Issued,Queen Anne's County,pbLicenseDetail.jsp?owi=xs9NlmRQHEs%3D
4,"SEVERNA PARK, MD 21146",5.VAPE FROGPending *,346 RITCHIE HIGHWAY,COX TRADING LLCLic. Status: Pending,Anne Arundel County,none


## Use Selenium to scrape ALL pages of results, save the results into `vape-results-all.csv`.

In [82]:
list_1 = []

while True:
    ## This will run, until it gets an error and then it will break. 
    ## IF YOU DON'T WRITE BREAK IT WILL RUN FOREVER!!!!
    doc = BeautifulSoup(driver.page_source, "html.parser")
    business_headers = doc.find_all('tr',class_='searchfieldtitle')
    for header in business_headers:
        dictionary = {}
        rows = header.find_next_siblings('tr')
        header_1 = header.text.strip().replace("\n", "")
        if header_1:
            print(header_1)
            dictionary["Header"] = header_1
        if header.find("a"):
            links = header.find("a")["href"]
            print("link is", links.strip())
            dictionary["link"] = links
        else:
            links = "none"
            print(links)
            dictionary["link"] = links
        Status = rows[0].text.strip().replace("\n", "")
        if Status:
            print(Status)
            dictionary["Status"] = Status
        License = rows[1].text.strip().replace("\n", "")
        if License:
            print(License)
            dictionary["License"] = License
        Date = rows[2].text.strip().replace("\n", "")
        if Date:
            print(Date)
            dictionary["Date"] = Date
        county = rows[3].text.strip().replace("\n", "")
        if county:
            print(county)
            dictionary["county"] = county
        list_1.append(dictionary)
    try:
        next_button = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[3]/a/nobr')
        next_button.click()
    except:
        break

1.VAPE IT STORE II
link is pbLicenseDetail.jsp?owi=QBwCBqE3UH0%3D
AMIN NARGISLic. Status: Issued
1015 S SALISBURY BLVDLicense: 22173808
SALISBURY, MD 21801Issued Date: 4/27/2017
Wicomico County
2.VAPE IT STORE I
link is pbLicenseDetail.jsp?owi=gXYTVqpDCfA%3D
AMIN NARGISLic. Status: Issued
1724 N SALISBURY BLVD UNIT 2License: 22173807
SALISBURY, MD 21801Issued Date: 4/27/2017
Wicomico County
3.VAPEPAD THE
link is pbLicenseDetail.jsp?owi=vbuPTC20I14%3D
ANJ DISTRIBUTIONS LLCLic. Status: Issued
2299 JOHNS HOPKINS ROADLicense: 02104436
GAMBRILLS, MD 21054Issued Date: 4/05/2017
Anne Arundel County
4.VAPE FROG
link is pbLicenseDetail.jsp?owi=xs9NlmRQHEs%3D
COX TRADING COMPANY L L CLic. Status: Issued
110 S. PINEY RDLicense: 17165957
CHESTER, MD 21619Issued Date: 5/31/2017
Queen Anne's County
5.VAPE FROGPending *
none
COX TRADING LLCLic. Status: Pending
346 RITCHIE HIGHWAY
SEVERNA PARK, MD 21146
Anne Arundel County
6.VAPE LOFT (THE)
link is pbLicenseDetail.jsp?owi=gzJn%2BEn85fc%3D
DISBROW II E

In [83]:
list_1

[{'Date': 'SALISBURY, MD 21801Issued Date: 4/27/2017',
  'Header': '1.VAPE IT STORE II',
  'License': '1015 S SALISBURY BLVDLicense: 22173808',
  'Status': 'AMIN NARGISLic. Status: Issued',
  'county': 'Wicomico County',
  'link': 'pbLicenseDetail.jsp?owi=QBwCBqE3UH0%3D'},
 {'Date': 'SALISBURY, MD 21801Issued Date: 4/27/2017',
  'Header': '2.VAPE IT STORE I',
  'License': '1724 N SALISBURY BLVD UNIT 2License: 22173807',
  'Status': 'AMIN NARGISLic. Status: Issued',
  'county': 'Wicomico County',
  'link': 'pbLicenseDetail.jsp?owi=gXYTVqpDCfA%3D'},
 {'Date': 'GAMBRILLS, MD 21054Issued Date: 4/05/2017',
  'Header': '3.VAPEPAD THE',
  'License': '2299 JOHNS HOPKINS ROADLicense: 02104436',
  'Status': 'ANJ DISTRIBUTIONS LLCLic. Status: Issued',
  'county': 'Anne Arundel County',
  'link': 'pbLicenseDetail.jsp?owi=vbuPTC20I14%3D'},
 {'Date': 'CHESTER, MD 21619Issued Date: 5/31/2017',
  'Header': '4.VAPE FROG',
  'License': '110 S. PINEY RDLicense: 17165957',
  'Status': 'COX TRADING COMPANY

In [84]:
df = pd.DataFrame(list_1)

df.to_csv("vap-results-all.csv", index = False)


In [86]:
df.head()


Unnamed: 0,Date,Header,License,Status,county,link
0,"SALISBURY, MD 21801Issued Date: 4/27/2017",1.VAPE IT STORE II,1015 S SALISBURY BLVDLicense: 22173808,AMIN NARGISLic. Status: Issued,Wicomico County,pbLicenseDetail.jsp?owi=QBwCBqE3UH0%3D
1,"SALISBURY, MD 21801Issued Date: 4/27/2017",2.VAPE IT STORE I,1724 N SALISBURY BLVD UNIT 2License: 22173807,AMIN NARGISLic. Status: Issued,Wicomico County,pbLicenseDetail.jsp?owi=gXYTVqpDCfA%3D
2,"GAMBRILLS, MD 21054Issued Date: 4/05/2017",3.VAPEPAD THE,2299 JOHNS HOPKINS ROADLicense: 02104436,ANJ DISTRIBUTIONS LLCLic. Status: Issued,Anne Arundel County,pbLicenseDetail.jsp?owi=vbuPTC20I14%3D
3,"CHESTER, MD 21619Issued Date: 5/31/2017",4.VAPE FROG,110 S. PINEY RDLicense: 17165957,COX TRADING COMPANY L L CLic. Status: Issued,Queen Anne's County,pbLicenseDetail.jsp?owi=xs9NlmRQHEs%3D
4,"SEVERNA PARK, MD 21146",5.VAPE FROGPending *,346 RITCHIE HIGHWAY,COX TRADING LLCLic. Status: Pending,Anne Arundel County,none


In [87]:
df_vap_all = pd.read_csv("vap-results-all.csv")
df_vap_all.head()

Unnamed: 0,Date,Header,License,Status,county,link
0,"SALISBURY, MD 21801Issued Date: 4/27/2017",1.VAPE IT STORE II,1015 S SALISBURY BLVDLicense: 22173808,AMIN NARGISLic. Status: Issued,Wicomico County,pbLicenseDetail.jsp?owi=QBwCBqE3UH0%3D
1,"SALISBURY, MD 21801Issued Date: 4/27/2017",2.VAPE IT STORE I,1724 N SALISBURY BLVD UNIT 2License: 22173807,AMIN NARGISLic. Status: Issued,Wicomico County,pbLicenseDetail.jsp?owi=gXYTVqpDCfA%3D
2,"GAMBRILLS, MD 21054Issued Date: 4/05/2017",3.VAPEPAD THE,2299 JOHNS HOPKINS ROADLicense: 02104436,ANJ DISTRIBUTIONS LLCLic. Status: Issued,Anne Arundel County,pbLicenseDetail.jsp?owi=vbuPTC20I14%3D
3,"CHESTER, MD 21619Issued Date: 5/31/2017",4.VAPE FROG,110 S. PINEY RDLicense: 17165957,COX TRADING COMPANY L L CLic. Status: Issued,Queen Anne's County,pbLicenseDetail.jsp?owi=xs9NlmRQHEs%3D
4,"SEVERNA PARK, MD 21146",5.VAPE FROGPending *,346 RITCHIE HIGHWAY,COX TRADING LLCLic. Status: Pending,Anne Arundel County,none
