# Scraping Maryland Business Licenses with Selenium

Maryland has a [great portal](https://jportal.mdcourts.gov/license/pbPublicSearch.jsp) for searching business licenses, but the only problem is you have to check a box in order to get in.

1. Try to visit [the public search page](https://jportal.mdcourts.gov/license/pbPublicSearch.jsp)
2. Get redirected to a "I agree to this" page. Click that you've read the disclaimer, click Enter the Site.
3. Click "Search License Records" down at the bottom of the page
4. You're now on the search page! From the "Jurisdiction" dropdown, select "Statewide"
5. In the "Trade Name" field, type "Vap%" to try to find vape shops
6. Click "Next" in the bottom right-hand corner to go to the next page
7. Click "Click for detail" to see the details for a specific business license.

That's a lot of stuff! **Let's get to work.**

## Preparation

### When you search for a business license, what URL should Selenium try to visit first?

In [1]:
# first page for business licensces: https://jportal.mdcourts.gov/license/index_disclaimer.jsp

In [2]:
url = 'https://jportal.mdcourts.gov/license/index_disclaimer.jsp'
from selenium import webdriver
from bs4 import BeautifulSoup 
driver = webdriver.Chrome()
driver.get(url)

**It isn't going to work, though! It's going to redirect to that intro page.** You can use *Incognito mode* to go back through the "Check the box, etc" series of pages.

### How will you identify the checkbox to check it?

Selenium can find elements by:

- name
- Class
- ID
- CSS selector (**ASK ME WHAT THIS IS** if you don't know)
- XPath (**ASK ME WHAT THIS IS** because you definitely don't know)
- Link text
- Partial link text

So in other words, what's unique about this element?

- *TIP: I have a secret awesome way to do this, but you have to ask.*

In [3]:
# .find_element_by_xpath('//*[@id="checkbox"]').click()

In [4]:
checkbox = driver.find_element_by_xpath('//*[@id="checkbox"]')
print(checkbox)

<selenium.webdriver.remote.webelement.WebElement (session="fbf3c2699107ac0b668d6e3039c60606", element="0.9235239769307022-1")>


In [5]:
checkbox.click()

### How will you identify the button to select it, or the form to submit it?

Selenium can submit forms by either

- Selecting the form and using `.submit()`, or
- Selecting the button and using `.click()`

You only need to be able to get **one, not both.**

In [6]:
enter_button = driver.find_element_by_xpath('/html/body/table/tbody/tr[7]/td/form/div/input[2]')
print(enter_button)
enter_button.click()

<selenium.webdriver.remote.webelement.WebElement (session="fbf3c2699107ac0b668d6e3039c60606", element="0.9235239769307022-2")>


### Once you're on the next page, how will you click the "SEARCH LICENSE RECORDS" link or follow its URL?

In [7]:
search_licence_button = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[6]/td[2]/a[2]')
print(search_licence_button)
search_licence_button.click()

<selenium.webdriver.remote.webelement.WebElement (session="fbf3c2699107ac0b668d6e3039c60606", element="0.9875494599627848-1")>


### Now you're on the form page. How do you pick "Statewide" for the jurisdiction dropdown?

In [8]:
select_button = driver.find_element_by_xpath('//*[@id="slcJurisdiction"]/option[2]')
print(select_button)
select_button.click()

<selenium.webdriver.remote.webelement.WebElement (session="fbf3c2699107ac0b668d6e3039c60606", element="0.1180056935980145-1")>


### How do you type "vap%" into the Trade Name field?

In [9]:
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys

In [10]:
textbox = driver.find_element_by_xpath('//*[@id="txtTradeName"]')
textbox.send_keys('Vap%')

### How do you click the submit button or submit the form?

In [11]:
submit_button = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/form/table/tbody/tr[14]/td/input[1]')
submit_button.click()

### How can you find and click the 'Next' button on the search results page?

In [12]:
next_button = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[3]/a/nobr')
next_button.click()

# Okay, let's scrape!

### Use Selenium to search for vape shops statewide, and then click through until it reaches the last page.

Don't scrape yet!

In [13]:
while True:
    
    try:
        next_button = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[3]/a/nobr')
        next_button.click()
    except:
        break 

In [14]:
while True:
    try:
        back_button = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[1]/a/nobr')
        back_button.click()
    except:
        break 

### Use Selenium to scrape the first page of search results for vape shops statewide (well, we'll try).

- *TIP: You can find elements/text using Selenium, or use BeautifulSoup with `doc = BeautifulSoup(driver.page_source)`*

I've included some BeautifulSoup code that might be helpful. If you use it, **ask me how I made it.** It's important.

In [15]:
# IF YOU ARE USING BEAUTIFULSOUP, HERE IS SOME SAMPLE CODE
# IF YOU USE THIS CODE, ASK ME HOW I MADE IT. IT'S IMPORTANT.
#len(business_headers)

In [16]:
doc = BeautifulSoup(driver.page_source, 'html.parser')

In [17]:
business_headers = doc.find_all('tr',class_='searchfieldtitle')
business_headers

Info = []

for license in business_headers:
    current = {}
    
    current['Company'] = license.find('span').text
    if license.find('a'):
        current['url'] = license.find('a')['href']
    current['Name'] = license.find_next_siblings()[0].find_all('td')[1].text
    current['Licence Status'] = license.find_next_siblings()[0].find_all('td')[2].text
    current['Address'] = license.find_next_siblings()[1].find_all('td')[1].text
    current['License Number'] = license.find_next_siblings()[1].find_all('td')[2].text 
    current['City'] = license.find_next_siblings()[2].find_all('td')[1].text
    current['Issued Date'] = license.find_next_siblings()[2].find_all('td')[2].text
    current['State'] = license.find_next_siblings()[3].find_all('td')[1].text
    Info.append(current)

Info


[{'Address': '1724 N SALISBURY BLVD UNIT 2',
  'City': 'SALISBURY, MD 21801',
  'Company': 'VAPE IT STORE I',
  'Issued Date': 'Issued Date: 4/27/2017',
  'Licence Status': 'Lic. Status: Issued',
  'License Number': 'License: 22173807',
  'Name': 'AMIN NARGIS',
  'State': 'Wicomico County',
  'url': 'pbLicenseDetail.jsp?owi=ZNl7eAo8j58%3D'},
 {'Address': '1015 S SALISBURY BLVD',
  'City': 'SALISBURY, MD 21801',
  'Company': 'VAPE IT STORE II',
  'Issued Date': 'Issued Date: 4/27/2017',
  'Licence Status': 'Lic. Status: Issued',
  'License Number': 'License: 22173808',
  'Name': 'AMIN NARGIS',
  'State': 'Wicomico County',
  'url': 'pbLicenseDetail.jsp?owi=hb0Q%2B144PQw%3D'},
 {'Address': '2299 JOHNS HOPKINS ROAD',
  'City': 'GAMBRILLS, MD 21054',
  'Company': 'VAPEPAD THE',
  'Issued Date': 'Issued Date: 4/05/2017',
  'Licence Status': 'Lic. Status: Issued',
  'License Number': 'License: 02104436',
  'Name': 'ANJ DISTRIBUTIONS LLC',
  'State': 'Anne Arundel County',
  'url': 'pbLicense

### Save these into `vape-results.csv`

In [18]:
import pandas as pd
df = pd.DataFrame(Info)
df.head

<bound method NDFrame.head of                         Address                    City           Company  \
0  1724 N SALISBURY BLVD UNIT 2     SALISBURY, MD 21801   VAPE IT STORE I   
1         1015 S SALISBURY BLVD     SALISBURY, MD 21801  VAPE IT STORE II   
2       2299 JOHNS HOPKINS ROAD     GAMBRILLS, MD 21054       VAPEPAD THE   
3               110 S. PINEY RD       CHESTER, MD 21619         VAPE FROG   
4           346 RITCHIE HIGHWAY  SEVERNA PARK, MD 21146         VAPE FROG   

              Issued Date        Licence Status     License Number  \
0  Issued Date: 4/27/2017   Lic. Status: Issued  License: 22173807   
1  Issued Date: 4/27/2017   Lic. Status: Issued  License: 22173808   
2  Issued Date: 4/05/2017   Lic. Status: Issued  License: 02104436   
3  Issued Date: 5/31/2017   Lic. Status: Issued  License: 17165957   
4                          Lic. Status: Pending                      

                        Name                State  \
0                AMIN NARGIS     

In [19]:
df.to_csv('vape-results.csv' , index = False)
vape_df = pd.read_csv('vape-results.csv')
vape_df.head()

Unnamed: 0,Address,City,Company,Issued Date,Licence Status,License Number,Name,State,url
0,1724 N SALISBURY BLVD UNIT 2,"SALISBURY, MD 21801",VAPE IT STORE I,Issued Date: 4/27/2017,Lic. Status: Issued,License: 22173807,AMIN NARGIS,Wicomico County,pbLicenseDetail.jsp?owi=ZNl7eAo8j58%3D
1,1015 S SALISBURY BLVD,"SALISBURY, MD 21801",VAPE IT STORE II,Issued Date: 4/27/2017,Lic. Status: Issued,License: 22173808,AMIN NARGIS,Wicomico County,pbLicenseDetail.jsp?owi=hb0Q%2B144PQw%3D
2,2299 JOHNS HOPKINS ROAD,"GAMBRILLS, MD 21054",VAPEPAD THE,Issued Date: 4/05/2017,Lic. Status: Issued,License: 02104436,ANJ DISTRIBUTIONS LLC,Anne Arundel County,pbLicenseDetail.jsp?owi=4O6WsryDTV4%3D
3,110 S. PINEY RD,"CHESTER, MD 21619",VAPE FROG,Issued Date: 5/31/2017,Lic. Status: Issued,License: 17165957,COX TRADING COMPANY L L C,Queen Anne's County,pbLicenseDetail.jsp?owi=%2FOSo0KMLj8I%3D
4,346 RITCHIE HIGHWAY,"SEVERNA PARK, MD 21146",VAPE FROG,,Lic. Status: Pending,,COX TRADING LLC,Anne Arundel County,


## Use Selenium to scrape ALL pages of results, save the results into `vape-results-all.csv`.

In [20]:
Info_all = []
while True:
    try:
        doc = BeautifulSoup(driver.page_source, 'html.parser')
        Business = doc.find_all('tr',class_='searchfieldtitle')
        for license in Business:
            current = {}
            current['Company'] = license.find('span').text
            if license.find('a'):
                current['Url'] = license.find('a')['href']
            current['Name'] = license.find_next_siblings()[0].find_all('td')[1].text
            current['Licence Status'] = license.find_next_siblings()[0].find_all('td')[2].text
            current['Address'] = license.find_next_siblings()[1].find_all('td')[1].text
            current['License Number'] = license.find_next_siblings()[1].find_all('td')[2].text 
            current['City'] = license.find_next_siblings()[2].find_all('td')[1].text
            current['Issued Date'] = license.find_next_siblings()[2].find_all('td')[2].text
            current['State'] = license.find_next_siblings()[3].find_all('td')[1].text
            Info_all.append(current)
        next_button = driver.find_element_by_xpath('/html/body/table[2]/tbody/tr[4]/td[2]/table[2]/tbody/tr/td[3]/a/nobr')
        next_button.click()
    except:
        break 
Info_all

[{'Address': '1724 N SALISBURY BLVD UNIT 2',
  'City': 'SALISBURY, MD 21801',
  'Company': 'VAPE IT STORE I',
  'Issued Date': 'Issued Date: 4/27/2017',
  'Licence Status': 'Lic. Status: Issued',
  'License Number': 'License: 22173807',
  'Name': 'AMIN NARGIS',
  'State': 'Wicomico County',
  'Url': 'pbLicenseDetail.jsp?owi=ZNl7eAo8j58%3D'},
 {'Address': '1015 S SALISBURY BLVD',
  'City': 'SALISBURY, MD 21801',
  'Company': 'VAPE IT STORE II',
  'Issued Date': 'Issued Date: 4/27/2017',
  'Licence Status': 'Lic. Status: Issued',
  'License Number': 'License: 22173808',
  'Name': 'AMIN NARGIS',
  'State': 'Wicomico County',
  'Url': 'pbLicenseDetail.jsp?owi=hb0Q%2B144PQw%3D'},
 {'Address': '2299 JOHNS HOPKINS ROAD',
  'City': 'GAMBRILLS, MD 21054',
  'Company': 'VAPEPAD THE',
  'Issued Date': 'Issued Date: 4/05/2017',
  'Licence Status': 'Lic. Status: Issued',
  'License Number': 'License: 02104436',
  'Name': 'ANJ DISTRIBUTIONS LLC',
  'State': 'Anne Arundel County',
  'Url': 'pbLicense

In [21]:
df.to_csv('vape-results-all.csv' , index = False)
vape_df = pd.read_csv('vape-results-all.csv')
vape_df.head()

Unnamed: 0,Address,City,Company,Issued Date,Licence Status,License Number,Name,State,url
0,1724 N SALISBURY BLVD UNIT 2,"SALISBURY, MD 21801",VAPE IT STORE I,Issued Date: 4/27/2017,Lic. Status: Issued,License: 22173807,AMIN NARGIS,Wicomico County,pbLicenseDetail.jsp?owi=ZNl7eAo8j58%3D
1,1015 S SALISBURY BLVD,"SALISBURY, MD 21801",VAPE IT STORE II,Issued Date: 4/27/2017,Lic. Status: Issued,License: 22173808,AMIN NARGIS,Wicomico County,pbLicenseDetail.jsp?owi=hb0Q%2B144PQw%3D
2,2299 JOHNS HOPKINS ROAD,"GAMBRILLS, MD 21054",VAPEPAD THE,Issued Date: 4/05/2017,Lic. Status: Issued,License: 02104436,ANJ DISTRIBUTIONS LLC,Anne Arundel County,pbLicenseDetail.jsp?owi=4O6WsryDTV4%3D
3,110 S. PINEY RD,"CHESTER, MD 21619",VAPE FROG,Issued Date: 5/31/2017,Lic. Status: Issued,License: 17165957,COX TRADING COMPANY L L C,Queen Anne's County,pbLicenseDetail.jsp?owi=%2FOSo0KMLj8I%3D
4,346 RITCHIE HIGHWAY,"SEVERNA PARK, MD 21146",VAPE FROG,,Lic. Status: Pending,,COX TRADING LLC,Anne Arundel County,
