# Scraping many pages + Using Selenium

## The pages we'll be looking at

If I wanted to read specific information about a specfic mine, it takes a few steps. **Do these steps with your browser before you try any programming.**

1. Visit the [Mine Data Retrieval System](https://arlweb.msha.gov/drs/drshome.htm)
2. Scroll down to **Mine Identification Number (ID) Search**
3. Type in a mine ID number, such as `3503598`, click **Search**
4. I'm on a page! It lists the MINE NAME and MINE OWNER.

After searching for and finding a mine, I can use this page to **find reports about this mine**. Some of the reports are on accidents, violations, inspections, health samples and more. To get those reports:

1. Search for a mine (if you haven't already)
2. Scroll down and change **Beginning Date** to `1/1/1995` (violation reports begin in 1995, accidents begin in 1983)
3. Select the report type of `Violations`
4. Click **Get Report**
5. I'm on a page! It lists ALL OF THE MINE'S VIOLATIONS.

By changing the report type you're searching for you can find all sorts of different data.

# Researching mine information

## Preparation 

### When you search for information on a specific mine, what URL should Selenium visit first?

- *TIP: the answer is NOT `https://arlweb.msha.gov/drs/ASP/BasicMineInfonew.asp`*

In [7]:
# https://arlweb.msha.gov/drs/ASP/BasicMineInfonew.asp

### How can you identify the text field we're going to type the Mine ID into?

Selenium can find elements by:

- name
- Class
- ID
- CSS selector (**ASK ME WHAT THIS IS** if you don't know)
- XPath (**ASK ME WHAT THIS IS** because you definitely don't know)
- Link text
- Partial link text

So in other words, what's unique about this element?

In [8]:
#it is an xpath
#driver.find_elements_by_xpath('//*[@id="inputdrs"]').click()

### How can you identify the search button we're going to click, or the form we're going to submit?

Selenium can submit forms by either

- Selecting the form and using `.submit()`, or
- Selecting the button and using `.click()`

You only need to be able to get **one, not both.**

In [9]:
#search_button = driver.find_elements_by_xpath('//*[@id="content"]/table[3]/tbody/tr[3]/td[2]/input').click()

### Use Selenium to search using the mine ID `3901432`. Get me the operator's name by scraping.

- *TIP: You can find elements/text using Selenium, or use BeautifulSoup with `doc = BeautifulSoup(driver.page_source)`*

In [14]:
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys

In [15]:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://arlweb.msha.gov/drs/drshome.htm')

In [19]:
textbox = driver.find_element_by_xpath('//*[@id="inputdrs"]')
textbox.send_keys('3901432')

In [20]:
from bs4 import BeautifulSoup
doc = BeautifulSoup(driver.page_source, 'html.parser')

In [None]:
#doc.find_all('tr')[3]

# Using .apply to find data about SEVERAL mines

The file `mines-subset.csv` has a list of mine IDs. We're going to scrape the operator's name for each of those mines.

### Open up `mines-subset.csv` and save it into a dataframe

In [21]:
import pandas as pd 
df = pd.read_csv('mines-subset.csv', dtype={'id': 'str'})
df

Unnamed: 0,id
0,4104757
1,801306
2,3609931


### Open up `mines-subset.csv` in a text editor, then look at your dataframe. Is something different about them?

In [None]:
#zero was missing 

In [22]:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
#from selenium.webdriver.common.keys import Keys

In [23]:
driver = webdriver.Chrome()
driver.get('https://arlweb.msha.gov/drs/drshome.htm')

In [24]:
input_1 = driver.find_elements_by_xpath('//*[@id="inputdrs"]')[1]
input_1.send_keys('3901432')

In [None]:
#search_button = driver.find_element_by_xpath('//*[@id="content"]/table[3]/tbody/tr[3]/td[2]/input')
#search_button.click()

In [25]:
from bs4 import BeautifulSoup
#doc = BeautifulSoup(driver.page_source, 'html.parser')

### Scrape the operator's name for each of those mines and print it

- *TIP: use .apply and a function*
- *TIP: If you need help with .apply, look at the "Using apply in pandas" notebook *

In [None]:
#df.apply

In [None]:
#doc.find_all('td')[7].find_all('td')[0].text

In [26]:
df

Unnamed: 0,id
0,4104757
1,801306
2,3609931


In [42]:
#driver = webdriver.Chrome()
#driver.get('https://arlweb.msha.gov/drs/drshome.htm')
#textbox = driver.find_elements_by_xpath('//*[@id="inputdrs"]')

In [41]:
def Names_IDs(row):
    driver = webdriver.Chrome()
    driver.get('https://arlweb.msha.gov/drs/drshome.htm')
    textbox = driver.find_elements_by_xpath('//*[@id="inputdrs"]')[1]
    textbox.send_keys(row['id'])
    search_button = driver.find_element_by_xpath('//*[@id="content"]/table[3]/tbody/tr[3]/td[2]/input')
    search_button.click()
    #print(type(doc))
    Operator = driver.find_element_by_xpath('//*[@id="content"]/form[1]/table[1]/tbody/tr[3]/td[2]/font/b').text.strip()
    print(Operator)
    
    return Operator

df.apply(Names_IDs,axis=1)


Dirt Works
Holley Dirt Company, Inc
M.R. Dirt Inc.


0                  Dirt Works
1    Holley Dirt Company, Inc
2              M.R. Dirt Inc.
dtype: object

### Scrape the operator's name and save it into a new column

- *TIP: Use .apply and a function*
- *TIP: Remember to use `return`*

In [54]:
def Operators_name(row): 
    driver = webdriver.Chrome()
    driver.get('https://arlweb.msha.gov/drs/drshome.htm')
    textbox = driver.find_elements_by_xpath('//*[@id="inputdrs"]')[1]
    textbox.send_keys(row['id'])
    search_button = driver.find_element_by_xpath('//*[@id="content"]/table[3]/tbody/tr[3]/td[2]/input')
    search_button.click()
    Operator = driver.find_element_by_xpath('//*[@id="content"]/form[1]/table[1]/tbody/tr[3]/td[2]/font/b').text.strip()
  
    return Operator

df['Operator'] = df.apply(Operators_name, axis=1)
df

Unnamed: 0,id,Operator
0,4104757,Dirt Works
1,801306,"Holley Dirt Company, Inc"
2,3609931,M.R. Dirt Inc.


# Researching mine violations

Read the very top again to remember how to find mine violations

### When you search for a mine's violations, what URL is Selenium going to start on?

- *TIP: `requests` can send form data to load in the middle of a bunch of steps, but Selenium has to start at the beginning

In [None]:
#https://arlweb.msha.gov/drs/ASP/BasicMineInfonew.asp

### When you're searching for violations from the Mine Information page, how are you going to identify the "Beginning Date" field?

In [None]:
#date_field = driver.find_element_by_xpath('///*[@id="content"]/form[1]/table[2]/tbody/tr[2]/td/font/input[1]')

### When you're searching for violations from the Mine Information page, how are you going to identify the "Violations" button?

In [None]:
#violation_button = driver.find_element_by_xpath('//*[@id="content"]/form[1]/table[3]/tbody/tr[2]/td[2]/table/tbody/tr[1]/td/input')

### When you're searching for violations from the Mine Information page, how are you going to identify the form or the button to click to get a list of the violations?

In [None]:
#get_report = driver.find_element_by_xpath('/*[@id="content"]/form[1]/table[3]/tbody/tr[3]/td[2]/input')

### Using the mine ID `3901432`, scrape all of their violations since 1/1/1995

**Save this into a CSV called `3901432-violations.csv`.** This CSV must include the following fields:

- Citation number
- Case number
- Standard violated
- Link to standard
- Proposed penalty
- Amount paid to date

**Tips:**

- *TIP: It's probably worth it to print them all first, then save them to a CSV once you know it's all working.*
- *TIP: You'll use the parent pattern - get the ROWS first (tr), then loop through and get the TABLE CELLS (td)*

In [59]:
from selenium import webdriver
from selenium.webdriver.support.ui import Select

In [121]:
driver = webdriver.Chrome()
driver.get('https://arlweb.msha.gov/drs/drshome.htm')

In [122]:
number_input = driver.find_elements_by_xpath('//*[@id="inputdrs"]')[1]
number_input.send_keys('3901432')

In [123]:
search_button = driver.find_element_by_xpath('//*[@id="content"]/table[3]/tbody/tr[3]/td[2]/input')
search_button.click()

In [124]:
violation_button = driver.find_element_by_xpath('//*[@id="content"]/form[1]/table[3]/tbody/tr[2]/td[2]/table/tbody/tr[1]/td/input')
violation_button.click()

In [125]:
date_input = driver.find_element_by_xpath('//*[@id="content"]/form[1]/table[2]/tbody/tr[2]/td/font/input[1]')
date_input.send_keys('1/1/1995')

In [126]:
get_report = driver.find_element_by_xpath('//*[@id="content"]/form[1]/table[3]/tbody/tr[3]/td[2]/input')
get_report.click()

In [127]:
rows = driver.find_elements_by_xpath('//tr[@class="drsviols"]')
len(rows)

18

In [139]:
All_Info = []
driver = webdriver.Chrome()
driver.get('https://arlweb.msha.gov/drs/drshome.htm')

number_input = driver.find_elements_by_xpath('//*[@id="inputdrs"]')[1]
number_input.send_keys('3901432')

search_button = driver.find_element_by_xpath('//*[@id="content"]/table[3]/tbody/tr[3]/td[2]/input')
search_button.click()

date_input = driver.find_element_by_xpath('//*[@id="content"]/form[1]/table[2]/tbody/tr[2]/td/font/input[1]')
date_input.click()
date_input.send_keys('1/1/1995')

violation_button = driver.find_element_by_xpath('//*[@id="content"]/form[1]/table[3]/tbody/tr[2]/td[2]/table/tbody/tr[1]/td/input')
violation_button.click()

get_report = driver.find_element_by_xpath('//*[@id="content"]/form[1]/table[3]/tbody/tr[3]/td[2]/input')
get_report.click()

doc = BeautifulSoup(driver.page_source, 'html.parser')

rows = doc.find_all('tr', class_='drsviols')

for row in rows:
    
    current = {}

    current['Citation number'] = row.find_all('td')[2].text.strip()
    current['Case number'] = row.find_all('td')[3].text.strip()
    current['Standard violated'] = row.find_all('td')[10].find('font', attrs = {'color': '#0000FF '})
    current['Link to standard'] = row.find_all('td')[10].find('a')['href']
    current['Proposed penalty'] = row.find_all('td')[11].text.strip()
    current['Amount paid to date'] = row.find_all('td')[14].text.strip()
    All_Info.append(current)
    
All_Info

[{'Amount paid to date': '100.00',
  'Case number': '000361866',
  'Citation number': '8750964',
  'Link to standard': 'http://www.gpo.gov/fdsys/pkg/CFR-2014-title30-vol1/pdf/CFR-2014-title30-vol1-sec56-18010.pdf',
  'Proposed penalty': '100.00',
  'Standard violated': None},
 {'Amount paid to date': '100.00',
  'Case number': '000260865',
  'Citation number': '6426438',
  'Link to standard': 'http://www.gpo.gov/fdsys/pkg/CFR-2011-title30-vol1/pdf/CFR-2011-title30-vol1-sec56-4101.pdf',
  'Proposed penalty': '100.00',
  'Standard violated': None},
 {'Amount paid to date': '100.00',
  'Case number': '000260865',
  'Citation number': '6426439',
  'Link to standard': 'http://www.gpo.gov/fdsys/pkg/CFR-2011-title30-vol1/pdf/CFR-2011-title30-vol1-sec56-4201.pdf',
  'Proposed penalty': '100.00',
  'Standard violated': None},
 {'Amount paid to date': '100.00',
  'Case number': '000260865',
  'Citation number': '6588189',
  'Link to standard': 'http://www.gpo.gov/fdsys/pkg/CFR-2011-title30-vol1/

In [140]:
import pandas as pd
violations_df = pd.DataFrame(All_Info)

In [141]:
violations_df.to_csv('3901432-violations.csv', index = False)
violations_df = pd.read_csv('3901432-violations.csv')
violations_df.head()

Unnamed: 0,Amount paid to date,Case number,Citation number,Link to standard,Proposed penalty,Standard violated
0,100.0,361866,8750964,http://www.gpo.gov/fdsys/pkg/CFR-2014-title30-...,100.0,
1,100.0,260865,6426438,http://www.gpo.gov/fdsys/pkg/CFR-2011-title30-...,100.0,
2,100.0,260865,6426439,http://www.gpo.gov/fdsys/pkg/CFR-2011-title30-...,100.0,
3,100.0,260865,6588189,http://www.gpo.gov/fdsys/pkg/CFR-2011-title30-...,100.0,
4,100.0,238554,6588210,http://www.gpo.gov/fdsys/pkg/CFR-2010-title30-...,100.0,


# Using .apply to save mine data for SEVERAL mines

The file `mines-subset.csv` has a list of mine IDs. We're going to scrape the operator's name for each of those mines.

### Open up `mines-subset.csv` and save it into a dataframe

In [173]:
import pandas as pd
df

Unnamed: 0,id,Operator
0,4104757,Dirt Works
1,801306,"Holley Dirt Company, Inc"
2,3609931,M.R. Dirt Inc.


### Scrape the violations for each mine

**Save each mine's violations into separate CSV files.** Each CSV file must include the following fields:

- Citation number
- Case number
- Standard violated
- Link to standard
- Proposed penalty
- Amount paid to date

Make sure you are saving them into **separate files.** It might be nice to name them after the mine id.

- *TIP: Use .apply for this*
- *TIP: Print out the ID before you start scraping. That way you can take that ID and search manually to see if there is anything weird about the results.*
- *TIP: If you need help with .apply, look at the "Using apply in pandas" notebook 
- *TIP: It's probably worth it to print the fields first, then save them to a CSV once you know it's all working.*

In [171]:
def bob(row):
    All = []
    
    driver = webdriver.Chrome()
    driver.get('https://arlweb.msha.gov/drs/drshome.htm')
    textbox = driver.find_elements_by_xpath('//*[@id="inputdrs"]')[1]
    textbox.send_keys(row['id'])
    search_button = driver.find_element_by_xpath('//*[@id="content"]/table[3]/tbody/tr[3]/td[2]/input')
    search_button.click() 
    doc = BeautifulSoup(driver.page_source, 'html.parser')

    rows = doc.find_all('tr', class_='drsviols')


    for row in rows:
        current = {}
        try:
            current['Citation number'] = row.find_all('td')[2].text.strip()
        except:
            current['Citation number'] = 'No Information'
        try:
            current['Case number'] = row.find_all('td')[3].text.strip()
        except: 
            current['Case number'] = 'No Information'
        try:
            current['Standard violated'] = row.find_all('td')[10].find('font', attrs = {'color': '#0000FF '})
        except:
            current['Standard violated'] = 'No Information'
        try: 
            current['Link to standard'] = row.find_all('td')[10].find('a')['href']
        except:
            current['Link to standard'] = 'No Information'
        try:
            current['Proposed penalty'] = row.find_all('td')[11].text.strip()
        except:
            current['Proposed penalty'] = 'No Information'
        try:
            current['Amount paid to date'] = row.find_all('td')[14].text.strip()
        except:
            current['Amount paid to date'] = 'No Information'

        
        All.append(current)
    df = pd.DataFrame(All)
    path = row['id'] + '.csv'
    df.to_csv(path, index=False)
    print(path, 'saved')

In [172]:
df.apply(bob, axis=1)

4104757.csv saved
0801306.csv saved
3609931.csv saved


0    None
1    None
2    None
dtype: object