# Mine Safety

We're interested in [US mine safety](https://arlweb.msha.gov/drs/drshome.htm), thank goodness we can search for these things.

## Setup: Import what you'll need to search and scrape and Selenium

In [18]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait

## Starting from `https://arlweb.msha.gov/drs/drshome.htm`, search for every operator with 'dirt' in their name, including abandoned mines.

> - *Tip: If you can't make an element work using name, class or ID, try to use the XPath*

In [19]:
driver = webdriver.Chrome()

In [20]:
driver.get('https://arlweb.msha.gov/drs/drshome.htm')

In [21]:
text_input = driver.find_element_by_name('OperSearch')
text_input.send_keys('dirt')

In [22]:
search_button = driver.find_element_by_xpath('//*[@id="content"]/form[1]/table/tbody/tr[7]/td[3]/input[1]')
search_button.click()

## Scrape the results page, saving it as `dirt-operators.csv`

> - *Tip: Think about what each row in your dataset will be, and start by looping through that*
> - *Tip: Printing is cool and good! Print everything! Move it into a dictionary later.*
> - *Tip: If you don't want a row, think about what's in the row that makes it different. You can use an `if` statement or list slicing to skip the ones you aren't interested in.*
> - *Tip: Make sure your dictionary and your loop variable have DIFFERENT NAMES*
> - *Tip: After you've made your dictionary (and printed it, of course), you'll want to add it to your list of rows*
> - *Tip: Be sure to import pandas to convert it to a dataframe*
> - *Tip: Make sure you don't include the index when saving your dataframe*

### Hopefully you know that each `tr` is supposed to be a row of your data. What is the index of the first row element that is actually a result?

> - *Tip: `.text` will help you here.*
> - *Tip: You aren't interesting in annotations or anything, just mines and where they are from*
> - *Tip: Using `print("-----")` will help you keep track of different rows*
> - *Tip: If you have a list called `animals`, `animals[2:]` will skip the first two and start with the third. You can use this to skip ahead to the 'good' data if you want*

In [23]:
rows = driver.find_elements_by_tag_name('tr')

In [14]:
# keep in the notebook (just for now) even the code that didn't run just to remember the wrong paths I followed
for row in rows:
    cells = row.find_elements_by_tag_name('td')
    print("ID is", cells[0].text)
    print("The State is", cells[1].text)
    print("The Operator is", cells[2].text)
    print("Mine name is", cells[3].text)
    print("Type is", cells[4].text)
    print("CM* is", cells[5].text)
    print("Status is", cells[6].text)
    print("Commodity is", cells[7].text)
    print("More info is", cells[8].text)

ID is 
The State is  


IndexError: list index out of range

In [15]:
for row in rows:
    cells = row.find_elements_by_tag_name('td')
    for cell in cells:
        print(cell.text)


 
Abandoned*
Indicates Mine is Abandoned and Sealed
Abandoned*
Indicates Mine is Abandoned and Sealed
*CM (Coal or Metal Mine/Nonmetal Mine)
C
M ...... Coal
...... Metal/Nonmetal
*CM (Coal or Metal Mine/Nonmetal Mine)
C
M
...... Coal
...... Metal/Nonmetal
Abandoned*
Indicates Mine is Abandoned and Sealed
*CM (Coal or Metal Mine/Nonmetal Mine)
C
M
...... Coal
...... Metal/Nonmetal
3503598
OR 
Newberg Rock & Dirt  
Newberg Rock & Dirt
Surface
M 
Active 
Crushed, Broken Stone NEC 
1401575
KS 
Bender Sand & Dirt  
BENDER SAND & DIRT
Surface
M 
Intermittent 
Construction Sand and Gravel 
5001797
AK 
Dirt Company  
Bush Pilot
Surface
M 
Intermittent 
Construction Sand and Gravel 
2103723
MN 
Dirt Doctor Inc  
Rock Lake Plant
Surface
M 
Intermittent 
Construction Sand and Gravel 
2103914
MN 
Dirt Work Specialists LLC  
Astec Plant
Surface
M 
Intermittent 
Construction Sand and Gravel 
4104757
TX 
Dirt Works  
Portable #1
Surface
M 
Intermittent 
Construction Sand and Gravel 
0801306
FL 
Holl

In [26]:
for row in rows:
    print(row.text)
    print("=====")

Operator Name or Mine Name
Search  
=====
Abandoned*
Indicates Mine is Abandoned and Sealed
*CM (Coal or Metal Mine/Nonmetal Mine)
C
M ...... Coal
...... Metal/Nonmetal
=====
Abandoned*
=====
Indicates Mine is Abandoned and Sealed
=====
*CM (Coal or Metal Mine/Nonmetal Mine)
=====
C
M ...... Coal
...... Metal/Nonmetal
=====
ID State Operator Mine Name Type CM* Status Commodity More Info
=====
3503598
OR  Newberg Rock & Dirt   Newberg Rock & Dirt Surface M  Active  Crushed, Broken Stone NEC 
=====
1401575
KS  Bender Sand & Dirt   BENDER SAND & DIRT Surface M  Intermittent  Construction Sand and Gravel 
=====
5001797
AK  Dirt Company   Bush Pilot Surface M  Intermittent  Construction Sand and Gravel 
=====
2103723
MN  Dirt Doctor Inc   Rock Lake Plant Surface M  Intermittent  Construction Sand and Gravel 
=====
2103914
MN  Dirt Work Specialists LLC   Astec Plant Surface M  Intermittent  Construction Sand and Gravel 
=====
4104757
TX  Dirt Works   Portable #1 Surface M  Intermittent  Cons

### Loop through each operator result, printing its name

> - *Tip: If you have a list called `animals`, `animals[2:]` will skip the first two and start with the third.*
> - *Tip: You can use list slicing or an `if` statement to skip the non-data row(s). List slicing is probably easier, even if you aren't comfortable with it.*
> - *Tip: or honestly you can use `try` and `except` if you know how it works.*
> - *Tip: Once you have the "right" rows of data, you're going to be looking for a certain tag inside*
> - *Tip: Sometimes you can't say "give me this class," and instead you have to say "give me all of the `div` elements, and then give me the third one."*

In [27]:
for row in rows[7:27]:
    cells = row.find_elements_by_tag_name('td')
    print("The Operator is", cells[2].text)

The Operator is Newberg Rock & Dirt  
The Operator is Bender Sand & Dirt  
The Operator is Dirt Company  
The Operator is Dirt Doctor Inc  
The Operator is Dirt Work Specialists LLC  
The Operator is Dirt Works  
The Operator is Holley Dirt Company, Inc  
The Operator is Krueger Brothers Gravel & Dirt  
The Operator is M R Dirt  
The Operator is M.R. Dirt Inc.  
The Operator is P B Dirt Movers, Inc  
The Operator is P B Dirt Movers, Inc.  
The Operator is PB Dirt Movers  
The Operator is Prescott Dirt, LLC  
The Operator is R D Blankenship Dirt Work LLC  
The Operator is Sand & Dirt, Inc  
The Operator is SIMPSON DIRTWORX LLC  
The Operator is SIMPSON DIRTWORX LLC  
The Operator is Spry's Dirt & Gravel, Inc.  
The Operator is Vogt Dirt Service  


### Loop through each operator result, printing its ID

There should be ONE code per row, and NO empty rows between them.

In [28]:
for row in rows[7:27]:
    cells = row.find_elements_by_tag_name('td')
    print("ID is", cells[0].text)

ID is 3503598
ID is 1401575
ID is 5001797
ID is 2103723
ID is 2103914
ID is 4104757
ID is 0801306
ID is 3901432
ID is 3609624
ID is 3609931
ID is 1519799
ID is 4407379
ID is 4407296
ID is 0203332
ID is 2901986
ID is 0801417
ID is 4300768
ID is 4300776
ID is 2302283
ID is 2103518


## Saving the results

### Loop through each `tr` to create a list of dictionaries

Each dictionary must contain

- Operator ID
- Operator name
- Mine name
- State
- Mine type
- Coal or metal
- Status
- Commodity

Create a new dictionary for each row.

> - *Tip: Start with an empty dictionary, then add the keys one at a time like we did during class*
> - *Tip: You might want to save all of the cells in a variable, then use indexes to get the second, third, fourth, etc.*
> - *Tip: I know you already skipped a bunch of rows already, but one of them still might be bad! Which one is it? How can you skip it? You might need to slice out some of the end of your list, too. Use `print` to help you debug, or just look at the page closely.*
> - *Tip: Or, if you did the other homework already, `try` / `except` is also an option*

In [41]:
rows = driver.find_elements_by_tag_name('tr')
mines_list = []

In [52]:
for row in rows[7:-1]:
    cells = row.find_elements_by_tag_name('td')
    print(row.text)
    
    mines_dict = {}
    mines_dict['ID'] = cells[0].text
    print(cells[0].text)
    
    mines_dict['State'] = cells[1].text
    mines_dict['Operator'] = cells[2].text
    mines_dict['Mine_Name'] = cells[3].text
    mines_dict['Type'] = cells[4].text
    mines_dict['CM*'] = cells[5].text
    mines_dict['Status'] = cells[6].text
    mines_dict['Commodity'] = cells[7].text
    print("Our mines dictionary is", mines_dict)
    mines_list.append(mines_dict)

print("This is our dictionary!")

3503598
OR  Newberg Rock & Dirt   Newberg Rock & Dirt Surface M  Active  Crushed, Broken Stone NEC 
3503598
Our mines dictionary is {'ID': '3503598', 'State': 'OR ', 'Operator': 'Newberg Rock & Dirt  ', 'Mine_Name': 'Newberg Rock & Dirt', 'Type': 'Surface', 'CM*': 'M ', 'Status': 'Active ', 'Commodity': 'Crushed, Broken Stone NEC '}
1401575
KS  Bender Sand & Dirt   BENDER SAND & DIRT Surface M  Intermittent  Construction Sand and Gravel 
1401575
Our mines dictionary is {'ID': '1401575', 'State': 'KS ', 'Operator': 'Bender Sand & Dirt  ', 'Mine_Name': 'BENDER SAND & DIRT', 'Type': 'Surface', 'CM*': 'M ', 'Status': 'Intermittent ', 'Commodity': 'Construction Sand and Gravel '}
5001797
AK  Dirt Company   Bush Pilot Surface M  Intermittent  Construction Sand and Gravel 
5001797
Our mines dictionary is {'ID': '5001797', 'State': 'AK ', 'Operator': 'Dirt Company  ', 'Mine_Name': 'Bush Pilot', 'Type': 'Surface', 'CM*': 'M ', 'Status': 'Intermittent ', 'Commodity': 'Construction Sand and Grave

In [53]:
import pandas as pd

df = pd.DataFrame(rows)
df.head(10)

Unnamed: 0,0
0,<selenium.webdriver.remote.webelement.WebEleme...
1,<selenium.webdriver.remote.webelement.WebEleme...
2,<selenium.webdriver.remote.webelement.WebEleme...
3,<selenium.webdriver.remote.webelement.WebEleme...
4,<selenium.webdriver.remote.webelement.WebEleme...
5,<selenium.webdriver.remote.webelement.WebEleme...
6,<selenium.webdriver.remote.webelement.WebEleme...
7,<selenium.webdriver.remote.webelement.WebEleme...
8,<selenium.webdriver.remote.webelement.WebEleme...
9,<selenium.webdriver.remote.webelement.WebEleme...


### Save that to a CSV named `dirt-operators.csv`

In [54]:
# Save it as a CSV
df.to_csv("dirt-operators.csv", index=False)

### Open the CSV file and examine the first few.

Make sure you didn't save that extra weird unnamed index column.

In [None]:
df = pd.read_csv("dirt-operators.csv")
df.head()