# Mines, Part 2

We're interested in [US mine safety](https://www.msha.gov/mine-data-retrieval-system), as always. You can get information about a specific mine by using its Mine ID.

## Starting up

Start up Playwright and open the mine search page.

## Searching

It would be nice to search for the Mine ID `3503590`... but it isn't easy!

### First issue

When you want to put content on a page, sometimes you have the content on a *different* page and you say "hey web page, include this *other* web page inside of your content." It's like when you embed youtube videos or whatever. This is called an **iframe**.

The form on this page contains an **iframe**, which makes everything more complicated.

Normally you could get the `input` for the mine entry text like this:

```python
page.locator('#mstr92 input')
```

But because it's in an *iframe* you have to first grab the iframe, *then* grab the element. The documentation for playwright gives this as an example:

```python
# Locate element inside frame
# Get frame using any other selector
username = page.frame_locator('.frame-class').get_by_label('User Name')
username.fill('John')
```

You find the class or ID of the iframe by using the web inspector, finding your element, and going up up up up up the indented sections until you find `<iframe>`!

### Second issue

*You can't even use the `input` in this case*. It's a dumb system where you have to click a div that pretends to be a text input, [type on the keyboard](https://playwright.dev/docs/api/class-keyboard), then click some text that shows up and then click submit. It's *awful*.

### Okay, now do it

Your steps will look something like this:

1. Find the iframe
2. Find the div inside the iframe you need to click (the one that pretends to be the text box). I used xpath for this one.
3. [Type on the keyboard](https://playwright.dev/docs/api/class-keyboard) 
4. Click the popup that suggests the mine you're interested in
5. Click the submit button

Using `.get_by_text` is a lot easier than `.locator` for a few of these.

## Scraping

Originally I had people scrape the page, displaying the following **in a dictionary.**

- The mine name
- The mine status
- The mined material
- The operator
- The current address

But this page has changed, oh how it has changed. It's now the most insane page I have ever seen in my life, and the only reason I'm continuing on with writing this assignment is to make you appreciate data cleaning.

Here's how we're going to do this: *we're going to get every single bit of information off of the page, all at once, and clean it up later.*

1. Find the class that the Mine ID, name, status, etc all share.
2. Select them
3. Use `.all_text_contents()` to get all of their text
4. Pull out the pieces by index and place them into a dictionary

After cleaning it up, your data should look like this

```python
{'name': ': Mobile Crusher #4',
 'status': ': Active',
 'material': ': Construction Sand and Gravel',
 'operator': ': Knife River Materials',
 'address': ': 3959 Hamrick Road Central Point OR 97502 '}
```

**Save it as a variable called `data`**

## Move all this into one cell and search for mine 3503594

The first line should be `mine_id = 3503594`.

The cell should visit the mine data system, type in the mine ID, click submit, and pull the data on the results page into a dictionary called `data`.

* **Tip:** You will have to use `str(mine_id)` to type and `.get_by_text` matches. Playwright won't automatically convert from an integer to a string for you! If you have problems on either your typing or waiting lines, that might be your issue.
* **Tip:** After you click the 'Submit' button, you need to do something to wait for the updated information to load. I would recommend [waiting for](https://playwright.dev/python/docs/navigations#wait-for-element) something specific to the results page to show up before trying to get the data about the mine. You might need to try a few options before you get something that works.

The result should look like 

```python
{'name': ': WILLAMETTE PORTABLE #2',
 'status': ': Intermittent',
 'material': ': Crushed, Broken Stone NEC',
 'operator': ': Knife River Corporation-Northwest',
 'address': ': 1812 Willow Lake North Keizer OR 97307 '}
```

## Building your function

Convert this into a function called `get_mine_data`. It should take a `mine_id` and return the dictionary. Save the mine under `mine_id` in the dictionary.

* **Tip:** You'll need to make it `async def` instead of just `def` since it has await statments in it.
* **Tip:** Make sure you're typing `str(mine_id)`, otherwise numbers `get_mine_data('3503595')` will work but `get_mine_data(3503595)` will break!

Test it by calling 

```python
await get_mine_data('3503595')
```

Your result should be a dictionary about a mine named **ROGUE PORTABLE #2**.

## Getting information on many mines

### Reading in our source

Using pandas, read in `mines-subset.csv`.

## Scrape mine data for every single row, saving it in `additional_df`

This is how you loop through the rows. We can't use `.apply` since it's an async function (sad, sad, sad).

```python
for index, row in df.iterrows():
    mine_id = str(row['id'])
    print(mine_id)
```

Adjust this code to save the result of `get_mine_data` into a list called `all_data`. You'll be able to convert it into a dataframe with 

```python
df = pd.DataFrame(all_data)
df.head()
```

Save the results as `mine-data.csv`.

## Let's... make it a little crazier? While we're at it?

Each one of those pages takes SO LONG to load. Why can't we search like *four mines at a time?*

Because our code is **asynchronous**, we can!

1. Copy and paste your code from above, renaming it `get_mine_data_browser`.
2. Instead of just using `page.goto` to go to the mine search page, also have it open a new window using `page = await browser.new_page()`
3. Close the page just before you `return` at the end of the loop (be sure to `await` it!)

Test it with the following lines of code:

```python
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless = False)

await get_mine_data_browser(2200717)
```

Confirm that it opens a new window and returns the appropriate data. I purposefully did not put quotation marks around `2200717`, it should still work! If not, you need to type `str(mine_id)`.

It should open a new browser window and return the expected data:

```python
{'name': ': Scribner Pit',
 'status': ': Active',
 'material': ': Construction Sand and Gravel',
 'operator': ': APAC, MS Inc.',
 'address': ': 40102 Flower Farm Road Hamilton MS 39746 '}
```

### Scrape five pages at a time!

Using an edited form of [this code here](https://stackoverflow.com/questions/48483348/how-to-limit-concurrency-with-python-asyncio/61478547#61478547), we can bundle up all of our coroutines from `.apply` and run them 5 at a time. As long as we don't think it'll get us blocked, it's a great way to make our code five times faster!

First we adjust the code from that link to enable running multiple functions at the same time (you'll always be able to cut and paste this):

```python
async def gather_with_concurrency(n, *tasks):
    semaphore = asyncio.Semaphore(n)
 
    async def sem_task(task):
        async with semaphore:
            return await task
    return await asyncio.gather(*(sem_task(task) for task in tasks))
```

Then we build a list of things we want done. In this case, it's sending all of our `df['id']` to `get_mine_data_broser`. Usually we need `await` in front of `get_mine_data_browser` to mean "actually scrape this page," but for now we're just saying ***get ready** to scrape this page, but don't do it yet*.

```python
tasks = df.id.apply(get_mine_data_browser)
tasks
```

Then we say hey, we have a list of things we want done! Run these, 5 at a time!

```python
# Prepare a new browser
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless = False)

# 5 at a time
results = await gather_with_concurrency(5, *tasks)
```

You can look at the results if you want, then we'll take the list of dicts and convert them into a dataframe!

```python
df = pd.DataFrame(results)
df
```

Because we moved `page = await browser.new_page()` into the function, each time we run the function it opens a new browser window. As a result, tpying and searching in one window won't conflict with typing and searching in another window!