# Chicago Building Permits, Part 2

You'll find the [Selenium-Playwright conversion reference](https://jonathansoma.com/everything/scraping/selenium-playwright-conversion/) helpful for clicking, entering text, and selecting from dropdowns.

**Use Playwright or Selenium to visit https://webapps1.chicago.gov/buildingrecords/ and accept the agreement.**

# Clicking links for inspection details

This one is real... fun? and is what we spent some time on Wednesday. You should keep a copy of the classwork handy so you can see where we use `await` and where we don't.

## Getting our results page

As always, click the search link and search for **400 E 41ST ST**.

## Finding all of the inspection links

Use `.locator` to select all of the links in the inspections table. Save them as a variable called `links`.

Count them to confirm there are around 160-170

```python
count = await links.count()
count
```

## Clicking the first one

Use `links.nth(0)` to grab the first link. Click the first link.

## OH NO IT OPENED UP A NEW WINDOW!!!

Close it. *Close it!!!* It's hard to talk about it in the way we clicked.

```python
# Click the link while expecting a new tab or window to open
async with page.expect_popup() as popup_info:
    await LINK CLICKING CODE GOES HERE 

# Grab the new page and wait for it to load
new_page = await popup_info.value
await new_page.wait_for_load_state()
```

Now you can use `await new_page.content()` for stuff on the new page.

## Checking for violations

I clicked around a *lot* and couldn't find any listings with violations. How are we supposed to scrape it???

### See if the page says `no Alleged Code Violation records` somewhere on it

You can look at the page source, you can use `.get_by_text` and `count()`, you can do whatever you want.

### Write a conditional

If "no Alleged Code Violation records" shows up on the page, print `no violations`. Otherwise, print `has violations`.

### Close the new page

You can use this code:

```python
await new_page.close()
```

### Finding a page with violations

Loop through the first... 30? pages to see if we can find one that has violations on it.

```python
for i in range(30):
    # 1. Click the ith link to open up the new page
    # 2. Save the text of the inspection link as insp_no, and print it out
    #       just so we can keep track
    # 3. Check if there are violations on that page
    # 4. If there aren't, close the new page (make sure you use await)
    # 5. If there are, use 'break' to exit the loop so we can look at it
```

### Hooray! It should have stopped on a page. Convert the inspections into a dataframe named `df`

You can use `pd.read_html`, just remember you're getting this content from the new page, not the original page. Use `header=1` to skip the first **Inspection #** header that kind of gets in the way.

### Saving inspection records

Convert your `for i in range(30)` code above:

Instead of exiting the loop when it finds violations,

1. Create a dataframe from the violations
2. Add a new column called 'insp_no' to it with the inspection number
3. Save the dataframe to a list of dataframes
4. Close the new page (previously we only did it if we found no violations)
5. Use pd.concat to combine the dataframes into a big dataframe

The completed dataframe **should have around 60 rows**, and the first few rows should look something like

|VIOLATIONS|BUILDING CODE CITATION|VIOLATION DETAILS|insp_no|
|---|---|---|---|
|0|EV1110|Failed to maintain electric elevator equipment...|Reinspection required. No Show|13808208|
|1|EV1110|Failed to maintain electric elevator equipment...|Upgrade fire Service phase 2, all Elevators. N...|13748085|
|2|EV1110|Failed to maintain electric elevator equipment...|Provide SAFE access to all Elevators|13748085|

## Put it all into one cell

And scrape all of the inspection violations for `400 E 41ST ST`, saving them into a file called `violations-400 E 41ST ST.csv`

You're pretty much just taking the above and

1. Adding the search + link locator
2. Adding a time.sleep since sometimes it takes some time to load
3. Doing all of them instead of just 30
4. Saving at the very very end

## Convert it into a function called `get_violations`

Test with

```python
await get_violations('25 W Randolph St')
```

Confirm it creates the file and saves it.

# Wrapping it all up

Now that we know this all works, we can run it in **headless mode**. Headless mode means you can't see the browser doing anything.

## Going headless

Typically we use `headless=False` because we need to see the browser to debug, and it's fun to watch the browser click around and navigate. `headless=True` allows it to run in the background and not disturb the work you're doing 

**Launch a new browser window** that has `headless=True`, and then use that to scrape any of the above content again. Feel free to pick some other addresses from Chicago, etc etc etc (I was using [rent.com listings](https://www.rent.com/illinois/chicago-apartments)).

```python
import asyncio
from playwright.async_api import async_playwright

# Visit a page using chromium (could also do .firefox or .webkit)
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless = True)
page = await browser.new_page()

await page.goto("https://webapps1.chicago.gov/buildingrecords/")
await page.locator("#rbnAgreement1").click()
await page.locator("#submit").click()
```

In "real life" we'd definitely put this in a brand-new notebook, but...

## A scraping hydra

If you wanted to get exceptionally fancy, you could also have a `scrape_all` function that takes in an address, and then scrapes permits, inspections, *and* violations, one after the other... I won't make you do that, though.