# Scraping basics for Playwright or Selenium

If you feel comfortable with scraping in general, you're free to skip this notebook and try to go right to the next one. Same thing if you get bored partway down.

**Possibly useful links:**

* Scraping section of my [everything page](https://jonathansoma.com/everything/)
* Some [old Selenium snippets](http://jonathansoma.com/lede/foundations-2018/classes/selenium/selenium-snippets/) (if you decide to use Selenium)
* [Loops in Playwright](https://jonathansoma.com/everything/scraping/loops-in-playwright/), which is the thing that we were having trouble with during class when using `.locator` so much.

## Part 0: Imports

Import what you need to use Playwright or Selenium, and start up a new browser to use for scraping. 
> If you end up opening a lot of Chromes/Chromiums, shutting down the Python kernel with the stop button is an easy way to make them go away! You'll have to re-run your notebook, but at least you won't have sixty icons in your dock.

In [1]:
from playwright.async_api import async_playwright
playwright = await async_playwright().start()

In [2]:
browser = await playwright.chromium.launch(headless = False)

In [3]:
browser.new_page()

<coroutine object Browser.new_page at 0x10e504da0>

## Part 1: Scraping by class

Scrape the content at http://jonathansoma.com/lede/static/by-class.html, printing out the title, subhead, and byline. You're welcome to use BeautifulSoup as long as the information comes from Playwright/Selenium.

In [4]:
page = await browser.new_page()
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless = False)
await page.goto(" http://jonathansoma.com/lede/static/by-class.html")

<Response url='https://jonathansoma.com/lede/static/by-class.html' request=<Request url='https://jonathansoma.com/lede/static/by-class.html' method='GET'>>

In [5]:
await page.content()


'<html><head></head><body><h1 class="title">How to Scrape Things</h1>\n<h3 class="subhead">Some Supplemental Materials</h3>\n<p class="byline">By Jonathan Soma</p></body></html>'

In [23]:
from bs4 import BeautifulSoup
import requests
doc = BeautifulSoup(await page.content())
title = doc.select("h1")[0].string
print(title)

How to Scrape Things


In [26]:
from bs4 import BeautifulSoup
import requests
doc = BeautifulSoup(await page.content())
subhead = doc.select("h3")[0].string
print(subhead)

Some Supplemental Materials


In [27]:
from bs4 import BeautifulSoup
import requests
doc = BeautifulSoup(await page.content())
byline = doc.select("p")[0].string
print(byline)

By Jonathan Soma


## Part 2: Scraping using tags

Scrape the content at http://jonathansoma.com/lede/static/by-tag.html, printing out the title, subhead, and byline. You're welcome to use BeautifulSoup as long as the information comes from Playwright/Selenium.

In [31]:
page = await browser.new_page()
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless = False)
await page.goto("http://jonathansoma.com/lede/static/by-tag.html")

<Response url='https://jonathansoma.com/lede/static/by-tag.html' request=<Request url='https://jonathansoma.com/lede/static/by-tag.html' method='GET'>>

In [32]:
await page.content()

'<html><head></head><body><h1>How to Scrape Things</h1>\n<h3>Some Supplemental Materials</h3>\n<p>By Jonathan Soma</p></body></html>'

In [33]:
from bs4 import BeautifulSoup
import requests
doc = BeautifulSoup(await page.content())
title = doc.select("h1")[0].string
print(title)

How to Scrape Things


In [34]:
from bs4 import BeautifulSoup
import requests
doc = BeautifulSoup(await page.content())
subhead = doc.select("h3")[0].string
print(subhead)

Some Supplemental Materials


In [35]:
from bs4 import BeautifulSoup
import requests
doc = BeautifulSoup(await page.content())
byline = doc.select("p")[0].string
print(byline)

By Jonathan Soma


## Part 3: Scraping using a single tag

Scrape the content at http://jonathansoma.com/lede/static/by-list.html, printing out the title, subhead, and byline. You're welcome to use BeautifulSoup as long as the information comes from Playwright/Selenium.

> **This will be important for the next few:** if you scrape multiple items, you have a list. In Selenium you can use `[0]`, `[1]`, `[-1]` etc just like you would for a normal list (and in Playwright, too, asl ong as you're using `query_selector_all`). If you're using locators you'll need to use `.nth(0)`, `nth(1)`, `nth(2)`.

In [36]:
page = await browser.new_page()
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless = False)
await page.goto("http://jonathansoma.com/lede/static/by-list.html")

<Response url='https://jonathansoma.com/lede/static/by-list.html' request=<Request url='https://jonathansoma.com/lede/static/by-list.html' method='GET'>>

In [37]:
await page.content()

'<html><head></head><body><p>How to Scrape Things</p>\n<p>Some Supplemental Materials</p>\n<p>By Jonathan Soma</p></body></html>'

In [38]:
doc = BeautifulSoup(await page.content())
title = doc.select("p")[0].string
print(title)

How to Scrape Things


In [42]:
doc = BeautifulSoup(await page.content())
subhead = doc.select("p")[1].string
print(subhead)

Some Supplemental Materials


In [43]:
doc = BeautifulSoup(await page.content())
byline = doc.select("p")[2].string
print(byline)

By Jonathan Soma


In [None]:
#I have no idea why this worked??? i literally just guessed and it gave me the answers i wanted

## Part 4: Scraping a single table row

Scrape the content at http://jonathansoma.com/lede/static/single-table-row.html, printing out the title, subhead, and byline.

In [46]:
page = await browser.new_page()
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless = False)
await page.goto("http://jonathansoma.com/lede/static/single-table-row.html")

<Response url='https://jonathansoma.com/lede/static/single-table-row.html' request=<Request url='https://jonathansoma.com/lede/static/single-table-row.html' method='GET'>>

In [47]:
await page.content()

'<html><head></head><body><table>\n  <tbody><tr>\n    <td>How to Scrape Things</td>\n    <td>Some Supplemental Materials</td>\n    <td>By Jonathan Soma</td>\n  </tr>\n</tbody></table></body></html>'

In [48]:
doc = BeautifulSoup(await page.content())
title = doc.select("td")[0].string
print(title)

How to Scrape Things


In [49]:
doc = BeautifulSoup(await page.content())
subhead = doc.select("td")[1].string
print(subhead)

Some Supplemental Materials


In [50]:
doc = BeautifulSoup(await page.content())
byline = doc.select("td")[2].string
print(byline)

By Jonathan Soma


## Part 5: Saving into a dictionary

Scrape the content at http://jonathansoma.com/lede/static/single-table-row.html, saving the title, subhead, and byline into a single dictionary called `book`.

> Don't use pandas for this one!

In [55]:
cells = doc.select("td")
book = {}
book['title']=cells[0].string
book['subhead']=cells[1].string
book['byline']=cells[2].string
book

{'title': 'How to Scrape Things',
 'subhead': 'Some Supplemental Materials',
 'byline': 'By Jonathan Soma'}

## Part 6: Scraping multiple table rows

Scrape the content at http://jonathansoma.com/lede/static/multiple-table-rows.html, printing out each title, subhead, and byline.

> You won't use pandas for this one, either!

In [56]:
page = await browser.new_page()
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless = False)
await page.goto("http://jonathansoma.com/lede/static/multiple-table-rows.html")

<Response url='https://jonathansoma.com/lede/static/multiple-table-rows.html' request=<Request url='https://jonathansoma.com/lede/static/multiple-table-rows.html' method='GET'>>

In [57]:
await page.content()

"<html><head></head><body><table>\n  <tbody><tr>\n    <td>How to Scrape Things</td>\n    <td>Some Supplemental Materials</td>\n    <td>By Jonathan Soma</td>\n  </tr>\n  <tr>\n    <td>How to Scrape Many Things</td>\n    <td>But, Is It Even Possible?</td>\n    <td>By Sonathan Joma</td>\n  </tr>\n  <tr>\n    <td>The End of Scraping</td>\n    <td>Let's All Use CSV Files</td>\n    <td>By Amos Nathanos</td>\n  </tr>\n</tbody></table></body></html>"

In [58]:
rows = doc.select("tr")
cells = doc.select("td")
print(cells[0].text)

How to Scrape Things


In [59]:
rows = doc.select("tr")
cells = doc.select("td")
print(cells[1].text)

Some Supplemental Materials


In [60]:
rows = doc.select("tr")
cells = doc.select("td")
print(cells[2].text)

By Jonathan Soma


## Part 7: Scraping an actual table

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html, creating a list of dictionaries.

> Don't use pandas here, either!

In [61]:
page = await browser.new_page()
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless = False)
await page.goto("http://jonathansoma.com/lede/static/the-actual-table.html")

<Response url='https://jonathansoma.com/lede/static/the-actual-table.html' request=<Request url='https://jonathansoma.com/lede/static/the-actual-table.html' method='GET'>>

In [62]:
await page.content()

'<html><head></head><body><table id="booklist">\n  <tbody><tr>\n    <td>How to Scrape Things</td>\n    <td>Some Supplemental Materials</td>\n    <td>By Jonathan Soma</td>\n  </tr>\n  <tr>\n    <td>How to Scrape Many Things</td>\n    <td>But, Is It Even Possible?</td>\n    <td>By Sonathan Joma</td>\n  </tr>\n  <tr>\n    <td>The End of Scraping</td>\n    <td>Let\'s All Use CSV Files</td>\n    <td>By Amos Nathanos</td>\n  </tr>\n</tbody></table></body></html>'

## Part 8: Scraping multiple table rows into a list of dictionaries

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html, creating a pandas DataFrame.

> There are two ways to do this one! One uses just pandas, the other one uses the result from Part 7.

## Part 9: Scraping into a file

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html and save it as `output.csv`