# Scraping basics for Playwright

If you feel comfortable with scraping in general, you're free to skip this notebook and try to go right to the next one. Same thing if you get bored partway down.

> The [scraping section](https://jonathansoma.com/everything/scraping/) on my Everything I Know site might be helpful.
>
> I know I love them, but **you don't have to use CSS selectors!**

## Part 0: Imports

Import what you need to use Playwright, and start up a new browser to use for scraping. 

> If you end up opening a lot of Chromes/Chromiums, shutting down the Python kernel with the stop button is an easy way to make them go away! You'll have to re-run your notebook, but at least you won't have sixty icons in your dock.

In [10]:
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup

In [5]:
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)

In [34]:
page = await browser.new_page()

## Part 1: Scraping by class

Scrape the content at http://jonathansoma.com/lede/static/by-class.html, printing out the title, subhead, and byline.

In [9]:
await page.goto("http://jonathansoma.com/lede/static/by-class.html")

<Response url='https://jonathansoma.com/lede/static/by-class.html' request=<Request url='https://jonathansoma.com/lede/static/by-class.html' method='GET'>>

In [11]:
html = await page.content()

doc = BeautifulSoup(html)

In [44]:
doc.select_one(".title").text

'How to Scrape Things'

In [45]:
doc.select_one(".subhead").text

'Some Supplemental Materials'

In [46]:
doc.select_one(".byline").text

'By Jonathan Soma'

## Part 2: Scraping using tags

Scrape the content at http://jonathansoma.com/lede/static/by-tag.html, printing out the title, subhead, and byline.

In [57]:
await page.goto("http://jonathansoma.com/lede/static/by-tag.html")

<Response url='https://jonathansoma.com/lede/static/by-tag.html' request=<Request url='https://jonathansoma.com/lede/static/by-tag.html' method='GET'>>

In [58]:
html = await page.content()
doc2 = BeautifulSoup(html)

In [59]:
doc2.select_one('h1').string

'How to Scrape Things'

In [60]:
doc2.select_one('h3').string

'Some Supplemental Materials'

In [61]:
doc2.select_one('p').string

'By Jonathan Soma'

## Part 3: Scraping using a single tag

Scrape the content at http://jonathansoma.com/lede/static/by-list.html, creating a dictionary out of the title, subhead, and byline in sentences, e.g. "the title is `______`"

> **This will be important for the next few:** you can use `.get_by_text` but it seems kind of silly since maybe the text would change. I think getting them all, then using list indexes like `[0]`, etc, would be better! If I sold you on CSS selectors, you can also look up `nth-of-type` and use it with `.select_one`.

In [62]:
await page.goto("http://jonathansoma.com/lede/static/by-list.html")

<Response url='https://jonathansoma.com/lede/static/by-list.html' request=<Request url='https://jonathansoma.com/lede/static/by-list.html' method='GET'>>

In [63]:
html = await page.content()
doc3 = BeautifulSoup(html)


In [75]:
doc3.select('p')
print("The title is", doc3.select('p')[0].text)
print("The subheader is", doc3.select('p')[1].text)
print("The byline is", doc3.select('p')[2].text)

The title is How to Scrape Things
The subheader is Some Supplemental Materials
The byline is By Jonathan Soma


## Part 4: Scraping a single table row

Scrape the content at http://jonathansoma.com/lede/static/single-table-row.html, printing out the title, subhead, and byline in sentences, e.g. "the title is `______`."

In [76]:
await page.goto(" http://jonathansoma.com/lede/static/single-table-row.html")

<Response url='https://jonathansoma.com/lede/static/single-table-row.html' request=<Request url='https://jonathansoma.com/lede/static/single-table-row.html' method='GET'>>

In [78]:
html = await page.content()
doc4 = BeautifulSoup(html)

In [79]:
doc4.select('td')
print("The title is", doc4.select('td')[0].text)
print("The subheader is", doc4.select('td')[1].text)
print("The byline is", doc4.select('td')[2].text)

The title is How to Scrape Things
The subheader is Some Supplemental Materials
The byline is By Jonathan Soma


## Part 5: Saving into a dictionary

Scrape the content at http://jonathansoma.com/lede/static/single-table-row.html, saving the title, subhead, and byline into a single dictionary called `book`.

> Don't use pandas for this one!

In [81]:
title = doc4.select('td')[0].text
subheader = doc4.select('td')[1].text
byline = doc4.select('td')[2].text

book     = {
        'title': title,
        'subheader': subheader,
        'byline': byline
    }
book

{'title': 'How to Scrape Things',
 'subheader': 'Some Supplemental Materials',
 'byline': 'By Jonathan Soma'}

## Part 6: Scraping multiple table rows

Scrape the content at http://jonathansoma.com/lede/static/multiple-table-rows.html, printing out each title, subhead, and byline.

> You won't use pandas for this one, either!

## Part 7: Scraping an actual table

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html, creating a list of dictionaries.

> Don't use pandas here, either, even though that's exactly what we did in class.

In [95]:
await page.goto("http://jonathansoma.com/lede/static/the-actual-table.html")

<Response url='https://jonathansoma.com/lede/static/the-actual-table.html' request=<Request url='https://jonathansoma.com/lede/static/the-actual-table.html' method='GET'>>

In [96]:
html = await page.content()
doc5 = BeautifulSoup(html)

In [102]:
table = doc5.find("table")
rows = table.find_all("tr")
all_data = []
for row in rows:
    cells = row.find_all("td")
    if cells:
        title = cells[0].get_text(strip=True)
        subheader = cells[1].get_text(strip=True)
        byline = cells[2].get_text(strip=True)

        book = {
            'title': title,
            'subheader': subheader,
            'byline': byline
        }
        all_data.append(book)

all_data

[{'title': 'How to Scrape Things',
  'subheader': 'Some Supplemental Materials',
  'byline': 'By Jonathan Soma'},
 {'title': 'How to Scrape Many Things',
  'subheader': 'But, Is It Even Possible?',
  'byline': 'By Sonathan Joma'},
 {'title': 'The End of Scraping',
  'subheader': "Let's All Use CSV Files",
  'byline': 'By Amos Nathanos'}]

## Part 8: Scraping multiple table rows into a list of dictionaries

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html, creating a pandas DataFrame.

> There are two ways to do this one! One uses just pandas, the other one uses the result from Part 7.

In [103]:
import pandas as pd
df = pd.DataFrame(all_data)
df.head()

Unnamed: 0,title,subheader,byline
0,How to Scrape Things,Some Supplemental Materials,By Jonathan Soma
1,How to Scrape Many Things,"But, Is It Even Possible?",By Sonathan Joma
2,The End of Scraping,Let's All Use CSV Files,By Amos Nathanos


## Part 9: Scraping into a file

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html and save it as `output.csv`

In [104]:
df.to_csv("output.csv", index=False)
df.head()

Unnamed: 0,title,subheader,byline
0,How to Scrape Things,Some Supplemental Materials,By Jonathan Soma
1,How to Scrape Many Things,"But, Is It Even Possible?",By Sonathan Joma
2,The End of Scraping,Let's All Use CSV Files,By Amos Nathanos
