# Scraping basics for Playwright

If you feel comfortable with scraping in general, you're free to skip this notebook and try to go right to the next one. Same thing if you get bored partway down.

> The [scraping section](https://jonathansoma.com/everything/scraping/) on my Everything I Know site might be helpful.
>
> I know I love them, but **you don't have to use CSS selectors!**

## Part 0: Imports

Import what you need to use Playwright, and start up a new browser to use for scraping. 

> If you end up opening a lot of Chromes/Chromiums, shutting down the Python kernel with the stop button is an easy way to make them go away! You'll have to re-run your notebook, but at least you won't have sixty icons in your dock.

In [1]:
from playwright.async_api import async_playwright

playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless = False)
page = await browser.new_page()

In [7]:
await page.goto("http://jonathansoma.com/lede/static/by-class.html")

<Response url='https://jonathansoma.com/lede/static/by-class.html' request=<Request url='https://jonathansoma.com/lede/static/by-class.html' method='GET'>>

In [8]:
from bs4 import BeautifulSoup
html = await page.content()
doc = BeautifulSoup(html)

## Part 1: Scraping by class

Scrape the content at http://jonathansoma.com/lede/static/by-class.html, printing out the title, subhead, and byline.

In [22]:
title = doc.select(".title")[0].text
title

'How to Scrape Things'

In [23]:
subhead = doc.select(".subhead")[0].text
subhead

'Some Supplemental Materials'

In [24]:
byline = doc.select(".byline")[0].text
byline

'By Jonathan Soma'

## Part 2: Scraping using tags

Scrape the content at http://jonathansoma.com/lede/static/by-tag.html, printing out the title, subhead, and byline.

In [38]:
await page.goto("http://jonathansoma.com/lede/static/by-tag.html")
html = await page.content()
doc = BeautifulSoup(html)

In [39]:
title1 = doc.select("h1")[0].text
title1

'How to Scrape Things'

In [40]:
subhead2 = doc.select("h3")[0].text
subhead2

'Some Supplemental Materials'

In [41]:
byline2 = doc.select("p")[0].text
byline2

'By Jonathan Soma'

## Part 3: Scraping using a single tag

Scrape the content at http://jonathansoma.com/lede/static/by-list.html, creating a dictionary out of the title, subhead, and byline in sentences, e.g. "the title is `______`"

> **This will be important for the next few:** you can use `.get_by_text` but it seems kind of silly since maybe the text would change. I think getting them all, then using list indexes like `[0]`, etc, would be better! If I sold you on CSS selectors, you can also look up `nth-of-type` and use it with `.select_one`.

In [42]:
await page.goto("https://jonathansoma.com/lede/static/by-list.html")
html = await page.content()
doc = BeautifulSoup(html)

In [58]:
dic = {}
dic['title'] = doc.select("p")[0].text
dic['subhead']= doc.select("p")[1].text
dic['byline']= doc.select("p")[2].text
dic

{'title': 'How to Scrape Things',
 'subhead': 'Some Supplemental Materials',
 'byline': 'By Jonathan Soma'}

## Part 4: Scraping a single table row

Scrape the content at http://jonathansoma.com/lede/static/single-table-row.html, printing out the title, subhead, and byline in sentences, e.g. "the title is `______`."

In [61]:
await page.goto("http://jonathansoma.com/lede/static/single-table-row.html")
html = await page.content()
doc = BeautifulSoup(html)

In [77]:
title3 = doc.select("td")[0].text
print(title3)
subhead3 = doc.select("td")[1].text
print(subhead3)
byline3 = doc.select("td")[2].text
print(byline3)

How to Scrape Things
Some Supplemental Materials
By Jonathan Soma


## Part 5: Saving into a dictionary

Scrape the content at http://jonathansoma.com/lede/static/single-table-row.html, saving the title, subhead, and byline into a single dictionary called `book`.

> Don't use pandas for this one!

In [73]:
await page.goto("http://jonathansoma.com/lede/static/single-table-row.html")
html = await page.content()
doc = BeautifulSoup(html)

In [78]:
book = {
    'title': title3,
    'subhead': subhead3,
    'byline': byline3,
 }
book

{'title': 'How to Scrape Things',
 'subhead': 'Some Supplemental Materials',
 'byline': 'By Jonathan Soma'}

## Part 6: Scraping multiple table rows

Scrape the content at http://jonathansoma.com/lede/static/multiple-table-rows.html, printing out each title, subhead, and byline.

> You won't use pandas for this one, either!

In [81]:
await page.goto("https://jonathansoma.com/lede/static/multiple-table-rows.html")
html = await page.content()
doc = BeautifulSoup(html)

In [84]:
first_title = doc.select("td")[0].text
first_title

'How to Scrape Things'

In [86]:
first_subhead = doc.select("td")[1].text
first_subhead

'Some Supplemental Materials'

In [87]:
first_byline = doc.select("td")[2].text
first_byline

'By Jonathan Soma'

In [88]:
second_title = doc.select("td")[3].text
second_title

'How to Scrape Many Things'

In [89]:
second_subhead = doc.select("td")[4].text
second_subhead

'But, Is It Even Possible?'

In [90]:
second_byline = doc.select("td")[5].text
second_byline

'By Sonathan Joma'

In [91]:
third_title = doc.select("td")[6].text
third_title

'The End of Scraping'

In [92]:
third_subhead = doc.select("td")[7].text
third_subhead

"Let's All Use CSV Files"

In [93]:
third_byline = doc.select("td")[8].text
third_byline

'By Amos Nathanos'

In [None]:
# okay, I know this might be a very stupid method...

In [96]:
# Since I felt this way is too stupid, I watched your walk-through video on this part
# and got the way to use a for loop to separate each <tr>
rows = doc.select("tr")
rows

[<tr>
 <td>How to Scrape Things</td>
 <td>Some Supplemental Materials</td>
 <td>By Jonathan Soma</td>
 </tr>,
 <tr>
 <td>How to Scrape Many Things</td>
 <td>But, Is It Even Possible?</td>
 <td>By Sonathan Joma</td>
 </tr>,
 <tr>
 <td>The End of Scraping</td>
 <td>Let's All Use CSV Files</td>
 <td>By Amos Nathanos</td>
 </tr>]

In [100]:
for row in rows:
    print("------")
    cells = row.select("td")
    print('Title is', cells[0].text)
    print('Subhead is', cells[1].text)
    print('Byline is', cells[2].text)

------
Title is How to Scrape Things
Subhead is How to Scrape Things
Byline is How to Scrape Things
------
Title is How to Scrape Many Things
Subhead is How to Scrape Many Things
Byline is How to Scrape Many Things
------
Title is The End of Scraping
Subhead is The End of Scraping
Byline is The End of Scraping


## Part 7: Scraping an actual table

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html, creating a list of dictionaries.

> Don't use pandas here, either, even though that's exactly what we did in class.

In [None]:
await page.goto("http://jonathansoma.com/lede/static/the-actual-table.html")
html = await page.content()
doc = BeautifulSoup(html)

In [103]:
all_elements = []

for row in rows:
    cells = row.select("td")
    elements = {
        'title':cells[0].text,
        'subhead': cells[1].text,
        'byline': cells[2].text
    }
    all_elements.append(elements)
all_elements

[{'title': 'How to Scrape Things',
  'subhead': 'Some Supplemental Materials',
  'byline': 'By Jonathan Soma'},
 {'title': 'How to Scrape Many Things',
  'subhead': 'But, Is It Even Possible?',
  'byline': 'By Sonathan Joma'},
 {'title': 'The End of Scraping',
  'subhead': "Let's All Use CSV Files",
  'byline': 'By Amos Nathanos'}]

## Part 8: Scraping multiple table rows into a list of dictionaries

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html, creating a pandas DataFrame.

> There are two ways to do this one! One uses just pandas, the other one uses the result from Part 7.

In [104]:
import pandas as pd
df = pd.DataFrame(all_elements)
df

Unnamed: 0,title,subhead,byline
0,How to Scrape Things,Some Supplemental Materials,By Jonathan Soma
1,How to Scrape Many Things,"But, Is It Even Possible?",By Sonathan Joma
2,The End of Scraping,Let's All Use CSV Files,By Amos Nathanos


## Part 9: Scraping into a file

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html and save it as `output.csv`

In [105]:
df.to_csv('output.csv',index=False)