# Scraping basics for Selenium

If you feel comfortable with scraping, you're free to skip this notebook.

## Part 0: Imports

Import what you need to use Selenium, and start up a new Chrome to use for scraping. You might want to copy from the [Selenium snippets](http://jonathansoma.com/lede/foundations-2018/classes/selenium/selenium-snippets/) page.

**You only need to do `driver = webdriver.Chrome(...)` once,** every time you do it you'll open a new Chrome instance. You'll only need to run it again if you close the window (or want another Chrome, for some reason).

In [182]:
# from bs4 import BeautifulSoup
# import requests

# response = requests.get("https://opensyllabus.org/results-list/titles?size=50&usState=AK")
# doc = BeautifulSoup(response.text)


from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; hu-HU; rv:1.7.8) Gecko/20050511 Firefox/1.0.4'}

In [183]:
import pandas as pd

import time

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select

from webdriver_manager.chrome import ChromeDriverManager

In [184]:
#Launch a new Chrome, istalll a driver if possible
driver = webdriver.Chrome(ChromeDriverManager().install())



Current google-chrome version is 96.0.4664
Get LATEST chromedriver version for 96.0.4664 google-chrome
Driver [/Users/richardabbey/.wdm/drivers/chromedriver/mac64/96.0.4664.45/chromedriver] found in cache
  driver = webdriver.Chrome(ChromeDriverManager().install())


## Part 1: Scraping by class

Scrape the content at http://jonathansoma.com/lede/static/by-class.html, printing out the title, subhead, and byline.

In [185]:
driver.get("http://jonathansoma.com/lede/static/by-class.html")

In [186]:
#Using the class tag to identify the title
driver.find_element(By.CLASS_NAME, "title").text

'How to Scrape Things'

In [187]:
#Using the class tag to identify the subhead
driver.find_element(By.CLASS_NAME, "subhead").text

'Some Supplemental Materials'

In [188]:
#Using the class tag to identify the byline
driver.find_element(By.CLASS_NAME, "byline").text

'By Jonathan Soma'

## Part 2: Scraping using tags

Scrape the content at http://jonathansoma.com/lede/static/by-tag.html, printing out the title, subhead, and byline.

In [189]:
driver.find_element_by_tag_name("h1").text

  driver.find_element_by_tag_name("h1").text


'How to Scrape Things'

In [190]:
driver.find_element_by_tag_name("h3").text

  driver.find_element_by_tag_name("h3").text


'Some Supplemental Materials'

In [191]:
driver.find_element_by_tag_name("p").text

  driver.find_element_by_tag_name("p").text


'By Jonathan Soma'

## Part 3: Scraping using a single tag

Scrape the content at http://jonathansoma.com/lede/static/by-list.html, printing out the title, subhead, and byline.

> **This will be important for the next few:** if you scrape multiples, you have a list. Even though it's Seleninum, you can use things like `[0]`, `[1]`, `[-1]` etc just like you would for a normal list.

In [192]:
# Looping through the table using a sinle tag, body
table = driver.find_elements(By.TAG_NAME, "body")
for row in table:
    print(row.text)

How to Scrape Things
Some Supplemental Materials
By Jonathan Soma


## Part 4: Scraping a single table row

Scrape the content at http://jonathansoma.com/lede/static/single-table-row.html, printing out the title, subhead, and byline.

In [193]:
#Launch a new Chrome, istalll a driver if possible
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("http://jonathansoma.com/lede/static/single-table-row.html")



Current google-chrome version is 96.0.4664
Get LATEST chromedriver version for 96.0.4664 google-chrome
Driver [/Users/richardabbey/.wdm/drivers/chromedriver/mac64/96.0.4664.45/chromedriver] found in cache
  driver = webdriver.Chrome(ChromeDriverManager().install())


In [194]:
table = driver.find_elements(By.TAG_NAME, "tr")


In [195]:
for row in table:
    print(row.text)

How to Scrape Things Some Supplemental Materials By Jonathan Soma


## Part 5: Saving into a dictionary

Scrape the content at http://jonathansoma.com/lede/static/single-table-row.html, saving the title, subhead, and byline into a single dictionary called `book`.

> Don't use pandas for this one!

In [196]:
doc = BeautifulSoup(driver.page_source)

In [197]:
#This creates an empty dictionary
#Loops through and empties the values into the empty disctionary
#After that, the dictionary is appended to the master list, book

all_books = doc.select("tbody tr td")
dataset = []

for x in all_books:
    book = {}
    book ['Title'] = all_books[0].text
    book ['Subhead'] = all_books[1].text
    book ['Byline'] = all_books[2].text
    dataset.append(book)
    
    
book

{'Title': 'How to Scrape Things',
 'Subhead': 'Some Supplemental Materials',
 'Byline': 'By Jonathan Soma'}

In [198]:
# all_books = doc.select("tbody tr td")

In [199]:
# dataset = []

# for x in all_books:
#     book = {}
#     book ['Title'] = all_books[0].text
#     book ['Subhead'] = all_books[1].text
#     book ['Byline'] = all_books[2].text
#     dataset.append(book)
    
    
# book


## Part 6: Scraping multiple table rows

Scrape the content at http://jonathansoma.com/lede/static/multiple-table-rows.html, printing out each title, subhead, and byline.

> You won't use pandas for this one, either!

In [200]:
#Launch a new Chrome, istalll a driver if possible
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("http://jonathansoma.com/lede/static/multiple-table-rows.html")



Current google-chrome version is 96.0.4664
Get LATEST chromedriver version for 96.0.4664 google-chrome
Driver [/Users/richardabbey/.wdm/drivers/chromedriver/mac64/96.0.4664.45/chromedriver] found in cache
  driver = webdriver.Chrome(ChromeDriverManager().install())


In [201]:
driver.page_source

"<html><head></head><body><table>\n  <tbody><tr>\n    <td>How to Scrape Things</td>\n    <td>Some Supplemental Materials</td>\n    <td>By Jonathan Soma</td>\n  </tr>\n  <tr>\n    <td>How to Scrape Many Things</td>\n    <td>But, Is It Even Possible?</td>\n    <td>By Sonathan Joma</td>\n  </tr>\n  <tr>\n    <td>The End of Scraping</td>\n    <td>Let's All Use CSV Files</td>\n    <td>By Amos Nathanos</td>\n  </tr>\n</tbody></table></body></html>"

In [202]:
table1 = driver.find_elements(By.TAG_NAME, "tr")

In [203]:
#This loops through the table containing all the rows (tr's)
#Prints the tags
for row in table1:
    book2 = row.find_elements(By.TAG_NAME, "td")
    print("_____")
    print("Title:", book2[0].text)
    print("Subhead:", book2[1].text)
    print("Byline:", book2[2].text)
 


_____
Title: How to Scrape Things
Subhead: Some Supplemental Materials
Byline: By Jonathan Soma
_____
Title: How to Scrape Many Things
Subhead: But, Is It Even Possible?
Byline: By Sonathan Joma
_____
Title: The End of Scraping
Subhead: Let's All Use CSV Files
Byline: By Amos Nathanos


## Part 7: Scraping an actual table

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html, creating a list of dictionaries.

> Don't use pandas here, either!

In [204]:
#Launch a new Chrome, istalll a driver if possible
#Launch a new Chrome, istalll a driver if possible
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("http://jonathansoma.com/lede/static/the-actual-table.html")



Current google-chrome version is 96.0.4664
Get LATEST chromedriver version for 96.0.4664 google-chrome
Driver [/Users/richardabbey/.wdm/drivers/chromedriver/mac64/96.0.4664.45/chromedriver] found in cache
  driver = webdriver.Chrome(ChromeDriverManager().install())


In [205]:
booklist = driver.find_element(By.ID, 'booklist')
new_book = booklist.find_elements(By.TAG_NAME, "tr")

In [206]:
#This creates an empty dictionary
#Loops through and empties the values into the empty disctionary
#After that, the dictionary is appended to the master list, book_data

book_data =[]

for row in new_book:
    book_dict = {}
    cells = row.find_elements(By.TAG_NAME, 'td')
    book_dict['Title'] = cells[0].text
    book_dict['Subhead'] = cells[1].text
    book_dict['Byline'] = cells[2].text
    book_data.append(book_dict)

book_data
    

[{'Title': 'How to Scrape Things',
  'Subhead': 'Some Supplemental Materials',
  'Byline': 'By Jonathan Soma'},
 {'Title': 'How to Scrape Many Things',
  'Subhead': 'But, Is It Even Possible?',
  'Byline': 'By Sonathan Joma'},
 {'Title': 'The End of Scraping',
  'Subhead': "Let's All Use CSV Files",
  'Byline': 'By Amos Nathanos'}]

## Part 8: Scraping multiple table rows into a list of dictionaries

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html, creating a pandas DataFrame.

> There are two ways to do this one! One uses just pandas, the other one uses the result from Part 7.

In [207]:
#This creates an empty dictionary
#Loops through and empties the values into the empty disctionary
#After that, the dictionary is appended to the master list, book_data
#The resultant output is converted into a dataframe using df = pd.DataFrame(book_data)

book_data =[]

for row in new_book:
    book_dict = {}
    cells = row.find_elements(By.TAG_NAME, 'td')
    book_dict['Title'] = cells[0].text
    book_dict['Subhead'] = cells[1].text
    book_dict['Byline'] = cells[2].text
    book_data.append(book_dict)

book_data
    

[{'Title': 'How to Scrape Things',
  'Subhead': 'Some Supplemental Materials',
  'Byline': 'By Jonathan Soma'},
 {'Title': 'How to Scrape Many Things',
  'Subhead': 'But, Is It Even Possible?',
  'Byline': 'By Sonathan Joma'},
 {'Title': 'The End of Scraping',
  'Subhead': "Let's All Use CSV Files",
  'Byline': 'By Amos Nathanos'}]

In [208]:
df = pd.DataFrame(book_data)

In [209]:
df

Unnamed: 0,Title,Subhead,Byline
0,How to Scrape Things,Some Supplemental Materials,By Jonathan Soma
1,How to Scrape Many Things,"But, Is It Even Possible?",By Sonathan Joma
2,The End of Scraping,Let's All Use CSV Files,By Amos Nathanos


## Part 9: Scraping into a file

Scrape the content at http://jonathansoma.com/lede/static/the-actual-table.html and save it as `output.csv`

In [210]:
#This scrapes the data and creates a csv file using  df.to_csv("output.csv", sep=',') with a comma separator

book_data =[]

for row in new_book:
    book_dict = {}
    cells = row.find_elements(By.TAG_NAME, 'td')
    book_dict['Title'] = cells[0].text
    book_dict['Subhead'] = cells[1].text
    book_dict['Byline'] = cells[2].text
    book_data.append(book_dict)

df.to_csv("output.csv", sep=',')

In [211]:
pd.read_csv("output.csv")

Unnamed: 0.1,Unnamed: 0,Title,Subhead,Byline
0,0,How to Scrape Things,Some Supplemental Materials,By Jonathan Soma
1,1,How to Scrape Many Things,"But, Is It Even Possible?",By Sonathan Joma
2,2,The End of Scraping,Let's All Use CSV Files,By Amos Nathanos
