Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraping thoughts - updates, things to know #918

Open
JoshCheek opened this issue Jul 29, 2014 · 0 comments
Open

Scraping thoughts - updates, things to know #918

JoshCheek opened this issue Jul 29, 2014 · 0 comments

Comments

@JoshCheek
Copy link
Contributor

Resources

  • Repo where I made the exercise
  • Initial tutorial, with Mechanize, the Markdown and the HTML
  • Updated tutorial, with Capybara, the Markdown, and the HTML

Notes from second time teaching (Capybara)

  • Maybe rename the topic to just "Automating the web", Capybara is useful for far more than just scraping.
  • Pry fucked up for 3 or 4 students, IDK why, but their console input and output just became invisible. Spent an hour or two trying to debug it after the class, documented my failure here.
  • Try the Denver post example again, if it works fine, we can keep it, but I hadn't found the config options to turn off the js errors at that point, so that example wound up crashing Capybara for a lot of people.
  • I had to change up the lesson, switch it over to use Amazon instead of isbnsearch.org, whose database suddenly seems empty -.-
  • On Amazon, there's an intermediate page of results, and they have to figure out how to click the link. This tripped a lot of them up, but I think it's good to have that in there. Maybe add something like this to the "lets do it together" portion. I got around it with browser.click_on browser.find('#resultsCol h3 a').text IDK if there's a better way.
  • If Pry and Capybara hadn't kept fucking up, doing everything in pry would have been a good exercise, but with those failures, it was probably much harder. I'm hoping that turning off the js errors is good enough, everyone's problems reduced dramatically after that.
  • Now that we have Phantom, we can expand the exercise. This was maybe the fourth one I tried, and it was chosen b/c it didn't require js. But other things might be more interesting, IDK.
  • I spent the last 15 minutes doing it for them in class. Didn't finish, but this is what I came up with:
# Setup poltergeist
require 'capybara/poltergeist'
Capybara.register_driver :poltergeist do |app|
  Capybara::Poltergeist::Driver.new(app, js_errors: false)
end
Capybara.default_driver = :poltergeist
browser = Capybara.current_session

# go to amazon
isbns = %w[
  1405232501 082172388X 0764222228
  0590474235 0672320835 0439539439
  0375434461 0752859978 0752860224
  2745945475 0425032337 074459040X
  1860393225 1405232501 082172388X
  0764290762 0590474235 0672320835
  3826672429 0375434461 0752859978
  038079392X 2745945475 0425032337
  0701184361 1860393225 0758238614
  0152049215 3826672429 1921656573
  0747203873 0701184361 0764222228
  0439539439 0152049215 0752860224
  074459040X 0747203873 0764290762
  038079392X 1921656573 0758238614
]

isbns.each do |isbn|
  # TODO: Skip if I already pulled this
  browser.visit 'http://amazon.com'
  browser.fill_in 'field-keywords', with: isbn
  browser.click_button 'Go'
  browser.click_on browser.find('#resultsCol h3 a').text

  lis   = browser.all('#detail-bullets h3 li')
  texts = lis.map { |li| li.text }
  data  = texts.map { |text| text.split(":") }
               .each_with_object({}) { |(key, value), attributes| attributes[key.downcase] = value.strip }
  File.open('isbn-results', 'a') { |f| f.puts JSON.dump(data) }
end

Notes from first time teaching (Mechanize)

What I would change next time:

  • Switch to Capybara / Poltergeist / Phantom.js
    Which will open up many more possibilities.
    I approximately figured out how to do it, here: https://gist.github.com/JoshCheek/1ef1c6fbe7ff7ee28de4#file-using_capybara_with_poltergeist_to_get_the_data-rb
    but haven't updated the material yet.
  • Remove section on Scripts (doesn't add anything)
  • Switch from open-uri to RestClient
    not common to use open-uri in prod, plus it monkey-patches Kernel#open...
    though, to be fair, Kernel#open isn't something you'd use in maintainable code,
    it really is intended for scripts

My plan going in this last time

1st Hour

  Learning Objectives
    Understand the internet
      if we do, we could create such a tool
    Scraping with Nokogiri
      for the clone wars project
    Increase familiarity with pry and CSS selectors

  Imagine
    What could you do with such a tool?

  Goals (we'll do all of these together except the last one)
    GET a webpage with Nokogiri
    Find all the page's methods that deal with links
    Look at its links
    Select the links we want, click them
    Look at its forms
    Select the forms we want, click them
    Use this information to take the list of ISBNs and extract the book data

  Given that you know this

2nd Hour

  Work on the project
  I'll be around if anyone has 

I ran out of time and did not get to go extensively through the Mechanize portion, otherwise they wouldn't have gotten to play with it themselves.

I also had them think about what they could use such a tool to do, hoping to spike their imaginations so they would have some context or hypothetical goals in mind when we went through it. IDK if this was valuable or not.

Rachel's feedback (mostly applies to my teaching style)

Rachel's feedback
  Good:
    * giving students time to catch up
    * "cold-calling" student to explain what's happening
    * having students think about the thing, then doing it
  Bad:
    * cohesiveness between example and result (pressing return too quickly after typing)
      - throw a semicolon on the end so I can try it out
      - if I'm going to go off exploring, tell them we're exploring, not following
    * confusion between why we have a text file and a pry session
      - start in pry, then as we show that something does what we think,
        copy/paste it into the editor, so it feels like we're exploring and learning
        and then aggregating our findings into a file that we can then reference
        and use later on.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant