Skip to content

Web Scraping #9

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TyJK opened this issue May 31, 2017 · 0 comments
Closed

Web Scraping #9

TyJK opened this issue May 31, 2017 · 0 comments

Comments

@TyJK
Copy link
Owner

TyJK commented May 31, 2017

Web Scraping

This issue is primarily to ensure organization of any web scraping efforts. If you are going to try to scrape a URL, mention which one it is so others don't do the same.

Instructions

Sign up for Portia, a free, visual web scraping tool. Portia lets you set up simple rules for how the spider (aka web crawler) will navigate the site, and then lets you visually mark what content you want to scrape. This pattern will then be utilized on other pages. Multiple patterns can be given to ensure proper scraping across multiple page formats. There ARE likely more efficient and clever methods of scraping, but this is the most feasible I've found that people who don't have any specialized knowledge will be able to use. If you have any of that specialized knowledge, please feel free to speak up and make suggestions.

Tutorial

Tutorial Video
Portia Documentation

Important Note
Make SURE that when you have the text highlighted, it's scraping text and only text. This will mean you won't have to worry about it scraping images or other undesirable content.

Also, if you are able to get all your data with only one sample (you can add to the sample by clicking the little four square icon near the minus sign), do that and name it field1. This provides a standard and makes cleaning easier. If this isn't possible though, no worries.

Running the Scraper

It's hard to tell how long the process will run for. It can take several hours to scrape one site, depending on its size, so keep that in mind when deciding how many sites you'll scrape. Once the scraper is running, it's a good idea to check the log as soon as you can to make sure that, in general, the scraper is doing what you want it to.

Uploading data

One thing that wasn't mentioned in the tutorial (woops) was how to upload. Click on the items number once it's completed, and then go to the Export button in the top right. Select "JSONL" and download the file. Then upload it to the Data folder when finished.

Thank you so much for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant