Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
This issue is primarily to ensure organization of any web scraping efforts. If you are going to try to scrape a URL, mention which one it is so others don't do the same.
Sign up for Portia, a free, visual web scraping tool. Portia lets you set up simple rules for how the spider (aka web crawler) will navigate the site, and then lets you visually mark what content you want to scrape. This pattern will then be utilized on other pages. Multiple patterns can be given to ensure proper scraping across multiple page formats. There ARE likely more efficient and clever methods of scraping, but this is the most feasible I've found that people who don't have any specialized knowledge will be able to use. If you have any of that specialized knowledge, please feel free to speak up and make suggestions.
Also, if you are able to get all your data with only one sample (you can add to the sample by clicking the little four square icon near the minus sign), do that and name it field1. This provides a standard and makes cleaning easier. If this isn't possible though, no worries.
Running the Scraper
It's hard to tell how long the process will run for. It can take several hours to scrape one site, depending on its size, so keep that in mind when deciding how many sites you'll scrape. Once the scraper is running, it's a good idea to check the log as soon as you can to make sure that, in general, the scraper is doing what you want it to.
One thing that wasn't mentioned in the tutorial (woops) was how to upload. Click on the items number once it's completed, and then go to the Export button in the top right. Select "JSONL" and download the file. Then upload it to the Data folder when finished.
Thank you so much for your contribution!