Skip to content

teejl/webscrape_guide

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A Guide for Crawling the Web with Python

by TeeJ

Introduction

Sometimes the easiest way to gather data is scraping the web!

Scraping Amazon

Lucky for you I have built some semi-professional (a lot of improvements can be made) scripts that scrape the data from searches on amazon. In order to scrape Amazon for all "Playstation3" search results then follow the shell commands below:

# setup environment (only need to be ran once)
sudo git clone "https://github.com/teejl/webscrape_guide.git"
cd webscrape_guide
sudo chmod u+x webscrape.init
./webscrape.init

# run search (can run after init for different items)
sudo python3 amznscrape.py "Data Science" 10

The raspberry pi will scrape for the results found from a "Data Science" search result on Amazon (retrying 10 times if a connection to the html file from the website cannot be established) and save the results in the data folder.

Automating many searches:

sudo chmod u+x data.load
sudo nano data.load # add the searches you want to find
./data.load

Potential Improvements

The skip_scrape function within the amznscrape.py file can be improved. It targets to mainly work for books. I also beleive the algorithm can be made more efficient, by trying to establish a connection to all of the sites, instead of retrying 1 by 1. I did not try to implement the connection by connection to be friendly to Amazon.

About

Follow me as I scrape data off of websites.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published