A Guide for Crawling the Web with Python

by TeeJ

Introduction

Sometimes the easiest way to gather data is scraping the web!

Scraping Amazon

Lucky for you I have built some semi-professional (a lot of improvements can be made) scripts that scrape the data from searches on amazon. In order to scrape Amazon for all "Playstation3" search results then follow the shell commands below:

# setup environment (only need to be ran once)
sudo git clone "https://github.com/teejl/webscrape_guide.git"
cd webscrape_guide
sudo chmod u+x webscrape.init
./webscrape.init

# run search (can run after init for different items)
sudo python3 amznscrape.py "Data Science" 10

The raspberry pi will scrape for the results found from a "Data Science" search result on Amazon (retrying 10 times if a connection to the html file from the website cannot be established) and save the results in the data folder.

Automating many searches:

sudo chmod u+x data.load
sudo nano data.load # add the searches you want to find
./data.load

Potential Improvements

The skip_scrape function within the amznscrape.py file can be improved. It targets to mainly work for books. I also beleive the algorithm can be made more efficient, by trying to establish a connection to all of the sites, instead of retrying 1 by 1. I did not try to implement the connection by connection to be friendly to Amazon.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
developing		developing
README.md		README.md
amznscrape.py		amznscrape.py
data.load		data.load
webscrape.init		webscrape.init

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

developing

developing

README.md

README.md

amznscrape.py

amznscrape.py

data.load

data.load

webscrape.init

webscrape.init

Repository files navigation

A Guide for Crawling the Web with Python

Introduction

Scraping Amazon

Automating many searches:

Potential Improvements

About

Releases

Packages

Languages

teejl/webscrape_guide

Folders and files

Latest commit

History

Repository files navigation

A Guide for Crawling the Web with Python

Introduction

Scraping Amazon

Automating many searches:

Potential Improvements

About

Resources

Stars

Watchers

Forks

Languages