GitHub - Decodo/Python-scraper-tutorial at bddf005542b46406ed440c0ac7257edc69f33414

Name	Name	Last commit message	Last commit date
Latest commit History 12 Commits
LICENSE	LICENSE
README.md	README.md

Name

Last commit message

Last commit date

Disclaimer

The following tutorial is meant for educational purposes and introduces to the basics of web scraping and utilizing Smartproxy for it. We suggest to reseach the Requests and BeautifulSoup documentation in order to build upon the given example.

Prerequisites

To run our example scraper, you are going to need these libraries:

Introduction

If you’re here that means you are interested in finding out more about how to scrape and enjoy all the data that you gather. However, before we dive into it, we first need to understand what web scraping is. In general terms, scraping is the process of acquiring a web page with all of its information and then extracting selected fields for further processing. Usually the purpose of gathering that information is so that a person could easily monitor it. Some examples could be reviews, prices, weather reports, billboard hits,and so on.

Be polite

Just as you are polite and caring in the real world, you should be such online as well. Before you start scraping, make sure that the website you’re targeting allows it. You can do that by checking its Robots.txt file. If the site doesn’t condone crawling or scraping of its content, be kind and respect the owner’s wishes. Failing to do so might get your IP blocked or even lead to legal action taken against you, so be wary. Moreover, check if the site you’re targeting has an API. If it does, just use that – it will be easier to get the needed data, and you won’t put unnecessary load on the sites infrastructures.

Let’s get to it

In the following tutorial, you will not only see how a basic scraper is written but will also learn how to adjust it to your own needs. Moreover, you will learn how to do it via a proxy!

As mentioned, we will be using these libraries: Requests BeautifulSoup 4 The page we’re going to scrape is http://books.toscrape.com/. It doesn’t have robots.txt, but I think we can agree that the name of the site is asking you to scrape it. But before we carry on with the coding part, let's inspect the website first.

Inspecting the site

So, this is what the main page of the website looks like. We can see it contains books, their titles, prices, ratings, availability information, and a list of genres in the sidebar.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Disclaimer

Prerequisites

List of contents

Introduction

Be polite

Let’s get to it

Inspecting the site

About

Releases

Packages

Contributors 2

Languages

License

Decodo/Python-scraper-tutorial

Folders and files

Latest commit

History

Repository files navigation

Disclaimer

Prerequisites

List of contents

Introduction

Be polite

Let’s get to it

Inspecting the site

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages