## Getting Data from the Web

In this section, we will cover the basics of web scraping, the tools we'll use, and the ethical considerations involved. We will also discuss using APIs when available.

### Web Scraping

Web scraping is the process of extracting data from websites. It involves parsing the HTML structure of a webpage and extracting the desired information. Python provides several libraries for web scraping, such as BeautifulSoup and Scrapy.

### Tools for Web Scraping

1. [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/): A Python library for parsing HTML and XML documents. It provides a simple and intuitive API for navigating and searching the parsed data.
2. [Scrapy](https://scrapy.org/): A powerful and flexible web scraping framework written in Python. It allows you to define the structure of the website you want to scrape and provides tools for extracting data efficiently.

### Ethical Considerations

When web scraping, it is important to consider the ethical implications and legal restrictions. Here are some key points to keep in mind:

- Respect website terms of service: Make sure to review the terms of service of the website you are scraping and comply with any restrictions or guidelines.

- Don't overload the server: Avoid sending too many requests to a website in a short period of time, as it can put a strain on the server and disrupt the website's normal operation.

- Respect privacy: Be mindful of the data you are scraping and ensure that you are not violating any privacy laws or collecting sensitive information without consent.

### Using APIs

In some cases, websites provide APIs (Application Programming Interfaces) that allow developers to access and retrieve data in a structured and controlled manner. APIs provide a more reliable and efficient way to obtain data compared to web scraping. When available, it is recommended to use APIs instead of web scraping, as it ensures that you are accessing the data in a legitimate and authorized manner. In the next section, we will explore how to use APIs to retrieve data from the web.

### Other tools
Just grabbing data from the web is not enough. We need to store it, process it, and analyze it. For this, we will use the following tools:
* [pandas](https://pandas.pydata.org/): A powerful data manipulation and analysis library for Python. It provides data structures and functions for working with structured data, making it easy to clean, transform, and analyze data.
* [numpy](https://numpy.org/doc/): A fundamental package for scientific computing with Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.
* [matplotlib](https://matplotlib.org/): A plotting library for the Python programming language and its numerical mathematics extension, NumPy. It provides a MATLAB-like interface for creating visualizations and plots. 
* [seaborn](https://seaborn.pydata.org/): A data visualization library based on matplotlib that provides a high-level interface for drawing attractive and informative statistical graphics.


## Inspecting the HTML Structure of a Webpage
Before we get to scraping it, we need to understand the structure of the webpage we want to scrape. We can do this by opening the page in a web browser and using the browser's developer tools to inspect the HTML structure.

Here are the steps to inspect the HTML structure of a webpage using Google Chrome:
Open the webpage in Google Chrome.
Right-click on the element you want to inspect and select "Inspect" from the context menu.
The developer tools panel will open, showing the HTML structure of the webpage and allowing you to inspect the elements and their attributes.

We'll be interested primarily in book title, availability and price.

## Installing the pre-requisites
We will use the following libraries for web scraping:
* requests
* BeautifulSoup
* pandas

We can install these libraries using pip, the Python package manager. Run the following commands in your terminal or command prompt to install the required libraries:

```python
pip install requests
pip install beautifulsoup4
pip install pandas
```

### Scraping the data
Let's start by getting the data from the website. We will use the requests library to send an HTTP request to the website and retrieve the HTML content of the page. We will then use BeautifulSoup to parse the HTML and extract the desired information.

What we are going to do is pretend our code is a web browser.  So this starts by sending an HTTP request to the website.  The website will respond with the HTML content of the page.  There are a variety of valid responses, but we are interested in the 200 response, which means the request was successful.  If the request was not successful, we will need to handle the error.

In [1]:
import requests
from bs4 import BeautifulSoup

base_url = 'https://books.toscrape.com/'
path = 'catalogue/category/books_1/'
page = 'index.html'

# We separate out the components of the full URL to allow us to adjust the page number
#  if we decide to scroll through the follow-on pages or categories
full_url = base_url + path + page


https://www.iana.org/domains/example
