
# Web Scraping Tutorial

## Table of Contents

1. [Introduction to Web Scraping](#introduction)
2. [Web Scraping Ethics and Legal Considerations](#ethics)
3. [Python Libraries for Web Scraping](#libraries)
4. [Setting up the Environment](#setup)
5. [Basics of HTML](#html)
6. [Using Beautiful Soup to Parse HTML](#beautifulsoup)
7. [Example Project: Scraping a Simple Webpage](#example)

---
        


## 1. Introduction to Web Scraping <a name="introduction"></a>

Web scraping is a method used to extract data from websites. This is done by making a request to the web server for the page's HTML, which is then parsed to extract the desired data. Web scraping is often used for data mining, data analysis, testing, and many other applications.


## 2. Web Scraping Ethics and Legal Considerations <a name="ethics"></a>

Before starting a web scraping project, it's important to understand the ethical
and legal considerations. Not all websites allow web scraping, and even when
they allow, they have certain limitations about the rate of scraping. Please
also note that data you scraped may be protected by copyright. Always check a website's "robots.txt" file and
terms of service before scraping.

Web scraping is a powerful tool for extracting data from websites. However, it's important to consider the ethical and legal implications before starting a web scraping project. Here is a simple guide to help you navigate these considerations.

#### Respect Copyright Laws

Web content is often protected by copyright laws. Using this content without permission can lead to legal consequences. Always ensure that the data you are scraping is not copyrighted, or that you have permission to use it.

#### Read the Terms of Service

Before scraping a website, make sure to read its Terms of Service (ToS). Some websites explicitly forbid web scraping in their ToS. Disregarding this can lead to being banned from the site or even legal action.

#### Respect Robots.txt

`Robots.txt` is a file used by websites to guide how search engines crawl their pages. It can also provide guidance on which parts of the site the owners prefer not to be scraped. While not legally binding, it's considered good web etiquette to respect the instructions in `robots.txt`.

#### Don't Overload the Server

Sending too many requests to a website in a short amount of time can overload the server, which can slow down or disrupt service for other users. Be considerate of how many requests you send and try to limit your scraping to off-peak times.

#### Be Aware of Privacy Issues

Be mindful of privacy issues when scraping personal data. Laws such as the European General Data Protection Regulation (GDPR) or the California Consumer Privacy Act (CCPA) have strict rules about how personal data can be collected and used.

### Conclusion

Web scraping is a powerful tool, but it's important to use it responsibly. Always respect the legal rights and privacy of others, and strive to minimize your impact on the services you scrape.

Remember, this guide is not legal advice, and laws may vary by location and over
time. If you're unsure about the legality of your web scraping project, it's
always a good idea to consult with a legal expert.

## 3. Python Libraries for Web Scraping <a name="libraries"></a>

There are many libraries available in Python to perform web scraping. Some of the most commonly used ones are:

- **Requests**: This library allows you to send HTTP requests and handle the response in Python.

- **Beautiful Soup**: This library is used for parsing HTML and XML documents. It creates a parse tree from page source code that can be used to extract data in a hierarchical and readable manner.

- **Selenium**: Selenium is mainly used for testing in the industry, but it can also be used for web scraping. It provides a way to control a web browser with code, which is useful for interacting with JavaScript-based websites.

- **Scrapy**: Scrapy is a powerful and versatile web scraping library, but it can be a bit complex for beginners. It's used for more extensive web scraping projects.

In this tutorial, we'll be focusing on using the `requests` and `Beautiful Soup` libraries.



## 3. Setting up the Environment <a name="setup"></a>

Before we start coding, we need to install the necessary libraries. This can be done by running the following commands in your Jupyter notebook:
        

In [1]:

%pip install requests
%pip install beautifulsoup4
        


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.



You can import the necessary libraries at the start of your notebook and print
their versions. This can help while debugging a problem. 
        

In [6]:
import sys
print(f'Python version: {sys.version}')
import requests
import bs4
print(f'BeautifulSoup version: {bs4.__version__}')
        

Python version: 3.11.4 (main, Jun 20 2023, 17:23:00) [Clang 14.0.3 (clang-1403.0.22.14.1)]
BeautifulSoup version: 4.12.2



## 4. Basics of HTML <a name="html"></a>

HTML (HyperText Markup Language) is the standard markup language for creating web pages. It uses tags to define elements, which are the building blocks of web pages. Understanding HTML is essential for web scraping because it allows us to understand the structure of the web page and identify the data we want to extract.
        


## 5. Using Beautiful Soup to Parse HTML <a name="beautifulsoup"></a>

Beautiful Soup is a Python library that is used for web scraping purposes to pull the data out of HTML and XML files. It creates a parse tree from page source code that can be used to extract data in a hierarchical and readable manner.



## 7. Example Project: Wikipedia

Let's try scraping the Wikipedia page describing high-performance computing.

In [9]:

import requests
from bs4 import BeautifulSoup

# Make a request to the website
r = requests.get("https://en.wikipedia.org/wiki/High-performance_computing")
r.content

# Use the 'html.parser' to parse the page
soup = BeautifulSoup(r.content, 'html.parser')

# Extract the text from the page
text = soup.get_text()

# Extract the URLs from the page
urls = []
for link in soup.find_all('a'):
    urls.append(link.get('href'))

# Extract the images from the page
images = []
for img in soup.find_all('img'):
    images.append(img.get('src'))

        

### Note:
While you can access the content by web scraping, it can be hard to parse the required information since we lose the structure embedded in the html format. A better way to access online content is to use custom API that simplifies parsing,i.e. https://github.com/martin-majlis/Wikipedia-API. There are many Python packages that provides custom API for different websites such as https://www.tweepy.org/, https://github.com/Nv7-GitHub/googlesearch. Please, note that all of these tools must be used repsonsibly as desribed in [Web Scraping Ethics and Legal Considerations](#ethics).