# Web Scraping Tutorial

This tutorial will cover the basics of web scraping using Python.

## What is Web Scraping?

Web scraping is the process of automatically extracting data from websites. It can be done using various tools and libraries in Python, such as BeautifulSoup and Scrapy.

## Installing Required Libraries

For this tutorial, we will use `requests` to fetch web pages and `BeautifulSoup` to parse HTML content. Install these libraries using pip:
```bash
pip install requests
pip install beautifulsoup4
```

## Fetching a Web Page

Use the `requests` library to fetch the content of a web page.

In [2]:
!pip install requests
!pip install beautifulsoup4



In [3]:
import requests

# URL of the web page you want to scrape
url = 'https://en.wikipedia.org/wiki/Artificial_intelligence'

# Fetch the content of the web page
response = requests.get(url)

# Print the status code to check if the request was successful
print(response.status_code)

# Print the first 500 characters of the content
print(response.text[:500])

200
<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-enabled vector-feature-appearance-pinned-clien


## Parsing HTML Content

Use `BeautifulSoup` to parse the HTML content of the web page.

In [4]:
from bs4 import BeautifulSoup

# Create a BeautifulSoup object and specify the parser
soup = BeautifulSoup(response.text, 'html.parser')

# Print the title of the web page
print(soup.title)

# Print all paragraph tags
for p in soup.find_all('p'):
    print(p.text)

<title>Artificial intelligence - Wikipedia</title>


Artificial intelligence (AI), in its broadest sense, is intelligence exhibited by machines, particularly computer systems. It is a field of research in computer science that develops and studies methods and software that enable machines to perceive their environment and use learning and intelligence to take actions that maximize their chances of achieving defined goals.[1] Such machines may be called AIs.

Some high-profile applications of AI include advanced web search engines (e.g., Google Search); recommendation systems (used by YouTube, Amazon, and Netflix); interacting via human speech (e.g., Google Assistant, Siri, and Alexa); autonomous vehicles (e.g., Waymo); generative and creative tools (e.g., ChatGPT, Apple Intelligence, and AI art); and superhuman play and analysis in strategy games (e.g., chess and Go).[2] However, many AI applications are not perceived as AI: "A lot of cutting edge AI has filtered into general applicati

## Extracting Specific Data

Extract specific data from the web page using BeautifulSoup.

In [5]:
# Extract all links from the web page
links = soup.find_all('a')

# Print the URLs of the links
for link in links:
    print(link.get('href'))

#bodyContent
/wiki/Main_Page
/wiki/Wikipedia:Contents
/wiki/Portal:Current_events
/wiki/Special:Random
/wiki/Wikipedia:About
//en.wikipedia.org/wiki/Wikipedia:Contact_us
https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
/wiki/Help:Contents
/wiki/Help:Introduction
/wiki/Wikipedia:Community_portal
/wiki/Special:RecentChanges
/wiki/Wikipedia:File_upload_wizard
/wiki/Main_Page
/wiki/Special:Search
/w/index.php?title=Special:CreateAccount&returnto=Artificial+intelligence
/w/index.php?title=Special:UserLogin&returnto=Artificial+intelligence
/w/index.php?title=Special:CreateAccount&returnto=Artificial+intelligence
/w/index.php?title=Special:UserLogin&returnto=Artificial+intelligence
/wiki/Help:Introduction
/wiki/Special:MyContributions
/wiki/Special:MyTalk
#
#Goals
#Reasoning_and_problem-solving
#Knowledge_representation
#Planning_and_decision-making
#Learning
#Natural_language_processing
#Perceptio

## Saving Data to a File

Save the extracted data to a file for later use.

In [6]:
with open('links.txt', 'w') as file:
    for link in links:
        href = link.get('href')
        if href is not None:
            file.write(href + '\n')

UnicodeEncodeError: 'charmap' codec can't encode character '\xf6' in position 27: character maps to <undefined>

## Handling Pagination

Some websites have multiple pages of data. Handle pagination by iterating through each page.

In [7]:
# Loop through the first 5 pages
for page in range(1, 6):
    url = f'https://en.wikipedia.org/wiki/{page}'  # Update this URL to the actual website you are scraping
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Find all titles and paragraphs
    titles = soup.find_all('h1')  # Assuming titles are in <h1> tags
    paragraphs = soup.find_all('p')
    
    # Print titles
    print(f"Page {page} Titles:")
    for title in titles:
        print(title.get_text())
        
    print(f"\nPage {page} Paragraphs:")
    for paragraph in paragraphs:
        print(paragraph.get_text())
    print("\n" + "="*50 + "\n")

Page 1 Titles:
1

Page 1 Paragraphs:


1 (one, unit, unity) is a number representing a single or the only entity. 1 is also a numerical digit and represents a single unit of counting or measurement. For example, a line segment of unit length is a line segment of length 1. In conventions of sign where zero is considered neither positive nor negative, 1 is the first and smallest positive integer. It is also sometimes considered the first of the infinite sequence of natural numbers, followed by 2, although by other definitions 1 is the second natural number, following 0.

The fundamental mathematical property of 1 is to be a multiplicative identity, meaning that any number multiplied by 1 equals the same number. Most if not all properties of 1 can be deduced from this. In advanced mathematics, a multiplicative identity is often denoted 1, even if it is not a number. 1 is by convention not considered a prime number; this was not universally accepted until the mid-20th century. Additionally

## Extracting Data from Tables

Web pages often contain data in HTML tables. You can extract this data using BeautifulSoup.

In [8]:
# Extract data from a table
table = soup.find('table')
rows = table.find_all('tr')

for row in rows:
    cols = row.find_all('td')
    for col in cols:
        print(col.text)

← 4 
5
 6 →
← 4 
5
 6 →
 −1 0 1 2 3 4 5 6 7 8 9 → List of numbersIntegers← 0 10 20 30 40 50 60 70 80 90 →
five
5th
(fifth)
quinary
prime
3rd
1, 5
Ε´
V, v
penta-/pent-
quinque-/quinqu-/quint-
1012
123
56
58
512
516
ε (or Ε)
٥
۵
፭
৫
೫
੫
五
Ե
५
ה
៥
౫
൫
௫
๕
𒐙
|||||
𝋥
.....


## Handling JavaScript-Rendered Content

Some web pages render content using JavaScript, which requires different approaches to scrape, such as using Selenium or Scrapy with Splash.

In [10]:
!pip install selenium

Collecting selenium
  Downloading selenium-4.23.1-py3-none-any.whl.metadata (7.1 kB)
Collecting trio~=0.17 (from selenium)
  Downloading trio-0.26.1-py3-none-any.whl.metadata (8.6 kB)
Collecting trio-websocket~=0.9 (from selenium)
  Downloading trio_websocket-0.11.1-py3-none-any.whl.metadata (4.7 kB)
Collecting attrs>=23.2.0 (from trio~=0.17->selenium)
  Using cached attrs-24.1.0-py3-none-any.whl.metadata (14 kB)
Collecting outcome (from trio~=0.17->selenium)
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl.metadata (2.6 kB)
Collecting wsproto>=0.14 (from trio-websocket~=0.9->selenium)
  Downloading wsproto-1.2.0-py3-none-any.whl.metadata (5.6 kB)
Downloading selenium-4.23.1-py3-none-any.whl (9.4 MB)
   ---------------------------------------- 0.0/9.4 MB ? eta -:--:--
   ---------------------------------------- 0.0/9.4 MB ? eta -:--:--
   ---------------------------------------- 0.0/9.4 MB 435.7 kB/s eta 0:00:22
   ---------------------------------------- 0.0/9.4 MB 326.8 kB/s et

In [11]:
# Example using Selenium to handle JavaScript-rendered content
# Install Selenium using pip: pip install selenium
from selenium import webdriver

# Set up the WebDriver (e.g., using Chrome)
driver = webdriver.Chrome()

# Navigate to the web page
driver.get('http://example.com')

# Get the page source and parse it with BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')

# Extract data as usual
print(soup.title.text)

# Close the WebDriver
driver.quit()

Example Domain


## Using Scrapy for Advanced Web Scraping

Scrapy is a powerful web scraping framework for Python. It allows you to define spiders to crawl and scrape websites.

In [14]:
# Example of a simple Scrapy spider
# Save this code in a file named my_spider.py

import scrapy

class MySpider(scrapy.Spider):
    name = 'my_spider'
    start_urls = ['http://example.com']

    def parse(self, response):
        title = response.css('title::text').get()
        yield {'title': title}

# Run the spider using the command: scrapy runspider my_spider.py

## Avoiding Getting Blocked

When scraping websites, it’s important to follow best practices to avoid getting blocked. Here are some tips:

1. **Respect `robots.txt`:** Check the website's `robots.txt` file to see which parts of the site are allowed to be scraped.
2. **Throttle your requests:** Do not overload the server with too many requests in a short period. Use delays and random intervals between requests.
3. **Use User-Agent headers:** Some websites block requests that do not come from a browser. Set the User-Agent header to mimic a real browser.

In [15]:
# Example of setting headers to mimic a real browser
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
response = requests.get(url, headers=headers)
print(response.status_code)

200


## Error Handling

Implement error handling to make your scraper more robust and handle exceptions gracefully.

In [None]:
# Example of error handling
try:
    response = requests.get(url)
    response.raise_for_status()
except requests.exceptions.HTTPError as err:
    print(f'HTTP error occurred: {err}')
except Exception as err:
    print(f'Other error occurred: {err}')

## Summary

In this tutorial, we've covered the basics of web scraping using Python, including fetching web pages, parsing HTML content, extracting specific data, saving data to a file, handling pagination, extracting data from tables, handling JavaScript-rendered content, using Scrapy for advanced scraping, avoiding getting blocked, and implementing error handling.