# Web Scraping Python Tutorial – How to Scrape Data From A Website

Python is a beautiful language to code in. It has a great package ecosystem, there's much less noise than you'll find in other languages, and it is super easy to use.

Python is used for a number of things, from data analysis to server programming. And one exciting use-case of Python is Web Scraping.

In this article, we will cover how to use Python for web scraping. We'll also work through a complete hands-on classroom guide as we proceed.

*Note: We will be scraping a webpage that I host, so we can safely learn scraping on it. Many companies do not allow scraping on their websites, so this is a good way to learn. Just make sure to check before you scrape.*

## 1. Load webpages in Python with 'request'

Welcome to a new classroom! Let’s start our first lab in the classroom by learning about the request module in Python.

The `requests` module allows you to send HTTP requests using Python.

The HTTP request returns a Response Object with all the response data (content, encoding, status, etc). One example of getting the HTML of a page:

----
```python
import requests

res = requests.get('URL')

print(res.text)
print(res.status_code)
```
----
To pass this lab, take care of the following things:

1. Get the contents of the following URL using `requests` module: **https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/**
2. Store the text response (as shown above) in a variable called `txt`
3. Store the status code (as shown above) in a variable called `status`
4. Print `txt` and `status` using `print` function

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
response = requests.get('https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/')

txt = response.text
status = response.status_code

print(response)

<Response [200]>


In [3]:
print(txt)

<!DOCTYPE html>
<html lang="en">
	<head>
		<!-- Anti-flicker snippet (recommended)  -->
		<style>
			.async-hide {
				opacity: 0 !important;
			}
		</style>
		<title>codedamn Web Scraper demo</title>
		<meta charset="utf-8" />
		<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />

		<meta
			name="keywords"
			content="web scraping,Web Scraper,Chrome extension,Crawling,Cross platform scraper, "
		/>
		<meta name="description" content="The most popular web scraping website." />
		<link
			rel="icon"
			sizes="128x128"
			href="/webscraper-python-codedamn-classroom-website/favicon.png"
		/>

		<meta name="viewport" content="width=device-width, initial-scale=1.0" />

		<link rel="stylesheet" href="/webscraper-python-codedamn-classroom-website/app.css" />

		<link
			rel="apple-touch-icon"
			href="/webscraper-python-codedamn-classroom-website/logo-icon.png"
		/>

		<script defer src="/webscraper-python-codedamn-classroom-website/app.js"></script>
	</head>
	<body>
		<header r

In [4]:
print(status)

200


## 2. Extracting title with BeautifulSoup

In this whole classroom, we’ll be using a library called `BeautifulSoup` in python to do web scraping. Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Three features make it powerful:

1. Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. It doesn't take much code to write an application
2. Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings unless the document doesn't specify an encoding and Beautiful Soup can't detect one. Then you just have to specify the original encoding.
3. Beautiful Soup sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different parsing strategies or trade speed for flexibility.

Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class `externalLink`", or "Find all the links whose URLs match "foo.com", or "Find the table heading that's got bold text, then give me that text."

Here’s a simple example of BeautifulSoup:

****
```python
from bs4 import BeautifulSoup

page = requests.get("URL")
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.title.text # gets you the text of the <title>(...)</title>
```
****
Remember how we loaded page content using `requests` module in the last lab? Similarly, we first download the page, and then load all the content into BeautifulSoup. Finally, we extract out the page title just by saying `soup.title`, very convenient!

To pass this lab:

1. Use `requests` package to get title of the URL:

    https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/

2. Use BeautifulSoup to store the title of this page into a variable called `page_title`



In [5]:
response = requests.get("https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/")

soup = BeautifulSoup(response.content, 'html.parser')


# Extract title of page
page_title = soup.title.text

# print the result
print(page_title)

codedamn Web Scraper demo


## 3. Soup-ed body and head

In the last lab, we saw how we can extract `title` from the page. It is equally easy to extract out certain sections too. We also saw that you have to call `.text` on these to get the string, but you can print them without calling `.text` too, and it will give you the full markup. Try to run the example below:

Let us take a look at how you can extract out `body` and `head` sections from your pages.

****
```python
import requests
from bs4 import BeautifulSoup

# Make a request
page = requests.get("url")
soup = BeautifulSoup(page.content, 'html.parser')

# Extract title of page
page_title = soup.title.text

# Extract body of page
page_body = soup.body

# Extract head of page
page_head = soup.head

# print the result
print(page_body, page_head)
```
****

In [6]:
page = requests.get(
    "https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/")
soup = BeautifulSoup(page.content, 'html.parser')

# Extract info of page
page_title = soup.title
page_body = soup.body
page_head = soup.head

# print the result
print(page_title)

<title>codedamn Web Scraper demo</title>


In [7]:
print(page_body)

<body>
<header class="navbar navbar-fixed-top navbar-static" role="banner">
<div class="container">
<div class="navbar-header">
<a data-target=".side-collapse" data-target-2=".side-collapse-container" data-toggle="collapse-side">
<button aria-controls="navbar" aria-expanded="false" class="navbar-toggle pull-right collapsed" data-target="#navbar" data-target-2=".side-collapse-container" data-target-3=".side-collapse" data-toggle="collapse" type="button">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar top-bar"></span>
<span class="icon-bar middle-bar"></span>
<span class="icon-bar bottom-bar"></span>
</button>
</a>
<div class="navbar-brand">
<a href="/webscraper-python-codedamn-classroom-website/"><img alt="Web Scraper" src="/webscraper-python-codedamn-classroom-website/logo_white.svg"/></a>
</div>
</div>
<div class="side-collapse in">
<nav class="navbar-collapse collapse" id="navbar" role="navigation">
<ul class="nav navbar-nav navbar-right">
<li class="hidden">
<a

In [8]:
print(page_head)

<head>
<!-- Anti-flicker snippet (recommended)  -->
<style>
			.async-hide {
				opacity: 0 !important;
			}
		</style>
<title>codedamn Web Scraper demo</title>
<meta charset="utf-8"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta content="web scraping,Web Scraper,Chrome extension,Crawling,Cross platform scraper, " name="keywords"/>
<meta content="The most popular web scraping website." name="description"/>
<link href="/webscraper-python-codedamn-classroom-website/favicon.png" rel="icon" sizes="128x128"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<link href="/webscraper-python-codedamn-classroom-website/app.css" rel="stylesheet"/>
<link href="/webscraper-python-codedamn-classroom-website/logo-icon.png" rel="apple-touch-icon"/>
<script defer="" src="/webscraper-python-codedamn-classroom-website/app.js"></script>
</head>


## 4. select with BeautifulSoup

Now that we have explored some parts of BeautifulSoup, let us look how we can select DOM elements with BeautifulSoup methods.

Once we have the `soup` variable with us (like previous labs), we can work with `.select` on it which is a CSS selector inside BeautifulSoup, i.e., you can reach down the DOM tree just like how you will select elements with CSS. Let us take an example:

****

```python
import requests from bs4 import BeautifulSoup
# Make a request
page = requests.get("URL")

soup = BeautifulSoup(page.content, 'html.parser')

# Extract first <h1>(...)</h1> text
first_h1 = soup.select('h1')[0].text
```

****

`.select` returns you a Python list of all the elements, this is why we selected only the first element here with the `[0]` index.

Passing requirements:

- Create a variable `all_h1_tags`. Set it to an empty list.
- Use `.select` to select all the `<h1>` tags and store the text of those h1 inside `all_h1_tags` list.
- Create a variable `seventh_p_text` and store the text of 7th `p` element (index 6) inside.

In [9]:
page = requests.get("https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/")
soup = BeautifulSoup(page.content, 'html.parser')

# Create all_h1_tags as empty list
all_h1_tags = []
# Set all_h1_tags to all h1 tags of the soup
for h1 in soup.select('h1'):
    all_h1_tags.append(h1.text)
# Create seventh_p_text and set it to 7th p element text of the page
seventh_p_text = soup.select('p')[6].text

print(all_h1_tags, seventh_p_text)

['Test Sites', 'E-commerce training site'] 7 reviews


## 5. Top items being scraped right now

Let us go ahead and extract the top items scraped from our URL: https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/

If you open this page in a new tab, you’ll see some top items. In this lab, your task is to scrape out their names and store them in a list called `top_items`. We will also extract out the reviews for these items as well.

To pass this challenge, take care of the following things:

- Use `.select` to extract the titles. (Hint: one selector for product titles could be `a.title`)
- Use `.select` to extract the review count label for those product titles. (Hint: one selector for reviews could be `div.ratings`) Note: this is a complete label (i.e. **2 reviews**) and not just a number.
- Create a new dictionary in the format:

****

```python
info = {
   "title": 'Asus AsusPro Adv...   '.strip(),
   "review": '2 reviews\n\n\n'.strip()
}
```

****

- Note we are using the `strip` method to remove any extra newlines/whitespaces we might have in the output. This is **important** to pass this lab.
- Append this dictionary in a list called `top_items`
- Print this list at the end

In [10]:
page = requests.get(
    "https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/")
soup = BeautifulSoup(page.content, 'html.parser')

# Create top_items as empty list
top_items = []

# Extract and store in top_items according to instructions on the left
for title, ratings in zip(soup.select('a.title'), soup.select('div.ratings')):
    t = title.text.strip()
    rating = ratings.select('p.pull-right')[0].text
    #print(t, rating)
    top_items.append({"title": t, "review":rating})
print(top_items)

[{'title': 'Asus AsusPro Adv...', 'review': '7 reviews'}, {'title': 'Asus ROG Strix G...', 'review': '4 reviews'}, {'title': 'Acer Aspire 3 A3...', 'review': '2 reviews'}]


## 6. Extracting Links

So far you have seen how you can extract the text, or rather innerText of elements. Let's now see how you can extract attributes by extracting links from the page.

Here’s an example of how to extract out all the image information from the page:
****
```python
import requests
from bs4 import BeautifulSoup
# Make a request
page = requests.get(
    "https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/")
soup = BeautifulSoup(page.content, 'html.parser')

# Create top_items as empty list
image_data = []

# Extract and store in top_items according to instructions on the left
images = soup.select('img')
for image in images:
    src = image.get('src')
    alt = image.get('alt')
    image_data.append({"src": src, "alt": alt})

print(image_data)
```
****
In this lab, your task is to extract the `href` attribute of links with their `text` as well. Make sure of the following things:

- You have to create a list called `all_links`
- In this list, store all link dict information. It should be in the following format:
****
```python
info = {
   "href": "<link here>",
   "text": "<link text here>"
}
```
****
- Make sure your `text` is stripped of any whitespace
- Make sure you check if your `.text` is None before you call `.strip()` on it.
- Store all these dicts in the `all_links`
- Print this list at the end

In [11]:
all_links = []

page = requests.get(
    "https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/")
soup = BeautifulSoup(page.content, 'html.parser')

for link in soup.select('a'):
    href = link.get('href')
    text = link.text.strip()
    all_links.append({"href": href, "text": text})

print(all_links)

[{'href': None, 'text': 'Toggle navigation'}, {'href': '/webscraper-python-codedamn-classroom-website/', 'text': ''}, {'href': '#page-top', 'text': ''}, {'href': '/webscraper-python-codedamn-classroom-website/', 'text': 'Web Scraper'}, {'href': '/webscraper-python-codedamn-classroom-website/cloud-scraper', 'text': 'Cloud Scraper'}, {'href': '/webscraper-python-codedamn-classroom-website/pricing', 'text': 'Pricing'}, {'href': '#section3', 'text': 'Learn'}, {'href': '/webscraper-python-codedamn-classroom-website/documentation', 'text': 'Documentation'}, {'href': '/webscraper-python-codedamn-classroom-website/tutorials', 'text': 'Video Tutorials'}, {'href': '/webscraper-python-codedamn-classroom-website/how-to-videos', 'text': 'How to'}, {'href': '/webscraper-python-codedamn-classroom-website/test-sites', 'text': 'Test Sites'}, {'href': 'https://forum.webscraper.io/', 'text': 'Forum'}, {'href': 'https://chrome.google.com/webstore/detail/web-scraper/jnhgnonknehpejjnehehllkliplmbmhn?hl=en',

## 7. Generating CSV from data

Finally, let's understand how you can generate CSV from a set of data. You will create a CSV with the following headings:

1. Product Name
2. Price
3. Description
4. Reviews
5. Product Image

These products are located in the `div.thumbnail`. The CSV boilerplate is given below:

****

```python
import requests
from bs4 import BeautifulSoup
import csv
# Make a request
page = requests.get(
    "https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/")
soup = BeautifulSoup(page.content, 'html.parser')

all_products = []

products = soup.select('div.thumbnail')
for product in products:
    # TODO: Work
    print("Work on product here")


keys = all_products[0].keys()

with open('products.csv', 'w', newline='') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(all_products)
```

****

You have to extract data from the website and generate this CSV for the three products.

### Passing Requirements:

- Product Name is the whitespace trimmed version of the name of the item (example - Asus AsusPro Adv..)
- Price is the whitespace trimmed but full price label of the product (example - $1101.83)
- The description is the whitespace trimmed version of the product description (example - Asus AsusPro Advanced BU401LA-FA271G Dark Grey, 14", Core i5-4210U, 4GB, 128GB SSD, Win7 Pro)
- Reviews are the whitespace trimmed version of the product (example - 7 reviews)
- Product image is the URL (src attribute) of the image for a product (example - /webscraper-python-codedamn-classroom-website/cart2.png)
- The name of the CSV file should be **products.csv** and should be stored in the same directory as your **script.py** file

In [12]:
import requests
from bs4 import BeautifulSoup
import csv
# Make a request
page = requests.get(
    "https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/")
soup = BeautifulSoup(page.content, 'html.parser')

all_products = []

products = soup.select('div.thumbnail')
for product in products:
    # TODO: Work
    name = product.select('a')[0].get('title').strip()
    price = product.select('h4')[0].text.strip()
    description = "".join([word.strip('\n') for word in product.select('p.description')[0].text.strip().split('\t')])
    review = product.select('div.ratings')[0].text.strip()
    prod_img = product.select('img')[0].get('src')
    all_products.append({'Product Name' : name ,
                         'Price' : price,
                         'Description' : description,
                         'Reviews' : review,
                         'Product Image' : prod_img})
    
keys = all_products[0].keys()

with open('products.csv', 'w', newline='') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(all_products)

In [13]:
import pandas as pd
pd.read_csv('products.csv')

Unnamed: 0,Product Name,Price,Description,Reviews,Product Image
0,Asus AsusPro Advanced BU401LA-FA271G Dark Grey,$1139.54,Asus AsusPro Advanced BU401LA-FA271G Dark Grey...,7 reviews,/webscraper-python-codedamn-classroom-website/...
1,Asus ROG Strix GL553VD-DM535T,$1101.83,"Apple MacBook Air 13.3"", Core i5 1.8GHz, 8GB, ...",4 reviews,/webscraper-python-codedamn-classroom-website/...
2,Acer Aspire 3 A315-51 Black,$494.71,"Acer Aspire 3 A315-51 Black, 15.6"" FHD, Corei3...",2 reviews,/webscraper-python-codedamn-classroom-website/...


In [14]:
pd.DataFrame(all_products)

Unnamed: 0,Product Name,Price,Description,Reviews,Product Image
0,Asus AsusPro Advanced BU401LA-FA271G Dark Grey,$1139.54,Asus AsusPro Advanced BU401LA-FA271G Dark Grey...,7 reviews,/webscraper-python-codedamn-classroom-website/...
1,Asus ROG Strix GL553VD-DM535T,$1101.83,"Apple MacBook Air 13.3"", Core i5 1.8GHz, 8GB, ...",4 reviews,/webscraper-python-codedamn-classroom-website/...
2,Acer Aspire 3 A315-51 Black,$494.71,"Acer Aspire 3 A315-51 Black, 15.6"" FHD, Corei3...",2 reviews,/webscraper-python-codedamn-classroom-website/...
