# Web Scraping Tutorial

This tutorial will teach you how Python to scrap and extract data from a web page. We will use two packages, `requests` to scrap the webpage and `BeautifulSoup` to extract the data.

Many good references on web scraping are available online. I would recommend the following resources:
1. Automate Boring Stuff with Python by Al Sweigart (2020) has a chapter on Web Scraping tutorial, which can be read [online](https://automatetheboringstuff.com/2e/chapter12/).
2. Web Scraping With Python by Ryan Mitchell (2018) is a bit old book but provides a comprehensive guide to the topic.

**Goal:** We will extract the cryptocurrency market price from Etherscan website: https://etherscan.io/tokens

Your first step should always be to familiarize yourself with the website you want to scrape. Take a look at the website and try to inspect the HTML elements on the webpage.

## Step 1: Scrap a web page

Now, we are ready to scrap a webpage we want to get the data from with the `requests` package. We will use the following functions:

* `requests.get('URL')` - make a request to the specified URL
* `r.status_code` - get the status code of the request
* `r.content` - get the binary content of the page

More functions in the `requests` package are available in [its documentation](https://requests.readthedocs.io/en/latest/).

In [1]:
# First, we will import the requests package
import requests

In [2]:
# Request the webpage
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get('https://etherscan.io/tokens?ps=100&p=1', headers=headers)

In [3]:
# Type of the request object we've got
type(r)

requests.models.Response

In [4]:
# Check if the request is success
r.status_code == requests.codes.ok

True

In [5]:
# Get the header of the web page
r.headers['content-type']

'text/html; charset=utf-8'

In [6]:
# Get the content of the web page
type(r.content)

bytes

In [7]:
# Get the text in the web page
type(r.text)

str

In [8]:
# Save the content of web page
with open('Etherscan.html', 'wb') as fp:
    fp.write(r.content)

## Step 2: Load the web page as BeautifulSoup object

After we crawled the web page and download it to the local disk, we will use `BeautifulSoup` package to parse HTML file and access the content. We will use the following functions:

**1. Load the web page to BeautifulSoup**
* `soup = BeautifulSoup(html_doc, 'html.parser')` - parse the HTML content to BeautifulSoup object

In [9]:
# First, we will import the BeautifulSoup from bs4 package
from bs4 import BeautifulSoup

In [10]:
# Load the web page and parse it to BeautifulSoup
with open('Etherscan.html', encoding='utf-8') as fp:
    soup = BeautifulSoup(fp, 'html.parser')

In [11]:
# Check the type of our soup object
type(soup)

bs4.BeautifulSoup

In [13]:
# Print the content of the web page
print(soup.prettify())

<!DOCTYPE html>
<html id="html" lang="en">
 <head>
  <title>
   Token Tracker | Etherscan
  </title>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
  <meta content="etherscan.io" name="author"/>
  <meta content="The list of ERC-20 Tokens and their Prices, Market Capitalizations and the Number of Holders in the Ethereum Blockchain on Etherscan." name="Description"/>
  <meta content="ethereum, explorer, ether, search, blockchain, crypto, currency" name="keywords"/>
  <meta content="telephone=no" name="format-detection"/>
  <meta content="Token Tracker | Etherscan" property="og:title"/>
  <meta content="The list of ERC-20 Tokens and their Prices, Market Capitalizations and the Number of Holders in the Ethereum Blockchain on Etherscan." property="og:description"/>
  <meta content="website" property="og:type"/>
  <meta content="Ethereum (ETH) Blockchain Explorer" property="og:site_name"/>
  <meta content="http://etherscan.

In [21]:
# Print all the text in the webpage
print(soup.get_text())





	Token Tracker | Etherscan

            























ETH Price: $1,637.95 (-0.52%) Gas: 6 Gwei


















/










 Light




 Dim




 Dark







 Site Settings 












Ethereum Mainnet




Ethereum Mainnet CN




Beaconscan ETH2







Goerli Testnet




Sepolia Testnet




Holesky Testnet
















 Sign In









Home



Blockchain



Transactions




Pending Transactions




Contract Internal Transactions




Beacon Deposits




Beacon Withdrawals







View Blocks




Forked Blocks (Reorgs)




Uncles







Top Accounts




Verified Contracts





Tokens



Top Tokens (ERC-20)




Token Transfers (ERC-20)





NFTs



Top NFTs




Top Mints




Latest Trades




Latest Transfers




Latest Mints





Resources



Charts And Stats




Top Statistics







Directory




Newsletter




Knowledge Base





Developers



API Plans




API Documentation







Code Reader Beta




Verify Contract




Similar Contract Search




Smart Contract S

In [22]:
import re

In [30]:
print(re.sub(r'(\n\s*)+\n', '\n', soup.get_text().replace('\t', '')))


Token Tracker | Etherscan
ETH Price: $1,637.95 (-0.52%) Gas: 6 Gwei
/
 Light
 Dim
 Dark
 Site Settings 
Ethereum Mainnet
Ethereum Mainnet CN
Beaconscan ETH2
Goerli Testnet
Sepolia Testnet
Holesky Testnet
 Sign In
Home
Blockchain
Transactions
Pending Transactions
Contract Internal Transactions
Beacon Deposits
Beacon Withdrawals
View Blocks
Forked Blocks (Reorgs)
Uncles
Top Accounts
Verified Contracts
Tokens
Top Tokens (ERC-20)
Token Transfers (ERC-20)
NFTs
Top NFTs
Top Mints
Latest Trades
Latest Transfers
Latest Mints
Resources
Charts And Stats
Top Statistics
Directory
Newsletter
Knowledge Base
Developers
API Plans
API Documentation
Code Reader Beta
Verify Contract
Similar Contract Search
Smart Contract Search
Contract Diff Checker
Vyper Online Compiler
Bytecode to Opcode
Broadcast Transaction
More
Tools & Services
Discover more of Etherscan's tools and services in one place.
Sponsored
Tools
Unit Converter
CSV Export
Account Balance Checker
Explore
Gas Tracker
DEX Tracker
Node Tracker


**2. Get the content of the element**
* `soup.title` - get the title of the page
* `soup.title.string` - get the string in the title element
* `soup.h1` - get the H1 element in the web page
* `soup.h1.attrs` - get all attributes in the H1 element
* `soup.h1['class']` - get the class attribute in the H1 element

In [13]:
# Get the title of the page
soup.title.text.strip()

'Token Tracker | Etherscan'

In [40]:
# Other HTML elements also work too
soup.h1.text.strip()

'Token Tracker\n(ERC-20)'

In [41]:
# Get the class attribute of an element
soup.h1['class']

['h5', 'mb-0']

**3. Look for the element in the web page**
* `soup.find('HTML_tag')` - get the element from an HTML tag
* `soup.find_all('HTML_tag')` - get the list of elelemts that has the specified HTML tag
* `soup.select('CSS_selector')` - get the list of elements with the specified [CSS selector](https://www.w3schools.com/cssref/css_selectors.asp)

In [15]:
# We can also get the page title using soup.find() function
soup.find('title').text.strip()

'Token Tracker | Etherscan'

In [16]:
# Get all the elements with image tag
for i in soup.find_all('img', {"class": "rounded-circle"}):
    print(i['src'])

/token/images/tethernew_32.png
/token/images/bnb_28_2.png
/token/images/centre-usdc_28.png
/token/images/lido-steth_32.png
/token/images/trontrx_32.png
/token/images/theopennetwork_32.png
/token/images/theta_28.png
/token/images/polygonmatic_new_32.png
/token/images/wbtc_28.png?v=1
/token/images/chainlinktoken_32.png?v=6
/token/images/shibatoken_32.png
/token/images/MCDDai_32.png
/token/images/leo_28_2.png
/token/images/trueusd_32.png?v=2
/token/images/uniswap_32.png
/token/images/okex_28.png
/images/main/empty-token.png
/token/images/lido-dao_32.png
/token/images/cro_32.png
/token/images/mkr-etherscan-35.png
/token/images/mantle_32.png
/token/images/quantnetwork_28_2.png?v=6
/token/images/vechain_28.png
/token/images/arbitrumone2_32.png
/token/images/near_32.png?v=3
/token/images/stakedaave_32.png
/token/images/rocketpooleth_32.png?v=2
/token/images/TheGraph_32.png
/token/images/usdd-tron_32.png
/token/images/fraxfinanceeth2_32.png
/token/images/immutable_32.png
/token/images/Syntheti

In [38]:
# Get all the token names on the web page
for i in soup.select('.hash-tag'):
    print(i.text)

Tether USD
BNB
USDC
stETH
TRON
Wrapped TON Coin
Theta Token
Matic Token
Wrapped BTC
ChainLink Token
SHIBA INU
Dai Stablecoin
Bitfinex LEO Token
TrueUSD
Uniswap
OKB
BUSD
Lido DAO Token
Cronos Coin
Maker
Mantle
Quant
VeChain
Arbitrum
NEAR
Staked Aave
Rocket Pool ETH
Graph Token
Decentralized USD
Frax
Immutable X
Synthetix Network Token
Render Token
Injective Token
SAND
HEX
Fantom Token
Decentraland
Wrapped Decentraland MANA
Rollbit Coin
Pax Dollar
Paxos Gold
Compound Ether
Tether Gold
Frax Ether
KuCoin Token
chiliZ
ApeCoin
Frax Share
Rocket Pool
HuobiToken
BitTorrent
dYdX
UnlimitedIP Token
APENFT
Coinbase Wrapped Staked ETH
Staked Frax Ether
Nexo
Wootrade Network
NXM
Compound
Pepe
Zilliqa
1INCH Token
BAT
PancakeSwap Token
Gnosis
MCO
SafePalToken
Huobi BTC
WQtum
Convex Token
Illuvium
Fetch
MX Token
Compound USDT
EnjinCoin
Celo native asset (Wormhole)
Wrapped Celo
Gemini dollar
SingularityNET Token
Mask Network
LoopringCoin V2
Euro Tether
DeFiChain Token
Compound USD Coin
Worldcoin
Ethereu

## Step 3: Extract the data from the table

Now, we will extract the cryptocurrencies market price from the table.

In [17]:
# Get the table element in the web page
table = soup.table

In [18]:
# Get the table headers
col_names = list()
for i in table.tr.find_all('th'):
    print(i.text.strip())
    col_names.append(i.text.strip())

#
Token
Price
Change (%)
Volume (24H)
Circulating Market Cap
On-Chain Market Cap
Holders


For loop over each row in the table and extract the data for each column in the row.

In [19]:
data = list()
# For loop over each row in the table
for row in table.tbody.find_all('tr'):
    # Get all the columns in the row
    cols = row.find_all('td')

    d = list()
    # For loop over each column and extract the string
    for idx, col in enumerate(cols):
        print(idx, ':', col.text.strip())
        d.append(col.text.strip())

    print('--------------------------')
    data.append(d)

0 : 1
1 : Tether USD
(USDT)
2 : $1.00


0.000610 ETH
3 : --
4 : $16,289,778,116.00
5 : $83,369,918,594.00
6 : $39,823,119,845.00
7 : 4,580,187
-0.094%
--------------------------
0 : 2
1 : BNB
(BNB)
2 : $212.7004


0.129764 ETH
3 : 0.05%
4 : $331,713,816.00
5 : $32,723,133,157.00
6 : $3,526,470,297.89
7 : 285,796
0.009%
--------------------------
0 : 3
1 : USDC
(USDC)
2 : $1.00


0.000610 ETH
3 : 0.02%
4 : $4,270,444,800.00
5 : $25,255,348,953.00
6 : $46,602,430,840.00
7 : 1,790,688
-0.046%
--------------------------
0 : 4
1 : stETH
(stETH)
2 : $1,639.23


1.000062 ETH
3 : -0.47%
4 : $5,520,449.00
5 : $14,459,518,712.00
6 : $3,038,491,481.07
7 : 270,064
0.064%
--------------------------
0 : 5
1 : TRON
(TRX)
2 : $0.0888


0.000054 ETH
3 : -0.01%
4 : $219,852,244.00
5 : $7,904,153,443.00
6 : $230.91
7 : 2,208
0.317%
--------------------------
0 : 6
1 : Wrapped TON Coin
(TONCOIN)
2 : $2.08


0.001269 ETH
3 : 3.61%
4 : $16,339,632.00
5 : $7,200,646,823.00
6 : $15,281,150.56
7 : 7,677
0.000%

## Step 4: Create a DataFrame table and write to a CSV file

In [20]:
import pandas as pd

In [21]:
# How many rows in the extracted data
len(data)

100

In [22]:
data[0]

['1',
 'Tether USD\n(USDT)',
 '$1.00\n\n\n0.000610 ETH',
 '--',
 '$16,289,778,116.00',
 '$83,369,918,594.00',
 '$39,823,119,845.00',
 '4,580,187\n-0.094%']

In [23]:
# Convert the data list to DataFrame object
df = pd.DataFrame(data, columns=col_names)

In [24]:
df

Unnamed: 0,#,Token,Price,Change (%),Volume (24H),Circulating Market Cap,On-Chain Market Cap,Holders
0,1,Tether USD\n(USDT),$1.00\n\n\n0.000610 ETH,--,"$16,289,778,116.00","$83,369,918,594.00","$39,823,119,845.00","4,580,187\n-0.094%"
1,2,BNB\n(BNB),$212.7004\n\n\n0.129764 ETH,0.05%,"$331,713,816.00","$32,723,133,157.00","$3,526,470,297.89","285,796\n0.009%"
2,3,USDC\n(USDC),$1.00\n\n\n0.000610 ETH,0.02%,"$4,270,444,800.00","$25,255,348,953.00","$46,602,430,840.00","1,790,688\n-0.046%"
3,4,stETH\n(stETH),"$1,639.23\n\n\n1.000062 ETH",-0.47%,"$5,520,449.00","$14,459,518,712.00","$3,038,491,481.07","270,064\n0.064%"
4,5,TRON\n(TRX),$0.0888\n\n\n0.000054 ETH,-0.01%,"$219,852,244.00","$7,904,153,443.00",$230.91,"2,208\n0.317%"
...,...,...,...,...,...,...,...,...
95,96,Livepeer Token\n(LPT),$5.84\n\n\n0.003563 ETH,-2.30%,"$9,512,317.00","$171,385,461.00","$145,399,764.80","2,214,783\n-0.001%"
96,97,SwissBorg\n(CHSB),$0.1776\n\n\n0.000108 ETH,3.16%,"$779,820.00","$169,312,466.00","$177,591,000.00","18,478\n0.000%"
97,98,Blur\n(BLUR),$0.1742\n\n\n0.000106 ETH,1.20%,"$22,114,090.00","$169,052,418.00","$522,690,000.00","41,492\n0.000%"
98,99,Compound Dai\n(cDAI),$0.0224\n\n\n0.000014 ETH,0.02%,$154.00,"$166,725,648.00","$2,997,776,968.07","17,997\n0.006%"


Split the columns with "\n"

In [25]:
# Split between token name and token symbol
df[['Token', 'Symbol']] = df['Token'].str.split('\n', expand=True)

In [26]:
# Split between the USD and ETH prices
df[['Price', 'Price (ETH)']] = df['Price'].str.split('\n\n\n', expand=True)

In [27]:
# Split the number of holders and percent changes
df[['Holders', 'Holders Change (%)']] = df['Holders'].str.split('\n', expand=True)

In [28]:
df

Unnamed: 0,#,Token,Price,Change (%),Volume (24H),Circulating Market Cap,On-Chain Market Cap,Holders,Symbol,Price (ETH),Holders Change (%)
0,1,Tether USD,$1.00,--,"$16,289,778,116.00","$83,369,918,594.00","$39,823,119,845.00",4580187,(USDT),0.000610 ETH,-0.094%
1,2,BNB,$212.7004,0.05%,"$331,713,816.00","$32,723,133,157.00","$3,526,470,297.89",285796,(BNB),0.129764 ETH,0.009%
2,3,USDC,$1.00,0.02%,"$4,270,444,800.00","$25,255,348,953.00","$46,602,430,840.00",1790688,(USDC),0.000610 ETH,-0.046%
3,4,stETH,"$1,639.23",-0.47%,"$5,520,449.00","$14,459,518,712.00","$3,038,491,481.07",270064,(stETH),1.000062 ETH,0.064%
4,5,TRON,$0.0888,-0.01%,"$219,852,244.00","$7,904,153,443.00",$230.91,2208,(TRX),0.000054 ETH,0.317%
...,...,...,...,...,...,...,...,...,...,...,...
95,96,Livepeer Token,$5.84,-2.30%,"$9,512,317.00","$171,385,461.00","$145,399,764.80",2214783,(LPT),0.003563 ETH,-0.001%
96,97,SwissBorg,$0.1776,3.16%,"$779,820.00","$169,312,466.00","$177,591,000.00",18478,(CHSB),0.000108 ETH,0.000%
97,98,Blur,$0.1742,1.20%,"$22,114,090.00","$169,052,418.00","$522,690,000.00",41492,(BLUR),0.000106 ETH,0.000%
98,99,Compound Dai,$0.0224,0.02%,$154.00,"$166,725,648.00","$2,997,776,968.07",17997,(cDAI),0.000014 ETH,0.006%


Convert string into numerical columns

In [29]:
# Regular expression pattern to match numbers
pattern = r'([-+]?\d[\d,]*(?:\.\d+)?)'

In [30]:
# For each numerical column, convert the string to float numbers
for col_name in df:
    if col_name in ['Token', 'Symbol']:
        continue
    print(col_name)

    # Use df[col_name].str.extract() to extract the numbers and
    # .astype(float) to convert the string to float numbers
    df[col_name] = df[col_name].str.extract(pattern)
    df[col_name] = df[col_name].str.replace(',', '').astype(float)

#
Price
Change (%)
Volume (24H)
Circulating Market Cap
On-Chain Market Cap
Holders
Price (ETH)
Holders Change (%)


In [31]:
df

Unnamed: 0,#,Token,Price,Change (%),Volume (24H),Circulating Market Cap,On-Chain Market Cap,Holders,Symbol,Price (ETH),Holders Change (%)
0,1.0,Tether USD,1.0000,,1.628978e+10,8.336992e+10,3.982312e+10,4580187.0,(USDT),0.000610,-0.094
1,2.0,BNB,212.7004,0.05,3.317138e+08,3.272313e+10,3.526470e+09,285796.0,(BNB),0.129764,0.009
2,3.0,USDC,1.0000,0.02,4.270445e+09,2.525535e+10,4.660243e+10,1790688.0,(USDC),0.000610,-0.046
3,4.0,stETH,1639.2300,-0.47,5.520449e+06,1.445952e+10,3.038491e+09,270064.0,(stETH),1.000062,0.064
4,5.0,TRON,0.0888,-0.01,2.198522e+08,7.904153e+09,2.309100e+02,2208.0,(TRX),0.000054,0.317
...,...,...,...,...,...,...,...,...,...,...,...
95,96.0,Livepeer Token,5.8400,-2.30,9.512317e+06,1.713855e+08,1.453998e+08,2214783.0,(LPT),0.003563,-0.001
96,97.0,SwissBorg,0.1776,3.16,7.798200e+05,1.693125e+08,1.775910e+08,18478.0,(CHSB),0.000108,0.000
97,98.0,Blur,0.1742,1.20,2.211409e+07,1.690524e+08,5.226900e+08,41492.0,(BLUR),0.000106,0.000
98,99.0,Compound Dai,0.0224,0.02,1.540000e+02,1.667256e+08,2.997777e+09,17997.0,(cDAI),0.000014,0.006


Last but not least, remove the bracket in token symbol column

In [32]:
df['Symbol'] = df['Symbol'].str.extract('\((.*?)\)')

In [33]:
df

Unnamed: 0,#,Token,Price,Change (%),Volume (24H),Circulating Market Cap,On-Chain Market Cap,Holders,Symbol,Price (ETH),Holders Change (%)
0,1.0,Tether USD,1.0000,,1.628978e+10,8.336992e+10,3.982312e+10,4580187.0,USDT,0.000610,-0.094
1,2.0,BNB,212.7004,0.05,3.317138e+08,3.272313e+10,3.526470e+09,285796.0,BNB,0.129764,0.009
2,3.0,USDC,1.0000,0.02,4.270445e+09,2.525535e+10,4.660243e+10,1790688.0,USDC,0.000610,-0.046
3,4.0,stETH,1639.2300,-0.47,5.520449e+06,1.445952e+10,3.038491e+09,270064.0,stETH,1.000062,0.064
4,5.0,TRON,0.0888,-0.01,2.198522e+08,7.904153e+09,2.309100e+02,2208.0,TRX,0.000054,0.317
...,...,...,...,...,...,...,...,...,...,...,...
95,96.0,Livepeer Token,5.8400,-2.30,9.512317e+06,1.713855e+08,1.453998e+08,2214783.0,LPT,0.003563,-0.001
96,97.0,SwissBorg,0.1776,3.16,7.798200e+05,1.693125e+08,1.775910e+08,18478.0,CHSB,0.000108,0.000
97,98.0,Blur,0.1742,1.20,2.211409e+07,1.690524e+08,5.226900e+08,41492.0,BLUR,0.000106,0.000
98,99.0,Compound Dai,0.0224,0.02,1.540000e+02,1.667256e+08,2.997777e+09,17997.0,cDAI,0.000014,0.006


Write the DataFrame table to CSV

In [34]:
df.to_csv('token_prices.csv', index=False)