# Scraping Tabular Data for Programmatic SEO


This script is a helpful overview for:

- scraping a webpage using beautiful soup
- extrating tablular data

### About Alton
Follow me for more data and tutorials
- twitter: https://twitter.com/alton_lex @alton_lex
- linkedin: https://www.linkedin.com/in/altonalexander/

In [2]:
# url we will be scraping

url = "https://en.wikipedia.org/wiki/List_of_Asian_countries_by_area"

In [13]:
# import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [6]:
# download a copy of the html
website_html = requests.get(url).text

# scrape into a 'soup'
soup = BeautifulSoup(website_html,'lxml')

# preview the html
print(soup.prettify()[0:1000])

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-language-alert-in-sidebar-enabled vector-feature-sticky-header-disabled vector-feature-page-tools-disabled vector-feature-page-tools-pinned-disabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of Asian countries by area - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-language-alert-in-sidebar-enabled vector-feature-sticky-header-disabled vector-feature-page-tools-disabled vector-feature-page-tools-pinned-disabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled";(function(

In [45]:
data = []
table = soup.find_all('table')
table = table[1]
table_body = table.find('tbody')

rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip().replace("*","") for ele in cols]
    data.append([ele for ele in cols if ele]) # Get rid of empty values

In [46]:
# get the headers

headers = table_body.find_all('th')
headers
headers_list = []
for header in headers:
    headers_list.append(header.text.strip())

headers_list

['Rank', 'Country', 'Area', 'Notes', 'Facts', 'km²', 'sq mi']

In [47]:
df = pd.DataFrame(data)


# assign the headers to the table columns
df.columns = headers_list[0:len(df.columns)]

df.head()

Unnamed: 0,Rank,Country,Area,Notes,Facts,km²
0,,,,,,
1,,,,,,
2,1.0,Russia,13129142.0,5069190.0,"17,098,242 km2 (6,601,668 sq mi) including Eur...",Largest country in the world by area.
3,2.0,China,9615222.0,3712458.0,"Excluding Taiwan, Hong Kong and Macau.",Largest country by population in Asia as well ...
4,3.0,India,3287263.0,1269219.0,Largest democratic country of the world.,


In [48]:
# cleanup

# remove none
df = df[ df['Notes'].isnull() == False ]

# print the table
df

Unnamed: 0,Rank,Country,Area,Notes,Facts,km²
2,1,Russia,13129142,5069190,"17,098,242 km2 (6,601,668 sq mi) including Eur...",Largest country in the world by area.
3,2,China,9615222,3712458,"Excluding Taiwan, Hong Kong and Macau.",Largest country by population in Asia as well ...
4,3,India,3287263,1269219,Largest democratic country of the world.,
5,4,Kazakhstan,2544900,982600,"2,724,900 km2 (1,052,100 sq mi) including Euro...",Largest landlocked country in Asia and in the ...
6,5,Saudi Arabia,2149690,830000,,
7,6,Iran,1648195,636372,,
8,7,Mongolia,1564110,603910,,
9,8,Indonesia,1502029,579937,"1,904,569 km2 (735,358 sq mi) including the In...",Largest island country in Asia as well as in t...
10,9,Pakistan,881913,340509,,
11,10,Turkey,759592,293280,"783,562 km2 (302,535 sq mi) including European...",
