<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#URLs" data-toc-modified-id="URLs-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>URLs</a></span></li><li><span><a href="#Request-&amp;-Response" data-toc-modified-id="Request-&amp;-Response-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Request &amp; Response</a></span></li><li><span><a href="#Parse-HTML-content" data-toc-modified-id="Parse-HTML-content-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Parse HTML content</a></span></li><li><span><a href="#Title-of-HTML-content" data-toc-modified-id="Title-of-HTML-content-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Title of HTML content</a></span></li><li><span><a href="#Find-All-Tables" data-toc-modified-id="Find-All-Tables-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Find All Tables</a></span></li><li><span><a href="#Find-Right-Table-to-scrap" data-toc-modified-id="Find-Right-Table-to-scrap-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Find Right Table to scrap</a></span></li><li><span><a href="#Number-of-Columns" data-toc-modified-id="Number-of-Columns-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Number of Columns</a></span></li><li><span><a href="#Get-the-Rows" data-toc-modified-id="Get-the-Rows-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Get the Rows</a></span></li><li><span><a href="#Get-Table-Header-Attributes" data-toc-modified-id="Get-Table-Header-Attributes-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Get Table Header Attributes</a></span></li><li><span><a href="#Get-Table-Data" data-toc-modified-id="Get-Table-Data-10"><span class="toc-item-num">10&nbsp;&nbsp;</span>Get Table Data</a></span><ul class="toc-item"><li><span><a href="#Data-Analysis" data-toc-modified-id="Data-Analysis-10.1"><span class="toc-item-num">10.1&nbsp;&nbsp;</span>Data Analysis</a></span></li></ul></li></ul></div>

In [2]:
# Requests
import urllib.request #
import requests
from requests import get
from requests.exceptions import RequestException
from contextlib import closing

# for xml & html scrapping 
import lxml.html as lh
from bs4 import BeautifulSoup

# for table analysis
import pandas as pd

# write to csv
import csv

# Time
import time

## URLs

In [3]:
url1 = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"

## Request & Response

In [4]:
# get the web content
# If the access was successful, you should see the following output:
response = requests.get(url1, timeout=10)
response

<Response [200]>

## Parse HTML content

In [5]:
# parse response content to html
soup = BeautifulSoup(response.content, 'html.parser')

In [6]:
# Prettify the BeautifulSoup object: pretty_soup
pretty_soup = soup.prettify()

## Title of HTML content

In [7]:
# title
soup.title.string

'List of countries and dependencies by population - Wikipedia'

## Find All Tables

In [8]:
# find all the tables in the html
all_tables=soup.find_all('table')

## Find Right Table to scrap

In [9]:
right_table=soup.find('table', {"class":'wikitable sortable'})

## Number of Columns

In [10]:
for row in right_table.findAll("tr"):
    cells = row.findAll('td')

len(cells)

6

## Get the Rows

In [11]:
# number of rows including header
rows = right_table.findAll("tr")
len(rows)

241

## Get Table Header Attributes

In [12]:
# header attributes of the table
header = [th.text.rstrip() for th in rows[0].find_all('th')]
print(header)
print('------------')
print(len(header))

['Rank', 'Country(or dependent territory)', 'Population', 'Date', '% of worldpopulation', 'Source']
------------
6


## Get Table Data

### Data Analysis

In [14]:
lst_data = []
for row in rows[1:]:
            data = [d.text.rstrip() for d in row.find_all('td')]
            lst_data.append(data)

In [15]:
# sample records
lst_data[0:3]

[['1',
  ' China[Note 2]',
  '1,397,200,000',
  'May 5, 2019',
  '18.1%',
  'Official population clock'],
 ['2',
  ' India[Note 3]',
  '1,346,800,000',
  'May 5, 2019',
  '17.5%',
  'Official population clock'],
 ['3',
  ' United States[Note 4]',
  '329,147,000',
  'May 5, 2019',
  '4.27%',
  'Official population clock']]

In [43]:
# length of each record
len(lst_data[0])

6

In [20]:
# html of each table record

list_row = []
for row in right_table.findAll("tr"):
    list_row.append(row)

    
print('Number of row :',len(list_row))
print('----------------')
print(list_row[1])
print('----------------')
print('Second Attribute is has link reference')
print('----------------')
print(list_row[1].findAll('th'))
print('----------------')
print(list_row[1].find('a').text)

Number of row : 241
----------------
<tr>
<td>1
</td>
<td align="left"><span class="flagicon" style="display:inline-block;width:25px;"><img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/23px-Flag_of_the_People%27s_Republic_of_China.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/35px-Flag_of_the_People%27s_Republic_of_China.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/45px-Flag_of_the_People%27s_Republic_of_China.svg.png 2x" width="23"/></span> <a href="/wiki/China" title="China">China</a><sup class="reference" id="cite_ref-4"><a href="#cite_note-4">[Note 2]</a></sup>
</td>
<td>1,397,200,000</td>
<td>May 5, 2019</td>
<td>18.1%
</td>
<td align="left"><a class="external text" href="http://data.stats.gov.

In [40]:
#Generate lists
c1=[]
c2=[]
c3=[]
c4=[]
c5=[]
c6=[]

for row in right_table.findAll("tr"):
    cells = row.findAll('td')
    if len(cells)==6: #Only extract table body not heading
        c1.append(cells[0].find(text=True))
        c2.append(cells[1].find('a').text)  # fetch the text of the url in td tag. 
        c3.append(cells[2].find(text=True))
        c4.append(cells[3].find(text=True))
        c5.append(cells[4].find(text=True))
        c6.append(cells[5].find(text=True))        

In [37]:
# create a dictionary
d = dict([(x,0) for x in header])
d

{'Rank': 0,
 'Country(or dependent territory)': 0,
 'Population': 0,
 'Date': 0,
 '% of worldpopulation': 0,
 'Source': 0}

In [38]:
# append dictionary with corresponding data list.
d['Rank'] = c1
d['Country(or dependent territory)']= c2
d['Population']=c3
d['Date']=c4
d['% of worldpopulation']=c5
d['Source']=c6

In [39]:
# convert dict to DataFrame
df_table = pd.DataFrame(d)

# Top 5 records
df_table.head(5)

Unnamed: 0,Rank,Country(or dependent territory),Population,Date,% of worldpopulation,Source
0,1,China,1397200000,"May 5, 2019",18.1%,Official population clock
1,2,India,1346800000,"May 5, 2019",17.5%,Official population clock
2,3,United States,329147000,"May 5, 2019",4.27%,Official population clock
3,4,Indonesia,268074600,"July 1, 2019",3.48%,Official annual projection
4,5,Brazil,209863000,"May 5, 2019",2.72%,Official population clock


In [41]:
# Last 5 records
df_table.tail(5)

Unnamed: 0,Rank,Country(or dependent territory),Population,Date,% of worldpopulation,Source
235,–,Niue,1520,"July 1, 2018",0.000020%,Official annual estimate
236,–,Tokelau,1400,"July 1, 2018",0.000018%,Official annual estimate
237,195,Vatican City,800,"January 1, 2014",0.000010%,Official estimate
238,–,Cocos (Keeling) Islands,538,"June 30, 2018",0.0000070%,Official estimate
239,–,Pitcairn Islands,50,"January 1, 2019",0.00000065%,Official estimate
