<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#-to-" data-toc-modified-id="-to--1"><span class="toc-item-num">1&nbsp;&nbsp;</span> to </a></span><ul class="toc-item"><li><span><a href="#Import" data-toc-modified-id="Import-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Import</a></span></li><li><span><a href="#URLs" data-toc-modified-id="URLs-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>URLs</a></span></li><li><span><a href="#Request-&amp;-Response" data-toc-modified-id="Request-&amp;-Response-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Request &amp; Response</a></span></li><li><span><a href="#Parse-HTML-content" data-toc-modified-id="Parse-HTML-content-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Parse HTML content</a></span></li><li><span><a href="#Title-of-HTML-content" data-toc-modified-id="Title-of-HTML-content-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Title of HTML content</a></span></li><li><span><a href="#Find-All-Tables" data-toc-modified-id="Find-All-Tables-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Find All Tables</a></span></li><li><span><a href="#Find-Right-Table-to-scrap" data-toc-modified-id="Find-Right-Table-to-scrap-1.7"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>Find Right Table to scrap</a></span></li><li><span><a href="#Get-the-Rows" data-toc-modified-id="Get-the-Rows-1.8"><span class="toc-item-num">1.8&nbsp;&nbsp;</span>Get the Rows</a></span></li><li><span><a href="#Get-Table-Header-Attributes" data-toc-modified-id="Get-Table-Header-Attributes-1.9"><span class="toc-item-num">1.9&nbsp;&nbsp;</span>Get Table Header Attributes</a></span></li><li><span><a href="#Get-Table-Data" data-toc-modified-id="Get-Table-Data-1.10"><span class="toc-item-num">1.10&nbsp;&nbsp;</span>Get Table Data</a></span></li><li><span><a href="#Case-1:-Dataset-mismatch-with-Header-attributes" data-toc-modified-id="Case-1:-Dataset-mismatch-with-Header-attributes-1.11"><span class="toc-item-num">1.11&nbsp;&nbsp;</span>Case 1: Dataset mismatch with Header attributes</a></span></li><li><span><a href="#Case-2:-No-Dataset-mismatch-with-Header-attributes" data-toc-modified-id="Case-2:-No-Dataset-mismatch-with-Header-attributes-1.12"><span class="toc-item-num">1.12&nbsp;&nbsp;</span>Case 2: No Dataset mismatch with Header attributes</a></span></li><li><span><a href="#Appendix" data-toc-modified-id="Appendix-1.13"><span class="toc-item-num">1.13&nbsp;&nbsp;</span>Appendix</a></span></li><li><span><a href="#Request-using-urllib" data-toc-modified-id="Request-using-urllib-1.14"><span class="toc-item-num">1.14&nbsp;&nbsp;</span>Request using urllib</a></span></li><li><span><a href="#Find-All-Links" data-toc-modified-id="Find-All-Links-1.15"><span class="toc-item-num">1.15&nbsp;&nbsp;</span>Find All Links</a></span></li></ul></li></ul></div>

Notes: 

Web scraping is a computer software technique of extracting information from websites. This technique mostly focuses on the transformation of unstructured data (HTML format) on the web into structured data (database or spreadsheet).

BeautifulSoup: It is an incredible tool for pulling out information from a webpage.BeautifulSoup does not fetch the web page for us. 

This syntax has various tags as elaborated below:

<!DOCTYPE html> : HTML documents must start with a type declaration
HTML document is contained between <html> and </html>
The visible part of the HTML document is between <body> and </body>
HTML headings are defined with the <h1> to <h6> tags
HTML paragraphs are defined with the <p> tag
An unordered list starts with the <ul> tag. Each list item starts with the <li> tag.
    
HTML links are defined with the <a> tag, “<a href=“http://www.test.com”>This is a link for test.com</a>”

An HTML table is defined with the <table> tag.
Each table row is defined with the <tr> tag. A table header is defined with the <th> tag. By default, table headings are bold and centered. A table data/cell is defined with the <td> tag.
    
    https://www.w3schools.com/html/html_tables.asp
    

## Import

In [79]:
# Requests
import urllib.request #
import requests

# for xml & html scrapping 
import lxml.html as lh
from bs4 import BeautifulSoup

# for table analysis
import pandas as pd

# write to csv
import csv

## URLs

In [121]:
url1 = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
url2 = "https://en.wikipedia.org/wiki/List_of_national_independence_days"
url3 = "https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India"

## Request & Response

In [181]:
# get the web content
response = requests.get(url2, timeout=10)
response

<Response [200]>

## Parse HTML content

In [182]:
# parse response content to html
soup = BeautifulSoup(response.content, 'html.parser')

In [183]:
# Prettify the BeautifulSoup object: pretty_soup
pretty_soup = soup.prettify()

## Title of HTML content

In [184]:
# title
soup.title.string

'List of national independence days - Wikipedia'

## Find All Tables

In [136]:
# find all the tables in the html
all_tables=soup.find_all('table')

## Find Right Table to scrap

In [185]:
#right_table=soup.find('table', {"class":'wikitable sortable plainrowheaders'})
#right_table=soup.find('table', {"class":'wikitable sortable'})
right_table=soup.find('table', {"class":'wikitable sortable'})
#print(right_table)

In [186]:
for row in right_table.findAll("tr"):
    cells = row.findAll('td')

len(cells)

5

## Get the Rows

In [187]:
rows = right_table.findAll("tr")
len(rows)

192

## Get Table Header Attributes

In [193]:
# header attributes of the table
header = [th.text.rstrip() for th in rows[0].find_all('th')]
print(header)
print('------------')
print(len(header))

['Country', 'Date of holiday', 'Year celebrated', 'Event celebrated', 'Name of holiday']
------------
5


## Get Table Data

In [188]:
lst_data = []
for row in rows[1:]:
            data = [d.text.rstrip() for d in row.find_all('td')]
            lst_data.append(data)

In [189]:
# sample records
lst_data[0:3]

[['\xa0Afghanistan',
  'August\xa019',
  '1919',
  'Independence from the United Kingdom in 1919.',
  'Afghan Independence Day'],
 ['\xa0Albania',
  'November\xa028',
  '1912',
  'Declared by Ismail Qemal Vlora in 1912 and signaled the end of five centuries of Ottoman rule.',
  'Independence Day/Dita e Pavarësisë'],
 ['\xa0Algeria',
  'July\xa05',
  '1962',
  'Independence from France in 1962.',
  'Independence Day (Algeria)']]

In [190]:
len(lst_data[0])

5

## Case 1: Dataset mismatch with Header attributes 

Each data records has 6 attributes, whereas header has 7 attributes. "State" is missing in the data attributes. Reason is that it has captured as 'th' tag 

In [177]:
list_row = []
for row in right_table.findAll("tr"):
    list_row.append(row)

    
print('Number of row :',len(list_row))
print('----------------')
print(list_row[1])
print('----------------')
print('Second Attribute is has link reference')
print('----------------')
print(list_row[1].findAll('th'))
print('----------------')
print(list_row[1].find('a').text)

Number of row : 37
----------------
<tr>
<td>1
</td>
<th scope="row"><a href="/wiki/Andaman_and_Nicobar_Islands" title="Andaman and Nicobar Islands">Andaman and Nicobar Islands</a> <img alt="union territory" data-file-height="14" data-file-width="9" decoding="async" height="14" src="//upload.wikimedia.org/wikipedia/commons/3/37/Dagger-14-plain.png" width="9"/>
</th>
<td><a href="/wiki/Port_Blair" title="Port Blair">Port Blair</a>
</td>
<td> —
</td>
<td>Kolkata
</td>
<td>1955
</td>
<td>Calcutta (1945–1955)
</td></tr>
----------------
Second Attribute is has link reference
----------------
[<th scope="row"><a href="/wiki/Andaman_and_Nicobar_Islands" title="Andaman and Nicobar Islands">Andaman and Nicobar Islands</a> <img alt="union territory" data-file-height="14" data-file-width="9" decoding="async" height="14" src="//upload.wikimedia.org/wikipedia/commons/3/37/Dagger-14-plain.png" width="9"/>
</th>]
----------------
Andaman and Nicobar Islands


In [178]:
#Generate lists
A=[]
B=[]
C=[]
D=[]
E=[]
F=[]
G=[]
for row in right_table.findAll("tr"):
    cells  =  row.findAll('td')
    states = row.findAll('th') # second attribute has th(header) tag 
    if len(cells)==6: # length of the table record
        A.append(cells[0].find(text=True))
        B.append(states[0].find(text=True))
        C.append(cells[1].find(text=True))
        D.append(cells[2].find(text=True))
        E.append(cells[3].find(text=True))
        F.append(cells[4].find(text=True))
        G.append(cells[5].find(text=True))

In [180]:
#import pandas to convert list to data frame
import pandas as pd
df=pd.DataFrame(A,columns=['Number'])
df['State/UT']=B
df['Admin_Capital']=C
df['Legislative_Capital']=D
df['Judiciary_Capital']=E
df['Year_Capital']=F
df['Former_Capital']=G
df.head(5)

Unnamed: 0,Number,State/UT,Admin_Capital,Legislative_Capital,Judiciary_Capital,Year_Capital,Former_Capital
0,1,Andaman and Nicobar Islands,Port Blair,—,Kolkata,1955,Calcutta (1945–1955)
1,2,Andhra Pradesh,Hyderabad,Amaravati,Amaravati,1956,Kurnool
2,3,Arunachal Pradesh,Itanagar,Itanagar,Guwahati,1986,—
3,4,Assam,Dispur,Guwahati,Guwahati,1975,Shillong
4,5,Bihar,Patna,Patna,Patna,1912,—


## Case 2: No Dataset mismatch with Header attributes 

In [191]:
list_row = []
for row in right_table.findAll("tr"):
    list_row.append(row)

    
print('Number of row :',len(list_row))
print('----------------')
print(list_row[1])
print('----------------')
print('Second Attribute is has link reference')
print('----------------')
print(list_row[1].findAll('th'))
print('----------------')
print(list_row[1].find('a').text)

Number of row : 192
----------------
<tr style="vertical-align: top;">
<td><span class="flagicon"><img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/9/9a/Flag_of_Afghanistan.svg/23px-Flag_of_Afghanistan.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/9/9a/Flag_of_Afghanistan.svg/35px-Flag_of_Afghanistan.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/9/9a/Flag_of_Afghanistan.svg/45px-Flag_of_Afghanistan.svg.png 2x" width="23"/> </span><a href="/wiki/Afghanistan" title="Afghanistan">Afghanistan</a>
</td>
<td><span data-sort-value="08-19 !">August 19</span>
</td>
<td>1919
</td>
<td>Independence from the <a href="/wiki/United_Kingdom" title="United Kingdom">United Kingdom</a> in 1919.
</td>
<td><a href="/wiki/Afghan_Independence_Day" title="Afghan Independence Day">Afghan Independence Day</a>
</td></tr>
----------------
Second Attribute is has link refere

In [206]:
#Generate lists
#A=[]
B=[]
C=[]
D=[]
E=[]
F=[]
#G=[]
for row in right_table.findAll("tr"):
    cells = row.findAll('td')
    if len(cells)==5: #Only extract table body not heading
        #A.append(cells[0].find(text=True))
        B.append(cells[0].find('a').text)  # fetch the text of the url in td tag. 
        C.append(cells[1].find(text=True))
        D.append(cells[2].find(text=True))
        E.append(cells[2].find(text=True))
        F.append(cells[4].find(text=True))
        #G.append(cells[5].find(text=True))

In [208]:
for row in right_table.findAll("tr"):
    cells = row.findAll('td')
    
cells[4]

<td>
</td>

In [209]:
#import pandas to convert list to data frame
import pandas as pd
df=pd.DataFrame(B,columns=['Country'])
df['Date of holiday']=C
df['Year celebrated']=D
df['Event celebrated']=E
df['Name of holiday']=F
#df['Former_Capital']=G
df


Unnamed: 0,Country,Date of holiday,Year celebrated,Event celebrated,Name of holiday
0,Afghanistan,August 19,1919,1919,Afghan Independence Day
1,Albania,November 28,1912,1912,Independence Day/Dita e Pavarësisë
2,Algeria,July 5,1962,1962,Independence Day (Algeria)
3,Angola,November 11,1975,1975,
4,Anguilla,May 30,1967,1967,Anguilla Day
5,Antigua and Barbuda,November 1,1981,1981,Independence Day
6,Argentina,July 9,1816,1816,Independence Day
7,Armenia,May 28,1918,1918,
8,Australia,January 1,1901,1901,Independence Day (not to be confused with
9,Austria,October 26,1955,1955,National Day


In [85]:
 with open('output.csv', 'w') as csv_file:
        writer = csv.writer(csv_file)
        writer.writerow(header)
        for row in rows[1:]:
            data = [th.text.rstrip() for th in row.find_all('td')]
            writer.writerow(data)

In [68]:
for row in rows[1:]:
            data = [th.text.rstrip() for th in row.find_all('td')]


In [69]:
data

['–',
 ' Pitcairn Islands (UK)',
 '50',
 'January 1, 2019',
 '0.00000065%',
 'Official estimate']

In [62]:
#Generate lists
A=[]
B=[]
C=[]
D=[]
E=[]
F=[]
G=[]
for row in right_table.findAll("tr"):
    cells = row.findAll('td')
    states=row.findAll('th') #To store second column data
    if len(cells)==6: #Only extract table body not heading
        A.append(cells[0].find(text=True))
        #B.append(states[0].find(text=True))
        C.append(cells[1].find(text=True))
        D.append(cells[2].find(text=True))
        E.append(cells[3].find(text=True))
        F.append(cells[4].find(text=True))
        G.append(cells[5].find(text=True))

In [64]:
#import pandas to convert list to data frame
import pandas as pd
df=pd.DataFrame(A,columns=['Number'])
#df['State/UT']=B
df['Admin_Capital']=C
df['Legislative_Capital']=D
df['Judiciary_Capital']=E
df['Year_Capital']=F
df['Former_Capital']=G
df

Unnamed: 0,Number,Admin_Capital,Legislative_Capital,Judiciary_Capital,Year_Capital,Former_Capital
0,1,,1397180000,"May 4, 2019",18.1%,Official population clock
1,2,,1346760000,"May 4, 2019",17.5%,Official population clock
2,3,,329142000,"May 4, 2019",4.27%,Official population clock
3,4,,268074600,"July 1, 2019",3.48%,Official annual projection
4,5,,209859000,"May 4, 2019",2.72%,Official population clock
5,6,,204545000,"May 4, 2019",2.66%,Official population clock
6,7,,193392517,"July 1, 2016",2.51%,Official estimate
7,8,,166493000,"May 4, 2019",2.16%,Official population clock
8,9,,146793744,"January 1, 2019",1.91%,Official estimate
9,10,,126577691,"July 1, 2019",1.64%,Official annual projection


## Appendix

## Request using urllib

In [None]:
# page = urllib.request.urlopen(url1)
# soup = BeautifulSoup(page)

## Find All Links

In [None]:
# find all links in the html
#all_links = soup.find_all("a")
#for link in all_links:
#    print(link.get("href"))