<b>Import the libraries.</b>
<br>
<b>BeautifulSoup</b> this is used to retrive necessary information from the html pages.
<br>
<b>Request, urlopen</b> this is used to retrive necessary information from the html pages. urlopen opens the url.
<br>
<b>Pandas</b> this is used to store the data in csv format, clean data and manage data.
<br>
<b>re</b> this is used for regular expressions to manipulate strings, find keywords and so on.
<br>
<b>pickle</b> this is used to store python objects in files.
<br>
<b>os</b> this is used to do operation system dependent functions. Here we will be using it to create folder to store files.
<br>
<b>glob</b> this is used to find pathnames.

In [17]:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
from urllib.error import HTTPError
import pandas as pd
import re
import pickle
import os
import glob

Get the urls to scrape. Usually most webpages list products or information page by page. Just increment the page number and store each url in a list.

In [6]:
baseUrl = 'http://example.webscraping.com/places/default/index/'
urls = []
for i in range(1, 5):
    urls.append(baseUrl+str(i))

Function to return the html data from urls and display error if something goes wrong. 

In [4]:
def getConnection(my_url):
    hdr = {'User-Agent': 'Mozilla/5.0'}
    req = Request(my_url, headers=hdr)
    try:
        uClient = urlopen(req)
        rawData = uClient.read()
        # close the content
        uClient.close()
    except HTTPError:
        print("Http error")
        rawData = False
    return rawData

store all the unprocessed html data into a variable.

In [8]:
rawList = []
for url in urls:
    print(url)
    rawList.append(getConnection(url))

http://example.webscraping.com/places/default/index/1
http://example.webscraping.com/places/default/index/2
http://example.webscraping.com/places/default/index/3
http://example.webscraping.com/places/default/index/4


Extract the links with relevant information from the page. Check if link are relative or absolute, if relative add the base url in front.

In [11]:
linklist = []
base_link = 'http://example.webscraping.com'
for rawdata in rawList:
    bs = BeautifulSoup(rawdata, 'html.parser')
    container = bs.find('div', {'id': 'results'})
    link_items = container.select('td')
    for link_item in link_items:
        if link_item.find('a') is not None:
            link = link_item.find('a')['href']
            if link.startswith('/'):
                link = base_link + link
            linklist.append(link)

Now collect html data from the links extracted and store them in a variable. In this sample we are only storing first five variables. You can choose to store file, or just store it in variable according to the need. If there are lots of urls and you might need to scrape other information in future, you can store it in files. You can store whole array or individual files like shown here as well.

In [21]:
datacollection = []
for url in linklist[:5]:
    foldername = 'rawdata'
    filename = url.split('/')[-1]+'.txt'
    if not os.path.isdir(foldername):
        os.mkdir(foldername)
    data = getConnection(url)
    with open(foldername+'/'+filename, 'wb') as f:
        pickle.dump(data, f)
    datacollection.append(data)

after you have stored the files, you can run this code to get the data back in variables. This is useful when you have to turn off pc and resume scrapping.

In [23]:
filenames = glob.glob(foldername+'/*.txt')
datacollection = []
for filename in filenames:
    with open(filename, 'rb') as f:
        datacollection.append(pickle.load(f))

now extract the information you need from the stored files and store them as required

In [32]:
country_list = []
for data in datacollection:
    country_item = {}
    bs = BeautifulSoup(data, 'html.parser')
    table = bs.select_one('table')
    rows = table.select('tr')
    for row in rows:
        tds = row.select('td')
        row_title = tds[0].get_text().strip()
        if row_title.endswith(':'):
            row_title = row_title[:-1]
        row_value = tds[1].get_text().strip()
        country_data = {row_title:row_value}
        country_item.update(country_data)
    country_list.append(country_item)

now store the data in csv format using pandas

In [33]:
df = pd.DataFrame(country_list)

print(df.head)

<bound method NDFrame.head of                           Area       Capital Continent    Country  \
0  2,766,890 square kilometres  Buenos Aires        SA  Argentina   
1     29,800 square kilometres       Yerevan        AS    Armenia   
2        193 square kilometres    Oranjestad        NA      Aruba   
3  7,686,850 square kilometres      Canberra        OC  Australia   
4     83,858 square kilometres        Vienna        EU    Austria   

  Currency Code Currency Name Iso             Languages National Flag  \
0           ARS          Peso  AR  es-AR,en,it,de,fr,gn                 
1           AMD          Dram  AM                    hy                 
2           AWG       Guilder  AW           nl-AW,es,en                 
3           AUD        Dollar  AU                 en-AU                 
4           EUR          Euro  AT        de-AT,hr,hu,sl                 

                Neighbours Phone  Population Postal Code Format  \
0           CL BO UY PY BR    54  41,343,201     

use additional function to filter the data as required like to separate first and last name, or to camel case or to convert units

In [49]:
def convert_sqkm_acre(area):
    area = area.replace(',', '')
    sqkm = re.search(r'[0-9]+', area).group()
    return float(sqkm)*247.10538147

apply filter and store the dataframe as csv file

In [50]:
df['Area in acre'] = df['Area'].apply(lambda x: convert_sqkm_acre(x))
df.to_csv('countries.csv', index=False, encoding='UTF-8')