<a href="https://colab.research.google.com/github/BrianKipngeno/Web-scrapping-with-Python/blob/main/Web_Scraping_with_Python_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AfterWork Data Science: Web Scraping with Python

## Prerequisites

In [1]:
# We first import the required libraries
# ---
#
import pandas as pd             # library for data manipulation
import requests                 # library for fetching a web page
from bs4 import BeautifulSoup   # library for extrating contents from a webpage

## Examples

As we go through the following examples, we should keep in mind that web pages are different, however, the process for scraping data will largely be the same. This means that while scraping for data in other webpages, we can always use the given code with some modifications.

### Example 1: Performing Basic Web Scraping

In [None]:
# Example 1
# ---
# Scrape data found within the <a class="tag"> HTML tags in the given quotes website.
# ---
# Website: http://quotes.toscrape.com/
# ---
# YOUR CODE GOES BELOW
#

**Before we begin, we need to first understand the following concepts:**

1. A website is made up of web pages. These web pages can be either HTML or XML Webpages.   
2. HTML webpages contain content made up of tags i.e. `<head>, <body>, <div>, <header>, <section>, <p>, <span>, <a>, <li>` etc.
2. XML webpages contain content made up of user defined tags such as `<root>, <name>, <address>, <sector>, <location>`, etc.
3. The desired text data is usually contained within tags in an HTML or a XML web page. For example the text "hello" can be contained in the `<p>` tag as shown: `<p>`hello`</p>`. In such a case the `<p>` tag or *paragraph tag* has a closing tag `</p>`.
4. HTML Tags can comprise of attributes such as class, id, href etc. A good example would a paragraph tag with a class attribute that has a value "home": `<p class="home">hello</p>.` These attributes help us specify the elements that we would want to work with.



#### Step 1: Obtaining our Data

In [2]:
# We will first download our webpage from the server that contains
# our web page through the use of the get() method from the requests library.
# Upon doing this, we also check the status_code of our download.
# - A status_code starting with 2 indicates success.
# - A status_code starting with 4 or 5 indicates an error.
# ---
#
page = requests.get('http://quotes.toscrape.com/')
page

<Response [200]>

In [3]:
# Run the following code to see an error in downloading an non-existent webpage.
# There is not page with URL: http://quotes.toscrape.com/mypage
# ---
#
page2 = requests.get('http://quotes.toscrape.com/mypage')
page2

<Response [404]>

In [4]:
# Once we have successfully retrieved our page we can preview our document
# by printing the first 1600 characters of the HTML document as shown below.
# ---
#
print(page.text[0:1600])

<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div class="col-md-4">
                <p>
                
                    <a href="/login">Login</a>
                
                </p>
            </div>
        </div>
    

<div class="row">
    <div class="col-md-8">

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
        <span>by <small class="author" itempr

From the above output, we can see some of the HTML `<a class="tag" href="">` tags that contain the data that we're interested in. We can also locate the html tag that contains our data by using the inspect tool within our browser.
This has been demostrated in this [short video](https://www.youtube.com/watch?v=CwiRPmXhcLY).

#### Step 2: Parsing

In [5]:
# Once we have successfully downloaded our html document,
# we can parse it and extract our desired text from the <a class="tag"> tags
# as shown below.
# ---
#

# We use BeautifulSoup, which is a popular Python library for web scraping to parse our HTML document.
# By parsing in this case, BeautifulSoup parses the HTML (stored in page.text)
# into a special object called soup that the Beautiful Soup library understands.
# In laymans terms, Beautiful Soup is reading the HTML and making sense of its structure.
# If we were working with an XML document, we would use the XML parser 'xml'.
# ---
#
#
soup = BeautifulSoup(page.text, "html.parser")

In [6]:
# We can then print out the HTML content of the page formatted nicely,
# using the prettify() method as shown below:
# ---
#
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Quotes to Scrape
  </title>
  <link href="/static/bootstrap.min.css" rel="stylesheet"/>
  <link href="/static/main.css" rel="stylesheet"/>
 </head>
 <body>
  <div class="container">
   <div class="row header-box">
    <div class="col-md-8">
     <h1>
      <a href="/" style="text-decoration: none">
       Quotes to Scrape
      </a>
     </h1>
    </div>
    <div class="col-md-4">
     <p>
      <a href="/login">
       Login
      </a>
     </p>
    </div>
   </div>
   <div class="row">
    <div class="col-md-8">
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Albert Einstein
       </small>
       <a href="/author/Albert

#### Step 3: Extracting Required Elements

In [7]:
# Let's now extract data found in a specific tag i.e. data found in <a> tags.
# The result is a list of instances of <a> tags found within our document.
# NB: In this case, we will get all <a> tags using the find_all() method
# which will get all the instances of the specified tag in the web page.
# ---
#
results = soup.find_all('a')
results

[<a href="/" style="text-decoration: none">Quotes to Scrape</a>,
 <a href="/login">Login</a>,
 <a href="/author/Albert-Einstein">(about)</a>,
 <a class="tag" href="/tag/change/page/1/">change</a>,
 <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>,
 <a class="tag" href="/tag/thinking/page/1/">thinking</a>,
 <a class="tag" href="/tag/world/page/1/">world</a>,
 <a href="/author/J-K-Rowling">(about)</a>,
 <a class="tag" href="/tag/abilities/page/1/">abilities</a>,
 <a class="tag" href="/tag/choices/page/1/">choices</a>,
 <a href="/author/Albert-Einstein">(about)</a>,
 <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>,
 <a class="tag" href="/tag/life/page/1/">life</a>,
 <a class="tag" href="/tag/live/page/1/">live</a>,
 <a class="tag" href="/tag/miracle/page/1/">miracle</a>,
 <a class="tag" href="/tag/miracles/page/1/">miracles</a>,
 <a href="/author/Jane-Austen">(about)</a>,
 <a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>,
 <a class="tag" href

In [8]:
# Determining the no. of tags/ instances of the <a> tags
# ---
#
len(results)

55

In [9]:
# To get the first text within our first tag we can perform our extraction as follows:
# ---
#

# Method 1
# ---
#
soup.find_all('a')[0].get_text()

# or
# soup.find_all('a')[0].text

'Quotes to Scrape'

In [10]:
# Method 2
# ---
# Uncomment the following code
# ---
#
# soup.find('a').get_text()

# or
soup.find_all('a')[0].text

'Quotes to Scrape'

In [11]:
# Checking the last item
# ---
#
results[-1]

<a class="zyte" href="https://www.zyte.com">Zyte</a>

In [12]:
# We can also specify the tag that we would like to retrive our text data from,
# in this case getting the text from the tag with the following attributes:
# ---
# - class="tag"
# - href="/tag/change/page/1/"
# This tag would be:
# - <a class="tag" href="/tag/change/page/1/">change</a>
# ---
#

# We perform the following
# ---
#
results = soup.find_all('a', attrs={'class':'tag', 'href': '/tag/change/page/1/'})[0].get_text()
results

# Method 2
# ---
# Uncomment the following lines
# ---
#
# results = soup.find_all('a', {'class':'tag', 'href': '/tag/change/page/1/'})[0].get_text()
# results

'change'

In [13]:
# We then create empty lists that we will use to store
# content fetched from the <a> tags
# ---
#
link_content = []
link_url = []

# Getting all our <a> tags
# ---
#
results = soup.find_all('a')

# We the loop through these tags
for result in results:

    # Getting our text from each tag
    text = result.get_text()

    # We concatenate our domain with href link that we scrape
    # in order to form a full link
    link = 'http://quotes.toscrape.com' + result.get('href')

    # Then appending the text to our link_content list
    link_content.append(text)

    # Then appending the text to our link_url list
    link_url.append(link)

In [14]:
# Previewing our link_content list by checking first 10 items
# ---
#
link_content[0:10]

['Quotes to Scrape',
 'Login',
 '(about)',
 'change',
 'deep-thoughts',
 'thinking',
 'world',
 '(about)',
 'abilities',
 'choices']

In [15]:
# Previewing our links_content list by checking first 10 items
# ---
#
link_url[0:10]

['http://quotes.toscrape.com/',
 'http://quotes.toscrape.com/login',
 'http://quotes.toscrape.com/author/Albert-Einstein',
 'http://quotes.toscrape.com/tag/change/page/1/',
 'http://quotes.toscrape.com/tag/deep-thoughts/page/1/',
 'http://quotes.toscrape.com/tag/thinking/page/1/',
 'http://quotes.toscrape.com/tag/world/page/1/',
 'http://quotes.toscrape.com/author/J-K-Rowling',
 'http://quotes.toscrape.com/tag/abilities/page/1/',
 'http://quotes.toscrape.com/tag/choices/page/1/']

#### Step 4: Saving our Data

In [16]:
# Finally, we save the scraped contents in a dataframe and preview our data as shown
# ---
#
df = pd.DataFrame({"link_content": link_content, "link_url": link_url})
df.head()

Unnamed: 0,link_content,link_url
0,Quotes to Scrape,http://quotes.toscrape.com/
1,Login,http://quotes.toscrape.com/login
2,(about),http://quotes.toscrape.com/author/Albert-Einstein
3,change,http://quotes.toscrape.com/tag/change/page/1/
4,deep-thoughts,http://quotes.toscrape.com/tag/deep-thoughts/p...


### Example 2: Scraping for Tables

In [17]:
# Example 2
# ---
# Get a list of African cities from the given URL and store the city and respective population.
# ---
# Website URL = https://en.wikipedia.org/wiki/List_of_cities_in_Africa_by_population
# ---
# YOUR CODE GOES BELOW
#

#### Step 1: Obtaining our Data

In [18]:
# Fetching our data from wikipedia. A status of 200 mean success.
#
# ---
#
page = requests.get('https://en.wikipedia.org/wiki/List_of_cities_in_Africa_by_population')
page

<Response [200]>

#### Step 2: Parsing

In [19]:
# Parsing our data using BeautifulSoup
# ---
#
soup = BeautifulSoup(page.text, "html.parser")

Using the browser inspect feature, we identify our source table to have the following tag:

`<table class="sortable wikitable jquery-tablesorter">`

#### Step 3: Extracting Required Elements

In [20]:
# Using the browser inspect feature, we identify our source table to have the following tag:
# ---
# <table class="sortable wikitable">
# ---
#
right_table = soup.find('table', {'class': 'sortable wikitable'})

In [21]:
# And then preview it, still the purpose of confirmation
#
print(right_table)

<table class="sortable wikitable">
<tbody><tr>
<th style="width:60">City
</th>
<th style="width:4">Rank
</th>
<th style="width:60">Country
</th>
<th style="width:40">Population
</th>
<th style="width:6">Year of estimate
</th></tr>
<tr>
<td><i><b><a href="/wiki/Kinshasa" title="Kinshasa">Kinshasa</a></b></i>
</td>
<td>1
</td>
<td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><span><img alt="" class="mw-file-element" data-file-height="600" data-file-width="800" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/6/6f/Flag_of_the_Democratic_Republic_of_the_Congo.svg/20px-Flag_of_the_Democratic_Republic_of_the_Congo.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/6/6f/Flag_of_the_Democratic_Republic_of_the_Congo.svg/31px-Flag_of_the_Democratic_Republic_of_the_Congo.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/6/6f/Flag_of_the_Democratic_Republic_of_the_Congo.svg/40px-Flag_of_the_Democratic_Republic_of_the_

In [22]:
# Getting the table body rows
# ---
#
rows = right_table.find_all('tr')
rows

[<tr>
 <th style="width:60">City
 </th>
 <th style="width:4">Rank
 </th>
 <th style="width:60">Country
 </th>
 <th style="width:40">Population
 </th>
 <th style="width:6">Year of estimate
 </th></tr>,
 <tr>
 <td><i><b><a href="/wiki/Kinshasa" title="Kinshasa">Kinshasa</a></b></i>
 </td>
 <td>1
 </td>
 <td><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><span><img alt="" class="mw-file-element" data-file-height="600" data-file-width="800" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/6/6f/Flag_of_the_Democratic_Republic_of_the_Congo.svg/20px-Flag_of_the_Democratic_Republic_of_the_Congo.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/6/6f/Flag_of_the_Democratic_Republic_of_the_Congo.svg/31px-Flag_of_the_Democratic_Republic_of_the_Congo.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/6/6f/Flag_of_the_Democratic_Republic_of_the_Congo.svg/40px-Flag_of_the_Democratic_Republic_of_the_Congo.svg.png 2x" width=

In [23]:
# Getting the required text data from our table body rows
# ---
#
rank = []
cities = []
countries = []
population = []

for row in rows:
    cells = row.find_all('td')

    # We check to make sure there's no empty cell
    if len(cells) > 1:
        rank.append(cells[0].text.strip())
        cities.append(cells[1].text.strip())
        countries.append(cells[2].text.strip())
        population.append(cells[3].text.strip())

In [None]:
# Previewing our lists
# ---
#
print(rank[0:5])
print(cities[0:5])
print(countries[0:5])
print(population[0:5])

['1', '2', '3', '4', '5']
['Kinshasa', 'Lagos', 'Cairo', 'Giza', 'Dar es Salaam']
['Democratic Republic of the Congo', 'Nigeria', 'Egypt', 'Egypt', 'Tanzania']
['15,628,000', '15,388,000', '10,025,657', '9,200,000', '7,100,000']


#### Step 4: Saving our Data

In [24]:
# Saving the scraped data to a dataframe
# ---
#
countries_df = pd.DataFrame({"rank": rank, "city": cities, "country": countries, "population": population})
countries_df.sample(10)

Unnamed: 0,rank,city,country,population
90,Umuahia,91,Nigeria,860555
58,Mwanza,59,Tanzania,1245444
53,Gqeberha (Port Elizabeth),54,South Africa,1280550
91,Libreville,92,Gabon,856854
11,Nairobi,12,Kenya,5118844
69,Hargeisa,70,Somaliland,1079377
47,N'Djamena,48,Chad,1532588
84,Bangui,85,Central African Republic,933176
95,Lokoja,96,Nigeria,790742
49,Mombasa,50,Kenya,1388979


### Example 3: Scraping for Articles

In [None]:
# Example 3
# ---
# Scrape the given article from the the DailyPost Nigeria
# ---
# Website URL = https://dailypost.ng/2020/09/29/danbatta-inaugurates-evaluation-committee-for-2020-research-proposals/
# ---
# YOUR CODE GOES BELOW
#

#### Step 1: Obtaining our Data

In [25]:
page = requests.get('https://dailypost.ng/2020/09/29/danbatta-inaugurates-evaluation-committee-for-2020-research-proposals/')
page

<Response [200]>

#### Step 2: Parsing

In [26]:
soup = BeautifulSoup(page.text, "html.parser")

#### Step 3: Extracting Required Elements

In [27]:
# Getting our heading content
# ---
# This is our tag:
# <h1 class="mvp-post-title left entry-title" itemprop="headline">
# ---
#
article_heading = soup.find('h1', {'class': 'mvp-post-title left entry-title'}).get_text()
article_heading

'Danbatta inaugurates Evaluation Committee for 2020 research proposals'

In [28]:
# Getting our article content
# ---
# Target tags:
# All <p> tags contained in <div id="mvp-content-main">
# ---
#
article = soup.find('div', {'id': 'mvp-content-main'})
article

<div class="left relative" id="mvp-content-main">
<div class="ai-viewports ai-viewport-2 ai-viewport-3 ai-insert-7-75113661" data-block="7" data-code="PGRpdiBjbGFzcz0nY29kZS1ibG9jayBjb2RlLWJsb2NrLTcnIHN0eWxlPSdtYXJnaW46IDhweCAwOyBjbGVhcjogYm90aDsnPgo8ZGl2IGFsaWduPSJjZW50ZXIiPgo8IS0tIC8xNDAwMTYzNi9EUF9MZWFkZXJib2FyZF8xIC0tPgo8ZGl2IGlkPSdkaXYtZ3B0LWFkLTE1MDAzODY5NTMyODEtOCc+CjxzY3JpcHQ+Cmdvb2dsZXRhZy5jbWQucHVzaChmdW5jdGlvbigpIHsgZ29vZ2xldGFnLmRpc3BsYXkoJ2Rpdi1ncHQtYWQtMTUwMDM4Njk1MzI4MS04Jyk7IH0pOwo8L3NjcmlwdD4KPC9kaXY+CjwvZGl2PgoKPC9kaXY+Cg==" data-insertion-no-dbg="" data-insertion-position="prepend" data-selector=".ai-insert-7-75113661" style="margin: 8px 0; clear: both;"></div>
<p>The Executive Vice Chairman and Chief Executive of the Nigerian Communications Commission, Prof. Umar Garba Danbatta, has inaugurated a 15-member Evaluation Committee for the assessment of the 2020 Telecommunications based research from Academics in the Nigerian tertiary institutions.</p><div class="ai-view

In [29]:
# Lets find all the p tags that contain the article text
# ---
#
p_tags = article.find_all('p')

# We then strip all the surrounding whitespace.
# ---
#
p_tags_text = [tag.get_text().strip() for tag in p_tags]
p_tags_text

['The Executive Vice Chairman and Chief Executive of the Nigerian Communications Commission, Prof. Umar Garba Danbatta, has inaugurated a 15-member Evaluation Committee for the assessment of the 2020 Telecommunications based research from Academics in the Nigerian tertiary institutions.',
 'The Committee, chaired by Prof. Muâ\x80\x99azu Bashir, a Professor of Computer and Control Engineering and Head of Computer Engineering Department at Ahmadu Bello University, Zaria was inaugurated at the Commissionâ\x80\x99s Head Office in Abuja on Wednesday, September 24, 2020.',
 'Speaking during the inauguration, Danbatta said that the initiative speaks to the Commissionâ\x80\x99s commitment towards encouraging the development of indigenous innovative solutions that impact not only the Telecom industry/ICT sector positively but also the nation as a whole.',
 'â\x80\x9cWe want to continuously support research projects that can lead to the development of new products and services in the industry as

#### Step 4: Saving our Data

In [None]:
# Combine list items into string.
article = ' '.join(p_tags_text)
article

'The Executive Vice Chairman and Chief Executive of the Nigerian Communications Commission, Prof. Umar Garba Danbatta, has inaugurated a 15-member Evaluation Committee for the assessment of the 2020 Telecommunications based research from Academics in the Nigerian tertiary institutions. The Committee, chaired by Prof. Mu’azu Bashir, a Professor of Computer and Control Engineering and Head of Computer Engineering Department at Ahmadu Bello University, Zaria was inaugurated at the Commission’s Head Office in Abuja on Wednesday, September 24, 2020. Speaking during the inauguration, Danbatta said that the initiative speaks to the Commission’s commitment towards encouraging the development of indigenous innovative solutions that impact not only the Telecom industry/ICT sector positively but also the nation as a whole. “We want to continuously support research projects that can lead to the development of new products and services in the industry as the key enabler of the nation’s digital econ