### What You're Aiming For

- Scraping text from Wikipedia website using Beautiful Soup


##### Instructions

- After watching this video below, you will be able to:

➡️ https://www.youtube.com/watch?v=YY5skv756pc

- 1.1) Write a function to Get and parse html content from a Wikipedia page

- 1.2) Write a function to Extract article title

- 1.3) Write a function to Extract article text for each paragraph with their respective headings. Map those headings to their respective paragraphs in the dictionary.

- 1.4) Write a function to collect every link that redirects to another Wikipedia page

- 1.5) Wrap all the previous functions into a single function that takes as parameters a Wikipedia link

- 1.6) Test the last function on a Wikipedia page of your choice

In [1]:
import numpy as np
import pandas as pd 
import requests
from bs4 import BeautifulSoup as bs

In [2]:
web = 'https://en.wikipedia.org/wiki/Demographics_of_Africa'

res = requests.get(web)

if res.status_code == 200:
    print('Successfully Fetched')
else:
    print('Unsuccessful')

Successfully Fetched


In [3]:
soup = bs(res.content, 'html.parser')

In [4]:
soup.prettify()

'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-enabled vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-toc-available" dir="ltr" lang="en">\n <head>\n  <meta charset="utf-8"/>\n  <title>\n   Demographics of Africa - Wikipedia\n  </title>\n  <script>\n   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-me

In [5]:
tables = soup.find_all('table')

table_2 = tables[5]
table_2.prettify()

'<table class="wikitable sortable" style="text-align:right">\n <tbody>\n  <tr>\n   <th>\n   </th>\n   <th style="width:80pt;">\n    Mid-year population (thousands)\n   </th>\n   <th style="width:80pt;">\n    Live births (thousands)\n   </th>\n   <th style="width:80pt;">\n    Deaths (thousands)\n   </th>\n   <th style="width:80pt;">\n    Natural change (thousands)\n   </th>\n   <th style="width:80pt;">\n    Crude birth rate (per 1000)\n   </th>\n   <th style="width:80pt;">\n    Crude death rate (per 1000)\n   </th>\n   <th style="width:80pt;">\n    Natural change (per 1000)\n   </th>\n   <th style="width:80pt;">\n    Crude migration change (per 1000)\n   </th>\n   <th style="width:80pt;">\n    <a href="/wiki/Total_fertility_rate" title="Total fertility rate">\n     Total fertility rate\n    </a>\n    (TFR)\n   </th>\n   <th style="width:80pt;">\n    <a href="/wiki/Infant_mortality" title="Infant mortality">\n     Infant mortality\n    </a>\n    (per 1000 live births)\n   </th>\n   <th s

In [7]:
demographic_of_africa = []

In [8]:
africa = table_2.find_all('tr')
for row in africa[1:]:
    column = row.find_all("td")
    if len(column) >= 11:
        Index = column[0].get_text(strip = True)
        Mid_year_population = column[1].get_text(strip = True)
        Live_births = column[2].get_text(strip = True)
        Deaths = column[3].get_text(strip = True)
        Natural_change = column[4].get_text(strip = True)
        Crude_birth_rate = column[5].get_text(strip = True)
        Crude_death_rate = column[6].get_text(strip = True)
        Natural_change_ = column[7].get_text(strip = True)
        Crude_migration_change = column[8].get_text(strip = True)
        Total_fertility_rate = column[9].get_text(strip = True)
        infant_mortality = column[10].get_text(strip = True)
        life_expentancy = column[11].get_text(strip = True)
        
        demographic_of_africa.append({
            "Index": Index,
            "Mid-year population": Mid_year_population,
            "Live_births": Live_births,
            "Deaths": Deaths,
            "Natural_change": Natural_change,
            "Crude_birth_rate": Crude_birth_rate,
            "Crude_death_rate": Crude_death_rate,
            "Natural_change_(per1000)": Natural_change,
            "Crude_migration_change":Crude_migration_change,
            "Total_fertility_rate":Total_fertility_rate,
            "infant_mortality":infant_mortality,
            "life_expentancy_(inyears)":life_expentancy
        })
    else:
        print("Failed to fetch")

In [9]:
demographic_of_africa

[{'Index': '1950',
  'Mid-year population': '227 549',
  'Live_births': '10 949',
  'Deaths': '6 063',
  'Natural_change': '4 886',
  'Crude_birth_rate': '48.1',
  'Crude_death_rate': '26.6',
  'Natural_change_(per1000)': '4 886',
  'Crude_migration_change': '0.2',
  'Total_fertility_rate': '6.59',
  'infant_mortality': '186.6',
  'life_expentancy_(inyears)': '37.62'},
 {'Index': '1951',
  'Mid-year population': '232 484',
  'Live_births': '11 200',
  'Deaths': '6 132',
  'Natural_change': '5 068',
  'Crude_birth_rate': '48.2',
  'Crude_death_rate': '26.4',
  'Natural_change_(per1000)': '5 068',
  'Crude_migration_change': '0.1',
  'Total_fertility_rate': '6.59',
  'infant_mortality': '184.5',
  'life_expentancy_(inyears)': '37.93'},
 {'Index': '1952',
  'Mid-year population': '237 586',
  'Live_births': '11 448',
  'Deaths': '6 155',
  'Natural_change': '5 293',
  'Crude_birth_rate': '48.2',
  'Crude_death_rate': '25.9',
  'Natural_change_(per1000)': '5 293',
  'Crude_migration_change

In [10]:
demographic_of_africa_df = pd.DataFrame(demographic_of_africa)

In [11]:
demographic_of_africa_df

Unnamed: 0,Index,Mid-year population,Live_births,Deaths,Natural_change,Crude_birth_rate,Crude_death_rate,Natural_change_(per1000),Crude_migration_change,Total_fertility_rate,infant_mortality,life_expentancy_(inyears)
0,1950,227 549,10 949,6 063,4 886,48.1,26.6,4 886,0.2,6.59,186.6,37.62
1,1951,232 484,11 200,6 132,5 068,48.2,26.4,5 068,0.1,6.59,184.5,37.93
2,1952,237 586,11 448,6 155,5 293,48.2,25.9,5 293,-0.2,6.60,181.3,38.44
3,1953,242 837,11 708,6 188,5 520,48.2,25.5,5 520,-0.4,6.61,178.0,38.92
4,1954,248 245,11 941,6 234,5 708,48.1,25.1,5 708,-0.4,6.61,174.7,39.30
...,...,...,...,...,...,...,...,...,...,...,...,...
67,2017,1 263 334,43 102,10 695,32 408,34.1,8.5,32 408,-0.3,4.52,50.0,61.99
68,2018,1 295 265,43 713,10 763,32 950,33.7,8.3,32 950,-0.4,4.47,48.8,62.34
69,2019,1 327 701,44 295,10 841,33 454,33.3,8.2,33 454,-0.4,4.42,47.7,62.69
70,2020,1 360 677,44 807,11 390,33 417,32.9,8.4,33 417,-0.3,4.36,46.4,62.23


In [12]:
demographic_of_africa_df.rename(columns ={"Index": "Year"}, inplace = True )
demographic_of_africa_df

Unnamed: 0,Year,Mid-year population,Live_births,Deaths,Natural_change,Crude_birth_rate,Crude_death_rate,Natural_change_(per1000),Crude_migration_change,Total_fertility_rate,infant_mortality,life_expentancy_(inyears)
0,1950,227 549,10 949,6 063,4 886,48.1,26.6,4 886,0.2,6.59,186.6,37.62
1,1951,232 484,11 200,6 132,5 068,48.2,26.4,5 068,0.1,6.59,184.5,37.93
2,1952,237 586,11 448,6 155,5 293,48.2,25.9,5 293,-0.2,6.60,181.3,38.44
3,1953,242 837,11 708,6 188,5 520,48.2,25.5,5 520,-0.4,6.61,178.0,38.92
4,1954,248 245,11 941,6 234,5 708,48.1,25.1,5 708,-0.4,6.61,174.7,39.30
...,...,...,...,...,...,...,...,...,...,...,...,...
67,2017,1 263 334,43 102,10 695,32 408,34.1,8.5,32 408,-0.3,4.52,50.0,61.99
68,2018,1 295 265,43 713,10 763,32 950,33.7,8.3,32 950,-0.4,4.47,48.8,62.34
69,2019,1 327 701,44 295,10 841,33 454,33.3,8.2,33 454,-0.4,4.42,47.7,62.69
70,2020,1 360 677,44 807,11 390,33 417,32.9,8.4,33 417,-0.3,4.36,46.4,62.23


In [13]:
# Converting the file to CSV file
demographic_of_africa_df.to_csv('demographic_of_africa.csv', index = False)