# Web Scraping Project Hamoye DSC SUMMER '22

### Project: Scraping a wikipedia page to access and analyse useful data 

> **Tip**: In this project, we will be scraping 'list of countries by number of births' dataset from [WIKIPEDIA](https://en.wikipedia.org/wiki/List_of_countries_by_number_of_births) and performing basic analysis on the dataset obtained from the website to tell a story.
>
> **Let's dive right in!**

Importing the packages required for the analysis

In [1]:
import numpy as np
import pandas as pd
import io
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
%matplotlib inline

Importing the libraries required to scrape the dataset.

In [2]:
from bs4 import BeautifulSoup
import requests
import re

Saving the link to the website that will be scraped.

In [3]:
url = requests.get('https://en.wikipedia.org/wiki/List_of_countries_by_number_of_births')

Confirming if the website actually exists.

In [4]:
url.raise_for_status()

Since no error was thrown, the website is valid.

In [5]:
soup = BeautifulSoup(url.text, 'html.parser')

In [6]:
#To check the html source code of the table from the wikipedia website
table = soup.find('table')

print(table)

<table class="wikitable sortable">
<tbody><tr>
<th>Rank
</th>
<th>Country
</th>
<th>Number of births (2021)
</th></tr>
<tr>
<td>1
</td>
<td><span class="flagicon"><img alt="" class="thumbborder" data-file-height="900" data-file-width="1350" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/en/thumb/4/41/Flag_of_India.svg/23px-Flag_of_India.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/4/41/Flag_of_India.svg/35px-Flag_of_India.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/4/41/Flag_of_India.svg/45px-Flag_of_India.svg.png 2x" width="23"/> </span><a href="/wiki/India" title="India">India</a>
</td>
<td>          23,113,533
</td></tr>
<tr>
<td>2
</td>
<td><span class="flagicon"><img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/23px-Flag_of_the_People%27s_Republic_of_China.svg.png" srcset="//uploa

In [7]:
#To get the relevant columns in the table
column = table.find_all('th')

column_names = [col.text.strip() for col in column]
print(column_names)

['Rank', 'Country', 'Number of births (2021)']


In [8]:
#To get the relevant rows in the table
values = table.find_all('td')
value_text = [value.get_text(strip=True) for value in values]

print(value_text)

['1', 'India', '23,113,533', '2', 'China', '10,881,567', '3', 'Nigeria', '7,923,294', '4', 'Pakistan', '6,374,741', '5', 'Indonesia', '4,496,383', '6', 'Democratic Republic of the Congo', '4,034,953', '7', 'Ethiopia', '3,895,734', '8', 'United States', '3,722,822', '9', 'Bangladesh', '3,019,672', '10', 'Brazil', '2,760,958', '11', 'Philippines', '2,485,008', '12', 'Egypt', '2,465,005', '13', 'Tanzania', '2,303,114', '14', 'Mexico', '1,882,362', '15', 'Uganda', '1,686,795', '16', 'Sudan', '1,534,332', '17', 'Kenya', '1,468,358', '18', 'Vietnam', '1,462,623', '19', 'Afghanistan', '1,440,941', '20', 'Russia', '1,397,456', '21', 'Angola', '1,338,792', '22', 'Turkey', '1,244,782', '23', 'Iran', '1,204,105', '24', 'Iraq', '1,192,345', '25', 'South Africa', '1,176,955', '26', 'Mozambique', '1,174,346', '27', 'Niger', '1,144,371', '28', 'Yemen', '1,008,936', '29', 'Algeria', '950,888', '30', 'Cameroon', '950,546', '31', 'Ivory Coast', '932,943', '32', 'Myanmar', '920,395', '33', 'Mali', '912,9

The rows are merged together as one list and we have to split each of the observations ie; the rank, the country name and the number of births in 2021

In [9]:
print(len(value_text))

708


There are actually 236 rows in the table from the website and this can be confirmed by dividing the 708 by 3 which is the number of values in each unit of observation. 

>Check this [website](https://en.wikipedia.org/wiki/List_of_countries_by_number_of_births) to better understand the concept.
>
>**So, we will be spliting the data obtained above into 236 rows with three variables each.**

In [10]:
value_text = np.array_split(value_text, 236)
print(value_text)

[array(['1', 'India', '23,113,533'], dtype='<U44'), array(['2', 'China', '10,881,567'], dtype='<U44'), array(['3', 'Nigeria', '7,923,294'], dtype='<U44'), array(['4', 'Pakistan', '6,374,741'], dtype='<U44'), array(['5', 'Indonesia', '4,496,383'], dtype='<U44'), array(['6', 'Democratic Republic of the Congo', '4,034,953'], dtype='<U44'), array(['7', 'Ethiopia', '3,895,734'], dtype='<U44'), array(['8', 'United States', '3,722,822'], dtype='<U44'), array(['9', 'Bangladesh', '3,019,672'], dtype='<U44'), array(['10', 'Brazil', '2,760,958'], dtype='<U44'), array(['11', 'Philippines', '2,485,008'], dtype='<U44'), array(['12', 'Egypt', '2,465,005'], dtype='<U44'), array(['13', 'Tanzania', '2,303,114'], dtype='<U44'), array(['14', 'Mexico', '1,882,362'], dtype='<U44'), array(['15', 'Uganda', '1,686,795'], dtype='<U44'), array(['16', 'Sudan', '1,534,332'], dtype='<U44'), array(['17', 'Kenya', '1,468,358'], dtype='<U44'), array(['18', 'Vietnam', '1,462,623'], dtype='<U44'), array(['19', 'Afghanis

In [11]:
print(len(value_text))

236


In [12]:
data = value_text
print(data)

[array(['1', 'India', '23,113,533'], dtype='<U44'), array(['2', 'China', '10,881,567'], dtype='<U44'), array(['3', 'Nigeria', '7,923,294'], dtype='<U44'), array(['4', 'Pakistan', '6,374,741'], dtype='<U44'), array(['5', 'Indonesia', '4,496,383'], dtype='<U44'), array(['6', 'Democratic Republic of the Congo', '4,034,953'], dtype='<U44'), array(['7', 'Ethiopia', '3,895,734'], dtype='<U44'), array(['8', 'United States', '3,722,822'], dtype='<U44'), array(['9', 'Bangladesh', '3,019,672'], dtype='<U44'), array(['10', 'Brazil', '2,760,958'], dtype='<U44'), array(['11', 'Philippines', '2,485,008'], dtype='<U44'), array(['12', 'Egypt', '2,465,005'], dtype='<U44'), array(['13', 'Tanzania', '2,303,114'], dtype='<U44'), array(['14', 'Mexico', '1,882,362'], dtype='<U44'), array(['15', 'Uganda', '1,686,795'], dtype='<U44'), array(['16', 'Sudan', '1,534,332'], dtype='<U44'), array(['17', 'Kenya', '1,468,358'], dtype='<U44'), array(['18', 'Vietnam', '1,462,623'], dtype='<U44'), array(['19', 'Afghanis

In [13]:
df = pd.DataFrame(data)

In [14]:
df.rename(columns={0:'rank', 1:'country', 2:'number_of_births' } ,inplace=True)
df.set_index(df.columns[0])

Unnamed: 0_level_0,country,number_of_births
rank,Unnamed: 1_level_1,Unnamed: 2_level_1
1,India,23113533
2,China,10881567
3,Nigeria,7923294
4,Pakistan,6374741
5,Indonesia,4496383
...,...,...
232,Falkland Islands,42
233,"Saint Helena, Ascension and Tristan da Cunha",41
234,Montserrat,40
235,Tokelau,35


In [15]:
dataset = df.to_csv('dataset.csv')

In [16]:
print(df)

    rank                                       country number_of_births
0      1                                         India       23,113,533
1      2                                         China       10,881,567
2      3                                       Nigeria        7,923,294
3      4                                      Pakistan        6,374,741
4      5                                     Indonesia        4,496,383
..   ...                                           ...              ...
231  232                              Falkland Islands               42
232  233  Saint Helena, Ascension and Tristan da Cunha               41
233  234                                    Montserrat               40
234  235                                       Tokelau               35
235  236                                          Niue               27

[236 rows x 3 columns]


>**The dataset is already arranged in descrending order of number of births so the first ten rows are the top ten countries with the highest number of births, same as the last ten rows being the bottom ten countries.**

In [17]:
df.head(10)

Unnamed: 0,rank,country,number_of_births
0,1,India,23113533
1,2,China,10881567
2,3,Nigeria,7923294
3,4,Pakistan,6374741
4,5,Indonesia,4496383
5,6,Democratic Republic of the Congo,4034953
6,7,Ethiopia,3895734
7,8,United States,3722822
8,9,Bangladesh,3019672
9,10,Brazil,2760958


In [18]:
df.tail(10)

Unnamed: 0,rank,country,number_of_births
226,227,San Marino,202
227,228,Anguilla,150
228,229,Wallis and Futuna,138
229,230,Saint Barthélemy,88
230,231,Saint Pierre and Miquelon,45
231,232,Falkland Islands,42
232,233,"Saint Helena, Ascension and Tristan da Cunha",41
233,234,Montserrat,40
234,235,Tokelau,35
235,236,Niue,27


In [19]:
df.shape

(236, 3)

>**PURRFECT!!!**
>
>Let's get some visualisations.

In [20]:
df.describe()

Unnamed: 0,rank,country,number_of_births
count,236,236,236
unique,236,236,236
top,1,India,23113533
freq,1,1,1


In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 236 entries, 0 to 235
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   rank              236 non-null    object
 1   country           236 non-null    object
 2   number_of_births  236 non-null    object
dtypes: object(3)
memory usage: 5.7+ KB


In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 236 entries, 0 to 235
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   rank              236 non-null    object
 1   country           236 non-null    object
 2   number_of_births  236 non-null    object
dtypes: object(3)
memory usage: 5.7+ KB


In [23]:
df.head()

Unnamed: 0,rank,country,number_of_births
0,1,India,23113533
1,2,China,10881567
2,3,Nigeria,7923294
3,4,Pakistan,6374741
4,5,Indonesia,4496383


In [24]:
df.tail()

Unnamed: 0,rank,country,number_of_births
231,232,Falkland Islands,42
232,233,"Saint Helena, Ascension and Tristan da Cunha",41
233,234,Montserrat,40
234,235,Tokelau,35
235,236,Niue,27


In [25]:
df

Unnamed: 0,rank,country,number_of_births
0,1,India,23113533
1,2,China,10881567
2,3,Nigeria,7923294
3,4,Pakistan,6374741
4,5,Indonesia,4496383
...,...,...,...
231,232,Falkland Islands,42
232,233,"Saint Helena, Ascension and Tristan da Cunha",41
233,234,Montserrat,40
234,235,Tokelau,35


>**With the brief analyses above, basic information of the dataset that was scraped from the wikipedia website can be obtained. Please note that the dataset can be further explored and cleaned, but for the purpose of this project, the scope is mainly to carry out web scraping and that has been achieved. Thank you.***