# Web Scraping "World Population" page from Wikipedia 

In this notebook i will be web-scrapping the wiki page "World Population" for tables. 

I have used two seperate methods to extract the tables but both using Beautiful Soup. 
1) I have extracted all the tables by their caption and them in a pandas dataframe by appending each cell manually.

2) I used "read_html" to extract a list of tables into a dataframe, and then accessing each of them by indexing.

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests

In [2]:
url = "https://en.wikipedia.org/wiki/World_population"

In [3]:
data = requests.get(url).text
soup = BeautifulSoup(data, "html.parser")

In [4]:
tables = soup.find_all('table')
len(tables)

26

### Searching from "Population by Region" table

In [5]:
for index,table in enumerate(tables):
    if ("Population by region" in table.text):
        region_index = index
        
print(region_index)


1


In [6]:
region_data = pd.DataFrame(columns = ["Region", "Density", "Population", 
                                      "Most populous country", "Most populous city"])
for row in tables[region_index].tbody.find_all("tr"):
    col = row.find_all("td")
    if (col != []):
        reg = col[0].text.strip()
        Density = col[1].text.strip()
        pop = col[2].text.strip()
        country = col[3].text.strip()
        city = col[4].text.strip()
        region_data = region_data.append({"Region": reg, "Density": Density, 
                                          "Population": pop, "Most populous country": country, 
                                          "Most populous city": city}, ignore_index = True)
region_data

Unnamed: 0,Region,Density,Population,Most populous country,Most populous city
0,Asia,104.1,4641,"1,411,778,000[note 1] – China","37,400,000/13,515,000 – Greater Tokyo Area/To..."
1,Africa,44.4,1340,"0211,401,000 – Nigeria","20,076,000/9,500,000 – Greater Cairo/Cairo"
2,Europe,73.4,747,"0146,171,000 – Russia;approx. 110 million in ...","20,004,000/13,200,000 – Moscow metropolitan a..."
3,Latin America,24.1,653,"0214,103,000 – Brazil","21,650,000/12,252,000 – São Paulo Metro Area/..."
4,Northern America[note 2],14.9,368,"0332,909,000 – United States","18,819,000/8,804,000 – New York metropolitan ..."
5,Oceania,5,42,"0025,917,000 – Australia","5,367,000 – Sydney"
6,Antarctica,~0,0.004[17],N/A[note 3],"1,258 – McMurdo Station"


### Searching for "Densely populated countries" table

In [7]:
for index,table in enumerate(tables):
    if ("10 most densely populated countries" in str(table)):
        table_index = index
print(table_index)

5


In [8]:
population_data = pd.DataFrame(columns=["Rank", "Country", "Population", "Area", "Density"])

for row in tables[table_index].tbody.find_all("tr"):
    col = row.find_all("td")
    if (col != []):
        rank = col[0].text
        country = col[1].text
        population = col[2].text.strip()
        area = col[3].text.strip()
        density = col[4].text.strip()
        population_data = population_data.append({"Rank":rank, "Country":country, "Population":population, "Area":area, "Density":density}, ignore_index=True)

population_data

Unnamed: 0,Rank,Country,Population,Area,Density
0,1,Singapore,5704000,710,8033
1,2,Bangladesh,172060000,143998,1195
2,3,\n Palestine\n\n,5266785,6020,847
3,4,Lebanon,6856000,10452,656
4,5,Taiwan,23604000,36193,652
5,6,South Korea,51781000,99538,520
6,7,Rwanda,12374000,26338,470
7,8,Haiti,11578000,27065,428
8,9,Netherlands,17680000,41526,426
9,10,Israel,9460000,22072,429


# Using read_html to replicate the above but in a short way

In [9]:
list = pd.read_html(url, flavor = "bs4")

In [10]:
list[1]

Unnamed: 0,Region,Density(inhabitants/km2),Population(millions),Most populous country,Most populous city (metropolitan area)
0,Asia,104.1,4641,"1,411,778,000[note 1] – China","37,400,000/13,515,000 – Greater Tokyo Area/Tok..."
1,Africa,44.4,1340,"0211,401,000 – Nigeria","20,076,000/9,500,000 – Greater Cairo/Cairo"
2,Europe,73.4,747,"0146,171,000 – Russia;approx. 110 million in E...","20,004,000/13,200,000 – Moscow metropolitan ar..."
3,Latin America,24.1,653,"0214,103,000 – Brazil","21,650,000/12,252,000 – São Paulo Metro Area/S..."
4,Northern America[note 2],14.9,368,"0332,909,000 – United States","18,819,000/8,804,000 – New York metropolitan a..."
5,Oceania,5,42,"0025,917,000 – Australia","5,367,000 – Sydney"
6,Antarctica,~0,0.004[17],N/A[note 3],"1,258 – McMurdo Station"


In [11]:
list[5]

Unnamed: 0,Rank,Country,Population,Area(km2),Density(pop/km2)
0,1,Singapore,5704000,710,8033
1,2,Bangladesh,172060000,143998,1195
2,3,Palestine,5266785,6020,847
3,4,Lebanon,6856000,10452,656
4,5,Taiwan,23604000,36193,652
5,6,South Korea,51781000,99538,520
6,7,Rwanda,12374000,26338,470
7,8,Haiti,11578000,27065,428
8,9,Netherlands,17680000,41526,426
9,10,Israel,9460000,22072,429


In [12]:
len(list)

26

## Author
   #### Debarshi Biswas