In this notebook, I will scrape data from wikipedia using Python, BeautifulSoup and Requests libs, in order to create a dataframe of countries and population.

**1. Importing the libraries**

In [1]:
# install bs4
!pip install bs4

from bs4 import BeautifulSoup
import requests
import pandas as pd

Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
Collecting beautifulsoup4
  Downloading beautifulsoup4-4.9.3-py3-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 825 kB/s 
[?25hCollecting soupsieve>1.2
  Downloading soupsieve-2.2.1-py3-none-any.whl (33 kB)
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25l- \ done
[?25h  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1273 sha256=68d2eae819bed91bcd106bbb98ceff3f118b318fe07e950d20c968e2abd68d96
  Stored in directory: /root/.cache/pip/wheels/0a/9e/ba/20e5bbc1afef3a491f0b3bb74d508f99403aabe76eda2167ca
Successfully built bs4
Installing collected packages: soupsieve, beautifulsoup4, bs4
Successfully installed beautifulsoup4-4.9.3 bs4-0.0.1 soupsieve-2.2.1


**2. Making a request to the website**

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)'
page = requests.get(url)

**3. Getting the desired data**

In [3]:
soup = BeautifulSoup(page.text, "html.parser")
tables = soup.find_all('table') # finding all tables
table = tables[0].get_text() # selecting the first table

**4. Creating the dataframe**

In [4]:
# some cleaning
table = table[1:]
table = table.replace("\n\n", "\n")
population_df = pd.DataFrame([x.split('\n') for x in table.split('\n\n')])
population_df = population_df[1:-2]
population_df = population_df.drop(columns = [1, 2, 3, 5, 6])
population_df = population_df.rename(columns = {0: "Country", 4: "Population"})
population_df["Population"] = population_df["Population"].replace(',', '', regex = True)
population_df["Population"] = pd.to_numeric(population_df["Population"])
population_df["Country"] = population_df["Country"].replace(r'\[.\]', '', regex = True)
population_df = population_df.style.format({'Population': '{:,}'})

# visualizing the dataframe
population_df

Unnamed: 0,Country,Population
1,China,1433783686
2,India,1366417754
3,United States,329064917
4,Indonesia,270625568
5,Pakistan,216565318
6,Brazil,211049527
7,Nigeria,200963599
8,Bangladesh,163046161
9,Russia,145872256
10,Mexico,127575529
