<a href="https://colab.research.google.com/github/Kachanta/Webscrapping/blob/main/Updated_webscraping_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## WEB SCRAPING LIST OF COUNTRIES AND THEIR POPULATION
Web scraping is a valuable skill for data scientists, as it allows us to extract data from websites and analyze it for insights. 

In this project, we will be using web scraping techniques to extract data on world population by countries from the popular online encyclopedia Wikipedia. Wikipedia is a rich source of information on a wide range of topics, including demographics and population statistics for countries around the world.

To begin, we will start by importing the relevant libraries


In [1]:

import pandas as pd
import numpy as np
import matplotlib as plt 

## Reading the Data
We start by identifying the relevant Wikipedia pages that provide population data for countries, and then assigning an alias "url" before reading the data into pandas

In [11]:
# Reading from the website using url and checking the file type we obrained
url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
raw = pd.read_html(url, skiprows = [0])
type(raw)

list

In [3]:
# We Identify the list containing our desired data and assign a name
df = raw[0]
df.columns

## Formating the Data 
We arrange columns for proper identification by droping their multi-level column and renaming then respectively 

In [4]:
# We clear the multi-level columns and reasign the apropriate columns names 
df.columns = df.columns.droplevel(0)
df.columns = ["Rank","Countries","Population","percentage","Date","Source","Notes"]
df.head()

Unnamed: 0,Rank,Countries,Population,percentage,Date,Source,Notes
0,1,China,1411750000,,31 Dec 2022,Official estimate[4],The population figure refers to mainland China...
1,2,India,1375586000,,1 Mar 2022,Official projection[5],The figure includes the population of the Indi...
2,3,United States,334414390,,22 Feb 2023,National population clock[6],The figure includes the 50 states and the Dist...
3,4,Indonesia,275773800,,1 Jul 2022,Official estimate[7],
4,5,Pakistan,235825000,,1 Jul 2022,UN projection[3],The figure includes the population of Pakistan...


## Exploratory Data Analysis and Wrangling  

In [5]:
df.describe(include="all")

Unnamed: 0,Rank,Countries,Population,percentage,Date,Source,Notes
count,241,241,241.0,0.0,241,241,33
unique,196,241,,,57,202,33
top,–,China,,,1 Jul 2021,UN projection[3],The population figure refers to mainland China...
freq,46,1,,,69,25,1
mean,,,32482410.0,,,,
std,,,132396700.0,,,,
min,,,47.0,,,,
25%,,,301295.0,,,,
50%,,,5431344.0,,,,
75%,,,22125000.0,,,,


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 241 entries, 0 to 240
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Rank        241 non-null    object 
 1   Countries   241 non-null    object 
 2   Population  241 non-null    int64  
 3   percentage  0 non-null      float64
 4   Date        241 non-null    object 
 5   Source      241 non-null    object 
 6   Notes       33 non-null     object 
dtypes: float64(1), int64(1), object(5)
memory usage: 13.3+ KB


In [7]:
# Since Rank has sorted the countries by population we change it to Index
df.set_index("Rank", inplace=True)

In [8]:
# From our info above percentage has 0 non_null values. Hence, we drop it 
df = df.drop(columns = "percentage")

In [9]:
df.head()

Unnamed: 0_level_0,Countries,Population,Date,Source,Notes
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,China,1411750000,31 Dec 2022,Official estimate[4],The population figure refers to mainland China...
2,India,1375586000,1 Mar 2022,Official projection[5],The figure includes the population of the Indi...
3,United States,334414390,22 Feb 2023,National population clock[6],The figure includes the 50 states and the Dist...
4,Indonesia,275773800,1 Jul 2022,Official estimate[7],
5,Pakistan,235825000,1 Jul 2022,UN projection[3],The figure includes the population of Pakistan...
