# Projecting 2050 World Population Based On Wikipedia's Current World Population & Population Growth Data

This project aims to speculate the population of the world by 2050 using existing data from Wikipedia on population by country, in conjunction with data on total population growth (including migration)

The data available is from 2005 until 2020. My method will be to calculate the average rate of growth over that period, and applying it on the latest available population data to get an estimate for 2050

# EDA

## Scraping Population Data by Country from Wikipedia

In this section I will be scraping Wikipedia for current population by country, which is based on the latest data for each country.

In [1]:
import pandas as pd

In [3]:
import ssl

In [4]:
ssl._create_default_https_context = ssl._create_unverified_context

### Scraping and reading the table data

Here I am using Pandas's read_html function which grabs all the tables on the page, and checking which index contains the correct table

In [5]:
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population#See_also')

In [6]:
len(tables)

3

In [7]:
tables[0]

Unnamed: 0.1,Unnamed: 0,Location,Population,% of world,Date,Source (official or from the United Nations),Notes
0,–,World,8119000000,100%,1 Jul 2024,UN projection[1][3],
1,1/2 [b],China,1409670000,17.3%,31 Dec 2023,Official estimate[5],[c]
2,1/2 [b],India,1402737000,17.2%,1 Jul 2024,Official projection[6],[d]
3,3,United States,340110988,4.2%,1 Jul 2024,Official estimate[7],[e]
4,4,Indonesia,282477584,3.5%,30 Jun 2024,National annual projection[8],
...,...,...,...,...,...,...,...
235,–,Niue (New Zealand),1681,0%,11 Nov 2022,2022 Census [239],
236,–,Tokelau (New Zealand),1647,0%,1 Jan 2019,2019 Census [240],
237,195,Vatican City,764,0%,26 Jun 2023,Official figure[241],[ah]
238,–,Cocos (Keeling) Islands (Australia),593,0%,30 Jun 2020,2021 Census[242],


In [8]:
pop_table = tables[0]

### Keeping relevant columns

In [9]:
pop_table.columns

Index(['Unnamed: 0', 'Location', 'Population', '% of world', 'Date',
       'Source (official or from the United Nations)', 'Notes'],
      dtype='object')

I am removing all columns which are not relevant for my project. I assume that all data is current for 2024

In [10]:
pop_table = pop_table.drop(['Unnamed: 0', '% of world', 'Date','Source (official or from the United Nations)', 
                            'Notes'], axis = 'columns' )

In [11]:
pop_table

Unnamed: 0,Location,Population
0,World,8119000000
1,China,1409670000
2,India,1402737000
3,United States,340110988
4,Indonesia,282477584
...,...,...
235,Niue (New Zealand),1681
236,Tokelau (New Zealand),1647
237,Vatican City,764
238,Cocos (Keeling) Islands (Australia),593


In [12]:
pop_table = pop_table.dropna()

In [13]:
pop_table.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 240 entries, 0 to 239
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Location    240 non-null    object
 1   Population  240 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 3.9+ KB


### Saving the result into CSV

In [14]:
pop_table.to_csv('pop_table.csv', index=False)

## Scraping List of Countries by Population Growth Rate

In this section, I am scraping a different article page for the table on population growth by country

In [5]:
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_population_growth_rate')

In [6]:
tables[0]

Unnamed: 0_level_0,Country (or territory),CIA[2][3] 2023–2024,CIA[2][3] 2023–2024,WB[4] 2023,UN[5] 2005–10,UN[5] 2010–15,UN[5] 2015–20
Unnamed: 0_level_1,Country (or territory),(%),Year,WB[4] 2023,UN[5] 2005–10,UN[5] 2010–15,UN[5] 2015–20
0,World,1.17,,0.9,1.23,1.19,1.09
1,Afghanistan,2.22,2024.0,2.7,2.78,3.16,2.41
2,Albania *,0.16,2024.0,-1.1,-0.92,-0.12,0.13
3,Algeria *,1.54,2024.0,1.6,1.63,1.98,1.67
4,Andorra *,-0.12,2024.0,0.3,1.37,-1.59,-0.21
...,...,...,...,...,...,...,...
236,U.S. Virgin Islands *,-0.49,2023.0,-0.5,-0.26,-0.02,
237,Wallis and Futuna * (France),0.23,2023.0,,-0.98,-0.62,
238,West Bank *,1.66,2023.0,,,,
239,Gaza Strip *,1.99,2023.0,,,,


In [7]:
growth_table = tables[0]

### Keeping relevant columns

I will mainly use UN data for this project, but I will keep the other sources' columns to fill in for when UN data is missing

In [8]:
growth_table.columns

MultiIndex([('Country (or territory)', 'Country (or territory)'),
            (   'CIA[2][3] 2023–2024',                    '(%)'),
            (   'CIA[2][3] 2023–2024',                   'Year'),
            (            'WB[4] 2023',             'WB[4] 2023'),
            (         'UN[5] 2005–10',          'UN[5] 2005–10'),
            (         'UN[5] 2010–15',          'UN[5] 2010–15'),
            (         'UN[5] 2015–20',          'UN[5] 2015–20')],
           )

In [9]:
growth_table = growth_table.drop([
    (   'CIA[2][3] 2023–2024',                   'Year')
], axis = "columns")

Now we want to remove the multi-index in columns

In [10]:
growth_table.sample(10)

Unnamed: 0_level_0,Country (or territory),CIA[2][3] 2023–2024,WB[4] 2023,UN[5] 2005–10,UN[5] 2010–15,UN[5] 2015–20
Unnamed: 0_level_1,Country (or territory),(%),WB[4] 2023,UN[5] 2005–10,UN[5] 2010–15,UN[5] 2015–20
61,Finland *,0.22,0.5,0.4,0.43,0.36
178,Tunisia *,0.63,0.8,1.04,1.16,1.09
103,Luxembourg *,1.58,2.3,2.08,2.19,1.27
31,Cambodia *,0.99,1.0,1.51,1.62,1.49
235,British Virgin Islands *,1.87,0.7,3.23,2.02,
189,Vanuatu *,1.59,2.3,2.42,2.26,2.1
172,Tanzania *,2.75,2.9,3.14,3.12,3.06
168,Switzerland *,0.64,0.8,1.11,1.21,0.83
206,Faroe Islands * (Denmark),0.63,0.3,-0.59,0.33,
223,Northern Mariana Islands * (US),-0.35,0.5,-3.57,0.44,


In [11]:
growth_table.columns = growth_table.columns.droplevel(0)

In [12]:
growth_table.sample(5)

Unnamed: 0,Country (or territory),(%),WB[4] 2023,UN[5] 2005–10,UN[5] 2010–15,UN[5] 2015–20
27,Burkina Faso *,2.4,2.5,3.01,2.98,2.87
149,San Marino *,0.59,-0.1,1.24,1.16,0.51
125,New Zealand *[d],1.06,2.0,1.1,1.09,0.93
106,Malaysia *,1.01,1.1,1.83,1.78,1.35
225,French Polynesia *,0.7,0.8,1.01,1.07,


In [17]:
growth_table = growth_table.rename(columns={"Country (or territory)": "Country",
                             "(%)" : "CIA",
                             "WB[4] 2023" : "WorldBank",
                             "UN[5] 2005–10" : "UN2010",
                             "UN[5] 2010–15": "UN2015",
                             "UN[5] 2015–20" : "UN2020"} )

In [18]:
growth_table.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 241 entries, 0 to 240
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Country    241 non-null    object 
 1   CIA        231 non-null    float64
 2   WorldBank  217 non-null    float64
 3   UN2010     233 non-null    float64
 4   UN2015     232 non-null    float64
 5   UN2020     197 non-null    float64
dtypes: float64(5), object(1)
memory usage: 11.4+ KB


### Saving the result into CSV

In [19]:
growth_table.to_csv('growth_table.csv', index=False)

## Joining the tables

### Preparing the tables for join

In [16]:
pop_table = pd.read_csv("pop_table.csv")
growth_table = pd.read_csv("growth_table.csv")

To keep data relevant, I will remove all countries or territories with below 50k population

In [61]:
pop_table.tail(70)

Unnamed: 0,Country,Population
141,Jamaica,2825544
142,Moldova,2423300
143,Gambia,2417471
144,Botswana,2410338
145,Gabon,2408586
...,...,...
206,Bermuda (UK),64055
207,Greenland (Denmark),56865
208,South Ossetia,56520
209,Faroe Islands (Denmark),54648


In [58]:
pop_table = pop_table.loc[pop_table['Population'] > 50000]

In [60]:
pop_table.tail()

Unnamed: 0,Country,Population
206,Bermuda (UK),64055
207,Greenland (Denmark),56865
208,South Ossetia,56520
209,Faroe Islands (Denmark),54648
210,Saint Kitts and Nevis,51320


In [None]:
# making sure that the columns to join on have the same name

In [29]:
pop_table = pop_table.rename(columns = {"Location" :"Country"})

pop_table

Unnamed: 0,Country,Population
0,World,8119000000
1,China,1409670000
2,India,1402737000
3,United States,340110988
4,Indonesia,282477584
...,...,...
206,Bermuda (UK),64055
207,Greenland (Denmark),56865
208,South Ossetia,56520
209,Faroe Islands (Denmark),54648


In [23]:
# looking at growth table columns

growth_table

Unnamed: 0,Country,CIA,WorldBank,UN2010,UN2015,UN2020
0,World,1.17,0.9,1.23,1.19,1.09
1,Afghanistan,2.22,2.7,2.78,3.16,2.41
2,Albania *,0.16,-1.1,-0.92,-0.12,0.13
3,Algeria *,1.54,1.6,1.63,1.98,1.67
4,Andorra *,-0.12,0.3,1.37,-1.59,-0.21
...,...,...,...,...,...,...
236,U.S. Virgin Islands *,-0.49,-0.5,-0.26,-0.02,
237,Wallis and Futuna * (France),0.23,,-0.98,-0.62,
238,West Bank *,1.66,,,,
239,Gaza Strip *,1.99,,,,


In [25]:
growth_table['Country'] = growth_table['Country'].replace(r'\*', '', regex=True)

growth_table['Country']

0                            World
1                      Afghanistan
2                         Albania 
3                         Algeria 
4                         Andorra 
                  ...             
236           U.S. Virgin Islands 
237    Wallis and Futuna  (France)
238                     West Bank 
239                    Gaza Strip 
240                     Palestine 
Name: Country, Length: 241, dtype: object

In [None]:
# one more step before joining the tables 

growth_table['Country'] = growth_table['Country'].str.strip()

In [56]:
world_pop = pd.merge(pop_table, growth_table, on = "Country",how = "inner")

world_pop

Unnamed: 0,Country,Population,CIA,WorldBank,UN2010,UN2015,UN2020
0,World,8119000000,1.17,0.9,1.23,1.19,1.09
1,India,1402737000,0.70,0.8,1.46,1.23,1.10
2,United States,340110988,0.68,0.5,0.90,0.72,0.71
3,Indonesia,282477584,0.76,0.7,1.35,1.25,1.05
4,Pakistan,241499431,1.91,2.0,2.05,2.09,1.91
...,...,...,...,...,...,...,...
177,Antigua and Barbuda,103603,1.11,0.6,1.18,1.08,1.01
178,Tonga,100179,-0.30,0.9,0.60,0.42,0.86
179,Andorra,86801,-0.12,0.3,1.37,-1.59,-0.21
180,Dominica,67408,0.02,0.4,0.23,0.48,0.51


#### Troubleshooting strings in country column

In [83]:
# China is missing...

print(growth_table[growth_table['Country'].str.contains(r'\[', na=False)])


                     Country   CIA  WorldBank  UN2010  UN2015  UN2020
37                 China [a]  0.18       -0.1    0.57    0.54    0.39
39               Comoros [b]  1.34        1.8    2.40    2.40    2.24
62                France [c]  0.31        0.3    0.58    0.45    0.39
125          New Zealand [d]  1.06        2.0    1.10    1.09    0.93
202  Channel Islands (UK)[e]   NaN        0.7    0.67    0.51     NaN


In [82]:
growth_table['Country'] = growth_table['Country'].replace(r'\s* \[[a-z]\] ', '', regex=True)

In [80]:
filtered_table = growth_table[growth_table['Country'].str.contains('China', case=False, na=False)]
print(filtered_table)


      Country   CIA  WorldBank  UN2010  UN2015  UN2020
37  China [a]  0.18       -0.1    0.57    0.54    0.39


In [86]:
# Trying another solution

growth_table['Country'] = growth_table['Country'].str.replace(r'\[[a-z]\] ', '', regex=True).str.strip()

In [93]:
growth_table['Country'] = growth_table['Country'].str.replace(r'\[', '', regex=True)

In [95]:
print(growth_table[growth_table['Country'].str.contains(r'\]', na=False)])

                    Country   CIA  WorldBank  UN2010  UN2015  UN2020
37                 China a]  0.18       -0.1    0.57    0.54    0.39
39               Comoros b]  1.34        1.8    2.40    2.40    2.24
62                France c]  0.31        0.3    0.58    0.45    0.39
125          New Zealand d]  1.06        2.0    1.10    1.09    0.93
202  Channel Islands (UK)e]   NaN        0.7    0.67    0.51     NaN


In [96]:
growth_table['Country'] = growth_table['Country'].str.replace(r'\s*[a-z]\]', '', regex=True)

In [98]:
print(growth_table[growth_table['Country'].str.contains('China', na=False)])

   Country   CIA  WorldBank  UN2010  UN2015  UN2020
37   China  0.18       -0.1    0.57    0.54    0.39


### The final join

In [100]:
world_pop = pd.merge(pop_table, growth_table, on = "Country",how = "inner")

world_pop.head(15)

Unnamed: 0,Country,Population,CIA,WorldBank,UN2010,UN2015,UN2020
0,World,8119000000,1.17,0.9,1.23,1.19,1.09
1,China,1409670000,0.18,-0.1,0.57,0.54,0.39
2,India,1402737000,0.7,0.8,1.46,1.23,1.1
3,United States,340110988,0.68,0.5,0.9,0.72,0.71
4,Indonesia,282477584,0.76,0.7,1.35,1.25,1.05
5,Pakistan,241499431,1.91,2.0,2.05,2.09,1.91
6,Nigeria,223800000,2.53,2.4,2.64,2.67,2.58
7,Brazil,212583750,0.61,0.5,1.03,0.91,0.63
8,Bangladesh,169828911,0.89,1.0,1.18,1.16,1.04
9,Russia,146150789,-0.24,-0.3,-0.07,0.04,0.01


## Calculating the growth rate

In [101]:
world_pop.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 186 entries, 0 to 185
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Country     186 non-null    object 
 1   Population  186 non-null    int64  
 2   CIA         185 non-null    float64
 3   WorldBank   185 non-null    float64
 4   UN2010      183 non-null    float64
 5   UN2015      183 non-null    float64
 6   UN2020      184 non-null    float64
dtypes: float64(5), int64(1), object(1)
memory usage: 10.3+ KB


I will start by taking the average of the values of the UN statistic. When that is missing, I will take WB or CIA data, whichever is available.

In [102]:
world_pop['growth_rate'] = world_pop[['UN2010', 'UN2015', 'UN2020']].mean(axis = 1, skipna=True)

In [103]:
world_pop

Unnamed: 0,Country,Population,CIA,WorldBank,UN2010,UN2015,UN2020,growth_rate
0,World,8119000000,1.17,0.9,1.23,1.19,1.09,1.170000
1,China,1409670000,0.18,-0.1,0.57,0.54,0.39,0.500000
2,India,1402737000,0.70,0.8,1.46,1.23,1.10,1.263333
3,United States,340110988,0.68,0.5,0.90,0.72,0.71,0.776667
4,Indonesia,282477584,0.76,0.7,1.35,1.25,1.05,1.216667
...,...,...,...,...,...,...,...,...
181,Antigua and Barbuda,103603,1.11,0.6,1.18,1.08,1.01,1.090000
182,Tonga,100179,-0.30,0.9,0.60,0.42,0.86,0.626667
183,Andorra,86801,-0.12,0.3,1.37,-1.59,-0.21,-0.143333
184,Dominica,67408,0.02,0.4,0.23,0.48,0.51,0.406667


Looking at which countries theres no UN data for

In [104]:
world_pop[world_pop['growth_rate'].isnull()]

Unnamed: 0,Country,Population,CIA,WorldBank,UN2010,UN2015,UN2020,growth_rate
117,Palestine,5483450,,2.4,,,,
149,Kosovo,1585566,0.62,-0.7,,,,


In [105]:
world_pop['growth_rate'] = world_pop['growth_rate'].fillna(world_pop['WorldBank'])

The data has been filled in for these rows from the WB data

In [107]:
world_pop.iloc[117]

Country        Palestine
Population       5483450
CIA                  NaN
WorldBank            2.4
UN2010               NaN
UN2015               NaN
UN2020               NaN
growth_rate          2.4
Name: 117, dtype: object

## Calculating the population at 2050

In [113]:
world_pop['pop_2050'] = world_pop['Population'] * (1 + (world_pop['growth_rate'])/100) ** 50

In [114]:
world_pop

Unnamed: 0,Country,Population,CIA,WorldBank,UN2010,UN2015,UN2020,growth_rate,pop_2050
0,World,8119000000,1.17,0.9,1.23,1.19,1.09,1.170000,1.452413e+10
1,China,1409670000,0.18,-0.1,0.57,0.54,0.39,0.500000,1.808925e+09
2,India,1402737000,0.70,0.8,1.46,1.23,1.10,1.263333,2.627769e+09
3,United States,340110988,0.68,0.5,0.90,0.72,0.71,0.776667,5.007490e+08
4,Indonesia,282477584,0.76,0.7,1.35,1.25,1.05,1.216667,5.171131e+08
...,...,...,...,...,...,...,...,...,...
181,Antigua and Barbuda,103603,1.11,0.6,1.18,1.08,1.01,1.090000,1.781485e+05
182,Tonga,100179,-0.30,0.9,0.60,0.42,0.86,0.626667,1.369087e+05
183,Andorra,86801,-0.12,0.3,1.37,-1.59,-0.21,-0.143333,8.079379e+04
184,Dominica,67408,0.02,0.4,0.23,0.48,0.51,0.406667,8.257316e+04


## Final cleanup & saving the table

In [115]:
world_pop['pop_2050'] = world_pop['pop_2050'].apply(lambda x: int(x))

In [116]:
world_pop.head()

Unnamed: 0,Country,Population,CIA,WorldBank,UN2010,UN2015,UN2020,growth_rate,pop_2050
0,World,8119000000,1.17,0.9,1.23,1.19,1.09,1.17,14524127111
1,China,1409670000,0.18,-0.1,0.57,0.54,0.39,0.5,1808924934
2,India,1402737000,0.7,0.8,1.46,1.23,1.1,1.263333,2627769156
3,United States,340110988,0.68,0.5,0.9,0.72,0.71,0.776667,500749045
4,Indonesia,282477584,0.76,0.7,1.35,1.25,1.05,1.216667,517113084


In [117]:
world_pop.to_csv('world_pop_2050.csv')

In [5]:
world_pop = pd.read_csv('world_pop_2050.csv')

In [6]:
world_pop['growth_rate'] = world_pop['growth_rate'].apply(lambda x: round(x, 2))

world_pop['growth_rate']

0      1.17
1      0.50
2      1.26
3      0.78
4      1.22
       ... 
181    1.09
182    0.63
183   -0.14
184    0.41
185    1.04
Name: growth_rate, Length: 186, dtype: float64

In [8]:
world_pop.sample(3)

Unnamed: 0.1,Unnamed: 0,Country,Population,CIA,WorldBank,UN2010,UN2015,UN2020,growth_rate,pop_2050
85,85,Cuba,11089511,-0.19,-0.2,0.09,0.23,0.06,0.13,11814090
32,32,Algeria,46700000,1.54,1.6,1.63,1.98,1.67,1.76,111730513
128,128,Georgia,3694600,0.01,1.3,-1.17,-1.37,-0.27,-0.94,2307885


In [9]:
world_pop = world_pop.drop(columns = {"Unnamed: 0"})

In [10]:
world_pop.sample(3)

Unnamed: 0,Country,Population,CIA,WorldBank,UN2010,UN2015,UN2020,growth_rate,pop_2050
165,Maldives,515132,-0.17,-0.5,2.68,2.76,1.85,2.43,1711106
108,Central African Republic,6470307,1.76,2.9,1.5,0.43,1.58,1.17,11574770
55,North Korea,25950000,0.44,0.4,0.34,0.42,0.47,0.41,31840927


In [12]:
world_pop.to_csv('raw_2050_pop.csv', index=False)

In [13]:
world_pop = world_pop.drop(columns = {"CIA", "WorldBank", "UN2010", "UN2015", "UN2020"})

In [14]:
world_pop.sample(3)

Unnamed: 0,Country,Population,growth_rate,pop_2050
5,Pakistan,241499431,2.02,655348889
163,Suriname,616500,0.98,1003925
140,Botswana,2410338,1.76,5757335


In [15]:
world_pop.to_csv('world_pop_2050.csv', index=False)