# Outline

In this notebook I'll be fetching the data for three different quantities:
- **Country Population:** population per country -> source: [WorldMeters - Population by Country](https://www.worldometers.info/world-population/population-by-country)
- **Internet Penetration Factor:** what percentage/fraction of a country's population has access to internet -> source: [WorldBank Data - Individuals with Internet Percentage](https://data.worldbank.org/indicator/IT.NET.USER.ZS)
- **Timezone:** what is the time zone (official) of a country, e.g. capital's timezone -> source: [The World Clock — Capitals Worldwide](https://www.timeanddate.com/worldclock/?low=c&sort=1)

Steps involved will be:
1. Fetching and finding the data in the HTML page
2. Normalizing country names across three sources
3. Merging and exporting to CSV

# Modules

In [1]:
import numpy as np  # for multi-dimensional array manipulations
import pandas as pd  # dataframe/table manipulations
import matplotlib.pyplot as plt  # visualizations for sanity checks

import country_converter as coco  # for country name and code conversions
cc = coco.CountryConverter()  # converts country names in various formats, to a specific format

import datetime
import pytz

# Parameters

In [2]:
COUNTRY_NAME_FORMAT = 'ISO3'

# Load & normalize

## Population

In [3]:
# Since the do not like non-human requests! Visit the page
# `https://www.worldometers.info/world-population/population-by-country/`
# in your browser and save it as a single HTML file. Now use the path to the file.
tmp_dfs = pd.read_html(
    "data/Population by Country (2025) - Worldometer.html"
)
df_population_by_country = tmp_dfs[0].copy()
# Select and Re-name
df_population_by_country = df_population_by_country[[
    "Country (or dependency)", "Population 2025"
]].copy()
df_population_by_country.columns = [ "country", "population" ]
df_population_by_country['country'] = cc.pandas_convert(
    df_population_by_country['country'],
    to=COUNTRY_NAME_FORMAT
)

# Glimpse
display( df_population_by_country.head(5) )

# Just for fun
print(f"World population 2025: {df_population_by_country['population'].sum():>g}")

Unnamed: 0,country,population
0,IND,1463865525
1,CHN,1416096094
2,USA,347275807
3,IDN,285721236
4,PAK,255219554


World population 2025: 8.22975e+09


## Internet penetration

In [None]:
# They provide a CSV file with relevant informations there. Visit
# `https://data.worldbank.org/indicator/IT.NET.USER.ZS`
# and download the CSV version of the infos from the panel on the right.
tmp_df = pd.read_csv(
    "data/API_IT.NET.USER.ZS_DS2_en_csv_v2_512034.csv",
    skiprows=4
)
df_internet_penetration = tmp_df[[
    'Country Name'
]].rename(columns={ 'Country Name': 'country' })
df_internet_penetration['country'] = cc.pandas_convert(
    df_internet_penetration['country'],
    to=COUNTRY_NAME_FORMAT
)

# Assume it's increasing and find maximum
df_internet_penetration['internet_penetration_fraction'] = np.nanmax( tmp_df.iloc[:,4:-1].to_numpy(), axis=1 ) / 100

# Filter
df_internet_penetration = df_internet_penetration[
    (df_internet_penetration['country'] != 'not found' ) & ( ~df_internet_penetration['internet_penetration_fraction'].isna() )
]

# Display
display(df_internet_penetration.head(5))

## Add Timezone Data

In [5]:
# Since the do not like non-human requests! Visit the page
# `https://www.timeanddate.com/worldclock/?low=c&sort=1`
# in your browser and save it as a single HTML file. Now use the path to the file.
tmp_dfs = pd.read_html(
    "data/The World Clock — Capitals Worldwide.html"
)
df_country_timezone = tmp_dfs[0]
df_country_timezone.columns = [ 0, 1, 2, 3 ]
df_country_timezone = pd.concat((
    df_country_timezone.iloc[:,:2].rename(columns={0: 'country', 1: 'time'}),
    df_country_timezone.iloc[:,2:4].rename(columns={2: 'country', 3: 'time'}),
))

df_country_timezone = df_country_timezone[
    ~df_country_timezone['time'].isna()
].copy()

df_country_timezone['country_raw'] = df_country_timezone['country'].map( lambda x: x.split(',',1)[0] if isinstance(x, str) else '' )

# ISO3 country codes
df_country_timezone['country'] = cc.pandas_convert(
    df_country_timezone['country_raw'],
    to=COUNTRY_NAME_FORMAT
)

# Add Hours Conversion
def convert_time_to_24h(time_str):
    dayofweek, time = time_str.split()
    hour, minutes = list(map(int, time.split(':')))
    if dayofweek == 'Fri':
        hour_offset = 0
    elif dayofweek == 'Sat':
        hour_offset = 24
    elif dayofweek == 'Thu':
        hour_offset = -24
    return (hour+hour_offset) + minutes/60
df_country_timezone['timezone_hour'] = df_country_timezone['time'].map( convert_time_to_24h )
## 
timezone_hour_iran = df_country_timezone[ df_country_timezone['country'] == 'IRN' ]['timezone_hour'].to_numpy()[0]
df_country_timezone['timezone'] = (df_country_timezone['timezone_hour'] - timezone_hour_iran).round(1)

df_country_timezone = df_country_timezone[['country', 'timezone']]

# Combine all Data

In [6]:
df_combined = pd.merge(
    df_population_by_country,
    df_internet_penetration,
    on=['country'],
    how='left',
)
df_combined = pd.merge(
    df_combined,
    df_country_timezone,
    on=['country'],
    how='left'
)
df_combined.dropna(inplace=True)
df_combined.reset_index(drop=True, inplace=True)
# Display
df_combined.head(5)

Unnamed: 0,country,population,internet_penetration_fraction,timezone
0,IND,1463865525,0.559,2.0
1,CHN,1416096094,0.775,4.5
2,USA,347275807,0.931,-7.5
3,IDN,285721236,0.692,3.5
4,PAK,255219554,0.274,1.5


## Write to CSV

In [7]:
df_combined.to_csv(
    "population_internet_timezone.csv",
    index=False,
    float_format="%.2f"
)