# Project 2

In project 2, I am going to observe the relationship between two datasets.

1. Plotting information from both as separate lines/points

a. Findex Data

"The Global Findex Database" from https://www.worldbank.org/en/publication/globalfindex/download-data  

In [2]:
# import library
import plotly.io as pio

pio.renderers.default = "notebook_connected+plotly_mimetype"

import pandas as pd
import plotly.express as px

# read and clean data
findex_df = pd.read_csv("GlobalFindexDatabase2025.csv")
years = [2014, 2017, 2021, 2024]

findex_df_clean = findex_df[
    (findex_df["mobileaccount_t_d"].notna())
    & (findex_df["mobileaccount_t_d"] != "NA")
    # & (findex_df["incomegroupwb24"] == "Lower middle income")
    & (findex_df["group"] == "all")
    & (findex_df["year"].isin(years))
].copy()

findex_df_clean["mobileaccount_t_d"] = pd.to_numeric(
    findex_df_clean["mobileaccount_t_d"], errors="coerce"
)
findex_df_clean = findex_df_clean[findex_df_clean["mobileaccount_t_d"].notna()]


Columns (4,5) have mixed types. Specify dtype option on import or set low_memory=False.



In [3]:
# plot
fig = px.choropleth(
    findex_df_clean,
    locations="codewb",
    color="mobileaccount_t_d",
    hover_name="countrynewwb",
    animation_frame="year",
    title="Mobile Account Ownership Across Countries (2014-2024)",
    color_continuous_scale="Viridis",
    labels={"mobileaccount_t_d": "Mobile Account Ownership Rate"},
    projection="natural earth",
)

fig.show()

The graph above illustrates the mobile account ownership across countries in 2014, 2017, 2021 and 2024.

It is obvious that the mobile account ownership rate has increased in the past 10 years. Brazil, for example, has increased its rate from 0.0086 in 2014 to 0.5817 in 2024. 

However, many countries do not have available data in this dataset.

b. Mobile Network Data

"Population coverage, by mobile network technology" from https://datahub.itu.int/data/?i=100095&s=430

In [4]:
# import data

network_df = pd.read_csv("population_coverage_by_mobile_network_technology.csv")
network_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9609 entries, 0 to 9608
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   seriesID           9609 non-null   int64  
 1   seriesCode         9609 non-null   object 
 2   seriesName         9609 non-null   object 
 3   seriesParent       0 non-null      float64
 4   seriesUnits        9609 non-null   object 
 5   entityID           9609 non-null   int64  
 6   entityIso          9609 non-null   object 
 7   entityName         9609 non-null   object 
 8   dataValue          9609 non-null   float64
 9   dataYear           9609 non-null   int64  
 10  dataNote           1978 non-null   object 
 11  dataSource         8745 non-null   object 
 12  seriesDescription  9609 non-null   object 
dtypes: float64(2), int64(3), object(8)
memory usage: 976.0+ KB


In [8]:
# data clean
network_df_4g = network_df[network_df["seriesName"] == "At least LTE/WiMAX"]
network_df_4g["dataValue"] = network_df_4g["dataValue"] / 100

# sort year
network_df_4g = network_df_4g.sort_values("dataYear")

network_df_4g.head()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,seriesID,seriesCode,seriesName,seriesParent,seriesUnits,entityID,entityIso,entityName,dataValue,dataYear,dataNote,dataSource,seriesDescription
6762,19306,i271GA,At least LTE/WiMAX,,%,11,DZA,Algeria,0.0,2012,,Ministère de la Poste et des Technologies de ...,Refers to the percentage of inhabitants that a...
7071,19306,i271GA,At least LTE/WiMAX,,%,47,CAN,Canada,0.67,2012,,ITU estimate.,Refers to the percentage of inhabitants that a...
7733,19306,i271GA,At least LTE/WiMAX,,%,130,KWT,Kuwait,0.95,2012,,ITU estimate.,Refers to the percentage of inhabitants that a...
7568,19306,i271GA,At least LTE/WiMAX,,%,111,ITA,Italy,0.0709,2012,,ITU estimate.,Refers to the percentage of inhabitants that a...
8535,19306,i271GA,At least LTE/WiMAX,,%,238,TZA,Tanzania,0.0534,2012,,ITU estimate.,Refers to the percentage of inhabitants that a...


In [9]:
# plot
fig = px.choropleth(
    network_df_4g,
    locations="entityIso",
    color="dataValue",
    hover_name="entityName",
    animation_frame="dataYear",
    title="Population Coverage by LTE/WiMAX Network Across Countries",
    color_continuous_scale="Viridis",
    labels={"dataValue": "At Least LTE/WiMAX Network"},
    projection="natural earth",
)

fig.show()

Graph above demonstrates the population coverage by at least LTE/WiMAX (namely 4G) network across countries in the past 12 years.

Intuitively, the avereage rate is high in the nearest 3 years. U.S. and Canada remains a high rate since 2014, with more than 90% of the population covered by at least LTE/WiMaX.

Also, the dataset contains relatively few null values.

2. Merging the datasets and plotting a derived measure

In [15]:
# filter

findex_df_filter = findex_df_clean[
    ["year", "codewb", "countrynewwb", "mobileaccount_t_d", "incomegroupwb24"]
]

network_df_filter = network_df_4g[["dataYear", "entityIso", "dataValue"]]
network_df_filter = network_df_filter.rename(columns={"dataValue": "4g_cover"})

# concat
merged_df = findex_df_filter.merge(
    network_df_filter,
    left_on=["year", "codewb"],
    right_on=["dataYear", "entityIso"],
    how="inner",
)
merged_df = merged_df.drop(columns=["dataYear", "entityIso"])
merged_df

Unnamed: 0,year,codewb,countrynewwb,mobileaccount_t_d,incomegroupwb24,4g_cover
0,2014,AFG,Afghanistan,0.003044,Low income,0.0000
1,2014,ARG,Argentina,0.004323,Upper middle income,0.0000
2,2014,ARM,Armenia,0.006578,Upper middle income,0.4600
3,2014,BGD,Bangladesh,0.026917,Lower middle income,0.5900
4,2014,BOL,Bolivia,0.027771,Lower middle income,0.0160
...,...,...,...,...,...,...
253,2024,UGA,Uganda,0.677383,Low income,0.8200
254,2024,VEN,"Venezuela, RB",0.337397,Upper middle income,0.7000
255,2024,VNM,Viet Nam,0.387233,Lower middle income,0.9985
256,2024,ZMB,Zambia,0.693108,Low income,0.9120


In [30]:
# divide into year
df_grouped = (
    merged_df.groupby("year")[["mobileaccount_t_d", "4g_cover"]].mean().reset_index()
)

df_grouped

Unnamed: 0,year,mobileaccount_t_d,4g_cover
0,2014,0.066403,0.202068
1,2017,0.150095,0.539639
2,2021,0.241184,0.761094
3,2024,0.340629,0.90991


In [32]:
fig = px.line(
    df_grouped,
    x="year",
    y=["mobileaccount_t_d", "4g_cover"],
    markers=True,
    title="Average Mobile Account Rate and Population Covered by 4G",
)

fig.show()

The line chart represents how these two variables are steadily growing in the past 10 years. The average mobile account has increased from 0.0664 in 2014 to 0.3406 in 2024. 4G coverage rises aprroximately 70%.

The simutanelous increasing trend is reasonable, since 4G coverage, which surpassed 90% in 2024, has created a sufficient background for the spread of mobile account.

Next, I want to explore whether there are correlation of these two variables in different countries that grouped by income.

In [None]:
# divide into income group in 2024
df_grouped_2024 = merged_df[merged_df["year"] == 2024]

df_grouped_2024 = (
    df_grouped_2024.groupby("incomegroupwb24")[["mobileaccount_t_d", "4g_cover"]]
    .mean()
    .reset_index()
)

# order
df_grouped_2024["incomegroupwb24"] = pd.Categorical(
    df_grouped_2024["incomegroupwb24"],
    categories=[
        "Low income",
        "Lower middle income",
        "Upper middle income",
        "High income",
    ],
    ordered=True,
)
df_grouped_2024 = df_grouped_2024.sort_values("incomegroupwb24")

df_grouped_2024

Unnamed: 0,incomegroupwb24,mobileaccount_t_d,4g_cover
1,Low income,0.40005,0.774613
2,Lower middle income,0.314943,0.906726
3,Upper middle income,0.326788,0.952724
0,High income,0.510241,0.999667


In [None]:
# plot
fig = px.bar(
    df_grouped_2024,
    x="incomegroupwb24",
    y=["mobileaccount_t_d", "4g_cover"],
    barmode="group",
    title="Average of Mobile Account Rate and 4G coverage by Income Group in 2024",
)

fig.show()

The bar chart shows the average of mobile account rate and 4G coverage by income group in 2024.

It is not surprised that 4g coverage is higher in the high income countries, while it is strange that the mobile account rate is relatively high (approximately 40%) in low income country. 

Takeaways:

1. Datasets are needed to be fully inspect and understood.
2. The x-axis could not automatically order as we wish, first category and then sort values to let plotly work well.