# Global Trends in Financial Inclusion and Digital Payment Adoption  
### *A Data-Driven Analysis Using the World Bank Global Findex Database (2021)*

## Introduction

Financial inclusion plays a pivotal role in driving economic growth and reducing poverty. By expanding access to essential financial services—such as savings accounts, credit, insurance, and payment systems—countries can empower individuals, support small businesses, and enhance economic resilience.

With the rise of **digital financial services**—including mobile money, online banking, and digital payments—millions, particularly in **low- and middle-income countries**, are gaining unprecedented access to formal financial systems. These innovations have the potential to significantly narrow financial access gaps across regions and demographics.

This project utilizes the **World Bank Global Findex Database 2021**, which provides comprehensive data on how adults worldwide save, borrow, make payments, and manage financial risk. The dataset includes information on:

- Account ownership (bank and mobile money)
- Digital payment usage (e.g., utility bills, wages, government transfers)
- Savings and borrowing behavior
- Barriers to account ownership
- Demographic and socioeconomic characteristics

Before conducting meaningful analysis, **data cleaning and preprocessing** are essential. This notebook focuses on preparing the raw dataset for analysis by:

- Handling missing values
- Renaming and restructuring columns
- Creating derived variables where needed
- Ensuring consistency in data types and formats

These foundational steps will ensure the quality, accuracy, and usability of the data in subsequent phases of the project, including visualization, modeling, and insight generation.

## Tools Used

This notebook leverages:
- `pandas` and `numpy` for data manipulation and cleaning
- `plotly.express` for interactive visual exploration
- Supplementary datasets (as needed) from ITU, GSMA, and World Bank Open Data

---


In [87]:
import pandas as pd
import os
import numpy as np
import plotly.express as px
import warnings
warnings.simplefilter(action='ignore', category=pd.errors.SettingWithCopyWarning)

In [4]:
# Get current working directory (where this script is run from)
current_dir = os.getcwd()
# Go one directory up to reach the project root (assuming script is in a subfolder)
project_root_dir = os.path.dirname(current_dir)
notebook_dir = os.path.join(project_root_dir, 'Notebook')
docs_dir = os.path.join(project_root_dir, 'Docs')
data_dir = os.path.join(project_root_dir, 'Data')
raw_dir = os.path.join(data_dir, 'raw')
processed_dir = os.path.join(data_dir, 'processed')

## Load Dataset

To begin the data cleaning process, we first load the dataset using the `os` and `pandas` libraries. The Excel file, `DatabankWide.xlsx`, is located in the current working directory.





In [6]:
world_data_filename=os.path.join(raw_dir,"DatabankWide.xlsx")


##  Reading the Excel File

We use `pandas.read_excel()` to load the dataset into a DataFrame named `df`. This function reads the Excel file located at the path specified earlier (`data_path`).




In [8]:
df = pd.read_excel(world_data_filename)
df

Unnamed: 0,Country name,Country code,Year,Adult populaiton,Region,Income group,Account (% age 15+),Financial institution account (% age 15+),First financial institution account ever was opened to receive a wage payment or money from the government (% age 15+),First financial institution account ever was opened to receive a wage payment (% age 15+),...,"Used a mobile phone or the internet to access an account, young (% ages 15-24)","Used a mobile phone or the internet to access an account, older (% age 25+)","Used a mobile phone or the internet to access an account, primary education or less (% ages 15+)","Used a mobile phone or the internet to access an account, secondary education or more (% ages 15+)","Used a mobile phone or the internet to access an account, income, poorest 40% (% ages 15+)","Used a mobile phone or the internet to access an account, income, richest 60% (% ages 15+)","Used a mobile phone or the internet to access an account, rural (% age 15+)","Used a mobile phone or the internet to access an account, urban (% age 15+)","Used a mobile phone or the internet to access an account, out of labor force (% age 15+)","Used a mobile phone or the internet to access an account, in labor force (% age 15+)"
0,Afghanistan,AFG,2011,1.512447e+07,South Asia,Low income,0.090050,0.090050,,,...,,,,,,,,,,
1,Afghanistan,AFG,2014,1.730080e+07,South Asia,Low income,0.099610,0.099610,,,...,,,,,,,,,,
2,Afghanistan,AFG,2017,1.971821e+07,South Asia,Low income,0.148933,0.145471,,,...,0.00000,0.014690,0.003958,0.022650,0.003607,0.012820,,,0.000548,0.017173
3,Afghanistan,AFG,2021,2.264750e+07,South Asia,Low income,0.096538,0.096538,,,...,,,,,,,,,,
4,Albania,ALB,2011,2.258900e+06,Europe & Central Asia (excluding high income),Upper middle income,0.282681,0.282681,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
653,Sub-Saharan Africa (excluding high income),SSA,2021,6.587700e+08,Sub-Saharan Africa (excluding high income),,0.550697,0.396974,0.183956,0.159863,...,,,,,,,,,,
654,World,WLD,2011,5.051902e+09,,,0.506275,0.506275,,,...,,,,,,,,,,
655,World,WLD,2014,5.267620e+09,,,0.619237,0.611153,,,...,,,,,,,,,,
656,World,WLD,2017,5.488624e+09,,,0.684983,0.671033,,,...,0.22863,0.252606,0.116583,0.355014,0.166880,0.302514,,,0.144359,0.306314


## Dataset Dimensions

To understand the size of the dataset, we use the `.shape` attribute. This returns a tuple indicating the number of rows and columns in the DataFrame.




In [10]:
df.shape

(658, 1232)

##  Cleaning Country Names

We want to ensure the dataset only includes valid countries. The following cleaning steps are performed on the `Country name` column:

1. **Remove rows with missing country names**  
   We first drop any rows where the `Country name` is `NaN`.

2. **Remove entries that represent regions, income groups, or aggregates**  
   Sometimes the dataset includes rows labeled as "World", "High income", "Europe & Central Asia", etc., which are not individual countries. We define a list of keywords and remove any rows where the `Country name` contains any of those terms (case-insensitive).




In [12]:
df = df[df['Country name'].notna()]

# Remove rows where "Country name" is really a region or income group
bad_keywords = ['income', 'world', 'region', 'OECD', 'Africa', 'Asia', 'America', 'Europe', 'Pacific', 'Arab']
df = df[~df['Country name'].str.contains('|'.join(bad_keywords), case=False, na=False)]

##  Reshaping the Dataset to Long Format

The original dataset is in a **wide format**, where each indicator (e.g., "% with account", "% receiving wage digitally") is a separate column. To make the data easier to analyze and visualize, we reshape it into a **long format** using `pd.melt()`.

###  Steps:

1. **Define identifier columns (`id_cols`)**  
   These are columns we want to keep as is (e.g., country info, year, demographics).

2. **Select indicator columns (`value_cols`)**  
   We collect all columns that contain a **percentage (`%`)** and are not part of the identifier list.

3. **Use `pd.melt()`**  
   This converts the DataFrame from wide to long format, where each row represents a country-indicator-value-year combination.

4. **Clean the 'Value' column**  
   Convert the values to numeric, safely handling any non-numeric or missing entries.



In [14]:
id_cols = ['Country name', 'Country code', 'Year', 'Adult populaiton', 'Region', 'Income group']
value_cols = [col for col in df.columns if '%' in str(col) and col not in id_cols]

df_long = pd.melt(df, id_vars=id_cols, value_vars=value_cols,
                  var_name='Indicator', value_name='Value')

# Remove percent signs and convert to numeric
# df_long['Value'] = df_long['Value'].astype(str).str.replace('%', '').str.strip()
df_long['Value'] = pd.to_numeric(df_long['Value'], errors='coerce')


In [15]:
df_long


Unnamed: 0,Country name,Country code,Year,Adult populaiton,Region,Income group,Indicator,Value
0,Afghanistan,AFG,2011,1.512447e+07,South Asia,Low income,Account (% age 15+),0.090050
1,Afghanistan,AFG,2014,1.730080e+07,South Asia,Low income,Account (% age 15+),0.099610
2,Afghanistan,AFG,2017,1.971821e+07,South Asia,Low income,Account (% age 15+),0.148933
3,Afghanistan,AFG,2021,2.264750e+07,South Asia,Low income,Account (% age 15+),0.096538
4,Albania,ALB,2011,2.258900e+06,Europe & Central Asia (excluding high income),Upper middle income,Account (% age 15+),0.282681
...,...,...,...,...,...,...,...,...
687220,Euro area,EMU,2021,2.916589e+08,,,Used a mobile phone or the internet to access ...,
687221,Developing,LMY,2011,4.077908e+09,,,Used a mobile phone or the internet to access ...,
687222,Developing,LMY,2014,4.272187e+09,,,Used a mobile phone or the internet to access ...,
687223,Developing,LMY,2017,4.471571e+09,,,Used a mobile phone or the internet to access ...,0.236661


## Long Format Dataset Dimensions

After reshaping the dataset from wide to long format using `pd.melt()`, we use the `.shape` attribute to check the number of rows and columns in the transformed DataFrame:



In [17]:
df_long.shape

(687225, 8)

##  Formatting Indicator Values as Percentages

To make the indicator values easier to read and visualize, we format them as percentage strings:

###  Steps:
1. **Convert the 'Value' column to float and scale**  
   Since the raw values are in decimal form (e.g., 0.75), we multiply them by 100 to convert them into percentage values.

2. **Create a new column `Indicator value`**  
   We round the percentage values to two decimal places and append the `%` sign to make them human-readable.

3. **Drop the original 'Value' column**  
   After formatting, the original numeric column is no longer needed.



In [19]:
df_long['Value'] = df_long['Value'].astype(float) * 100
df_long['Indicator value'] = df_long['Value'].round(2).astype(str) + '%'
df_long.drop('Value', axis=1, inplace=True)

##  Dataset Overview with `.info()`

To understand the structure of the cleaned and reshaped dataset, we use the `.info()` method. This provides:

- Number of entries (rows)
- Number of columns
- Column names and their data types
- Non-null (non-missing) counts for each column
- Memory usage of the DataFrame




In [21]:
df_long.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 687225 entries, 0 to 687224
Data columns (total 8 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Country name      687225 non-null  object 
 1   Country code      687225 non-null  object 
 2   Year              687225 non-null  int64  
 3   Adult populaiton  687225 non-null  float64
 4   Region            677425 non-null  object 
 5   Income group      672525 non-null  object 
 6   Indicator         687225 non-null  object 
 7   Indicator value   687225 non-null  object 
dtypes: float64(1), int64(1), object(6)
memory usage: 41.9+ MB


## Summary Statistics 

We use the `.describe()` method to generate descriptive statistics for the **numeric columns** in the dataset. This helps us understand the distribution of the percentage-based financial indicators.




In [23]:
df_long.describe()

Unnamed: 0,Year,Adult populaiton
count,687225.0,687225.0
mean,2015.759358,69684960.0
std,3.722199,388114300.0
min,2011.0,229858.7
25%,2014.0,3655052.0
50%,2017.0,8197626.0
75%,2017.0,24776780.0
max,2022.0,4739308000.0


##  Checking for Missing Values

To identify missing data in the cleaned dataset, we use `.isnull().sum()`. This method returns the total number of missing (`NaN`) values in each column.


In [25]:
df_long.isnull().sum()

Country name            0
Country code            0
Year                    0
Adult populaiton        0
Region               9800
Income group        14700
Indicator               0
Indicator value         0
dtype: int64

##  Exploring Unique Regions

To understand the geographic coverage of the dataset, we examine the unique values in the `Region` column using `.unique()`. This reveals all distinct regions represented in the data.




In [27]:
df_long['Region'].unique()


array(['South Asia', 'Europe & Central Asia (excluding high income)',
       'Middle East & North Africa (excluding high income)',
       'Sub-Saharan Africa (excluding high income)',
       'Latin America & Caribbean (excluding high income)', 'High income',
       'East Asia & Pacific (excluding high income)', nan], dtype=object)

## Cleaning the 'Region' Column

We perform two key cleaning steps on the `Region` column to ensure consistent and meaningful region values:

###  Steps:

1. **Remove rows with missing region values**  
   These entries cannot be used in regional analysis, so we filter them out.

2. **Standardize region names**  
   Some regions in the dataset include extra text like `"(excluding high income)"`, which is not needed for our analysis.  
   We use a regular expression to remove this suffix and keep only the clean region name.



In [29]:
df_long = df_long.loc[df_long['Region'].notna()]
df_long['Region'] = df_long['Region'].str.replace(r"\s*\(excluding high income\)", "", regex=True)

In [30]:
df_long['Region'].unique()

array(['South Asia', 'Europe & Central Asia',
       'Middle East & North Africa', 'Sub-Saharan Africa',
       'Latin America & Caribbean', 'High income', 'East Asia & Pacific'],
      dtype=object)

##  Exploring Unique Country Names

To review the countries represented in the dataset, we use `.unique()` on the `Country name` column. This returns a list of all distinct country names remaining after cleaning.



In [32]:
df_long['Country name'].unique()

array(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',
       'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahrain',
       'Bangladesh', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bhutan',
       'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil',
       'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon',
       'Canada', 'Chad', 'Chile', 'China', 'Colombia', 'Comoros',
       'Congo, Dem. Rep.', 'Congo, Rep.', 'Costa Rica', "Cote d'Ivoire",
       'Croatia', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti',
       'Dominican Republic', 'Ecuador', 'El Salvador', 'Estonia',
       'Eswatini', 'Ethiopia', 'Finland', 'France', 'Gabon',
       'Gambia, The', 'Georgia', 'Germany', 'Ghana', 'Greece',
       'Guatemala', 'Guinea', 'Haiti', 'Honduras', 'Hong Kong SAR, China',
       'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Rep.',
       'Iraq', 'Ireland', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jordan',
       'Kazakhstan', 'Kenya', 'Korea,

##  Replacing 'High income' Regions with Actual Geographic Regions

Some countries are labeled under the `Region` column as **"High income"**, which is not a geographic region but an income classification. To enhance regional analysis, we replace `"High income"` with the actual geographic region of the country using the steps below:

###  Steps:

1. **Create a mapping from countries to regions (excluding 'High income')**  
   - We filter out entries labeled `"High income"`.
   - Group by `Country name` and extract the most common region per country using `.mode()`.
   - Convert this into a dictionary (`country_region_map`) for easy lookup.

2. **Replace 'High income' with actual region**  
   - We use `.apply()` to update each row:
     - If the `Region` is `"High income"`, we look up the country’s correct region from the map.
     - If the region is already valid, we keep it unchanged.



In [34]:
country_region_map = (
    df_long[df_long['Region'].notna() & (df_long['Region'] != 'High income')]
    .groupby('Country name')['Region']
    .agg(lambda x: x.mode()[0])  # pick the most common region per country
    .to_dict()
)

# 2. Replace "High income" in Region using this mapping
df_long['Region'] = df_long.apply(
    lambda row: country_region_map.get(row['Country name'], np.nan) if row['Region'] == 'High income' else row['Region'],
    axis=1
)

In [35]:
country_to_region = {
    # South Asia
    'Afghanistan': 'South Asia', 'Bangladesh': 'South Asia', 'Bhutan': 'South Asia', 'India': 'South Asia',
    'Maldives': 'South Asia', 'Nepal': 'South Asia', 'Pakistan': 'South Asia', 'Sri Lanka': 'South Asia',

    # Europe & Central Asia
    'Albania': 'Europe & Central Asia', 'Armenia': 'Europe & Central Asia', 'Austria': 'Europe & Central Asia',
    'Azerbaijan': 'Europe & Central Asia', 'Belarus': 'Europe & Central Asia', 'Belgium': 'Europe & Central Asia',
    'Bosnia and Herzegovina': 'Europe & Central Asia', 'Bulgaria': 'Europe & Central Asia',
    'Croatia': 'Europe & Central Asia', 'Czech Republic': 'Europe & Central Asia', 'Denmark': 'Europe & Central Asia',
    'Estonia': 'Europe & Central Asia',
    'Finland': 'Europe & Central Asia', 'France': 'Europe & Central Asia', 'Georgia': 'Europe & Central Asia',
    'Germany': 'Europe & Central Asia', 'Greece': 'Europe & Central Asia', 'Hungary': 'Europe & Central Asia',
    'Iceland': 'Europe & Central Asia', 'Ireland': 'Europe & Central Asia', 'Italy': 'Europe & Central Asia',
    'Kazakhstan': 'Europe & Central Asia', 'Kosovo': 'Europe & Central Asia', 'Latvia': 'Europe & Central Asia',
    'Lithuania': 'Europe & Central Asia', 'Moldova': 'Europe & Central Asia', 'Montenegro': 'Europe & Central Asia',
    'Netherlands': 'Europe & Central Asia', 'North Macedonia': 'Europe & Central Asia', 'Norway': 'Europe & Central Asia',
    'Poland': 'Europe & Central Asia', 'Portugal': 'Europe & Central Asia', 'Romania': 'Europe & Central Asia',
    'Russian Federation': 'Europe & Central Asia', 'Serbia': 'Europe & Central Asia', 'Slovak Republic': 'Europe & Central Asia',
    'Slovenia': 'Europe & Central Asia', 'Spain': 'Europe & Central Asia', 'Sweden': 'Europe & Central Asia',
    'Switzerland': 'Europe & Central Asia', 'Turkey': 'Europe & Central Asia', 'Ukraine': 'Europe & Central Asia',
    'United Kingdom': 'Europe & Central Asia',

    # Sub-Saharan Africa
    'Angola': 'Sub-Saharan Africa', 'Benin': 'Sub-Saharan Africa', 'Botswana': 'Sub-Saharan Africa',
    'Burkina Faso': 'Sub-Saharan Africa', 'Burundi': 'Sub-Saharan Africa', 'Cabo Verde': 'Sub-Saharan Africa',
    'Cameroon': 'Sub-Saharan Africa', 'Central African Republic': 'Sub-Saharan Africa', 'Chad': 'Sub-Saharan Africa',
    'Comoros': 'Sub-Saharan Africa', 'Congo, Dem. Rep.': 'Sub-Saharan Africa', 'Congo, Rep.': 'Sub-Saharan Africa',
    "Cote d'Ivoire": 'Sub-Saharan Africa', 'Djibouti': 'Sub-Saharan Africa', 'Equatorial Guinea': 'Sub-Saharan Africa',
    'Eswatini': 'Sub-Saharan Africa', 'Ethiopia': 'Sub-Saharan Africa', 'Gabon': 'Sub-Saharan Africa',
    'Gambia, The': 'Sub-Saharan Africa', 'Ghana': 'Sub-Saharan Africa', 'Guinea': 'Sub-Saharan Africa',
    'Kenya': 'Sub-Saharan Africa', 'Lesotho': 'Sub-Saharan Africa', 'Liberia': 'Sub-Saharan Africa',
    'Madagascar': 'Sub-Saharan Africa', 'Malawi': 'Sub-Saharan Africa', 'Mali': 'Sub-Saharan Africa',
    'Mauritania': 'Sub-Saharan Africa', 'Mauritius': 'Sub-Saharan Africa', 'Mozambique': 'Sub-Saharan Africa',
    'Namibia': 'Sub-Saharan Africa', 'Niger': 'Sub-Saharan Africa', 'Nigeria': 'Sub-Saharan Africa',
    'Rwanda': 'Sub-Saharan Africa', 'Senegal': 'Sub-Saharan Africa', 'Sierra Leone': 'Sub-Saharan Africa',
    'Somalia': 'Sub-Saharan Africa', 'South Africa': 'Sub-Saharan Africa', 'South Sudan': 'Sub-Saharan Africa',
    'Sudan': 'Sub-Saharan Africa', 'Tanzania': 'Sub-Saharan Africa', 'Togo': 'Sub-Saharan Africa',
    'Uganda': 'Sub-Saharan Africa', 'Zambia': 'Sub-Saharan Africa', 'Zimbabwe': 'Sub-Saharan Africa',

    # East Asia & Pacific
    'Australia': 'East Asia & Pacific', 'Cambodia': 'East Asia & Pacific', 'China': 'East Asia & Pacific',
    'Indonesia': 'East Asia & Pacific', 'Japan': 'East Asia & Pacific', 'Korea, Rep.': 'East Asia & Pacific',
    'Lao PDR': 'East Asia & Pacific', 'Malaysia': 'East Asia & Pacific', 'Mongolia': 'East Asia & Pacific',
    'Myanmar': 'East Asia & Pacific', 'New Zealand': 'East Asia & Pacific', 'Philippines': 'East Asia & Pacific',
    'Singapore': 'East Asia & Pacific', 'Thailand': 'East Asia & Pacific', 'Vietnam': 'East Asia & Pacific',
    'Taiwan, China': 'East Asia & Pacific', 'Hong Kong SAR, China': 'East Asia & Pacific',

    # Latin America & Caribbean
    'Argentina': 'Latin America & Caribbean', 'Belize': 'Latin America & Caribbean', 'Bolivia': 'Latin America & Caribbean',
    'Brazil': 'Latin America & Caribbean', 'Chile': 'Latin America & Caribbean', 'Colombia': 'Latin America & Caribbean',
    'Costa Rica': 'Latin America & Caribbean', 'Cuba': 'Latin America & Caribbean', 'Dominican Republic': 'Latin America & Caribbean',
    'Ecuador': 'Latin America & Caribbean', 'El Salvador': 'Latin America & Caribbean', 'Guatemala': 'Latin America & Caribbean',
    'Haiti': 'Latin America & Caribbean', 'Honduras': 'Latin America & Caribbean', 'Jamaica': 'Latin America & Caribbean',
    'Mexico': 'Latin America & Caribbean', 'Nicaragua': 'Latin America & Caribbean', 'Panama': 'Latin America & Caribbean',
    'Paraguay': 'Latin America & Caribbean', 'Peru': 'Latin America & Caribbean', 'Puerto Rico': 'Latin America & Caribbean',
    'Trinidad and Tobago': 'Latin America & Caribbean', 'Uruguay': 'Latin America & Caribbean', 'Venezuela, RB': 'Latin America & Caribbean',

    # Middle East & North Africa
    'Algeria': 'Middle East & North Africa', 'Bahrain': 'Middle East & North Africa', 'Egypt': 'Middle East & North Africa',
    'Iran, Islamic Rep.': 'Middle East & North Africa', 'Iraq': 'Middle East & North Africa', 'Israel': 'Middle East & North Africa',
    'Jordan': 'Middle East & North Africa', 'Kuwait': 'Middle East & North Africa', 'Lebanon': 'Middle East & North Africa',
    'Libya': 'Middle East & North Africa', 'Morocco': 'Middle East & North Africa', 'Oman': 'Middle East & North Africa',
    'Qatar': 'Middle East & North Africa', 'Saudi Arabia': 'Middle East & North Africa', 'Syrian Arab Republic': 'Middle East & North Africa',
    'Tunisia': 'Middle East & North Africa', 'United Arab Emirates': 'Middle East & North Africa', 'West Bank and Gaza': 'Middle East & North Africa',
    'Yemen, Rep.': 'Middle East & North Africa',

    # North America
    'United States': 'North America', 'Canada': 'North America',

    # Other
    'Luxembourg': 'Europe & Central Asia', 'Malta': 'Europe & Central Asia', 'Cyprus': 'Europe & Central Asia',
    'Iceland': 'Europe & Central Asia',
}


## Verifying Cleaned Region Names

After replacing `"High income"` with actual geographic regions, we use `.unique()` to inspect the final list of distinct region names in the dataset:



In [37]:
df_long['Region'].unique()

array(['South Asia', 'Europe & Central Asia',
       'Middle East & North Africa', 'Sub-Saharan Africa',
       'Latin America & Caribbean', nan, 'East Asia & Pacific'],
      dtype=object)

##  Filling Missing or 'High income' Regions Using Country Mapping

Despite previous cleaning steps, some rows  still have missing region  To fix this, we use a dictionary (`country_to_region`) that maps each country to its correct geographic region.

###  Logic:

- If a row’s `Region` is `NaN` **or** `"High income"`, we replace it using the mapped region from `country_to_region`.
- If the `Region` is already valid, we leave it unchanged.



In [39]:
df_long['Region'] = df_long.apply(
    lambda row: country_to_region.get(row['Country name'], row['Region']) 
    if pd.isna(row['Region']) or row['Region'] == 'High income' 
    else row['Region'], 
    axis=1
)

##  Final Check for Missing Values

After completing the region corrections and other data cleaning steps, we perform a final check for missing values in the dataset using `.isnull().sum()


In [41]:
df_long.isnull().sum()

Country name           0
Country code           0
Year                   0
Adult populaiton       0
Region                 0
Income group        4900
Indicator              0
Indicator value        0
dtype: int64

##  Correcting Income Group for Specific Countries

venezuela were missing `Income group` labels in the dataset. To fix this, we manually update entries using a mapping dictionary.

- We explicitly map `'Venezuela, RB'` to `'Upper-Middle-Income'`.




In [43]:
income_group_mapping = {
    'Venezuela, RB': 'Upper-Middle-Income'
}
df_long['Income group'] = df_long['Country name'].map(income_group_mapping).fillna(df_long['Income group'])


Check again for null values

In [45]:
df_long.isnull().sum()

Country name        0
Country code        0
Year                0
Adult populaiton    0
Region              0
Income group        0
Indicator           0
Indicator value     0
dtype: int64

##  Checking for Duplicate Rows

To ensure data quality and avoid biased analysis, we check for duplicate rows in the dataset using `.duplicated().sum()`.



In [47]:
df_long.duplicated().sum()

0

## Exploring Unique Financial Indicators

To understand the scope of financial indicators available in the dataset, we examine the unique values in the `Indicator` column using `.unique()`.



In [49]:
df_long['Indicator'].unique()

array(['Account (% age 15+)', 'Financial institution account (% age 15+)',
       'First financial institution account ever was opened to receive a wage payment or money from the government (% age 15+)',
       ...,
       'Used a mobile phone or the internet to access an account, urban (% age 15+)',
       'Used a mobile phone or the internet to access an account, out of labor force (% age 15+)',
       'Used a mobile phone or the internet to access an account, in labor force (% age 15+)'],
      dtype=object)

##  Renaming Columns for Clarity and Consistency

To correct a typo and improve readability, we rename the column `'Adult populaiton'` to `'Adult Population'`.


In [51]:
df_long.columns

Index(['Country name', 'Country code', 'Year', 'Adult populaiton', 'Region',
       'Income group', 'Indicator', 'Indicator value'],
      dtype='object')

In [52]:
df_long.rename(columns={'Adult populaiton': 'Adult Population'}, inplace=True)

##  Exporting Cleaned Data

After completing data cleaning and transformation, we save the cleaned dataset to a CSV file for future use or sharing.



In [54]:
 # df_long.to_csv("world data.csv",index=False)

In [55]:
# save the file in the processed data folder
final_file = os.path.join(processed_dir,"world data.csv")
df_long.to_csv(final_file,index=False)