# Global GDP Statistics Analysis (2025)

This notebook performs an exploratory data analysis (EDA) of global Gross Domestic Product (GDP) data using the **Global GDP Explorer 2025** dataset sourced from the **World Bank** and **UN data**.
The purpose is to clean, process, and summarize GDP per capita values to understand economic variations across countries and identify global income ranges.



## Importing Libraries and Setting Up the Environment

We start by importing essential Python libraries for data manipulation, computation, and visualization:

- **NumPy** – for numerical operations
- **Pandas** – for loading, cleaning, and summarizing data
- **Matplotlib** – for potential visualizations

The notebook is configured to run in a local environment, and the dataset is loaded from a KaggleHub cache directory.



In [1]:
# Connect to Kaggle
import kagglehub

In [2]:
# Import dataset from Kaggle
path = kagglehub.dataset_download("asadullahcreative/global-gdp-explorer-2024-world-bank-un-data")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/asadullahcreative/global-gdp-explorer-2024-world-bank-un-data?dataset_version_number=1...


100%|██████████| 6.52k/6.52k [00:00<00:00, 9.04MB/s]

Extracting files...
Path to dataset files: C:\Users\bdall\.cache\kagglehub\datasets\asadullahcreative\global-gdp-explorer-2024-world-bank-un-data\versions\1





In [3]:
# Import tools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [5]:
# Make data readable by notebook
import os
print(os.getcwd())

os.chdir(r"C:\Users\bdall\.cache\kagglehub\datasets\asadullahcreative\global-gdp-explorer-2024-world-bank-un-data\versions\1")

C:\Users\bdall\PycharmProjects\PythonProject


## Loading and Inspecting the Dataset

The dataset, `Global GDP Explorer 2025 (World Bank UN Data).csv`, is loaded into a pandas DataFrame.
We begin by:
- Viewing the first few rows with `head()`
- Checking the dataset structure and data types using `info()`
- Reviewing summary statistics with `describe()`

This provides an initial overview of the GDP data distribution and helps identify data type inconsistencies.


In [12]:
# Create data frame
import pandas as pd
df = pd.read_csv('C:\\Users\\bdall\\.cache\\kagglehub\\datasets\\asadullahcreative\\global-gdp-explorer-2024-world-bank-un-data\\versions\\1\\Global GDP Explorer 2025 (World Bank  UN Data).csv', sep=',')
print(df.head())
print(df.info())
print(df.describe())

   Unnamed: 0        Country  GDP (nominal, 2023)    GDP (abbrev.) GDP Growth  \
0           0  United States  $27,720,700,000,000  27.721 trillion      2.89%   
1           1          China  $17,794,800,000,000  17.795 trillion      5.25%   
2           2        Germany   $4,525,700,000,000   4.526 trillion     −0.27%   
3           3          Japan   $4,204,490,000,000   4.204 trillion      1.68%   
4           4          India   $3,567,550,000,000   3.568 trillion      8.15%   

   Population 2023 GDP per capita Share of World GDP  
0        343477335        $80,706             26.11%  
1       1422584933        $12,509             16.76%  
2         84548231        $53,528              4.26%  
3        124370947        $33,806              3.96%  
4       1438069596         $2,481              3.36%  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 181 entries, 0 to 180
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               ----

## Data Quality Checks

Before analysis, we inspect the dataset for:
- **Duplicate rows** using `df.duplicated().sum()`
- **Null or missing values** using `df.isnull().sum()`
- **Column names** using `df.columns.tolist()`

These checks ensure that the dataset is clean, reliable, and ready for analysis.


In [14]:
# Check for dupes and nulls
df.duplicated().sum()
df.isnull().sum()

Unnamed: 0             0
Country                0
GDP (nominal, 2023)    0
GDP (abbrev.)          0
GDP Growth             0
Population 2023        0
GDP per capita         0
Share of World GDP     0
dtype: int64

In [17]:
#Find the names of columns
df.columns.tolist()


['Unnamed: 0',
 'Country',
 'GDP (nominal, 2023)',
 'GDP (abbrev.)',
 'GDP Growth',
 'Population 2023',
 'GDP per capita',
 'Share of World GDP']

In [19]:
# Get stat data
df['GDP per capita'].describe()


count        181
unique       179
top       $1,706
freq           2
Name: GDP per capita, dtype: object

In [22]:
# Discover data type
df['GDP per capita'].dtype


dtype('O')

## Data Cleaning and Conversion

The `GDP per capita` column initially contained non-numeric symbols such as commas, dollar signs, spaces, and em-dashes.
To prepare it for analysis:
1. Converted values to string type
2. Removed unwanted symbols (commas, `$`, etc.)
3. Replaced invalid entries (`nan`, `None`, empty strings) with NaN
4. Converted the column back to float

This step ensures that the GDP per capita values can be properly analyzed numerically.


In [26]:
# Refine stat data
df['GDP per capita'] = (
    df['GDP per capita']
    .astype(str)                                # convert to string (in case of mixed types)
    .str.replace(',', '', regex=False)          # remove commas
    .str.replace('$', '', regex=False)          # remove dollar signs
    .str.replace(' ', '', regex=False)          # remove extra spaces
    .str.replace('—', '', regex=False)          # handle em-dash/null symbols
    .replace(['nan', '', 'None'], np.nan)       # replace empty with NaN
    .astype(float)                              # convert to float
)
pd.options.display.float_format = '{:,.2f}'.format
df['GDP per capita'].describe()

count       181.00
mean     17,711.29
std      23,301.49
min         193.00
25%       2,478.00
50%       7,182.00
75%      22,798.00
max     128,936.00
Name: GDP per capita, dtype: float64

## GDP per Capita Range Summary

We calculate:
- Minimum GDP per capita (USD)
- Maximum GDP per capita (USD)
- Range (difference between max and min)

A summary dictionary displays these metrics in a formatted output for quick reference:

| Metric | Value (USD) |
|---------|--------------|
| Minimum GDP per Capita | {min_gdp_formatted} |
| Maximum GDP per Capita | {max_gdp_formatted} |
| Range (Max - Min) | {range_gdp_formatted} |

This helps visualize the global disparity between the wealthiest and poorest economies.


In [28]:
min_gdp = df['GDP per capita'].min()
max_gdp = df['GDP per capita'].max()
range_gdp = max_gdp - min_gdp

# Create a clean summary table
summary = {
    'Minimum GDP per Capita (USD)': f"{min_gdp:,.2f}",
    'Maximum GDP per Capita (USD)': f"{max_gdp:,.2f}",
    'Range (Max - Min)': f"{range_gdp:,.2f}"
}

# Display the results
for key, value in summary.items():
    print(f"{key}: {value}")

Minimum GDP per Capita (USD): 193.00
Maximum GDP per Capita (USD): 128,936.00
Range (Max - Min): 128,743.00


## Next Steps

Possible extensions for this analysis include:
- Visualizing GDP per capita by continent or region
- Analyzing GDP growth trends over time
- Comparing GDP per capita against other metrics such as population, inflation, or HDI
- Creating a dashboard to display interactive GDP insights

These next steps would provide a more comprehensive understanding of global economic patterns.
