<div style="color: #095AAD; font-weight: bold; font-size: 16px;">
    
# World Bank Data Collection - Gathering Economic Indicators

</div>

Этот ноутбук собирает экономические и социальные индикаторы из World Bank API для анализа их корреляции с трендами зарплат в data science. Собранные макроэкономические данные будут использованы для исследования того, как факторы уровня страны, такие как ВВП на душу населения, размер населения и уровень образования, влияют на паттерны компенсации в tech индустрии.

**Data sources**: 
[World Bank Open Data API](https://datahelpdesk.worldbank.org/knowledgebase/articles/889392)

<div style="color: #095AAD; font-weight: bold; font-size: 16px;">
    
## Collection Strategy

</div>

I collect annual economic indicators for all unique countries in our dataset that represent key factors influencing tech salaries:

| **Indicator** | **World Bank Code** | **Category** |
|-------------|----------------|-------------|
| `Population` | SP.POP.TOTL | Market Size |
| `GDP per capita` | NY.GDP.PCAP.CD | Economic Development |
| `Education rate (%)` | SE.TER.CUAT.BA.ZS | Human Capital |
| `Internet penetration (%)` | IT.NET.USER.ZS | Digital Infrastructure |
    
<div style="color: #095AAD; font-weight: bold; font-size: 16px;">
    
## Dataset Structure</div>

After collection, each indicator dataset contains the following structure:

| **Column** | **Description** | **Example** |
|------------|-----------------|-------------|
| `country_code` | ISO country code | US, GB, DE |
| `year` | Data year | 2020, 2021, 2022 |
| `value_population` | Population total | 331900000, 67800000 |
| `value_gdp_per_capita` | GDP per capita (USD) | 70249.30, 46344.05 |
| `value_education` | Education rate (%) | 35.05, 39.59 |
| `value_internet` | Internet penetration (%) | 91.30, 96.20 |

**Time period:** 2020-2025 years

<div style="color: #095AAD; font-weight: bold; font-size: 15px;">
    
### Importing required libraries

</div>

In [1]:
import asyncio
import aiohttp
import pandas as pd
import numpy as np
import requests
import time
import warnings

warnings.filterwarnings('ignore')

<div style="color: #095AAD; font-weight: bold; font-size: 15px;">

### Data loading</div>

In [2]:
df = pd.read_csv('salaries.csv')

<div style="color: #095AAD; font-weight: bold; font-size: 15px;">

### Countries identification</div>

To efficiently work with the World Bank API, I first need to identify all unique countries in our dataset. This approach allows me to optimize API requests and collect data only for relevant countries, avoiding unnecessary traffic.

In [3]:
unique_countries = sorted(df['company_location'].unique())
print('\nUnique countries in dataset: {len(unique_countries)}')
print('\nCountries list:\n')
print(unique_countries)

Unique countries in dataset: 95
Countries list:
['AD', 'AE', 'AM', 'AR', 'AS', 'AT', 'AU', 'BA', 'BE', 'BG', 'BR', 'BS', 'CA', 'CD', 'CF', 'CH', 'CL', 'CN', 'CO', 'CR', 'CY', 'CZ', 'DE', 'DK', 'DO', 'DZ', 'EC', 'EE', 'EG', 'ES', 'FI', 'FR', 'GB', 'GH', 'GI', 'GR', 'HK', 'HN', 'HR', 'HU', 'ID', 'IE', 'IL', 'IN', 'IQ', 'IR', 'IT', 'JM', 'JO', 'JP', 'KE', 'KR', 'LB', 'LS', 'LT', 'LU', 'LV', 'MD', 'MK', 'MT', 'MU', 'MX', 'MY', 'NG', 'NL', 'NO', 'NZ', 'OM', 'PA', 'PE', 'PH', 'PK', 'PL', 'PR', 'PT', 'QA', 'RO', 'RS', 'RU', 'SA', 'SE', 'SG', 'SI', 'SK', 'SV', 'TH', 'TR', 'TW', 'UA', 'US', 'VE', 'VN', 'XK', 'ZA', 'ZM']


<div style="color: #095AAD; font-weight: bold; font-size: 15px;">

### World Bank API function</div>

Creating a data collection function for World Bank indicators with error handling and separate file saving for each indicator.

In [4]:
# Collect World Bank indicator data for specified countries and time period
async def get_worldbank_data_async(session, country, indicator_code, year):
    url = f'https://api.worldbank.org/v2/country/{country}/indicator/{indicator_code}'
    params = {
        'date': str(year),
        'format': 'json',
        'per_page': 1000
    }
    
    try:
        async with session.get(url, params=params) as response:
            if response.status == 200:
                data = await response.json()
                if len(data) > 1 and data[1]:
                    item = data[1][0]
                    return {
                        'country_code': item['country']['id'],
                        'country_name': item['country']['value'],
                        'year': item['date'],
                        'value': item['value'],
                        'indicator': indicator_code
                    }
            return None
    except:
        return None

<div style="color: #095AAD; font-weight: bold; font-size: 15px;">

### Collecting all indicators</div>

Running data collection for all four economic indicators and combining them into a single dataset.

In [5]:
# Async collection of all World Bank indicators
async def collect_all_worldbank_data_async(countries_list, start_year, end_year):
    
    indicators = {
        'population': 'SP.POP.TOTL',
        'gdp_per_capita': 'NY.GDP.PCAP.CD',
        'education': 'SE.TER.CUAT.BA.ZS',
        'internet': 'IT.NET.USER.ZS'
    }
    
    all_data = []
    
    async with aiohttp.ClientSession() as session:
        for indicator_name, indicator_code in indicators.items():
            tasks = []
            
            for country in countries_list:
                for year in range(start_year, end_year + 1):
                    task = get_worldbank_data_async(session, country, indicator_code, year)
                    tasks.append(task)
            
            results = await asyncio.gather(*tasks)
            
            indicator_data = []
            for result in results:
                if result and result['value'] is not None:
                    indicator_data.append({
                        'country_code': result['country_code'],
                        'country_name': result['country_name'],
                        'year': result['year'],
                        f'value_{indicator_name}': round(float(result['value']), 2) if result['value'] else None
                    })
            
            if indicator_data:
                df = pd.DataFrame(indicator_data)
                df.to_csv(f'worldbank_{indicator_name}.csv', index=False)
                all_data.append(df)
    
    if all_data:
        final_df = all_data[0]
        for df in all_data[1:]:
            final_df = final_df.merge(df, on=['country_code', 'country_name', 'year'], how='outer')
        
        final_df.to_csv('worldbank_complete.csv', index=False)
        return final_df
    else:
        return pd.DataFrame()

In [6]:
# Full data collection for all countries 2020-2025
worldbank_data = await collect_all_worldbank_data_async(unique_countries, 2020, 2025)
print(f'\nCollection completed! Dataset shape: {worldbank_data.shape}\n')
worldbank_data.head()

Collection completed! Dataset shape: (470, 7)


Unnamed: 0,country_code,country_name,year,value_population,value_gdp_per_capita,value_education,value_internet
0,AD,Andorra,2020,77380.0,37361.09,,93.2
1,AD,Andorra,2021,78364.0,42425.7,,93.9
2,AD,Andorra,2022,79705.0,42414.06,25.04,94.5
3,AD,Andorra,2023,80856.0,46812.45,,95.4
4,AD,Andorra,2024,81938.0,49303.67,,
