# Gender Data Indicators
Generic Framework for extracting Gender Data

# World Bank API Usage Documentation
  
**API Documentation:** https://datahelpdesk.worldbank.org/knowledgebase/articles/898581</br>
**Base URL:** <a href="api.worldbank.org/v2/">api.worldbank.org/v2/</a></br>
**Parameters:** </br>

| Parameter | Description | Example |
| --- | --- | --- |
| `date` | Date or date range by year, month or quarter | `date=2000:2010` |
| `format` | Output format: xml, json, jsonP | `format=json` |
| `downloadformat` | Download format: csv, xml, excel | `downloadformat=csv` |
| `page` | Page number of the result set | `page=2` |
| `per_page` | Number of results per page (default 50) | `per_page=25` |
| `mrv` | Most recent values based on the number specified | `mrv=5` |
| `mrnev` | Most recent non-empty values based on the number specified | `mrnev=5` |
| `gapfill` | Fills values by back tracking to the next available period (works with mrv) | `gapfill=Y` |
| `frequency` | Frequency of values: Q (quarterly), M (monthly), Y (yearly) (works with mrv) | `frequency=M` |
| `footnote` | Fetches footnote detail in data calls | `footnote=y` |
| `language` | Local language translations for some countries | `language=vi` |


# 1. Extraction

### 1.1 Data Ingestion

In [1]:
# Imports
import pandas as pd # Dataframe
from pyjstat import pyjstat
import json # Parsing json to object
import requests # Making HTTP get requests
import pycountry
from ydata_profiling import ProfileReport

In [2]:
# Create Lists
# Commonwealth Countries
commonwealth_countries = [
    "Botswana", "Cameroon", "Gabon", "Gambia", "Ghana", "Kenya",
    "Eswatini", "Lesotho", "Malawi", "Mauritius", "Mozambique",
    "Namibia", "Nigeria", "Rwanda", "Seychelles", "Sierra Leone",
    "South Africa", "Togo", "Uganda", "Tanzania, United Republic of", "Zambia",
    "Bangladesh", "Brunei Darussalam", "India", "Malaysia", "Maldives",
    "Pakistan", "Singapore", "Sri Lanka", "Antigua and Barbuda", "Bahamas",
    "Barbados", "Belize", "Canada", "Dominica", "Grenada", "Guyana",
    "Jamaica", "Saint Lucia", "Saint Kitts and Nevis", "Saint Vincent and The Grenadines",
    "Trinidad and Tobago", "Cyprus", "Malta", "United Kingdom", "Australia",
    "Fiji", "Kiribati", "Nauru", "New Zealand", "Papua New Guinea", "Samoa",
    "Solomon Islands", "Tonga", "Tuvalu", "Vanuatu"
]

# Gender Indicators
gender_indicators = [
    "FIN21.T.D.2017.1","FIN21.T.D.2017.2","FIN21.T.D.2017","SG.GEN.PARL.ZS",
    "SG.GEN.MNST.ZS","SE.SEC.ENRR.FE","UIS.FGP.5T8.F600","SL.TLF.CACT.FE.ZS",
    "SG.LAW.NODC.HR","SG.OWN.LDAL.FE.ZS","SG.OPN.BANK.EQ","SG.CNT.SIGN.EQ",
    "SP.DYN.SMAM.FE","SP.DYN.SMAM.MA","SP.M15.2024.FE.ZS","SP.M18.2024.FE.ZS",
    "SG.VAW.1549.ME.ZS","SG.VAW.15PL.ME.ZS","SG.VAW.1549.LT.ME.ZS","SG.VAW.15PL.LT.ME.ZS",
    "SG.LEG.DVAW","SH.STA.MMRT","SH.STA.MMRT.NE","SP.DYN.LE00.FE.IN","SP.DYN.LE00.MA.IN","SP.DYN.LE00.IN"
]

# Get 3 digit ISO codes
country_iso_codes = {}
for country in commonwealth_countries:
    try:
        iso_code = pycountry.countries.get(name=country).alpha_3 # Trinidad and Tobago | Trinidad & Tobago | trinidad and tobago [TTO] - https://www.iban.com/country-codes
        country_iso_codes[country] = iso_code
    except AttributeError:
        print(f"ISO code not found for {country}")

In [3]:
# Use JSONSTAT to download the indicators data, filter by specific country and indicator codes
# http://api.worldbank.org/v2/country/LSO;ZAF/indicator/SP.POP.TOTL;SG.GEN.PARL.ZS?format=jsonstat&source=14
def download_indicators(country_list, indicator_list):
    isocode_filter = []
    indicator_filter = []
    for country in country_list:
        iso_code = country_iso_codes[country]
        isocode_filter.append(iso_code)
    for indicator in indicator_list:
        indicator_filter.append(indicator) 
    # Leverage on the parameter structure of the API
    api_url = f'http://api.worldbank.org/v2/country/{";".join(isocode_filter)}/indicator/{";".join(indicator_filter)}?format=jsonstat&gapfill=N&source=14'
    dataset = pyjstat.Dataset.read(api_url)
    df = dataset.write('dataframe')
    return df

### 1.2. Data Profiling (Data Quality)

In [4]:
# Pivot the data on country codes and series
# This format is more suitable for YData reporting
def get_transformed_df(df):
    columns = df['Series'].unique().tolist()
    columns.append('Year')
    columns.append('Country')
    dfp = df.pivot(index=['Year'], columns=['Country','Series'], values = ['value'])
    tdf = pd.DataFrame(columns=columns)
    counter = 0
    # Flatten the df indices
    for index, row in dfp.iterrows():
        for country, series in list(row["value"].keys()):
            trow = { 'Year': int(row.name) }
            trow['Country'] = country
            trow[series] = row["value"][country][series]
            tdf.loc[counter] = trow
            counter = counter + 1
    tdf.reset_index()
    return tdf

In [5]:
# Calculates the statistics on each indicator data and formats the results as markdown
def generate_report(indicator_list):
    counter = 0
    markdown = ''
    # Loop over indicators
    for indicator in indicator_list:
        counter = counter + 1
        print(f"Processing: {indicator} {counter}/{len(indicator_list)}")
        df = download_indicators(commonwealth_countries, [indicator])
        dt = get_transformed_df(df)
        label = df['Series'].unique()[0]
        # Build table header
        markdown += f'### { label }\n'
        markdown += f'| Attribute | Value | Attribute | Value |\n'
        markdown += f'| :--- | ---: | :--- | ---: |\n'
        # Calculate the data statistics
        # Use built in functions of Pandas dataframe
        mean = dt[label].mean()
        median = dt[label].median()
        maximum = dt[label].max()
        minimum = dt[label].min()
        kurtois = dt[label].kurt()
        std = dt[label].std()
        null_count = dt[label].isnull().sum().sum()
        dt_nn = dt[dt[label].notnull()]
        countries_reporting = len(dt_nn['Country'].unique()) * 100 / len(commonwealth_countries)
        from_year =  dt_nn['Year'].min()
        to_year =  dt_nn['Year'].max()
        unique_values = len(dt_nn[label].unique())
        total_values = len(dt_nn[label])
        # Identifiy the indicator type Numeric or Categorical
        ind_type = 'Numeric'
        if (unique_values < 10 or (unique_values / total_values) < 0.01):
            ind_type = 'Categorical'
        markdown += f'|Mean|{mean:.2f}|Missing Values|{(null_count*100/len(dt)):.2f}%|\n'
        markdown += f'|Median|{median:.2f}|Countries Reporting|{countries_reporting:.2f}%|\n'
        markdown += f'|Maximum|{maximum:.2f}|From Year (Oldest)|{from_year}|\n'
        markdown += f'|Minimum|{minimum:.2f}|To   Year (Latest)|{to_year}|\n'
        markdown += f'|Std|{std:.2f}|Indicator Code|`{ indicator }`|\n'
        markdown += f'|Kurtois|{kurtois:.2f}|Type|{ ind_type }|\n'
        reporting_by_label = 'Reported&nbsp;By'
        reporting_countries = 'All'
        if (countries_reporting > 50 and countries_reporting < 100):
            reporting_by_label = 'Not&nbsp;Reported&nbsp;By'
            reporting_countries = ','.join(list(set(commonwealth_countries).difference(dt_nn['Country'].unique())))
        elif (countries_reporting < 50):
            reporting_countries = ','.join(dt_nn['Country'].unique())
        markdown += f'|{reporting_by_label}<td colspan="3">{reporting_countries}</td>|\n'
        markdown += f'\n\n'
    return markdown

In [6]:
report = generate_report(gender_indicators)

Processing: FIN21.T.D.2017.1 1/26
Processing: FIN21.T.D.2017.2 2/26
Processing: FIN21.T.D.2017 3/26
Processing: SG.GEN.PARL.ZS 4/26
Processing: SG.GEN.MNST.ZS 5/26
Processing: SE.SEC.ENRR.FE 6/26
Processing: UIS.FGP.5T8.F600 7/26
Processing: SL.TLF.CACT.FE.ZS 8/26
Processing: SG.LAW.NODC.HR 9/26
Processing: SG.OWN.LDAL.FE.ZS 10/26
Processing: SG.OPN.BANK.EQ 11/26
Processing: SG.CNT.SIGN.EQ 12/26
Processing: SP.DYN.SMAM.FE 13/26
Processing: SP.DYN.SMAM.MA 14/26
Processing: SP.M15.2024.FE.ZS 15/26
Processing: SP.M18.2024.FE.ZS 16/26
Processing: SG.VAW.1549.ME.ZS 17/26
Processing: SG.VAW.15PL.ME.ZS 18/26
Processing: SG.VAW.1549.LT.ME.ZS 19/26
Processing: SG.VAW.15PL.LT.ME.ZS 20/26
Processing: SG.LEG.DVAW 21/26
Processing: SH.STA.MMRT 22/26
Processing: SH.STA.MMRT.NE 23/26
Processing: SP.DYN.LE00.FE.IN 24/26
Processing: SP.DYN.LE00.MA.IN 25/26
Processing: SP.DYN.LE00.IN 26/26


In [7]:
print(report)

### Borrowed to start, operate, or expand a farm or business, female (% age 15+)
| Attribute | Value | Attribute | Value |
| :--- | ---: | :--- | ---: |
|Mean|7.31|Missing Values|98.21%|
|Median|5.56|Countries Reporting|62.50%|
|Maximum|21.21|From Year (Oldest)|2014|
|Minimum|0.05|To   Year (Latest)|2017|
|Std|5.96|Indicator Code|`FIN21.T.D.2017.1`|
|Kurtois|-0.11|Type|Numeric|
|Not&nbsp;Reported&nbsp;By<td colspan="3">Dominica,Vanuatu,Kiribati,Papua New Guinea,Fiji,Seychelles,Nauru,Guyana,Eswatini,Tuvalu,Antigua and Barbuda,Saint Vincent and The Grenadines,Saint Lucia,Tanzania, United Republic of,Tonga,Barbados,Grenada,Solomon Islands,Brunei Darussalam,Gambia,Samoa,Bahamas,Saint Kitts and Nevis</td>|


### Borrowed to start, operate, or expand a farm or business, male (% age 15+)
| Attribute | Value | Attribute | Value |
| :--- | ---: | :--- | ---: |
|Mean|9.57|Missing Values|98.21%|
|Median|8.04|Countries Reporting|62.50%|
|Maximum|27.82|From Year (Oldest)|2014|
|Minimum|0.72|To   Ye

# 2. Load (MySQL, Parquet)