# GLOBAL IT SALARY ANALYSIS (Git-Girls-Collective-7)  
# - Notebook Part 1 (Python API code)

## Purpose

* The overall purpose of this program is to make a series of API calls to Teleport.org to retrieve salary information it offers comprehensively on 52 job roles across 198 countries.  
* The code makes a separate API call to exchangerate-api.com to retrieve a list of conversion rates for all the relevant countries' currencies into GBP.  
* Finally, our code integrates the information from these API requests into a DataFrame containing country info, currency info, currency conversion rates, local currency salaries and GBP converted salaries.  
* This DataFrame is exported to a timestamped .csv file (i.e. output_gbp_salaries_{timestamp}.csv) which can be imported into the Jupyter Notebook Part 2, for straightforward combination with other datasets for analysis.

### **_Fair Warning!_** _If you run this notebook in full, API requests will be made and output files will be generated!_

## Structure

This notebook catalogues a Python program, which would traditionally be structured as follows:
* **main.py** - execution file, which imports functionality from utils_api_to_df.py. Tidy entry point for the program.
* **utils_api_to_df.py** - module contains the core functionality of the program (retrieving data via API calls, and manipulating this into usable format for manipulation and data analysis elsewhere)
* **utils_helper.py** - module contains small utility functions, called by utils_api_to_df.py

**Notes about restructure**  
Now that this code is being run in a single notebook, original code has been changed as follows:
* comments and some docstrings have become markdown cell commentary
* import statements have been removed
* non-applicable commentary has been removed
* all import statement appear bundled together, at the top of the notebook rather than per file

To mimic the full program file structure, the code content appears in dependency order.

In [None]:
# Import module dependencies
import os
import datetime # fetch, manipulate and format dates
import requests
import pandas as pd
from pandas import json_normalize
import json
import time

# create a directory to gather all the API output files
if not os.path.exists("api_output"):
    os.makedirs("api_output")

<div style="background-color: #ADD8E6; padding: 5px; border: 1px solid black; border-radius: 1px;">
    <h3 style=:"margin-bottom: 5px; font-weight: bold;">Section 1: Content of utils_helper.py</h3>
    <p style=:"margin-top: 0;">Helper functions for utils_api_to_df.py section</p>
</div>

**calc_file_size** function estimates the file size of an API call. Only really needed for large API data requests.

In [None]:
def calc_file_size(num_API_requests):
    """Estimates the file size of an API call. Only really needed for large API data requests. Param: number of JSON objects (int) expected from API request. """
    json_file_size_kb = 154/10 # got from test of 100 sample results, stored as a JSON
    pause = 0.5
    # batch_size = 250
    mb_filesize = (num_API_requests*json_file_size_kb)/1000
    
    if mb_filesize < 100:
        judgement = "managable."
    else:
        judgement = "getting chunky."

    print(f"For {num_API_requests} JSON objects, it will take {round(num_API_requests*pause/60,2)} mins to fetch and at {round(json_file_size_kb)} KB per object, total filesize will be {round(mb_filesize)} MB; that's {judgement}")

# Test:
# calc_file_size(400)

**count_json_items** function counts the number of JSON/dictionary keys (at a specified level) in a JSON file. Returns sentence including number of items. This allows verification of the number of countries information has been retrieved about. 

In [None]:
def count_json_items(filepath, json_dict_structure, item_key):
    """ Counts the number of JSON/dictionary keys (at a specified level) in a JSON file. Params: filepath (str), the dictionary structure (list of keys, eg. ['_links','country:items'])  for the target key, and the target key (str, e.g. 'name') to be counted. Returns sentence including number of items. """
    try: 
        with open(filepath, "r") as json_file:
            json_data = json.load(json_file)

        for key in json_dict_structure:
            json_data = json_data[key]

        num_entries = sum(item_key in item for item in json_data)

        return f"The number of entries under '{item_key}' key in this JSON file is {num_entries}"
        
    except FileNotFoundError:
        print(f"File not found: {filepath}")
    except json.JSONDecodeError:
        print("Invalid JSON format.")
    except KeyError as e:
        print(f"Key error: {e}")

# Test params
# filepath = "api_output/list_countries.json"
# json_dict_structure = ['_links','country:items']
# item_key = "name"
# # 252
# print(count_json_items(filepath, json_dict_structure, item_key))

**count_keys** function returns count of keys in a JSON file dictionary. Required to scale appropriately the size of the API request made to Teleport, in order to match the number of countries information is available for.

In [None]:
def count_keys(filepath):
    """Returns count of keys in a JSON file dictionary"""
    try:
        with open(filepath, "r") as json_file:
            json_data = json.load(json_file)

        num_keys = len(json_data.keys())
        print("Keys found:", list(json_data.keys()))  # For debugging

        return f"The number of top-level keys in this JSON file is {num_keys}"
    except FileNotFoundError:
        print(f"File not found: {filepath}")
    except json.JSONDecodeError:
        print("Invalid JSON format.")
    except KeyError as e:
        print(f"Key error: {e}")

<div style="background-color: #ADD8E6; padding: 5px; border: 1px solid black; border-radius: 1px;">
    <h3 style=:"margin-bottom: 5px; font-weight: bold;">Section 2: Content of utils_api_to_df.py</h3>
    <p style=:"margin-top: 0;">This file contains 6 functions:</p>
    <p style=:"margin-top: 0;"><b>1) simple_get_list_countries()</b> - Makes API call to Teleport.org to retrieve a simple list of countries for which it offers data</p>
    <p style=:"margin-top: 0;"><b>2) count_json_items()</b> - Tests the content retrieved by 1), outputs a total number of countries</p>
    <p style=:"margin-top: 0;"><b>3) get_overviews_countries()</b> - Uses json formatted information returned by 1), to make a series of API calls to Teleport to retrieve country-level overview data (most importantly, currency code and iso_alpha2 code)</p>
    <p style=:"margin-top: 0;"><b>4) count_keys()</b> - Counts the number of countries for which information was retrieved by 3)</p>
    <p style=:"margin-top: 0;"><b>5) api_to_dataframe()</b> - Function makes API request to retrieve country-level salary data from Teleport API. It utilises JSON data retrieved from the previous two API requests. It outputs this as a DataFrame object and a sibling .csv file. </p>
    <p style=:"margin-top: 0;"><b>6) get_conversion_rates_insert_df()</b> - A collection of three functions. These needed to be grouped under one function call as output from one feeds directly into the others. It makes an API call to exchangerate-api.com, utilising the currency code of the countries we are interested in, compiling a file called currency_rate.json. This is used in <b>convert_salary_to_GBP</b> to create an additional columns where local salaries are converted to GBP. The final function <b>get_conversion_rates_insert</b> is what ensures the other two functions are called seamlessly together.</p>
        <p style=:"margin-top: 0;">See the docstrings and comments for detailed individual explanations of functionality.</p>
</div> 

**Note**  
If you just want to run all functions in one hit to create a new dataset, skip to **Section 3: (main.py)**.  
If you only want to run/test individual functions, uncomment and run the API functions / print functions in this section individually.  
Individual functions calls in this section are commented-out to prevent accidental large API requests, and/or uncontrolled update/overwrite of files.

1) Function makes an API request to retrieve list of countries from Teleport API (country name and url)

In [None]:
def simple_get_list_countries(): # NB 252
    """Grabs list of countries with info available from Teleport API. Log file and JSON file are output."""
    api_base_url = "https://api.teleport.org/api/countries/"
    headers = {"Accept" : "application/vnd.teleport.v1+json"}
    try: 
        request_country_list = requests.get(api_base_url, headers=headers)
        print(f"Status code: {request_country_list.status_code}") 
        if request_country_list.status_code == 200:
            response_country_list = request_country_list.json()

            if 'count' in response_country_list: 
                num_results = response_country_list['count']
                first_country = response_country_list['_links']['country:items'][0]["name"]
                last_country = response_country_list['_links']['country:items'][-1]["name"]
                request_time = datetime.datetime.now()

                grab_country = json.dumps(response_country_list, indent=4)
                with open("api_output/list_countries.json", "w") as list_countries:
                    list_countries.write(grab_country)

                with open("api_output/log_countries_list.txt", "a") as log_countries_list:
                    log_countries_list.write(
                    f"{request_time} - Processed {num_results} results from {first_country} to {last_country}.\n")
            else:
                print(f"Error: 'count' not in response")       
        else:
            print(f"Error fetching data, {request_country_list.status_code}")

    except requests.RequestException as e:
        print(f"Request failed: {e}")

# simple_get_list_countries()

2) Check the count of countries retrieved from API. Use this to ensure counter number in next function gets all results, and to pass into the calc_file_size function below, if required.

In [None]:
filepath = "api_output/list_countries.json"
json_dict_structure = ['_links','country:items']
item_key = "name"

# print(count_json_items(filepath, json_dict_structure, item_key)) # 252

3) Function makes an API call to Teleport to add grab overview country level data, including the currency_code which is used in the currency conversion function

In [None]:
api_calls_to_make = 252 # this should be number printed above. Set low (i.e. 10 if you only want to do a test run)

def get_overviews_countries(api_calls_to_make): 
    """Grabs currency code, geoname, country code, name & population via Teleport API for each country. Log file entry made for every API request. Appends each JSON object under country-code key to a dictionary, then writes this to a single JSON file at the end."""
    calc_file_size(api_calls_to_make) # num countries 252
    with open("api_output/list_countries.json", "r") as countries_list:
        country_list_data = json.load(countries_list)
    
    all_countries_overview_data = {} # empty dict is appended to by each API request

    counter = 0 # counts API requests

    for country in country_list_data['_links']['country:items']:
        if counter < api_calls_to_make+10: # counter can be set low for a test run
            api_country_url = f"{country['href']}"
            country_code = api_country_url[-3:-1]  # Extracting 2 letter the country code
            print(country_code)
            headers = {"Accept" : "application/vnd.teleport.v1+json"}

            try: 
                request_country_overview = requests.get(api_country_url, headers=headers)
                print(f"Status code: {request_country_overview.status_code}") 
                if request_country_overview.status_code == 200:
                    response_country_overview = request_country_overview.json()

                    country_overview = {
                        "currency_code" : response_country_overview.get("currency_code", "N/A"),
                        "geoname_id" : response_country_overview.get("currency_code", "N/A"),
                        "iso_alpha2" : response_country_overview.get("iso_alpha2", "N/A"),
                        "iso_alpha3" : response_country_overview.get("iso_alpha3", "N/A"),
                        "name": response_country_overview.get("name", "N/A"),
                        "population" : response_country_overview.get('population', "N/A"),
                    }

                    all_countries_overview_data[country_code] = country_overview
                    country_name = response_country_overview['name']
                    request_time = datetime.datetime.now()

                    with open("api_output/log_country_overviews.txt", "a") as log_country_overviews:
                        log_country_overviews.write(
                        f"API Request: {counter} - Made at {request_time} - For {country_code} - {country_name}.\n")
                    time.sleep(0.5)


                else:
                    all_countries_overview_data[country_code] = "N/A processing error API"  # Placeholder for omitted overview data due to API process error

                    print(f"Error fetching data for {country_code}, status code: {request_country_overview.status_code}")
                
            except requests.RequestException as e:
                all_countries_overview_data[country_code] = "N/A processing error" # Placeholder for request exceptions
                print(f"Request failed for {country_code}: {e}")

            counter +=1 # increment API request counter 

    # after loop completion, write combined country overview data to file
    with open("api_output/overviews_all_countries.json", "w") as file:
        json.dump(all_countries_overview_data, file, indent = 4)
    print("Teleport countries-overview info API call finished!")

# get_overviews_countries()

4) Counts the number of keys in the JSON dictionary which was just created, holding the salaries data for each country from Teleport. It should match the number of countries you passed into the previous function.

In [None]:
filepath = "api_output/overviews_all_countries.json"
# print(count_keys(filepath)) # returns 252. Matches countries_list

5) Function makes API request to retrieve country-level salary data from Teleport API. It utilises JSON data retrieved from the previous two API requests. It outputs this as a DataFrame object and a sibling .csv file. Please be patient, takes c30seconds to run.

In [None]:
def api_to_dataframe():
    # Load country list data for API call from existing JSON file
    with open("api_output/list_countries.json", "r") as countries_list:
        country_list_data = json.load(countries_list)

    # Load currency_code from existing JSON file
    with open("api_output/overviews_all_countries.json", "r") as overviews_countries:
        currency_data = json.load(overviews_countries)

    # Map country_code to currency_code (country_code key is common to both json files)
    currency_mapping = {country['iso_alpha2']: country['currency_code'] for country in currency_data.values()}

    cc = []
    # Iterate over countries to extract country codes
    for country in country_list_data['_links']['country:items']:
        api_country_url = f"{country['href']}salaries/"
        country_code = api_country_url[-12:-10] # Extracting 2 letter the country code excluding the /
        cc.append(country_code)

    # print(cc) # displays the list of currency codes to be retrieved. Not required if previous function has been run.
    url = "https://api.teleport.org/api/countries/iso_alpha2:{}/salaries"
    dfs = []

    # Iterate over the list of country codes
    for country_code in cc:
        # Construct the API endpoint URL for the current country_code
        api_endpoint = url.format(country_code)
        # Make a GET request to the API
        response = requests.get(api_endpoint)

        if response.status_code == 200:
            api_data = response.json()
            salaries_data = api_data.get('salaries', [])
            currency_code = currency_mapping.get(country_code, "Unknown")
            
            df = json_normalize(salaries_data, sep='_')
            if not df.empty:
                # Add two extra columns with country_code and currency_code
                df['iso_alpha2'] = country_code
                df['currency_code'] = currency_code
                dfs.append(df)

        else:
            print(f"Error: {response.status_code}")

    # Concatenate the data retrieved from the API calls into result DataFrame
    result_df = pd.concat(dfs, ignore_index=True)

    # Solution for invalid currency codes
    # Teleport API has outdated currency codes that do not match with those in the currency converter API
    result_df.loc[884:935, 'currency_code'] = 'BYN'
    result_df.loc[6032:6083, 'currency_code'] = 'MRU'
    result_df.loc[10036:10087, 'currency_code'] = 'VES'

    # For checking it works
    # print(result_df.iloc[885])
    # print(result_df.iloc[6035])
    # print(result_df.iloc[10040])

    # Print & convert the final DataFrame to CSV
    # print(result_df)
    result_df.to_csv('api_output/output_inc_codes.csv', index=False)
    print("Teleport country-salary info API call finished!")

# api_to_dataframe() 

6) Two functions are defined, then run together by calling a third function **get_conversion_rates_insert_df(currency)**. These two have to be bundled to ensure consistency of the final DataFrame. **get_gbp_conversion_rates** makes an API call to exchangerate-api.com, utilising the currency code of the countries we are interested in, compiling a file called currency_rate.json. This is used in **convert_salary_to_GBP** to create an additional columns where local salaries are converted to GBP. The final function **get_conversion_rates_insert** is what ensures the other two functions are called seamlessly together

In [None]:
def get_gbp_conversion_rates(currency):
    """param: 3 letter currency code, str, required
    Function makes API request to open access exchangerate_api and retrieves JSON object of current exchange rates for passed in currency i.e. "GBP". Outputs a file called currency_rate_{timestamp}.json and a log file."""
    try:
        currency_request = requests.get(f"https://open.er-api.com/v6/latest/{currency}")
        print(f"Status code: {currency_request.status_code}") 
        if currency_request.status_code == 200:
            currency_response = currency_request.json()
            
            if 'rates' in currency_response:               
                request_time = datetime.datetime.now()
                currency_timestamp = datetime.datetime.now().strftime("%y-%m-%d_%H-%M")
                currency_rates_filename = f"currency_rates_{currency_timestamp}.json"
                rates_updated = currency_response['time_last_update_unix']
                
                with open(f"api_output/{currency_rates_filename}", "w") as currency_conversions:
                    json.dump(currency_response, currency_conversions, indent=4)

                with open("api_output/log_currency_rates.txt", "a") as log_currency_rates:
                            log_currency_rates.write(
                            f"{request_time} - Processed {currency} rates last updated at UNIX time {rates_updated}.\n")
            else:
                print(f"Error: 'rates' not in response")       
        else:
            print(f"Error fetching data, {currency_request.status_code}")

    except requests.RequestException as e:
        print(f"Request failed: {e}")

    return currency_rates_filename

def convert_salary_to_GBP(currency_rates_filename):
    """Function works directly on output_inc_codes.csv generated above, in conjunction with currency_rate{timestamp}.json to convert the local currency salaries given by Teleport API to GBP. Outputs a new CSV file with extra GBP salary columns and a column showing the conversion rates used. Note, this file is uniquely named with a timestamp to assist with version control."""
    # Load the JSON file with exchange rates
    with open(f"api_output/{currency_rates_filename}", "r") as file:
        data = json.load(file)
    exchange_rates = data['rates']

    # Load CSV with the salary data to be converted
    df = pd.read_csv("api_output/output_inc_codes.csv") # use DataFrame from CSV

    # Function to convert currency to GBP
    def convert_to_gbp(amount, currency_code):
        rate = exchange_rates.get(currency_code, None)
        if rate:
            return amount / rate
        else:
            return "N/A" # Return "N/A" for missing exchange rate

    # The following 4 lambda functions create & populate columns using the API call data

    # Add a timestamped column stating the local to gbp conversion rates used
    df['local_to_gbp_rates'] = df['currency_code'].apply(lambda x: data['rates'].get(x, "N/A"))

    # Apply the conversion to each row (country) for all three salary %iles
    df['gbp_converted_25th'] = df.apply(lambda row: convert_to_gbp(row['salary_percentiles_percentile_25'], row['currency_code']), axis=1)

    df['gbp_converted_50th'] = df.apply(lambda row: convert_to_gbp(row['salary_percentiles_percentile_50'], row['currency_code']), axis=1)

    df['gbp_converted_75th'] = df.apply(lambda row: convert_to_gbp(row['salary_percentiles_percentile_75'], row['currency_code']), axis=1)

    #The iso_alpha2 for Namibia is "NA" however this is being displpayed as a null value.
    #Therefore the location of the Namibia values are labelled "NA".
    df.loc[6552:6603, 'iso_alpha2'] = 'NA'
    #print(df.iloc[6555])

    output_file_timestamp = datetime.datetime.now().strftime("%y-%m-%d_%H-%M")
    # Save the updated DataFrame into a new CSV
    df.to_csv(f'api_output/output_gbp_salaries_{output_file_timestamp}.csv', index=False)

def get_conversion_rates_insert_df(currency):
    # Call the function to get conversion rates and save them
    currency_rates_filename = get_gbp_conversion_rates(currency)
    # Use the generated file to convert salaries
    convert_salary_to_GBP(currency_rates_filename)
    print("Function complete. Please see output_gbp_salaries{timestamp}.csv for your resulting data.")

    # NB: once this function has been run and we have output_gbp_salaries_{timestamp}.csv, technically the intermediate file output_inc_codes.csv can be deleted. During development, may be helpful for troubleshooting. If we find no use for the csv at the point of final-code review, we can add a line of code here to delete output_inc_codes.csv from the directory.

##################### Run the following to call the preceeding collection of functions ###################

currency = "GBP"
# get_conversion_rates_insert_df(currency)

<div style="background-color: #ADD8E6; padding: 5px; border: 1px solid black; border-radius: 1px;">
    <h3 style=:"margin-bottom: 5px; font-weight: bold;">Section 3: Content main.py</h3>
    <p style=:"margin-top: 0;">Helper functions for utils_api_to_df.py section</p>
</div>

**Note**  
Only run the following cells if you want to create a full fresh dataset. Otherwise, run selected functions individually from Section 2 (utils_api_to_df.py.)

1) Function makes an API request to retrieve list of countries from Teleport API (country name and url)

In [None]:
simple_get_list_countries()

2) Check the count of countries retrieved from API. Use this to ensure counter number in next function gets all results, and to pass into the calc_file_size function below, if required.

In [None]:
filepath = "api_output/list_countries.json"
json_dict_structure = ['_links','country:items']
item_key = "name"

print(count_json_items(filepath, json_dict_structure, item_key))

3) Function makes an API call to Teleport to add grab overview country level data, including the currency_code which is used in the currency conversion function. Please be patient; we have deliberately delayed each API call to be polite to Teleport's servers, so it takes a minute to run. A Finished print statement will display when call is complete.

In [None]:
api_calls_to_make = 252 # this should be number printed above. Set low (i.e. 10 if you only want to do a test run)
get_overviews_countries(api_calls_to_make)

4) Counts the number of keys in the JSON dictionary which was just created, holding the salaries data for each country from Teleport. It should match the number of countries you passed into the previous function.

In [None]:
filepath = "api_output/overviews_all_countries.json"
print(count_keys(filepath)) # should return 252 to matches countries_list

5) Function makes API request to retrieve country-level salary data from Teleport API. It utilises JSON data retrieved from the previous two API requests. It outputs this as a DataFrame object and a sibling .csv file. Please be patient, takes c30seconds to run. Finished print statement will display once complete.

In [None]:
api_to_dataframe()

6) The following calls two functions one after the other to produce our final DataFrame which has country info, currency info, conversion rates, local currency salaries and GBP converted salaries. Finished print statement will display once complete.

In [None]:
currency = "GBP"
get_conversion_rates_insert_df(currency)

**Final Note**  
After running the script in the above cell, you should end up with the following output files, gathered within a new folder directory "/api_output":  
From 1), list_countries.json & log_list_countries.txt  
From 3), overviews_all_countries.json & log_countries_overviews.txt  
From 5), output_inc_codes.csv. This is an intermediate file and can be deleted. It was not programmatically deleted incase it is needed for testing.  
From 6) currency_rates_{timestamp}.json, output_gbp_salaries_{timestamp}.csv and log_currency_rates.txt  

**output_gbp_salaries_{timestamp}.csv** is the file to use Jupyter Notebook - Part 2 for data analysis. Have fun!