# NOTES

## TO DO:
* For Latin America, put in pipeline capacities in original units for oil pipelines; use converted units for gas pipelines if all have it
* for lng_term, for Latin America map: after separating train/phase from terminal name, create columns for 'unit' & 'unit_en'
* Instead of manually inserting fixes (using function fix_one_offs), put the missing info into the working files. Then read those, and insert values.
  * That way, the values are stored where they need to be for the long term.
  * Use pygsheets to read the working files (for all trackers except coal plants, because those are not in the GEM shared drive).
  * The problem with this is that some of the missing data is for projects that have been put in the "removed" tabs; so they're in a past official release, but no longer in the main data set in the working file. Those would still need hard-coded one-off fixes in this code base.
* Assign approximate coordinates using province data when available
* Put GOGET production data into map file
* Work more with xlsxwriter to format Excel files, using header_format & setting column widths
* Assign approximate coordinates based on country for LNG terminals
* Region: Add test that it's filled in for every row
* Capacity: Add test that at least some rows have data
* When reading pipelines using pygsheets, force capacity to be float; this should resolve issue with download file
* get rid of 'nan' strings in map file; for Africa gas map, had them for some rows in capacity, lat, lon, route
* in GOGET, exclude these statuses: 'UGS', 'decommissioned', 'abandoned'
* for Latin America:
    * Coal plants download sheet: Instead of column "Nombre local", in the Spanish sheet, put this as "Planta"
    * Coal mines, etc.: similar change?
* keep oil & gas pipelines separate, except for map file (where all types of projects are together)
* for Excel file writing, left align column headings & set column width = 12
* for GOGET, for download file, merge in Spanish wiki pages; then for Spanish download file, use Spanish wiki pages
* in Spanish download files, change order of columns to prioritize Spanish entries

#### Options now include:
* Latin America Portal: Maps for coal/steel and oil/gas, with project names and wiki pages in Spanish/Portuguese. Draws from working files for local language info as needed; otherwise data is from official releases.
* Global Coal Tracker: Incorporates all coal infrastructure globally (coal plants, coal mines, coal terminals, and most steel plants). Excludes steel plants that don't draw on coal.
* Europe gas tracker: Incorporates all gas infrastructure in EU countries and some in neighboring countries (see specs for map).
* Asia gas tracker: Covers all gas infrastructure (extraction areas, pipelines, LNG terminals, gas plants) in specified countries (see specs for map).

#### Update notes:
* 2022-06-14: Added steps to compile Global Oil Infrastructure Tracker
  * Compile data from working file
  * Assemble parent strings from the various sheets, then connect with each pipeline

In [1128]:
# Notes on unit/phase names:
# coal plants: column 'Unit' usually has a long name, with the power station name plus unit name
# coal mines: column 'Project Phase' has short name, e.g., "Stage 1"
# steel plants: expansions are named in the "Plant name" column
# coal terminals: there are expansions included in the terminal name; later can split them into separate column

In [1129]:
# there are 40+ units in the coal plant tracker that use location IDs starting with L4
# are these plants that were added first to the gas plant tracker, and assigned L4 location IDs,
# but turned out to also have coal units, and so were then added to the coal plant tracker?

In [1130]:
# in gas plants official, there is an error for IDs for one plant/unit
# which has location ID G408643 & unit ID L407152
# has been fixed in main sheet (working version)
# (should we issue a new version of the official data set?)

# Main parameters

In [1131]:
map_choice = 'Africa Gas Tracker'
# see accepted options in test below

export_files = True

error_verbose = False

exclude_no_wiki = False # currently only used for steel plants & oil pipelines

# run tests to find multiple values?
find_multi_values = False

gem_path = '~/Desktop/GEM_INFO/GEM_WORK/maps/output/'
gem_path = '/Users/gem-tah/Desktop/GEM_INFO/GEM_WORK/maps/output/'


In [1132]:
# TEST:
map_choice_accepted_list = [
    # single-tracker maps (global):
    'Oil Infrastructure', 'Oil & Gas Plant', 
    'Solar Power', 'Wind Power', 'Geothermal Power', 'Bioenergy Power', 'Nuclear Power',
    'GOGET',
    # multi-tracker maps - gas-oil:
    'Africa Gas Tracker', 'Asia Gas Tracker', 'Europe Gas Tracker', 'Latin America Portal - oil-gas',
    'Gas Infrastructure', 
    # multi-tracker maps - coal-steel:
    'Latin America Portal - coal-steel', 
    'Coal Terminals',
    # multi-tracker maps - renewables:
    'Latin America Portal - renewables',
]

if map_choice not in map_choice_accepted_list:
    print("Error!" + f" Map choice was not in accepted list")

In [1133]:
# future map set-up to group statuses for legend in better way uses two columns for status 
# ('status' has original values & 'status_legend' has modified values)
if map_choice in [
    'Africa Gas Tracker', 'Asia Gas Tracker', 'Europe Gas Tracker',
    'Latin America Portal - coal-steel', 'Latin America Portal - oil-gas', 
]:
    two_column_status = True
elif map_choice in [
    'Solar Power', 'Wind Power', 
    'Oil Infrastructure', 'Gas Infrastructure',
    'Oil & Gas Plant', 
    'Coal Terminals',
    'Geothermal Power',
    'Bioenergy Power',
    'Nuclear Power',
    'Latin America Portal - renewables',
    'GOGET',
]:
    two_column_status = False
else:
    print("Error!" + f" two_column_status not set for map_choice: {map_choice}")
    
# TO DO: may want to modify the coal terminals map handling to use two status columns, to avoid changing the statuses to fit the legend

# Imports

In [1134]:
import pandas as pd
import geopandas as gpd
import numpy as np

# import pygsheets
import gspread
# import xlwings

import time
from itertools import permutations
import copy

import os
from datetime import date
import openpyxl
import xlsxwriter


# Data Pull Set up

In [1135]:
# Get today's date
today_date = date.today()
# Format the date in ISO format
iso_today_date = today_date.isoformat()

client_secret = "Desktop/GEM_INFO/client_secret.json"
client_secret_full_path = os.path.expanduser("~/") + client_secret



In [1136]:
def gspread_access_file_read_only(key, title):
    """
    key = Google Sheets unique key in the URL
    title = name of the sheet you want to read
    """
    gspread_creds = gspread.oauth(
        scopes=["https://www.googleapis.com/auth/spreadsheets.readonly"],
        credentials_filename=client_secret_full_path,
        # authorized_user_filename=json_token_name,
    )
    gsheets = gspread_creds.open_by_key(key)
    # Access a specific tab
    spreadsheet = gsheets.worksheet(title)
    # expected_header option provided following: https://github.com/burnash/gspread/issues/1007
    # Getting All Values From a Worksheet as a List of Dictionaries
    # if key in [list of pipelines sheets]
    # df = pd.DataFrame(spreadsheet.get_all_records[2:](expected_headers=[]))

    df = pd.DataFrame(spreadsheet.get_all_records(expected_headers=[]))
    

    return df

In [1137]:
data_keys_titles = {    
    # COAL TRACKERS
    # Coal plants - official:
    # 'coal_plants_official_file': 'Global-Coal-Plant-Tracker-January-2023.xlsx',
    'coal_plants_official_key': '1rxONoHIxW1Rv8jPKzHsafPEMj3z_XtfJ',
    'coal_plants_official_title': 'Units',
    'coal_plants_official': ['1rxONoHIxW1Rv8jPKzHsafPEMj3z_XtfJ', 'Units'],

    
    # # Coal plants - working - Latin America
    # # (only needed for foreign language names & wiki pages; these are only needed for Latin America)
    # 'coal_plants_working_latam_file': 'Latin America coal plants (dl 2022-08-02_0942).xlsx',
    # 'coal_plants_working_latam_path': gem_path + 'Global Coal Plant Tracker/GCPT individual files/',
    
    # Global Coal Mine Tracker (Main): 
    # https://docs.google.com/spreadsheets/d/1VV4miL8uc6HmsBxV-yv5bgUsPOwGf6t26YeI6Q0Vv3Q/edit
    'coal_mines_official_key': '1QsJIhOqxeMgS_osB4Vo0eJjg84CQcND0',
    'coal_mines_official_title': 'Global Coal Mine Tracker',
    'coal_mines_official': ['1QsJIhOqxeMgS_osB4Vo0eJjg84CQcND0','Global Coal Mine Tracker'],

    # 'coal_mines_working_file': 'Global Coal Mine Tracker (Main) (dl 2022-03-03_1459).xlsx',
    # 'coal_mines_working_path': gem_path + 'coal mines (GCMT)/Global Coal Mine Tracker - versions downloaded/',
    
    # Coal Terminals Tracker: 
    # https://docs.google.com/spreadsheets/d/181HI0tI4aiAme5GZUABQc13HNW9RMc3V6vzEzZiHtSQ/edit#gid=0
    'coal_terminals_official_key': '1c69jHBXVpbGBL71JzTVZmd0h7Nhjy9rU',
    'coal_terminals_official_title': 'Coal Terminals',
    'coal_terminals_official': ['1c69jHBXVpbGBL71JzTVZmd0h7Nhjy9rU','Coal Terminals'],

    # 'coal_terminals_working_file': 'Global Coal Terminals Tracker - working (dl 2023-01-25_1130).xlsx',
    # 'coal_terminals_working_path': gem_path + 'Global Coal Terminals Tracker/versions downloaded/',
    
    # Global Steel Plant Tracker: 
    # official file
    'steel_plants_official_key': '1RajN7ErWDpLf58FmP0KSZ8M_pOA4aETd',
    'steel_plants_official_title': 'Steel Plants',
    'steel_plants_official': ['1RajN7ErWDpLf58FmP0KSZ8M_pOA4aETd','Steel Plants'],
    # working file
    'steel_plants_working_key': '1Yn1mNypUQvLgMwh-uSEtmuXkGhJ1gWy_6oHhJMCabNs',
    
    # OIL & GAS TRACKERS
    # Global Gas Plant Tracker:
    'gas_plants_official_key': '1dosICr3DU05hIRawCLB0EK4rv3cn44fwBAKjTTqmLDo',
    'gas_plants_official_title': 'Gas & Oil Units',
    'gas_plants_official': ['1dosICr3DU05hIRawCLB0EK4rv3cn44fwBAKjTTqmLDo','Gas & Oil Units'],

    # 'gas_plants_interim_file': 'Global Gas Plant Tracker (GGPT) 2022-09-01 interim version (working format).xlsx',
    # 'gas_plants_working_path': gem_path + 'Global Gas Plant Tracker/versions downloaded/',
    # 'gas_plants_working_file': 'Global Oil and Gas Plant Tracker (GOGPT) - main (dl 2023-03-15_1426).xlsx',
    # 'gas_plants_working_key': '1n7FRTBR404DUeO1lDrw7808jSs-azrXpx23VQ-GFqdY',
    
    # Pipelines (Gas/Oil/NGL):
    # GGIT pipelines official
    # 'ggit_pipes_official_file': 'GEM-GGIT-Gas-Pipelines-December-2022.xlsx',
    'ggit_pipes_official_key': '1rcFIqHjVpZ7UFNdP1TE7BeDKmraOjXof8gLtZ49G77U',
    'ggit_pipes_official_title': 'Gas Pipelines 2023-12-06',
    'ggit_pipes_official': ['1rcFIqHjVpZ7UFNdP1TE7BeDKmraOjXof8gLtZ49G77U', 'Gas Pipelines 2023-12-06'],

    
    # 'ggit_pipes_interim_file': 'Pipelines (Gas_Oil_NGL) - main - 2022_03_22 - Europe Gas Report.xlsx',
    # 'ggit_pipes_interim_path': gem_path + 'Global Gas Infrastructure Tracker/GGIT Pipelines - official releases/',
    # 'ggit_pipes_official_path_europe': gem_path + 'Global Gas Infrastructure Tracker/GGIT Pipelines - official releases/',
    # 'ggit_pipes_official_file_europe': 'GEM-Europe-Gas-Tracker-Gas-and-Hydrogen-Pipelines-2023-03-15.xlsx',
    # 'ggit_pipes_official_file_europe': 'GEM-Europe-Gas-Tracker-Gas-and-Hydrogen-Pipelines-March-2023-2023-07-10.xlsx',
    
    'ggit_pipes_official_europe_key': '1F0NlPH9ntS6AuKx-ZwojjEgadw7quYuTiBRfBK_V17I',
    'ggit_pipes_official_europe_title': 'Gas pipelines',
    'ggit_pipes_official_europe': ['1F0NlPH9ntS6AuKx-ZwojjEgadw7quYuTiBRfBK_V17I', 'Gas pipelines'],


    # GOIT pipelines official
    'goit_pipes_official_key': '1lMoU0Y3Z-NUBiKsnPxKsthmVJQfU-jw4',
    # 'goit_pipes_official_file': 'GOIT-Oil-NGL-Pipelines-June-2022-v2.xlsx',
    'goit_pipes_official_title': 'Pipelines',
    'goit_pipes_official': ['1lMoU0Y3Z-NUBiKsnPxKsthmVJQfU-jw4','Pipelines'],
    
    # all pipelines working (for Latin America wiki pages & names)
    # 'all_pipes_working_file': 'Pipelines (Gas_Oil_NGL) - main dl 2022-09-07.xlsx',
    # 'all_pipes_working_path': gem_path + 'Global Gas Infrastructure Tracker/GGIT Gas Pipelines - versions saved/',

    # LNG terminals:
    # 'ggit_lng_working_key': '1tcS6Wd-Wp-LTDpLzFgJY_RSNDnbyubW3J_9HKIAys4A',
    'ggit_lng_official_key': '1GVjpu4U1y6JNYgRXMruLV-06zFs6qLlIlL9bo_JrUUA',
    # 'ggit_lng_official_file': 'GEM-GGIT-LNG-Terminals-July2022.xlsx',
    'ggit_lng_official_title': 'LNG Terminals 2023-12-18',
    'ggit_lng_official': ['1GVjpu4U1y6JNYgRXMruLV-06zFs6qLlIlL9bo_JrUUA', 'LNG Terminals 2023-12-18'],
    # 'ggit_lng_europe_update_file': 'LNG Terminals - main - 2022_03_22 - Europe Gas Report.xlsx',
    # 'ggit_lng_europe_update_path': gem_path + 'GFIT & GGIT & GOIT (pipelines & LNG)/GGIT LNG Terminals - official releases/',
    # 'ggit_lng_official_path_europe': gem_path + 'Global Gas Infrastructure Tracker/GGIT LNG Terminals - official releases/',
    # 'ggit_lng_official_file_europe': 'GEM-Europe-Gas-Tracker-LNG-Terminals-2023-03-15.xlsx',
    # 'ggit_lng_official_file_europe': 'GEM-Europe-Gas-Tracker-LNG-Terminals-March-2023-updated-2023-07-10.xlsx', 

    'ggit_lng_official_europe_key': '1Rw9Xj0VIOLq94OT0zth5ub_IqCeAKYzlufdjxuezzgs',
    'ggit_lng_official_europe_title': 'Terminals',
    'ggit_lng_official_europe': ['1Rw9Xj0VIOLq94OT0zth5ub_IqCeAKYzlufdjxuezzgs', 'Terminals'],
    
    # GOGET (oil & gas extraction):
    # 'goget_official_file': 'Global-Oil-and-Gas-Extraction-Tracker-Feb-2023-v2.xlsx',
    'goget_official_key': '1vhZXbdLzYWQujait5DxBlRBo28XMUQnpnYsdHxIhGME',
    'goget_official_title': 'Main data',
    'goget_official': ['1vhZXbdLzYWQujait5DxBlRBo28XMUQnpnYsdHxIhGME', ['Main data','Production & reserves']],

    # 'goget_interim_file': 'GOGET oil & gas extraction sites (main data set) (dl 2022-09-10_0935) fixed.xlsx',
    # 'goget_interim_path': gem_path + 'GOGET/versions downloaded/',
    # goget interim fixes: one row was missing the country (Canada)
    # 'goget_working_key': '1nVkKNXEFuYyun4cyr47Fsa7ZdIfBNDdb3wQurBmzA5M', # TO DO: change code below to use working version, with summer 2022 updates for Europe
    
    # Solar tracker
    # 'solar_official_file': 'Global Solar Power Tracker (GSPT) - May 2022.xlsx',
    # 'solar_official_file': 'Global-Solar-Power-Tracker-January-2023.xlsx', 
    'solar_official_key': '1cT7tzAOigJ3f3ame5VvR-nCDW8NK2rOf',
    'solar_official_title': ['Large Utility-Scale', 'Medium Utility-Scale'],
    'solar_official': ['1cT7tzAOigJ3f3ame5VvR-nCDW8NK2rOf', ['Large Utility-Scale', 'Medium Utility-Scale']],

    # 'solar_working_key': '1ACAzYGblerFPt0gx_QevOnwM99n-mBv1QiGQIKx_B8w',
    
    # Wind tracker
    # 'wind_official_file': 'Global Wind Power Tracker (GWPT) - May 2022.xlsx',
    # 'wind_official_file': 'Global Wind Power Tracker (GWPT) - January 2023.xlsx',
    'wind_official_key': '1NnkqWCa9K4NoNoXkw8L4IOZr3LyX7joG',
    'wind_official_title': 'Data',
    'wind_official': ['1NnkqWCa9K4NoNoXkw8L4IOZr3LyX7joG','Data'], 

    # 'wind_working_key': '1HY6cl7kQ-NHhiKTP6IzZ-HLwx7ZPTJvnov8V00suUGA',
    
    # Geothermal Power tracker
    # 'geothermal_working_key': '1iRHlL1ZBd5D2GE7GhIQSIqEuskWIzbwoK5R89lLHZP0',
    'geothermal_official_key': '1dd-3--hnAJiqxeJTrrmmuElbxeLqrsm4',
    'geothermal_official_title': 'Data',
    'geothermal_official': ['1dd-3--hnAJiqxeJTrrmmuElbxeLqrsm4','Data'],

    
    # Bioenergy Power tracker
    'bioenergy_official_key': '127M-WOqhZrB_ea5rg0Z6R26dANTpFi7j',
    'bioenergy_official_title': 'Data',
    'bioenergy_official': ['127M-WOqhZrB_ea5rg0Z6R26dANTpFi7j','Data'],

#     'bioenergy_working_file': 'Global Bioenergy Power Tracker (GBPT) - main (dl 2022-12-07).xlsx',
#     'bioenergy_working_path': gem_path + 'Global Bioenergy Power Tracker/versions downloaded/',
    
    # Nuclear Power tracker
    'nuclear_official_key': '1jk4-0yVxiUQQfeoCjy_ueT0TgbPelqGd',
    'nuclear_official_title': 'Data',
    'nuclear_official': ['1jk4-0yVxiUQQfeoCjy_ueT0TgbPelqGd','Data'],

    }

# Helper Functions

In [1138]:
def create_folder_if_no(folder_path):
    if not os.path.exists(folder_path):
        try:
            # Create the folder if it doesn't exist
            os.makedirs(folder_path)
            print(f"Folder '{folder_path}' created successfully.")
        except OSError as e:
            print(f"Error creating folder '{folder_path}': {e}")
    else:
        print(f"Folder '{folder_path}' already exists.")

# Data versions

In [1139]:
# specify which data versions to use:
data_versions_dict = {
    # regional maps:
    'Africa Gas Tracker': {
        'gas plants': 'official',
        'gas pipelines': 'official',
        'ggit lng': 'official',
        'goget': 'official',
    },
    'Asia Gas Tracker': {
        'gas plants': 'official',
        'gas pipelines': 'official',
        'ggit lng': 'official',
        'goget': 'official',
    },
    'Europe Gas Tracker': {
        'gas plants': 'official',
        'gas pipelines': 'official',
        'ggit lng': 'official',
        'goget': 'official',
    },
    'Latin America Portal - coal-steel': {
        'coal plants': 'official',
        'coal mines': 'official',
        'coal terminals': 'official',
        'steel plants': 'official',
    },
    'Latin America Portal - oil-gas': {
        'gas plants': 'official',
        'gas pipelines': 'official',
        'oil pipelines': 'official',
        'ggit lng': 'official',
        'goget': 'official',
    },
    'Latin America Portal - renewables': {
        'solar power': 'official',
        'wind power': 'official',
        # 'geothermal': 'official', # TO DO: add this later
    },
    # global maps:
    'Coal Terminals': {
        'coal plants': 'official',
        'coal mines': 'official',
        'coal terminals': 'official',
        'steel plants': 'official',
    },
    'Gas Infrastructure': {
        'ggit lng': 'official',
        'gas pipelines': 'official',
    },
    'Solar Power': {'solar power': 'official'},
    'Wind Power': {'wind power': 'official'},
    'Oil Infrastructure': {'oil and NGL pipelines': 'official'},
    'Oil & Gas Plant': {'gas plants': 'official'},
    'Geothermal Power': {'geothermal power': 'official'},
    'Bioenergy Power': {'bioenergy power': 'official'},
    'Nuclear Power': {'nuclear power': 'working local'},
    'GOGET': {'goget': 'official'}
}

In [1140]:
map_choice_expected_counts = { 
    # Set 2022-04-28
    'Latin America Portal - coal-steel': {
        'coal_plant': 187,
        'steel_plant': 60,
        'coal_terminal': 27,
        'coal_mine': 24,
    },
    # Set mid 2022?
    'Latin America Portal - oil-gas': {
        'gas_power_plant': 591,
        'gas_pipeline': 150,
        'oil_pipeline': 49,
        'oil_and_gas_extraction_area': 595, # updated 2023-10; it's less than before; could look into this
        'lng_terminal': 79,
    }, 
    'Latin America Portal - renewables': {
        'solar_power': 0, # TO DO: fill in
        'wind_power': 0, # TO DO: fill in 
    },
    # TO DO: need to update values in 2023 to reflect addition of Iran & Afghanistan to this map
    # Set 2022-04-28
    'Asia Gas Tracker': {
        'Gas Power Plant': 1909, # was 1864 from Jan 2022 data
        'Gas Pipeline': 701,
        'LNG Terminal': 349,
        'Gas Extraction Area': 177,
    },
    # Set 2022-08-02
    'Africa Gas Tracker': {
        'gas_power_plant': 642, # was 669 from Jan 2022 data # changed for map columns needed  feb 2024
        'gas_pipeline': 104, # changed for map columns needed  feb 2024
        'lng_terminal': 102, # changed for map columns needed  2024 
        'gas_extraction_area': 193,  # changed for map columns needed feb 2024
    },
    # TO DO: need to update values in 2023 to reflect addition of Turkey to this map
    # Set 2022-04-28
    'Europe Gas Tracker': {
        'gas_power_plant': 1329, 
        # For GGPT: was 1289 from Jan 2022 data; then had the value 1364, I guess from mid-2022 release... but for early 2023, had only 1329 for Europe; why did it drop?
        'gas_pipeline': 666, # note name is singular--different from other regional gas maps
        'gas_extraction_area': 527,
        'lng_terminal': 116,
        # 'LNG Shipping Routes': 1, # don't apply fudge factor
    },
    # Set 2023-04-25; excludes those with no route (in column WKTFormat)
    'Oil Infrastructure': {
        'Oil Pipelines': 939,
        'NGL Pipelines': 39,
    },
    # Set 2022-12-09
    'Gas Infrastructure': {
        'Gas Pipelines': 2602, # value ia after removing those 
        'LNG Terminals (Import)': 606,
        'LNG Terminals (Export)': 567,
    },
    # Set 2022-12-09:
    'Coal Terminals': {
        'coal_plant': 13490,
        'coal_mine': 3670,
        'steel_plant': 1201,
        'coal_terminal': 445, 
    },
    # TO DO: add other maps here
}

# Other parameters

In [1141]:
# accepted statuses for maps (for legend & filtering)
# (needed to check data is formatted as needed for maps)

renewable_other_power_accepted_statuses_one_col = [
    'operating', 
    'construction', 
    'announced',
    'pre-construction',
    'cancelled', 
    'shelved', 
    'retired',
    'mothballed', 
]

accepted_statuses_one_col = {
    'Oil & Gas Plant': [
        'operating', 'construction', 
        'announced', 'pre-construction', 
        'shelved', 'cancelled',
        'mothballed', 'retired',
    ],
    'Geothermal Power': [
        'operating', 'construction', 
        'announced', 
        'development', # may change 'development' to 'pre-construction'
        'shelved', 'cancelled',
        'mothballed', 'retired',
    ],
    'Oil Infrastructure': [
        'operating', 'construction', 'proposed', 'mothballed', 'idle',
        'shelved', 'cancelled', 'retired',
    ],
    'Gas Infrastructure': [
        'operating', 'construction', 'proposed', 'mothballed', 'idle',
        'shelved', 'cancelled', 'retired', 'unknown',
    ],
    'Coal Terminals': [
        'operating', 
        'construction', 
        'proposed', 
        'permitted',
        'cancelled', 
        'shelved', 
        'retired',
        'mothballed', 
    ],
    'Latin America Portal - renewables': renewable_other_power_accepted_statuses_one_col,
    'Solar Power': renewable_other_power_accepted_statuses_one_col,
    'Wind Power': renewable_other_power_accepted_statuses_one_col, 
    'Nuclear Power': renewable_other_power_accepted_statuses_one_col, 
    'Geothermal Power': renewable_other_power_accepted_statuses_one_col,
    'Bioenergy Power': renewable_other_power_accepted_statuses_one_col,
    'GOGET': ['operating', 'in development', 'discovered', 'shut in']
}

# for checking the column 'status_legend'
# (when using two columns for status, entries in the column 'status' are the original values)
oil_gas_two_col_standarized = [
    'operating',
    'construction_plus', # includes all 'construction' & also 'in development' (GOGET)
    'proposed_plus', # includes all 'proposed' & also 'discovered' (GOGET)
    'cancelled', 
    'shelved', 
    'retired', # includes retired (GGIT, GGIT, GOIT); for now, GOGET doesn't include any decommissioned
    'mothballed_plus', # includes mothballed, also 'idle' (GGIT & GOIT) & 'shut in' (GOGET)
    'pre-construction', # new in July 2022 GGPT data
]
accepted_statuses_two_col = {
    'Latin America Portal - coal-steel': [
        'operating', 
        'construction', 
        'proposed', 
        'permitted',
        'cancelled', 
        'shelved', 
        'retired_plus', # includes all 'retired' & also 'closed' (coal mines)
        'mothballed', 
    ],
    'Europe Gas Tracker': oil_gas_two_col_standarized,
    'Latin America Portal - oil-gas': oil_gas_two_col_standarized,
    'Asia Gas Tracker': oil_gas_two_col_standarized,
    'Africa Gas Tracker': oil_gas_two_col_standarized,
}

if two_column_status == True:
    accepted_statuses_sel = accepted_statuses_two_col
else:
    accepted_statuses_sel = accepted_statuses_one_col

In [1142]:
lat_am_carib_countries = [
    'Argentina', 'Bahamas', 'Barbados', 'Belize', 'Bolivia',
    'Brazil', 'Chile', 'Colombia', 'Costa Rica', 'Cuba',
    'Dominican Republic', 'Ecuador', 'El Salvador', 'French Guiana', 'Grenada',
    'Guadeloupe', 'Guatemala', 'Guyana', 'Haiti', 'Honduras', 
    'Jamaica', 'Mexico', 'Nicaragua', 'Panama', 'Paraguay',
    'Peru', 'Suriname', 'Trinidad and Tobago', 'Uruguay', 'Venezuela'
]
european_union_countries = [
    'Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Cyprus',
    'Czech Republic', 'Denmark', 'Estonia', 'Finland', 'France', 
    'Germany', 'Greece', 'Hungary', 'Ireland', 'Italy', 
    'Latvia', 'Lithuania', 'Luxembourg', 'Malta', 'Netherlands', 
    'Poland', 'Portugal', 'Romania', 'Slovakia', 'Slovenia', 
    'Spain', 'Sweden',
]
other_europe_countries = [
    'Albania', 'Andorra', 'Belarus', 'Bosnia and Herzegovina', 'Holy See', 'Iceland',
    'Liechtenstein', 'Moldova', 'Monaco', 'Montenegro', 'North Macedonia', 
    'Norway', 'San Marino', 'Serbia', 'Switzerland', 'Türkiye', 'Ukraine', 
    'United Kingdom',
]
all_europe_countries = european_union_countries + other_europe_countries
# Notes: 
# In UN M49, Cyprus is within the subregion Western Asia.
# The definition of Europe above does NOT include: Armenia, Azerbaijan, Georgia, Israel;
# in UN's M49, those countries (and many more) are part of the subregion Western Asia.

asia_countries = [
    # UN Eastern Asia:
    'China', 'Hong Kong', 'Japan', 'Macao', 'Mongolia', 'North Korea', 'South Korea', 'Taiwan', 
    # UN South-eastern Asia:
    'Brunei', 'Cambodia', 'Indonesia', 'Laos', 'Malaysia', 'Myanmar',
    'Philippines', 'Singapore', 'Thailand', 'Timor-Leste', 'Vietnam',
    # UN Southern Asia:
    'Afghanistan', 'Iran', # new as of March 2023
    'Bangladesh', 'Bhutan', 'India', 'Maldives', 'Nepal', 'Pakistan', 'Sri Lanka',      
]

africa_countries = [
    'Algeria', 'Angola', 'Benin', 'Botswana', 'Burkina Faso',
    'Burundi', 'Cameroon', 'Cape Verde', 'Central African Republic', 'Chad',
    'Comoros', "Côte d'Ivoire", 'Djibouti', 'DR Congo', 'Egypt', 
    'Equatorial Guinea', 'Eritrea', 'Eswatini', 'Ethiopia', 'Gabon', 
    'The Gambia', 'Ghana', 'Guinea', 'Guinea-Bissau', 'Kenya', 
    'Lesotho', 'Liberia', 'Libya', 'Madagascar', 'Malawi',
    'Mali', 'Mauritania', 'Mauritius', 'Mayotte (France)', 'Morocco', 
    'Mozambique', 'Namibia', 'Niger', 'Nigeria', 'Republic of the Congo', 
    'Réunion (France)', 'Rwanda', 'Saint Helena, Ascension and Tristan da Cunha (UK)', 'São Tomé and Príncipe', 'Senegal',
    'Seychelles', 'Sierra Leone', 'Somalia', 'South Africa', 'South Sudan', 
    'Sudan', 'Tanzania', 'Togo', 'Tunisia', 'Uganda',
    'Western Sahara', 'Zambia', 'Zimbabwe'
]
# Notes:
# Uses "Côte d'Ivoire" instead of 'Ivory Coast'
# Uses 'DR Congo' instead of 'Democratic Republic of the Congo'
# Uses 'The Gambia' instead of 'Gambia'

# lists of countries for filtering
sel_countries = {
    'Latin America Portal - coal-steel': lat_am_carib_countries,
    'Latin America Portal - oil-gas': lat_am_carib_countries,
    'Latin America Portal - renewables': lat_am_carib_countries,
    'Europe Gas Tracker': all_europe_countries,
    'Asia Gas Tracker': asia_countries,
    'Africa Gas Tracker': africa_countries,
}

In [1143]:
# for pipelines, specify details of which aspects to use in each data set
pipelines_to_use_dict = {
    "Gas Infrastructure": {
        'gas pipes': True,
        'oil pipes': False,
        'ngl pipes': False,
    },
    "Oil Infrastructure": {
        'gas pipes': False,
        'oil pipes': True,
        'ngl pipes': True,
    },
    "Asia Gas Tracker": {
        'gas pipes': True,
        'oil pipes': False,
        'ngl pipes': False,
    },
    "Africa Gas Tracker": {
        'gas pipes': True,
        'oil pipes': False,
        'ngl pipes': False,
    },
    "Europe Gas Tracker": {
        'gas pipes': True,
        'oil pipes': False,
        'ngl pipes': False,
    },
    "Latin America Portal - oil-gas": {
        'gas pipes': True, 
        'oil pipes': True, 
        'ngl pipes': False,
    },
}

# for pipelines, entries that indicate there is no route
no_route_entries = [
    'Capacity expansion only',
    'Bidirectionality upgrade only',
    'Unavailable', 
    'Short route (< 100 km)',
    'N/A',
    '',
]

In [1144]:
# save timestamp for all exports
save_timestamp = time.strftime('%Y-%m-%d_%H%M', time.localtime())

In [1145]:
standard_folder = 'download and map files/'
if map_choice == 'Latin America Portal - coal-steel':
    path_for_download_and_map_files = gem_path + 'Latin America Portal/' + standard_folder
elif map_choice == 'Latin America Portal - oil-gas':
    path_for_download_and_map_files = gem_path + 'Latin America Portal/' + standard_folder
elif map_choice == 'Latin America Portal - renewables':
    path_for_download_and_map_files = gem_path + 'Latin America Portal/' + standard_folder
elif map_choice == 'Europe Gas Tracker':
    path_for_download_and_map_files = gem_path + 'Europe gas/' + standard_folder
elif map_choice == 'Asia Gas Tracker':
    path_for_download_and_map_files = gem_path + 'Asia gas/' + standard_folder
elif map_choice == 'Africa Gas Tracker':
    path_for_download_and_map_files = gem_path + 'Africa gas/' + standard_folder
elif map_choice == 'Coal Terminals':
    path_for_download_and_map_files = gem_path + 'Global Coal Terminals Tracker/' + standard_folder
elif map_choice == 'Oil & Gas Plant':
    path_for_download_and_map_files = gem_path + 'Global Gas Plant Tracker/' + standard_folder
elif map_choice == 'Solar Power':
    path_for_download_and_map_files = gem_path + 'Global Solar Power Tracker/' + standard_folder
elif map_choice == 'Wind Power':
    path_for_download_and_map_files = gem_path + 'Global Wind Power Tracker/' + standard_folder
elif map_choice == 'Geothermal Power':
    path_for_download_and_map_files = gem_path + 'Global Geothermal Power Tracker/' + standard_folder
elif map_choice == 'Bioenergy Power':
    path_for_download_and_map_files = gem_path + 'Global Bioenergy Power Tracker/' + standard_folder
elif map_choice == 'Nuclear Power':
    path_for_download_and_map_files = gem_path + 'Global Nuclear Power Tracker/' + standard_folder
elif map_choice == 'Oil Infrastructure':
    path_for_download_and_map_files = gem_path + 'Global Oil Infrastructure Tracker/' + standard_folder
elif map_choice == 'Gas Infrastructure':
    path_for_download_and_map_files = gem_path + 'Global Gas Infrastructure Tracker/' + standard_folder
elif map_choice == 'GOGET':
    path_for_download_and_map_files = gem_path + 'GOGET/' + standard_folder
else:
    print("Error!" + f" Unexpected value of map_choice: {map_choice}")


create_folder_if_no(path_for_download_and_map_files)

Folder '/Users/gem-tah/Desktop/GEM_INFO/GEM_WORK/maps/output/Africa gas/download and map files/' already exists.


# Initializing functions

In [1146]:
# translations for Latin America column headings
# filename: "Latin America - Download file - translations"
lat_am_col_trans_key = '1rYERG8B1tL7dccwPuZN8UYcNDTtKPicsaxjaqE4lA80'

In [1147]:
def get_standard_country_names():
    
    df = gspread_access_file_read_only(
        key = '1mtlwSJfWy1gbIwXVgpP3d6CcUEWo2OM0IvPD6yztGXI', 
        title = 'Countries',
    )
    gem_standard_country_names = df['GEM Standard Country Name'].tolist()
    
    return gem_standard_country_names

In [1148]:
# run initializing functions
gem_standard_country_names = get_standard_country_names()

# General functions

In [1149]:
def fix_one_offs(df, fixes_dict, identifier_col, change_col):
    """
    Input list of tuples; in each tuple:
    First value is identifier to look for (unique for each project).
    Second value is value to fill in, if missing.
    Column to look for identifier in is identifier_col.
    Column to add data to, if missing, is change_col.
    """
    
    if identifier_col not in df.columns:
        print("Error!" + f" identifier_col {identifier_col} not in df.columns")
        print(f"df.columns: {df.columns.tolist()}")
    if change_col not in df.columns:
        print("Error!" + f" change_col {change_col} not in df.columns")
        print(f"df.columns: {df.columns.tolist()}")
    
    for to_fix in list(fixes_dict.keys()):
        sel_rows = df[df[identifier_col]==to_fix].index
        for row in sel_rows:
            val_to_change = df.at[row, change_col]
            if pd.isna(val_to_change):
                if error_verbose == True:
                    print(f"For {to_fix}, filling in {change_col} {fixes_dict[to_fix]}")
                df.at[row, change_col] = fixes_dict[to_fix]
            else:
                if error_verbose == True:
                    print(f"Didn't change; already was a value in that cell: {val_to_change}")
                pass
            
    print(f"Finished fix_one_offs for {identifier_col}")
    print()
    return df

In [1150]:
def harmonize_countries(df):
    """
    Standardize country names, based on file "GEM Country Naming Conventions"
    """
    
    if error_verbose == True:
        print('-'*40 + '\n' + "Starting running harmonize_countries")
    
    if 'Countries' in df.columns:
        country_col = 'Countries'
    else:
        country_col = 'Country'
        
    # strip white space
    df[country_col] = df[country_col].str.strip()
        
    # harmonize countries:
    country_harm_dict = {
        'Czechia': 'Czech Republic',
        'Ivory Coast': "Côte d'Ivoire",
        "Cote d'Ivoire": "Côte d'Ivoire", # adds accent
        "Republic of Congo": "Republic of the Congo", # adds "the"
        "Rep Congo": "Republic of the Congo",
        "Democratic Republic of Congo": "DR Congo",
        "Democratic Republic of the Congo": "DR Congo", # in case step above adds "the"
        "Republic of Guinea": "Guinea",
        "Republic of Sudan": "Sudan",
        "FYROM": "North Macedonia",
        "Chinese Taipei": "Taiwan",
        "East Timor": "Timor-Leste",
        "USA": "United States",
        'Turkey': 'Türkiye',
        'Canary Islands': 'Spain', # used in LNG 2023-10
    }
    for key in country_harm_dict.keys():
        sel = df[df[country_col]==key]
        if len(sel) > 0:
            if error_verbose == True:
                print(f"Found non-standardized country name before trying to standardize: {key} ({len(sel)} rows)")
        df[country_col] = df[country_col].str.replace(key, country_harm_dict[key])
    
    # fix typo (in gas pipelines):
    df[country_col] = df[country_col].replace("Chna", "China")
    
    # clean up, checking if countries are in standard GEM list
    hyphenated_countries = ['Timor-Leste', 'Guinea-Bissau']
    for row in df.index:
        if pd.isna(df.at[row, country_col])==False:    
            try:
                countries_list = df.at[row, country_col].split(', ')
                countries_list = [x.split('-') for x in countries_list if x not in hyphenated_countries]
            except:
                print("Error!" + f" Exception for row {row}, country_col: {df.at[row, country_col]}")
                countries_list = []
                
            # flatten list
            countries_list = [
                country
                for group in countries_list
                for country in group
            ]
            # clean up
            countries_list = [x.strip() for x in countries_list]
        
            # check that countries are standardized
            for country in countries_list:
                if country not in gem_standard_country_names:
                    print(f"For row {row}, non-standard country name after trying to standardize: {country}")
        else:
            print(f"No countries listed for row {row}")
    
    if error_verbose == True:
        print("Finished running harmonize_countries" + '\n' + '-'*40)
    
    return df

In [1151]:
def test_for_country_entries(df):
    """
    Make sure all rows contain an entry for the country column.
    """
    
    if 'countries' in df.columns:
        country_col = 'countries'
    elif 'country' in df.columns:
        country_col = 'country'
    else:
        print("Error!" + f" Unexpected case for columns in test_for_country_entries; df.columns:")
        print(df.columns.tolist())
        
    sel = df[(df[country_col].isna()) | (df[country_col]=='')]
    
    # exclude row for LNG shipping routes
    sel = sel[sel['project'] != 'LNG shipping routes']
    
    if len(sel) > 0:
        print(f"There were {len(sel)} rows with no country value (these are NOT being excluded):")
        print(sel[['project', country_col]])
        print('-'*40)
    else:
        if error_verbose == True:
            print("Test passed; all rows had country entries.")

In [1152]:
def clean_nan_not_found_tbd(df):
    """ Clean up nan, 'not found', 'TBD' """
    dtypes_ser = df.dtypes
    for col in df.columns:
        dtype_col = dtypes_ser.at[col]
        
        # print(f"col {col}, dtype: {dtype_col}") # for db
        
        if dtype_col == float:
            # df[col] = df[col].fillna('').astype(str)
            pass
        else:
            df[col] = df[col].replace('nan', '')
            df[col] = df[col].replace('not found', '')
            df[col] = df[col].replace('TBD', '')
        
    return df

In [1153]:
def filter_points_by_country(df, map_choice, sel_countries):
    """
    Works for all point data; doesn't work for pipelines.
    So Global Oil Infrastructure Tracker isn't in this list.
    """
    global_trackers = [
        # single tracker:
        'Oil & Gas Plant',
        'Geothermal Power',
        'Solar Power', 
        'Wind Power',
        # multi-tracker:
        'Coal Terminals',
        'Gas Infrastructure',
        'GOGET',
    ]
    
    # default: unfiltered
    df_filtered = df.copy()
    
    if map_choice not in global_trackers:
        if map_choice in sel_countries.keys():
            country_col = ''
            for col in ['country', 'countries', 'Country']:
                if col in df.columns:
                    country_col = col
            if country_col != '':
                # overwrite df_filtered set above
                df_filtered = df[df[country_col].isin(sel_countries[map_choice])]
                
            else: 
                # then country_col == ''
                print("Error!" + f" Unexpected case; neither 'country' nor 'countries' nor 'Country' in df.columns")
                print(df.columns.tolist())

        else:
            print("Error!" + f" Not yet set up to handle this map_choice: {map_choice}")
    
    return df_filtered

In [1154]:
def test_convert_col_to_float(df, col_list):
    for col in col_list:
        try:
            float_col = df[col].astype(float)
        except:
            print("Error!" + f" Couldn't convert col {col} to float.")
            for row in df.index:
                val = df.at[row, col]
                try:
                    float_val = float(val)
                except:
                    print("Error!" + f" For row {row}, couldn't convert to float: {val}.")

In [1155]:
def find_multi_instead_of_one_to_one(df, cols_to_check):
    """
    Find cases in which there is not a one-to-one correspondence between two different columns,
    which should have a one-to-one correspondence with each other.
    """
    if find_multi_values == True:
        perms = permutations(cols_to_check, 2)

        for perm in list(perms):
            x = perm[0]
            y = perm[1]
            df_no_dup = df[[x, y]].drop_duplicates()
            counts = df_no_dup.groupby(y)[x].count()
            multi = counts[counts > 1]

            df_multi = df[df[y].isin(multi.index)]

            keep_cols = [] # initialize
            if len(df_multi) == 0:
                pass
            else:
                for col_to_check in cols_to_check:
                    if col_to_check in df.columns.tolist() and col_to_check not in [x, y]:
                        keep_cols += [col_to_check]
                    else:
                        pass

                # put x & y into keep_cols
                keep_cols += [y, x] # initialize

                df_multi = df_multi.sort_values(by=[y, x])

                for col_y_val in df_multi[y].drop_duplicates().tolist():
                    print(f"For a given value in column {y}: {col_y_val}")
                    df_sel = df_multi[df_multi[y] == col_y_val]
                    print(f"Multiple values in col {x}")
                    for row in df_sel.index:
                        print(df_sel.at[row, x])

                    print("=========")

In [1156]:
def latin_america_fill_in_missing_local_language_versions(df):
    if map_choice in ['Latin America Portal - oil-gas', 'Latin America Portal - coal-steel']:
        # updated below 2023-10-27 to remove local language wiki step
        for col in ['project']: # , 'url']:
            for row in df.index:
                if pd.isna(df.at[row, col]) or df.at[row,col]=='':
                    # fill in English values
                    eng_val = df.at[row, f'{col}_en']
                    df.at[row, col] = eng_val
                    print(f"English value was filled in for missing value in col '{col}' for {df.at[row, 'project']}") # for UI
                    
    return df

In [1157]:
def test_map_specified_cells_have_values(df, sector):
    """
    To make sure that map file has data entered in all the columns it needs to.
    """
    print("Running test_map_specified_cells_have_values")
    
    cols_to_check = ['project', 'type', 'status'] # 'url', # updated below 2023-10-27 to remove local language wiki step
    cols_to_print = ['country', 'project']
    
    single_tracker_maps = ['Oil & Gas Plant', 'Geothermal Power', 'GOGET']
    
    if map_choice.startswith('Latin America Portal'):
        # check additional columns
        cols_to_check += ['project_en', 'url_en']
        
        # swap columns to print
        cols_to_print += ['project_en']
        cols_to_print.remove('project')
        
    elif map_choice in single_tracker_maps:
        # remove column to check
        cols_to_check.remove('type')
        
    else:
        # it's one of the other multi-tracker maps, besides Lat Am
        pass

    print(f"show cols_to_check: {cols_to_check}") # for db

    if sector == 'oil_gas':
        if 'geom' in df.columns.tolist() and 'countries' in df.columns.tolist():
            # check different columns
            cols_to_check += ['geom']
        
            cols_to_print.remove('country')
            cols_to_print = ['countries'] + cols_to_print
    
    for col in cols_to_check:
        test = df[df[col].isna()]
        
        # handle coal terminals
        if 'type' in cols_to_check:
            if 'coal_terminal' in test['type'].tolist():
                if col == 'url':
                    coal_terminal_mask = test['type']=='coal_terminal'
                    print("Coal terminals don't have local language wiki pages (as of Apr 2022).")
                    print(f"There were {len(test[coal_terminal_mask])} coal terminal rows with NaN in {col}.")

                    # filter out coal terminals with NaNs
                    test = test[~coal_terminal_mask]
                else:
                    pass
            else:
                pass
        
        # see if there are problematic entries
        if len(test)==0:
            if error_verbose == True:
                print(f"All OK with column {col}")
            
        else:
            print('-'*40)
            print(f"\nError!" + f" In test_map_specified_cells_have_values, column '{col}' has rows with NaNs:")
            sel_cols = [] # initialize
            sel_cols = list(set(cols_to_print + [col]))
            print(test[sel_cols])
            print('-'*40)
            
    # no return

In [1158]:
def test_status_for_map(df):
    """
    Map choice is e.g., "Latin America Portal - oil-gas", 'Europe Gas Tracker', etc.
    """
    if error_verbose == True:
        print("\nRunning test_status_for_map")
        print(f"Accepted statuses are: {accepted_statuses_sel[map_choice]}")
    
    if two_column_status == True:
        status_col_to_check = 'status_legend'
    else:
        status_col_to_check = 'status'
    
    for status in df[status_col_to_check].fillna('[no entry]').unique().tolist():
        if status not in accepted_statuses_sel[map_choice]:
            print("Error!" + f" Status not in accepted list: {status}")
        else:
            pass

    # show statuses (check for outliers)
    if error_verbose == True:
        print(f"Show statuses--in column '{status_col_to_check}'")
        print(df[status_col_to_check].fillna('').value_counts())
        print()
    
        if two_column_status == True:
            print("Also show values in column 'Status'")
            print(df['status'].value_counts())
        
    print("Completed test_status_for_map\n")
    # END OF TEST
    # no return

In [1159]:
def read_eez_file_and_standardize():
    """
    Code from GOGET data compilation (map and public) 2022-01-13.ipynb
    """
    # use boundaries from MarineRegions.com
    # union of world country boundaries and Exclusive Economic Zones (2014)
    # http://www.marineregions.org/downloads.php#unioneezcountry

    df = gpd.read_file(
        gem_path + 'EEZ_land_union_v2_201410/' +  
        'EEZ_land_v2_201410.shp'
    )

    # change EEZ country names to those used by GEM
    eez_to_gem_standard_country_dict = {
        'Antigua & Barbuda': 'Antigua and Barbuda',
        'The Bahamas': 'Bahamas',
        'Bosnia & Herzegovina': 'Bosnia and Herzegovina',
        'Congo, DRC': 'DR Congo',
        'Congo': 'Republic of the Congo',
        "Cote d'Ivoire": "Côte d'Ivoire",
        'Czechia': 'Czech Republic',
        # Hong Kong not in EEZ file, it seems
        # Kosovo not in EEZ file, it seems
        'Trinidad & Tobago': 'Trinidad and Tobago',
        'Turkey': 'Türkiye',
    }
    df['Country'] = df['Country'].replace(eez_to_gem_standard_country_dict)

    df = df.set_crs('epsg:4326')
    df = df.set_index('Country')

    eez_and_land_boundaries = df
    
    return eez_and_land_boundaries

In [1160]:
def get_centroids_from_lat_lon_boundaries(df):
    """
    Code from GOGET data compilation (map and public) 2022-01-13.ipynb
    """
    # convert to Cartesian
    df = df.to_crs('epsg:4087')

    # calculate centroids
    centroids = df['geometry'].centroid

    # convert centroids to lat-lon
    centroids = centroids.to_crs('epsg:4326')

    return centroids

In [1161]:
def assign_lat_lon_for_one_row(df, row, centroids, national_centroids, jurisdiction, country):
    """
    Code from GOGET data compilation (map and public) 2022-01-13.ipynb
    """
    if jurisdiction in centroids.index:
        # latitude = y
        df.at[row, 'lat'] = '{0:.5g}'.format(centroids.at[jurisdiction].y)
        # longitude = x
        df.at[row, 'lng'] = '{0:.5g}'.format(centroids.at[jurisdiction].x)
        
    else:
        print(f"The province was not found in the shapefile index; province/country is {jurisdiction}/{country}")
        
        # assign based on country:
        # latitude = y
        df.at[row, 'lat'] = '{0:.5g}'.format(national_centroids.at[country].y)
        # longitude = x
        df.at[row, 'lng'] = '{0:.5g}'.format(national_centroids.at[country].x)
      
    # EXPERIMENTAL
    # df['lat'] = df['lat'].apply(lambda x: '{:,.5f}'.format(x))
    # df['lng'] = df['lng'].apply(lambda x: '{:,.5f}'.format(x))
        
    return df

In [1162]:
# def test_compare_eez_country_names_against_gem_standard(df):
#     """
#     Compare against GEM standard list; show countries that haven't been reconciled yet.
    
#     df argument is the EEZ file
#     """
#     # gem_standard_countries_key = '1mtlwSJfWy1gbIwXVgpP3d6CcUEWo2OM0IvPD6yztGXI'
#     # gc = pygsheets.authorize(client_secret_full_path)
#     # gem_standard_countries = gc.open_by_key(gem_standard_countries_key)
#     # gem_standard_countries = gem_standard_countries.worksheet('title', 'Countries')
#     # gem_standard_countries = gem_standard_countries.get_as_df()
#     # gem_standard_countries_list = gem_standard_countries['GEM Standard Country Name'].tolist()

#     countries_not_standardized = []
#     for country in gem_standard_list:
#         if country not in df['Country'].tolist():
#             gem_countries_not_in_eez += [country]
#     if len(gem_countries_not_in_eez) > 0:
#         print(f"Countries in data set (following GEM standard list) that are not in EEZ names: {gem_countries_not_in_eez}")
#     # END TEST

In [1163]:
def test_compare_eez_country_names_against_map_df(eez_df, map_df):
    """ Find countries in EEZ that haven't yet been reconciled to GEM standard names that are actually in use for this map.
    
    Excludes LNG shipping routes from the comparison (don't have a country listed on that row).
    """
    
    # exclude LNG shipping routes
    df = map_df.copy()[map_df['project']!='LNG shipping routes']
    
    # create map_country_list
    if 'country' in df.columns:
        map_country_list = df['country'].unique().tolist()
    elif 'countries' in df.columns:
        map_country_list = df['countries'].str.split(',').explode().str.strip().unique().tolist()
        
    map_countries_not_in_eez = [] # initialize
    for country in map_country_list:
        if country not in eez_df.index.tolist():
            map_countries_not_in_eez += [country]
            
    if len(map_countries_not_in_eez) > 0:
        if error_verbose == True:
            print(f"Countries in data set (following GEM standard list) that are not in EEZ names: {map_countries_not_in_eez}")
    # END TEST

In [1164]:
def add_centroids_approximate_locations(df, national_centroids, sector):
    """
    If changing to add province/state level data, need to add back argument: centroid_df_dict
    Code from GOGET data compilation (map and public) 2022-01-13.ipynb
    """
    approx_coord = [] # initialize
    
    if sector == 'oil_gas' and map_choice != 'GOGET':
        country_col = 'countries'
    else:
        country_col = 'country'

    if map_choice in ['Oil & Gas Plant', 'GOGET']:
        if 'geom' not in df.columns:
            df['geom'] = 'point'
        
    for row in df.index:
        if 'lat' in df.columns and df.at[row, 'geom'] == 'point':
            lat = df.at[row, 'lat']
            if pd.isna(lat)==True or lat == '':     
                try:
                    country = df.at[row, country_col]
                except:
                    print(f"Exception for: sector: {sector}; country_col: {country_col}; row: {row}") # for UI

    #             province = df.at[row, 'province']
    #             if pd.isna(province) == False: # or province == '':
    #                 # the entry has state/province, so try to use that
    #                 if country in ['Australia', 'Argentina', 'Canada']:
    #                     df = assign_lat_lon_for_one_row(
    #                         df = df, 
    #                         row = row, 
    #                         centroids = centroid_df_dict[country], 
    #                         national_centroids = national_centroids,
    #                         jurisdiction = province,
    #                         country = country,
    #                     )            
    #                 else:
    #                     pass # placeholder

    #             else:
                # use centroid of country
                if country in national_centroids.index:
                    goget_map = assign_lat_lon_for_one_row(
                        df = df, 
                        row = row, 
                        centroids = national_centroids, 
                        national_centroids = national_centroids,
                        jurisdiction = country,
                        country = country,
                    )
                    approx_coord += [f"{df.at[row, 'project']} ({country})"]
                else:
                    print("Error!" + f" Country was not in national_centroids.index: {country}")
            else:
                # lat is not nan; skip
                pass
        else:
            # Either it's geom route (pipeline), or some other case where we don't have the info we need
            pass
        
    if error_verbose == True: 
        print(f"Assigned approximate coordinates for extraction units: {approx_coord}")

    return df

In [1165]:
def assign_approximate_coordinates(df, sector):
    """
    Code from GOGET data compilation (map and public) 2022-01-13.ipynb
    """
    eez_and_land_boundaries = read_eez_file_and_standardize()    
    national_centroids = get_centroids_from_lat_lon_boundaries(eez_and_land_boundaries)
    
    # TO DO: could also use code from Jupyter notebook to assign province/state-level coordinates;
    # already set up for Australia, Argentina, Canada
    df = add_centroids_approximate_locations(df, national_centroids, sector)

    return df

In [1166]:
def test_type_counts(df, map_choice_expected_counts=map_choice_expected_counts):
    """
    Expected counts below are *minimums*, with some clearance to allow for deleting projects occasionally.
    
    Only set up to run on map files, not on download files.
    """
    print('-'*40 + '\n' + "Running test of counts of each type of project")
    
    fudge_factor = 0.05
    
    if len(df) == 0:
        print(f"There was no data in the map file for this map_choice: {map_choice}")
    else:
        if map_choice in ['Oil & Gas Plant']:
            pass
        else:
            # TO DO: change this to catch cases where there are additional types that weren't intended
            # (it happened for Oil Infrastructure where it was including gas pipelines also)
            type_counts = df['type'].value_counts()
            # try:
            expected_list = list(map_choice_expected_counts[map_choice].keys())
            if map_choice in map_choice_expected_counts.keys():
                for type_name in expected_list:
                    type_count_expected = map_choice_expected_counts[map_choice][type_name]

                    if type_name in type_counts.index:
                        type_count_actual = type_counts.at[type_name]
                    else:
                        actual_list = list(set(type_counts.index.tolist()))
                        print("Error!" + f" The map df didn't contain the type from expected list: {expected_list}")
                        print(f"Here are the types in the map df: {actual_list}")
                        type_count_actual = 'n/a'

                    try:
                        if type_count_actual >= type_count_expected * (1 - fudge_factor):
                            print(f"Passed: Number of {type_name}: {type_count_actual} (expected at least ~{int(round(type_count_expected, 0))})")
                            pass
                        else:
                            print("Error!" + f" For {type_name}, expected {type_count_expected}, but found {type_count_actual}")
                    except:
                        print(f"type_count_actual: {type_count_actual}; type_count_expected: {type_count_expected}; fudge_factor: {fudge_factor}") # for db
            else:
                print("Warning!" + f" test_type_counts not set up for this map_choice: {map_choice}")
                print(f"Counts are:\n{type_counts}")
                
            # except:
            #     print("Error!" + " test_type_counts failed; just show the counts:")
            #     print(type_counts)

In [1167]:
def lat_am_convert_one_tracker_col_names_to_spanish(
    tracker_df, trans_sheet_name):

    # read from Google Sheets
    lat_am_trans_one_tracker = gspread_access_file_read_only(
        key = lat_am_col_trans_key,
        title = trans_sheet_name,
    )
    
    # # create version with Spanish column names
    # lat_am_trans_one_tracker = get_lat_am_col_trans_one_tracker(lat_am_col_trans_gsheet, trans_sheet_name)
    
    lat_am_trans_one_tracker_dict = lat_am_trans_one_tracker.set_index('English')['Spanish'].to_dict()
    tracker_df_for_download_spanish = tracker_df.copy().rename(columns=lat_am_trans_one_tracker_dict)

    # check that all names got translated; should be that none of the English names are in coal_plants_spanish
    for col in tracker_df.columns:
        if col in tracker_df_for_download_spanish.columns:
            print("Error!" + f" A column from the English version didn't get translated: {col}")
        else:
            pass
        
    return tracker_df_for_download_spanish

In [1168]:
def get_lat_am_col_trans_one_tracker(lat_am_col_trans_gsheet, trans_sheet_name):
    df = (lat_am_col_trans_gsheet.worksheet('title', trans_sheet_name)).get_as_df()
    
    # for now, only keep those with 'Category' == 'column heading'
    df = df[df['Category']=='column headings']
    
    
    # TO DO: later incorporate other translations of data entries
    
    # keep only needed columns:
    df = df[['English', 'Spanish']]
    
    return df

# Coal & steel data

In [1169]:
# class CoalData:
#     """
#     EXPERIMENTAL:
#     A class to hold all data sets for coal (and steel), 
#     including DataFrames to be exported for map files and downloads.
#     """
    
#     def __init__(self, data_files_and_paths):
#         self.coal_plants = pd.DataFrame() # initialize
#         self.data_files_and_paths = data_files_and_paths
        
#     def read_coal_plants_official(self, data_files_and_paths):
#         path = data_files_and_paths['coal_plants_official_path']
#         file = data_files_and_paths['coal_plants_official_file']

#         print("*"*40)
#         print(f"Coal plants: Reading official data (local Excel file): {file}")
#         print('-'*40)

#         coal_plants = pd.read_excel(path + file, sheet_name = "Units")

#         coal_plants_official_columns_jan_2022 = [
#             'Tracker ID', 'TrackerLOC', 'ParentID', 'Wiki page', 'Country', 
#             'Subnational unit (province, state)', 'Unit', 'Plant', 'Chinese Name', 
#             'Other names', 'Sponsor', 'Parent', 'Capacity (MW)', 'Status', 'Year', 
#             'RETIRED', 'Planned Retire', 'Combustion technology', 'Coal type', 
#             'Coal source', 'Location', 'Local area (taluk, county)', 
#             'Major area (prefecture, district)', 'Region', 'Latitude', 'Longitude', 
#             'Accuracy', 'Permits', 'Captive', 'Captive industry use', 
#             'Captive residential use', 'Heat rate (Btu per kWh)', 
#             'Emission factor (kg of CO2 per TJ)', 'Capacity factor', 
#             'Annual CO2 (million tonnes / annum)', 'Lifetime CO2', 
#             'Remaining plant lifetime (years)'
#         ]
#         print("Checking columns in coal_plants")
#         for col in coal_plants.columns:
#             if col not in coal_plants_official_columns_jan_2022:
#                 print("Warning!" + f" There was a column in the file read in that wasn't in Jan 2022 official release: {col}")

#         for col in coal_plants_official_columns_jan_2022:
#             if col not in coal_plants.columns:
#                 print("Warning!" + f" There was a column in Jan 2022 official release that wasn't in the file read in: {col}")

#         self.coal_plants = coal_plants
#         # return self.coal_plants

### Coal plants

In [1170]:
def read_coal_plants_official(data_files_and_paths):
    file = data_files_and_paths['coal_plants_official_file']

    print("*"*40)
    print(f"Coal plants: Reading official data (local Excel file): {file}")
    print('-'*40)
    
    coal_plants = pd.read_excel(data_files_and_paths['coal_plants_official_path'] + file, sheet_name = "Units")
    # coal_plants_official_columns_jul_2022 = [
    #     'Tracker ID', 'TrackerLOC', 'ParentID', 'Wiki page', 'Country', 
    #     'Subnational unit (province, state)', 'Unit', 'Plant', 'Chinese Name', 
    #     'Other names', 'Owner', 'Parent', 'Capacity (MW)', 'Status', 'Year', 
    #     'RETIRED', 'Planned Retire', 'Combustion technology', 'Coal type', 
    #     'Coal source', 'Location', 'Local area (taluk, county)', 
    #     'Major area (prefecture, district)', 'Region', 'Latitude', 'Longitude', 
    #     'Accuracy', 'Permits', 'Captive', 'Captive industry use', 
    #     'Captive residential use', 'Heat rate (Btu per kWh)', 
    #     'Emission factor (kg of CO2 per TJ)', 'Capacity factor', 
    #     'Annual CO2 (million tonnes / annum)', 'Lifetime CO2', 
    #     'Remaining plant lifetime (years)'
    # ]
    coal_plants_official_columns_jul_2023 = [
        'GEM unit/phase ID', 'GEM location ID', 'Country', 'Wiki URL', 'Plant name', 'Unit name', 
        'Plant name (local)', 'Plant name (other)', 'Owner', 'Parent', 'Capacity (MW)', 
        'Status', 'Start year', 'Retired year', 'Planned retirement', 'Combustion technology', 
        'Coal type', 'Coal source', 'Alternate Fuel', 'Location', 'Local area (taluk, county)', 
        'Major area (prefecture, district)', 'Subnational unit (province, state)', 'Subregion', 'Region', 
        'Previous Region', 'Latitude', 'Longitude', 'Location accuracy', 'Permits', 'Captive', 'Captive industry use', 
        'Captive residential use', 'Heat rate (Btu per kWh)', 'Emission factor (kg of CO2 per TJ)', 'Capacity factor', 
        'Annual CO2 (million tonnes / annum)', 'Lifetime CO2', 'Remaining plant lifetime (years)'
    ]
    
    print("Checking columns in coal_plants")
    for col in coal_plants.columns:
        if col not in coal_plants_official_columns_jul_2023:
            print("Error!" + f" There was a column in the file read in that wasn't in July 2022 official release: {col}")
            
    for col in coal_plants_official_columns_jul_2023:
        if col not in coal_plants.columns:
            print("Error!" + f" There was a column in July 2022 official release that wasn't in the file read in: {col}")

    return coal_plants

In [1171]:
def read_coal_plants_working(data_files_and_paths):
    """ Reads working version of coal plant tracker in order to fill in local language information.
    
    Only used for Latin America Portal.
    """
    path = data_files_and_paths['coal_plants_working_latam_path'] 
    file = data_files_and_paths['coal_plants_working_latam_file']
    coal_plants = pd.read_excel(path + file)

    return coal_plants

In [1172]:
# def coal_plants_one_off_fixes(coal_plants):
#     # one-off
#     coal_plants['Wiki page'] = coal_plants['Wiki page'].replace({
#         'https://www.gem.wiki/President_Medici_(Candiota)_power_station': 
#         'https://www.gem.wiki/Presidente_M%C3%A9dici_Candiota_power_station',
#     })

#     coal_plants['Plant'] = coal_plants['Plant'].replace({
#         'Presidente Médici (Candiota) power station': 
#         'Presidente Médici Candiota power station',
#     })
            
#     coal_plants['Local name'] = coal_plants['Local name'].replace({
#         # multiple names:
#         'Termoeléctrica Mejillones, IEM1, Dragón Rojo': 'Termoeléctrica Mejillones',
#         'Termoeléctrica Mejillones, IEM2, Dragón Rojo': 'Termoeléctrica Mejillones',
#         # typo:
#         'Central Termoeléctrica de ilo': 'Central Termoeléctrica de Ilo',
#     })
    
#     return coal_plants

In [1173]:
# TO DO: remove this function; no longer needed after unit names were shortened in Coal Plant Tracker release July 2022

# def coal_plant_extract_unit_names_lat_am(coal_plants):
#     # change plant names before creating copy to help with parsing
#     coal_plants['Plant'] = coal_plants['Plant'].replace({
#         'Ilo 2 power station': 'Ilo power station',
#     })
#     # note: wiki page https://www.gem.wiki/Ilo_2_power_station redirects to .../Ilo_power_station
    
#     # pull short unit name out of column 'Unit'
#     coal_plants['Plant name in unit'] = coal_plants.copy()['Plant']

#     coal_plants['Plant'] = coal_plants['Plant'].str.replace('President Medici', 'Presidente Médici')
#     coal_plants['Unit'] = coal_plants['Unit'].str.replace('President Medici', 'Presidente Médici')

#     coal_plants['Plant name in unit'] = coal_plants['Plant name in unit'].replace({
#         'AES Fonseca power station': 'AES Fonseca',
#         'San José power station': 'San Jose power station',
#         'Le Moule Power Station': 'Le Moule power station',
#         'Coahuila power station': 'Coahuila power plant',
#         'Sao Luis Alumar power station': 'São Luis Alumar',
#         'Seival power station': 'Seival thermal power project',
#         'Barcarena Alunorte power station': 'Barcarena Alunorte',
#         'Pedras Altas power station': 'Pedras Altas',
#         'Porto do Pecém power station': 'Porto do Pecém',
#         'Sao Luis Alumar power station': 'São Luis Alumar',
#         'Presidente Médici-C (Candiota) power station': 'Presidente Médici (Candiota) power station C-'
#     })

#     # those with numbers or letters in unit name splitting up plant name
#     coal_plants['Unit'] = coal_plants['Unit'].str.replace('Gecelca-3 power station', 'Gecelca power station 3-')
#     coal_plants['Unit'] = coal_plants['Unit'].str.replace('San Nicolás-2 power station', 'San Nicolás power station 2-')
#     coal_plants['Unit'] = coal_plants['Unit'].str.replace(
#         'President Medici-C (Candiota) power station', 
#         'President Medici (Candiota) power station C-', 
#         regex=False)
#     coal_plants['Unit'] = coal_plants['Unit'].str.replace('Ilo 2 power station', 'Ilo power station 2-')
#     coal_plants['Unit'] = coal_plants['Unit'].str.replace('Las Palmas 2 power station', 'Las Palmas power station 2-')
#     coal_plants['Unit'] = coal_plants['Unit'].str.replace('Barahona 1', 'Barahona power station 1-')

#     coal_plants['Unit'] = coal_plants['Unit'].str.replace('- ', '-')
#     coal_plants['Unit'] = coal_plants['Unit'].str.strip('-')

#     for row in coal_plants.index:
#         plant_name = coal_plants.at[row, 'Plant name in unit']    
#         unit_short = coal_plants.at[row, 'Unit'].replace(plant_name, '').strip()

#         if plant_name == 'Presidente Médici Candiota power station':
#             if unit_short.startswith('Presidente Médici-A power station'):
#                 unit_short = unit_short.replace('Presidente Médici-A power station Unit ', 'Unit A-')
#             elif unit_short.startswith('Presidente Médici-B power station'):
#                 unit_short = unit_short.replace('Presidente Médici-B power station Unit ', 'Unit B-')
#         elif plant_name == 'Petacalco power station':
#             if unit_short.startswith('Pacífico coal-fired expansion project'):
#                 # for now, don't do anything
#                 pass

#         # rearrange order:
#         if unit_short.startswith('I Unit '):
#             unit_short = unit_short.replace('I Unit ', 'Unit I-')
#         elif unit_short.startswith('II Unit '):
#             unit_short = unit_short.replace('II Unit ', 'Unit II-')

#         # remove 'Unit'
#         unit_short = unit_short.replace('Unit ', '')

#         # remove 'Ilo power station' (not sure why it's not working to remove above)
#         unit_short = unit_short.replace('Ilo power station', '')
#         unit_short = unit_short.replace('Presidente Médici-C (Candiota) power station', 'C-')
#         unit_short = unit_short.strip()

#         coal_plants.at[row, 'Unit short'] = unit_short

#     coal_plants['Unit short'] = coal_plants['Unit short'].str.replace('- ', '-')
#     coal_plants['Unit short'] = coal_plants['Unit short'].str.strip('-')

#     coal_plants = coal_plants.rename(columns={
#         'Unit': 'Unit long',
#         'Unit short': 'Unit',
#     })
    
#     coal_plants['Plant'] = coal_plants['Plant'].str.replace('(Candiota)', 'Candiota', regex=False)
    
#     return coal_plants

In [1174]:
def coal_plants_create_file_for_map(coal_plants):
    df = coal_plants.copy()
    
    # convert to those used in PEPAL interface
    df['Status'] = df['Status'].str.lower().str.strip()
    df['Status'] = df['Status'].replace({
        'announced': 'proposed', # converted per Gregor's email 2022-01-31
        # 'pnnounced': 'proposed', # this is a typo; should be 'announced'
        'pre-permit development': 'proposed', # converted per Gregor's email 2022-01-31
        'pre-permit': 'proposed', # alternative version of 'pre-permit development'
    })

    # make only first letter uppercase
    df['Status'] = df['Status'].str[0].str.upper() + df['Status'].str[1:].str.lower()
    
    df = df.rename(columns={
        'Unit': 'unit',
        'Year': 'start_year',
        'Parent': 'parent',
        'Sponsor': 'owner',
        'Subnational unit (province, state)': 'province',
        'Country': 'country',
        'Status': 'status',
        'Capacity (MW)': 'capacity',
        'Latitude': 'lat', 
        'Longitude': 'lng',
        # additional columns unique to coal plants:
        'Coal type': 'coal_type',
        'Coal source': 'coal_source',
    })
    
    if map_choice == 'Latin America Portal - coal-steel':
        df = df.rename(columns={
            'Plant': 'project_en', # english name
            'Local name': 'project', # local name
            'Wiki page': 'url_en', # english page
            # 'Span/Port wiki page': 'url', # local language page # updated below 2023-10-27 to remove local language wiki step
        })
    else:
        df = df.rename(columns={
            'Plant': 'project', # english name
            'Wiki page': 'url', # english name
        })

    # add capacity unit
    df['capacity_production_unit'] = 'MW'
    
    # add category for map
    if map_choice in ['Coal Terminals', 'Latin America Portal - coal-steel']:
        df['type'] = 'coal_plant'
    else:
        print(f"Not set up to add 'type' values for this map_choice: {map_choice}")
    
    coal_plants_for_map = df
    return coal_plants_for_map

In [1175]:
# def coal_plants_merge_local_language_info(coal_plants, data_files_and_paths = data_files_and_paths):
#     """
#     Only runs if map_choice is Latin America Portal.
    
#     Coal plants processed: File from Flora, including new calculations for emissions.
#     Note that coal plants processed here is for Latin America.
#     Other maps use official data release, and don't need processed version.
    
#     Coal plants working: File(s) from working folder, used here only for adding local language info.
#     """
    
#     coal_plants_working = read_coal_plants_working(data_files_and_paths)

#     # Tracker ID is the ID for each unit; supposed to be unique
#     # If these IDs are in fact unique, then can use to merge the data Flora prepared with working data
#     num_duplicated_rows = coal_plants['Tracker ID'].duplicated().sum()
#     num_duplicated_rows_working = coal_plants_working['Tracker ID'].duplicated().sum()

#     if num_duplicated_rows == 0 and num_duplicated_rows_working == 0:
#         coal_plants_initial_len = len(coal_plants)
#         coal_plants_merged = pd.merge(
#             coal_plants, 
#             coal_plants_working[['Tracker ID', 'Span/Port wiki page', 'Local name']],
#             on='Tracker ID',
#             how='left'
#         )
#         if coal_plants_initial_len == len(coal_plants_merged):
#             coal_plants = coal_plants_merged.copy()
#             print("Merged in local language names for coal plants & wiki pages")
#             return coal_plants

#         else:
#             print("Error!" + f" There was a difference in the length of dfs.")
#             return pd.DataFrame()

#     else:
#         print("Error!" + f" There were duplicate Tracker ID entries.")
#         return pd.DataFrame()

In [1176]:
def coal_plants_check_for_local_names(coal_plants):
    test = coal_plants[coal_plants['Local name'].isna()][['Local name', 'Span/Port wiki page']]
    if len(test) == 0:
        pass
    else:
        print("Test failed!" + " There was no local name for some rows")
        print(test)
        
    # no return

In [1177]:
def coal_plants_create_data_download_version(coal_plants):
    """
    Argument should be coal_plants, read directly from working sheet.
    """
    
    coal_plants_for_download_dict = {'English': coal_plants.copy()}
    
    if map_choice == 'Latin America Portal - coal-steel':
        # modified 2023-10-27 to update with new column names
        # coal_plants = coal_plants.drop('Chinese Name', axis=1)
        
        coal_plants_for_download_spanish = lat_am_convert_one_tracker_col_names_to_spanish(
            tracker_df = coal_plants, 
            trans_sheet_name = 'coal plants',
        )
        
        # TO DO: instead of specifying the list below, make sure the column names in the file used for translations has the columns in the desired order;
        # then use that file to generate a list of column names used for reordering the data below
        # reorder columns:
        coal_plants_spanish_cols_new_order = [
            '№ de identificación', '№ de identificación - ubicación', '№ de identificación - empresa matriz', 
            'Página wiki', 'País', 'Unidad subnacional (provincia, estado, departamento)', 
            'Planta', 'Unidad', 'Nombre(s) alternativo(s)', 'Propietario', 'Empresa matriz', 'Capacidad (MW)', 
            'Estado', 'Año de inicio', 'Retirado', 'Fecha de retiro prevista', 'Tecnología de combustión', 
            'Tipo de carbón', 'Fuente de carbón', 'Ubicación', 'Municipio', 'Distrito', 'Región', 'Latitud', 'Longitud', 'Precisión', 
            'Permisos', 'Uso interno', 'Uso interno industrial', 'Uso interno residencial', 'Tasa de calor (Btu por kWh)', 
            'Factor de emisión (kg de CO2 por TJ)', 'Factor de capacidad', 'CO2 anual (millones de toneladas/año)', 
            'CO2 acumulado (vida útil)', 'Vida útil restante (años)', 'Planta (inglés)', 'Página wiki (inglés)',
        ]
        coal_plants_for_download_spanish = coal_plants_for_download_spanish[coal_plants_spanish_cols_new_order]
        
        # put into dict
        coal_plants_for_download_dict['Spanish'] = coal_plants_for_download_spanish
        print("Coal plants: Added Spanish version to download dict") # for UI

    else:
        pass
    
    return coal_plants_for_download_dict

In [1178]:
def run_all_coal_plant_functions(
    map_choice, data_versions_dict, data_files_and_paths,
):
    if data_versions_dict[map_choice]['coal plants'] == 'official':
        coal_plants = read_coal_plants_official(data_files_and_paths)
        coal_plants = harmonize_countries(coal_plants)
        
    else:
        print("Error!" + f" Not yet set up to handle this data_version {data_versions_dict[map_choice]['coal plants']}")
        
    # modified 2023-10-27 to remove merge of local language URL
    # if map_choice == 'Latin America Portal - coal-steel':
    #     coal_plants = coal_plants_merge_local_language_info(coal_plants)

    # filter by country:
    coal_plants = filter_points_by_country(coal_plants, map_choice, sel_countries)

    # ======
    # create version for download
    coal_plants_for_download_dict = coal_plants_create_data_download_version(coal_plants)
    
    # ======
    # create version for map
    
    # RUN TESTS:
    coal_plants_mod = coal_plants.copy()
    test_convert_col_to_float(coal_plants_mod, ['Latitude', 'Longitude'])
    
    # TO DO: remove this block; no longer needed after unit names shorted in Coal Plant Tracker release July 2022
    # if map_choice == 'Latin America Portal - coal-steel':
    #     coal_plants_mod = coal_plant_extract_unit_names_lat_am(coal_plants_mod)
    #     # coal_plants_mod = coal_plants_one_off_fixes(coal_plants_mod)
    #     coal_plants_check_for_local_names(coal_plants_mod)

    # CREATE DF FOR MAP
    coal_plants_for_map = coal_plants_create_file_for_map(coal_plants_mod)
    
    return coal_plants_for_download_dict, coal_plants_for_map

### Coal mines

In [1179]:
def read_coal_mines(data_versions_dict, data_files_and_paths):
    """ Read coal mine data from spreadsheet.
    
    Currently only set up to read official coal mine data.
    Previously had read working version of coal mine data, in which small mines were separate from large mines.
    """
    if data_versions_dict[map_choice]['coal mines'] == 'official':
        print("*"*40)
        print(f"Coal mines: Reading official version (local Excel file): {data_files_and_paths['coal_mines_official_file']}")
        print('-'*40)
        coal_mines_xl = pd.ExcelFile(
            data_files_and_paths['coal_mines_official_path'] + 
            data_files_and_paths['coal_mines_official_file']
        )
        coal_mines = pd.read_excel(coal_mines_xl, sheet_name='Global Coal Mine Tracker')
                
    elif data_versions_dict[map_choice]['coal mines'] == 'working':
        print("Not set up to handle working file for coal mines")
    
    else:
        print("Error!" + f" Unexpected value for data_versions_dict[map_choice]['coal mines']: {data_versions_dict[map_choice]['coal mines']}")

    return coal_mines

In [1180]:
def clean_coal_mines(coal_mines):
    
    coal_mines = harmonize_countries(coal_mines)
    
    # exclude any rows with no mine name
    coal_mines = coal_mines.dropna(subset=['Mine Name'])
    
    return coal_mines

In [1181]:
def test_coal_mine_name_counts(coal_mines):
    coal_mines_name_counts = coal_mines.groupby('Mine Name')['Mine Name'].count()
    df = coal_mines_name_counts[coal_mines_name_counts > 1]
    
    if len(df)==0:
        pass
    else:
        print("Error!" + " Test test_coal_mine_name_counts failed")
        print(df)

    # no return

In [1182]:
def coal_mine_split_production_vs_capacity_data(df):
    sel = df[df['Coal Output (Annual, Mt)']=='*']
    if len(sel) > 0:
        print(f"\nThere were {len(sel)} rows with the entry '*' in the column 'Coal Output (Annual, Mt)'; will be replaced with NaNs")
    
    for row in df.index:
        quantity_mtpa = df.at[row, 'Coal Output (Annual, Mt)']
        
        if quantity_mtpa == '*':
            quantity_mtpa = np.nan
        
        prod_cap = df.at[row, 'Production or Capacity Data (Mtpa)']
        if prod_cap == 'Production':
            df.at[row, 'production'] = quantity_mtpa
        elif prod_cap == 'Capacity':
            df.at[row, 'capacity'] = quantity_mtpa
        else:
            print("Error!" + f" There was an unexpected value for row {row}: {prod_cap}")
                  
    return df

In [1183]:
def coal_mines_convert_statuses(df):
    # convert statuses
    df['Status Detail'] = df['Status Detail'].replace('War in Ukraine', '')
    
    # GCMT uses two columns to define status; reduce to one column, and use only accepted categories
    df['Status concat'] = (df['Status'].fillna('') + '_' + df['Status Detail'].fillna('')).str.strip('_')
    
    df['status'] = df.copy()['Status concat']
    df['status'] = df['status'].str.lower().replace({     
        'proposed_announced': 'proposed', # based on PEPAL 2021-11-10
        'proposed_pre-permit': 'proposed', # based on PEPAL 2021-11-10
        'proposed_exploration': 'proposed',
        'proposed_permitted': 'permitted',
        'proposed_construction': 'construction',
    })
    
    for status in ['operating', 'cancelled', 'shelved', 'mothballed']:
        for row in df.index:
            if df.at[row, 'status'].startswith(f"{status}_"):
                # replace with version without the portion with the underscore
                df.at[row, 'status'] = status
        
    return df

In [1184]:
def coal_mines_reorder_cols_lat_am(df):
    """ Reorder columns in the Spanish version of the download file.
    """
    coal_mines_cols_new_order = [
        '№ de identificación', 'Nombre de la mina', 'Nombre de la mina (inglés)', 
        'Página wiki', 'Página wiki (inglés)', 'Estado', 'Estado detallado', 'Tipo de proyecto', 'Fase del proyecto', 
        'Operadores', 'Propietario', 'Empresa matriz', 'Sede de la empresa', 'Datos de producción o capacidad (Mt / año)', 'Producción de carbón (Mt / año)', 
        'Producción de carbón (Mtc / año)', 'Tipo de mina', 'Método de extracción', 'Tamaño de la mina (km2)', 
        'Profundidad de la mina (m)', 'Precisión de profundidad', 'Número de empleados', 'Tipo de carbón', 'Grado de carbón', 
        'Reservas totales (probadas y probables)', 'Reserva probada', 'Reserva probable', 'Recurso (medido)', 'Recurso (indicado)', 
        'Recurso (medido e indicado)', 'Recurso (total - inferido, indicado, medido)', 'Relación reservas/producción (R/P)', 
        'Año de inicio', 'Vida útil reportada', 'Yacimiento de carbón', 
        'Ubicación', 'Distrito',
        # 'Municipio', removed from July 2022 release
        'Unidad subnacional (provincia, estado, departamento)', 'País', 'Código ISO', 'Región', 'Latitud', 'Longitud', 
        'Precisión de ubicación', 'Consumidor primario, destino', 'Planta de carbón, Planta siderúrgica, Terminal', 
        'Página wiki (planta de carbón, planta siderúrgica, terminal)', 'Contenido de gas metano (m3 por tonelada)', 
        'Emisiones de metano (millones de m3/año)', 
        # 'Emisiones de metano (CO2e 20 años)', removed in July 2022 version
        'Emisiones de metano (CO2e 100 años)', 'Emisiones de CO2 (millones de toneladas/año)'
    ]
    df = df[coal_mines_cols_new_order]
    
    return df

In [1185]:
def coal_mines_create_file_for_map(coal_mines):
    df = coal_mines.copy()
    
    if map_choice in ['Coal Terminals', 'Latin America Portal - coal-steel']:
        df['type'] = 'coal_mine'
    else:
        print(f"Not set up to add 'type' values for this map_choice: {map_choice}")
    
    df = coal_mine_split_production_vs_capacity_data(df)
    
    df = coal_mines_convert_statuses(df)

    # ==============
    # for extra columns for coal mines, Gregor asked for:
    # Reserve size, Coal type, Coal grade, Mine type
    df = df.rename(columns={
         # 'Project Phase': for Lat Am, there aren't any with an entry
        'Project Phase': 'unit', 
        'Owners': 'owner',
        'Parent Company': 'parent',
        'State, Province': 'province',
        'Country': 'country',
        # have separated production vs capacity earlier
        # already created column 'status' (lowercase) above
        'Opening Year': 'start_year',
        'Latitude': 'lat', 
        'Longitude': 'lng',
        # extra columns
        'Coal Type': 'coal_type', 
        'Reserves Total (Proven & Probable)': 'reserve_size', 
        'Coal Grade': 'coal_grade', 
        'Mine Type': 'mine_type', 
    })
    
    if map_choice == 'Latin America Portal - coal-steel':
        df = df.rename(columns={
            'Mine Name': 'project_en',
            'Mine Name (Non-ENG)': 'project',
            'GEM Wiki Page (ENG)': 'url_en',
            # 'GEM Wiki Page (Non-ENG)': 'url', # updated below 2023-10-27 to remove local language wiki step
        })
        # exclude column 'Operators'
        print("Excluding column 'Operators'; map not yet set up to handle it")
        df = df.drop('Operators', axis=1)
        
    else:
        df = df.rename(columns={
            'Mine Name': 'project',
            'GEM Wiki Page (ENG)': 'url',
        })

    # ==============
    # add capacity unit column
    df['capacity_production_unit'] = 'MTPA'
    
    coal_mines_for_map = df
    
    return coal_mines_for_map

In [1186]:
def coal_mines_test_for_local_name(coal_mines):
    if map_choice == 'Latin America Portal - coal-steel':
        test = coal_mines[coal_mines['Mine Name (Non-ENG)'].isna()]
        if len(test) == 0:
            pass
        else:
            print("Test failed!" + " There were some coal mines with no local language name:")
            print(test)

    else:
        pass
            
    # no return

In [1187]:
def coal_mines_create_data_download_version(coal_mines):
    """
    Argument should be coal_mines DataFrame, read directly from working sheet.
    """
    
    # from July 2022 release (note that it's a bit different than Jan 2022 release):
    coal_mines_official_cols = [
        'Mine IDs', 'Mine Name', 'Mine Name AKAs', 'Mine Name (Non-ENG)',
        'GEM Wiki Page (ENG)', 'GEM Wiki Page (Non-ENG)',
        'Status', 'Status Detail', 'Project Type', 'Project Phase',
        'Operators', 'Owners', 'Parent Company', 'Company HQs',
        'Production or Capacity Data (Mtpa)', 'Coal Output (Annual, Mt)', 'Coal Output (Annual, Mst)',
        'Mine Type', 'Mining Method', 'Mine Size (Km2)', 'Mine Depth (m)', 'Depth Accuracy',
        'Workforce Size', 
        'Coal Type', 'Coal Grade', 'Reserves Total (Proven & Probable)', 'Proven Reserve', 'Probable Reserve',
        'Resource (Measured)', 'Resource (Indicated)', 'Resource (Measured + Indicated)', 'Resource (Total - Inferred, Indicated, Measured)',
        'Reserve to Production Ratio (R/P)',
        'Opening Year', 'Reported Life of Mine',
        'Coalfield', 'Location', 'Prefecture, District', 'State, Province', 'Country', 'ISO Code',
        'Region', 'Latitude', 'Longitude', 'Location Accuracy', 
        'Primary Consumer, Destination', 'Coal Plant, Steel Plant, Terminal', 'Coal Plant, Steel Plant, Terminal GEM Wiki',
        'Methane Gas Content (m^3/tonne)', 'Coal Mine Methane Emissions Estimate (MCM/yr)', 'CMM Emissions (CO2e 100 years)', 
        'Carbon Dioxide Emissions (Mt CO2/yr)'
    ]
    
    coal_mines_for_download = coal_mines[coal_mines_official_cols]
    coal_mines_for_download_dict = {'English': coal_mines.copy()}
    
    if map_choice == 'Latin America Portal - coal-steel':
        # TO DO: check that Spanish/Portuguese name is in the download sheet
        
        coal_mines_for_download_spanish = lat_am_convert_one_tracker_col_names_to_spanish(
            tracker_df = coal_mines_for_download,
            trans_sheet_name = 'coal mines',
        )
        coal_mines_for_download_spanish = coal_mines_reorder_cols_lat_am(coal_mines_for_download_spanish)
        
        # append another key-value pair to dictionary
        coal_mines_for_download_dict['Spanish'] = coal_mines_for_download_spanish
        print("Coal mines: Added Spanish version to download dict") # for UI
    
    return coal_mines_for_download_dict

In [1188]:
def run_all_coal_mine_functions(
    map_choice, data_versions_dict, data_files_and_paths,
):
    coal_mines = read_coal_mines(data_versions_dict, data_files_and_paths)
    coal_mines = filter_points_by_country(coal_mines, map_choice, sel_countries)
    
    # TO DO: find notebook version with coal_mines_fix_one_offs; accidentally deleted?
    # coal_mines = coal_mines_fix_one_offs(coal_mines)
    
    # ======
    # version for download
    coal_mines_for_download_dict = coal_mines_create_data_download_version(coal_mines)
    
    # ======
    # version for map
    
    coal_mines_mod = coal_mines.copy()
    coal_mines_mod = clean_coal_mines(coal_mines_mod)
    
    # test_coal_mine_name_counts(coal_mines_mod) # don't run; repeats are intentional
    coal_mines_test_for_local_name(coal_mines_mod)
    test_convert_col_to_float(coal_mines_mod, ['Latitude', 'Longitude'])

    # CREATE DF FOR MAP:
    coal_mines_for_map = coal_mines_create_file_for_map(coal_mines_mod)
    
    return coal_mines_for_download_dict, coal_mines_for_map

### Coal terminals

In [1189]:
def coal_terminals_read_data(
    map_choice, data_versions_dict, data_files_and_paths
):
    """
    Since the only version (as of Mar 2022) is the working file, 
    and the working file has the other language names in it,
    no need to merge in other language names from another file,
    separate from main data set.
    """
    print("*"*40)
    if data_versions_dict[map_choice]['coal terminals'] == 'official':
        print(f"Coal terminals: Reading official version from local Excel file: {data_files_and_paths['coal_terminals_official_file']}")
        print('-'*40)
        coal_terminals_xl = pd.ExcelFile(
            data_files_and_paths['coal_terminals_official_path'] + 
            data_files_and_paths['coal_terminals_official_file']
        )
        coal_terminals = pd.read_excel(coal_terminals_xl, sheet_name='Coal Terminals')
        
    elif data_versions_dict[map_choice]['coal terminals'] == 'working':
        # coal_terminals = coal_terminals_read_working_pygsheets()
        print("Error!" + f" Not currently set up to use working sheet for coal terminals.")
        
    else:
        print("Error!" + f" Not yet set up to handle data_versions_dict[map_choice]['coal terminals']: {data_versions_dict[map_choice]['coal terminals']}")

    return coal_terminals

In [1190]:
# def coal_terminals_read_working_pygsheets():
#     print("Coal terminals: Reading working version using pygsheets")
#     print('-'*40)
    
#     gc = pygsheets.authorize(client_secret_full_path)
#     coal_terminals_gsheet = gc.open_by_key(data_files_and_paths['coal_terminals_working_key'])

#     main_worksheet = coal_terminals_gsheet.worksheet('title', 'Terminals')
#     coal_terminals = main_worksheet.get_as_df()
    
#     # change column names to match official release (Dec 2021)
#     coal_terminals = coal_terminals.rename(columns={'Terminal Name': 'Name'})
          
#     return coal_terminals

In [1191]:
def coal_terminals_test_names(coal_terminals):
    coal_terminals_name_counts = coal_terminals.groupby('Coal Terminal Name')['Coal Terminal Name'].count()
    test = coal_terminals_name_counts[coal_terminals_name_counts > 1]
    if len(test) == 0:
        pass
    else:
        print("Error!" + " coal_terminals_test_names failed; there were duplicates:")
        print(test)
    # no return

In [1192]:
def coal_terminals_create_file_for_map(coal_terminals):
    
    df = coal_terminals.copy()
    if map_choice in ['Coal Terminals', 'Latin America Portal - coal-steel']:
        df['type'] = 'coal_terminal'
    else:
        print(f"Not set up to add 'type' values for this map_choice: {map_choice}")
        
    # fix capacity asterisks
    sel = df[df['Capacity (Mpta)']=='*']
    if len(sel) > 0:
        print(f"\nThere were {len(sel)} rows with the entry '*' in the column 'Coal Output (Annual, Mt)'; will be replaced with NaNs")
    for row in df.index:
        quantity_mtpa = df.at[row, 'Capacity (Mpta)']
        if quantity_mtpa == '*':
            quantity_mtpa = np.nan
    
    # rename columns
    df = df.rename(columns={
        # ____: 'unit', # no column like this
        'Owner': 'owner',
        # ____: 'parent', # no column like this
        'State/Province': 'province',
        'Country': 'country',
        'Status': 'status',
        'Capacity (Mpta)': 'capacity', # previously 'Capacity (mmtpa)'
        'Start Year': 'start_year',
        'Latitude': 'lat', 
        'Longitude': 'lng',
        # extras:
        # 'Coal source (specific mine or region)': 'coal_source', # not in Coal Terminals official release Dec 2021
        'Terminal Type': 'terminal_type',
    })
    
    if map_choice == 'Latin America Portal - coal-steel':
        df = df.rename(columns={
            'Coal Terminal Name': 'project_en', # previously 'Name'
            'Coal Terminal AKAs': 'project', # previously 'Other Names'
            'GEM Wiki': 'url_en',
            # 'GEM Wiki (Non-ENG)': 'url', # updated below 2023-10-27 to remove local language wiki step
        })
    else:
        df = df.rename(columns={
            'Coal Terminal Name': 'project', # previously 'Name'
            'GEM Wiki': 'url' # previously 'Wiki'
        })

    # add capacity unit
    df['capacity_production_unit'] = 'MTPA' # meaning million tons per annum
    
    coal_terminals_for_map = df
    
    return coal_terminals_for_map

In [1193]:
def coal_terminals_create_data_download_version(df):
    """
    Argument should be coal_terminals, before being modified for map.
    """
    # exclude columns
    if 'Checked?' in df.columns:
        df = df.drop(['Checked?'], axis=1)
    
    coal_terminals_for_download = df    
    coal_terminals_for_download_dict = {'English': coal_terminals_for_download}
    
    if map_choice == 'Latin America Portal - coal-steel':
        # TO DO: check that Spanish/Portuguese name is in the download sheet
        coal_terminals_for_download_spanish = lat_am_convert_one_tracker_col_names_to_spanish(
            tracker_df = coal_terminals_for_download, 
            trans_sheet_name = 'coal terminals',
        )
        # append another pair to dictionary
        coal_terminals_for_download_dict['Spanish'] = coal_terminals_for_download_spanish
        print("Coal terminals: Added Spanish version to download dict") # for UI
    
    return coal_terminals_for_download_dict

In [1194]:
def coal_terminals_merge_wiki_local(coal_terminals):
    """
    Official release Dec 2021 has local name, but not local wiki page.
    This function merges in the local wiki page.
    """
    # # read working file using pygsheets
    # coal_terminals_working = coal_terminals_read_working_pygsheets()
    
    # read working file - local
    coal_terminals_working = pd.read_excel(
        data_files_and_paths['coal_terminals_working_path'] + 
        data_files_and_paths['coal_terminals_working_file'],
        sheet_name = 'Terminals',
    )
    
    coal_terminals_init_len = len(coal_terminals) # for later test
    
    coal_terminals_working_for_merge = coal_terminals_working[['Coal Terminal Name', 'GEM Wiki (Non-ENG)']]
    coal_terminals_working_for_merge = coal_terminals_working_for_merge.dropna(subset='GEM Wiki (Non-ENG)').drop_duplicates()
    
    coal_terminals = pd.merge(
        coal_terminals,
        coal_terminals_working_for_merge,
        on='Coal Terminal Name',
        how='left'
    )
    
    # TEST:
    # Did len change?
    if len(coal_terminals) != coal_terminals_init_len:
        print("Error!" + f" There was a change in length when merging local wiki pages.")
        print(f"coal_terminals_init_len: {coal_terminals_init_len}; len(coal_terminals): {len(coal_terminals)}")
    # END OF TEST
    
    print("Coal terminals: merged in local language wiki URLs")
    
    return coal_terminals

In [1195]:
def run_all_coal_terminal_functions(
    map_choice, data_versions_dict, data_files_and_paths
):
    coal_terminals = coal_terminals_read_data(map_choice, data_versions_dict, data_files_and_paths)
    coal_terminals = harmonize_countries(coal_terminals)
    coal_terminals = filter_points_by_country(coal_terminals, map_choice, sel_countries)
    if map_choice == 'Latin America Portal - coal-steel':
        coal_terminals = coal_terminals_merge_wiki_local(coal_terminals)

    # version for download:
    coal_terminals_for_download_dict = coal_terminals_create_data_download_version(coal_terminals)

    # version for map:
    coal_terminals_test_names(coal_terminals)
    test_convert_col_to_float(coal_terminals, ['Latitude', 'Longitude'])
    coal_terminals_for_map = coal_terminals_create_file_for_map(coal_terminals)
    
    return coal_terminals_for_download_dict, coal_terminals_for_map

### Steel plants

In [1196]:
def steel_plants_read_official_file(data_files_and_paths):
    """
    Official release has other language names & wiki pages in it,
    so for Latin America Portal, no need to read another file and merge it in.
    """
    print("*"*40)
    print(f"Steel plants: Reading official version (local file): {data_files_and_paths['steel_plants_official_file']}")
    # TO DO: remove print lines below; 2022 release doesn't have this issue
    # print("Known issue in with openpyxl, will give this warning: 'UserWarning: Unknown extension is not supported and will be removed'")
    # print("(It seems the steel plant tracker has something installed on the sheet that openpyxl can't handle, but it reads the data fine.)")
    print('-'*40)
    
    steel_plants_xl = pd.ExcelFile(
        data_files_and_paths['steel_plants_official_path'] + 
        data_files_and_paths['steel_plants_official_file']
    )
    
    if '2021' in data_files_and_paths['steel_plants_official_file']:
        main_data_sheet = "Steel Plants - all data"
    elif '2022' in data_files_and_paths['steel_plants_official_file']:
        main_data_sheet = "Steel Plants"
        
    steel_plants = pd.read_excel(
        steel_plants_xl, 
        sheet_name = main_data_sheet,
        header = 0,
    )   
    
    for col in steel_plants.columns:
        if 'Unnamed: ' in col:
            steel_plants = steel_plants.drop(col, axis=1)
    
#     steel_plants_official_columns_2021 = [
#         'Plant ID', 'Plant name (English)', 'Plant name (other language)', 'GEM wiki page link', 
#         'Other plant names (English)', 'Other plant names (other language)', 'State or private ownership', 
#         'Department (if state-owned or mixed)', 'Parent', 'Owner', 'Intermediate subsidiaries', 
#         'Location address', 'Municipality', 'Subnational unit (province/state)', 'Country', 'Region', 
#         'Other language location address', 'Coordinates', 'Coordinate accuracy', 
#         'GEM wiki page (other language)', 'Status', 'Start', 'Plant age (years)', 
#         'Nominal crude steel capacity (thousand tonnes per annum)', 'Recent actual crude steel production (thousand tonnes per annum)', 
#         'Recent actual crude steel production year', 'Nominal BOF steel capacity (thousand tonnes per annum)', 
#         'Recent actual BOF steel production (thousand tonnes per annum)', 'Recent actual BOF steel production year', 
#         'Nominal EAF steel capacity (thousand tonnes per annum)', 'Recent actual EAF steel production (thousand tonnes per annum)', 
#         'Recent actual EAF steel production year', 'Nominal OHF steel capacity (thousand tonnes per annum)', 
#         'Recent actual OHF steel production (thousand tonnes per annum)', 'Recent actual OHF steel production year', 
#         'Nominal iron capacity (thousand tonnes per annum)', 'Recent actual iron production (thousand tonnes per annum)', 
#         'Recent actual iron production year', 'Nominal BF capacity (thousand tonnes per annum)', 
#         'Recent actual BF production (thousand tonnes per annum)', 'Recent actual BF production year', 
#         'Nominal DRI capacity (thousand tonnes per annum)', 'Recent actual DRI production (thousand tonnes per annum)', 
#         'Recent actual DRI production year', 'Ferronickel capacity (thousand tonnes per annum)', 
#         'Recent actual ferronickel production (thousand tonnes per annum)', 'Recent actual ferronickel production year', 
#         'Sinter plant capacity (thousand tonnes per annum)', 'Recent actual sinter plant production (thousand tonnes per annum)', 
#         'Recent actual sinter plant production year', 'Coking plant capacity (thousand tonnes per annum)', 
#         'Recent actual coking plant production (thousand tonnes per annum)', 'Recent actual coking plant production year', 
#         'Pelletizing plant capacity', 'Recent actual pelletizing plant production (thousand tonnes per annum)', 
#         'Recent actual pelletizing plant production year', 'Category of steel products', 'Steel products', 
#         'Steel product end use sector', 'Primary steelmaking process (integrated, electric, or oxygen)', 
#         'Primary steel production equipment', 'Detailed primary steel production equipment', 'Power source', 
#         'Iron ore source', 'Met coal source'
#     ]
    
#     steel_plants_official_columns_2022 = copy.deepcopy(steel_plants_official_columns_2021)
#     # known changes:
#     'GEM wiki page link' --> 'GEM wiki page'
#     'Start' --> 'Start year'
    
#     print("Checking columns in steel_plants")
#     for col in steel_plants.columns:
#         if col not in steel_plants_official_columns_2021:
#             print("Error!" + f" There was a column in the file read in that wasn't in the 2021 official release: {col}")

#     for col in steel_plants_official_columns_2021:
#         if col not in steel_plants.columns:
#             print("Error!" + f" There was a column in the 2021 official release that wasn't in the file read: {col}")

    return steel_plants

In [1197]:
def read_steel_plants_working_pygsheets(data_files_and_paths):
    print("-"*40)
    print("Reading steel plant data from working file (using pysheets)")
    gc = pygsheets.authorize(client_secret_full_path)
    steel_plants_gsheet = gc.open_by_key(data_files_and_paths['steel_plants_working_key'])

    main_worksheet = steel_plants_gsheet.worksheet('title', 'Steel Plants')
    # first row is instruction to researchers, so have to start at A2
    steel_plants = main_worksheet.get_as_df(start = 'A2', include_tailing_empty=False)
    
    # exclude empty columns:
    if '' in steel_plants.columns:
        steel_plants = steel_plants.drop('', axis=1)
    
    steel_plants = steel_plants.rename({'Parent [formula]': 'Parent'})

    meta_worksheet = steel_plants_gsheet.worksheet('title', 'Metadata')
    # first two rows are other info, so have to start at A3
    steel_plants_meta = meta_worksheet.get_as_df(start = 'A3')

    # =======
    # clean up
    
    # exclude empty rows:
    steel_plants = steel_plants[steel_plants['Plant name (English)']!='']
    steel_plants = steel_plants[steel_plants['Plant name (English)'].isna()==False]
    
    # drop empty columns
    for df in [steel_plants]:
        for col in df.columns:
            if 'Unnamed: ' in col:
                df = df.drop(col, axis=1)
    
    steel_plants = steel_plants.reset_index(drop=True)
    
    # ======
#     # TEST: check column headings in steel_plants_meta
#     if steel_plants_meta.columns[0] == '' and steel_plants_meta.columns[1] == 'Column Name':
#         steel_plants_meta = steel_plants_meta.rename(columns={
#             '': 'Keep column',
#         })
#         print(f"steel_plants_meta.columns (after rename): {steel_plants_meta.columns}")
#         steel_plants_meta['Column Name'] = steel_plants_meta['Column Name'].replace({
#             'Parent': 'Parent [formula]',
#         })
#     else:
#         print("Error!" + f" Expected steel_plants.columns[1] == 'Column Name'; was {steel_plants.columns[1]}")
#     # END TEST

#     # TO DO: in steel_plants_meta, get column names marked 'y' (to keep)
#     steel_plants_meta_keep = steel_plants_meta[steel_plants_meta['Keep column'].str.lower()=='y']
#     steel_plants_keep_cols = steel_plants_meta_keep['Column Name'].tolist()

#     for col in steel_plants_keep_cols:
#         if col not in steel_plants.columns:
#             print(f"Column not found in steel_plants: {col}")
#     for col in steel_plants.columns:
#         if col not in steel_plants_meta['Column Name'].tolist():
#             print(f"Column not found in steel_plants_meta: {col}")

    # TO DO: use steel_plants_meta to filter which columns are used
    
    # ======
    # convert dtypes
    # steel_plants = for_pygsheets_convert_steel_dtypes_and_values(steel_plants)

    steel_plants_working = steel_plants
    return steel_plants_working

In [1198]:
# def for_pygsheets_convert_steel_dtypes_and_values(steel_plants):
#     df = steel_plants.copy()
#     for col in df.columns:
#         df[col] = df[col].replace('', np.nan)    
    
# #     # main sheet: convert to float
# #     for col in ['Latitude', 'Longitude']:
# #         steel_plants[col] = steel_plants[col].replace('', np.nan).astype(float)
#     for num in range(1, 5+1):
#         owner_pct_col = f'Owner {num} %'
#         owner_ser_strs = steel_plants[owner_pct_col].astype(str)
#         owner_ser_strs = owner_ser_strs.replace('', np.nan).str.replace('%', '').astype(float).div(100)
#         steel_plants[owner_pct_col] = owner_ser_strs

#     # parent sheet: convert to float
#     for num in range(1, 5+1):
#         parent_pct_col = f'Parent {num} %'
#         try:
#             parent_ser_strs = steel_plants_parent_df[parent_pct_col]
#             parent_ser_strs = parent_ser_strs.astype(str).replace('', np.nan).str.replace('%', '').astype(float).div(100)
#             steel_plants_parent_df[parent_pct_col] = parent_ser_strs
            
#         except:
#             print(f"Exception in trying to convert to float for column {parent_pct_col}")
# #             print("All columns in df:")
# #             print(steel_plants_parent_df.columns.tolist())
# #             print("=======")

#     steel_plants = df
#     return steel_plants

In [1199]:
def steel_plants_read_data(map_choice, data_versions_dict, data_files_and_paths):
    if data_versions_dict[map_choice]['steel plants'] == 'official':
        steel_plants = steel_plants_read_official_file(data_files_and_paths)
        
    elif data_versions_dict[map_choice]['steel plants'] == 'working':
        print("Not currently handling reading working version")
    else:
        print("Error!" + f" Unexpected case for data_versions_dict[map_choice]['steel plants']: {data_versions_dict[map_choice]['steel plants']}")
        
    return steel_plants

In [1200]:
def steel_plants_clean_data(steel_plants): 
    
    steel_plants = harmonize_countries(steel_plants)
    
    # remove reference columns:
    for col in steel_plants.columns:
        if col.endswith('[ref]'):
            steel_plants = steel_plants.drop(col, axis=1)
            
    # remove empty rows
    steel_plants = steel_plants[steel_plants['Plant name (English)']!='']
    steel_plants = steel_plants[steel_plants['Plant name (English)'].isna()==False]
    
    # remove unnamed columns
    for col in steel_plants.columns:
        if 'Unnamed: ' in col:
            steel_plants = steel_plants.drop(col, axis=1)

    # rename from 2021 version to 2022 version (in case reading 2021 version again for some reason)
    steel_plants = steel_plants.rename(columns={
        'GEM wiki page link': 'GEM wiki page',
        'Start': 'Start year',
        'Primary steelmaking process (integrated, electric, or oxygen)': 'Main production process',
        'Primary steel production equipment': 'Main production equipment',
        'Nominal crude steel capacity (thousand tonnes per annum)': 'Nominal crude steel capacity (ttpa)',
        'Nominal iron capacity (thousand tonnes per annum)': 'Nominal iron capacity (ttpa)',
    })
    
    steel_plants = steel_exclude_if_no_wiki_page(steel_plants)
    
    return steel_plants

In [1201]:
def steel_exclude_if_no_wiki_page(steel_plants):
    """ 
    Exclude any rows that don't have wiki pages included.
    
    For Latin America, excludes rows if missing either English or foreign language wiki page.
    """
    
    if exclude_no_wiki == True:
        no_wiki_plant_ids = [] # initialize

        if map_choice == 'Latin America Portal - coal-steel':
            wiki_cols = ['GEM wiki page', 'GEM wiki page (other language)']
        else:
            wiki_cols = ['GEM wiki page']

        for wiki_col in wiki_cols:
            if wiki_col in steel_plants.columns:
                no_wiki_df = steel_plants[steel_plants[wiki_col].isna()]
                if len(no_wiki_df) > 0:
                    no_wiki_plant_ids += no_wiki_df['Plant ID'].tolist()
                    print(f"There were {len(no_wiki_df)} plant IDs with missing values in {wiki_col}")

        if len(no_wiki_plant_ids) > 0:
            steel_plants = steel_plants[~steel_plants['Plant ID'].isin(no_wiki_plant_ids)]
            print(f"Excluded {len(no_wiki_df)} rows due to missing wiki URLs")

        elif len(no_wiki_plant_ids) == 0:
            if map_choice == 'Latin America Portal - coal-steel':
                print(f"All rows had wiki URLs (for both English & Spanish/Portuguese)")
            else:
                print(f"All rows had wiki URLs")
                
        else:
            print("Error!" + f" Unexpected case for len(no_wiki_plant_ids): {len(no_wiki_plant_ids)}")
    
    else:
        pass
    
    return steel_plants

In [1202]:
def steel_plants_fix_one_offs(df):
    # fill in missing local language names:
#     local_name_fixes = {
#         # 'Aceros Arequipa Pisco steel plant': 'Planta siderúrgica de Pisco (Spanish)',
#         'Acero Simec Apizaco steel plant': 'Acería Simec Apizaco',
#         'Ternium Apodaca steel plant': 'Acería Ternium Apodaca',
#         'Gerdau Tultitlán (Sidertul) steel plant': 'Acería Gerdau Tultitlán',
#         'Grupo Acerero steel plant': 'Acería Grupo Acerero',
#         'Acero Simec San Luis steel plants': 'Acería Simec San Luis',
#     }      
#     df = fix_one_offs(df, local_name_fixes, 'Plant name (English)', 'Plant name (other language)')
                
#     # fill in missing local language wiki URLs:
#     local_wiki_fixes = {
#         'Acero Simec Apizaco steel plant': 'https://www.gem.wiki/Acer%C3%ADa_Simec_Apizaco',
#         'Ternium Apodaca steel plant': 'https://www.gem.wiki/Acer%C3%ADa_Ternium_Apodaca',
#         'Gerdau Tultitlán (Sidertul) steel plant': 'https://www.gem.wiki/Acer%C3%ADa_Gerdau_Tultitl%C3%A1n',
#         'Grupo Acerero steel plant': 'https://www.gem.wiki/Acer%C3%ADa_Grupo_Acerero',
#         'Acero Simec San Luis steel plants': 'https://www.gem.wiki/Acer%C3%ADa_Simec_San_Luis',
#     }
#     df = fix_one_offs(df, local_wiki_fixes, 'Plant name (English)', 'GEM wiki page (other language)')               

    return df

In [1203]:
# def steel_plants_merge_local_info(steel_plants):
#     """
#     Only runs for Latin America Portal
#     """
    
#     steel_working = read_steel_plants_working_pygsheets(data_files_and_paths)
    
#     # clean up 'Plant name (other language)'
#     steel_working['Plant name (other language)'] = steel_plants['Plant name (other language)'].str.replace(
#         '(Portuguese)', '', regex=False).str.strip()
#     steel_working['Plant name (other language)'] = steel_plants['Plant name (other language)'].str.replace(
#         '(Spanish)', '', regex=False).str.strip()
    
#     # pare down
#     steel_working = steel_working[['Plant ID', 'Plant name (other language)', 'GEM wiki page (other language)']]    
#     steel_working = steel_working.set_index('Plant ID')
    
#     # fill in missing data based on 'Plant ID'
#     # (which isn't actually a plant ID; it's unique for each row)
#     for row in steel_plants.index:
#         for col in ['Plant name (other language)', 'GEM wiki page (other language)']:
#             old_val = steel_plants.at[row, col]
#             if pd.isna(old_val) or old_val == '':
#                 # fill in missing value
#                 sel_id = steel_plants.at[row, 'Plant ID']
#                 working_val = steel_working.at[sel_id, col]
#                 steel_plants.at[row, col] = working_val
#             else:
#                 # print(f"Already a value in the df: {old_val}")
#                 pass
    
#     return steel_plants

In [1204]:
def steel_plants_clean_local_info(steel_plants):
    """
    Only runs for Latin America Portal
    """
    
    # clean up 'Plant name (other language)'
    steel_plants['Plant name (other language)'] = steel_plants['Plant name (other language)'].str.replace(
        '(Portuguese)', '', regex=False).str.strip()
    steel_plants['Plant name (other language)'] = steel_plants['Plant name (other language)'].str.replace(
        '(Spanish)', '', regex=False).str.strip()
    
    return steel_plants

In [1205]:
def steel_plants_create_file_for_map(steel_plants):
    df = steel_plants.copy() 
    
    if map_choice in ['Coal Terminals', 'Latin America Portal - coal-steel']:
        df['type'] = 'steel_plant'
    else:
        print(f"Not set up to add 'type' values for this map_choice: {map_choice}")
    
    # create columns as in other trackers, from what is entered in the steel tracker
    df['Latitude'] = df['Coordinates'].str.split(',').str[0].str.strip().astype(float)
    df['Longitude'] = df['Coordinates'].str.split(',').str[1].str.strip().astype(float)
    
    # TEST: check that specified columns can be converted to floats
    test_convert_col_to_float(df, ['Latitude', 'Longitude'])

    # clean capacity columns:
    df = df.rename(columns={
        'Nominal crude steel capacity (ttpa)': 'steel_cap_ktpa',
        'Nominal iron capacity (ttpa)': 'iron_cap_ktpa',
    })
        
    for col in ['steel_cap_ktpa', 'iron_cap_ktpa']:
        df[col] = df[col].astype(str)
        df[col] = df[col].replace('unknown', np.nan, regex=False)
        df[col] = df[col].replace('>0', np.nan, regex=False)
        df[col] = df[col].replace('N/A', np.nan, regex=False)
        df[col] = df[col].str.replace(' (idled)', '', regex=False)
        df[col] = df[col].str.strip()
        df[col] = df[col].astype(float)
    
    # convert values for capacity from thousand tons per year to million tons per year (MTPA)
    df['steel_capacity_mtpa'] = df['steel_cap_ktpa'] / 1000
    df['iron_capacity_mtpa'] = df['iron_cap_ktpa'] / 1000
    
    # add capacity unit
    df['capacity_production_unit'] = 'MTPA'
    
    # remove names of languages:
    df['Plant name (other language)'] = df['Plant name (other language)'].str.rsplit(pat=' (Spanish)', n=1).str[0]
    df['Plant name (other language)'] = df['Plant name (other language)'].str.rsplit(pat=' (Portuguese)', n=1).str[0]
    
    df = df.rename(columns={
        # ____: 'unit', # no such column
        'Parent': 'parent',
        'Owner': 'owner',
        'Subnational unit (province/state)': 'province',
        'Country': 'country',
        'Status': 'status',
        'steel_capacity_mtpa': 'capacity',
        'iron_capacity_mtpa': 'iron_capacity',
        'Start year': 'start_year',
        'Latitude': 'lat', 
        'Longitude': 'lng',
        # extras:
        'Main production process': 'steel_process',
        'Main production equipment': 'steel_equipment',
    })
    
    if map_choice == 'Latin America Portal - coal-steel':
        df = df.rename(columns={
            'Plant name (English)': 'project_en',
            'Plant name (other language)': 'project',
            'GEM wiki page': 'url_en',
            # 'GEM wiki page (other language)': 'url', # updated below 2023-10-27 to remove local language wiki step
        })
    else:
        df = df.rename(columns={
            'Plant name (English)': 'project',
            'GEM wiki page': 'url',
        })
    
    # clean data: steel tracker (working) has some start year values with "(anticipated)"
    df['start_year'] = df['start_year'].replace('TBD', np.nan) # .astype(float)
    
    steel_plants_for_map = df

    return steel_plants_for_map

In [1206]:
def test_steel_plant_names(steel_plants):
    # TEST: check that there is only one row for each entry of 'Plant name (English)'
    steel_plants_name_count = steel_plants.groupby('Plant name (English)')['Plant name (English)'].count()
    test = steel_plants_name_count[steel_plants_name_count > 1]
    if len(test)==0:
        pass
    else:
        print("Error!" + f" Problem with steel plants; more than one row for some entries of 'Plant name (English)'")
        print(test)
        
    # no return

In [1207]:
def steel_plants_create_data_download_version(steel_plants):
    """
    For Latin America Portal, translates column headings to Spanish.
    """
    
    steel_plants_for_download_dict = {'English': steel_plants.copy()}
    
    if map_choice == 'Latin America Portal - coal-steel':
        # TO DO: check that Spanish/Portuguese name is in the download sheet
        
        steel_plants_for_download_spanish = lat_am_convert_one_tracker_col_names_to_spanish(
            tracker_df = steel_plants, 
            trans_sheet_name = 'steel plants',
        )
        
        steel_plants_for_download_dict['Spanish'] = steel_plants_for_download_spanish
        print("Steel plants: Added Spanish version to download dict") # for UI

    else:    
        pass
    
    return steel_plants_for_download_dict

In [1208]:
def run_all_steel_plant_functions(map_choice, data_versions_dict, data_files_and_paths):
    # RUN MAIN FUNCTIONS:
    steel_plants = steel_plants_read_data(map_choice, data_versions_dict, data_files_and_paths)
    steel_plants = filter_points_by_country(steel_plants, map_choice, sel_countries)
    steel_plants = steel_plants_fix_one_offs(steel_plants)
                
    if map_choice == 'Latin America Portal - coal-steel':
        # steel_plants = steel_plants_merge_local_info(steel_plants)
        steel_plants = steel_plants_clean_local_info(steel_plants)
    
    steel_plants = steel_plants_clean_data(steel_plants)

    # create version for download    
    steel_plants_for_download_dict = steel_plants_create_data_download_version(steel_plants)

    # create version for map
    steel_plants_mod = steel_plants.copy()

    # RUN TESTS:
#     test_steel_plant_names(steel_plants_mod) # don't run; repeats are intentional

    # CREATE DF FOR MAP
    steel_plants_for_map = steel_plants_create_file_for_map(steel_plants_mod)

    # run lat-lon test on steel_plants_for_map; 
    # the df steel_plants has both coordinates in one column, 'Coordinates',
    # so this has to be done after creating version for map
    test_convert_col_to_float(steel_plants_for_map, ['lat', 'lng'])
    
    print('-'*40)
    print("Finished processing steel plants")
    
    return steel_plants_for_download_dict, steel_plants_for_map

## Compile all coal data

In [1209]:
def coal_data_for_map_final_processing_and_export(list_of_dfs, path_for_download_and_map_files):
    print("Final map prep (coal_data_for_map_final_processing_and_export)")
    
    # combine all the dfs into one df
    df = pd.concat(list_of_dfs, sort=False).reset_index(drop=True)
    
    # reorder columns:
    coal_keep_cols = [
        'type',
        'project',
        'url',
        'unit',
        'owner',
        'parent',
        'province',
        'country',
        'status',
        'capacity',
        'production',
        'capacity_production_unit',
        'start_year',
        'lat', 
        'lng',
    ]
    coal_remove_cols = [] # initialize
    
    if map_choice == 'Latin America Portal - coal-steel':
        # add additional columns
        coal_keep_cols += [
            'project_en',
            'url_en',
            'coal_type', 
            'coal_grade', 
            'mine_type', 
            'reserve_size', 
            'terminal_type',
            # 'coal_source', # not in map table or pop-up
            'iron_capacity',
            'steel_process',
            'steel_equipment',
        ]
        # modified 2023-10-27 to drop local language URL
        coal_keep_cols.drop('url')
    elif map_choice == 'Coal Terminals':
        # these are in the map pop-ups but not in the table
        coal_keep_cols += [
            'coal_type',
            'reserve_size', # changed 2022-12
            'coal_grade',
            'mine_type',
            'terminal_type',
            'coal_source',
            'iron_capacity', # changed 2022-12
            'steel_process', # changed 2022-12
            'steel_equipment', # changed 2022-12
        ]
        # check and reorder cols
        coal_terminals_specified_cols = [
            'type', 'project', 'url', 'unit', 'owner', 'parent', 'province', 'country', 
            'status', 'capacity', 'production', 'capacity_production_unit', 'start_year', 
            'lat', 'lng', 'coal_type', 
            # 'reserve_size', # changed 2022-12
            'coal_grade', 'mine_type', 'coal_source', 
            'terminal_type', 
            # 'iron_capacity', 'steel_process', 'steel_equipment', # changed 2022-12
        ]
        
        if set(coal_terminals_specified_cols) == set(coal_keep_cols):
            # use coal_terminals_specified_cols to overwrite value of coal_keep_cols
            coal_keep_cols = coal_terminals_specified_cols
        else:
            print("Error!" + f" For {map_choice}, columns were not as expected.")

    # pare and/or reorder columns:
    df = df[coal_keep_cols]
    
    # clean up text
    for col in ['coal_type', 'coal_grade', 'mine_type', 'terminal_type']:
        if col in df.columns:
            df[col] = df[col].str.lower()
        
    # clean up start year entries, removing decimal places
    df['start_year'] = df['start_year'].astype(str).str.replace('.0', '', regex=False).replace('nan', '').replace(np.nan, '')

    df = clean_nan_not_found_tbd(df)
    
    if df['capacity'].dtype == object:
        print("For column 'capacity', replacing empty strings with NaNs")
        df['capacity'] = df['capacity'].replace('', np.nan).replace('*', np.nan)
        
    df['capacity'] = df['capacity'].astype(float)
    
    df = coal_convert_statuses_for_map(df)
    test_status_for_map(df)
    
    # --------------
    # COUNTRIES
    # change 'Bahamas' to 'the Bahamas'
    df['country'] = df['country'].str.replace('Bahamas', 'the Bahamas')
    df['country'] = df['country'].str.replace('the the Bahamas', 'the Bahamas')
    
    # check that all rows have a country entered
    test_for_country_entries(df)
    # --------------
    
    df = latin_america_fill_in_missing_local_language_versions(df)
    
    # check for missing values
    test_map_specified_cells_have_values(df, 'coal_steel')
    
    # TEST: check for missing coordinates
    no_coord = df[(df['lat'].isna()) | (df['lng'].isna())]
    if len(no_coord) > 0:
        # show any rows with no location data
        print(no_coord[['project_en', 'country', 'lat', 'lng']])

        # keep only rows with both lat & lng
        df = df[(df['lat'].isna()==True) & (df['lng'].isna()==False)]
    # end of test
        
    if map_choice == 'Latin America Portal - coal-steel':
        cols_to_check = ['project_en', 'project', 'url_en'] # 'url' # modified 2023-10-27 to remove local language URL
    else:
        cols_to_check = ['project', 'url']
    find_multi_instead_of_one_to_one(df, cols_to_check)   

    # export coal data for map
    if export_files == True:        
        coal_compiled_for_map_file_name = f'{map_choice} - map data {save_timestamp}.xlsx'
        df.to_excel(
            path_for_download_and_map_files + 
            coal_compiled_for_map_file_name,
            index=False
        )
        print("*"*40)
        print(f"Exported file {coal_compiled_for_map_file_name}")
        print("*"*40)
    else:
        print("*"*40)
        print("Did not export coal map file")
        print("*"*40)
        
    return df

In [1210]:
def coal_convert_statuses_for_map(df):
    """
    Do for current maps (as of March 2022). 
    """
    df['status'] = df['status'].str.lower()
    
    # convert statuses
    if two_column_status == True:
        # create column 'status_legend'
        df['status_legend'] = df.copy()['status'].str.lower().replace({
            'retired': 'retired_plus',
            'closed': 'retired_plus',
        })
        
    else:
        # convert values within column 'status'
        df['status'] = df.copy()['status'].str.lower().replace({
            'closed': 'retired',
        })
        
    if map_choice == 'Latin America Portal - coal-steel':
        # capitalize first letter
        df['status'] = df['status'].str[0].str.upper() + df['status'].str[1:].str.lower()
    else:
        # leave as lowercase
        pass

    return df

In [1211]:
def test_cols_coal_steel_map(coal_steel_map_df):
    """ Test that column headings are as expected for each map.
    """
    if map_choice == 'Latin America Portal - coal-steel':
        coal_steel_expected_cols = [
            'type', 'project_en', 'project', 'url_en', 'unit', 
            # 'url', modified 2023-10-27 to remove local language URL
            'owner', 'parent', 'province', 'country', 'status', 
            'capacity', 'production', 'capacity_production_unit', 
            'start_year', 'lat', 'lng', 'coal_type', 'coal_grade', 
            'mine_type', 'reserve_size', 'terminal_type', 'iron_capacity', 
            'steel_process', 'steel_equipment', 'status_legend'
        ]
        test_coal_steel_map_columns_evaluate(coal_steel_map_df, coal_steel_expected_cols)
        
    elif map_choice == 'Coal Terminals':
        # note: currently some entries are only in the pop-ups, not in the map table:
        # 'coal_type', 'coal_grade', 'mine_type', 'coal_source', 'terminal_type', 
        coal_steel_expected_cols = [
            'type', 'project', 'url', 'unit', 'owner', 'parent', 'province', 'country', 'status', 
            'capacity', 'production', 'capacity_production_unit', 'start_year', 'lat', 'lng', 
            'coal_type', 'coal_grade', 'mine_type', 'coal_source', 'terminal_type', 
        ]
        test_coal_steel_map_columns_evaluate(coal_steel_map_df, coal_steel_expected_cols)
    else:
        print(f"Not set up to check coal-steel map file columns for map_choice: {map_choice}")

In [1212]:
def test_coal_steel_map_columns_evaluate(coal_steel_map_df, expected_cols):
    if set(coal_steel_map_df.columns.tolist()) == set(expected_cols):
        pass
    else:
        print('Error!' + f" Map columns were not as expected.")
        for x in coal_steel_map_df.columns:
            if x not in expected_cols:
                print(f"Column in map df not in expected_cols: {x}")
        for x in expected_cols:
            if x not in coal_steel_map_df.columns:
                print(f"Column in expected_cols not in map df: {x}")

In [1213]:
def compile_all_coal_steel_data(
    map_choice, data_versions_dict, data_files_and_paths, export_files
):
    """
    For download, update Excel file with multiple sheets:
    https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html
    
    For map, write to Excel file specifically for the map.
    """
    print('*'*40)
    print(f"Running compile_all_coal_steel_data for map_choice: {map_choice}")
    print('-'*40)
    
    # CREATE FILE NAME:
    download_file_name = f'{map_choice} - data download {save_timestamp}.xlsx'
    path_and_filename_for_download = path_for_download_and_map_files + download_file_name
    
    if map_choice == 'Latin America Portal - coal-steel':
        download_file_name_spanish = f'Portal Latino Americano - carbón y acero - descargar datos {save_timestamp}.xlsx'
        path_and_filename_for_download_spanish = path_for_download_and_map_files + download_file_name_spanish
    # =======
    
    coal_data_for_download_list = [] # initialize
    coal_data_for_download_list_spanish = [] # initialize
    coal_data_for_map_list = [] # initialize

    if 'coal plants' in data_versions_dict[map_choice].keys():
        coal_plants_for_download_dict, coal_plants_for_map = run_all_coal_plant_functions(
            map_choice, data_versions_dict, data_files_and_paths,
        )
        coal_data_for_download_list += [('Coal plants - data', coal_plants_for_download_dict['English'])]
        if map_choice == 'Latin America Portal - coal-steel':
            coal_data_for_download_list_spanish += [('Centrales de carbón - datos', coal_plants_for_download_dict['Spanish'])]
        
        coal_data_for_map_list += [coal_plants_for_map]

    if 'coal mines' in data_versions_dict[map_choice].keys():
        coal_mines_for_download_dict, coal_mines_for_map = run_all_coal_mine_functions(
            map_choice, data_versions_dict, data_files_and_paths,
        )
        coal_data_for_download_list += [('Coal mines - data', coal_mines_for_download_dict['English'])]
        if map_choice == 'Latin America Portal - coal-steel':
            coal_data_for_download_list_spanish += [('Minas de carbón - datos', coal_mines_for_download_dict['Spanish'])]
        
        coal_data_for_map_list += [coal_mines_for_map]

    if 'coal terminals' in data_versions_dict[map_choice].keys():
        coal_terminals_for_download_dict, coal_terminals_for_map = run_all_coal_terminal_functions(
            map_choice, data_versions_dict, data_files_and_paths
        )
        coal_data_for_download_list += [('Coal terminals - data', coal_terminals_for_download_dict['English'])]
        if map_choice == 'Latin America Portal - coal-steel':
            coal_data_for_download_list_spanish += [('Terminales de carbón - datos', coal_terminals_for_download_dict['Spanish'])]
        
        coal_data_for_map_list += [coal_terminals_for_map]

    if 'steel plants' in data_versions_dict[map_choice].keys():
        steel_plants_for_download_dict, steel_plants_for_map = run_all_steel_plant_functions(
            map_choice, data_versions_dict, data_files_and_paths
        )
        coal_data_for_download_list += [('Steel plants - data', steel_plants_for_download_dict['English'])]
        if map_choice == 'Latin America Portal - coal-steel':
            coal_data_for_download_list_spanish += [('Plantas de acero - datos', steel_plants_for_download_dict['Spanish'])]
        
        coal_data_for_map_list += [steel_plants_for_map]
        
    print("*"*40 + "\nFinished running functions for each tracker\n" + "-"*40)
    
    if len(coal_data_for_map_list) == 0:
        print("There was no coal data to add to map")
        print("*"*40)
        coal_steel_map_df = pd.DataFrame()
    
    else:
        if export_files == False:
            print("*"*40 + "\nThere was coal data for download file, but did not create Excel file\n" + "*"*40)
            
        elif export_files == True:            
            print(f"Exporting data to download file: {download_file_name}")
            with pd.ExcelWriter(path_and_filename_for_download) as writer:
                for df_name_tuple in coal_data_for_download_list:
                    sheet_name = df_name_tuple[0]
                    df = df_name_tuple[1]
                    df.to_excel(
                        writer, 
                        sheet_name=sheet_name, 
                        index=False
                    )
                    print(f"Wrote to download file: {sheet_name}")
            if map_choice == 'Latin America Portal - coal-steel':
                print("Writing to Spanish version of download file")
                with pd.ExcelWriter(path_and_filename_for_download_spanish) as writer:
                    for df_name_tuple in coal_data_for_download_list_spanish:
                        sheet_name = df_name_tuple[0]
                        df = df_name_tuple[1]
                        df.to_excel(
                            writer, 
                            sheet_name=sheet_name, 
                            index=False
                        )
                        print(f"Wrote to download file: {sheet_name}")
            print("-"*40)
                        
        # finish map data & export Excel
        coal_steel_map_df = coal_data_for_map_final_processing_and_export(coal_data_for_map_list, path_for_download_and_map_files)
        
    # TESTS:
    test_type_counts(coal_steel_map_df)
    test_cols_coal_steel_map(coal_steel_map_df)
    
    print("-"*40)
    print("Finished compile_all_coal_steel_data")
    print("*"*40)

    return (coal_steel_map_df, coal_data_for_download_list, coal_data_for_download_list_spanish)

# Oil & gas data

## Oil & gas: general functions

In [1214]:
def create_owner_and_parent_strings(main_df, parent_df):
    """
    Works for GOGET & GGPT.
    """
    if 'Owner' in parent_df.columns:
        parent_df = parent_df.set_index('Owner')
    else:
        pass
    
    for row in main_df.index:
        owners_str = '' # initialize
        parents_str = '' # initialize

        for o_num in range(1, 5+1):
            owner_num = main_df.at[row, f'Owner {o_num}']
            owner_num_fract = main_df.at[row, f'Owner {o_num} %']
            if pd.isna(owner_num):
                owner_str = ''
            elif owner_num.lower() == 'other':
                # fill in 'other' as parent as well, with same percentage
                owner_num_pct = convert_ownership_fract_to_pct(owner_num_fract)                
                parents_str += f"other [{owner_num_pct}]; "
            else:   
                # create owner_str, add to collection (owners_str)
                owner_str = create_owner_string_for_row_and_owner_num(main_df, row, o_num)
                owners_str += owner_str
                
                if 'unknown %' not in owner_str:
                
                    # for each owner, look up parents in sheet parent_df
                    for p_num in range(1, 5+1):
                        if owner_num in parent_df.index:
                            parent_num = parent_df.at[owner_num, f'Parent {p_num}']
                            try:
                                if pd.isna(parent_num):
                                    pass
                                else:
                                    # get share of owner that parent owns
                                    parent_num_fract = parent_df.at[owner_num, f'Parent {p_num} %']

                                    # calculate fractional ownership of the O&G unit for this parent
                                    parent_num_own_unit_fract = owner_num_fract * parent_num_fract

                                    if parent_num_own_unit_fract == float(1):
                                        parent_str = f"{parent_num}; "
                                    else:
                                        parent_num_own_unit_pct = convert_ownership_fract_to_pct(parent_num_own_unit_fract)
                                        parent_str = f"{parent_num} [{parent_num_own_unit_pct}]; "

                                    # add to collection (parents_str)
                                    parents_str += parent_str
                            except:
                                print(f"Hit exception for pd.isna(parent_num), for owner: {owner_str}:")

                        else:
                            print("Error!" + f" Owner isn't in 'ownership' sheet: {owner_num}")
                            
                else:
                    pass

        # clean up ending
        owners_str = owners_str.strip('; ')
        parents_str = parents_str.strip('; ')
        
        # put into main_df
        main_df.at[row, 'Owner'] = owners_str
        main_df.at[row, 'Parent'] = parents_str
        
    return main_df

In [1215]:
def create_owner_string_for_row_and_owner_num(main_df, row, num):
    owner_num = main_df.at[row, f'Owner {num}']
    owner_num_fract = main_df.at[row, f'Owner {num} %']
    if pd.isna(owner_num):
        owner_str = ''
    else:
        if owner_num_fract == float(1):
            owner_str = f"{owner_num}; "
        else:
            owner_num_pct = convert_ownership_fract_to_pct(owner_num_fract)
            owner_str = f"{owner_num} [{owner_num_pct}]; "
    
    return owner_str

In [1216]:
def convert_ownership_fract_to_pct(fract):
    if pd.isna(fract):
        pct = 'unknown '
    else:
        pct = str(round(fract*100, 1))
        pct = pct.rsplit('.0', 1)[0]
        pct = pct + '%'

    return pct

## Gas plants

In [1217]:
def gas_plants_read_official_data_from_excel(data_keys_titles):
    """
    Read GGPT file, using official data release.
    
    If map is Latin America Portal, also have to read working file, merge in local language info.
    """
    name = 'gas_plants_official'
    
    print("*"*40)
    print("Gas plants: read official version of data from local Excel file.")
    print(f'"{name}"')
    print("-"*40)
  
    key = data_keys_titles[name][0]
    title = data_keys_titles[name][1]
    gas_plants = gspread_access_file_read_only(key, title)
    # gas_plant_xl = pd.ExcelFile(data_files_and_paths['gas_plants_official_path'] + file_name)
    gas_plant_dtypes = {'Plant name': str, 'Unit name': str, 'Longitude': float, 'Latitude': float}
    columns_to_change = list(gas_plant_dtypes.keys())
    gas_plants[columns_to_change] = gas_plants[columns_to_change].astype(gas_plant_dtypes)

    gas_plants_official_columns_aug_2023 = [
        'Wiki URL', 'Country', 'Plant name', 'Plant name (local script)', 'Unit name', 
        'Fuel', 'Capacity (MW)', 'Status', 'Technology', 'CHP', 'Start year', 
        'Retired year', 'Planned retire', 
        'Owner', 'Parent', 'Operator', # note: 'Operator' was new in official release Feb 2023
        'Latitude', 'Longitude', 
        'Location accuracy', 'City', 'Local area (taluk, county)', 
        'Region', 'Sub-region', # note: 'Subregion' and 'Previous Region' were new in official release Feb 2023 
        # note: Sub-region replaced Subregion in 2024 release and Previous Region no longer included
        'Major area (prefecture, district)', 'Subnational unit (province, state)', 
        'Other IDs (location)', 'Other IDs (unit)', 'Other plant names', 
        'Captive [heat, power, both]', 'Captive industry type', 
        'Captive non-industry use [heat, power, both, none]', 
        'GEM location ID', 'GEM unit ID',
        # 'Wiki URL local language',
        'Hydrogen capable?', 'CCS attachment?', 'Coal-to-gas conversion/replacement?',        
    ]
    print("Checking columns in gas_plants")
    for col in gas_plants.columns:
        if col not in gas_plants_official_columns_aug_2023:
            print("Error!" + f" There was a column in the file read in that wasn't in Feb 2023 official release: {col}")

    for col in gas_plants_official_columns_aug_2023:
        if col not in gas_plants.columns:
            print("Error!" + f" There was a column in Feb 2023 official release that wasn't in the file read in: {col}")
            gas_plants[col] = ''
    
    return gas_plants

In [1218]:
def gas_plants_read_working_or_interim_local_copy(data_files_and_paths):
    """ Reads gas plant data from local copy of working file or interim release (update from major release).
    
    Works for interim files formatted the same way as the working file, 
    with various sheets for different regions, and a separate owner-parent sheet.
    
    (If an interim release file has already been created in the same format as a major release, 
    it can be put into the code for specifying versions as an 'official' release.)
    
    Currently only used for:
    * Global Gas Plant Tracker: compilation from working to create official files
    * Latin America: working file read to get local language info)
    * Europe Gas Tracker: interim for update of Europe since last major release
    * Global Gas Plant Tracker: interim for update of Europe since last major release
    """
    
    all_data_sheets = [
        "North America", "China", "European Union", "Europe", "Latin America",
        "Africa (sub-Saharan)", "Middle East & North Africa", 
        "East Asia", "Eurasia", "South Asia", "SE Asia",
        "Russia", "Australia and New Zealand"
    ]
    
    if map_choice == 'Africa Gas Tracker':
        sel_sheets = ["Africa (sub-Saharan)", "Middle East & North Africa"]
    elif map_choice == 'Asia Gas Tracker':
        sel_sheets = ["China", "East Asia", "South Asia", "SE Asia"]
    elif map_choice == 'Europe Gas Tracker':
        sel_sheets = ["European Union", "Europe"]
    elif map_choice == 'Latin America Portal - oil-gas':
        sel_sheets = ["Latin America"]
    else:
        # global
        sel_sheets = all_data_sheets

    print("Reading gas plants from working file (local copy)")
    gas_plants_xl = pd.ExcelFile(
        data_files_and_paths['gas_plants_working_path'] + 
        data_files_and_paths['gas_plants_working_file']
    )

    # read all specified sheets, and then compile them all
    # (will be filtered by country in a later step)
    gas_plant_dfs_list = [] # initialize

    for sel_sheet in sel_sheets:
        try:
            gas_plants_one_sheet = pd.read_excel(gas_plants_xl, sheet_name=sel_sheet)
            gas_plant_dfs_list += [gas_plants_one_sheet]
            print(f"Read sheet {sel_sheet}")
        except:
            print(f"Unable to read sheet {sel_sheet}")

    gas_plants = pd.concat(gas_plant_dfs_list, sort=False).reset_index(drop=True)
    
    # parent data
    gas_plants_parent_df = pd.read_excel(gas_plants_xl, sheet_name='owner-parent')
    
    # =======
    # clean up
    
    # MAIN DATA:
    # exclude empty rows:
    gas_plants = gas_plants[gas_plants['Plant name'].isna()==False]
    gas_plants = gas_plants[gas_plants['Plant name']!='']
    gas_plants = gas_plants.reset_index(drop=True)
    
    # round off capacities
    gas_plants['Capacity (MW)'] = gas_plants['Capacity (MW)'].replace('not found', np.nan).astype(float).round(0)
    
    # PARENT DATA:
    gas_plants_parent_df = gas_plants_parent_df.dropna(subset=['Owner'])
    gas_plants_parent_df = gas_plants_parent_df[gas_plants_parent_df['Owner']!='']
    
    # before setting index, test that there aren't repeated owners
    test_repeated_owners(gas_plants_parent_df)
    gas_plants_parent_df = gas_plants_parent_df.set_index('Owner')
    
    # ALL DATA:
    # drop empty columns
    for df in [gas_plants, gas_plants_parent_df]:
        for col in df.columns:
            if 'Unnamed: ' in col:
                df = df.drop(col, axis=1)

    return gas_plants, gas_plants_parent_df

In [1219]:
def for_pygsheets_convert_ggpt_dtypes_and_values(gas_plants, gas_plants_parent_df):
    for df in [gas_plants, gas_plants_parent_df]:
        for col in df.columns:
            df[col] = df[col].replace('', np.nan)    
    
    # main sheet: convert to float
    for col in ['Latitude', 'Longitude']:
        gas_plants[col] = gas_plants[col].replace('', np.nan).astype(float)
        
    for num in range(1, 5+1):
        owner_pct_col = f'Owner {num} %'
        owner_ser_strs = gas_plants[owner_pct_col].astype(str)
        owner_ser_strs = owner_ser_strs.replace('', np.nan).str.replace('%', '').astype(float).div(100)
        gas_plants[owner_pct_col] = owner_ser_strs

    # =====
    # parent sheet: convert to float
    for num in range(1, 10+1):
        parent_pct_col = f'Parent {num} %'
        try:
            parent_ser_strs = gas_plants_parent_df[parent_pct_col]
            parent_ser_strs = parent_ser_strs.astype(str).replace('', np.nan).str.replace('%', '').astype(float).div(100)
            gas_plants_parent_df[parent_pct_col] = parent_ser_strs
            
        except:
            print(f"Exception in trying to convert column {parent_pct_col}")
            print("All columns in df:")
            print(gas_plants_parent_df.columns.tolist())
            print("=======")

    return gas_plants, gas_plants_parent_df

In [1220]:
# def gas_plants_merge_local_language_info(gas_plants):
#     """
#     Merge in local language information (plant name, wiki page URL) from working file,
#     to create augmented version of official data.
    
#     Only runs for Latin America Portal.
#     """

#     if map_choice == 'Latin America Portal - oil-gas':
        
#         # have to read working data, merge in local language info
#         gas_plants_working, gas_plants_parent_df_working = gas_plants_read_working_or_interim_local_copy(data_files_and_paths)

#         # Tracker ID is the ID for each unit; supposed to be unique
#         # If these IDs are in fact unique, then can merge info from working file using this ID
#         num_duplicated_rows = gas_plants['GEM unit ID'].duplicated().sum()
        
#         # in working, filter out any rows with NaN for GEM unit ID
#         gas_plants_working = gas_plants_working[gas_plants_working['GEM unit ID'].isna()==False]
        
#         num_duplicated_rows_working = gas_plants_working['GEM unit ID'].duplicated().sum()

#         if num_duplicated_rows == 0 and num_duplicated_rows_working == 0:
#             gas_plants_initial_len = len(gas_plants)
#             # note: official release Feb 2022 has 'Plant name (local script)', but not local language wiki pages
#             gas_plants_merged = pd.merge(
#                 gas_plants, 
#                 gas_plants_working[['GEM unit ID', 'Wiki URL local language']],
#                 on='GEM unit ID',
#                 how='left'
#             )
#             if gas_plants_initial_len == len(gas_plants_merged):
#                 # create new version of gas_plants, overwriting previous version
#                 gas_plants = gas_plants_merged.copy()
                
#                 print("Merged in local language wiki pages")
                
#                 return gas_plants

#             else:
#                 print("Error!" + f" There was a difference in the length of dfs.")
#                 return pd.DataFrame()

#         else:
#             print("Error!" + f" There were duplicate GEM unit ID entries; will return empty DataFrame.")
#             print(f"num_duplicated_rows: {num_duplicated_rows} & num_duplicated_rows_working: {num_duplicated_rows_working}")
#             print(gas_plants[gas_plants['GEM unit ID'].duplicated()][['Plant name', 'Unit name', 'GEM unit ID']])
#             return pd.DataFrame()
        
#     else:
#         return gas_plants

In [1221]:
def gas_plants_clean_data(df):
    df = harmonize_countries(df)
    
    for col in df.columns:
        if 'Unnamed: ' in col:
            df = df.drop(col, axis=1)
    
    # remove empty rows
    df = df.dropna(how='all')
  
    df['Capacity (MW)'] = df['Capacity (MW)'].replace('not found', np.nan)

    df = gas_plant_convert_fuel_entries(df)

    df = df.reset_index(drop=True)
    
    # convert numerical columns
    for col in ['Capacity (MW)', 'Latitude', 'Longitude']:
        df[col] = df[col].astype(str).str.replace(',', '').astype(float)
        
    test_gas_plant_missing_plant_name(df, 'Plant name')
    
    return df

In [1222]:
def gas_plant_convert_fuel_entries(df):
    """
    Convert the fuels from short codes to full terms
    """
    # put slashes at ends, so that each fuel can be identified with slashes on either side
    df['Fuel'] = '/' + df['Fuel'] + '/'

    fuel_dict = {
        'B': 'biomass', 
        'BFG': 'blast furnace gas',
        'BL': 'bioliquids',
        'C': 'coal',
        'CM': 'coalbed methane',
        'COG': 'coke oven gas',
        'CR': 'crude oil',
        'D': 'diesel',
        'FO': 'fuel oil',
        'FOG': 'FINEX off gas',
        'G': 'gasoline',
        'H': 'hydrogen',
        'HFO': 'fuel oil', # supposed to be only 'FO' (GGPT editing manual)
        'J': 'jet fuel',
        'KER': 'kerosene',
        'LFG': 'refuse (landfill gas)',
        'LNG': 'liquefied natural gas',
        'LPG': 'liquefied petroleum gas',
        'N': 'naptha',
        'NG': 'natural gas',
        'OG': 'gas (unknown)',
        'S': 'solar',
        'WSTH-NG': 'waste heat from natural gas',
        'SG': 'synthesized gas',
    }
    for fuel in list(fuel_dict.keys()):
        df['Fuel'] = df['Fuel'].str.replace(
            f'/{fuel}/', f'/{fuel_dict[fuel]}/', 
            regex=False)

    # remove the slashes at ends added at start of function
    df['Fuel'] = df['Fuel'].str.strip('\/')
    
    # replace slashes with commas
    df['Fuel'] = df['Fuel'].str.replace('/', ', ')
    
    return df

  df['Fuel'] = df['Fuel'].str.strip('\/')


In [1223]:
def clean_owners_gas_plants(df):
    for col in ['Owner', 'Parent']:
        df[col] = df[col].str.replace(' [0%]', '', regex=False)
        df[col] = df[col].str.replace(' [100%]', '', regex=False)
        
    return df

In [1224]:
# def fill_in_missing_local_project_name(df):
#     if map_choice.startswith('Latin America Portal'):
#         # 'project': name in Spanish/Portuguese
#         # 'project_en': name in English

#         print("Running fill_in_missing_local_project_name")
        
#         for row in df.index:
#             loc_name = df.at[row, 'project']
#             if loc_name == '' or pd.isna(loc_name):
#                 eng_name = df.at[row, 'project_en']
#                 df.at[row, 'project'] = eng_name
#                 # print("Warning!" + f" For row {row}, local language project name was missing; set equal to English name: {eng_name}")
                
#     return df

In [1225]:
def test_gas_plant_missing_plant_name(df, plant_name_col):
    # if map_choice == 'Latin America Portal - oil-gas':
    #     plant_name_col = 'project_en'
    # else:
    #     if 'project' in df.columns:
    #         plant_name_col = 'project'
    #     elif 'Plant name' in df.columns:
    #         plant_name_col = 'Plant name'
    #     else:
    #         print("Error!" + " Didn't find column with name either 'Plant name' or 'project'")
    
    no_plant_name = df[df[plant_name_col].isna()]
    if len(no_plant_name) > 0:
        print(f"There were {len(no_plant_name)} rows with no plant name in column {plant_name_col}:")
        # print(no_plant_name.T)
        print(no_plant_name[no_plant_name.columns.tolist()[0:5]])
    else:
        # print(f"Test passed, checking col {plant_name_col}")
        pass

In [1226]:
def gas_plants_modify_for_map(gas_plants):    
    print('-'*40) # for UI

    df = gas_plants.copy()    
    df = gas_plants_clean_data(df) 
    df = df.dropna(how='all')
    df = clean_owners_gas_plants(df)

    ggpt_rename_for_map_universal_dict = {
        'Unit name': 'unit',
        'Capacity (MW)':  'capacity',
        'TOTAL Capacity elec. (MW)': 'capacity', # to handle new version in Feb 2023 release
        'Owner': 'owner', 
        'Parent': 'parent', 
        'Status': 'status',   
        'Technology': 'technology',
        'Fuel': 'fuel_type',
        'Subnational unit (province, state)': 'province',
        'Latitude': 'lat',
        'Longitude': 'lng',
        'Start year': 'start_year',
        'Region': 'region', 
    }
    df = df.rename(columns=ggpt_rename_for_map_universal_dict)
    # TEST:
    for col in list(ggpt_rename_for_map_universal_dict.values()):
        if col not in df.columns:
            print("Error!" + f" For all maps, after rename, still missing column {col}")
    # END OF TEST
    
    if map_choice == 'Oil & Gas Plant':
        # map file column names from "Global Gas Plant Tracker (GGPT) - Latest Data for export"
        # https://docs.google.com/spreadsheets/d/1l3khUwO0otjXN7dBfsJPhaoQjrQFrMJn4SktBbkm1Kk/edit#gid=997296159
        ggpt_rename_for_map_file_dict = {
            'Plant name': 'project', 
            'Plant name (local script)': 'project_loc', 
            'Country': 'country', 
            'Wiki URL': 'url',
        }
        df = df.rename(columns=ggpt_rename_for_map_file_dict)
    else:
        # gas plants used for a map other than GGPT; it's a multi-tracker map
        if map_choice == 'Latin America Portal - oil-gas':
            print("Renaming columns for Latin America Portal")
            lat_am_rename_dict = {
                'Plant name': 'project_en',
                'Plant name (local script)': 'project',
                'Wiki URL': 'url_en',
                # 'Wiki URL local language': 'url', # modified 2023-10-27 to remove local language URL
                'Country': 'countries',
            }
            df = df.rename(columns=lat_am_rename_dict)

            # TEST:
            for col in list(lat_am_rename_dict.values()):
                if col not in df.columns:
                    print("Error!" + f" For Latin America Portal, after rename, still missing column {col}")
            # END OF TEST

            # create another column 'unit', same as 'unit_en'
            df['unit_en'] = df['unit']

        else:
            df = df.rename(columns={
                'Plant name': 'project',
                'Wiki URL': 'url',
                'Country': 'countries',
            })
        
    # fill in additional columns
    if map_choice in ['Africa Gas Tracker', 'Asia Gas Tracker', 'Europe Gas Tracker']:
        df['type'] = 'gas_power_plant' 
    elif map_choice in ['Latin America Portal - oil-gas']:
        df['type'] = 'gas_power_plant'
    else:
        print(f"Not set up to add 'type' values for this map_choice: {map_choice}")
        
    df['capacity_production_unit'] = 'MW'
    df['geom'] = 'point'
    df['start_year'] = df['start_year'].fillna('').astype(str)
    
    # drop reference columns (for easier viewing)
    for col in df.columns:
        if ' [ref]' in col:
            df = df.drop(col, axis=1)

    # drop columns with no name:
    if '' in df.columns:
        df = df.drop('', axis=1)
        
    
    gas_plants_for_map = df
    
    return gas_plants_for_map

In [1227]:
def test_repeated_owners(df):
    """
    Check whether there are any owners repeated in the 'Ownership' sheet.
    
    There is no need for repeated entries.
    If there are repeated entries, it causes an error in the function create_owner_and_parent_strings.
    """
    
    owner_count = df['Owner'].value_counts()
    owner_count_multi = owner_count[owner_count > 1]
    if len(owner_count_multi) > 0:
        print("Error!" + " There was at least one company with multiple entries in 'Ownership' sheet:")
        print(owner_count_multi)
    else:
        if error_verbose == True:
            print("Test passed: no owners had multiple entries")

In [1228]:
def gas_plants_fix_one_offs(df): 
    # if map_choice == 'Latin America Portal - oil-gas':
    #     local_wiki_fixes = {
    #         # # plants removed from working sheet after Feb 2022 GGPT release, because were too small or not using gas:
    #         # 'Santa Cruz power station (Bolivia)': 'https://gem.wiki/Termoel%C3%A9ctrica_de_Santa_Cruz',
    #         # 'Andes Vallenar power station': 'https://gem.wiki/Central_Térmica_Andes_Vallenar',
    #         # 'Cardones power station': 'https://gem.wiki/Central_Térmica_Cardones',
    #         # 'Orazul Acajutla power station': 'https://gem.wiki/Central_Termoeléctrica_Orazul_Acajutla',
    #         # 'Pacífico Acajutla power station': 'https://gem.wiki/Termoeléctrica_Pacífico_Acajutla',
    #         # 'Ilo power station': 'https://gem.wiki/Central_Termoel%C3%A9ctrica_de_Ilo',
    #         # 'Puerto Bravo power station': 'https://gem.wiki/Central_Termoeléctrica_Puerto_Bravo',
    #     }
    #     df = fix_one_offs(df, local_wiki_fixes, 'Plant name', 'Wiki URL local language')

    return df

In [1229]:
def gas_plants_create_data_download_versions_in_dict(gas_plants):
    
    gas_plants_for_download_dict = {'English': gas_plants.copy()}
    
    if map_choice == 'Latin America Portal - oil-gas':
        # TO DO: check that Spanish/Portuguese name is in the download sheet
        gas_plants_for_download_spanish = lat_am_convert_one_tracker_col_names_to_spanish(
            tracker_df = gas_plants, 
            trans_sheet_name = 'gas plants',
        )
        # append another key-value pair to the dictionary
        gas_plants_for_download_dict['Spanish'] = gas_plants_for_download_spanish
    
    return gas_plants_for_download_dict

In [1230]:
def gas_plants_process_working_owner_parent(
    gas_plants_working, gas_plants_parent_df_working
):

    # condense owner data
    for row in gas_plants_working.index:
        owner_str = '' # initialize
        for owner_num in range(1, 5+1):
            owner_name = gas_plants_working.at[row, f'Owner {owner_num}']
            owner_fract = gas_plants_working.at[row, f'Owner {owner_num} %']
            
            # although read by pygsheets, parent pcts were turned into floats earlier
            if pd.isna(owner_name) == False:
                if pd.isna(owner_fract) == False:
                    owner_pct = '%.1f' % (owner_fract * 100) + '%'
                    owner_str += f"{owner_name} [{owner_pct}]; "
                elif pd.isna(owner_fract) == True:
                    owner_str += f"{owner_name} [unknown %]; "
                else:
                    print("Error!" + f" Unexpected case for owner pct: {owner_pct}")

        owner_str = owner_str.strip('; ').replace(' [100.0%]', '')

        gas_plants_working.at[row, 'Owner'] = owner_str

    # for each owner, calculate weighted parent shares
    for row in gas_plants_working.index:
        # 1. create df of parents
        owner_all_parents_list = [] # initialize
        for owner_num in range(1, 5+1):
            owner_name = gas_plants_working.at[row, f'Owner {owner_num}']
            owner_fract = gas_plants_working.at[row, f'Owner {owner_num} %']

            if owner_name != '' and pd.isna(owner_name)==False and owner_name!='nan':
                if owner_name.lower() == 'other':
                    owner_all_parents_list += [('other', owner_fract)]
                else:
                    # get corresponding parents as a series
                    parent_ser = gas_plants_parent_df_working.loc[owner_name]

                    # adjust pcts
                    for par_num in range(1, 10+1):
                        par_name = parent_ser.at[f'Parent {par_num}']
                        if pd.isna(par_name) == False:
                            par_fract = parent_ser.at[f'Parent {par_num} %']
                            par_fract_weighted = owner_fract * par_fract
                            owner_all_parents_list += [(par_name, par_fract_weighted)]

        owner_all_parents_df = pd.DataFrame(owner_all_parents_list, columns = ['Parent', 'Parent fract'])
        owner_all_parents_df = owner_all_parents_df.sort_values(by='Parent fract', ascending=False)
        
        # TEST:
        parent_sum = owner_all_parents_df['Parent fract'].sum()
        if parent_sum == 0:
            # no data
            pass
        else:
            abs_diff = abs(1 - parent_sum)
            if abs_diff < 1e-2:
                pass
            else:
                # print("Warning!" + f" For gas_plants_working row {row}, the parent sums didn't add to 100%: {parent_sum}")
                pass
        # END TEST

        # 2. create parent string
        owner_par_str = '' # initialize
        for row2 in owner_all_parents_df.index:
            par_name = owner_all_parents_df.at[row2, 'Parent']
            par_fract = owner_all_parents_df.at[row2, 'Parent fract']
            if pd.isna(par_fract) == False:
                par_pct = '%.1f' % (par_fract * 100) + '%'
                owner_par_str += f"{par_name} [{par_pct}]; "
            elif pd.isna(par_fract) == True:
                    owner_par_str += f"{par_name} [unknown %]; "
            else:
                print("Error!" + f" Unexpected case for parent pct: {par_pct}")


        owner_par_str = owner_par_str.strip('; ')        
        gas_plants_working.at[row, 'Parent'] = owner_par_str
    
    return gas_plants_working

In [1231]:
def gas_plants_create_official_version(gas_plants_working, gas_plants_parent_df_working):
    """
    Take working file read by pygsheets, and create version for official download, 
    and which can also be used for the map file.
    
    Major steps:
    * Compile all parent data into a single string
    """   
    
    df = gas_plants_process_working_owner_parent(
        gas_plants_working, gas_plants_parent_df_working
    )
    
    ggpt_official_cols = [
        'Wiki URL', 'Country', 'Plant name', 'Plant name (local script)', 'Unit name', 'Fuel', 
        'Capacity (MW)', 'Status', 'Technology', 'CHP', 'Start year', 'Retired year', 'Planned retire', 
        'Owner', 'Parent', 'Latitude', 'Longitude', 'Location accuracy', 
        'Region', 'City', 'Local area (taluk, county)', 'Major area (prefecture, district)', 'Subnational unit (province, state)', 
        'Other IDs (location)', 'Other IDs (unit)', 'Other plant names', 
        'Captive [heat, power, both]', 'Captive industry type', 'Captive non-industry use [heat, power, both, none]', 
        'Hydrogen capable?', 'CCS attachment?', 'Coal-to-gas conversion/replacement?',
        'GEM location ID', 'GEM unit ID',
    ]

    df = df[ggpt_official_cols]

    gas_plants = df
    return gas_plants

In [1232]:
def run_all_gas_plant_functions(
    map_choice, data_versions_dict, data_keys_titles,
):

    if data_versions_dict[map_choice]['gas plants'] == 'official':
        gas_plants = gas_plants_read_official_data_from_excel(data_keys_titles)
        
    elif data_versions_dict[map_choice]['gas plants'] in ['working']: # 'interim'
        gas_plants_working, gas_plants_parent_df_working = gas_plants_read_working_or_interim_local_copy(data_keys_titles)
        # gas_plants, gas_plants_parent_df = for_pygsheets_convert_ggpt_dtypes_and_values(gas_plants, gas_plants_parent_df)
        
        # if data_versions_dict[map_choice]['gas plants'] == 'interim':
        #     # TO DO: ensure parent data is compiled into strings
        #     gas_plants = gas_plants_create_official_version(gas_plants_working, gas_plants_parent_df_working)

    else:
        x = data_versions_dict[map_choice]['gas plants']
        print("Error!" + f" Unexpected case for data_versions_dict[map_choice]['gas plants']: {x}")

    # modified 2023-10-27 to remove local language URL
    # gas_plants = gas_plants_merge_local_language_info(gas_plants)

    gas_plants = filter_points_by_country(gas_plants, map_choice, sel_countries)
    gas_plants = gas_plants_fix_one_offs(gas_plants)
    
    # check data before processing for map
    if map_choice == 'Latin America Portal - oil-gas':
        cols_to_check = ['Plant name', 'Plant name (local script)', 'Wiki URL'] # 'Wiki URL local language'
    else:
        cols_to_check = ['Plant name', 'Wiki URL']
    find_multi_instead_of_one_to_one(gas_plants, cols_to_check)   
    
    gas_plants_for_download_dict = gas_plants_create_data_download_versions_in_dict(gas_plants)
    
    gas_plants_for_map = gas_plants_modify_for_map(gas_plants)
    
    print("-"*40)
    print("Gas plants: finished processing")
    print("-"*40)
    
    return gas_plants_for_download_dict, gas_plants_for_map

In [1233]:
# sandbox:
gas_plants = gas_plants_read_official_data_from_excel(data_keys_titles)
len(gas_plants[gas_plants['GEM location ID'].str.startswith('L1')]['GEM location ID'].unique())

****************************************
Gas plants: read official version of data from local Excel file.
"gas_plants_official"
----------------------------------------
Checking columns in gas_plants


5358

## Oil & gas pipelines

In [1234]:
# official columns as of Dec 2023 gas pipelines
pipeline_official_cols = [
    'ProjectID', 'Fuel', 'StartCountry', 'EndCountry', 'Countries', 
    'Wiki', 'PipelineName', 'SegmentName', 'OtherEnglishNames', 'Parent', 
    'Status', 'StartYear1', 'StartYear2', 'StartYear3', 'StopYear', 
    'Capacity', 'CapacityUnits', 'CapacityBcm/y', 'LengthKnownKm', 
    'LengthEstimateKm', 'LengthMergedKm', 'Diameter', 'DiameterUnits', 
    'FuelSource', 'StartLocation', 'StartPrefecture/District', 'StartState/Province', 
    'StartRegion', 'EndLocation', 'EndPrefecture/District', 'EndState/Province', 
    'EndRegion', 'WKTFormat',
    # 'FID', 'FIDYear', 
]

pipeline_EGT_2024_official_cols = [
    "PipelineName", "SegmentName", "Wiki", "ProjectID", "LastUpdated",
    "Fuel", "Countries", "Status", "OtherLanguagePrimaryPipelineName", "Owner",
    "Parent", "ParentHQCountry", "StartYear1", "StartYear2", "StartYear3",
    "ShelvedYear", "CancelledYear", "StopYear", "Capacity", "CapacityUnits",
    "CapacityBcm/y", "CapacityBOEd", "LengthKnownKm", "LengthEstimateKm",
    "LengthMergedKm", "Diameter", "DiameterUnits", "FuelSource", "StartLocation",
    "StartPrefecture/District", "StartState/Province", "StartCountry", "StartRegion",
    "StartSubRegion", "EndLocation", "EndPrefecture/District", "EndState/Province",
    "EndCountry", "EndRegion", "EndSubRegion", "Cost", "CostUnits", "CostUSD",
    "FIDStatus", "FIDYear", "PCI5", "PCI6", "OtherEnglishNames", "OtherLanguageSegmentName",
    "WKTFormat", "DraftPCI6List", "PCI6ProjectCode"
]

# TO DO: in future, will add 'Owner'

In [1235]:
lng_official_cols = [] # TO ADD 

In [1236]:
def pipelines_fix_one_offs(df):
    print("Running pipelines_fix_one_offs")
    
#     # fill in missing local language names:
#     local_name_fixes = {
#         'Comodoro Rivadavia–Buenos Aires Pipeline': 'Gasoducto Comodoro Rivadavia–Buenos Aires',
#         'GASUN Gas Pipeline': 'Gasoduto GASUN',
#         'Gran Gasoducto del Sur Gas Pipeline': 'Gran Gasoducto del Sur',
#     }      
#     df = fix_one_offs(df, local_name_fixes, 'PipelineName', 'OtherLanguagePrimaryPipelineName')
    
#     # fill in missing local language wikis:
#     local_wiki_fixes = {
#         'https://www.gem.wiki/Comodoro_Rivadavia%E2%80%93Buenos_Aires_Pipeline': 
#         'https://www.gem.wiki/Gasoducto_Comodoro_Rivadavia%E2%80%93Buenos_Aires',
#         'https://www.gem.wiki/GASUN_Gas_Pipeline': 
#         'https://www.gem.wiki/Gasoduto_GASUN',
#         'https://www.gem.wiki/Gran_Gasoducto_del_Sur_Gas_Pipeline': 
#         'https://www.gem.wiki/Gran_Gasoducto_del_Sur',
#     }      
#     df = fix_one_offs(df, local_wiki_fixes, 'Wiki', 'OtherLanguageWikiPage')
    
    return df

In [1237]:
# def pipelines_read_owner_parent_data_working_file():
#     gc = pygsheets.authorize(client_secret_full_path)
#     pipelines_gsheet = gc.open_by_key(data_files_and_paths['pipelines_working_key'])

#     # OWNERS
#     oil_pipe_owners_sheet = pipelines_gsheet.worksheet('title', 'Pipeline operators/owners (1/3)')
#     oil_pipe_owners = oil_pipe_owners_sheet.get_as_df()
    
#     oil_pipe_owners = oil_pipe_owners.dropna(subset=['ProjectID'])
#     oil_pipe_owners = oil_pipe_owners[oil_pipe_owners['ProjectID']!='']
    
#     # PARENTS
#     oil_pipe_parents_sheet = pipelines_gsheet.worksheet('title', 'Owner–parent relationships (2/3)')
#     oil_pipe_parents = oil_pipe_parents_sheet.get_as_df()
    
#     oil_pipe_parents = oil_pipe_parents[oil_pipe_parents['Owner']!='']
    
#     return oil_pipe_owners, oil_pipe_parents

In [1238]:
def pipelines_condense_parents(oil_pipe_owners, oil_pipe_parents):
    all_parents = [] # initialize
    for row in oil_pipe_owners.index:
        project_id = oil_pipe_owners.at[row, 'ProjectID']
        for owner_num in range(1, 11+1):
            owner_name = oil_pipe_owners.at[row, f'Owner{owner_num}']
            owner_pct = oil_pipe_owners.at[row, f'Owner{owner_num}%'].strip().replace('%', '')
            if owner_pct == '':
                owner_fract = 0
            else:
                owner_fract = float(owner_pct)/100

            if owner_name != '':
                # get parents from oil_pipe_parents
                parent_sel = oil_pipe_parents[oil_pipe_parents['Owner']==owner_name]
                if len(parent_sel) == 1:
                    parent_sel_ser = parent_sel.loc[parent_sel.index[0]]
                    for parent_num in range(1, 10+1):
                        parent_name = parent_sel_ser.at[f"Parent{parent_num}"]
                        if parent_name != '':
                            parent_pct = str(parent_sel_ser.at[f"Parent{parent_num}%"]).strip().replace('%', '')
                            if parent_pct == '':
                                parent_fract = 0
                            else:
                                parent_fract = float(parent_pct)/100

                            parent_fract_scaled = parent_fract * owner_fract
                            all_parents += [(project_id, parent_name, parent_fract_scaled)]

                elif len(parent_sel) == 0:
                    # no info in parent sheet for this owner
                    parent_name = owner_name
                    parent_fract = owner_fract
                    all_parents += [(project_id, parent_name, parent_fract)]

                else:
                    print("Error!" + f" Unexpected len(parent_sel): {len(parent_sel)} for owner_name: {owner_name}")

    all_parents_df = pd.DataFrame(all_parents, columns=['ProjectID', 'Parent', 'Parent fract'])

    all_parents_df['Parent pct'] = ((all_parents_df['Parent fract']*100).apply(lambda x: '{:,.2f}'.format(x))).astype(str) + '%'

    all_parents_cond_list = [] # initialize

    for project_id in all_parents_df['ProjectID'].unique().tolist():
        sel = all_parents_df[all_parents_df['ProjectID']==project_id]
        sel = sel.sort_values('Parent fract', ascending=False)
        parent_str = '' # initialize
        for row in sel.index:
            parent_str += f"{sel.at[row, 'Parent']} [{sel.at[row, 'Parent pct']}]; "
        parent_str = parent_str.replace(' [100.00%]', '')
        parent_str = parent_str.replace(' [0.00%]', '')
        parent_str = parent_str.strip('; ')
        all_parents_cond_list += [(project_id, parent_str)]

    all_parents_cond = pd.DataFrame(all_parents_cond_list, columns=['ProjectID', 'Parent'])
    
    return all_parents_cond

In [1239]:
def pipelines_create_data_download_version(arg_df):
    df = arg_df.copy()
    
    df = df.rename(columns={
        'OtherNames': 'OtherEnglishNames', 
        'Source': 'FuelSource'
    })
    
    if map_choice == 'Europe Gas Tracker':
        df = df[pipeline_EGT_2024_official_cols]
    else:
        df = df[pipeline_official_cols]
        
    # create dictionary
    pipelines_for_download_dict = {'English': df}
    
    if map_choice == 'Latin America Portal - oil-gas':
        # TO DO: check that Spanish/Portuguese name is in the download sheet
        pipelines_for_download_spanish = lat_am_convert_one_tracker_col_names_to_spanish(
            tracker_df = pipelines_for_download_dict['English'], 
            trans_sheet_name = 'pipelines',
        )
        # append another key-value pair to the dictionary
        pipelines_for_download_dict['Spanish'] = pipelines_for_download_spanish
    
    return pipelines_for_download_dict

In [1240]:
def read_pipelines_official_file(
    data_keys_titles, 
    pipelines_to_use_dict,
    map_choice,
):
    if map_choice == 'Oil Infrastructure':
        df = read_and_process_oil_pipelines_official_file()
        
    else:
        
        if map_choice == 'Europe Gas Tracker':
            name = 'ggit_pipes_official_europe'
            gas_pipes = gspread_access_file_read_only(data_keys_titles[name][0], data_keys_titles[name][1])
        else:
            name = 'ggit_pipes_official'
            gas_pipes = gspread_access_file_read_only(data_keys_titles[name][0], data_keys_titles[name][1])


        print('*'*40)
        print("Gas pipelines: reading data from official release, local Excel file")
        print(name)
        print('-'*40)

        # if 'Route' not in gas_pipes.columns:
        # if gas_pipes['Route'].empty or gas_pipes['Route'].isnull().all():
        # assuming we want to overwrite Route with wktformat we drop that column then run it through the convert 
        if 'Route' not in gas_pipes.columns:
            gas_pipes = convert_wkt_to_google_maps(gas_pipes)
        else:
            gas_pipes.drop('Route', axis=1, inplace=True)
            gas_pipes = convert_wkt_to_google_maps(gas_pipes)

        if pipelines_to_use_dict[map_choice]['oil pipes'] == True:
            oil_pipes = read_and_process_oil_pipelines_official_file(data_keys_titles)

            # bring together oil and gas pipes
            df = pd.concat([gas_pipes, oil_pipes], sort=False)
        else:
            # only have gas pipelines
            df = gas_pipes
            print(f'this is after convert wkt logic: {df["Route"].to_list()}')
            
    
    # clean up
    df = df.dropna(subset=['PipelineName'])
    df = df.reset_index(drop=True)
    
    # clean up
    for col in df.columns:
        if 'Unnamed: ' in col:
            df = df.drop(col, axis=1)
            
    for col in ['PipelineName', 'SegmentName']:
        df[col] = df[col].str.strip()
    
    pipelines_df = df
    return pipelines_df

In [1241]:
def test_gas_pipeline_columns(gas_pipes):
    gas_pipes_specified_columns = [
        'ProjectID', 'Fuel', 'StartCountry', 'EndCountry', 'Countries', 'Wiki', 
        'PipelineName', 'SegmentName', 'OtherNames', 'Parent', 'Status', 
        'StartYear1', 'StartYear2', 'StartYear3', 'StopYear', 'Capacity', 
        'CapacityUnits', 'CapacityBcm/y', 'LengthKnownKm', 'LengthEstimateKm', 
        'LengthMergedKm', 'Diameter', 'DiameterUnits', 'FuelSource', 'StartLocation', 
        'StartPrefecture/District', 'StartState/Province', 'StartRegion', 'EndLocation', 
        'EndPrefecture/District', 'EndState/Province', 'EndRegion', 'FID', 'FIDYear', 'WKTFormat',
        'FuelSource'
    ]
    print("Checking columns in gas_pipes")
    for col in gas_pipes.columns:
        if col not in gas_pipes_specified_columns:
            additional_working_cols = [
                'Route (Google Maps)', 'OtherLanguagePrimaryPipelineName', 
                'OtherLanguageSegmentName', 'OtherLanguageWikiPage'
            ]
            if col in additional_working_cols:
                # this is OK; not in official release, but needed for map file
                pass
            else:
                print("Error!" + f" There was a column in the file read in that wasn't in the specified list: {col}")

    for col in gas_pipes_specified_columns:
        if col not in gas_pipes.columns:
            print("Error!" + f" There was a column in the specified list that wasn't in the file read in: {col}")

In [1242]:
def fill_in_missing_capacity_bcm_per_year(df):
    """
    The column 'CapacityBcm/y' includes only values converted from other units. If the units are already Bcm/y, it doesn't include them.
    This function puts the values that are already Bcm/y into the 'CapacityBcm/y' column
    """
    for row in df.index:
        units = df.at[row, 'CapacityUnits']
        if units == 'bcm/year':
            df.at[row, 'CapacityBcm/y'] = df.at[row, 'Capacity']
    return df

In [1243]:
# def pipelines_merge_local_language_info(pipelines_df):
#     """
#     Merge in local language information (plant name, wiki page URL) from working file,
#     to create augmented version of official data.
    
#     Only is needed for Latin America Portal.
#     """
    
#     # create pipelines_working
#     # df = read_pipelines_working_pygsheets(data_files_and_paths, pipelines_to_use_dict)
#     df = read_pipelines_working_local(data_files_and_paths, pipelines_to_use_dict)
    
#     # clean pipelines working before using for merging local language info
#     df = df[df['ProjectID']!='']
#     df = df[df['ProjectID'].isna()==False]
#     df = df[df['PipelineName']!='']
#     df = df[df['PipelineName'].isna()==False]
    
#     pipelines_working = df    
    
#     # =======

#     # Tracker ID is the ID for each unit; supposed to be unique
#     # If these IDs are in fact unique, then can use to merge the data Flora prepared with working data
#     num_duplicated_rows = pipelines_df['ProjectID'].duplicated().sum()
#     num_duplicated_rows_working = pipelines_working['ProjectID'].duplicated().sum()

#     local_language_cols = ['ProjectID', 'OtherLanguagePrimaryPipelineName', 
#                            'OtherLanguageSegmentName', 'OtherLanguageWikiPage']
#     if num_duplicated_rows == 0 and num_duplicated_rows_working == 0:
#         pipelines_df_initial_len = len(pipelines_df)
#         # note: official release Feb 2022 has 'Plant name (local script)', but not local language wiki pages
#         pipelines_merged = pd.merge(
#             pipelines_df, 
#             pipelines_working[local_language_cols],
#             on='ProjectID',
#             how='left'
#         )
#         if pipelines_df_initial_len == len(pipelines_merged):
#             # create new version of pipelines_df, overwriting previous version
#             pipelines_df = pipelines_merged.copy()
            
#             print("Merged in local language wiki pages")
            
#             return pipelines_df

#         else:
#             print("Error!" + f" There was a difference in the length of dfs.")
#             return pd.DataFrame()

#     else:
#         print("Error!" + f" Possibly there were duplicate ProjectID entries; will return empty DataFrame.")
#         if num_duplicated_rows > 0:
#             print(f"num_duplicated_rows: {num_duplicated_rows}")
#             print(pipelines_df[pipelines_df['ProjectID'].duplicated()][['ProjectID', 'PipelineName', 'SegmentName']])
#         if num_duplicated_rows_working > 0:
#             print(f"num_duplicated_rows_working: {num_duplicated_rows_working}")
#             print(pipelines_working[pipelines_working['ProjectID'].duplicated()][['ProjectID', 'PipelineName', 'SegmentName']])
        
#         return pd.DataFrame()

In [1244]:
def clean_parents(df):
    for row in df.index:
        parents_new_list = [] # initialize
        parent_str = df.at[row, 'Parent']
        if pd.isna(parent_str)==False:
            try:
                parent_list = parent_str.split(', ')
                for parent_element in parent_list:
                    if parent_element[-1]=='%' and '.' in parent_element:
                        # print(f"parent_element is a fractional share: {parent_element}")
                        pass
                    else:
                        # put parent_element into parents_new_list
                        parents_new_list += [parent_element]

                    parents_new_str = ''
                    for parents_new_element in parents_new_list:
                        parents_new_str += parents_new_element + ', '
                    parents_new_str = parents_new_str.strip(' ').strip(',')
                    df.at[row, 'Parent'] = parents_new_str
            except:
                print(f"Unknown error; parent_str: {parent_str}; for row {row}")
        else:
            pass

    # after "for row in df.index:"
    return df

In [1245]:
def pipelines_select_by_country(df, sel_countries):
    """
    Mark pipelines for inclusion, then filter the dataframe.
    """
    for row in df.index:
        keep_for_map = 'N' # initialize for each row

        try:
            countries_list = df.at[row, 'Countries'].split(', ')
        except:
            print("Error!" + f" Exception for row {row}, for df.at[row, 'Countries']: {df.at[row, 'Countries']}")
            countries_list = []

        if map_choice in sel_countries.keys():
            # need to filter by country
            for country in countries_list:
                if country in sel_countries[map_choice]:
                    # overwrite the default value of 'N'
                    keep_for_map = 'Y'

                # put value for keep_for_map into df
                df.at[row, 'sel for map'] = keep_for_map

        else:
            # set to keep all pipelines
            df['sel for map'] = 'Y'

    # filter to keep only selected entries for the map
    df = df.copy()[df['sel for map']=='Y']

    df['sel for map'].value_counts()
    
    return df

In [1246]:
def process_wkt_linestring(wkt_format_str, row):
    # split on commas to separate coordinate pairs from each other
    line = wkt_format_str.replace('LINESTRING', '').strip('() ')
    line_list = line.split(', ')

    line_list_rev = [] # initialize
    for pair in line_list:
        try:
            # in WKT, order is lon lat
            lon = pair.split(' ')[0]
            lat = pair.split(' ')[1]
            # put into Google Maps order & format
            line_list_rev += [f"{lat},{lon}:"]
        except:
            print(f"In process_wkt_linestring, couldn't process {pair} (in row {row})")

    google_maps_line = ''.join(line_list_rev).strip(':')

    return google_maps_line

In [1247]:
def process_gmaps_linestring(gmap_format_str):
    """
    Convert from Google Maps format to WKT format
    """
    
    # split on colons to separate coordinate pairs from each other
    line_list = gmap_format_str.strip().split(':')
    # clean up
    line_list = [x.strip() for x in line_list]

    line_list_rev = [] # initialize
    for pair in line_list:
        # in Google Maps, order is lat, lon
        lat = pair.split(',')[0].strip()
        lon = pair.split(',')[1].strip()
        # put into WKT order & format
        line_list_rev += [f"{lon} {lat},"]

    wkt_line = ' '.join(line_list_rev).strip(',')

    return wkt_line

In [1248]:
def convert_google_maps_to_wkt(
    pipes_df, 
    no_route_entries = no_route_entries
):
    """
    GFIT has pipeline routes in Google Maps format (renamed in this notebook to 'Route (Google Maps)'
    For download file, want to use WKT format to be consistent with GGIT.
    Put WKT format into column 'WKTFormat'.

    In WKT:
    * Each coordinate pair is longitude then latitude, separated by spaces
    * Within linestrings: Coordinate pairs are separated by commas
    * Within multilinestrings: Linestrings are bookended by parentheses
    
    In Google Maps:
    * Each coordinate pair is latitude then longitude, separated by comma
    * Within linestrings: Coordinate pairs are separated by colons
    * Within multilinestrings: Linestrings are separated by semicolons
    """
    
    # rename column if not already in this format
    pipes_df = pipes_df.rename(columns={'Route': 'Route (Google Maps)'})
    
    for row in pipes_df.index:
        val = pipes_df.at[row, 'Route (Google Maps)']
        if pd.isna(val)==True:
            # can't convert
            pass
        
        else:
            gmap_format_str = val

            if ';' in gmap_format_str:
                # split on ';' -- the marker of the end of a linestring
                gmap_multiline_list = gmap_format_str.split(';')

                # clean up:
                gmap_multiline_list = [x.strip() for x in gmap_multiline_list]

                multiline_list_rev = [] # initialize

                for gmap_line in gmap_multiline_list:
                    wkt_line = process_gmaps_linestring(gmap_line)
                    # put parentheses around wkt lines
                    wkt_line = f"({wkt_line})"
                    multiline_list_rev += [wkt_line]

                wkt_str = ', '.join(multiline_list_rev)
                # add MULTILINESTRING wrapper
                wkt_str = f"MULTILINESTRING({wkt_str})"
                pipes_df.at[row, 'WKTFormat'] = wkt_str

            elif ';' not in gmap_format_str and ':' in gmap_format_str:
                wkt_str = process_gmaps_linestring(gmap_format_str)
                # add LINESTRING wrapper
                wkt_str = f"LINESTRING({wkt_str})"
                pipes_df.at[row, 'WKTFormat'] = wkt_str

            elif gmap_format_str in no_route_entries:
                # Known values for no route
                pass

            else:
                print("Error!" + f" Couldn't convert to WKT format: {gmap_format_str}")
    
    return pipes_df

In [1249]:
def convert_wkt_to_google_maps(pipes_df):
    """
    GGIT official release has pipeline routes in WKT format only.
    For map file, need to convert to Google Maps format.
    Put Google Maps format into column 'Route'.

    In WKT:
    * Each coordinate pair is longitude then latitude, separated by spaces
    * Within linestrings: Coordinate pairs are separated by commas
    * Within multilinestrings: Linestrings are bookended by parentheses
    
    In Google Maps:
    * Each coordinate pair is latitude then longitude, separated by comma
    * Within linestrings: Coordinate pairs are separated by colons
    * Within multilinestrings: Linestrings are separated by semicolons
    """
    print("Running convert_wkt_to_google_maps")
    truncated = [] # initialize
    for row in pipes_df.index:
        # route = pipes_df.at[row, 'Route']
        wkt_format_str = pipes_df.at[row, 'WKTFormat']
        name = pipes_df.at[row, 'PipelineName']

        # if len(route) > 1:
            # print(f'ROUTE IS MORE THAN 1: {route}')
            # want to keep route information for few cases that it has it feb 2024
            # pass     
        if wkt_format_str == '--':
            # Known empty value
            pass
        else:
            if wkt_format_str.endswith(')') == True:
                # formatted correctly; not truncated
                pass
            elif wkt_format_str.endswith(')') == False:
                # it is truncated; need to get rid of partial coordinates
                truncated += [(
                    pipes_df.at[row, 'PipelineName'], 
                    pipes_df.at[row, 'Countries'], 
                    wkt_format_str[-30:]
                )]
                
                wkt_format_str = wkt_format_str.rsplit(',', 1)[0].strip()
                if wkt_format_str.startswith('LINESTRING'):
                    # close with single parentheses
                    wkt_format_str = f"{wkt_format_str})"
                elif wkt_format_str.startswith('MULTILINESTRING'):
                    # close with double parentheses
                    wkt_format_str = f"{wkt_format_str}))"

            if wkt_format_str.startswith('LINESTRING'):
                google_maps_str = process_wkt_linestring(wkt_format_str, row)
                pipes_df.at[row, 'Route'] = google_maps_str

            elif wkt_format_str.startswith('MULTILINESTRING'):
                wkt_multiline = wkt_format_str.replace('MULTILINESTRING', '').strip('() ')
                # split on '), '--marker of the end of a linestring
                wkt_multiline_list = wkt_multiline.split('), ')

                # clean up:
                wkt_multiline_list = [x.strip('(') for x in wkt_multiline_list]

                multiline_list_rev = [] # initialize
                for wkt_line in wkt_multiline_list:
                    google_maps_line = process_wkt_linestring(wkt_line, row)
                    multiline_list_rev += [google_maps_line]

                google_maps_str = ';'.join(multiline_list_rev)
                pipes_df.at[row, 'Route'] = google_maps_str

            else:
                print("Error!" + f" Couldn't convert to Google Maps: {wkt_format_str}")
                print((name, wkt_format_str))
            
                pass
    
    # after end of for row in pipes_df.index
    if len(truncated) > 0:
        print(f"WKTFormat was truncated for {len(truncated)} pipelines")
        print(truncated)
        if error_verbose == True:
            for x in truncated:
                print(f"{x[0]} in {x[1]}; last 30 characters: {x[2]}")
            print('-'*40)
            
    return pipes_df

In [1250]:
def test_pipeline_type_results(df):
    if map_choice in ['Oil Infrastructure']:
        expected_types = ['Oil Pipelines', 'NGL Pipelines'] # note plural
    elif map_choice in ['Gas Infrastructure']:
        expected_types = ['Gas Pipelines'] # note plural
    elif map_choice in ['Africa Gas Tracker', 'Asia Gas Tracker']:
        expected_types = ['Gas Pipeline']
    elif map_choice in ['Europe Gas Tracker']:
        expected_types = ['Gas Pipeline', 'Hydrogen Pipeline']  # change for feb 2024 update ggit
    elif map_choice in ['Latin America Portal - oil-gas']:
        expected_types = ['oil_pipeline', 'gas_pipeline'] # not including ngl_pipeline for now
    else:
        print("Error!" + f" test_pipeline_type_results not set up to run for map_choice: {map_choice}")
        expected_types = []
    
    actual_types = df['type'].fillna('___').unique().tolist()
            
    if set(actual_types) != set(expected_types):
        print("Error!" + f" For map_choice {map_choice}, there were unexpected results in the 'type' column: {actual_types}")
        print(f"Expected types: {expected_types}")

In [1251]:
# def lat_am_oil_gas_reorganize_capacity_data(df):
#     """ For Latin America only, reorganize to use standardized data for capacities.
    
#     Latin America oil-gas map is different from regional gas maps, because it includes two fuels.
#     Instead of using the standard capacity column with all sorts of mixed units,
#     better to use the standardized values in the columns 'CapacityBOEd' & 'CapacityBcm/y'.
    
    
#     """
    
#     if map_choice == 'Latin America Portal - oil-gas':
        
    
#     return df

In [1252]:
def modify_pipelines_for_map(
    df_arg, 
    no_route_entries = no_route_entries
): 
    df = df_arg.copy()
    df = clean_parents(df)
    
    df = df.dropna(subset=['PipelineName'])
    
    df['StartYear1'] = df['StartYear1'].fillna('').astype(str).str.split('0', n=1).str[0] #'\.0'
    
    df['Parent'] = df['Parent'].replace('--', '')

    # exclude those with status 'Speculative' 
    # (used for hydrogen pipelines in Europe, as of Mar 2023)
    # df = df[df['Status']!='Speculative']
    
    df = fill_in_missing_capacity_bcm_per_year(df)
    
    # TO DO: remove function call below (and def of function above) if not needed
    # df = lat_am_oil_gas_reorganize_capacity_data(df)
    
    # rename columns:
    # for all map choices
    df = df.rename(columns={
        'Parent': 'parent',
        'Status': 'status',
        'StartYear1': 'start_year',
        'Countries': 'countries',
        'Route': 'route',
        # 'PCI3': 'pci3',
        # 'PCI4': 'pci4',
        'PCI5': 'pci5',
        'PCI6': 'pci6',
    })
    # if map_choice == 'Europe Gas Tracker':
    #     df = df.rename(columns={
    #         'status': 'status_tabular',
    #         'CapacityUnits': 'units',
    #         'status_legend': 'status',
    #         'PipelineName': 'project',
    #         'SegmentName': 'unit',
    #         'Capacity': 'capacity',
    #         'PipelineName': 'project',
    #         'SegmentName': 'unit',
    #         'Wiki': 'url',

    #     })
    if map_choice == 'Latin America Portal - oil-gas':
        df = df.rename(columns={
            'PipelineName': 'project_en',
            'SegmentName': 'unit_en',
            'OtherLanguagePrimaryPipelineName': 'project',
            'OtherLanguageSegmentName': 'unit',
            'Wiki': 'url_en',
            # 'OtherLanguageWikiPage': 'url', # modified 2023-10-27 to remove local language URL
            'Capacity': 'capacity',
            'CapacityUnits': 'units',
        })
    elif map_choice == 'Oil Infrastructure':
        df = df.rename(columns={
            'PipelineName': 'project',
            'SegmentName': 'unit',
            'Wiki': 'url',
            'Capacity': 'capacity',
            'CapacityUnits': 'units',
            # TO DO: get all values into common units of BOE/d
        })
    elif map_choice == 'Gas Infrastructure':
        df = df.rename(columns={
            'PipelineName': 'project',
            'SegmentName': 'unit',
            'Wiki': 'url',
            'CapacityBcm/y': 'capacity',
        })
        df['capacity_production_unit'] = 'Bcm/y'
    else:
        df = df.rename(columns={
            'PipelineName': 'project',
            'SegmentName': 'unit',
            'Capacity': 'capacity',
            'CapacityUnits': 'capacity_production_unit',
            'PipelineName': 'project',
            'SegmentName': 'unit',
            'Wiki': 'url',

        })
    
    # exclude those with no route
    no_route_filter = df['route'].fillna('').isin(no_route_entries) 
    no_route = df[no_route_filter]
    print("Warning!" + f" Number of rows with no route: {len(no_route)}") # for UI
    
    df = df[~no_route_filter]
    df = df[df['route'].isna()==False]
    
    # add entries in column 'type'
    if map_choice in ['Oil Infrastructure']:
        df['type'] = df['Fuel'] + ' Pipelines' # note use of plural
    elif map_choice in ['Gas Infrastructure']:
        # keep only those with fuel = 'Gas'
        df = df[df['Fuel']=='Gas']
        df['type'] = df['Fuel'] + ' Pipelines' # note use of plural
    elif map_choice in ['Africa Gas Tracker', 'Asia Gas Tracker', 'Europe Gas Tracker']:
        df['type'] = df['Fuel'].str.lower() + '_pipeline'
        print(df['type'].to_list())
     
    elif map_choice in ['Latin America Portal - oil-gas']:
        df['type'] = df['Fuel'].str.lower() + '_pipeline'
    else:
        print(f"Not set up to add 'type' values for this map_choice: {map_choice}")
        
    test_pipeline_type_results(df)
    df['geom'] = 'line'

    pipelines_for_map = df
    return pipelines_for_map

In [1253]:
def clean_pipeline_data(pipelines_df):
    
    # if there are spaces after country names, remove them
    pipelines_df['Countries'] = pipelines_df['Countries'].str.strip()
    pipelines_df['Countries'] = pipelines_df['Countries'].str.replace(' , ', ', ')

    # remove "(H2 only)" from statuses
    pipelines_df['Status'] = pipelines_df['Status'].str.replace(' (H2 only)', '', regex=False)
    
    return pipelines_df

In [1254]:
def run_all_pipeline_functions(
    map_choice, 
    data_versions_dict,
    data_keys_titles,
    pipelines_to_use_dict,
    no_route_entries,
):
    
    if map_choice in ['Gas Infrastructure', 'Africa Gas Tracker', 'Asia Gas Tracker', 'Europe Gas Tracker']:
        data_type = data_versions_dict[map_choice]['gas pipelines']
        if data_type == 'official':
            pipelines_df = read_pipelines_official_file(data_keys_titles, pipelines_to_use_dict, map_choice)


        elif data_type == 'working':
            print("Error!" + f" For map_choice {map_choice}, unexpected case for data_type: {data_type}")

        # elif data_type == 'interim':
        #     pipelines_df = read_gas_pipelines_interim_local_copy_working_format()

        else:
            print("Error!" + f" For map_choice {map_choice}, unexpected case for data_type: {data_type}")

    elif map_choice in ['Oil Infrastructure']:
        data_type = data_versions_dict[map_choice]['oil and NGL pipelines']
        if data_type == 'official':
            pipelines_df = read_pipelines_official_file(data_keys_titles, pipelines_to_use_dict, map_choice)

        elif data_type == 'working':
            # pipelines_df = read_pipelines_working_gspread(data_files_and_paths, pipelines_to_use_dict)
            print("Error!" + " Not currently using pipelines working data for oil pipelines")
            
        else:
            print("Error!" + f" For map_choice {map_choice}, unexpected case for data_type: {data_type}")
        
    elif map_choice in ['Latin America Portal - oil-gas']:
        data_type_gas = data_versions_dict[map_choice]['gas pipelines']
        data_type_oil = data_versions_dict[map_choice]['oil pipelines']
        if data_type_gas == 'official' and data_type_oil == 'official':
            pipelines_df = read_pipelines_official_file(data_keys_titles, pipelines_to_use_dict, map_choice)
        else:
            print("Error!" + f" For map_choice {map_choice}, unexpected case for data_type_gas: {data_type_gas} and/or data_type_oil: {data_type_oil}")
        
    else:
        print(f"No pipeline data will be read for map_choice: {map_choice}")
        pipelines_df = pd.DataFrame() # placeholder

    pipelines_df = clean_pipeline_data(pipelines_df)
    pipelines_df = harmonize_countries(pipelines_df)
    
    # filter for specific countries
    pipelines_df = pipelines_select_by_country(pipelines_df, sel_countries)
    
    # TO DO: remove block below, if all working with local language info now in both GGIT & GOIT
    # if map_choice == 'Latin America Portal - oil-gas':        
    #     # merge in local language info
    #     pipelines_df = pipelines_merge_local_language_info(pipelines_df)
        
    pipelines_df = pipelines_fix_one_offs(pipelines_df)
    
    # TEST: data before creating download & map versions
    if map_choice == 'Latin America Portal - oil-gas':
        cols_to_check = ['PipelineName', 'OtherLanguagePrimaryPipelineName', 'Wiki', 'OtherLanguageWikiPage']
    else:
        cols_to_check = ['PipelineName', 'Wiki']
    find_multi_instead_of_one_to_one(pipelines_df, cols_to_check)
    # END OF TEST
    
#     # TO DO: remove steps below if using oil pipeline official release with parents condensed
#     # TO DO: check whether I need to condense parents for gas pipelines also
#     if map_choice == 'Oil Infrastructure':
#         oil_pipe_owners, oil_pipe_parents = pipelines_read_owner_parent_data_working_file()
#         all_parents_cond = pipelines_condense_parents(oil_pipe_owners, oil_pipe_parents)
    
#         pipelines_df_len_init = len(pipelines_df)
#         pipelines_df = pd.merge(pipelines_df, all_parents_cond, on='ProjectID', how='left')

#         if len(pipelines_df) != pipelines_df_len_init:
#             print("Error!" + f" After merge, len(pipelines_df) changed.")
#     else:
#         pass
    
    # create pipeline download file immediately after filtering by country
    pipelines_for_download_dict = pipelines_create_data_download_version(pipelines_df)

    # create pipeline data for map
    pipelines_for_map = modify_pipelines_for_map(pipelines_df)
        
    print('-'*40)
    print("Pipelines: finished processing")
    print('-'*40)
    
    return pipelines_for_download_dict, pipelines_for_map

## LNG terminals

In [1255]:
def read_lng_official():
    
    if map_choice == 'Europe Gas Tracker':
        # path = data_files_and_paths['ggit_lng_official_path_europe']
        # file = data_files_and_paths['ggit_lng_official_file_europe'] 
        # sheet_name = 'LNG Terminals 2023-07-10'
        name = 'ggit_lng_official_europe'
        lng_term = gspread_access_file_read_only(data_keys_titles[name][0], data_keys_titles[name][1])

    else:
        # path = data_files_and_paths['ggit_lng_official_path']
        # file = data_files_and_paths['ggit_lng_official_file'] 
        # sheet_name = 'LNG Terminals 2023-12-14'
        name = 'ggit_lng_official'
        lng_term = gspread_access_file_read_only(data_keys_titles[name][0], data_keys_titles[name][1])

    

    print('*'*40)
    print("LNG Terminals: reading data from official release, local Excel file")
    print(name)
    print('-'*40)
    
    # lng_term = pd.read_excel(path + file, sheet_name = sheet_name)
    # lng_term = gspread_access_file_read_only(data_keys_titles[name][0], data_keys_titles[name][1])

    print("Checking columns in lng_term")
    lng_term_official_columns_2022 = [
        'TerminalID', 'ProjectID', 'ComboID', 'Country', 'Region', 'Wiki', 'TerminalName', 'UnitName', 
        'OtherEnglishNames', 'Owner', 'Parent', 'Status', 'StartYear1', 'StartYear2', 'StartYear3', 'StopYear', 
        'Import/Export', 'Capacity', 'CapacityUnits', 'CapacityInMtpa', 'Location', 'Prefecture/District', 
        'State/Province', 'Latitude', 'Longitude', 'Accuracy', 'Floating', 'FID', 'FIDYear', 'OtherLanguageName', 
        'PowerPlantsSupplied', 'OtherLanguageWikiPage',
    ]
     
    lng_term_official_EGT_columns_2024 = [
        "TerminalID", "ProjectID", "ComboID", "Wiki", "TerminalName",
        "UnitName", "FacilityType", "Status", "Country", "OtherLanguageName",
        "LastUpdated", "OtherEnglishNames", "Owner", "Parent", "ParentHQCountry",
        "ProposalYear", "ProposalMonth", "ConstructionYear", "ConstructionMonth", "StartYear1",
        "StartMonth1", "StartYear2", "StartYear3", "DelayedStartYear", "StartYearEarliest",
        "ShelvedYear", "CancelledYear", "StopYear", "Capacity", "CapacityUnits",
        "CapacityInMtpa", "CapacityInBcm/y", "Region", "Location", "Prefecture/District",
        "State/Province", "Latitude", "Longitude", "Accuracy", "PowerPlantsSupplied", "Cost",
        "CostUnits", "CostUSD", "FIDStatus", "FIDYear", "Floating"
    ]

    if map_choice == 'Europe Gas Tracker':
        # lng_term_official_EGT_columns_2024
        for col in lng_term.columns:
            if col not in lng_term_official_EGT_columns_2024:
                print("Error!" + f" There was a column in the file read in that wasn't in the 2024 official release: {col}")

        for col in lng_term_official_EGT_columns_2024:
            if col not in lng_term.columns:
                print("Error!" + f" There was a column in 2024 official release that wasn't in the file read in: {col}")
        
    else:
        for col in lng_term.columns:
            if col not in lng_term_official_columns_2022:
                print("Error!" + f" There was a column in the file read in that wasn't in the 2022 official release: {col}")

        for col in lng_term_official_columns_2022:
            if col not in lng_term.columns:
                print("Error!" + f" There was a column in 2022 official release that wasn't in the file read in: {col}")
    

    # TODO ADD THIS FOR LNG but where? too early here
        
    # if map_choice == 'Europe Gas Tracker':
    #     df = df[lng_term_official_EGT_columns_2024]
    # else:
    #     df = df[lng_term_official_columns_2022]
        
    return lng_term

In [1256]:
# def lng_read_europe_update():
#     file_name = data_files_and_paths['ggit_lng_europe_update_file']
    
#     print('*'*40)
#     print("LNG Terminals: reading data from update for Europe (local Excel file)")
#     print(f'"{file_name}"')
#     print('-'*40)
    
#     lng_term = pd.read_excel(
#         data_files_and_paths['ggit_lng_europe_update_path'] + file_name, 
#         sheet_name = 'Terminals',
#     )
    
#     lng_term = lng_term.rename(columns={
#         'Owner': 'Parent',
#         'OtherEnglishNames': 'OtherNames',
#     })
    
#     # filter to keep only LNG terminals
#     lng_term = lng_term[lng_term['Type1'].str.strip()=='LNG']
    
#     lng_extraneous_cols = [
#         'Type1', 'Researcher', 'LastUpdated', 
#         'Type2', # greenfield vs brownfield
#         'ProposalYear', 'ConstructionYear', 'Delayed', 'DelayType', 'StartYearEarliest',
#         'ShelvedYear', 'CancelledYear', 'ShelvedCancelledStatusType', 'CapacityInBcm/y',
#         'Source', 'PowerPlantsSupplied', 'CostEst', 'CostEstUnits', 'CostEstYear',
#         'CostEstUSD', 'WriteDown', 'ReExport', 'EuropeTracker', 'PCINumber',
#         'PCI3', 'PCI4', 'PCI5', 'Opposition', 'ESJNotes',
#         'Defeated', 'OtherLanguageName', 'OtherLanguageWikiPage', 'H2Proposed', 'H2Notes',
#         'ResearcherNotes1', 'ResearcherNotes2', 'ResearcherNotes3', 'ResearcherNotes4', 'CostUSDPerBcm/y',
#     ]
#     for col in lng_extraneous_cols:
#         if col in lng_term.columns:
#             lng_term = lng_term.drop(col, axis=1)
#         else:
#             print("Warning!" + f" Extraneous column {col} wasn't in lng_term.")
        
#     print("Checking columns in lng_term")
#     lng_term_official_columns_2022 = [
#         'TerminalID', 'ProjectID', 'ComboID', 'Country', 'Region', 'Wiki', 'TerminalName', 'UnitName', 
#         'OtherEnglishNames', 'Owner', 'Parent', 'Status', 'StartYear1', 'StartYear2', 'StartYear3', 'StopYear', 
#         'Import/Export', 'Capacity', 'CapacityUnits', 'CapacityInMtpa', 'Location', 'Prefecture/District', 
#         'State/Province', 'Latitude', 'Longitude', 'Accuracy', 'Floating', 'FID', 'FIDYear', 'OtherLanguageName', 
#         'PowerPlantsSupplied', 'OtherLanguageWikiPage',
#     ]
#     for col in lng_term.columns:
#         if col not in lng_term_official_columns_2022:
#             print("Warning!" + f" There was a column in the file read in that wasn't in the 2022 official release: {col}")

#     for col in lng_term_official_columns_2022:
#         if col not in lng_term.columns:
#             print("Warning!" + f" There was a column in 2022 official release that wasn't in the file read in: {col}")
    
#     return lng_term

In [1257]:
# def lng_read_custom_file_july_2022():
#     """
#     Note this is the first version of LNG tracker that has terminal name and unit name in separate columns.
#     """
#     print("LNG Terminals: reading custom file for map (July 2022)" + '\n' + '-'*40)
    
#     # read working file (with local language name, wiki URL)
#     lng_term = pd.read_excel(
#         gem_path + 'GFIT & GGIT & GOIT (pipelines & LNG)/GGIT LNG Terminals - versions saved/' + 
#         'LNG Terminals - main - July 2022 version - reformat for map 2022-08-17_1703.xlsx',
#         sheet_name = 'Sheet1', 
#         na_values = ['--'],
#         dtype = {'CapacityInMtpa': float}    
#     )
    
#     return lng_term

In [1258]:
def read_lng_term_working_pygsheets(data_files_and_paths):
    print("Reading LNG terminals working file using pygsheets")
    print("-"*40)
    gc = pygsheets.authorize(client_secret_full_path)
    lng_term_working_gsheet = gc.open_by_key(data_files_and_paths['ggit_lng_working_key'])

    lng_term_working_sheet = lng_term_working_gsheet.worksheet('title', 'Terminals')
    df = lng_term_working_sheet.get_as_df(start='A2')

    # =======
    # clean up
    
    # exclude empty rows:
    df = df[
        (df['TerminalName'].isna()==False) & 
        (df['TerminalName']!='')
    ]

    # drop empty columns
    for col in df.columns:
        if 'Unnamed: ' in col:
            df = df.drop(col, axis=1)
            
    for col in ['OtherLanguageName']:
        df[col] = df[col].str.strip()
    
    df = df.reset_index(drop=True)
    
    # ======
    # TO DO: create function below and run it, if need any columns from working file that are not strings
#     # convert dtypes
#     df = for_pygsheets_convert_lng_term_dtypes_and_values(df)

    lng_term_working = df
    return lng_term_working

In [1259]:
def lng_clean_data(df):
    df = harmonize_countries(df)
    
    # exclude empty columns
    for col in df.columns:
        if 'Unnamed: ' in col:
            df = df.drop(col, axis=1)
            
    # exclude empty rows
    for x in ['Name', 'TerminalName']:
        if x in df.columns:
            df = df.dropna(subset=[x])
    if 'Name' not in df.columns and 'TerminalName' not in df.columns:
        print("Error!" + " Columns didn't include 'Name' or 'TerminalName'; didn't clean up by excluding empty rows")
    
    # handle parent vs owner:
    if 'Parent' in df.columns:
        # don't rename
        pass        
    else:
        if 'Owner' in df.columns:
            df = df.rename(columns={'Owner': 'Parent'})
            print("Warning!" + " Renamed column 'Owner' to 'Parent'") # for UI
        else:
            pass    
        
    return df

In [1260]:
# def lng_exclude_one_offs_latam(df):
#     if 'Tango FLNG Terminal' in df['Name'].tolist():
#         # Tango FLNG Terminal was moved to be part of Bahia Blanca LNG Terminal
#         # Can use old Spanish page, which redirects to Bahia Blanca: https://www.gem.wiki/Unidad_Flotante_Tango_FLNG
#         sel_rows = df[df['Name']=='Tango FLNG Terminal']
#         for row in sel_rows.index:
#             url = df.at[row, 'OtherLanguageWikiPage']
#             # check to make sure we got a row where the URL was missing
#             if pd.isna(url)==True:
#                 # put in missing data
#                 df.at[row, 'OtherLanguageWikiPage'] = 'https://www.gem.wiki/Unidad_Flotante_Tango_FLNG'
#                 df.at[row, 'OtherLanguageName'] = 'Unidad Flotante Tango FLNG'
#                 df.at[row, 'Wiki'] = 'https://www.gem.wiki/Tango_FLNG_Terminal'
#                 print("Added missing data for Tango FLNG Terminal")
#             else:
#                 print("Already has wiki page entry for Tango FLNG Terminal")

#     # exclude 'Penco FLNG Terminal'; no longer in data set
#     if 'Penco FLNG Terminal' in df['Name'].tolist():
#         df = df[df['Name']!='Penco FLNG Terminal']
#         print("Removed Penco FLNG Terminal")
    
#     return df

In [1261]:
def lng_fix_one_offs(df):
    """
    Can add new fixes if any. Fix for T049202 was for June 2021 version of data; fixed in July 2022 version of data
    """
    # status_fixes = {
    #     'T049202': 'Proposed', # had missing status
    # }
    # df = fix_one_offs(df, status_fixes, 'ComboID', 'Status')
    
    return df

In [1262]:
def lng_clean_and_prepare_for_map(df, map_choice):
    df = clean_parents(df)

    # rename columns - all map choices
    df = df.rename(columns={
        'UnitName': 'unit',
        'Parent': 'parent',
        'Status': 'status',
        'State/Province': 'province',
        'Country': 'countries',
        'CapacityInMtpa': 'capacity',
        'StartYear1': 'start_year',
        'Latitude': 'lat',
        'Longitude': 'lng',
        # 'Facility': 'terminal_type',
        # 'Import/Export': 'terminal_type',
        'FacilityType': 'terminal_type', # change in 2023-12 release; was previously called 'Import/Export', and before that 'Facility'        
        # 'PCI3': 'pci3',
        # 'PCI4': 'pci4',
        'PCI5': 'pci5',
        'PCI6': 'pci6',
    })
    
    if map_choice == 'Latin America Portal - oil-gas':
        df = df.rename(columns={
            'TerminalName': 'project_en',
            'OtherLanguageName': 'project',
            # unit: no equivalent at this point; train name within column 'Name'
            # if splitting out train names, then need to translate to Spanish/Portuguese?
            'Wiki': 'url_en',
            # 'OtherLanguageWikiPage': 'url', # modified 2023-10-27 to remove local language URL
        })
    else:
        df = df.rename(columns={
            'TerminalName': 'project', # previously was 'Name'
            'Wiki': 'url',
        })
        
    # create column 'unit_en' & translate values in 'unit' (Spanish)
    if map_choice == 'Latin America Portal - oil-gas':
        df['unit_en'] = df.copy()['unit']
        
        # translate after copying to English column
        df['unit'] = df['unit'].str.replace('Phase', 'Fase')
        df['unit'] = df['unit'].str.replace('Unit', 'Unidad')
        df['unit'] = df['unit'].str.replace('Block', 'Bloque')
    
    df['start_year'] = df['start_year'].fillna('').astype(str)

    # =========
    # add values in column 'type'
    if map_choice in ['Africa Gas Tracker', 'Asia Gas Tracker', 'Europe Gas Tracker']:
        df['type'] = 'lng_terminal' # lng_terminal
    elif map_choice in ['Gas Infrastructure']:
        df['type'] = 'lng_terminal'
    elif map_choice in ['Latin America Portal - oil-gas']:
        df['type'] = 'lng_terminal'
    else:
        print(f"Not set up to add 'type' values for this map_choice: {map_choice}")
        
    if map_choice in ['Gas Infrastructure']:
        # combine 'type' column and 'Facility' column (import/export)
        df['type'] = df['type'] + ' (' + df['terminal_type'].fillna('') + ')'
    
        print("Show all types:") # for db
        print(df['type'].fillna('__').value_counts()) # for db
    # =========
        
    df['geom'] = 'point'
    df['capacity_production_unit'] = 'MTPA'
    
    # # new steps to handle July 2022 version of LNG terminals:
    # if 'Owner' in df.columns:
    #     df = df.rename(columns={'Owner': 'owner'})
    
    return df

In [1263]:
def lng_term_create_data_download_version(lng_term):
    lng_term_for_download = lng_term.copy()
    
    lng_term_official_EGT_columns_2024 = [
        "TerminalID", "ProjectID", "ComboID", "Wiki", "TerminalName",
        "UnitName", "FacilityType", "Status", "Country", "OtherLanguageName",
        "LastUpdated", "OtherEnglishNames", "Owner", "Parent", "ParentHQCountry",
        "ProposalYear", "ProposalMonth", "ConstructionYear", "ConstructionMonth", "StartYear1",
        "StartMonth1", "StartYear2", "StartYear3", "DelayedStartYear", "StartYearEarliest",
        "ShelvedYear", "CancelledYear", "StopYear", "Capacity", "CapacityUnits",
        "CapacityInMtpa", "CapacityInBcm/y", "Region", "Location", "Prefecture/District",
        "State/Province", "Latitude", "Longitude", "Accuracy", "PowerPlantsSupplied", "Cost",
        "CostUnits", "CostUSD", "FIDStatus", "FIDYear", "Floating"
    ]
    if map_choice == 'Europe Gas Tracker':
        lng_term_for_download = lng_term_for_download[lng_term_official_EGT_columns_2024]
    else:
        pass
    lng_term_for_download_dict = {'English': lng_term_for_download}
    
    if map_choice == 'Latin America Portal - oil-gas':
        # TO DO: check that Spanish/Portuguese name is in the download sheet
        lng_term_for_download_spanish = lat_am_convert_one_tracker_col_names_to_spanish(
            tracker_df = lng_term_for_download, 
            trans_sheet_name = 'LNG terminals',
        )
        # append another key-value pair to the dictionary
        lng_term_for_download_dict['Spanish'] = lng_term_for_download_spanish
        
    return lng_term_for_download_dict

In [1264]:
def run_all_lng_terminal_functions(map_choice, data_versions_dict, data_keys_titles):
    
    if data_versions_dict[map_choice]['ggit lng'] == 'official':
        lng_term = read_lng_official()
    # elif data_versions_dict[map_choice]['ggit lng'] == 'interim Mar 2022':
    #     lng_term = lng_read_europe_update()
    # elif data_versions_dict[map_choice]['ggit lng'] == 'custom map file July 2022':
    #     lng_term = lng_read_custom_file_july_2022()
    else:
        print("Error!" + f" Not yet set up to handle data_versions_dict[map_choice]['ggit lng']: {data_versions_dict[map_choice]['ggit lng']}")
    
    lng_term = lng_clean_data(lng_term)
    lng_term = filter_points_by_country(lng_term, map_choice, sel_countries)
    
    lng_term = lng_fix_one_offs(lng_term)

    # TO DO: remove block below; no longer needed because local language info is in official release
    # if map_choice == 'Latin America Portal - oil-gas':
    #     # merge in local language info
    #     lng_term = lng_term_merge_local_language_info(lng_term)
    #     # lng_term = lng_exclude_one_offs_latam(lng_term)
    # else:
    #     pass
    
    # check data before creating download & map versions
    if map_choice == 'Latin America Portal - oil-gas':
        cols_to_check = ['Name', 'OtherLanguageName', 'Wiki', 'OtherLanguageWikiPage']    
    else:
        cols_to_check = ['Name', 'Wiki']
    find_multi_instead_of_one_to_one(lng_term, cols_to_check)
    
    # create data download file at this point: lng_term_for_download
    lng_term_for_download = lng_term_create_data_download_version(lng_term)

    # create df for map
    lng_terminals_for_map = lng_clean_and_prepare_for_map(lng_term, map_choice)        
    
    print('-'*40)
    print("LNG Terminals: finished processing")
    print('-'*40)

    return lng_term_for_download, lng_terminals_for_map

## Oil & Gas Extraction

In [1265]:
goget_main_official_columns_2022 = [
    'Unit name', 'Unit name local script', 'Fuel type', 'Unit type', 'Country', 
    'Subnational unit (province, state)', 'Latitude', 'Longitude', 'Location accuracy', 
    'Status', 'Status year', 'Discovery year', 'Production start year', 'Operator', 'Owner', 
    'Parent', 'Basin', 'Concession / block', 'Project or complex', 'GEM Unit ID', 
    'Government unit ID', 'Wiki URL',
    'Wiki URL local', # newly added to official
]
goget_prod_official_columns_2022 = [
    'GEM Unit ID', 'Unit name', 'Wiki URL', 'Production/reserves', 'Fuel description', 
    'Reserves classification (original)', 'Reserves classification (converted)', 
    'Data year', 'Quantity (original)', 'Units (original)', 'Quantity (converted)', 
    'Units (converted)'
]

In [1266]:
# def read_goget_working_pygsheets_main(data_files_and_paths):
#     gc = pygsheets.authorize(client_secret_full_path)
#     goget_working_gsheet = gc.open_by_key(data_files_and_paths['goget_working_key'])

#     goget_main_working_sheet = goget_working_gsheet.worksheet('title', 'Main data')
#     df = goget_main_working_sheet.get_as_df()

#     for own_num in range(1, 5+1):
#         df[f'Owner {own_num}'] = df[f'Owner {own_num}'].replace('', np.nan)
#         df[f'Owner {own_num} %'] = df[f'Owner {own_num} %'].replace('', np.nan).str.replace('%', '').astype(float) / 100
        
#     for col in ['Latitude', 'Longitude']:
#         df[col] = df[col].replace('', np.nan).astype(float)
        
#     df = df.rename(columns={'Unit ID': 'GEM Unit ID'})
    
#     # remove any empty rows:
#     df = df[df['Unit name'].isna() == False]
#     df = df[df['Unit name'] != '']
        
#     goget_main_working = df
    
#     return goget_main_working

In [1267]:
def read_goget_working_gspread_main(data_files_and_paths):
    df = gspread_access_file_read_only(
        key = data_files_and_paths['goget_working_key'],
        title = 'Main data',
    )
    return df

In [1268]:
def read_goget_working_local_main(data_files_and_paths):
    file_name = data_files_and_paths['goget_interim_file']
    print("Reading local file (interim release):")
    print(f"{data_files_and_paths['goget_interim_file']}")
    
    df = pd.read_excel(
        data_files_and_paths['goget_interim_path'] + file_name,
        sheet_name = 'Main data',
    )
    
    # remove extraneous columns
    for col in [
        'Production data?', 'Reserves data?', 'Last updated',
        'Owner % sum', 'Notes',
    ]:
        if col in df.columns:
            df = df.drop(col, axis=1)
        else:
            print(f"In read_goget_working_local_main, column not found: {col}")
    
    return df

In [1269]:
def clean_goget_working_main(df):
    for own_num in range(1, 5+1):
        df[f'Owner {own_num}'] = df[f'Owner {own_num}'].replace('', np.nan)
        
        if df[f'Owner {own_num} %'].dtype != float:
            df[f'Owner {own_num} %'] = df[f'Owner {own_num} %'].astype(str).replace('', np.nan).str.replace('%', '').astype(float) / 100
        
    for col in ['Latitude', 'Longitude']:
        df[col] = df[col].replace('', np.nan).astype(float)
        
    df = df.rename(columns={'Unit ID': 'GEM Unit ID'})
    
    # remove any empty rows:
    df = df[df['Unit name'].isna() == False]
    df = df[df['Unit name'] != '']
        
    return df

In [1270]:
def read_goget_working_gspread_prod(data_files_and_paths):
    df = gspread_access_file_read_only(
        key = data_files_and_paths['goget_working_key'],
        title = 'Reserves and Production',
    )    
    return df

In [1271]:
def read_goget_working_local_prod(data_files_and_paths):
    df = pd.read_excel(
        data_files_and_paths['goget_interim_path'] + data_files_and_paths['goget_interim_file'],
        sheet_name = 'Reserves and Production',
    )
    
    return df

In [1272]:
def clean_goget_working_prod(df):
    
    df = df.rename(columns={'Unit ID': 'GEM Unit ID'})
    
    # remove any rows with no data; messes up conversion to float below
    df = df[~df['Quantity (converted)'].isin(['#N/A', '#VALUE!', ''])]
    
    for col in ['Data source', 'Notes']:
        if col in df.columns:
            df = df.drop(col, axis=1)
    
    df = goget_prod_quantity_cols_convert_from_str_to_float(df)
    
    return df

In [1273]:
def read_goget_working_gspread_parent(data_files_and_paths):
    df = gspread_access_file_read_only(
        key = data_files_and_paths['goget_working_key'],
        title = 'Ownership',
    )
    
    # fill in empty values for 'Owner (local script)'
    df['Owner (local script)'] = df['Owner (local script)'].fillna('')
    
    return df

In [1274]:
def read_goget_working_local_parent(data_files_and_paths):
    df = pd.read_excel(
        data_files_and_paths['goget_interim_path'] + data_files_and_paths['goget_interim_file'],
        sheet_name = 'Ownership',
    )
    
    # fill in empty values for 'Owner (local script)'
    df['Owner (local script)'] = df['Owner (local script)'].fillna('')
    
    return df

In [1275]:
def clean_goget_working_parent(df):

    for par_num in range(1, 5+1):
        df[f'Parent {par_num}'] = df[f'Parent {par_num}'].replace('', np.nan)
        
        if df[f'Parent {par_num} %'].dtype != float:
            df[f'Parent {par_num} %'] = df[f'Parent {par_num} %'].replace('', np.nan).str.replace('%', '').astype(float) / 100

        if df['% sum'].dtype != float:
            df['% sum'] = df['% sum'].replace('', np.nan).str.replace('%', '').astype(float) / 100

    return df

In [1276]:
def goget_reformat_main_from_working_to_official(goget_main_working, goget_parent_working):
    df = goget_main_working.copy()
    
    df = df.rename(columns={'Unit ID': 'GEM Unit ID'})
        
    df = create_owner_and_parent_strings(df.copy(), goget_parent_working)
    
    # remove extraneous columns
    for col in df.columns:
        if col.endswith(' source'):
            df = df.drop(col, axis=1)
    for col in [
        'Wiki name', 
        # 'Notes', 
        'Rystad Asset ID',
        # 'Production data?', 'Reserves data?', 'Last updated', 
        # 'Data source', 
    ]:
        if col in df.columns:
            df = df.drop(col, axis=1)
        else:
            print(f"In goget_reformat_main_from_working_to_official, column not found: {col}")
        
    for own_num in range(1, 5+1):
        df = df.drop([f'Owner {own_num}', f'Owner {own_num} %'], axis=1)

    return df

In [1277]:
def test_official_columns_goget_main(goget_main):
    print("Checking columns goget_main")

    deliberately_excluded = ['Owner % sum']
    
    for col in goget_main.columns:
        if col not in goget_main_official_columns_2022:
            if col not in deliberately_excluded:
                print("Warning!" + f" There was a column in the file read in that wasn't in the 2022 official release: {col}")

    for col in goget_main_official_columns_2022:
        if col not in goget_main.columns:
            if col not in deliberately_excluded:
                print("Warning!" + f" There was a column in 2022 official release (main sheet) that wasn't in the file read in: {col}")
                print(goget_main.columns)

In [1278]:
def test_official_columns_goget_prod(goget_prod):
    print("Checking columns goget_prod")

    deliberately_excluded = ['Country', 'Subnational unit (province, state)', 'Data source', 'Notes']
    # 'Wiki URL local language', 
    
    for col in goget_prod.columns:
        if col not in goget_prod_official_columns_2022:
            if col not in deliberately_excluded:
                print("Warning!" + f" There was a column in the file read in that wasn't in the 2022 official release: {col}")

    for col in goget_prod_official_columns_2022:
        if col not in goget_prod.columns:
            if col not in deliberately_excluded:
                print("Warning!" + f" There was a column in 2022 official release (production sheet) that wasn't in the file read in: {col}")
                if error_verbose == True:
                    print("Show columns in file read in:")
                    print(goget_prod.columns)

In [1279]:
def goget_prod_quantity_cols_convert_from_str_to_float(df):
    could_not_convert = [] # initialize
    for col in ['Quantity (original)', 'Quantity (converted)']:
        try:
            df[col] = df[col].astype(str).str.replace(',', '').replace('', np.nan).astype(float)
        except:
            print("Error!" + f" Wasn't able to convert column '{col}' to float")
            for row in df.index:
                try:
                    val_float = float(df.at[row, col].replace(',', ''))
                except:
                    could_not_convert += [(f'row {row}', df.at[row, col])]                
            
        if len(could_not_convert) > 0:
            print(f"Couldn't convert these values to float:\n")
            print(could_not_convert)
    return df

In [1280]:
def read_goget_official(data_keys_titles):
    name = 'goget_official'
    
    print('*'*40)
    print("GOGET: Reading official data from local file:")
    print(f'"{name}"')
    print('-'*40)
    
    print("Known issue in with openpyxl, will give this warning: 'UserWarning: Unknown extension is not supported and will be removed'")
    print("(It seems GOGET early 2023 release has something in the sheet that openpyxl doesn't like, but it reads the data fine.)")

    # path_file = data_files_and_paths['goget_official_path'] + file_name
    # goget_xl = pd.ExcelFile(path_file)
    # goget_main = pd.read_excel(goget_xl, sheet_name = 'Main data')
    goget_main = gspread_access_file_read_only(data_keys_titles[name][0],data_keys_titles[name][1][0])
    goget_main = goget_main.rename(columns={'Unit ID': 'GEM Unit ID'})
    
    # goget_prod = pd.read_excel(goget_xl, sheet_name = 'Production & reserves')
    # goget_prod = goget_prod.rename(columns={'Unit ID': 'GEM Unit ID'})
    # TODO Production & reserves 
    goget_prod = gspread_access_file_read_only(data_keys_titles[name][0],data_keys_titles[name][1][1])
    goget_prod = goget_prod.rename(columns={'Unit ID': 'GEM Unit ID'})

    goget_prod = goget_prod_quantity_cols_convert_from_str_to_float(goget_prod)
    
    return goget_main, goget_prod

In [1281]:
def test_goget_dtypes(df, float_cols):
    for col in float_cols:
        if df[col].dtype != float:
            print("Error!" + f" For col {col}, expected dtype float64 but was: {df[col].dtype}")

    other_cols = [x for x in df.columns.tolist() if x not in float_cols]
        
    for col in other_cols:
        if df[col].dtype != object:
            print("Error!" + f" For col {col}, expected dtype str but was: {df[col].dtype}")

In [1282]:
def goget_create_data_download_version(df, map_choice, sel_countries):
    """
    Function filters data by country. Country is isn't in goget_prod (official version),
    so need to use countries in goget_main to filter.
    """
    
    df_dict = {'English': df}
    
    if map_choice == 'Latin America Portal - oil-gas':
        df_spanish = lat_am_convert_one_tracker_col_names_to_spanish(
            tracker_df = df, 
            trans_sheet_name = 'oil & gas extraction',
        )
        df_dict['Spanish'] = df_spanish
        
    # TO DO: delete block below
#     # PRODUCTION SHEET
#     # filter goget_prod based on goget_main filtering
#     goget_prod_for_download = goget_prod[goget_prod['GEM Unit ID'].isin(goget_main_for_download['GEM Unit ID'])]
    
#     goget_prod_for_download_dict = {'English': goget_prod_for_download} 
#     if map_choice == 'Latin America Portal - oil-gas':
#         goget_prod_for_download_spanish = lat_am_convert_one_tracker_col_names_to_spanish(
#             tracker_df = goget_prod_for_download, 
#             trans_sheet_name = 'oil & gas extraction',
#         )
#         goget_prod_for_download_dict['Spanish'] = goget_prod_for_download_spanish
    
    return df_dict

In [1283]:
def prepare_goget_for_map(
    df, 
    map_choice, 
    goget_main, 
    goget_prod
):
    
    # Insert production data (summed for each unit)
    if map_choice in ['Africa Gas Tracker', 'Asia Gas Tracker', 'Europe Gas Tracker', 'Latin America Portal - oil-gas']:
        # put production data into the df
        df = goget_process_production_data(df, goget_prod, goget_main)
    else:
        pass

    df = df.rename(columns={
        # 'unit' will be blank
        'Operator': 'operator',
        'Owner': 'owner',
        'Parent': 'parent',
        'Status': 'status',
        'Production start year': 'start_year',
        # 'capacity' will be blank
        'Quantity (converted)': 'production',
        'Latitude': 'lat',
        'Longitude': 'lng',
    })
    
    if map_choice == 'Latin America Portal - oil-gas':
        # TO DO: remove this step if we always have 'Wiki URL local' already in the file
        
#         # merge in local language URLs from working file
#         # goget_main_working = read_goget_working_gspread_main(data_files_and_paths)
#         goget_main_working = read_goget_working_local_main(data_files_and_paths)
#         goget_main_working = clean_goget_working_main(goget_main_working)
        
#         df = pd.merge(
#             df, goget_main_working[['Unit ID', 'Wiki URL local']],
#             left_on = 'GEM Unit ID', right_on = 'Unit ID',
#             how = 'left',
#         )
        
        lat_am_rename_dict = {
            'Unit name local script': 'project',
            'Unit name': 'project_en',
            'Wiki URL': 'url_en',
            # 'Wiki URL local': 'url', # modified 2023-10-27 to remove local language URL
            'Country': 'countries',
        }
        df = df.rename(columns=lat_am_rename_dict)
        
    if map_choice == 'GOGET':
        goget_rename_dict = {
            'Discovery year': 'discovery_year',
            'Start year': 'start_year',
            'Fuel type': 'fuel_type',
            'Unit type': 'unit_type', 
            'Unit name': 'project',
            'Country': 'country',
            'Wiki URL': 'url',
        }
        df = df.rename(columns=goget_rename_dict)

    else:
        all_regions_rename_dict = {
            'Unit name': 'project',
            'Wiki URL': 'url',
            'Country': 'countries',
        }
        df = df.rename(columns=all_regions_rename_dict)
    
    # clean up start_year column
    df['start_year'] = df['start_year'].replace('nan', '').replace(np.nan, '')
    
    # assign 'type'
    if map_choice == 'GOGET':
        df['type'] = "oil_and_gas_extraction_area"
    elif map_choice in ['Africa Gas Tracker', 'Asia Gas Tracker', 'Europe Gas Tracker']:
        df['type'] = "gas_extraction_area" 
    elif map_choice in ['Latin America Portal - oil-gas']:
        df['type'] = "oil_and_gas_extraction_area"
    else:
        print(f"Not set up to add 'type' values for this map_choice: {map_choice}")
    
    # assign geometry
    df['geom'] = 'point'

    return df

In [1284]:
def select_max_data_year_for_each_unit_and_fuel(df):
    if error_verbose == True:
        print("Running select_max_data_year_for_each_unit_and_fuel")

    # keep only annual production data
    df = df.copy()[df['Production/reserves']=='production']
    
    len_orig = len(df)

    list_of_df_max_years = [] # initialize
        
    prod_metadata_columns = ['GEM Unit ID', 'Unit name', 'Fuel description', 'Data year']
    prod_metadata_dups = df[df.duplicated(subset=prod_metadata_columns)]
    if len(prod_metadata_dups) > 0:
        print("Warning!" + " There are some units with more than one entry for production of a given type of fuel, in a given year (additional entries will be removed):")
        print(prod_metadata_dups[['Unit name', 'Fuel description', 'Quantity (converted)', 'Units (converted)']])

    # drop the duplicates
    df = df.drop_duplicates(subset=prod_metadata_columns, keep='first')

    # after dropping duplicates, there is only one value for each unit, fuel, and year
    # then can keep only the data for the latest year (max year)
    df['combo ID'] = df[['GEM Unit ID', 'Fuel description']].agg('_'.join, axis=1)
    
    if error_verbose == True:
        print(f"In select_max_data_year_for_each_unit_and_fuel, number of combo_ids to process: {len(df['combo ID'].unique())}") # for UI
    
    counter = 0 # initialize
    for combo_id in df['combo ID'].unique():
        df_one_combo = df[df['combo ID']==combo_id]
        
        # clean up data year entries, including changing '[not stated]' to be 0
        data_year_mod = df_one_combo.copy()['Data year']
        data_year_mod = data_year_mod.astype(str)
        data_year_mod = data_year_mod.replace('[not stated]', '0')
        data_year_mod = data_year_mod.str.replace(' predicted', '')
        data_year_mod = data_year_mod.str.strip().str.split('\.0', n=1).str[0]
        data_year_mod = data_year_mod.astype(int)
        data_year_mod.name = 'data_year_mod'
        data_year_mod_df = pd.DataFrame(data_year_mod)
        df_one_combo = pd.concat([df_one_combo, data_year_mod_df], axis=1)
        
        # keep only the rows for the maximum data year
        df_one_combo_year_max = df_one_combo[df_one_combo['data_year_mod']==df_one_combo['data_year_mod'].max()]
        if len(df_one_combo_year_max) == 1:
            list_of_df_max_years += [df_one_combo_year_max]
        else:
            print("Error!" + f" len(df_one_combo_year_max) was != 1; it was: {len(df_one_combo_year_max)}; for combo_id: {combo_id}")

        if counter % 1000 == 0:
            print(f"Finished combo_id #{counter}")
        counter += 1

    df = pd.concat(list_of_df_max_years, sort=False)
    
    if error_verbose == True:
        print("Finished select_max_data_year_for_each_unit_and_fuel")
        print(f"Original df len: {len_orig}; after removals: {len(df)}")
        print()
    
    return df

  data_year_mod = data_year_mod.str.strip().str.split('\.0', n=1).str[0]


In [1285]:
def clean_liquids_data(df):
    """
    Compare 'total liquids' vs other data (e.g., oil); 
    if 'total liquids' is newer, keep it; otherwise delete it
    """
    orig_len = len(df)
    
    total_liq = df[df['Fuel description']=='total liquids']
    if len(total_liq) > 0:
        comp_mask = df['GEM Unit ID'].isin(total_liq['GEM Unit ID'])
        comp = df[comp_mask]
        remainder = df[~comp_mask]
        comp_clean_list = [] # initialize

        for unit_id in comp['GEM Unit ID'].unique():
            sel_unit = comp[comp['GEM Unit ID']==unit_id]
            sel_unit_tot_liq_year = sel_unit[sel_unit['Fuel description']=='total liquids']['Data year'].max()
            sel_unit_oil_year = sel_unit[sel_unit['Fuel description']=='oil']['Data year'].max()
            # if total liquids data is older than oil data, 
            # then get rid of total liquids data
            if sel_unit_oil_year > sel_unit_tot_liq_year:
                sel_unit_no_tot_liq = sel_unit[sel_unit['Fuel description']!='total liquids']
                comp_clean_list += [sel_unit_no_tot_liq]
            else:
                # get rid of any other liquids data, to leave only total liquids
                individual_liq = ['oil', 'condensate']
                sel_unit_no_individual_liq = sel_unit[~sel_unit['Fuel description'].isin(individual_liq)]
                comp_clean_list += [sel_unit_no_individual_liq]

        comp_clean = pd.concat(comp_clean_list, sort=False)
        # recombine
        df = pd.concat([comp_clean, remainder])
        print(f"Original df len: {orig_len}; after removals in clean_liquids_data: {len(df)}")
    else:
        print("No 'total liquids' entries to handle.")
    
    print("Finished clean_liquids_data")
    return df

In [1286]:
def clean_gas_data(df):
    for sel_gas in ['associated gas', 'nonassociated gas']:
        sel_df = df[df['Fuel description']==sel_gas]
    
        if error_verbose == True:
            print(f"Checking {sel_gas} ({len(sel_df)} rows) vs total gas")

        # check if any of these units also have production for simply 'gas' (aka total gas)
        overlaps = 0 # initialize
        for unit_id in sel_df['GEM Unit ID'].unique():
            one_unit_df = sel_df[sel_df['GEM Unit ID']==unit_id]
            if 'gas' in one_unit_df['Fuel description'].tolist():
                print(f"There's total gas data overlapping with {sel_gas} (for unit ID {unit_id}); need to address it")
                overlaps += 1
            else:
                pass

    if overlaps == 0:
        if error_verbose == True:
            print("No overlaps in gas data")

    if error_verbose == True:
        print("Finished clean_gas_data")
        
    return df

In [1287]:
def sum_total_production_per_unit(df):
    print("Running sum_total_production_per_unit")
    sums = df.groupby(['GEM Unit ID', 'Units (converted)'])['Quantity (converted)'].sum()

    sums = sums.reset_index()
    # convert gas into boe
    # 169.98 m³ = 1 boe (BP Statistical Review 2021; conversion used within GOGET as well)
    cubic_meters_ng_per_boe = 169.98

    for row in sums.index:
        units = sums.at[row, 'Units (converted)']
        if units in ['million m³/y']: 
            sums.at[row, 'Production million boe/y'] = sums.at[row, 'Quantity (converted)'] / cubic_meters_ng_per_boe
        elif units in ['million bbl/y', 'million boe/y']:
            sums.at[row, 'Production million boe/y'] = sums.at[row, 'Quantity (converted)']
        else:
            print("Error!" + f" In sum_total_production_per_unit, unexpected value for 'Units (converted)': {units}")

    production_per_unit = sums.groupby('GEM Unit ID')[['Production million boe/y']].sum().reset_index()
    
    return production_per_unit

In [1288]:
def goget_process_production_data(map_df, goget_prod, goget_main_filtered):
    """Only for creating production data for map file.
    
    Use production data sheet to gather all the production data for a unit,
    while avoiding (if at all possible) any double-counting, and without missing any subcomponents.
    
    select_max_data_year_for_each_unit_and_fuel: 
    Gets the most recent data, for each type of fuel (to avoid double counting).
    
    clean_liquids_data & clean_gas_data:
    Clean the data sets to avoid overlaps (to avoid double counting).
    
    sum_total_production_per_unit:
    Calculates total for each unit, in million boe/y.
    """
    if error_verbose == True:
        print("Running process_production_data")
    
    df = goget_prod.copy()

    # select only the units selected earlier in goget_main_filtered
    df = df[df['GEM Unit ID'].isin(goget_main_filtered['GEM Unit ID'])]

    df = select_max_data_year_for_each_unit_and_fuel(df)
    df = clean_liquids_data(df)
    df = clean_gas_data(df)

    if error_verbose == True:
        print("Fuel description value_counts after removals:")
        print(df['Fuel description'].value_counts())
        # print(df[df['Fuel description']=='[not stated]'])
        print()

    production_per_unit = sum_total_production_per_unit(df)
    production_per_unit = production_per_unit.rename(columns={
        'Production million boe/y': 'production',
    })
    
    # convert values to only have 2 decimal places
    for row in production_per_unit.index:
        val = round(production_per_unit.at[row, 'production'], 2)
        production_per_unit.at[row, 'production'] = "{0:.2f}".format(val)
    
    # assign capacity/production unit
    production_per_unit['capacity_production_unit'] = 'million boe/y'

    # merge production data with main data set
    map_df = pd.merge(
        map_df, production_per_unit,
        on='GEM Unit ID',
        how='left')
    
    return map_df

In [1289]:
def goget_filter_ids_based_on_wikis(goget_main):
    # in the file GOGET data compilation (map and public) 2022-01-13.ipynb,
    # we filtered data based on which had wiki pages, since we had already determined which met certain thresholds, and created wiki pages for those
    # then if additional units get added that are of interest (such as Goddard) and those also have wiki pages, this way of filtering will also get them

    df = goget_main.copy()
    df = df[df['Wiki URL']!='']
    df = df[df['Wiki URL'].isna()==False]

    keep_ids = df['GEM Unit ID'].tolist()
    
    return keep_ids

In [1290]:
def goget_filter_ids_for_gas_production_or_reserves(goget_prod, keep_ids):
    """ Keep only gas units, for data sets that only have gas (and not oil). 
    
    Also retains any that report production/reserves as "hydrocarbons," because it's not clear if gas is included.
    
    Returns a list of GEM unit IDs to keep.
    """
    wiki_and_gas_prod_or_res = goget_prod[
        (goget_prod['GEM Unit ID'].isin(keep_ids)) & 
        (goget_prod['Quantity (converted)']>0) & 
        ((goget_prod['Fuel description'].str.lower().str.contains('gas')) | 
         (goget_prod['Fuel description'].str.lower()=='hydrocarbons')
        )
    ]
    wiki_and_gas_prod_or_res_ids = wiki_and_gas_prod_or_res['GEM Unit ID'].unique().tolist()

    # update keep_ids list
    keep_ids = [x for x in keep_ids if x in wiki_and_gas_prod_or_res_ids]
    
    return keep_ids

In [1291]:
def goget_filter_data_based_on_ids(goget_main, goget_prod, keep_ids):
    """ For 
    """
    
    if map_choice in ['Africa Gas Tracker', 'Asia Gas Tracker', 'Europe Gas Tracker', 'Latin America Portal - oil-gas']:
        # main_en = goget_main_for_download_dict['English']
        # prod_en = goget_prod_for_download_dict['English']
        
        # filter data:
        goget_main = goget_main[goget_main['GEM Unit ID'].isin(keep_ids)]
        goget_prod = goget_prod[goget_prod['GEM Unit ID'].isin(keep_ids)]
        
        # # overwrite values in dictionary
        # goget_main_for_download_dict['English'] = main_en
        # goget_prod_for_download_dict['English'] = prod_en
        
    else:
        pass
        
#     if map_choice in ['Latin America Portal - oil-gas']:
#         # also process Spanish dfs
#         main_sp = goget_main_for_download_dict['Spanish']
#         prod_sp = goget_prod_for_download_dict['Spanish']
        
#         # filter data:
#         main_sp = main_sp[main_sp['GEM Unit ID'].isin(keep_ids)]
#         prod_sp = prod_sp[prod_sp['GEM Unit ID'].isin(keep_ids)]
        
#         # overwrite values in dictionary
#         goget_main_for_download_dict['Spanish'] = main_sp
#         goget_prod_for_download_dict['Spanish'] = prod_sp
        
#     else:
#         pass
    
    return goget_main, goget_prod

In [1292]:
def run_all_goget_functions(map_choice, data_versions_dict, data_keys_titles):
    print('*'*40 + "\nGOGET: running all functions\n") # for UI
    if data_versions_dict[map_choice]['goget'] == 'official':
        goget_main, goget_prod = read_goget_official(data_keys_titles)
        
        goget_main['Discovery year'] = goget_main['Discovery year'].fillna('').astype(str).str.rsplit(pat='.0', n=1).str[0]
        # TO DO: also add for 'Start year'? Or would this remove "expected" values
            # main sheet: convert to float
        for col in ['Latitude', 'Longitude']:
            goget_main[col] = goget_main[col].replace('', np.nan).astype(float)
            
        test_goget_dtypes(goget_main, ['Latitude', 'Longitude']) # this is not a float but object
        test_goget_dtypes(goget_prod, ['Quantity (original)', 'Quantity (converted)'])
        
    elif data_versions_dict[map_choice]['goget'] == 'interim':
        # TO DO: can speed it up by reading ExcelFile first, then putting that into functions
        
        # goget_main_working = read_goget_working_gspread_main(data_files_and_paths)
        goget_main_working = read_goget_working_local_main(data_files_and_paths)
        goget_main_working = clean_goget_working_main(goget_main_working)
        
        # goget_prod_working = read_goget_working_gspread_prod(data_files_and_paths)
        goget_prod_working = read_goget_working_local_prod(data_files_and_paths)
        goget_prod_working = clean_goget_working_prod(goget_prod_working)
        
        # goget_parent_working = read_goget_working_gspread_parent(data_files_and_paths)
        goget_parent_working = read_goget_working_local_parent(data_files_and_paths)
        goget_parent_working = clean_goget_working_parent(goget_parent_working)
    
        test_goget_dtypes(
            goget_main_working, 
            ['Latitude', 'Longitude', 'Owner 1 %', 'Owner 2 %', 'Owner 3 %', 'Owner 4 %', 'Owner 5 %']
        )
        test_goget_dtypes(goget_prod_working, ['Quantity (original)', 'Quantity (converted)'])
        test_goget_dtypes(goget_parent_working, ['% sum', 'Parent 1 %', 'Parent 2 %', 'Parent 3 %', 'Parent 4 %', 'Parent 5 %'])
        
        goget_main = goget_reformat_main_from_working_to_official(goget_main_working, goget_parent_working)
        
        goget_prod = goget_prod_working # no mods
    
    test_official_columns_goget_main(goget_main)
    test_official_columns_goget_prod(goget_prod)
    
    # filter by country before other filtering
    goget_main = harmonize_countries(goget_main)
    goget_main = filter_points_by_country(goget_main, map_choice, sel_countries)
    
    # filter based on which units are the correct fuel and are over production/reserves threshold, or other criteria
    # (those that meet the latter criteria have wiki pages)
    keep_ids = goget_filter_ids_based_on_wikis(goget_main)
    keep_ids = goget_filter_ids_for_gas_production_or_reserves(goget_prod, keep_ids)
    goget_main, goget_prod = goget_filter_data_based_on_ids(
        goget_main, goget_prod, keep_ids)
    
    # create download dictionaries
    goget_main_for_download_dict = goget_create_data_download_version(
        goget_main, map_choice, sel_countries)
    goget_prod_for_download_dict = goget_create_data_download_version(
        goget_prod, map_choice, sel_countries)
    
    # create map file
    goget_for_map = prepare_goget_for_map(
        df = goget_main_for_download_dict['English'].copy(), 
        map_choice = map_choice, 
        goget_main = goget_main, 
        goget_prod = goget_prod,
    )
    
    print("-"*40 + "\nGOGET: Finished processing\n" + "-"*40)
    return goget_main_for_download_dict, goget_prod_for_download_dict, goget_for_map

In [1293]:
# notes:
# some in Ireland just say "hydrocarbons"; include them in gas map, since we don't know if oil or gas?
# ditto for Italy
# Johan Castberg (OG0001426) is excluded; it has gas reserves in the gov data, but listed with value 0

# Troll West goes to wiki page "Troll oil and gas field"; but according to the map, they are very far away from each other
# so either one of the coordinates is wrong, or there should be separate wiki pages
# From a quick look online, I think the coordinates for Troll West are wrong
# We don't have any production or reserves for it in our spreadsheet; with new approach here, it will be excluded from the map
# then later on we can figure out if it should have some data, and what the correct coordinates are

# Yme: 0 gas listed

# Moesia: fuel is "oil, NGL, and gas"
# Achmelvich: fuel is "hydrocarbons"

## Functions to compile all oil & gas data

In [1294]:
def test_name_vs_url(og_compiled):
    # check that URL and project name match at the start
    test = og_compiled.copy()
    if map_choice in ['Latin America Portal - coal-steel', 'Latin America Portal - oil-gas']:
        project_col_name = 'project_en'
        url_col_name = 'url_en'
    else:
        project_col_name = 'project'
        url_col_name = 'url'
        
    test['project 1st word'] = test[project_col_name].str.split(' ').str[0]
    test['url 1st word'] = test[url_col_name].str.replace(
        'https://www.gem.wiki/', 'https://gem.wiki/', regex=False).str.replace(
        'https://gem.wiki/', '', regex=False).str.split('_').str[0]
    test = test.loc[test['project 1st word'] != test['url 1st word']]
    test = test.loc[~test[url_col_name].str.contains('//bit.ly')]
    test = test.reset_index()
    
    if len(test) > 0:
        print("\nThere were some mismatches between name & URL:")
        print(test[[project_col_name, url_col_name]])
        print()

In [1295]:
def test_missing_local_unit_names(df, map_choice = map_choice):
    """
    TO DO: would need to modify this function to work correctly.
    
    For example, GOGET doesn't have any entries in column 'unit'.
    A lot of gas plants, pipelines, etc. also don't have entries in 'unit'
    """
    if map_choice in ['Latin America Portal - coal-steel', 'Latin America Portal - oil-gas']:
        unit_col_name = 'unit_en'
    else:
        unit_col_name = 'unit'
    
    missing_local_unit_names = df[
        (df[unit_col_name].fillna('')!='') & 
        (df['unit'].fillna('')=='')
    ]
    if len(missing_local_unit_names)==0:
        pass
    else:
        if map_choice in ['Latin America Portal - coal-steel', 'Latin America Portal - oil-gas']:
            print(missing_local_unit_names[['project_en', 'project', 'unit_en', 'unit']])
        else:
            print(missing_local_unit_names[['project', 'unit']])
            
    # no return

In [1296]:
def test_for_missing_coordinates_or_route(df, cols_to_check, country_col):
    missing_coord_route_list = [] # initialize
    if 'lat' in cols_to_check:
        missing_coord = df.copy()[
            (df['geom']=='point') & 
            ((df['lat'].isna()) | (df['lng'].isna()))
        ]
        missing_coord_route_list += [missing_coord]

    if 'route' in cols_to_check:
        missing_route = df.copy()[
            (df['geom']=='line') & 
            (df['route'].isna())
        ]
        missing_coord_route_list += [missing_route]

    if len(missing_coord_route_list) == 0:
        print("There were no rows with missing coordinates")
        
    else:
        missing_coord_route = pd.concat(missing_coord_route_list, sort=False)

        if len(missing_coord_route) > 0:
            # show any rows with no location data
            print(f"\nThere were {len(missing_coord_route)} rows with missing coordinates/routes that are being excluded")
            if map_choice in ['Latin America Portal - oil-gas', 'Latin America Portal - coal-steel']:
                project_name_col = 'project_en'
            else:
                project_name_col = 'project'

            if len(missing_coord_route) <= 5:
                print(missing_coord_route[[project_name_col, country_col, 'type', 'lat', 'lng']])
            else:
                print(f"summary of rows with missing coordinates/routes:")
                print(missing_coord_route['type'].value_counts())
    # no return

In [1297]:
def exclude_missing_coordinates_or_route(df, sector):    
    """
    Exclude any point data that doesn't have both lat & lng.

    Exclude any line data that doesn't have a route
    """
    
    if sector == 'coal_steel':
        country_col = 'country'
        cols_to_check = ['lat', 'lng']
    elif sector == 'oil_gas':
        if map_choice == 'GOGET':
            country_col = 'country'
        else:
            country_col = 'countries'
            
        if map_choice in ['Oil Infrastructure']:
            cols_to_check = ['route']
        elif map_choice in ['Oil & Gas Plant', 'GOGET']:
            cols_to_check = ['lat', 'lng']
        elif map_choice in ['Africa Gas Tracker', 'Asia Gas Tracker', 'Europe Gas Tracker', 
                            'Gas Infrastructure', 'Latin America Portal - oil-gas']:
            cols_to_check = ['lat', 'lng', 'route']
        else:
            print("Error!" + f" In exclude_missing_coordinates_or_route, unexpected value for map_choice: {map_choice}")
    else:
        print("Error!" + f" Not yet handling sector: {sector}")
    
    test_for_missing_coordinates_or_route(df, cols_to_check, country_col)
        
    # show any rows with no coordinates/routes
    if 'lat' in cols_to_check:
        df_sel = df.copy()
        if 'geom' in df.columns:
            df_sel = df_sel[df_sel['geom']=='point']
        else:
            pass
        df_sel = df_sel[df_sel['lat'].isna()]
        if len(df_sel) > 0:
            print(f"Missing lat-lon:")
            print(df_sel)
    if 'route' in cols_to_check:
        df_sel = df.copy()
        if 'geom' in df.columns:
            df_sel = df_sel[df_sel['geom']=='route']
        else:
            pass
        df_sel = df_sel[df_sel['route'].isna()]
        if len(df_sel) > 0:
            print(f"Missing route:")
            print(df_sel)
        
    # exclude any rows with no coordinates/routes
    df = df.dropna(subset=cols_to_check, how='all')
    
    print("Finished exclude_missing_coordinates_or_route\n")
    return df

In [1298]:
def oil_gas_convert_statuses_for_map(df):
    """
    Do for current maps (as of March 2022).
    
    TO DO: if no status, fill in 'Unknown'???
    """
    df['status'] = df['status'].str.lower()
    
    if error_verbose == True:
        print(f"Show statuses before conversion:\n{df['status'].fillna('___').value_counts()}")
    
    # convert statuses
    if two_column_status == True:
        # create column 'status_legend'
        df['status_legend'] = df.copy()['status'].str.lower().replace({
            # proposed_plus
            'proposed': 'proposed_plus',
            'announced': 'proposed_plus',
            'discovered': 'proposed_plus',
            # construction_plus
            'construction': 'construction_plus',
            'in development': 'construction_plus',
            # mothballed
            'mothballed': 'mothballed_plus',
            'idle': 'mothballed_plus',
            'shut in': 'mothballed_plus',
        })
        
    else:
        if map_choice in ['Oil Infrastructure', 'Gas Infrastructure', 'GOGET']:
            pass
        else:
            # convert values within column 'status'
            df['status'] = df.copy()['status'].str.lower().replace({
                'discovered': 'proposed',
                'in development': 'construction',
                'idle': 'mothballed',
                'shut in': 'mothballed',
            })
            print(f"Show statuses after conversion:\n{df['status'].fillna('___').value_counts()}")
        
    if map_choice == 'Latin America Portal - oil-gas':
        # capitalize first letter
        df['status'] = df['status'].str[0].str.upper() + df['status'].str[1:].str.lower()
    else:
        # leave as lowercase
        pass
    
    # handle missing statuses
    missing_mask = (df['status'].isna()) | (df['status']=='')
    missing = df.copy()[missing_mask]
    if len(missing) > 0:
        print(f"There are {len(missing)} rows with no status; they will be excluded from the map file.")
        print("They are of the following types, with counts by country:")
        if 'type' not in missing.columns:
            missing['type'] = '____'
        missing['type'] = missing['type'].fillna('____')
        if 'country' in missing.columns:
            country_col = 'country'
        else:
            country_col = 'countries'
        missing[country_col] = missing[country_col].fillna('____')
        print(missing.groupby(['type', country_col])[country_col].count())
        len_before_removal = len(df)
        df = df[~missing_mask]
        print(f"Before removal, were {len_before_removal} rows; after removal, {len(df)} rows.")
    elif len(missing) == 0:
        print("Test passed; all rows had statuses.")
    else:
        print("Error!" + f" Unexpected case in oil_gas_convert_statuses_for_map for len(missing): {len(missing)}")

    return df

In [1299]:
def oil_gas_data_for_map_final_processing_and_export(
    list_of_dfs, 
    path_for_download_and_map_files
):    
    sector = 'oil_gas'
    
    print("For map data, running final processing & export")
    
    # combine all the dfs into one df
    df = pd.concat(list_of_dfs, sort=False).reset_index(drop=True)
    
    # ===========
    # Change column names in certain cases:
    if map_choice in ['Gas Infrastructure']:
        df = df.rename(columns={'capacity_production_unit': 'capacity_units'})
        
    print(df.columns) # for db

    df = oil_gas_map_pare_and_reorder_columns(df, map_choice)
    
    # ===========
    # SORT:
    if error_verbose == True:
        print(f"show columns before sort: {df.columns.tolist()}") # for db

    if map_choice == 'Latin America Portal - oil-gas':
        df = df.sort_values(by=['type', 'project_en', 'unit_en'])
    elif map_choice == 'GOGET':
        df = df.sort_values(by=['country', 'project'])
    elif map_choice == 'Oil & Gas Plant':
        df = df.sort_values(by=['country', 'project', 'unit'])
    else:
        df = df.sort_values(by=['type', 'project'])
    # ===========

    # ===========
    # STATUSES    
    df = oil_gas_convert_statuses_for_map(df)

    # fix statuses
    if two_column_status == True:
        status_col = 'status_legend'
    else:
        status_col = 'status'
        
    statuses_not_accepted = df[~df[status_col].isin(accepted_statuses_sel[map_choice])]
    if len(statuses_not_accepted)>0:
        print("Warning!" + f" Number of rows with non-standard statuses (which will be excluded from map file): {len(statuses_not_accepted)}") # for UI
        print(statuses_not_accepted[status_col].value_counts())
    df = df[df[status_col].isin(accepted_statuses_sel[map_choice])]

    test_status_for_map(df)
    
    # ===========
    # COORDINATES & GEOMETRY
    # show geometries: check that all rows have a geom
    if error_verbose == True:
        if 'geom' in df.columns:
            print("Show geometries (including any empty rows)")
            print(df['geom'].fillna('').value_counts())
            print()
    
    df = assign_approximate_coordinates(df, sector)    
    df = exclude_missing_coordinates_or_route(df, sector)
    
    if map_choice in ['GOGET', 'Gas Plants'] and 'geom' in df.columns:
        df = df.drop('geom', axis=1)
        print("dropped col 'geom'")   
    
    # ===========
    # CLEAN UP
    # clean up start year entries, removing decimal places
    if 'start_year' in df.columns:
        df['start_year'] = df['start_year'].astype(str).str.replace('.0', '', regex=False).replace('nan', '').replace(np.nan, '')   

    df = clean_nan_not_found_tbd(df)
        
    # convert to float
    # (need to do after step above, in which all columns are handled as strings)
    if 'capacity' in df.columns:
        capacity_with_comma = df[df['capacity'].astype(str).str.contains(',')]
        if len(capacity_with_comma) > 0:
            df['capacity'] = df['capacity'].str.replace(',', '')
        
    if map_choice == 'GOGET':
        float_cols = ['lat', 'lng']
    else:
        float_cols = ['capacity', 'lat', 'lng']
        
    for col in float_cols:
        if col in df.columns:
            if df[col].dtype == str or df[col].dtype == object:
                df[col] = df[col].replace('--', np.nan)
                df[col] = df[col].replace('', np.nan)
                df[col] = df[col].replace('Unknown', np.nan)
                df[col] = df[col].replace('to be determined', np.nan)
                try:
                    df[col] = df[col].astype(float)
                    print(f"Successfully converted values to float for col: {col}") # for UI
                except:
                    for row in df.index:
                        val = df.at[row, col]
                        try:
                            float_val = float(val)
                        except:
                            project_name = df.at[row, 'project']
                            print(f"For {project_name}, for col {col}, couldn't convert val to float: {val}; changed to blank value") # for UI
                            df.at[row, col] = np.nan

                    # try again
                    df[col] = df[col].astype(float)
                    
                    # after second try/except
                    print(f"Successfully converted values to float for col: {col}") # for UI

                # EXPERIMENTAL
                if col in ['lat', 'lng']:
                    df[col] = df[col].apply(lambda x: format(x, '.5g'))
            else:
                print(f"For col {col}, didn't try to clean; dtype was: {df[col].dtype}")
        else:
            print(f"For {map_choice}, df didn't include the column {col}") # for UI

    # clean up URLs
            
    # ===========
    # COUNTRIES
    
    # Check country names:
    # TO DO: change this step; may want to move test to be within read_eez_file_and_standardize again
    # What is the point of this test? Is it to find mismatches between EEZ and GEM standard? Or between EEZ and a specific data set?
    eez_and_land_boundaries = read_eez_file_and_standardize()
    test_compare_eez_country_names_against_map_df(eez_and_land_boundaries, df)
   
    # check that all rows have a country entered
    test_for_country_entries(df)

    # ===========
    # TESTS FOR INCONSISTENCIES
    if map_choice == 'Latin America Portal - oil-gas':
        cols_to_check = ['project_en', 'project', 'url_en'] # 'url'; modified 2023-10-27 to remove local language URL
    else:
        cols_to_check = ['project', 'url']
    find_multi_instead_of_one_to_one(df, cols_to_check)
        
    # test_name_vs_url(df) # TO DO: restore
    # test_missing_local_unit_names(df) # TO DO: fix function or delete
    
    df = latin_america_fill_in_missing_local_language_versions(df)
    
    # check for missing values
    test_map_specified_cells_have_values(df, sector)
        
    # ===========
    # Check counts of the column 'type'
    if map_choice in ['Oil & Gas Plant', 'GOGET']:
            pass
    else:
        if 'type' in df.columns:
            test_type_counts(df)
        else:
            print("Map df doesn't have column 'type'")
        
    # ===========
    # TEST COLUMNS - just before export  
    # Add in change so don't need to do anything in excel like Tom was doing
    # change status to status_tabular
    # change status_legend to status
    # convert capacity_production_unit to units column: 
    # gas_extraction_area: if in production million boe/y
    # gas_pipeline: bcm/y, MMcf/d, MMSCMD,  
    # gas_power_plant: MW
    # lng_terminal: MTPA  
    test_oil_gas_map_columns(df)
        
    # ===========
    # EXPORT MAP FILE
    if export_files == True:        
        oil_gas_compiled_for_map_file_name = f'{map_choice} - map data {save_timestamp}.xlsx'
        df.to_excel(
            path_for_download_and_map_files + 
            oil_gas_compiled_for_map_file_name,
            index=False
        )
        print("*"*40)
        print(f"Exported map file: {oil_gas_compiled_for_map_file_name}")
        print(f"len: {len(df)}")
        print("*"*40)
    else:
        print("*"*40)
        print("Did not export oil & gas map file")
        print("*"*40)

    return df

In [1300]:
def oil_gas_export_download_files(
    oil_gas_data_for_download_list, 
    download_file_name, path_and_filename_for_download, 
    oil_gas_data_for_download_list_spanish,
    download_file_name_spanish, path_and_filename_for_download_spanish
):
    
    if export_files == False:
        print("*"*40)
        print("There was oil and/or gas data for data download, but did not create Excel file")
        print("*"*40)
        
    elif export_files == True:
        print(f"Exporting data to download file: {download_file_name}")
        oil_gas_export_file(
            path_and_filename_for_download, 
            oil_gas_data_for_download_list
        )
        
#         with pd.ExcelWriter(path_and_filename_for_download) as writer:
#             for df_name_tuple in oil_gas_data_for_download_list:
#                 sheet_name = df_name_tuple[0]
#                 df = df_name_tuple[1]
#                 df.to_excel(writer, sheet_name=sheet_name, startrow=1, header=False, index=False)
#                 # specify startrow 1 to leave row 0 empty for headers

#                 workbook = writer.book
#                 worksheet = writer.sheets[sheet_name]

#                 # Add a header format; from: https://xlsxwriter.readthedocs.io/working_with_pandas.html
#                 header_format = workbook.add_format({
#                     'bold': True,
#                     'text_wrap': True,
#                     'valign': 'top',
#                     'border': 0,
#                     'bottom': 1,
#                 })

#                 # Write the column headers with the defined format.
#                 for col_num, value in enumerate(df.columns.values):
#                     worksheet.write(0, col_num, value, header_format)
    
        if map_choice == 'Latin America Portal - oil-gas':
            print(f"Exporting data to Spanish download file: {download_file_name_spanish}")
            oil_gas_export_file(
                path_and_filename_for_download_spanish, 
                oil_gas_data_for_download_list_spanish
            )
            
            # with pd.ExcelWriter(path_and_filename_for_download_spanish) as writer:
            #     for df_name_tuple in oil_gas_data_for_download_list_spanish:
            #         sheet_name = df_name_tuple[0]
            #         df = df_name_tuple[1]
            #         df.to_excel(
            #             writer, 
            #             sheet_name=sheet_name, 
            #             index=False
            #         )
            #         print(f"Wrote to download file: {sheet_name}")
        print("-"*40)
        
    else:
        print("Error!" + f" Unexpected case for export_files: {export_files}")
    
    # no return

In [1301]:
def oil_gas_export_file(path, list_of_tuples):

    with pd.ExcelWriter(path) as writer:
        for df_name_tuple in list_of_tuples:
            
            sheet_name = df_name_tuple[0]
            print(f"Processing sheet {sheet_name}")
            
            df = df_name_tuple[1]
            df.to_excel(writer, sheet_name=sheet_name, startrow=1, header=False, index=False)
            # specify startrow 1 to leave row 0 empty for headers

            workbook = writer.book
            worksheet = writer.sheets[sheet_name]

            # Add a header format; from: https://xlsxwriter.readthedocs.io/working_with_pandas.html
            header_format = workbook.add_format({
                'bold': True,
                'text_wrap': True,
                'valign': 'top',
                'border': 0,
                'bottom': 1,
            })
            
            worksheet.set_column(0, len(df.columns), 12)

            # Write the column headers with the defined format.
            for col_num, value in enumerate(df.columns.values):
                worksheet.write(0, col_num, value, header_format)


    # no return

In [1302]:
def oil_gas_map_pare_and_reorder_columns(df, map_choice):
    """ Removes columns not needed for map & reorders columns.
    
    Accidentally lost this function. 
    Version here restored from file "Cross-tracker data compilation for map and download 2022-05-08.ipynb".
    """
    
    if map_choice == 'Latin America Portal - oil-gas':
        if 'production' not in df.columns:
            df['production'] = np.nan
            print("Added column 'production' because it was missing")
        else:
            pass

        oil_gas_keep_cols = [
            'project_en', 'project', 'unit_en', 'unit', 'type', 'owner', 'parent',
            'province', 'countries', 'status', 'start_year', 
            'capacity', 'production', 'capacity_production_unit',
            'geom', 'lat', 'lng', 'route',
            'url_en', 
            # 'url', # modified 2023-10-27 to remove local language URL
            # 'terminal_type', # TO DO: add this in the future, and coordinate with Tom; but it's not part of the map now
        ]
    elif map_choice in ['Africa Gas Tracker', 'Asia Gas Tracker']:
        oil_gas_keep_cols = [
            'project', 'unit', 'type', 'owner', 'parent', 'countries',
            'status', 'start_year', 'capacity', 'production', 'capacity_production_unit',
            'geom', 'lat', 'lng', 'route', 'url',
            # 'terminal_type', # TO DO: add this in the future, and coordinate with Tom; but it's not part of the map now
        ]
    elif map_choice == 'Europe Gas Tracker':
        oil_gas_keep_cols = [
            'project', 'unit', 'type', 'owner', 'parent', 'countries',
            'status', 'start_year', 'capacity', 'production', 'capacity_production_unit',
            'geom', 'lat', 'lng', 'route', 'url', 
            # 'pci3', 'pci4', 
            'pci5', 'pci6' # new for feb 2024
            # 'terminal_type', # TO DO: add this in the future, and coordinate with Tom; but it's not part of the map now
        ]
    elif map_choice in ['Oil Infrastructure']:
        oil_gas_keep_cols = [
            'project', 'unit', 'type',
            # note: owner is not included currently
            'parent', 'countries', 'status', 'start_year', 'capacity',
            'geom', 'route', 'url',
        ]
    
    elif map_choice in ['Gas Infrastructure']:
        oil_gas_keep_cols = [
            'project', 'unit', 'type',
            # note: owner is not included currently
            'parent', 'countries', 'status', 'start_year', 'capacity', 'capacity_units',
            'geom', 'lat', 'lng', 'route', 'url',
        ]
        
    elif map_choice == 'Oil & Gas Plant':
        oil_gas_keep_cols = [
            'project', 'project_loc', 'unit', 'province', 'country', 'region', 'status', 'fuel_type', 
            'capacity', 'technology', 'start_year', 'owner', 'parent', 'lat', 'lng', 'url',
        ]
    elif map_choice == 'GOGET':
        oil_gas_keep_cols = [
            'project', 'country', 'status', 'discovery_year', 'start_year', 
            'operator', 'owner', 'parent',
            'fuel_type', 'unit_type',
            'lat', 'lng', 'url'
        ]
    # TO DO: specify which maps use the columns below, rather than using else case to do this
    else:
        oil_gas_keep_cols = [
            'project', 'unit', 'type', 'owner', 'parent', 'countries',
            'status', 'start_year', 'capacity', 'capacity_production_unit',
            'geom', 'lat', 'lng', 'route', 'url',
            # 'terminal_type', # TO DO: add this in the future, and coordinate with Tom; but it's not part of the map now
        ]
    df = df[oil_gas_keep_cols]
    
    return df

In [1303]:
def test_oil_gas_map_columns(oil_gas_map_df):
    
    # set expected columns
    if map_choice == 'Latin America Portal - oil-gas':
        oil_gas_expected_cols = [
            'project_en', 'project', 'unit_en', 'unit', 'type', 'owner', 'parent', 
            'province', 'countries', 'status', 'start_year', 'capacity', 'production', 
            'capacity_production_unit', 'geom', 'lat', 'lng', 'route',
            'url_en', 
            # 'url', # modified 2023-10-27 to remove local language URL
            'status_legend'
            # TO DO: add terminal_type in the future (to distinguish import/export)
        ]
    elif map_choice in ['Africa Gas Tracker', 'Asia Gas Tracker', 'Europe Gas Tracker']:
        oil_gas_expected_cols = [
            'project', 'unit', 'type', 'owner', 'parent', 'countries', 'status', 
            'start_year', 'capacity', 'production', 'capacity_production_unit', 
            'geom', 'lat', 'lng', 'route', 'url', 'status_legend'
            # TO DO: add terminal_type in the future (to distinguish import/export)
        ]
    elif map_choice == 'Europe Gas Tracker':
        oil_gas_expected_cols = [
            'project', 'unit', 'type', 'owner', 'parent', 'countries', 'status', 
            'start_year', 'capacity', 'production', 'capacity_production_unit', 
            'geom', 'lat', 'lng', 'route', 'url', 'status_legend',
            'pci5', 'pci6',
            # TO DO: add terminal_type in the future (to distinguish import/export)
        ]
    elif map_choice in ['Gas Infrastructure']:
        oil_gas_expected_cols = [
            'project', 'unit', 'type', 'parent', 
            'countries', 'status', 'start_year', 'capacity', 'capacity_units', 
            'geom', 'lat', 'lng', 'route', 'url',
        ]
    elif map_choice == 'GOGET':
        oil_gas_expected_cols = [
            'project', 'country', 'status', 'discovery_year', 'start_year', 
            'operator', 'owner', 'parent', 'fuel_type', 'unit_type', 'lat', 'lng', 'url'
        ]       
    else:
        print(f"Not set up to check oil-gas map file columns for map_choice: {map_choice}")
        oil_gas_expected_cols = [] # initialize
        
    # check for expected columns
    if oil_gas_expected_cols != []:
        if set(oil_gas_map_df.columns.tolist()) == set(oil_gas_expected_cols):
            print("Test passed. All columns were as expected")
        else:
            print('-'*40 + '\n' + 'Error!' + f" Map columns were not as expected.")
            for x in oil_gas_map_df.columns:
                if x not in oil_gas_expected_cols:
                    print(f"Column in map df not in oil_gas_expected_cols: {x}")
            for x in oil_gas_expected_cols:
                if x not in oil_gas_map_df.columns:
                    print(f"Column in oil_gas_expected_cols not in map df: {x}")

In [1304]:
def compile_all_oil_gas_data(
    map_choice, data_versions_dict, data_keys_titles, export_files
):
    """
    For download, update Excel file with multiple sheets:
    https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html
    
    For map, write to Excel file specifically for the map.
    """
    
    print("*"*40)
    print(f"Running compile_all_oil_gas_data for map_choice: {map_choice}")
    if map_choice == 'Latin America Portal - oil-gas':
        fuel_details = ' - oil & gas - '
        fuel_details_spanish = ' - petróleo y gas - '
    else:
        fuel_details = ' '
        fuel_details_spanish = ' '
    download_file_name = f'{map_choice} - download file{fuel_details}{save_timestamp}.xlsx'
    download_file_name_spanish = f'{map_choice} - descargar datos{fuel_details_spanish}{save_timestamp}.xlsx' 
    
    path_and_filename_for_download = path_for_download_and_map_files + download_file_name   
    path_and_filename_for_download_spanish = path_for_download_and_map_files + download_file_name_spanish
    
    oil_gas_data_for_download_list = [] # initialize
    oil_gas_data_for_download_list_spanish = [] # initialize
    oil_gas_data_for_map_list = [] # initialize

    if 'gas plants' in data_versions_dict[map_choice].keys():
        gas_plants_for_download_dict, gas_plants_for_map = run_all_gas_plant_functions(
            map_choice, data_versions_dict, data_keys_titles,
        )
        oil_gas_data_for_map_list += [gas_plants_for_map]
        
        oil_gas_data_for_download_list += [('Gas plants - data', gas_plants_for_download_dict['English'])]
        if map_choice == 'Latin America Portal - oil-gas':
            oil_gas_data_for_download_list_spanish += [('Central de gas - datos', gas_plants_for_download_dict['Spanish'])]
    else:
        print("Not processing gas plants")

    # TO DO: make sure below will work as we expect for gas-only trackers
    # (Europe gas, Asia gas, etc.)
    run_pipeline_functions = False # initialize
    for option in ['gas pipelines', 'oil and NGL pipelines']:
        if option in data_versions_dict[map_choice].keys():
            run_pipeline_functions = True
            
    if run_pipeline_functions == True:
        pipelines_for_download_dict, pipelines_for_map = run_all_pipeline_functions(
            map_choice, data_versions_dict, data_keys_titles,pipelines_to_use_dict,
    no_route_entries
        )
        oil_gas_data_for_map_list += [pipelines_for_map]

        # print(f"Columns for pipelines_for_map: {pipelines_for_map.columns.tolist()}") # for db
        
        if map_choice in ['Africa Gas Tracker', 'Asia Gas Tracker', 'Gas Infrastructure', 'Europe Gas Tracker']:
            oil_gas_data_for_download_list += [('Gas pipelines - data', pipelines_for_download_dict['English'])]
        elif map_choice in ['Oil Infrastructure']:
            oil_gas_data_for_download_list += [('Oil pipelines - data', pipelines_for_download_dict['English'])]        
        elif map_choice == 'Latin America Portal - oil-gas':
            oil_gas_data_for_download_list += [('Oil & gas pipelines - data', pipelines_for_download_dict['English'])]
            oil_gas_data_for_download_list_spanish += [('Gasoducto o oleoducto - datos', pipelines_for_download_dict['Spanish'])]
        else:
            print("Unexpected case for map_choice, within compile_all_oil_gas_data")
            
    else:
        print("Not processing pipelines")

    if 'ggit lng' in data_versions_dict[map_choice].keys():
        lng_terminals_for_download_dict, lng_terminals_for_map = run_all_lng_terminal_functions(
            map_choice, data_versions_dict, data_keys_titles
        )
        
        oil_gas_data_for_map_list += [lng_terminals_for_map]
        
        oil_gas_data_for_download_list += [('LNG terminals - data', lng_terminals_for_download_dict['English'])]
        if map_choice == 'Latin America Portal - oil-gas':
            oil_gas_data_for_download_list_spanish += [('Terminal de GNL - datos', lng_terminals_for_download_dict['Spanish'])]
    else:
        print("Not processing LNG")

    if 'goget' in data_versions_dict[map_choice].keys():
        goget_main_for_download_dict, goget_prod_for_download_dict, goget_for_map = run_all_goget_functions(
            map_choice, data_versions_dict, data_keys_titles
        )
        oil_gas_data_for_map_list += [goget_for_map]
        
        if map_choice in ['Africa Gas Tracker', 'Asia Gas Tracker', 'Europe Gas Tracker']:
            oil_gas_data_for_download_list += [('Gas extraction - main', goget_main_for_download_dict['English'])]
            oil_gas_data_for_download_list += [('Gas extraction - production', goget_prod_for_download_dict['English'])]      
        else:
            oil_gas_data_for_download_list += [('Oil & gas extract - main', goget_main_for_download_dict['English'])]
            oil_gas_data_for_download_list += [('Oil & gas extract - production', goget_prod_for_download_dict['English'])]
            
        if map_choice == 'Latin America Portal - oil-gas':
            # put additional data into the list
            oil_gas_data_for_download_list_spanish += [('Extrac petróleo gas principal', goget_main_for_download_dict['Spanish'])]
            oil_gas_data_for_download_list_spanish += [('Extrac petróleo gas producción', goget_prod_for_download_dict['Spanish'])]
    else:
        print("Not processing GOGET")
    
    print("*"*40)
    print("Finished running functions for each tracker")
    print("-"*40)    
    
    if len(oil_gas_data_for_map_list) == 0:
        print("There was no oil & gas data to add to map")
        oil_gas_map_df = pd.DataFrame()
    
    else:
        # handle download file
        oil_gas_export_download_files(
            oil_gas_data_for_download_list, 
            download_file_name, path_and_filename_for_download, 
            oil_gas_data_for_download_list_spanish,
            download_file_name_spanish, path_and_filename_for_download_spanish
        )
            
        # handle map file (uses export_files value to decide whether to export)
        oil_gas_map_df = oil_gas_data_for_map_final_processing_and_export(
            list_of_dfs = oil_gas_data_for_map_list, 
            path_for_download_and_map_files = path_for_download_and_map_files,
        )

    print("-"*40)
    print("Finished compile_all_oil_gas_data")
    print("*"*40)
    
    # don't try to return download file because it is multiple dfs, essentially unchanged from the original trackers
    return (oil_gas_map_df, oil_gas_data_for_download_list, oil_gas_data_for_download_list_spanish)

In [1305]:
# # sandbox:
# lng_terminals_for_download_dict, lng_terminals_for_map = run_all_lng_terminal_functions(
#     map_choice, data_versions_dict, data_files_and_paths
# )

In [1306]:
# # sandbox:
# pipelines_for_download_dict, pipelines_for_map = run_all_pipeline_functions(
#     map_choice, data_versions_dict, data_files_and_paths,
# )

# Renewables & other power data

## Common functions

In [1307]:
def test_statuses(df, accepted_statuses, status_col_name, other_cols_to_print):
    """
    other_cols_to_print (example): ['Project Name', 'Unit Name']
    """
    sel = df[~df[status_col_name].isin(accepted_statuses)]
    if len(sel) > 0:
        print("Error!" + f" There were {len(sel)} rows with unaccepted statuses:")
        print(sel[other_cols_to_print + status_col_name])
    elif len(sel) == 0:
        print("Test passed! All statuses were in accepted list.")
    else:
        print("Unexpected case in test_statuses")
    print()

In [1308]:
def find_duplicate_column_names(col_name_list):
    dups = [x for x in col_name_list if col_name_list.count(x) > 1]
    unique_dups = list(set(dups))
    if len(unique_dups) > 0:
        print(f"Duplicated columns: {unique_dups}")
    else:
        pass

In [1309]:
# def convert_cols_to_float_solar_or_wind(df, cols_to_convert):
#     for col_to_convert in cols_to_convert:
#         try:
#             df[col_to_convert] = df[col_to_convert].astype(float)
#             print(f"Converted to float for col {col_to_convert}")
#         except:
#             for row in df.index:
#                 row_val = df.at[row, col_to_convert]
#                 try:
#                     row_val_float = float(row_val)
#                 except:
#                     project_name = df.at[row, 'Project Name']
#                     sheet_name = df.at[row, 'Sheet name']
#                     print("Error!" + f" For {col_to_convert}, couldn't convert to float: {row_val}; from {project_name} in sheet {sheet_name}")
#     print()                
#     return df

In [1310]:
def convert_cols_to_float_renewables_and_other_power(df, cols_to_convert, project_name_col):

    for col_to_convert in cols_to_convert:
        col_dtype = df[col_to_convert].dtype
        if col_dtype not in ['int64', 'float64']:
            try:
                df[col_to_convert] = df[col_to_convert].astype(float)
                print(f"Converted to float for col {col_to_convert}")
            except:
                for row in df.index:
                    row_val = df.at[row, col_to_convert]
                    try:
                        row_val_float = float(row_val)
                    except:
                        project_name = df.at[row, project_name_col]
                        print("Error!" + f" For {col_to_convert}, couldn't convert to float: {row_val}; from {project_name}")
        else:
            print(f"Column {col_to_convert} was already a number (dtype {col_dtype})")
    print()
    
    return df

## Solar & wind combined functions

In [1311]:
def solar_or_wind_read_working_using_pygsheets(exclude_sheets):
    df = pd.DataFrame() # initialize empty df
    gc = pygsheets.authorize(client_secret_full_path)
    
    if map_choice == 'Solar Power':
        print("set working key for solar")
        working_key = data_files_and_paths['solar_working_key']
    elif map_choice == 'Wind Power':
        working_key = data_files_and_paths['wind_working_key']
    else:
        working_key = ''
        
    if working_key != '':
        working_gsheet = gc.open_by_key(working_key)

        all_sheets = [] # initialize

        for worksheet in working_gsheet.worksheets():
            if worksheet.title not in exclude_sheets:
                print(f"reading sheet {worksheet.title}")

                one_sheet = working_gsheet.worksheet('title', worksheet.title)
                one_sheet_df = one_sheet.get_as_df()
                one_sheet_df['Sheet name'] = worksheet.title
                
                # if '' in one_sheet_df.columns:
                #     one_sheet_df = one_sheet_df.drop('', axis=1)
                
                all_sheets += [one_sheet_df]
                
                find_duplicate_column_names(one_sheet_df.columns.tolist()) # for db

        # overwrite initial df
        df = pd.concat(all_sheets, sort=False).reset_index(drop=True)
            
    df_raw = df
    return df_raw

In [1312]:
def solar_read_official():
    solar_raw = pd.read_excel(
        data_files_and_paths['solar_official_path'] + 
        data_files_and_paths['solar_official_file'],
        sheet_name = 'Large Utility-Scale', # changed 2023-11-21; previously was 'Data'
    )
    
    return solar_raw

In [1313]:
def wind_read_official():
    wind_raw = pd.read_excel(
        data_files_and_paths['wind_official_path'] + 
        data_files_and_paths['wind_official_file'],
        sheet_name = 'Data',
    )
    
    return wind_raw

In [1314]:
def solar_or_wind_clean(df_raw):
    df = df_raw.copy()
    
    # remove columns with no name; clean up columns we're keeping
    for col in df.columns:
        if 'Unnamed: ' in col:
            df = df.drop(col, axis=1)
        else:
            # TO DO: remove steps below if not needed
            # # convert all to str; strip white space
            # df[col] = df[col].astype(str).str.strip()

            # # replace empty strings with NaN
            # df[col] = df[col].replace('', np.nan)
            
            pass

    # remove rows with no entry for 'Project Name' (either NaN or empty string)
    df = df.dropna(subset=['Project Name'])
    df = df[df['Project Name']!='']
    
    return df

In [1315]:
def renewables_exclude_no_coord(df):
    sel = df[(df['Latitude'].isna()) | (df['Longitude'].isna())]
    if len(sel)==0:
        print("Test passed! All rows have lat & lon")
    elif len(sel) > 0:
        print("Error!" + f" There were {len(sel)} rows with no lat and/or lon:")
        print(sel[['Project Name', 'Latitude', 'Longitude']])
        
        
    # remove those with no lat or lon
    len_init = len(df)
    df = df[~df['Latitude'].isna()]
    df = df[~df['Longitude'].isna()]
    len_final = len(df)
    print(f"Removed {len_init - len_final} rows with missing coordinates.")
    print()
    return df

In [1316]:
def test_solar_wind_statuses(df, accepted_statuses):

    sel = df[~df['Status'].isin(accepted_statuses)]
    if len(sel) > 0:
        print("Error!" + f" There were {len(sel)} rows with unaccepted statuses:")
        print(sel[['Project Name', 'Phase Name', 'Status']])
        print("Accepted statuses:")
        print(accepted_statuses)
    elif len(sel) == 0:
        print("Test passed! All statuses were in accepted list.")
    else:
        print("Unexpected case in test_solar_statuses")
    print()

## Solar Power

In [1317]:
def solar_create_map_file(df): 
    df['Phase Name'] = df['Phase Name'].fillna('')
    
    solar_map_col = [
        'Project Name', 'Phase Name', 
        "Project Name in Local Language / Script",
        'Capacity (MW)',
        'Capacity Rating',
        'Status', 'Start year', 
        'Operator', 'Owner', 'Country',
        'Wiki URL',
        'Latitude', 'Longitude', 'Location accuracy',
        'Technology Type',
        'Region', 'Subregion',
    ]
    
    df = df[solar_map_col]
    
    if map_choice == 'Latin America Portal - renewables':
        # use lower-case names; handle English & local language
        df = df.rename(columns={
            'Project Name': 'project_en',
            'Project Name in Local Language / Script': 'project',
            'Phase Name': 'phase_en', # equivalent of 'unit' in coal/steel & oil/gas
            'Capacity Rating': 'capacity_rating',
            'Start year': 'start_year',
            'Wiki URL': 'url_en',
            'Owner': 'owner',
            'Operator': 'operator',
            'Country': 'country',
            'Status': 'status',
            'Operating status': 'status',
            'Capacity (MW)': 'capacity',
            'Latitude': 'lat',
            'Longitude': 'lng',            
            'Location accuracy': 'loc_accuracy',
        })
        
        # modified 2023-10-27 to remove local language URL
        # # while we don't have translations, put English wiki page into column for local language wiki name
        # df['url'] = df['url_en']
        
        # put unit name in English into column for local language unit name
        df['phase'] = df['phase_en']
        
        # add 'type' column
        df['type'] = 'Solar farm'
        
        # print(f"Show df.columns after rename for {map_choice}: {df.columns.tolist()}") # for db
        
    else:
        df = df.rename(columns={
            'Project Name': 'Project name',
            'Project Name in Local Language / Script': 'Project name local',
            'Phase Name': 'Phase name',
            'Capacity Rating': 'Capacity rating',
            'Technology Type': 'Type',
            'Status': 'Status',
            'Operating status': 'Status',
            # 'Region': 'Region',
            # 'Subregion': 'Subregion',
        })
    
    if export_files == True:
        if map_choice == 'Latin America Portal - renewables':
            print("Not exporting map file for solar alone for Latin America; will later export combined solar-wind file")
            pass
        else:
            export_file = f'Global Solar Power Tracker map file {save_timestamp}.xlsx'
            df.to_excel(path_for_download_and_map_files + export_file, index=False)
            # # note: if an IllegalCharacterError comes up, can try in line above: engine='xlsxwriter'
            print(f"Exported solar map file to Excel: {export_file}") # for UI
       
    return df

In [1318]:
def run_all_solar_functions():
    print('-'*40 + '\n' + "Running run_all_solar_functions") # for UI
    solar_version = data_versions_dict[map_choice]['solar power']
    
    if solar_version == 'official':
        solar_raw = solar_read_official()
        
    elif solar_version == 'working':
        solar_exclude_sheets = [
            'Data Dictionary', 'sponsor-parent', 'parent metadata', 
            'references', 'countryregion', 'country/region', 'Units Removed'
        ]
        solar_raw = solar_or_wind_read_working_using_pygsheets(exclude_sheets = solar_exclude_sheets)

    solar_clean = solar_or_wind_clean(solar_raw)
    
    # filter for Latin America
    if map_choice == 'Latin America Portal - renewables':
        solar_clean = solar_clean[solar_clean['Country'].isin(lat_am_carib_countries)]
        
    elif map_choice == 'Solar Power':
        # global solar data; don't filter
        pass
    
    else:
        print("Error!" + f" In processing solar, unexpected value for map_choice: {map_choice}")
    
    solar_clean = convert_cols_to_float_renewables_and_other_power(
        df=solar_clean, 
        cols_to_convert=['Capacity (MW)', 'Latitude', 'Longitude'],
        project_name_col='Project Name',
    )

    solar_clean = renewables_exclude_no_coord(solar_clean)
    
    test_solar_wind_statuses(solar_clean, accepted_statuses_sel[map_choice])

    solar_map = solar_create_map_file(solar_clean)
    
    # TEST:
    if map_choice == 'Latin America Portal - renewables':
        cols_to_test = ['phase', 'phase_en', 'start_year', 'operator', 'owner']
    else:
        cols_to_test = ['Phase name', 'Start year', 'Operator', 'Owner']
    for col in cols_to_test:
        x = solar_map[solar_map[col]=='nan']
        if len(x) > 0:
            print(f"For solar_map, the string 'nan' found in column {col}; number of times: {len(x)}")
    # END TEST
    
    return solar_clean, solar_map

## Wind Power

In [1319]:
def wind_create_map_file(df): 
    df['Phase Name'] = df['Phase Name'].fillna('')

    wind_map_col = [
        'Project Name', 'Phase Name', 
        "Project Name in Local Language / Script",
        'Capacity (MW)',
        # 'Capacity Rating', not a column in the wind tracker
        'Installation Type',
        'Status', 'Start year', 
        'Operator', 'Owner', 'Country',
        'Wiki URL',
        'Latitude', 'Longitude', 'Location accuracy',
        'Region', 'Subregion',
    ]
    
    df = df[wind_map_col]
    
    if map_choice == 'Latin America Portal - renewables':
        # use lower-case names; handle English & local language
        df = df.rename(columns={
            'Project Name': 'project_en',
            'Project Name in Local Language / Script': 'project',
            'Phase Name': 'phase_en', # equivalent of 'unit' in coal/steel & oil/gas
            'Capacity Rating': 'capacity_rating',
            'Start year': 'start_year',
            'Wiki URL': 'url_en',
            'Owner': 'owner',
            'Operator': 'operator',
            'Country': 'country',
            'Status': 'status',
            'Capacity (MW)': 'capacity',
            'Latitude': 'lat',
            'Longitude': 'lng',      
            'Location accuracy': 'loc_accuracy',
        })
        
        # modified 2023-10-27 to remove local language URLs
        # # while we don't have translations, put English wiki page into column for local language wiki name
        # df['url'] = df['url_en']
        
        # put unit name in English into column for local language unit name
        df['phase'] = df['phase_en']
        
        # create 'type' column that draws on 'Installation type' to distinguish onshore/offshore
        # df['type'] = df['Installation Type'].str.lower().replace({
        #     'offshore hard mount': 'offshore',
        #     'offshore floating': 'offshore',
        #     'offshore unknown mount': 'offshore',
        # })
        # df['type'] = 'Wind farm (' + df['type'] + ')'
        df['type'] = 'Wind farm (' + df['Installation Type'].str.lower() + ')'
        
        # remove column 'Installation Type'
        df = df.drop('Installation Type', axis=1)
        
        # TEST:
        wind_expected_types = [
            'Wind farm (offshore hard mount)', 
            'Wind farm (offshore floating)', 
            'Wind farm (offshore mount unknown)',
            'Wind farm (onshore)',
        ]
        if set(df['type'].unique().tolist()) == set(wind_expected_types):
            pass
        else:
            print("Error!" + f" Unexpected values for wind 'type': {df['type'].unique().tolist()}") # for UI
            print(f"Expected values are: {wind_expected_types}") # for UI
        # END TEST
        
        # print(f"Show df.columns after rename for {map_choice}: {df.columns.tolist()}") # for db
        
    else:
        df = df.rename(columns={
            'Project Name': 'Project name',
            'Project Name in Local Language / Script': 'Project name local',
            'Phase Name': 'Phase name',
            'Technology Type': 'Type',
            'Region': 'Region',
            'Subregion': 'Subregion',
        })
    
    if export_files == True:
        if map_choice == 'Latin America Portal - renewables':
            print("Not exporting map file for wind alone for Latin America; will later export combined solar-wind file")
            pass
        else:
            export_file = f'Global Wind Power Tracker map file {save_timestamp}.xlsx'
            df.to_excel(path_for_download_and_map_files + export_file, index=False)
            # note: if an IllegalCharacterError comes up, can try in line above: engine='xlsxwriter'
            print(f"Exported wind map file to Excel: {export_file}") # for UI
       
    return df

In [1320]:
def run_all_wind_functions():
    print('-'*40 + '\n' + "Running run_all_wind_functions") # for UI
    wind_version = data_versions_dict[map_choice]['wind power']
    
    if wind_version == 'official':
        wind_raw = wind_read_official()
        
    elif wind_version == 'working':
        wind_exclude_sheets = [
            'Data Dictionary', 'sponsor-parent', 'parent metadata', 
            'references', 'countryregion', 'country/region', 'Units Removed', 'Temporary Imports',
        ]
        wind_raw = solar_or_wind_read_working_using_pygsheets(exclude_sheets = wind_exclude_sheets)

    wind_clean = solar_or_wind_clean(wind_raw)

    wind_clean = convert_cols_to_float_renewables_and_other_power(
        df=wind_clean, 
        cols_to_convert=['Capacity (MW)', 'Latitude', 'Longitude'],
        project_name_col='Project Name',
    )

    wind_clean = renewables_exclude_no_coord(wind_clean)
    
    # filter for Latin America
    if map_choice == 'Latin America Portal - renewables':
        wind_clean = wind_clean[wind_clean['Country'].isin(lat_am_carib_countries)]
        
    elif map_choice == 'Wind Power':
        # global wind data; don't filter
        pass
    else:
        print("Error!" + f" In processing wind, unexpected value for map_choice: {map_choice}")
        
    # TEST:
    for col in ['Phase Name', 'Start year', 'Operator', 'Owner']:
        x = wind_clean[wind_clean[col]=='nan']
        if len(x) > 0:
            print(f"For wind_clean, the string 'nan' found in column {col}; number of times: {len(x)}")
    # END TEST

    test_solar_wind_statuses(wind_clean, accepted_statuses_sel[map_choice])

    wind_map = wind_create_map_file(wind_clean)
    
    # TEST:
    if map_choice == 'Latin America Portal - renewables':
        cols_to_test = ['phase', 'phase_en', 'start_year', 'operator', 'owner']
    else:
        cols_to_test = ['Phase name', 'Start year', 'Operator', 'Owner']
    for col in cols_to_test:
        x = wind_map[wind_map[col]=='nan']
        if len(x) > 0:
            print(f"For wind_map, the string 'nan' found in column {col}; number of times: {len(x)}")
    # END TEST
    
    return wind_clean, wind_map

## Geothermal

In [1321]:
def geothermal_read_official():
    geothermal_xl = (        
        data_files_and_paths['geothermal_official_path'] + 
        data_files_and_paths['geothermal_official_file']
    )
    
    df_main = pd.read_excel(geothermal_xl, sheet_name = 'Data')
    df_subthreshold = pd.read_excel(geothermal_xl, sheet_name = 'Below Threshold')
    df = pd.concat([df_main, df_subthreshold]).reset_index(drop=True)
    
    return df

In [1322]:
def geothermal_read_working_using_pygsheets():
    # df = pd.DataFrame() # initialize empty df
    gc = pygsheets.authorize(client_secret_full_path)
    
    working_key = data_files_and_paths['geothermal_working_key']
    working_gsheet = gc.open_by_key(working_key)

    all_sheets = [] # initialize

    print(f"reading sheet: 'Main'")

    one_sheet = working_gsheet.worksheet('title', 'Main')
    df = one_sheet.get_as_df()

    find_duplicate_column_names(df.columns.tolist()) # for db
    
    df_raw = df
    return df_raw

In [1323]:
def geothermal_clean_pygsheets(df):
    # clean data for 'Project Capacity (MW)'; if blank, fill with zero
    df['Project Capacity (MW)'] = df['Project Capacity (MW)'].replace('', 0)
    
    for col in df.columns.tolist():
        col_dtype = df[col].dtype
        if col_dtype == 'O':
            df[col] = df[col].astype(str).str.strip()
    
    return df

In [1324]:
def geothermal_condense_types(df):
    """
    For flash steam, don't distinguish between single, double, triple, unknown.
    """
    
    for row in df.index:
        row_type = df.at[row, 'Type'].strip()
        if row_type.startswith('flash steam'):
            # overwrite value
            df.at[row, 'Type'] = 'flash steam'
            
    print("Show technology ('Type') value counts after change:")
    print(df['Type'].value_counts())
    
    return df

In [1325]:
# def convert_cols_to_float_geothermal(df, cols_to_convert):

#     for col_to_convert in cols_to_convert:
#         col_dtype = df[col_to_convert].dtype
#         if col_dtype not in ['int64', 'float64']:
#             try:
#                 df[col_to_convert] = df[col_to_convert].astype(float)
#                 print(f"Converted to float for col {col_to_convert}")
#             except:
#                 for row in df.index:
#                     row_val = df.at[row, col_to_convert]
#                     try:
#                         row_val_float = float(row_val)
#                     except:
#                         project_name = df.at[row, 'Project Name']
#                         print("Error!" + f" For {col_to_convert}, couldn't convert to float: {row_val}; from {project_name}")
#         else:
#             print(f"Column {col_to_convert} was already a number (dtype {col_dtype})")
#     print()
    
#     return df

In [1326]:
def test_geothermal_statuses(df, accepted_statuses):

    sel = df[~df['Status'].isin(accepted_statuses)]
    if len(sel) > 0:
        print("Error!" + f" There were {len(sel)} rows with unaccepted statuses:")
        print(sel[['Project Name', 'Unit Name', 'Status']])
    elif len(sel) == 0:
        print("Test passed! All statuses were in accepted list.")
    else:
        print("Unexpected case in test_geothermal_statuses")
    print()

In [1327]:
def geothermal_create_map_file(geothermal):
    df = geothermal.copy()
    
    df = geothermal_condense_types(df)
    df['Unit Name'] = df['Unit Name'].fillna('')
    
    # (from Ingrid on Asana 2022-04-21); mod based on GSPT Map - Notes for Tom file (2022-05-05)
    geothermal_map_col = [
        'Project Name', 'Unit Name', 
        'Project Name in Local Language / Script',
        'Type',
        'Unit Capacity (MW)',
        'Status', 'Start year', 
        'Operator', 'Owner', 'Country',
        'Wiki URL', 
        'Latitude', 'Longitude', 'Location accuracy',
        'Region', 'Subregion',
    ]
    
    df = df[geothermal_map_col]
    
    df = df.rename(columns={
        'Project Name': 'Project name',
        'Project Name in Local Language / Script': 'Project name local',
        'Unit Name': 'Unit name',
        'Unit Capacity (MW)': 'Unit capacity (MW)',
        'Start Year': 'Start year',
    })
    
    if export_files == True:
        if map_choice == 'Latin America Portal - renewables':
            print("Not exporting map file for geothermal for Latin America; will later export combined solar-wind file")
            pass
        else:
            export_file = f'Global Geothermal Power Tracker map file {save_timestamp}.xlsx'
            df.to_excel(path_for_download_and_map_files + export_file, index=False)
            # note: if an IllegalCharacterError comes up, can try in line above: engine='xlsxwriter'
            print("Exported geothermal map file to Excel:")
            print(export_file)
       
    return df

In [1328]:
def run_all_geothermal_functions():
    
    if data_versions_dict[map_choice]['geothermal power'] == 'working':
        geothermal_raw = geothermal_read_working_using_pygsheets()

        geothermal = geothermal_raw.copy()
        geothermal = geothermal_clean_pygsheets(geothermal)
        
    elif data_versions_dict[map_choice]['geothermal power'] == 'official':
        geothermal = geothermal_read_official()

    geothermal_cols_to_convert = ['Unit Capacity (MW)', 'Project Capacity (MW)', 'Latitude', 'Longitude']
    geothermal = convert_cols_to_float_renewables_and_other_power(
        df=geothermal, 
        cols_to_convert=geothermal_cols_to_convert,
        project_name_col='Project Name',
    )

    geothermal = renewables_exclude_no_coord(geothermal)

    test_geothermal_statuses(geothermal, accepted_statuses_sel[map_choice])

    geothermal_map = geothermal_create_map_file(geothermal)
    
    return geothermal_map

In [1329]:
# ANOTHER OLD VERSION!
# def create_map_file_latin_america_renewables():
#         print("For Latin America, compiling combined solar & wind data")
#         # create combined df, solar_and_wind_map
#         solar_clean, solar_map = run_all_solar_functions()
        
#         wind_clean, wind_map = run_all_wind_functions()

#         df = pd.concat([
#             solar_map, 
#             wind_map,
#         ], sort=False).reset_index(drop=True)

#         print(f"Show df.columns before reorder: {df.columns}") # for db

#         # reorder columns
#         # (have to order after concat of wind & solar, because concat can alter order of columns
#         renewables_col = [
#             'project_en', 'project',
#             'phase_en', 'phase',
#             'type',
#             'capacity', 'capacity_rating',
#             'status', 'start_year', 
#             'owner', 'operator', 'country',
#             'url_en', 'url',
#             'lat', 'lng', 'loc_accuracy',
#         ]
#         # TEST:
#         for col in df.columns:
#             if col not in renewables_col:
#                 print(f"Missing from renewables_col: {col}")
#         for col in renewables_col:
#             if col not in df.columns:
#                 print(f"Missing from df.columns: {col}")
#         # END TEST

#         df = df[renewables_col]
#         renewables_compiled_for_map = df## Run renewables functions

## Bioenergy

In [1330]:
def bioenergy_read_working_local_file():
    all_sheets = [] # initialize
    
    bioenergy_xl = pd.ExcelFile(
        data_files_and_paths['bioenergy_working_path'] + 
        data_files_and_paths['bioenergy_working_file']
    )
    bioenergy_sheets = [
        'Africa',
        'Asia',
        'China',
        'India',
        'Central America & the Caribbean',
        'Eurasia',
        'Europe',
        'Middle East',
        'North America',
        'Oceania',
        'South America'
    ]
    for sheet_name in bioenergy_sheets:
        all_sheets += [pd.read_excel(
            bioenergy_xl, 
            sheet_name = sheet_name,
            dtype = {'Unit Name': str},
        )]

    df = pd.concat(all_sheets, sort=False).reset_index(drop=True)
    
    # remove instructions to researchers
    instructions = df[df['Researcher']=='Required']
    for row in instructions.index:
        df = df.drop(row)
        
    # remove empty rows
    df = df.dropna(how='all').reset_index(drop=True)
        
    find_duplicate_column_names(df.columns.tolist()) # for db
    
    return df

In [1331]:
def bioenergy_read_official_local_file():
    all_sheets = [] # initialize
    
    bioenergy_xl = pd.ExcelFile(
        data_files_and_paths['bioenergy_official_path'] + 
        data_files_and_paths['bioenergy_official_file']
    )
    df = pd.read_excel(
        bioenergy_xl, 
        sheet_name = 'Data',
        dtype = {'Unit Name': str},
    )
        
    # remove empty rows
    df = df.dropna(how='all').reset_index(drop=True)
        
    find_duplicate_column_names(df.columns.tolist()) # for db
    
    return df

In [1332]:
def bioenergy_condense_fuels(df):
    """
    For fuels, reduce the number of categories.
    """
    
    for col in ['Fuel 1', 'Fuel 2', 'Fuel 3']:
        df[col] = df[col].str.rsplit(pat='(', n=1).str[0].str.strip()
            
    print("Show fuels value counts after change:")
    fuels = pd.concat([df['Fuel 1'], df['Fuel 2'], df['Fuel 3']]).dropna()
    print(fuels.value_counts())
    
    return df

In [1333]:
def convert_cols_to_float_bioenergy(df, cols_to_convert):

    for col_to_convert in cols_to_convert:
        col_dtype = df[col_to_convert].dtype
        if col_dtype not in ['int64', 'float64']:
            try:
                df[col_to_convert] = df[col_to_convert].astype(float)
                print(f"Converted to float for col {col_to_convert}")
            except:
                for row in df.index:
                    row_val = df.at[row, col_to_convert]
                    try:
                        row_val_float = float(row_val)
                    except:
                        project_name = df.at[row, 'Project Name']
                        print("Error!" + f" For {col_to_convert}, couldn't convert to float: {row_val}; from {project_name}")
        else:
            print(f"Column {col_to_convert} was already a number (dtype {col_dtype})")
    print()
    
    return df

In [1334]:
# def test_bioenergy_statuses(df, accepted_statuses):

#     sel = df[~df['Operating Status'].isin(accepted_statuses)]
#     if len(sel) > 0:
#         print("Error!" + f" There were {len(sel)} rows with unaccepted statuses:")
#         print(sel[['Project Name', 'Unit Name', 'Operating Status']])
#     elif len(sel) == 0:
#         print("Test passed! All statuses were in accepted list.")
#     else:
#         print("Unexpected case in test_bioenergy_statuses")
#     print()

In [1335]:
def bioenergy_combine_fuels_into_str(df):
    df['Fuel'] = ''
    
    for col in ['Fuel 1', 'Fuel 2', 'Fuel 3']:
        df['Fuel'] += df[col].fillna('').astype(str) + ', '
    df['Fuel'] = df['Fuel'].str.strip(', ')
    
    return df

In [1336]:
def bioenergy_create_map_file(bioenergy):
    df = bioenergy.copy()
    
    print(df.columns) # for db
    
    df = bioenergy_condense_fuels(df)
    # df = bioenergy_combine_fuels_into_str(df)
    
    df['Unit name'] = df['Unit name'].fillna('')
    
    df = df.rename(columns={
        'Project Name': 'Project name',
        'Unit Name': 'Unit name',
        'Capacity (MW)': 'Unit capacity (MW)',
        'Start Year': 'Start year',
        'Unit start year': 'Start year',
        'Operating Status': 'Status',
        'Operating status': 'Status',
        'Is conversion?': 'Conversion',
    })
    
    bioenergy_map_col = [
        'Project name', 'Unit name', 
        # 'Type',
        'Fuel 1', 'Fuel 2', 'Fuel 3',
        'Unit capacity (MW)',
        'Status', 'Start year', 
        'Conversion',
        'Unit conversion year', # new as of 2023-11
        'Owner', 'Operator', 'Country',
        'Wiki URL', 
        'Latitude', 'Longitude', 'Location accuracy',
    ]
    df = df[bioenergy_map_col]
    
    if export_files == True:
        if map_choice == 'Latin America Portal - renewables':
            print("Not exporting map file for bioenergy for Latin America; will later export combined solar-wind file")
            pass
        else:
            export_file = f'Global Bioenergy Power Tracker map file {save_timestamp}.xlsx'
            df.to_excel(path_for_download_and_map_files + export_file, index=False)
            # note: if an IllegalCharacterError comes up, can try in line above: engine='xlsxwriter'
            print("Exported bioenergy map file to Excel")
       
    return df

In [1337]:
def bioenergy_clean_working_local(df_raw):
    df = df_raw.copy()
    for col in df.columns:
        if ' [ref]' in col:
            df = df.drop(col, axis=1)
    
    return df

In [1338]:
def clean_unit_names(df, unit_name_col):
    for num in range(1, 10):
        df[unit_name_col] = df[unit_name_col].replace(f"{num}.0", str(num))
        
    return df

In [1339]:
def run_all_bioenergy_functions():
    if data_versions_dict[map_choice]['bioenergy power'] == 'working pygsheets':
        df_raw = bioenergy_read_working_using_pygsheets()
        df = bioenergy_clean_pygsheets(df_raw)
    elif data_versions_dict[map_choice]['bioenergy power'] == 'working local':
        df_raw = bioenergy_read_working_local_file()
        df = bioenergy_clean_working_local(df_raw)
    elif data_versions_dict[map_choice]['bioenergy power'] == 'official':
        df = bioenergy_read_official_local_file()

    bioenergy_cols_to_convert = ['Capacity (MW)', 'Latitude', 'Longitude']
    df = convert_cols_to_float_bioenergy(df, bioenergy_cols_to_convert)

    df = clean_unit_names(df, 'Unit name')
    
    df = renewables_exclude_no_coord(df)

    test_statuses(
        df = df, 
        accepted_statuses = accepted_statuses_sel[map_choice], 
        status_col_name = 'Operating status', 
        other_cols_to_print = ['Project name', 'Unit name'],
    )
    bioenergy_map = bioenergy_create_map_file(df)
    
    return bioenergy_map

In [1340]:
# sandbox:
# bioenergy_map = run_all_bioenergy_functions()

In [1341]:
# bioenergy map column names Dec 2022:
# Project name	Unit name	Fuel 1	Fuel 2	Fuel 3	Unit capacity (MW)	Status	Start year	Conversion	Owner	Operator	Country	Wiki URL	Latitude	Longitude	Location accuracy

## Nuclear

In [1342]:
def nuclear_read_working_local_file():
    print("Reading nuclear power working file local")
    all_sheets = [] # initialize
    
    df = pd.read_excel(
        data_files_and_paths['nuclear_working_path'] + 
        data_files_and_paths['nuclear_working_file'],
        sheet_name = 'Data',
    )
        
    # remove empty rows
    df = df.dropna(how='all').reset_index(drop=True)
    
    # remove row with instructions
    if df.iloc[0, 0] == 'Required':
        df = df.drop(0)
        print("Found row with instructions; removed it.")
        
    find_duplicate_column_names(df.columns.tolist()) # for db
    
    return df

In [1343]:
def nuclear_clean_working_local(df_raw):
    df = df_raw.copy()
    for col in df.columns:
        if ' [ref]' in col:
            df = df.drop(col, axis=1)
    
    return df

In [1344]:
def nuclear_read_official_local_file():
    df = pd.read_excel(
        data_files_and_paths['nuclear_official_path'] + 
        data_files_and_paths['nuclear_official_file'],
        sheet_name = 'Data',
    )
    df = df.rename(columns={'Capacity (MW)': 'Unit Capacity (MW)'})
    
    return df

In [1345]:
def nuclear_create_map_file(nuclear):
    df = nuclear.copy()
    
    df['Unit Name'] = df['Unit Name'].fillna('')
    
    df = df.rename(columns={
        'Project Name': 'Project name',
        'Unit Name': 'Unit name',
        'Unit Capacity (MW)': 'Unit capacity (MW)',
        'Start Year': 'Start year',
        'Location Accuracy': 'Location accuracy',
    })
    
    nuclear_map_col = [
        'Project name', 'Unit name', 
        # 'Type',
        'Unit capacity (MW)',
        'Reactor Type',
        'Status', 'Start year', 
        'Owner', 'Operator', 'Country',
        'Wiki URL', 
        'Latitude', 'Longitude', 'Location accuracy',
    ]
    df = df[nuclear_map_col]
    
    if export_files == True:
        if map_choice == 'Latin America Portal - renewables':
            print("Not exporting map file for nuclear for Latin America; will later export combined solar-wind file")
            pass
        else:
            export_file = f'Global Nuclear Power Tracker map file {save_timestamp}.xlsx'
            df.to_excel(path_for_download_and_map_files + export_file, index=False)
            # note: if an IllegalCharacterError comes up, can try in line above: engine='xlsxwriter'
            print("Exported nuclear map file to Excel")
       
    return df

In [1346]:
def run_all_nuclear_functions():
    if data_versions_dict[map_choice]['nuclear power'] == 'working pygsheets':
        nuclear_raw = nuclear_read_working_using_pygsheets()
        nuclear = nuclear_clean_pygsheets(nuclear_raw)
    elif data_versions_dict[map_choice]['nuclear power'] == 'working local':
        nuclear_raw = nuclear_read_working_local_file()
        nuclear = nuclear_clean_working_local(nuclear_raw)
    elif data_versions_dict[map_choice]['nuclear power'] == 'official local':
        nuclear = nuclear_read_official_local_file()
    else:
        print(f"Unexpected value for data_versions_dict[map_choice]['nuclear power']: {data_versions_dict[map_choice]['nuclear power']}")

    nuclear_cols_to_convert = ['Unit Capacity (MW)', 'Latitude', 'Longitude']
    nuclear = convert_cols_to_float_renewables_and_other_power(
        nuclear, nuclear_cols_to_convert, 'Project Name')

    nuclear = renewables_exclude_no_coord(nuclear)
    
    test_statuses(
        df = nuclear, 
        accepted_statuses = accepted_statuses_sel[map_choice], 
        status_col_name = 'Status', 
        other_cols_to_print = ['Project Name', 'Unit Name'],
    )
    nuclear_map = nuclear_create_map_file(nuclear)
    
    return nuclear_map

## Renewables - overall functions

In [1347]:
def renewables_create_download_files_lat_am(solar_clean, wind_clean):
    """ Creates download files in English and Spanish.
    
    Only runs for Latin America Portal. 
    For global maps, already have official download files.
    No other regional maps include renewables.
    
    """
    
    download_file_name = f'Latin America - data download Eng {save_timestamp}.xlsx'
    download_file_name_spanish = f'Latin America - data download Span {save_timestamp}.xlsx'

    path_and_filename_for_download = path_for_download_and_map_files + download_file_name   
    path_and_filename_for_download_spanish = path_for_download_and_map_files + download_file_name_spanish

    # filter for Latin America
    solar_download = solar_clean[solar_clean['Country'].isin(lat_am_carib_countries)]
    wind_download = wind_clean[wind_clean['Country'].isin(lat_am_carib_countries)]

    # put into dictionary
    renewables_download_df_dict_eng = {
        'Solar power': solar_download,
        'Wind power': wind_download
    }

    renewables_export_download_file(
        path_and_filename_for_download, 
        renewables_download_df_dict_eng
    )

    solar_download_sp = lat_am_convert_one_tracker_col_names_to_spanish(
        tracker_df = solar_download,
        trans_sheet_name = 'solar',
    )
    # solar_download_sp = solar_wind_reorder_cols_lat_am(solar_download_sp)

    wind_download_sp = lat_am_convert_one_tracker_col_names_to_spanish(
        tracker_df = wind_download,
        trans_sheet_name = 'wind',
    )
    # wind_download_sp = solar_wind_reorder_cols_lat_am(wind_download_sp)

    renewables_download_df_dict_sp = {
        'Solar power': solar_download_sp,
        'Wind power': wind_download_sp
    }
    renewables_export_download_file(
        path_and_filename_for_download_spanish, 
        renewables_download_df_dict_sp
    )        
    # no return

In [1348]:
def renewables_export_download_file(path_and_filename, renewables_download_df_dict_eng):

    with pd.ExcelWriter(path_and_filename) as writer:
        processed_data_sets = ''
        for data_set_name in renewables_download_df_dict_eng.keys():
            
            df = renewables_download_df_dict_eng[data_set_name]
            df.to_excel(writer, sheet_name=data_set_name, startrow=1, header=False, index=False)
            # specify startrow 1 to leave row 0 empty for headers

            workbook = writer.book
            worksheet = writer.sheets[data_set_name]

            # Add a header format; from: https://xlsxwriter.readthedocs.io/working_with_pandas.html
            header_format = workbook.add_format({
                'bold': True,
                'text_wrap': True,
                'valign': 'top',
                'border': 0,
                'bottom': 1,
            })
            
            # set width of all columns
            worksheet.set_column(0, len(df.columns), 12)

            # Write the column headers with the defined format.
            for col_num, value in enumerate(df.columns.values):
                worksheet.write(0, col_num, value, header_format)
            
            processed_data_sets += data_set_name + ', ' 
            
        print(f"Exported file: {path_and_filename.rsplit('/', 1)}")
        print(f"Includes sheets: {processed_data_sets.strip(', ')}")
                
    # no return

In [1349]:
def create_map_file_latin_america_renewables(solar_map, wind_map):
    print("For Latin America, compiling combined solar & wind data")
    # create combined df, solar_and_wind_map

    df = pd.concat([
        solar_map, 
        wind_map,
    ], sort=False).reset_index(drop=True)

    # reorder columns
    # (have to order after concat of wind & solar, because concat can alter order of columns
    renewables_col = [
        'project_en', 'project',
        'phase_en', 'phase',
        'type',
        'capacity', 'capacity_rating',
        'status', 'start_year', 
        'owner', 'operator', 'country',
        'url_en', 
        # 'url', # modified 2023-10-27 to remove local language URL
        'lat', 'lng', 'loc_accuracy',
    ]
    
    # convert 'type' values
    df['type'] = df['type'].replace({
        'Wind farm (offshore hard mount)': 'Wind farm (offshore)',
        'Wind farm (offshore floating)': 'Wind farm (offshore)',
        'Wind farm (offshore mount unknown)': 'Wind farm (offshore)',
    })
    
    # TEST:
    for col in df.columns:
        if col not in renewables_col:
            print(f"Missing from renewables_col: {col}")
    for col in renewables_col:
        if col not in df.columns:
            print(f"Missing from df.columns: {col}")
    # END TEST

    df = df[renewables_col]
    
    return df   

In [1350]:
def export_map_latin_america_renewables(renewables_compiled_for_map):
    if export_files == True:
        renewables_compiled_for_map_file_name = f'{map_choice} - map data {save_timestamp}.xlsx'

        renewables_compiled_for_map.to_excel(
            path_for_download_and_map_files + 
            renewables_compiled_for_map_file_name,
            index=False
        )
        print("*"*40)
        print(f"Exported map file: {renewables_compiled_for_map_file_name}")
        print(f"len: {len(renewables_compiled_for_map)}")
        print("*"*40)
    else:
        print("*"*40)
        print(f"Did not export map file for {map_choice}")
        print("*"*40)

In [1351]:
def run_all_renewables_functions():
    if map_choice == 'Solar Power':
        map_df = run_all_solar_functions()

    elif map_choice == 'Wind Power':
        map_df = run_all_wind_functions()

    elif map_choice == 'Geothermal Power':
        map_df = run_all_geothermal_functions()

    elif map_choice == 'Bioenergy Power':
        map_df = run_all_bioenergy_functions()

    elif map_choice == 'Nuclear Power':
        map_df = run_all_nuclear_functions()

    elif map_choice == 'Latin America Portal - renewables':
        solar_clean, solar_map = run_all_solar_functions()
        wind_clean, wind_map = run_all_wind_functions()
        map_df = create_map_file_latin_america_renewables(solar_map, wind_map)
        export_map_latin_america_renewables(map_df)

        renewables_create_download_files_lat_am(solar_clean, wind_clean)        

    else:
        print(f"Nothing to compile for solar & wind, given map_choice: {map_choice}")
        map_df = pd.DataFrame()
        
    return map_df

# Run compiling functions (all fuels)

In [1352]:
# (oil_gas_map_df, 
#  oil_gas_data_for_download_list, 
#  oil_gas_data_for_download_list_spanish) = compile_all_oil_gas_data(
#     map_choice, data_versions_dict, data_files_and_paths, export_files)

(oil_gas_map_df, 
 oil_gas_data_for_download_list, 
 oil_gas_data_for_download_list_spanish) = compile_all_oil_gas_data(
    map_choice, data_versions_dict, data_keys_titles, export_files)




****************************************
Running compile_all_oil_gas_data for map_choice: Africa Gas Tracker
****************************************
Gas plants: read official version of data from local Excel file.
"gas_plants_official"
----------------------------------------
Checking columns in gas_plants
----------------------------------------
----------------------------------------
Gas plants: finished processing
----------------------------------------


****************************************
Gas pipelines: reading data from official release, local Excel file
ggit_pipes_official
----------------------------------------
Running convert_wkt_to_google_maps
this is after convert wkt logic: [nan, nan, nan, nan, nan, nan, '30.341591,112.250118:32.015042,112.125658', '30.595731,114.311586:30.20346,115.039205', nan, nan, '28.779254,105.369961:23.67901,103.07499', '28.446109,115.366511:28.167889,115.775752', nan, '28.165851,115.778914:27.955148,116.367699', nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, '28.401137,114.814691:28.56155,114.462178', '27.575417,110.007939:27.449768,109.666682:27.44839,109.954338:27.213279,109.837764', nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,

  goget_main[col] = goget_main[col].replace('', np.nan).astype(float)
  goget_main[col] = goget_main[col].replace('', np.nan).astype(float)


Checking columns goget_main
Checking columns goget_prod
       Unit name Fuel description  Quantity (converted) Units (converted)
14732  Mnazi Bay              gas            857.850772      million m³/y
Finished combo_id #0
No 'total liquids' entries to handle.
Finished clean_liquids_data
Running sum_total_production_per_unit
Error! In sum_total_production_per_unit, unexpected value for 'Units (converted)': million m³
----------------------------------------
GOGET: Finished processing
----------------------------------------
****************************************
Finished running functions for each tracker
----------------------------------------
Exporting data to download file: Africa Gas Tracker - download file 2024-03-06_1644.xlsx
Processing sheet Gas plants - data


  production_per_unit.at[row, 'production'] = "{0:.2f}".format(val)


Processing sheet Gas pipelines - data
Processing sheet LNG terminals - data
Processing sheet Gas extraction - main
Processing sheet Gas extraction - production
----------------------------------------
For map data, running final processing & export
Index(['url', 'countries', 'project', 'Plant name (local script)', 'unit',
       'fuel_type', 'capacity', 'status', 'technology', 'CHP',
       ...
       'Subnational unit (province, state)', 'Status year', 'Discovery year',
       'operator', 'Basin', 'Concession / block', 'Project or complex',
       'Government unit ID', 'Wiki URL local', 'production'],
      dtype='object', length=114)
Test passed; all rows had statuses.
Completed test_status_for_map

Finished exclude_missing_coordinates_or_route

Successfully converted values to float for col: capacity
Successfully converted values to float for col: lat
Successfully converted values to float for col: lng
Running test_map_specified_cells_have_values
show cols_to_check: ['project', 'typ

  df[col] = df[col].replace('', np.nan)


****************************************
Exported map file: Africa Gas Tracker - map data 2024-03-06_1644.xlsx
len: 1353
****************************************
----------------------------------------
Finished compile_all_oil_gas_data
****************************************


In [1353]:
(coal_steel_map_df, 
 coal_data_for_download_list, 
 coal_data_for_download_list_spanish) = compile_all_coal_steel_data(
    map_choice, data_versions_dict, data_files_and_paths, export_files)

NameError: name 'data_files_and_paths' is not defined

In [None]:
# renewables functions
renewables_map_df = run_all_renewables_functions()

Nothing to compile for solar & wind, given map_choice: Africa Gas Tracker


In [None]:
# Map changes:
# Add status 'cancelled'

In [None]:
# TO DO: find where I generate the download file for Latin America renewables (solar & wind)