# Summary
This notebook is used to automatically extract data from ATB yearly workbook file and generate summary accordingly to the specification.

## Getting Ready
- Here, we first import the package needed, and identify the path to files;
    - 2 paths to files are required here:
        1. raw_data_path: the path to the raw data file
        2. ancillary_path: the path to the ancillary file which declares the scope of tech and index of interest.

- Make 2 lists out of the ancillary file

In [90]:
import pandas as pd

In [91]:
raw_data_path = '/Users/zhixuan/PycharmProjects/ATB-Raw-Summarization/data/2022 v2 Annual Technology Baseline Workbook Corrected 7-21-2022.xlsx'

ancillary_path = '/Users/zhixuan/PycharmProjects/ATB-Raw-Summarization/data/ancillary.xlsx'

raw = pd.ExcelFile(raw_data_path)

summary_path = '/Users/zhixuan/PycharmProjects/ATB-Raw-Summarization/data/summary.xlsx'

del raw_data_path

In [92]:
ancillary = pd.read_excel(ancillary_path)

# tech: sheet dictionary
tech_sheet_dict = dict(ancillary[['Tech', 'Sheet']].values)

# make tech list out of ancillary
tech_list = list(ancillary['Tech'].values)

# make index list out of ancillary
index_list = list(ancillary['Index'].dropna().values)

# make year list out of anicllary
year_list = list(ancillary['Year'].dropna().values)

# flush-flush
del ancillary, ancillary_path

## Main loop
- Iterate over tech sheets needed, and get rid of the extraneous rows.
- Make a dictionary of each sheet to store their corresponding dataframe

In [93]:
# get all the name of the sheets -> intersect with tech list
sheet_list = list(set(raw.sheet_names) & set(tech_list))
sub_tech_list = list(set(tech_list)-set(sheet_list))

In [94]:
sheets = {}
last_i = None

# iterate through all sheets level techs
for sheet in sheet_list:
    # special specification for storages
    if not (sheet.split()[-1] == 'Storage'):
        sheet_df = pd.read_excel(raw, sheet_name=str(sheet)).iloc[:, 9:].dropna(how='all')
    else:
        sheet_df = pd.read_excel(raw, sheet_name=str(sheet)).iloc[:, 3:].dropna(how='all')
    index = None
    # iterate over rows
    for i in range(len(sheet_df)):
        # get the value on the current index column
        working_index = sheet_df.iloc[i, 0]

        if not pd.isna(working_index):
            sheet_df.iloc[i-1, 0] = None    # get rid of the header (year) row
            if working_index in index_list:
                index = working_index
                last_i = i
            else:
                index = None

        sheet_df.iloc[i, 0] = index

    # get the header
    header = list(sheet_df.iloc[last_i-1, :].values)
    header[0:3] = ['Index', 'Display Name', 'Scenario']
    sheet_df.columns = header

    sheet_df = sheet_df.dropna(how='any', subset=['Index']).reset_index(drop=True)
    sheets[str(sheet)] = sheet_df

- for those tech that is relatively more detailed, we shall first generate a dictionary of its correspondence to the parent sheet it belongs to.
- For now, we are trying to match the sub-techs with the sheet which has a name that is most similar to it.
- If it can't work, we will switch to some pristine methods for example manual specification. **which is likely the case** (modification on the ancillary file)

- for the main logic, there are generally 4 steps to follow;
    1. iterate through all dataframe of different tech (and sub-tech)
    2. iterate through different indexes (and select that subset of the dataframe)
    3. calculate each year's value as ratio of 2020's
    4. write to summary file (the 2020 baseline data and other data are supposed to be organized in different sheets in different ways)

In [101]:
import collections
import numpy as np

baseline_year = year_list[0]
# make a dictionary-of-dictionaries-of-dictionaries
ddd = collections.defaultdict(lambda : collections.defaultdict(dict))

debug_session = []

# iterate over sheet names and dataframes
for tech, df in sheets.items():
    # get all unique indexes in the df
    indexes = list(df['Index'].unique())
    year_dict = {}

    # iterate over the indexes
    for index in indexes:
        index_df = df[(df['Index']==index)]
        range_list = [-1, -1]
        # iterate over years
        for year in year_list[1:]:
            try:
                target_year = pd.to_numeric(index_df[year])
                baseline = pd.to_numeric(index_df[baseline_year])
                min_ratio = (target_year/baseline).min()
                max_ratio = (target_year/baseline).max()
                range_list = [min_ratio, max_ratio]

                if index == 'Fixed Operation and Maintenance Expenses ($/kW-yr)' and tech == 'Utility-Scale Battery Storage':
                    print('a')
                if np.isnan(max_ratio) or np.isnan(min_ratio):
                    debug_session.append([str(tech), str(index), str(year)])

            except ZeroDivisionError as e:
                if (index_df[year]==0).all():
                    range_list = [0, 0]
                else:
                    debug_session.append([str(tech), str(index), str(year)])

            finally:
                year_dict[str(year)] = range_list.copy()

        ddd[index][tech] = year_dict.copy()

a
a
a
a
a
a


In [102]:
pd.DataFrame(ddd[index_list[1]])

Unnamed: 0,Pumped Storage Hydropower,Biopower,Hydropower,Land-Based Wind,Solar - CSP,Offshore Wind,Solar - PV Dist. Comm,Utility-Scale Battery Storage,Solar - Utility PV
2025.0,"[1.0, 1.0]","[1.0, 1.0]","[0.8408964152537147, 1.0]","[0.8997674418604651, 1.0]","[0.8313973063973064, 1.0]","[0.821767149670645, 0.9174386759765845]","[0.7576481043139284, 0.9300351051356408]","[0.5754451323662406, 0.7710775177325845]","[0.7625285197142309, 0.8963493747990888]"
2030.0,"[1.0, 1.0]","[1.0, 1.0]","[0.7071067811865478, 1.0]","[0.7995348837209303, 1.0]","[0.6627946127946127, 1.0]","[0.7335423729337738, 0.8765710448658974]","[0.5118328474306393, 0.8997035992794921]","[0.3892717071889275, 0.689676731232011]","[0.5857721701425453, 0.8819916719469735]"
2035.0,"[1.0, 1.0]","[1.0, 1.0]","[0.594603557501361, 1.0]","[0.7395697674418605, 0.9886598837209302]","[0.6611405723905724, 1.0]","[0.6744077561670052, 0.8491786071258295]","[0.4890973671219775, 0.8294251203087463]","[0.3649422254896194, 0.689676731232011]","[0.5583544015728088, 0.8296220092771015]"
2040.0,"[1.0, 1.0]","[1.0, 1.0]","[0.5623730565710472, 1.0]","[0.6796046511627908, 0.9773197674418604]","[0.6594865319865322, 1.0]","[0.6298649631774956, 0.8285454188099878]","[0.4663618868133156, 0.7591466413380002]","[0.3406127437903113, 0.689676731232011]","[0.5316716642246694, 0.7773039225750643]"
2045.0,"[1.0, 1.0]","[1.0, 1.0]","[0.5477534639888454, 1.0]","[0.6196395348837209, 0.9659796511627907]","[0.6578324915824919, 1.0]","[0.5941194587230305, 0.8119873254495772]","[0.44362640650465374, 0.6888681623672542]","[0.3162832620910033, 0.689676731232011]","[0.5056280844605274, 0.725035673325092]"
2050.0,"[1.0, 1.0]","[1.0, 1.0]","[0.5335139260425694, 1.0]","[0.5596744186046512, 0.954639534883721]","[0.6561784511784512, 1.0]","[0.5642628847479636, 0.7981571124802533]","[0.4208909261959916, 0.6185896833965077]","[0.2919537803916956, 0.689676731232011]","[0.48014376758238364, 0.6728156002787821]"


In [103]:
# iterate over the out most layer of the 3-d dictionary
with pd.ExcelWriter(summary_path, mode='w') as writer:
    for index, index_dict in ddd.items():
        index_df = pd.DataFrame(index_dict)
        index_df.to_excel(writer, sheet_name=''.join(x+' 'for x in str(index).split()[:2]))

In [104]:
ddd['Fixed Operation and Maintenance Expenses ($/kW-yr)']['Utility-Scale Battery Storage']

{'2025.0': [0.5754451323662406, 0.7710775177325845],
 '2030.0': [0.3892717071889275, 0.689676731232011],
 '2035.0': [0.3649422254896194, 0.689676731232011],
 '2040.0': [0.3406127437903113, 0.689676731232011],
 '2045.0': [0.3162832620910033, 0.689676731232011],
 '2050.0': [0.2919537803916956, 0.689676731232011]}

In [105]:
debug_session

[['Hydropower',
  'Variable Operation and Maintenance Expenses ($/MWh)',
  '2025.0'],
 ['Hydropower',
  'Variable Operation and Maintenance Expenses ($/MWh)',
  '2030.0'],
 ['Hydropower',
  'Variable Operation and Maintenance Expenses ($/MWh)',
  '2035.0'],
 ['Hydropower',
  'Variable Operation and Maintenance Expenses ($/MWh)',
  '2040.0'],
 ['Hydropower',
  'Variable Operation and Maintenance Expenses ($/MWh)',
  '2045.0'],
 ['Hydropower',
  'Variable Operation and Maintenance Expenses ($/MWh)',
  '2050.0'],
 ['Land-Based Wind',
  'Variable Operation and Maintenance Expenses ($/MWh)',
  '2025.0'],
 ['Land-Based Wind',
  'Variable Operation and Maintenance Expenses ($/MWh)',
  '2030.0'],
 ['Land-Based Wind',
  'Variable Operation and Maintenance Expenses ($/MWh)',
  '2035.0'],
 ['Land-Based Wind',
  'Variable Operation and Maintenance Expenses ($/MWh)',
  '2040.0'],
 ['Land-Based Wind',
  'Variable Operation and Maintenance Expenses ($/MWh)',
  '2045.0'],
 ['Land-Based Wind',
  'Varia

In [106]:
df = sheets['Utility-Scale Battery Storage']
df[df['Index']=='Variable Operation and Maintenance Expenses ($/MWh)'].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15 entries, 15 to 29
Data columns (total 34 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Index         15 non-null     object 
 1   Display Name  15 non-null     object 
 2   Scenario      15 non-null     object 
 3   2020.0        15 non-null     float64
 4   2021.0        15 non-null     float64
 5   2022          15 non-null     object 
 6   2023.0        15 non-null     float64
 7   2024.0        15 non-null     float64
 8   2025          15 non-null     object 
 9   2026.0        15 non-null     float64
 10  2027.0        15 non-null     float64
 11  2028.0        15 non-null     float64
 12  2029.0        15 non-null     float64
 13  2030.0        15 non-null     float64
 14  2031.0        15 non-null     float64
 15  2032.0        15 non-null     float64
 16  2033.0        15 non-null     float64
 17  2034.0        15 non-null     float64
 18  2035.0        15 non-null     f