Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make vma.py support XLS files #2

Closed
DentonGentry opened this issue Feb 8, 2020 · 8 comments
Closed

Make vma.py support XLS files #2

DentonGentry opened this issue Feb 8, 2020 · 8 comments
Labels
good first issue Good for newcomers

Comments

@DentonGentry
Copy link
Contributor

model/vma.py implements Variable Meta-Analysis, where we produce a value for an input variable like the yield for soybeans or the cost for a megawatt from natural gas powerplants. Researchers at Project Drawdown vet data sources as inputs, ideally multiple sources, and vma.py collects these sources and produces a single resulting value. This is typically the mean, though +/- multiples of the standard deviation is also common.

Right now in the __init__ method we call VMA._read_csv, which reads in a CSV file. The existing CSV files were produced via the vma_xls_extract.py code generator from the original Drawdown Excel models.

This issue concerns adding support for XLS files for VMA input, to allow researchers to more easily perform data normalizations like currency or unit conversion.

The desired steps are:

  1. In vma.py VMA::__init__, check the file extension of the data source and add support for *.xlsx and *.xlsm

  2. For CSV we require a separate file for each VMA. For Excel we want to support multiple VMA definitions within one file, to allow the researcher to implement their needed conversions once not have copies in multiple files.
    Therefore, the code should open the Excel file and then search for the definition of its VMA. Searching for the name of the VMA within the first sheet of the workbook, and figuring out where the VMA definition is below that, is preferred.

  3. advanced_controls.py, which instantiates VMA objects, knows the human-readable Title of the VMA it is looking for. The VMA::__init__ does not currently receive the Title as a parameter, but it can be added.
    Please do not add a default value, the codebase is small enough that we can update all existing callers to pass in a proper value. If the backing file is CSV, the title argument may just not be used.

  4. Add unit tests to model/tests/test_vma.py. Add an Excel file for use in the test in model/tests/data.

  5. Please use Pandas read_excel() and ensure it works with the 'xlrd' backend, as we already have dependencies on xlrd in the tree. We do not currently have any dependencies on other Excel+python packages like openpyxl, and would prefer not to add new dependencies without a really good reason.

@DentonGentry DentonGentry added the good first issue Good for newcomers label Feb 8, 2020
@DentonGentry
Copy link
Contributor Author

It would be great if a single Excel file for a solution could supply VMA (issue #2), customadoption/adoptiondata/etc (issue #3), plus allow additional sheets for various and sundry stuff specific to the solution which doesn't benefit from being done within Python.

A way to do this would be to check if a sheet named "VMA" or "VMA Data" exists, and use it if it does.

@FranzEricSchneider
Copy link
Contributor

Let me see if I've got this right. Currently I think we have

.xlsm files, containing all sorts of data
    - long-form column names
    - deleted from repo, in git-lfs

.csv files, containing all data from xlsm files, extracted
    - long-form column names

vma.VMA objects, instatiated by solution/XXX/__init__.py, which draw from certain CSV files
    - short-form column names

And what's being proposed is (I think)

.xlsm files, containing all sorts of data
    - long-form column names
    - deleted from repo, in git-lfs

.csv files, containing all data from xlsm files, extracted
    - long-form column names

.xlsx files, containing certain sheets that are currently in the `.xlsm` files
    - long-form column names

vma.VMA objects, instatiated by `solution/XXX/__init__.py`, which draw from the nearby CSV files OR from nearby `.xlsx` files
    - short-form column names

Is this correct, that the new files are intended to be in addition to the existing files? Or should something be replaced? Also, is the long-form/short-form column names what is intended? I've been assuming that the new .xlsx files would need to have the long-form/old column names, but I don't know if that's true.

Could you also clarify what you mean by supporting multiple VMA definitions within one file? For example, right now in solution/cars/vma_data/ there are a number of CSVs. Does multiple VMA definitions within one file mean all of these CSVs should become tabs in a single file solution/cars/vma_data/cars_data.xlsx?

solutions $ ls solution/cars/vma_data/
Average_Global_Car_Occupancy_Conventional.csv
Average_Global_Car_Occupancy_Solution.csv
CONVENTIONAL_Average_Annual_Use.csv
CONVENTIONAL_First_Cost_per_Implementation_Unit.csv
...
SOLUTION_Lifetime_Capacity.csv
SOLUTION_Variable_Operating_Cost_VOM_per_Functional_Unit.csv
Urban_Travel_TAM_Current.csv
Urban_Travel_TAM_Projected.csv
VMA_info.csv

@DentonGentry
Copy link
Contributor Author

At this point I'd recommend maximizing for the human usage: Long-form column names, in an XLSX file, with multiple VMAs defined within one "Variable Meta-analysis" sheet essentially identical to what is currently in the XLSM file. (In earlier comments I misremembered this as being named "VMA Data")

A number of the existing XLSM files use their "Variable Meta-analysis" sheet for calculations like:

  • currency conversion
  • adjusting currency values from different years for inflation
  • Imperial/metric unit conversion

We did not consider any of these uses when we first created the CSV files, we just had tools/vma_xls_extract.py fetch the final computed values and write them to CSV. So using the cars example, I think it would be great if we could:

  1. open solution/cars/vma_data/VMA.xlsx
  2. Discover that the "Current Adoption" VMA starts at C46 of the "Variable Meta-analysis-DD" sheet
  3. Discover that the "CONVENTIONAL First Cost per Implementation Unit for replaced practices/technologies" VMA starts at C82 of the "Variable Meta-analysis-DD" sheet
  4. etc, etc

A couple other notes:

  • using a single sheet for VMA means we could use the same file for XLS files in tam.py, adoptiondata.py, customadoption.py #3, with other sheets named for TAM and Adoption Data and so on. This would likely mean moving the file out of vma_data and renaming it.
  • In the XLSM files, "Variable meta-analysis" is the primary sheet which the research team used. However, some of the data sources for climate work are licensed, not free, and have restrictions on redistribution. "Variable meta-analysis-DD" was added which clears the raw data values and only supplies a computed mean/low/high, to avoid infringing the license of the underlying data.
  • there is code in tools/vma_xls_extract.py which may be useful in finding the boundaries of VMA definitions in a "Variable meta-analysis" sheet. It also knows about checking for "Variable meta-analysis-DD" if "Variable meta-analysis" is empty.

@DentonGentry
Copy link
Contributor Author

Also: it would be nice if model/VMA.py continued to support use of a solution/<name>/vma_data/VMA_NAME.csv file, so that we don't have to go change all of the solutions all at once. In actual usage I would expect that if a VMA.xlsx file is present then there would probably be no CSV files for VMAs in that solution, that all of the VMAs would be defined in the XLSX file.

@FranzEricSchneider
Copy link
Contributor

Okay, I've made an example xslx file using just the "Advanced Controls" and "Variable Meta-Analysis DD" sheets of the cars testdata. "Advanced Controls" is necessary to include because of how many cells reference it.

Do you know of any examples that have non-empty "Variable Meta-Analysis" sheets? All the xlsm files I sampled have stuff in the DD sheet but not in the basic VMA sheet.

@DentonGentry
Copy link
Contributor Author

DentonGentry commented Mar 29, 2020

The Excel files checked into the repository (and later deleted) are the Public versions. Some of the data used in the Drawdown models is licensed and has restrictions on redistribution. The Public models copy the VMA information to a Variable Meta-AnalysisDD sheet but remove the raw data, to avoid redistributing it publicly. The DD tab retains the Mean and standard deviation computed from the VMA data, which is sufficient for the model to run and generate results but avoids redistributing the licensed data.

I placed two of the full files in https://drive.google.com/drive/folders/16ToiESaPkpz8Z-Hda2r2SuDIIoKXp1ik and made it available to anyone with the link. The CSP file is from solution/concentratedsolar, the SolarPVUtility file is from solution/solarpvutil.

I'll leave them there for a while, long enough to retrieve them to work on this issue, though I'll need to take them down later.

@FranzEricSchneider
Copy link
Contributor

Thanks, you can delete them now. I'm finding this slow going to find time to work on, but I'd like to keep trying.

@FranzEricSchneider
Copy link
Contributor

I believe this is closed by #126

ksatan added a commit to ksatan/solutions that referenced this issue Jul 20, 2021
denised pushed a commit that referenced this issue May 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants