Make vma.py support XLS files #2

DentonGentry · 2020-02-08T06:26:59Z

model/vma.py implements Variable Meta-Analysis, where we produce a value for an input variable like the yield for soybeans or the cost for a megawatt from natural gas powerplants. Researchers at Project Drawdown vet data sources as inputs, ideally multiple sources, and vma.py collects these sources and produces a single resulting value. This is typically the mean, though +/- multiples of the standard deviation is also common.

Right now in the __init__ method we call VMA._read_csv, which reads in a CSV file. The existing CSV files were produced via the vma_xls_extract.py code generator from the original Drawdown Excel models.

This issue concerns adding support for XLS files for VMA input, to allow researchers to more easily perform data normalizations like currency or unit conversion.

The desired steps are:

In vma.py VMA::__init__, check the file extension of the data source and add support for *.xlsx and *.xlsm
For CSV we require a separate file for each VMA. For Excel we want to support multiple VMA definitions within one file, to allow the researcher to implement their needed conversions once not have copies in multiple files.
Therefore, the code should open the Excel file and then search for the definition of its VMA. Searching for the name of the VMA within the first sheet of the workbook, and figuring out where the VMA definition is below that, is preferred.
advanced_controls.py, which instantiates VMA objects, knows the human-readable Title of the VMA it is looking for. The VMA::__init__ does not currently receive the Title as a parameter, but it can be added.
Please do not add a default value, the codebase is small enough that we can update all existing callers to pass in a proper value. If the backing file is CSV, the title argument may just not be used.
Add unit tests to model/tests/test_vma.py. Add an Excel file for use in the test in model/tests/data.
Please use Pandas read_excel() and ensure it works with the 'xlrd' backend, as we already have dependencies on xlrd in the tree. We do not currently have any dependencies on other Excel+python packages like openpyxl, and would prefer not to add new dependencies without a really good reason.

DentonGentry · 2020-02-17T00:49:27Z

It would be great if a single Excel file for a solution could supply VMA (issue #2), customadoption/adoptiondata/etc (issue #3), plus allow additional sheets for various and sundry stuff specific to the solution which doesn't benefit from being done within Python.

A way to do this would be to check if a sheet named "VMA" or "VMA Data" exists, and use it if it does.

FranzEricSchneider · 2020-03-22T21:12:19Z

Let me see if I've got this right. Currently I think we have

.xlsm files, containing all sorts of data
    - long-form column names
    - deleted from repo, in git-lfs

.csv files, containing all data from xlsm files, extracted
    - long-form column names

vma.VMA objects, instatiated by solution/XXX/__init__.py, which draw from certain CSV files
    - short-form column names

And what's being proposed is (I think)

.xlsm files, containing all sorts of data
    - long-form column names
    - deleted from repo, in git-lfs

.csv files, containing all data from xlsm files, extracted
    - long-form column names

.xlsx files, containing certain sheets that are currently in the `.xlsm` files
    - long-form column names

vma.VMA objects, instatiated by `solution/XXX/__init__.py`, which draw from the nearby CSV files OR from nearby `.xlsx` files
    - short-form column names

Is this correct, that the new files are intended to be in addition to the existing files? Or should something be replaced? Also, is the long-form/short-form column names what is intended? I've been assuming that the new .xlsx files would need to have the long-form/old column names, but I don't know if that's true.

Could you also clarify what you mean by supporting multiple VMA definitions within one file? For example, right now in solution/cars/vma_data/ there are a number of CSVs. Does multiple VMA definitions within one file mean all of these CSVs should become tabs in a single file solution/cars/vma_data/cars_data.xlsx?

solutions $ ls solution/cars/vma_data/
Average_Global_Car_Occupancy_Conventional.csv
Average_Global_Car_Occupancy_Solution.csv
CONVENTIONAL_Average_Annual_Use.csv
CONVENTIONAL_First_Cost_per_Implementation_Unit.csv
...
SOLUTION_Lifetime_Capacity.csv
SOLUTION_Variable_Operating_Cost_VOM_per_Functional_Unit.csv
Urban_Travel_TAM_Current.csv
Urban_Travel_TAM_Projected.csv
VMA_info.csv

DentonGentry · 2020-03-22T22:49:03Z

At this point I'd recommend maximizing for the human usage: Long-form column names, in an XLSX file, with multiple VMAs defined within one "Variable Meta-analysis" sheet essentially identical to what is currently in the XLSM file. (In earlier comments I misremembered this as being named "VMA Data")

A number of the existing XLSM files use their "Variable Meta-analysis" sheet for calculations like:

currency conversion
adjusting currency values from different years for inflation
Imperial/metric unit conversion

We did not consider any of these uses when we first created the CSV files, we just had tools/vma_xls_extract.py fetch the final computed values and write them to CSV. So using the cars example, I think it would be great if we could:

open solution/cars/vma_data/VMA.xlsx
Discover that the "Current Adoption" VMA starts at C46 of the "Variable Meta-analysis-DD" sheet
Discover that the "CONVENTIONAL First Cost per Implementation Unit for replaced practices/technologies" VMA starts at C82 of the "Variable Meta-analysis-DD" sheet
etc, etc

A couple other notes:

using a single sheet for VMA means we could use the same file for XLS files in tam.py, adoptiondata.py, customadoption.py #3, with other sheets named for TAM and Adoption Data and so on. This would likely mean moving the file out of vma_data and renaming it.
In the XLSM files, "Variable meta-analysis" is the primary sheet which the research team used. However, some of the data sources for climate work are licensed, not free, and have restrictions on redistribution. "Variable meta-analysis-DD" was added which clears the raw data values and only supplies a computed mean/low/high, to avoid infringing the license of the underlying data.
there is code in tools/vma_xls_extract.py which may be useful in finding the boundaries of VMA definitions in a "Variable meta-analysis" sheet. It also knows about checking for "Variable meta-analysis-DD" if "Variable meta-analysis" is empty.

DentonGentry · 2020-03-22T22:51:27Z

Also: it would be nice if model/VMA.py continued to support use of a solution/<name>/vma_data/VMA_NAME.csv file, so that we don't have to go change all of the solutions all at once. In actual usage I would expect that if a VMA.xlsx file is present then there would probably be no CSV files for VMAs in that solution, that all of the VMAs would be defined in the XLSX file.

FranzEricSchneider · 2020-03-29T17:59:36Z

Okay, I've made an example xslx file using just the "Advanced Controls" and "Variable Meta-Analysis DD" sheets of the cars testdata. "Advanced Controls" is necessary to include because of how many cells reference it.

Do you know of any examples that have non-empty "Variable Meta-Analysis" sheets? All the xlsm files I sampled have stuff in the DD sheet but not in the basic VMA sheet.

DentonGentry · 2020-03-29T22:45:44Z

The Excel files checked into the repository (and later deleted) are the Public versions. Some of the data used in the Drawdown models is licensed and has restrictions on redistribution. The Public models copy the VMA information to a Variable Meta-AnalysisDD sheet but remove the raw data, to avoid redistributing it publicly. The DD tab retains the Mean and standard deviation computed from the VMA data, which is sufficient for the model to run and generate results but avoids redistributing the licensed data.

I placed two of the full files in https://drive.google.com/drive/folders/16ToiESaPkpz8Z-Hda2r2SuDIIoKXp1ik and made it available to anyone with the link. The CSP file is from solution/concentratedsolar, the SolarPVUtility file is from solution/solarpvutil.

I'll leave them there for a while, long enough to retrieve them to work on this issue, though I'll need to take them down later.

FranzEricSchneider · 2020-04-09T03:29:26Z

Thanks, you can delete them now. I'm finding this slow going to find time to work on, but I'd like to keep trying.

FranzEricSchneider · 2020-04-30T04:02:36Z

I believe this is closed by #126

what gon happen?

Refresh personal fork

DentonGentry added the good first issue Good for newcomers label Feb 8, 2020

DentonGentry mentioned this issue Feb 17, 2020

XLS files in tam.py, adoptiondata.py, customadoption.py #3

Closed

DentonGentry mentioned this issue Mar 20, 2020

Rationalize VMA dataframe #7

Open

FranzEricSchneider mentioned this issue Apr 24, 2020

Adding xlsx/xlsm support to vma.VMA #126

Merged

DentonGentry closed this as completed Apr 30, 2020

Sunishchal mentioned this issue Jul 14, 2020

Health & Education solutions #23

Open

Sunishchal mentioned this issue Sep 12, 2020

Initial script for issue #23 health & education electricity cluster #208

Draft

ksatan added a commit to ksatan/solutions that referenced this issue Jul 20, 2021

kirill's test ProjectDrawdown#2

b3a9783

what gon happen?

denised pushed a commit that referenced this issue May 3, 2022

Merge pull request #2 from ProjectDrawdown/develop

7e1f2d7

Refresh personal fork

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make vma.py support XLS files #2

Make vma.py support XLS files #2

DentonGentry commented Feb 8, 2020

DentonGentry commented Feb 17, 2020

FranzEricSchneider commented Mar 22, 2020

DentonGentry commented Mar 22, 2020

DentonGentry commented Mar 22, 2020

FranzEricSchneider commented Mar 29, 2020

DentonGentry commented Mar 29, 2020 •

edited

Loading

FranzEricSchneider commented Apr 9, 2020

FranzEricSchneider commented Apr 30, 2020

Make vma.py support XLS files #2

Make vma.py support XLS files #2

Comments

DentonGentry commented Feb 8, 2020

DentonGentry commented Feb 17, 2020

FranzEricSchneider commented Mar 22, 2020

DentonGentry commented Mar 22, 2020

DentonGentry commented Mar 22, 2020

FranzEricSchneider commented Mar 29, 2020

DentonGentry commented Mar 29, 2020 • edited Loading

FranzEricSchneider commented Apr 9, 2020

FranzEricSchneider commented Apr 30, 2020

DentonGentry commented Mar 29, 2020 •

edited

Loading