feat: adding data processing script #243

isond · 2021-12-03T21:45:58Z

Purpose

What is the larger goal of this change?
To add the data processing script that will organize the NHTS data.

What the code is doing

How is the purpose executed?
This code filters through the NHTS data based on the inputted census division and reorganizes the information. It also calculates additional information based on the given data.

Testing

How did you test this change (unit/functional testing, manual testing, etc.)?
This was tested through manual testing using data from "trippub.csv".

Where to look

It's helpful to clarify where your new code lives if you moved files around or there could be confusion/
The new code is located at "PreREISE/prereise/gather/demanddata/transportation_electrification/data_process.py"
What files are most important?
data_process.py which is currently the only file

Usage Example/Visuals

How the code can be used and/or images of any graphs, tables or other visuals (not always applicable).

Time estimate

How long will it take for reviewers and observers to understand this code change?
It should take a few days for reviewers to understand.

prereise/gather/demanddata/transportation_electrification/data_process.py

danielolsen · 2021-12-03T22:19:40Z

As a note for BE folks: the target for this PR is currently the develop branch, so that we can enable the CI linting, but we will want to change the target to the transportation_electrification branch before merging.

prereise/gather/demanddata/transportation_electrification/data_process.py

danielolsen · 2021-12-17T18:58:50Z

After the refactor, the code runs much faster: less than 4 minutes now to go through all nine regions on my laptop, rather than the hour and a half before. Bravo!

Comparing the results from before the refactor to the results afterward, I see a few differences:

Some column renaming for spelling (travelled -> traveled)
Different lengths of the dataframes for each region

Region	Length before refactor	Length after refactor
1	13,323	9,030
2	123,616	82,355
3	104,046	73,197
4	35,424	24,964
5	192,380	136,168
6	8,424	5,999
7	183,089	130,259
8	34,968	23,520
9	189,215	122,217

If I look at the indices from the newly-returned dataframes, and look at the corresponding values in the previously-returned dataframes, I see that most columns are the same, some columns have values that are different but nearly-identical: Scaling Factor Applied, Miles traveled, Vehicle miles traveled, Start time (hour decimal), End time (hour decimal), Travel time (hour decimal), Vehicle speed (mi/hour) (probably differences due to floating-point rounding across the two different methods of summing), and some columns have values that are noticeably different: Dwell time (hour decimal), sample vehicle number, total vehicle trips, and total vehicle miles traveled.

@isond can you confirm that the behavior change is intentional, and that the new behavior is the one we want?

One way to validate the new behavior could be to refactor the data_filtering function to take the raw NHTS dataframe as an explicit input, rather than to load it from disk. So instead of the user getting data as:

df1 = data_filtering(census_division=1)

we could:

raw_nhts = pd.read_csv(FILENAME)
df1 = data_filtering(raw_nhts, census_division=1)

The value of an approach like this is that we can design a test that passes a very small amount of data to data_filtering, and compares the output to a known expectation (calculated manually). Another benefit is that we can run the data_filtering function from anywhere, and we're less tied to specific filenames; we won't need the current working directory of the script to contain a file named trippub.csv.

prereise/gather/demanddata/transportation_electrification/data_process.py

isond · 2022-01-04T22:30:58Z

After the refactor, the code runs much faster: less than 4 minutes now to go through all nine regions on my laptop, rather than the hour and a half before. Bravo!

Comparing the results from before the refactor to the results afterward, I see a few differences:

Some column renaming for spelling (travelled -> traveled)

Different lengths of the dataframes for each region

Region Length before refactor Length after refactor
1 13,323 9,030
2 123,616 82,355
3 104,046 73,197
4 35,424 24,964
5 192,380 136,168
6 8,424 5,999
7 183,089 130,259
8 34,968 23,520
9 189,215 122,217
If I look at the indices from the newly-returned dataframes, and look at the corresponding values in the previously-returned dataframes, I see that most columns are the same, some columns have values that are different but nearly-identical: Scaling Factor Applied, Miles traveled, Vehicle miles traveled, Start time (hour decimal), End time (hour decimal), Travel time (hour decimal), Vehicle speed (mi/hour) (probably differences due to floating-point rounding across the two different methods of summing), and some columns have values that are noticeably different: Dwell time (hour decimal), sample vehicle number, total vehicle trips, and total vehicle miles traveled.

@isond can you confirm that the behavior change is intentional, and that the new behavior is the one we want?

One way to validate the new behavior could be to refactor the data_filtering function to take the raw NHTS dataframe as an explicit input, rather than to load it from disk. So instead of the user getting data as:
df1 = data_filtering(census_division=1)
we could:
raw_nhts = pd.read_csv(FILENAME)
df1 = data_filtering(raw_nhts, census_division=1)
The value of an approach like this is that we can design a test that passes a very small amount of data to data_filtering, and compares the output to a known expectation (calculated manually). Another benefit is that we can run the data_filtering function from anywhere, and we're less tied to specific filenames; we won't need the current working directory of the script to contain a file named trippub.csv.

Regarding the differences in lengths of the dataframes, I noticed that the condition to filter out the repeated trips was using the incorrect column prior to the refactor.

nhts = nhts[nhts.TRPMILES != -1] should be nhts = nhts[nhts.VMT_MILE != -1]

Basically in the original code, the two columns were switched around, and so I've updated the code to reflect that change.
I manually filtered out the columns in the trippub.csv because I wanted to double-check the length of the results from the dataframes after the refactor, and they all match up for each region. So the lengths that you are seeing now should be good :)

The significant change in values for Dwell time (hour decimal), sample vehicle number, total vehicle trips, and total vehicle miles traveled is due to the fact that they used vehicle miles traveled (originally using values from TRPMILES but is now using VMT_MILE) which was affected by the change I previously described.

As for the spelling change, I updated the traveled variable to be consistent with the variables listed in Kate's data conversion document.

danielolsen · 2022-01-05T18:03:05Z

prereise/gather/demanddata/transportation_electrification/data_process.py

+def data_filtering(raw_nhts, census_division):
+    """Filter raw NHTS data to be used in mileage.py
+
+    :param: (*pandas.DataFrame*) raw_nhts: raw NHTS dataframe


:param: (*pandas.DataFrame*) raw_nhts: -> :param pandas.DataFrame raw_nhts:. We only add the parentheses for the return variable.

danielolsen · 2022-01-05T18:05:34Z

All checks pass, and the code runs successfully and produces dataframes of the appropriate length, in 2 minutes:

import pandas as pd
from prereise.gather.demanddata.transportation_electrification.data_process import data_filtering
raw_nhts = pd.read_csv("trippub.csv")
data = {i: data_filtering(raw_nhts, i) for i in range(1, 10)}

There's one small tweak that needs to be made to a docstring, and the branch needs to be rebased onto the latest transportation_electrification commit, and then we're ready to merge!

danielolsen

Looks good to me!

danielolsen reviewed Dec 3, 2021

View reviewed changes

danielolsen requested a review from BainanXia December 3, 2021 22:17

rouille reviewed Dec 3, 2021

View reviewed changes

prereise/gather/demanddata/transportation_electrification/data_process.py Outdated Show resolved Hide resolved

danielolsen reviewed Dec 7, 2021

View reviewed changes

danielolsen reviewed Dec 17, 2021

View reviewed changes

danielolsen reviewed Jan 5, 2022

View reviewed changes

danielolsen changed the base branch from develop to transportation_electrification January 5, 2022 18:06

danielolsen approved these changes Jan 5, 2022

View reviewed changes

isond added 14 commits January 5, 2022 13:38

feat: adding data processing script

189c81c

docs: add census_division param to docstring

327e303

style: remove function call comment

da230f7

fix: remove to_excel data export

d292333

refactor: add load_data function

8ee4d0d

refactor: rewrite to sort columns using rename

e0b62ec

refactor: rewrite to add new columns using reindex

2b003ac

refactor: update trpmiles and vmt_mile values

a6c8eea

perf: optimize grouping calculations

9178e6d

fix: change sorted_data var to filtered_data

ddd6581

refactor: condense filter comparisons

72b510b

docs: update docstrings

964ccc5

refactor: remove load_data function

1e4df72

docs: edit docstring for data_filtering

8423a8f

isond force-pushed the data_processing branch from 39220e5 to 8423a8f Compare January 5, 2022 21:44

danielolsen merged commit a0157e7 into Breakthrough-Energy:transportation_electrification Jan 5, 2022

danielolsen pushed a commit that referenced this pull request Jan 31, 2022

feat: adding data processing script (#243)

98c92d3

danielolsen mentioned this pull request Feb 16, 2022

feat: generate profiles for immediate charging #266

Merged

danielolsen pushed a commit that referenced this pull request Mar 14, 2022

feat: adding data processing script (#243)

445b439

dmuldrew pushed a commit that referenced this pull request Sep 6, 2022

feat: adding data processing script (#243)

f99e0bf

dmuldrew pushed a commit that referenced this pull request Nov 14, 2022

feat: adding data processing script (#243)

09a85f4

rouille pushed a commit that referenced this pull request Jan 11, 2023

feat: adding data processing script (#243)

87e323e

rouille pushed a commit that referenced this pull request Jan 19, 2023

feat: adding data processing script (#243)

211d9aa

rouille pushed a commit that referenced this pull request Jan 20, 2023

feat: adding data processing script (#243)

0fdfd3d

rouille pushed a commit that referenced this pull request Feb 1, 2023

feat: adding data processing script (#243)

16c2704

rouille pushed a commit that referenced this pull request Feb 1, 2023

feat: adding data processing script (#243)

2851f06

jenhagg pushed a commit that referenced this pull request Feb 23, 2023

feat: adding data processing script (#243)

7d51978

jenhagg pushed a commit that referenced this pull request Feb 24, 2023

feat: adding data processing script (#243)

713eb03

rouille pushed a commit that referenced this pull request Feb 24, 2023

feat: adding data processing script (#243)

b0bc23c

feat: adding data processing script #243

feat: adding data processing script #243

Uh oh!

Conversation

isond commented Dec 3, 2021

Purpose

What the code is doing

Testing

Where to look

Usage Example/Visuals

Time estimate

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danielolsen commented Dec 3, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danielolsen commented Dec 17, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

isond commented Jan 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielolsen Jan 5, 2022

Choose a reason for hiding this comment

Uh oh!

danielolsen commented Jan 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielolsen left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

isond commented Jan 4, 2022 •

edited

Loading

danielolsen commented Jan 5, 2022 •

edited

Loading