Skip to content

Conversation

@isond
Copy link
Contributor

@isond isond commented Dec 3, 2021

Pull Request doc

Purpose

What is the larger goal of this change?
To add the data processing script that will organize the NHTS data.

What the code is doing

How is the purpose executed?
This code filters through the NHTS data based on the inputted census division and reorganizes the information. It also calculates additional information based on the given data.

Testing

How did you test this change (unit/functional testing, manual testing, etc.)?
This was tested through manual testing using data from "trippub.csv".

Where to look

  • It's helpful to clarify where your new code lives if you moved files around or there could be confusion/
    The new code is located at "PreREISE/prereise/gather/demanddata/transportation_electrification/data_process.py"

  • What files are most important?
    data_process.py which is currently the only file

Usage Example/Visuals

How the code can be used and/or images of any graphs, tables or other visuals (not always applicable).

Time estimate

How long will it take for reviewers and observers to understand this code change?
It should take a few days for reviewers to understand.

@danielolsen danielolsen requested a review from BainanXia December 3, 2021 22:17
@danielolsen
Copy link
Contributor

As a note for BE folks: the target for this PR is currently the develop branch, so that we can enable the CI linting, but we will want to change the target to the transportation_electrification branch before merging.

@danielolsen
Copy link
Contributor

After the refactor, the code runs much faster: less than 4 minutes now to go through all nine regions on my laptop, rather than the hour and a half before. Bravo!

Comparing the results from before the refactor to the results afterward, I see a few differences:

  • Some column renaming for spelling (travelled -> traveled)
  • Different lengths of the dataframes for each region
Region Length before refactor Length after refactor
1 13,323 9,030
2 123,616 82,355
3 104,046 73,197
4 35,424 24,964
5 192,380 136,168
6 8,424 5,999
7 183,089 130,259
8 34,968 23,520
9 189,215 122,217

If I look at the indices from the newly-returned dataframes, and look at the corresponding values in the previously-returned dataframes, I see that most columns are the same, some columns have values that are different but nearly-identical: Scaling Factor Applied, Miles traveled, Vehicle miles traveled, Start time (hour decimal), End time (hour decimal), Travel time (hour decimal), Vehicle speed (mi/hour) (probably differences due to floating-point rounding across the two different methods of summing), and some columns have values that are noticeably different: Dwell time (hour decimal), sample vehicle number, total vehicle trips, and total vehicle miles traveled.

@isond can you confirm that the behavior change is intentional, and that the new behavior is the one we want?

One way to validate the new behavior could be to refactor the data_filtering function to take the raw NHTS dataframe as an explicit input, rather than to load it from disk. So instead of the user getting data as:

df1 = data_filtering(census_division=1)

we could:

raw_nhts = pd.read_csv(FILENAME)
df1 = data_filtering(raw_nhts, census_division=1)

The value of an approach like this is that we can design a test that passes a very small amount of data to data_filtering, and compares the output to a known expectation (calculated manually). Another benefit is that we can run the data_filtering function from anywhere, and we're less tied to specific filenames; we won't need the current working directory of the script to contain a file named trippub.csv.

@isond
Copy link
Contributor Author

isond commented Jan 4, 2022

After the refactor, the code runs much faster: less than 4 minutes now to go through all nine regions on my laptop, rather than the hour and a half before. Bravo!

Comparing the results from before the refactor to the results afterward, I see a few differences:

  • Some column renaming for spelling (travelled -> traveled)
  • Different lengths of the dataframes for each region

Region Length before refactor Length after refactor
1 13,323 9,030
2 123,616 82,355
3 104,046 73,197
4 35,424 24,964
5 192,380 136,168
6 8,424 5,999
7 183,089 130,259
8 34,968 23,520
9 189,215 122,217
If I look at the indices from the newly-returned dataframes, and look at the corresponding values in the previously-returned dataframes, I see that most columns are the same, some columns have values that are different but nearly-identical: Scaling Factor Applied, Miles traveled, Vehicle miles traveled, Start time (hour decimal), End time (hour decimal), Travel time (hour decimal), Vehicle speed (mi/hour) (probably differences due to floating-point rounding across the two different methods of summing), and some columns have values that are noticeably different: Dwell time (hour decimal), sample vehicle number, total vehicle trips, and total vehicle miles traveled.

@isond can you confirm that the behavior change is intentional, and that the new behavior is the one we want?

One way to validate the new behavior could be to refactor the data_filtering function to take the raw NHTS dataframe as an explicit input, rather than to load it from disk. So instead of the user getting data as:

df1 = data_filtering(census_division=1)

we could:

raw_nhts = pd.read_csv(FILENAME)
df1 = data_filtering(raw_nhts, census_division=1)

The value of an approach like this is that we can design a test that passes a very small amount of data to data_filtering, and compares the output to a known expectation (calculated manually). Another benefit is that we can run the data_filtering function from anywhere, and we're less tied to specific filenames; we won't need the current working directory of the script to contain a file named trippub.csv.

Regarding the differences in lengths of the dataframes, I noticed that the condition to filter out the repeated trips was using the incorrect column prior to the refactor.

nhts = nhts[nhts.TRPMILES != -1] should be nhts = nhts[nhts.VMT_MILE != -1]

Basically in the original code, the two columns were switched around, and so I've updated the code to reflect that change.
I manually filtered out the columns in the trippub.csv because I wanted to double-check the length of the results from the dataframes after the refactor, and they all match up for each region. So the lengths that you are seeing now should be good :)

The significant change in values for Dwell time (hour decimal), sample vehicle number, total vehicle trips, and total vehicle miles traveled is due to the fact that they used vehicle miles traveled (originally using values from TRPMILES but is now using VMT_MILE) which was affected by the change I previously described.

As for the spelling change, I updated the traveled variable to be consistent with the variables listed in Kate's data conversion document.

def data_filtering(raw_nhts, census_division):
"""Filter raw NHTS data to be used in mileage.py

:param: (*pandas.DataFrame*) raw_nhts: raw NHTS dataframe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:param: (*pandas.DataFrame*) raw_nhts: -> :param pandas.DataFrame raw_nhts:. We only add the parentheses for the return variable.

@danielolsen
Copy link
Contributor

danielolsen commented Jan 5, 2022

All checks pass, and the code runs successfully and produces dataframes of the appropriate length, in 2 minutes:

import pandas as pd
from prereise.gather.demanddata.transportation_electrification.data_process import data_filtering
raw_nhts = pd.read_csv("trippub.csv")
data = {i: data_filtering(raw_nhts, i) for i in range(1, 10)}

There's one small tweak that needs to be made to a docstring, and the branch needs to be rebased onto the latest transportation_electrification commit, and then we're ready to merge!

@danielolsen danielolsen changed the base branch from develop to transportation_electrification January 5, 2022 18:06
Copy link
Contributor

@danielolsen danielolsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@danielolsen danielolsen merged commit a0157e7 into Breakthrough-Energy:transportation_electrification Jan 5, 2022
danielolsen pushed a commit that referenced this pull request Jan 31, 2022
danielolsen pushed a commit that referenced this pull request Mar 14, 2022
dmuldrew pushed a commit that referenced this pull request Sep 6, 2022
dmuldrew pushed a commit that referenced this pull request Nov 14, 2022
rouille pushed a commit that referenced this pull request Jan 11, 2023
rouille pushed a commit that referenced this pull request Jan 19, 2023
rouille pushed a commit that referenced this pull request Jan 20, 2023
rouille pushed a commit that referenced this pull request Feb 1, 2023
rouille pushed a commit that referenced this pull request Feb 1, 2023
jenhagg pushed a commit that referenced this pull request Feb 23, 2023
jenhagg pushed a commit that referenced this pull request Feb 24, 2023
rouille pushed a commit that referenced this pull request Feb 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants