-
Notifications
You must be signed in to change notification settings - Fork 29
feat: adding data processing script #243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: adding data processing script #243
Conversation
prereise/gather/demanddata/transportation_electrification/data_process.py
Outdated
Show resolved
Hide resolved
prereise/gather/demanddata/transportation_electrification/data_process.py
Outdated
Show resolved
Hide resolved
prereise/gather/demanddata/transportation_electrification/data_process.py
Outdated
Show resolved
Hide resolved
|
As a note for BE folks: the target for this PR is currently the |
prereise/gather/demanddata/transportation_electrification/data_process.py
Outdated
Show resolved
Hide resolved
prereise/gather/demanddata/transportation_electrification/data_process.py
Outdated
Show resolved
Hide resolved
prereise/gather/demanddata/transportation_electrification/data_process.py
Outdated
Show resolved
Hide resolved
prereise/gather/demanddata/transportation_electrification/data_process.py
Outdated
Show resolved
Hide resolved
prereise/gather/demanddata/transportation_electrification/data_process.py
Outdated
Show resolved
Hide resolved
prereise/gather/demanddata/transportation_electrification/data_process.py
Outdated
Show resolved
Hide resolved
prereise/gather/demanddata/transportation_electrification/data_process.py
Outdated
Show resolved
Hide resolved
|
After the refactor, the code runs much faster: less than 4 minutes now to go through all nine regions on my laptop, rather than the hour and a half before. Bravo! Comparing the results from before the refactor to the results afterward, I see a few differences:
If I look at the indices from the newly-returned dataframes, and look at the corresponding values in the previously-returned dataframes, I see that most columns are the same, some columns have values that are different but nearly-identical: @isond can you confirm that the behavior change is intentional, and that the new behavior is the one we want? One way to validate the new behavior could be to refactor the df1 = data_filtering(census_division=1)we could: raw_nhts = pd.read_csv(FILENAME)
df1 = data_filtering(raw_nhts, census_division=1)The value of an approach like this is that we can design a test that passes a very small amount of data to |
prereise/gather/demanddata/transportation_electrification/data_process.py
Outdated
Show resolved
Hide resolved
prereise/gather/demanddata/transportation_electrification/data_process.py
Outdated
Show resolved
Hide resolved
prereise/gather/demanddata/transportation_electrification/data_process.py
Show resolved
Hide resolved
prereise/gather/demanddata/transportation_electrification/data_process.py
Outdated
Show resolved
Hide resolved
prereise/gather/demanddata/transportation_electrification/data_process.py
Outdated
Show resolved
Hide resolved
Regarding the differences in lengths of the dataframes, I noticed that the condition to filter out the repeated trips was using the incorrect column prior to the refactor.
Basically in the original code, the two columns were switched around, and so I've updated the code to reflect that change. The significant change in values for As for the spelling change, I updated the |
| def data_filtering(raw_nhts, census_division): | ||
| """Filter raw NHTS data to be used in mileage.py | ||
|
|
||
| :param: (*pandas.DataFrame*) raw_nhts: raw NHTS dataframe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:param: (*pandas.DataFrame*) raw_nhts: -> :param pandas.DataFrame raw_nhts:. We only add the parentheses for the return variable.
|
All checks pass, and the code runs successfully and produces dataframes of the appropriate length, in 2 minutes: import pandas as pd
from prereise.gather.demanddata.transportation_electrification.data_process import data_filtering
raw_nhts = pd.read_csv("trippub.csv")
data = {i: data_filtering(raw_nhts, i) for i in range(1, 10)}There's one small tweak that needs to be made to a docstring, and the branch needs to be rebased onto the latest |
danielolsen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
a0157e7
into
Breakthrough-Energy:transportation_electrification
Pull Request doc
Purpose
What is the larger goal of this change?
To add the data processing script that will organize the NHTS data.
What the code is doing
How is the purpose executed?
This code filters through the NHTS data based on the inputted census division and reorganizes the information. It also calculates additional information based on the given data.
Testing
How did you test this change (unit/functional testing, manual testing, etc.)?
This was tested through manual testing using data from "trippub.csv".
Where to look
It's helpful to clarify where your new code lives if you moved files around or there could be confusion/
The new code is located at "PreREISE/prereise/gather/demanddata/transportation_electrification/data_process.py"
What files are most important?
data_process.py which is currently the only file
Usage Example/Visuals
How the code can be used and/or images of any graphs, tables or other visuals (not always applicable).
Time estimate
How long will it take for reviewers and observers to understand this code change?
It should take a few days for reviewers to understand.