-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/la data #117
Feature/la data #117
Conversation
Bump node version
use geopandas/rtree capability to do a spatial join for the pruning schedules
|
||
def get_maximal_df(self): | ||
df = super().get_maximal_df() | ||
df = df.assign( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question from @daniellezegelstein @Cdower @cajaks2
why is the variable df reassigned in every function? why isn't the variable renamed? @allentran
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a parent class CityParser
that does some useful things common to all cities.
Each cities actual parser is a subclass of CityParser
, e.g BeverlyHillsParser
. Each subclass reimplements/overloads (I forget the correct term) the method get_maximal_df()
to do all the custom things required for that city. The name df
refers to dataframe and each operation is just adding/changing stuff to that dataframe. I don't know if this is understood, but each get_maximal_df()
is a different implementation specific to the city. (This is a common pattern in object oriented programming).
You could rename it but why would you when you're just operating on the same dataframe. I could rename every single instance of the dataframe, e.g df1, df2, ..., dfn
but 1) since a lot of Pandas operations are not in-place, I'd be wasting memory and 2) even more confusing to someone reading the code as then I start to think why would the engineer create N different copies of the dataframe if they didn't need them. If I just see one variable/dataframe thats being operated on, the intent is clear (even without documentation) that the code just does stuff to a single dataframe and then returns it.
|
||
import numpy as np | ||
import geohash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which geohash library is this? needs to be added to environment.yml.
addresses #114
Motivation and context
Speeds up pruning and parsing SM trees data (avoids two brute force N^2 computations) and some initial parsing for other LA datasets.
What I did
sjoin()
to do a spatial join based on the predicatecontains
(uses a RTree to avoid doing the N^2 calculation)