Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/la data #117

Merged
merged 6 commits into from
Jul 23, 2020
Merged

Feature/la data #117

merged 6 commits into from
Jul 23, 2020

Conversation

allentran
Copy link
Contributor

addresses #114

Motivation and context

Speeds up pruning and parsing SM trees data (avoids two brute force N^2 computations) and some initial parsing for other LA datasets.

What I did

  • LA city/county, city parsers for Matt Stiles data upto Beverly Hills (starting alphabetically)
  • geohash street line segments and tree lat/lon points, match on same geohash while decrementing geohash digit precision (order(N) calculation)
  • use GeoPandas sjoin() to do a spatial join based on the predicate contains (uses a RTree to avoid doing the N^2 calculation)

@emillipede emillipede added this to for review in santa monica tree map Jun 24, 2020

def get_maximal_df(self):
df = super().get_maximal_df()
df = df.assign(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question from @daniellezegelstein @Cdower @cajaks2

why is the variable df reassigned in every function? why isn't the variable renamed? @allentran

Copy link
Contributor Author

@allentran allentran Jun 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a parent class CityParser that does some useful things common to all cities.

Each cities actual parser is a subclass of CityParser, e.g BeverlyHillsParser. Each subclass reimplements/overloads (I forget the correct term) the method get_maximal_df() to do all the custom things required for that city. The name df refers to dataframe and each operation is just adding/changing stuff to that dataframe. I don't know if this is understood, but each get_maximal_df() is a different implementation specific to the city. (This is a common pattern in object oriented programming).

You could rename it but why would you when you're just operating on the same dataframe. I could rename every single instance of the dataframe, e.g df1, df2, ..., dfn but 1) since a lot of Pandas operations are not in-place, I'd be wasting memory and 2) even more confusing to someone reading the code as then I start to think why would the engineer create N different copies of the dataframe if they didn't need them. If I just see one variable/dataframe thats being operated on, the intent is clear (even without documentation) that the code just does stuff to a single dataframe and then returns it.

@emillipede emillipede added this to the la county expansion milestone Jul 22, 2020
@emillipede emillipede merged commit 454606e into test-circleci Jul 23, 2020
santa monica tree map automation moved this from for review to done Jul 23, 2020

import numpy as np
import geohash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which geohash library is this? needs to be added to environment.yml.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants