-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up the calculation of the penalization matrix for FDataGrid #519
Conversation
From this point on, memory usage might start to be a problem. Thus, it does not make sense to optimize the running time of the algoritm much more.
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## develop #519 +/- ##
===========================================
+ Coverage 85.75% 85.77% +0.01%
===========================================
Files 144 145 +1
Lines 11428 11500 +72
===========================================
+ Hits 9800 9864 +64
- Misses 1628 1636 +8
☔ View full report in Codecov by Sentry. |
Otherwise, mypy is not able to infer a type for it. I think that is due to the mutability introduced by having it as a paramter in a function call.
With these changes, a FDataBasis using a GridBasis cannot be evaluated. This functionality is removed since, when evaluation is required, an FDataGrid should be used
return f"{type(self).__name__}(grid_points={self.grid_points}) " | ||
|
||
def __hash__(self) -> int: | ||
return hash((super(), f"{self.grid_points}")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is grid_points
hashable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vnmabus,
By itself no. But converted to a string it is no longer mutable so I would hope so. I might be missing something but that's why it's in an f-string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh-uh, that is a bit fishy. I would convert it to tuples instead or something like that (see what BSplineBasis
does with the knots
).
In the long term, I think that the hash
for the bases should not depend on the attributes, as the class is mutable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider that changing the grid_points
results in a different grid basis. That is why I included it in the hash. I think it defines the identity of the basis.
I have changed it to use a tuple instead of convertin to string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is true. However, we only need that objects that compare equal have the same hash. The opposite, although desirable for performance reasons, is not required. Note that our only reason for having hashable bases in the first case is because they are an attribute for the FDataBasis
ExtensionDType, and pandas requires that those are hashable.
I think that in the long term we should just hash the class type itself or something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could change the grid_points
definition to be a property. That would make it inmutable and then the hash would make more sense.
I understand that hashing the class type itself is enough for the current needs. However, providing hashable objets with good performance is not a bad thing. Someone might want to store functional data in a dictionary or a set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For dictionaries, only the keys need to be hashable. Normally, mutable objects such as lists or arrays are not made hashable on purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you are right. I was also thinking of sets, and I got them mixed up.
To me, GridBasis
does seem like it should be an immutable object. And it is fully characterized by the discretization points. With that said, since we do not have a need to implement effective hash methods, I think it is not unreasonable just to use the basis type as a hash. But then the method would be basically returning a constant.
Personally, I'd rather change the grid_points
to a property, making the object immutable, and keep the points in the hash. But I think the other option also makes sense. In any case, let me know what you think because I want to get to closing this PR rather sooner than later. The PLS functionality depends on this, and I'd like to have a clean PLS PR at this point (with these commits integrated into develop).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I will accept this PR in its current state. If we change the hash behaviour for basis objects, that should be done in another PR.
For me, at the beginning I though of basis objects as immutable too. However, they were several disadvantages with that approach, and I am reconsidering it now. First, Python does not have first-class support for custom immutable objects, apart from frozen dataclasses. Second, and more important, the current approach makes difficult or even impossible the choice of basis parameters in a BasisSmoother
transformer using hyperparameter selection tools such as GridSearchCV
. For example, for choosing the number of elements in the basis, one has to create different Basis
objects with the appropriate parameter set. Choosing several basis parameters at the same time is too complex for average users.
Forgetting the idea of immutable basis, we can make Basis
inherit from BaseEstimator
, thus allowing it to play well with the hyperparameter selection tools in scikit-learn. I think we should investigate this approach further.
Taking advantage of the sparsity of the penalization matrix, it is possible to improve the time complexity of its calculation significantly