Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up the calculation of the penalization matrix for FDataGrid #519

Merged
merged 34 commits into from
Jun 24, 2023

Conversation

Ddelval
Copy link
Contributor

@Ddelval Ddelval commented Mar 1, 2023

Taking advantage of the sparsity of the penalization matrix, it is possible to improve the time complexity of its calculation significantly

From this point on, memory usage might
start to be a problem.
Thus, it does not make sense to optimize the running time of the algoritm much more.
@codecov
Copy link

codecov bot commented Mar 1, 2023

Codecov Report

Patch coverage: 93.25% and project coverage change: +0.01 🎉

Comparison is base (0154d69) 85.75% compared to head (f69db8f) 85.77%.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #519      +/-   ##
===========================================
+ Coverage    85.75%   85.77%   +0.01%     
===========================================
  Files          144      145       +1     
  Lines        11428    11500      +72     
===========================================
+ Hits          9800     9864      +64     
- Misses        1628     1636       +8     
Impacted Files Coverage Δ
skfda/representation/basis/_grid_basis.py 80.00% <80.00%> (ø)
...da/misc/operators/_linear_differential_operator.py 95.74% <96.07%> (-0.18%) ⬇️
skfda/preprocessing/dim_reduction/_fpca.py 89.84% <100.00%> (-0.08%) ⬇️
skfda/representation/basis/_custom_basis.py 86.25% <100.00%> (ø)
skfda/tests/test_linear_differential_operator.py 98.50% <100.00%> (+0.39%) ⬆️

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Otherwise, mypy is not able to infer a type for it.
I think that is due to the mutability introduced
by having it as a paramter in a function call.
@Ddelval Ddelval requested a review from vnmabus March 1, 2023 22:33
@Ddelval Ddelval marked this pull request as draft March 22, 2023 16:05
With these changes, a FDataBasis using a GridBasis cannot be evaluated.
This functionality is removed since, when evaluation is required,
an FDataGrid should be used
skfda/misc/operators/_linear_differential_operator.py Outdated Show resolved Hide resolved
skfda/tests/test_fpca_regression.py Outdated Show resolved Hide resolved
skfda/tests/test_fpca_regression.py Outdated Show resolved Hide resolved
skfda/tests/test_fpca_regression.py Outdated Show resolved Hide resolved
skfda/tests/test_fpca_regression.py Outdated Show resolved Hide resolved
@Ddelval Ddelval marked this pull request as ready for review April 16, 2023 10:46
@Ddelval Ddelval requested a review from vnmabus April 16, 2023 10:46
skfda/representation/basis/_grid_basis.py Outdated Show resolved Hide resolved
skfda/representation/basis/_grid_basis.py Outdated Show resolved Hide resolved
skfda/misc/operators/_linear_differential_operator.py Outdated Show resolved Hide resolved
skfda/misc/operators/_linear_differential_operator.py Outdated Show resolved Hide resolved
@Ddelval Ddelval requested a review from vnmabus June 22, 2023 06:24
return f"{type(self).__name__}(grid_points={self.grid_points}) "

def __hash__(self) -> int:
return hash((super(), f"{self.grid_points}"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is grid_points hashable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vnmabus,
By itself no. But converted to a string it is no longer mutable so I would hope so. I might be missing something but that's why it's in an f-string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh-uh, that is a bit fishy. I would convert it to tuples instead or something like that (see what BSplineBasis does with the knots).

In the long term, I think that the hash for the bases should not depend on the attributes, as the class is mutable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider that changing the grid_points results in a different grid basis. That is why I included it in the hash. I think it defines the identity of the basis.
I have changed it to use a tuple instead of convertin to string

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true. However, we only need that objects that compare equal have the same hash. The opposite, although desirable for performance reasons, is not required. Note that our only reason for having hashable bases in the first case is because they are an attribute for the FDataBasis ExtensionDType, and pandas requires that those are hashable.

I think that in the long term we should just hash the class type itself or something like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could change the grid_points definition to be a property. That would make it inmutable and then the hash would make more sense.
I understand that hashing the class type itself is enough for the current needs. However, providing hashable objets with good performance is not a bad thing. Someone might want to store functional data in a dictionary or a set.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For dictionaries, only the keys need to be hashable. Normally, mutable objects such as lists or arrays are not made hashable on purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you are right. I was also thinking of sets, and I got them mixed up.
To me, GridBasis does seem like it should be an immutable object. And it is fully characterized by the discretization points. With that said, since we do not have a need to implement effective hash methods, I think it is not unreasonable just to use the basis type as a hash. But then the method would be basically returning a constant.

Personally, I'd rather change the grid_points to a property, making the object immutable, and keep the points in the hash. But I think the other option also makes sense. In any case, let me know what you think because I want to get to closing this PR rather sooner than later. The PLS functionality depends on this, and I'd like to have a clean PLS PR at this point (with these commits integrated into develop).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I will accept this PR in its current state. If we change the hash behaviour for basis objects, that should be done in another PR.

For me, at the beginning I though of basis objects as immutable too. However, they were several disadvantages with that approach, and I am reconsidering it now. First, Python does not have first-class support for custom immutable objects, apart from frozen dataclasses. Second, and more important, the current approach makes difficult or even impossible the choice of basis parameters in a BasisSmoother transformer using hyperparameter selection tools such as GridSearchCV. For example, for choosing the number of elements in the basis, one has to create different Basis objects with the appropriate parameter set. Choosing several basis parameters at the same time is too complex for average users.

Forgetting the idea of immutable basis, we can make Basis inherit from BaseEstimator, thus allowing it to play well with the hyperparameter selection tools in scikit-learn. I think we should investigate this approach further.

@vnmabus vnmabus merged commit c40e372 into develop Jun 24, 2023
16 of 17 checks passed
@vnmabus vnmabus deleted the feature/speedUpPenalization branch August 18, 2023 08:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants