@willmcgugan did a code review. This is his feedback:
Kloppy
General observations
- Names like utils.py and helpers.py are best avoided. Best to come up with descriptive names, even if that means separating them in to smaller files.
- Typing throughout which is great. Consider enabling strict mode in Mypy to catch missing types.
- I think you will need to add a py.typed files to expose typing information to user of the library
- I wonder if you can convert some dataclasses to namedtuples, particularly the small mathematical entities, like Point and Dimensions. They are immutable by default, and typically smaller / faster than dataclasses.
- Your install requirements are unbounded. For instance,
lxml>=4.5.0. If lxml follows semver then version 5.0.0 of lxml may have breaking changes, which could in turn break kloppy.
Package structure
I see a fair number of * imports, which are generally frowned upon these days. The problem is that you import every symbol, including many things that are unintended. It's better to explicitly import every symbol (as tedious as that can sometimes be).
You might also want to prefix module / file names that you don't expect to be imported directly with an underscore.
Some of the module names are quite verbose (and difficult to remember). It's best to strive for as flat a module structure that makes sense.
for instance, I'm looking at EventDataset and the import is:
kloppy.domain.models.event.EventDataset
Consider discarding a few levels to simplify that. For instance domain.models is probably irrelevant to the caller. Perhaps it could be simplified to the following:
kloppy.events.EventDatset
These changes will likely require restructuring the project to reflect how the caller would want to use the library, rather than the internal logic.
In general, the less you require the user of a library to remember, the more positively they will perceive it.
Some of your helper methods look like perhaps they should be methods of the Dataset object, for instance transform and to_pandas.
Exceptions
There are a number of instances of raise Exception. Raising an exception like this makes it impossible to explicitly catch, as except Exception would catch everything. Better to create custom exception derived from Exception, so the caller can explicitly catch those.
There are also a few ValueErrors raised. It generally not a good idea to raise builtin exceptions for similar reasons as above. Any number of bugs may raise ValueErrors and you risk hiding this by conflating them with a ValueError. Better to use a custom exception for these.
Serializers
I focused on serializers, since this seems to be a major feature of the library.
I see you have an abstract base class for serializers, CodeDataSerializer which has serialize and deserialize methods. The deserialize method is problematic because of the options dict which can take arbitrary parameters, and similarly inputs which has an arbitrary number of Readable objects. Passing in an object without any structure means that you lose out on a) typing and b) expressive method signatures.
When we spoke I suggested leaving these options out of the abstract methods and putting them in the constructor. With the options in the constructor they can be part of the signature rather than bundled in a dict.
Here's a (rough) example:
class MetricaTrackingSerializer(EventDataSerializer):
def __init__(self, raw_data_home: Readable, raw_data_away: Readable, sample_rate: float = 1/12):
self.raw_data_home = raw_data_home
self.raw_data_away = raw_data_away
self.sample_rate = sample_rate
def deserialize(self) -> EventDataset:
...
def serialize(self) -> Tuple[str, str]:
...
It may be the case that you will only need those parameters for deserialize, which brings me to another suggestion: separate the two classes in to a Serializer and Deserializer. A class should ideally do only one thing, and your current serializer baseclass does two things. By separating them you can also encode any options in the constructor which doesn't need to be abstract.
If you separate the serializers in to two classes, the method names deserialize and serialize become redundant. You may as well make them callable by adding a __call__ method.
class MetricaTrackingDeserializer(Deserializer):
def __init__(self, raw_data_home: Readable, raw_data_away: Readable, sample_rate: float = 1/12):
....
def __call__(self) -> EventDataset:
....
That way you can use them like this:
deserializer = MetricaTrackingDeserializer(foo, bar):
dataset = deserializer()
This brings me to another point. Do you need an abstract base class at all?
An base class is useful when you want to accept an object with a common interface. But as far as I can tell, you don't accept the base class as an argument anywhere. Which suggests to me that they could be functions and not a class.
How about a module for each format, with a load and save function in each (or import and export if you prefer). These are somewhat friendlier terms that serializer, and easier to type.
Your deserializer could simply be:
def load(raw_data_home: Readable, raw_data_away: Readable, sample_rate: float = 1/12):
...
You could structure your project to put these formats at the top level, and you could import them like this:
from kloppy import metrica
my_dataset = metrica.load_tracking("home_file_.csv", "away_file.csv")
This is not too dissimilar to your load_ helper functions, but I think friendlier still, and there wouldn't need to be separate helper methods.
Just to labour the point a bit more, here's an example from the docs:
from kloppy import StatsBombSerializer, Provider
with open(
f"{base_dir}/files/statsbomb_lineup.json", "rb"
) as lineup_data, open(
f"{base_dir}/files/statsbomb_event.json", "rb"
) as event_data:
dataset = serializer.deserialize(
inputs={"lineup_data": lineup_data, "event_data": event_data},
options={"coordinate_system": Provider.STATSBOMB},
)
return dataset
Consider this as an alternative:
from kloppy import statsbomb
dataset = statsbomb.load(
f"{base_dir}/files/statsbomb_lineup.json",
f"{base_dir}/files/statsbomb_event.json",
coordinates="statsbomb"
)
Note that rather than files, this accepts paths. I suspect this will be the most common requirement. If you also need to accept file-like objects you could accept a Union[str, IO[str]] and handle that within the load method.
@willmcgugan did a code review. This is his feedback:
Kloppy
General observations
lxml>=4.5.0. If lxml follows semver then version 5.0.0 of lxml may have breaking changes, which could in turn break kloppy.Package structure
I see a fair number of * imports, which are generally frowned upon these days. The problem is that you import every symbol, including many things that are unintended. It's better to explicitly import every symbol (as tedious as that can sometimes be).
You might also want to prefix module / file names that you don't expect to be imported directly with an underscore.
Some of the module names are quite verbose (and difficult to remember). It's best to strive for as flat a module structure that makes sense.
for instance, I'm looking at EventDataset and the import is:
Consider discarding a few levels to simplify that. For instance
domain.modelsis probably irrelevant to the caller. Perhaps it could be simplified to the following:These changes will likely require restructuring the project to reflect how the caller would want to use the library, rather than the internal logic.
In general, the less you require the user of a library to remember, the more positively they will perceive it.
Some of your helper methods look like perhaps they should be methods of the Dataset object, for instance
transformandto_pandas.Exceptions
There are a number of instances of
raise Exception. Raising an exception like this makes it impossible to explicitly catch, asexcept Exceptionwould catch everything. Better to create custom exception derived from Exception, so the caller can explicitly catch those.There are also a few
ValueErrors raised. It generally not a good idea to raise builtin exceptions for similar reasons as above. Any number of bugs may raise ValueErrors and you risk hiding this by conflating them with a ValueError. Better to use a custom exception for these.Serializers
I focused on serializers, since this seems to be a major feature of the library.
I see you have an abstract base class for serializers,
CodeDataSerializerwhich hasserializeanddeserializemethods. Thedeserializemethod is problematic because of the options dict which can take arbitrary parameters, and similarlyinputswhich has an arbitrary number ofReadableobjects. Passing in an object without any structure means that you lose out on a) typing and b) expressive method signatures.When we spoke I suggested leaving these options out of the abstract methods and putting them in the constructor. With the options in the constructor they can be part of the signature rather than bundled in a dict.
Here's a (rough) example:
It may be the case that you will only need those parameters for deserialize, which brings me to another suggestion: separate the two classes in to a
SerializerandDeserializer. A class should ideally do only one thing, and your current serializer baseclass does two things. By separating them you can also encode any options in the constructor which doesn't need to be abstract.If you separate the serializers in to two classes, the method names
deserializeandserializebecome redundant. You may as well make them callable by adding a__call__method.That way you can use them like this:
This brings me to another point. Do you need an abstract base class at all?
An base class is useful when you want to accept an object with a common interface. But as far as I can tell, you don't accept the base class as an argument anywhere. Which suggests to me that they could be functions and not a class.
How about a module for each format, with a
loadandsavefunction in each (orimportandexportif you prefer). These are somewhat friendlier terms that serializer, and easier to type.Your deserializer could simply be:
You could structure your project to put these formats at the top level, and you could import them like this:
This is not too dissimilar to your
load_helper functions, but I think friendlier still, and there wouldn't need to be separate helper methods.Just to labour the point a bit more, here's an example from the docs:
Consider this as an alternative:
Note that rather than files, this accepts paths. I suspect this will be the most common requirement. If you also need to accept file-like objects you could accept a
Union[str, IO[str]]and handle that within theloadmethod.