Code review feedback

@willmcgugan did a code review. This is his feedback:

# Kloppy

- URL: https://github.com/PySport/kloppy
- Date Reviewed: Sep 3 2021

## General observations

- Names like utils.py and helpers.py are best avoided. Best to come up with descriptive names, even if that means separating them in to smaller files. 
- Typing throughout which is great. Consider enabling strict mode in Mypy to catch missing types.
- I think you will need to add a [py.typed](https://mypy.readthedocs.io/en/stable/installed_packages.html#creating-pep-561-compatible-packages) files to expose typing information to user of the library
- I wonder if you can convert some dataclasses to namedtuples, particularly the small mathematical entities, like Point and Dimensions. They are immutable by default, and typically smaller / faster than dataclasses.
- Your install requirements are unbounded. For instance, `lxml>=4.5.0`. If lxml follows [semver](https://semver.org/) then version 5.0.0 of lxml may have breaking changes, which could in turn break kloppy.

## Package structure

I see a fair number of * imports, which are generally frowned upon these days. The problem is that you import every symbol, including many things that are unintended. It's better to explicitly import every symbol (as tedious as that can sometimes be).

You might also want to prefix module / file names that you don't expect to be imported directly with an underscore.

Some of the module names are quite verbose (and difficult to remember). It's best to strive for as flat a module structure that makes sense.

for instance, I'm looking at EventDataset and the import is:

```python
kloppy.domain.models.event.EventDataset
```

Consider discarding a few levels to simplify that. For instance `domain.models` is probably irrelevant to the caller. Perhaps it could be simplified to the following:

```python
kloppy.events.EventDatset
```

These changes will likely require restructuring the project to reflect how the caller would want to use the library, rather than the internal logic. 

In general, the less you require the user of a library to remember, the more positively they will perceive it.

Some of your helper methods look like perhaps they should be methods of the Dataset object, for instance `transform` and `to_pandas`.

## Exceptions

There are a number of instances of `raise Exception`. Raising an exception like this makes it impossible to explicitly catch, as `except Exception` would catch *everything*. Better to create custom exception derived from Exception, so the caller can explicitly catch those.

There are also a few `ValueError`s raised. It generally not a good idea to raise builtin exceptions for similar reasons as above. Any number of bugs may raise ValueErrors and you risk hiding this by conflating them with a ValueError. Better to use a custom exception for these.

## Serializers

I focused on serializers, since this seems to be a major feature of the library.

I see you have an abstract base class for serializers, [`CodeDataSerializer`](https://github.com/PySport/kloppy/blob/0ebacb5faa86627974b6d48ff1e0e81a0ad1b145/kloppy/infra/serializers/code/base.py) which has `serialize` and `deserialize` methods. The `deserialize` method is problematic because of the options dict which can take arbitrary parameters, and similarly `inputs` which has an arbitrary number of `Readable` objects. Passing in an object without any structure means that you lose out on a) typing and b) expressive method signatures.

When we spoke I suggested leaving these options out of the abstract methods and putting them in the constructor. With the options in the constructor they can be part of the signature rather than bundled in a dict.

Here's a (rough) example:

```python
class MetricaTrackingSerializer(EventDataSerializer):

    def __init__(self, raw_data_home: Readable, raw_data_away: Readable, sample_rate: float = 1/12):
        self.raw_data_home = raw_data_home
        self.raw_data_away = raw_data_away
        self.sample_rate = sample_rate
    
    def deserialize(self) -> EventDataset:
        ...

    def serialize(self) -> Tuple[str, str]:
        ...
```

It may be the case that you will only need those parameters for deserialize, which brings me to another suggestion: separate the two classes in to a `Serializer` and `Deserializer`. A class should ideally do only one thing, and your current serializer baseclass does two things. By separating them you can also encode any options in the constructor which doesn't need to be abstract.

If you separate the serializers in to two classes, the method names `deserialize` and `serialize` become redundant. You may as well make them callable by adding a `__call__` method.

```python
class MetricaTrackingDeserializer(Deserializer):

    def __init__(self, raw_data_home: Readable, raw_data_away: Readable, sample_rate: float = 1/12):
        ....
    
    def __call__(self) -> EventDataset:
        ....
```

That way you can use them like this:

```python
deserializer = MetricaTrackingDeserializer(foo, bar):
dataset = deserializer()
```

This brings me to another point. Do you need an abstract base class at all?

An base class is useful when you want to accept an object with a common interface. But as far as I can tell, you don't accept the base class as an argument anywhere. Which suggests to me that they could be functions and not a class.

How about a module for each format, with a `load` and `save` function in each (or `import` and `export` if you prefer). These are somewhat friendlier terms that *serializer*, and easier to type. 

Your deserializer could simply be:

```python
def load(raw_data_home: Readable, raw_data_away: Readable, sample_rate: float = 1/12):
    ...
```

You could structure your project to put these formats at the top level, and you could import them like this:

```python
from kloppy import metrica

my_dataset = metrica.load_tracking("home_file_.csv", "away_file.csv")
```

This is not too dissimilar to your `load_` helper functions, but I think friendlier still, and there wouldn't need to be separate helper methods.

Just to labour the point a bit more, here's an example from the docs:

```python
from kloppy import StatsBombSerializer, Provider

with open(
        f"{base_dir}/files/statsbomb_lineup.json", "rb"
    ) as lineup_data, open(
        f"{base_dir}/files/statsbomb_event.json", "rb"
    ) as event_data:
        dataset = serializer.deserialize(
            inputs={"lineup_data": lineup_data, "event_data": event_data},
            options={"coordinate_system": Provider.STATSBOMB},
        )
        return dataset
```

Consider this as an alternative:

```python
from kloppy import statsbomb

dataset = statsbomb.load(
    f"{base_dir}/files/statsbomb_lineup.json",
    f"{base_dir}/files/statsbomb_event.json",
    coordinates="statsbomb"
 )

```

Note that rather than files, this accepts paths. I suspect this will be the most common requirement. If you also need to accept file-like objects you could accept a `Union[str, IO[str]]` and handle that within the `load` method. 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code review feedback #112

Kloppy

General observations

Package structure

Exceptions

Serializers

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Code review feedback #112

Description

Kloppy

General observations

Package structure

Exceptions

Serializers

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions