# Python Data Classes Tutorial: N Things You Must Learn About Data Classes

## Why learn about data classes?

Data classes are one of the features of Python that after you discover them, you are never going back to the old way. Consider this regular class:

In [1]:
class Exercise:
    def __init__(self, name, reps, sets, weight):
        self.name = name
        self.reps = reps
        self.sets = sets
        self.weight = weight

To me, that class definition is very inefficient - in the `__init__` method, you repeat each parameter at least three times. This may not sound like a big deal but think about how often you write classes in your lifetime with much more parameters. 

In comparison, take a look at the data classes alternative of the above code:

In [1]:
from dataclasses import dataclass


@dataclass
class Exercise:
    name: str
    reps: int
    sets: int
    weight: float  # Weight in lbs

This modest-looking piece of code is orders of magnitude better than a regular class. The tiny `@dataclass` decorator is implementing `__init__`, `__repr__`, `__eq__` classes behind the scenes, which would have taken at least 20 lines of code manually. Besides, many other features such as comparison operators, object ordering and immutability are all a single line away from being magically created for our class.

So, the purpose of this tutorial is to show you why data classes are the best thing to happen to Python if you love object-oriented programming. 

Let's get started!

## Basics of data classes

1. Defining data classes
2. Mentioning the automatically-generated functions
3. Mentioning that type hints are required but not actually enforced
4. Accepts any type from typing module
5. Create data classes on the fly with `make_dataclass`
6. Default values can be easily added
7. Default values must come after non-defaults

1. Defining data classes

### Some methods are automatically generated in data classes

Despite all their features, data classes are regular classes that take much less code to implement the same functionality. Here is the `Exercise` class again:

In [2]:
from dataclasses import dataclass


@dataclass
class Exercise:
    name: str
    reps: int
    sets: int
    weight: float


ex1 = Exercise("Bench press", 10, 3, 52.5)

In [3]:
# Verifying Exercise is a regular class
ex1.name

'Bench press'

Right now, `Exercise` already has `__repr__` and `__eq__` methods already implemented. Let's verify it:

In [4]:
repr(ex1)

"Exercise(name='Bench press', reps=10, sets=3, weight=52.5)"

The object representation of an object `repr` must return the code that can recreate itself and we can see that is exactly the case for `ex1`. 

In comparison, `Exercise` defined in the old way would look like this:

In [5]:
class Exercise:
    def __init__(self, name, reps, sets, weight):
        self.name = name
        self.reps = reps
        self.sets = sets
        self.weight = weight


ex3 = Exercise("Bench press", 10, 3, 52.5)

ex3

<__main__.Exercise at 0x7f6834100130>

Pretty awful!

Now, let's verify the existence of `__eq__`, which is the equality operator:

In [7]:
# Redefine the class
@dataclass
class Exercise:
    name: str
    reps: int
    sets: int
    weight: float


ex1 = Exercise("Bench press", 10, 3, 52.5)
ex2 = Exercise("Bench press", 10, 3, 52.5)

Comparing the class to itself and to another class with identical parameters must return True:

In [8]:
ex1 == ex2

True

In [9]:
ex1 == ex1

True

And so it does! In regular classes, this logic would have been a pain to write.

### Data classes require type hints 

As you might have noticed, data classes require type hints when defining fields. In fact, data classes allow any type from the `typing` module. For example, here is how to create a field that can accept `Any` data type:

In [10]:
from typing import Any


@dataclass
class Dummy:
    attr: Any

However, the idiosyncrasy of Python is that even though data classes require type hints, types aren't actually enforced.

For example, creating an instance of `Exercise` class with completely incorrect data types can be run without errors:

In [None]:
silly_exercise = Exercise("Bench press", "ten", "three sets", 52.5)

silly_exercise.sets

If you want to enforce data types, you must use type checkers such as [Mypy](https://mypy-lang.org/).

### Data classes allow default values in fields

Till now, we haven't added any defaults to our classes. Let's fix that:

In [11]:
@dataclass
class Exercise:
    name: str = "Push-ups"
    reps: int = 10
    sets: int = 3
    weight: float = 0


# Now, all fields have defaults
ex5 = Exercise()
ex5

Exercise(name='Push-ups', reps=10, sets=3, weight=0)

Keep in mind that non-default fields can't follow default fields. For example, the below code will throw an error:

```python
@dataclass
class Exercise:
    name: str = "Push-ups"
    reps: int = 10
    sets: int = 3
    weight: float  # NOT ALLOWED


ex5 = Exercise()
ex5
```

```
TypeError: non-default argument 'weight' follows default argument
```

In practice, you will rarely define defaults with `name: type = value` syntax. Instead, you will use the `field` function, which allows more control not over just default values but over all fields as well:

In [13]:
from dataclasses import field


@dataclass
class Exercise:
    name: str = field(default="Push-up")
    reps: int = field(default=10)
    sets: int = field(default=3)
    weight: float = field(default=0)


# Now, all fields have defaults
ex5 = Exercise()
ex5

Exercise(name='Push-up', reps=10, sets=3, weight=0)

The `field` function has more parameters such as:
- `repr`
- `init`
- `compare`
- `default_factory`

and so on. We will discuss these in the coming sections.

### Data classes can be created with a function

A final note on the data class basics is that their definition can be even shorter by using the `make_dataclass` function:

In [14]:
from dataclasses import make_dataclass

Exercise = make_dataclass(
    "Exercise",
    [
        ("name", str),
        ("reps", int),
        ("sets", int),
        ("weight", float),
    ],
)

ex3 = Exercise("Deadlifts", 8, 3, 69.0)
ex3

Exercise(name='Deadlifts', reps=8, sets=3, weight=69.0)

But, you will sacrifice readability, so I don't recommend using this function.

## Advanced data classes

In this section, we will discuss advanced features of data classes that bring more benefits. One such feature is a default factory.

### Default factories
To explain default factories, let's create another class named `WorkoutSession` that accepts two fields:

In [16]:
from dataclasses import dataclass
from typing import List


@dataclass
class Exercise:
    name: str = "Push-ups"
    reps: int = 10
    sets: int = 3
    weight: float = 0


@dataclass
class WorkoutSession:
    exercises: List[Exercise]
    duration_minutes: int

By using the `List` type, we are specifying that `WorkoutSession` accepts a list of `Exercise` instances.

In [17]:
# Define the Exercise instances for HIIT training
ex1 = Exercise(name="Burpees", reps=15, sets=3)
ex2 = Exercise(name="Mountain Climbers", reps=20, sets=3)
ex3 = Exercise(name="Jump Squats", reps=12, sets=3)
exercises_monday = [ex1, ex2, ex3]

hiit_monday = WorkoutSession(exercises=exercises_monday, duration_minutes=30)

Right now, each session instance requires exercises to be initialized. But this doesn't mirror how people work out - first, they start a session (probably in an app) and then, add exercises as they work out.

So, we must be able to create sessions with no exercises and no duration. Let's make this happen by adding an empty list as a default value for `exercises`:

```python
@dataclass
class WorkoutSession:
    exercises: List[Exercise] = []
    duration_minutes: int = None


hiit_monday = WorkoutSession("25-02-2024")
```

```python
ValueError: mutable default <class 'list'> for field exercises is not allowed: use default_factory
```

But... we got an error - turns out, data classes don't allow mutable default values on the surface. 

However, we can fix this by using a default factory:

In [20]:
@dataclass
class WorkoutSession:
    exercises: List[Exercise] = field(default_factory=list)
    duration_minutes: int = 0


hiit_monday = WorkoutSession()
hiit_monday

WorkoutSession(exercises=[], duration_minutes=0)

The `default_factory` parameter accepts a function that returns an initial value for a data class field. This means it can accept any arbitrary function:
- `tuple`
- `dict`
- `set`
- Any user-defined custom function

regardless of whether the result of the function is mutable or not.

Now, if we think about it, most people start their training with warm up exercises that are typically similar for any kind of work out. So, initializing sessions with no exercises may not be what some people want. 

Instead, let's create a function that returns three warm up `Exercise`s:

In [24]:
def create_warmup():
    return [
        Exercise("Jumping jacks", 30, 1),
        Exercise("Squat lunges", 10, 2),
        Exercise("High jumps", 20, 1),
    ]

In [25]:
@dataclass
class WorkoutSession:
    exercises: List[Exercise] = field(default_factory=create_warmup)
    duration_minutes: int = 5  # Increase the default duration as well


hiit_monday = WorkoutSession()
hiit_monday

WorkoutSession(exercises=[Exercise(name='Jumping jacks', reps=30, sets=1, weight=0), Exercise(name='Squat lunges', reps=10, sets=2, weight=0), Exercise(name='High jumps', reps=20, sets=1, weight=0)], duration_minutes=5)

Now, any time we create a session, they will come with some warm-up exercises already logged. The new version of `WorkoutSession` has five minutes of default duration to account for that. 

### Adding methods to data classes

Since data classes are regular classes, adding methods to them is exactly the same. Let's add two methods to our `WorkoutSession` data class:

In [26]:
@dataclass
class WorkoutSession:
    exercises: List[Exercise] = field(default_factory=create_warmup)
    duration_minutes: int = 5

    def add_exercise(self, exercise: Exercise):
        self.exercises.append(exercise)

    def increase_duration(self, minutes: int):
        self.duration_minutes += minutes

Using these methods, we can now log any new activity to a session:

In [29]:
hiit_monday = WorkoutSession()

# Log a new exercise
new_exercise = Exercise("Deadlifts", 6, 4, 60)

hiit_monday.add_exercise(new_exercise)
hiit_monday.increase_duration(15)

But, there is a problem:

In [30]:
hiit_monday

WorkoutSession(exercises=[Exercise(name='Jumping jacks', reps=30, sets=1, weight=0), Exercise(name='Squat lunges', reps=10, sets=2, weight=0), Exercise(name='High jumps', reps=20, sets=1, weight=0), Exercise(name='Deadlifts', reps=6, sets=4, weight=60)], duration_minutes=20)

When we print the session, its default representation is too verbose and unreadable since it contains the code to recreate the object. Let's fix that.

### `__repr__` and `__str__` in data classes

Data classes implement `__repr__` automatically but not `__str__`. This makes the class fall back on `__repr__` when we call `print` on it.

So, let's override this behavior:

In [34]:
@dataclass
class Exercise:
    name: str = "Push-ups"
    reps: int = 10
    sets: int = 3
    weight: float = 0

    def __str__(self):
        base = f"{self.name}: {self.reps}/{self.sets}"
        if self.weight == 0:
            return base
        return base + f", {self.weight} lbs"


ex1 = Exercise(name="Burpees", reps=15, sets=3)
ex1

Exercise(name='Burpees', reps=15, sets=3, weight=0)

The `__repr__` is still the same but:

In [35]:
print(ex1)

Burpees: 15/3


The class' spring representation is much nicer. Now, let's fix `WorkoutSession` as well:

In [33]:
@dataclass
class WorkoutSession:
    exercises: List[Exercise] = field(default_factory=create_warmup)
    duration_minutes: int = 5  # Increase the default duration as well

    def add_exercise(self, exercise: Exercise):
        self.exercises.append(exercise)

    def increase_duration(self, minutes: int):
        self.duration_minutes += minutes

    def __str__(self):
        base = ""

        for ex in self.exercises:
            base += str(ex) + "\n"
        base += f"\nSession duration: {self.duration_minutes} minutes."

        return base


hiit_monday = WorkoutSession()
print(hiit_monday)

Jumping jacks: 30/1
Squat lunges: 10/2
High jumps: 20/1

Session duration: 5 minutes.


> Note: Use the "Explain code" button at the bottom of the snippet to get a line-by-line explanation of the code.

Now, we have got a readable and compact output.

### Comparison in data classes

For many classes, it makes sense to compare their objects by some logic. For workouts, it can be the workout duration, the exercise intensity or the weight. 

First, let's see what happens if we try to compare two workouts in the current state:

In [36]:
hiit_wednesday = WorkoutSession()

hiit_wednesday.add_exercise(Exercise("Pull-ups", 7, 3))
print(hiit_wednesday)

Jumping jacks: 30/1
Squat lunges: 10/2
High jumps: 20/1
Pull-ups: 7/3

Session duration: 5 minutes.


In [37]:
hiit_monday > hiit_wednesday

TypeError: '>' not supported between instances of 'WorkoutSession' and 'WorkoutSession'

We receive a `TypeError` as data classes don't implement comparison operators. But this is easily fixable by setting the `order` parameter to `True`:

In [39]:
@dataclass(order=True)
class WorkoutSession:
    exercises: List[Exercise] = field(default_factory=create_warmup)
    duration_minutes: int = 5

    def add_exercise(self, exercise: Exercise):
        self.exercises.append(exercise)

    def increase_duration(self, minutes: int):
        self.duration_minutes += minutes

    def __str__(self):
        base = ""

        for ex in self.exercises:
            base += str(ex) + "\n"
        base += f"\nSession duration: {self.duration_minutes} minutes."

        return base

In [40]:
hiit_monday = WorkoutSession()
# hiit_monday.add_exercise(...)
hiit_monday.increase_duration(10)

hiit_wednesday = WorkoutSession()

hiit_monday > hiit_wednesday

True

This time, comparison works but, what are we even comparing? 

In data classes, comparison is performed in the order the fields are defined. Right now, the classes are compared based on work out duration since the first field, `exercises`, contain non-standard objects.

We can verify this by increasing the duration of the Wednesday session:

In [42]:
hiit_monday = WorkoutSession()
# hiit_monday.add_exercise(...)

hiit_wednesday = WorkoutSession()
hiit_wednesday.increase_duration(10)

hiit_monday > hiit_wednesday

False

As expected, we received `False`. 

But, what would happen if the first field of `Workout` was another type of field, say, a string? Let's try and find out:

In [45]:
@dataclass(order=True)
class WorkoutSession:
    date: str = None  # DD-MM-YYYY
    exercises: List[Exercise] = field(default_factory=create_warmup)
    duration_minutes: int = 5

    def add_exercise(self, exercise: Exercise):
        self.exercises.append(exercise)

    def increase_duration(self, minutes: int):
        self.duration_minutes += minutes

    def __str__(self):
        base = ""

        for ex in self.exercises:
            base += str(ex) + "\n"
        base += f"\nSession duration: {self.duration_minutes} minutes."

        return base

In [47]:
hiit_monday = WorkoutSession("25-02-2024")
hiit_monday.increase_duration(10)

hiit_wednesday = WorkoutSession("27-02-2024")

hiit_monday > hiit_wednesday

False

Even though the Monday session lasts longer, the comparison is telling us that it is smaller than Wednesday. The reason is that "25" comes before "27" in Python string comparison. 

So, how do we keep the order of the fields and still sort sessions based on the workout duration. This is easy through the `field` function:

In [49]:
@dataclass(order=True)
class WorkoutSession:
    date: str = field(default=None, compare=False)
    exercises: List[Exercise] = field(default_factory=create_warmup)
    duration_minutes: int = 5

    def add_exercise(self, exercise: Exercise):
        self.exercises.append(exercise)

    def increase_duration(self, minutes: int):
        self.duration_minutes += minutes

    def __str__(self):
        base = ""

        for ex in self.exercises:
            base += str(ex) + "\n"
        base += f"\nSession duration: {self.duration_minutes} minutes."

        return base

In [50]:
hiit_monday = WorkoutSession("25-02-2024")
hiit_monday.increase_duration(10)

hiit_wednesday = WorkoutSession("27-02-2024")

hiit_monday > hiit_wednesday

True

By setting `compare` to `False` for any field, we exclude it from sorting, as evidenced by the above result.

## Post-init field manipulation

Right now, we have a default session duration of five minutes to account for warm-up exercises. However, this only makes sense if a user starts a session with warm-up. What if they start a session with other exercises:

In [55]:
new_session = WorkoutSession([Exercise("Diamond push-ups", 10, 3)])

new_session.duration_minutes

5

For just a single exercise, the total duration is five, which is illogical. Each session must dynamically guess its duration based on the number of sets of each exercises. This means we should make `duration_minutes` dependent on the `exercises` field.

Let's implement it:

In [56]:
@dataclass
class WorkoutSession:
    exercises: List[Exercise] = field(default_factory=create_warmup)
    duration_minutes: int = field(default=0, init=False)

    def __post_init__(self):
        set_duration = 3
        for ex in self.exercises:
            self.duration_minutes += ex.sets * set_duration

    def add_exercise(self, exercise: Exercise):
        self.exercises.append(exercise)

    def increase_duration(self, minutes: int):
        self.duration_minutes += minutes

    def __str__(self):
        base = ""

        for ex in self.exercises:
            base += str(ex) + "\n"
        base += f"\nSession duration: {self.duration_minutes} minutes."

        return base

This time, we are defining `duration_minutes` with `init` set to `False` to delay the field's initialization. Then, inside a special method `__post_init__`, we are updating its value based on the total number of sets in each `Exercise`. 

Now, when we initialize `WorkoutSession`, the `duration_minutes` is dynamically increased by three minutes for each set in each exercise. 

In [60]:
# Adding an exercise with three sets
hiit_friday = WorkoutSession([Exercise("Diamond push-ups", 10, 3)])

hiit_friday.duration_minutes

9

In general, if you want to define a field that depends on other fields of your data class, you can use the `__post_init__` logic.

## Immutability in data classes

In [None]:
@dataclass(frozen=True)
class FrozenExercise:
    name: str
    reps: int
    sets: int
    weight: int | float = 0


ex1 = FrozenExercise("Muscle-ups", 5, 3)
ex1.sets

In [None]:
ex1.sets = 5

In [None]:
ex1.new_field = 10

In [None]:
@dataclass(frozen=True)
class ImmutableWorkoutSession:
    exercises: List[Exercise] = field(default_factory=create_warmup)
    duration_minutes: int = 5


session1 = ImmutableWorkoutSession()

In [None]:
session1.exercises = exercises_monday

In [None]:
session1.exercises[1] = FrozenExercise("Totally new exercise", 5, 5)

print(session1)

## Inheritance in data classes

1. Inheritance works like always
2. Just make sure non-default arguments don't follow defaults

In [None]:
@dataclass(frozen=True)
class ImmutableWorkoutSession:
    exercises: List[Exercise] = field(default_factory=create_warmup)
    duration_minutes: int = 5


@dataclass(frozen=True)
class CardioWorkoutSession(ImmutableWorkoutSession):
    intensity_level: str  # Not allowed, must have a default

## Conclusion and further resources