# The data structure

The data structure is the way we store and structure our data. It's really important because it will define how your functions and machine learning system will be able to access your data. It can be done in one line or it can be a real mess with 10 function just to format you data and feed you model correctly.

## The problem

Let's say we have a super model that recognizes faces from an image. You want to store that in a structured way so we can plot our results, store them in a DB, create a script that show us all the faces detected to calculate accuracy etc...

## Solutions

In [41]:
from typing import List, Dict

### Dictionary

The first idea that you could have is to create a dictionary :

In [24]:
# A dict to store your data
my_image: dict = {
    "name": "image 1",
    "height": 800,
    "width": 300,
    "resolution": 800 * 300,
    "face_detected": [
        {"x0": 10, "x1": 60, "y0": 200, "y1": 250},
        {"x0": 10, "x1": 60, "y0": 300, "y1": 350},
    ],
    "confidence_score": 1.0
}

my_image["resolution"]

240000

It works but you have to type each element (at least not easily), you will need to create a new dictionary each time, it leaves a chance to make easy mistakes like typos. It's not a good data structure if you're gonna use it often.

### Class

That's when you will think of a class! You create a class that contains your data fields, then you just have to instantiate it.
You can even type each property! It look like the perfect fit your issue.

In [34]:
class Image:
    """Class that store the image's data."""
    def __init__(
        self,
        name: str,
        height: int,
        width: int,
        score: int,
        face_detected: List[Dict[str, int]]
    ):

        self.name = name
        self.height = height
        self.width = width
        self.score = score
        face_detected = face_detected
        self.resulotion = self.height * self.width

        
faces = [
    {"x0": 10, "x1": 60, "y0": 200, "y1": 250},
    {"x0": 10, "x1": 60, "y0": 300, "y1": 350},
]
# Instanciate an Image
my_image = Image(name="image 1", height=800, width=300, score=10, face_detected=faces)

my_image.resulotion

240000

In one hand the syntax isn't great, it's heavy, it's big and you will have to define a lot of them to store all kinds of data, you can end up with a file that contains thousands of lines.
In the other hand, you can keep control on your data, if an Image is instanciate without a height, it will raise an error.

Classes have another super feature, you can create attributes made of other attributes.

### Dataclass

Fortunatly, Python have an answer to this heavy syntax and it's called dataclass. Dataclass is a decorator allowing you to create class with a simple and short syntax.

In [42]:
from dataclasses import dataclass

@dataclass
class Image:
    """Class that store the image's data."""
    name: str
    height: int
    width: int
    score: int
    face_detected: List[Dict[str, int]]
    resolution: int


faces = [
    {"x0": 10, "x1": 60, "y0": 200, "y1": 250},
    {"x0": 10, "x1": 60, "y0": 300, "y1": 350},
]
# Instanciate an Image
my_image = Image(name="image 1", height=800, width=300, score=10, resolution= 800 * 300, face_detected=faces)

my_image.face_detected[0]["x0"]

10

It is a perfect fit because there is no relation between attributes. If we didn't use the resolution attribute, it would be better to use a dataclass than a regular class.

### Named tuple

The named tuple is clearly a bad use here because it only allows us to create a tuple that have attributes and can be called like a class. but it's good to store small data. It is what we need for our faces coordinate!

In [43]:
from collections import namedtuple

Coordinate = namedtuple('Coordinate', ['x0', 'x1', 'y0', 'y1'])

faces = [
    Coordinate(10, 60, 200, 250),
    Coordinate(10, 60, 300, 350),
]

faces[0].x0

10

### Merge solutions

We can of course merge multiple datatypes to fit our needs!

For exemple here my choice is a class mixed to a namedtuple:

In [40]:
Coordinate = namedtuple('Coordinate', ['x0', 'x1', 'y0', 'y1'])

class ImageFinal:
    """Class that store the image's data."""
    def __init__(
        self,
        name: str,
        height: int,
        width: int,
        score: int,
        face_detected: Optional[Coordinate]
    ):

        self.name = name
        self.height = height
        self.width = width
        self.score = score
        self.face_detected = face_detected
        self.resulotion = self.height * self.width


faces = [
    Coordinate(10, 60, 200, 250),
    Coordinate(10, 60, 300, 350),
]

my_image = ImageFinal(name="image 1", height=800, width=300, score=10, face_detected=faces)

my_image.face_detected[0].x0

10

## Conclusion

99% of the time, the best solution will be to mix multiple data structure to obtain something easy to read, to use and store. Also, don't forget that the perfect solution doesn't exist! It's all a matter of choice!

A good indicator you need to use a different data strucure is when you're adding type to your code and you come to something like this:

In [None]:
data: Dict[str, List[Dict[str, Dict[str, Tuple[int]]]] = ...

It's still simple but you could also have multiple possibilities in your second dictionary so you have to use the Union type and so on. If you start to have a very confusing structure, then you probably need to use a different datastructre!

## Additional resouces

Make sure to have a look to these resources:
* https://docs.python.org/3/library/collections.html
* https://docs.python.org/3/tutorial/datastructures.html
* https://www.edureka.co/blog/data-structures-in-python/