# Labels Profiling
This notebook shows a preliminary exploration of the labels, written with the COCO Keypoint 1.0 format.<br>
In this format we have a number for the class, that is always 0 because we only have foosball tables, 4 numbers for the bounding box (coordinates for the center, width and height) and the variuos keypoints rapresented as *(x, y, v)*, where *x* and *y* are simply the coordinates and *v* is the visibility (2 means visible, 1 means not visible and 0 means not present).

For a foosball table we have 8 keypoints: the first 4 form the upper rectangle of the play area, and the last 4 form the lower rectangle of the play area<br>
**Note** that we technically need just the lower rectangle, but because some of its keypoints are cut out from the image (visibility = 0), we need the upper rectangle, that is **always** present in the image, to be able to build the perspective to obtain the position of the cutted out points.

## Setup
Execute this cell before the other ones.

In [None]:
import pandas as pd
from pathlib import Path
import sys
import ipywidgets as widgets
from IPython.display import display
sys.path.append(str(Path("../src").resolve()))
from utility import *
from config import *


clustering_options = [
    ("Default", DEFAULT_LABELS_DATAFRAME_DIRECTORY),
    ("Added", ADDED_LABELS_DATAFRAME_DIRECTORY),
    ("Augmented", AUGMENTED_LABELS_DATAFRAME_DIRECTORY)
]

# Dropdown widget
dropdown = widgets.Dropdown(
    options=clustering_options,
    description='Clustering:'
)
display(dropdown)

button = widgets.Button(description="Load Data")

# Output widget
out = widgets.Output()
display(out)

def on_load_data_button_clicked(b):
    '''
    Function executed when the 'Load Data' button is clicked.
    '''
    global df
    global df_chosen
    df_path = dropdown.value
    df_chosen = next(key for key, val in clustering_options if val == df_path)
    
    with out: 
        out.clear_output()
        print(f"Dataframe selected: {df_path}")
        df = pd.read_parquet(df_path)
        print("Dataframe loaded.")

button.on_click(on_load_data_button_clicked)
display(button)


## Centers and Directions Analysis
For center we refer to the center of the upper rectangle, that we can obtain by intersecting the lines that connect the first keypoint to the third, and the second to the fourth.<br>
For direction we refer to the direction given by the center, to the vanishing point obtained by intersecting the lines that connect the first keypoint to the fourth, and the second to the third.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.scatterplot(data=df, x=df['center'].str[0], y=df['center'].str[1])
plt.title(f"Centers ({df_chosen})")
plt.xlim(0, 1)
plt.ylim(0, 1)
plt.xlabel("X center")
plt.ylabel("Y center")
plt.show()


Dividing the unitary square in a grid of n x n elements, we can count how many centers fall in a certain subsquare, and the result is a heatmap that shows which position is the most common

In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

n = 10  # number of subdivision

centers = np.stack(df["center"].values)

# we try to obtain the min square that holds all of this centers
average_point = centers.mean(axis=0)
max_distance = np.max([np.linalg.norm(average_point - center) for center in centers])

# we divide the square in a grid of n x n squares
# and we count how many points fall in every square

# squares creation
x_values = np.linspace(average_point[0] - max_distance, average_point[0] + max_distance, n + 1)
y_values = np.linspace(average_point[1] - max_distance, average_point[1] + max_distance, n + 1)
y_values = y_values[::-1]  # invert array

# counts will hold the value of how many points are in that square
# squares_positions will hold the position of the square
# the order goes from up to down, and left to right (the usual order to scroll a matrix)
counts = [[0 for _ in range(n)] for _ in range(n)]
squares_length = (max_distance * 2.0) / n

for i in range(n):
    for j in range(n):
        for center in centers:
            if x_values[i] < center[0] < x_values[i + 1] and y_values[j + 1] < center[1] < y_values[j]:
                # the point is in the square (i, j)
                counts[j][i] += 1

# create the labels to show on the graph
xticklabels = [f'{x + (squares_length / 2.0):.2f}' for x in x_values[0:n]]
yticklabels = [f'{y + (squares_length / 2.0):.2f}' for y in y_values[0:n]]

plt.figure(figsize=(8, 6))
sns.heatmap(np.array(counts), annot=True, fmt="d", xticklabels=xticklabels, yticklabels=yticklabels, cmap="viridis")
plt.title(f"Centers Heatmap for n = {n} ({df_chosen})")
plt.xlabel("X center")
plt.ylabel("Y center")
plt.show()


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.scatterplot(data=df, x=df['direction'].str[0], y=df['direction'].str[1])
plt.title(f"Normalized Directions ({df_chosen})")
OFFSET = 0.1
plt.xlim(-(1.0 + OFFSET), 1.0 + OFFSET)
plt.ylim(-(0.0 + OFFSET), 1.0 + OFFSET)
plt.xlabel("X direction")
plt.ylabel("Y direction")
plt.show()

If we convert the direction *(x, y)* to the angle *theta*, we can see which *theta* is the most common.<br>
*theta* is the angle that starts to the right and goes anticlockwise to the left.<br>
(For reference: *(1, 0)* corresponds to 0 degrees, *(1, 1)* corresponds to 90 degrees and (-1, 0) corresponds to 180 degrees)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import math

# let's convert the direction from (x, y) to theta
# where theta is the angle going from 0, that is (1, 0) to 180 that is (-1, 0)

df_aux = df.copy()
theta = [math.degrees(math.atan2(y,x)) for x, y in df_aux["direction"]]
df_aux = df_aux.assign(theta = theta)

sns.displot(df_aux["theta"])
plt.title(f"Theta ({df_chosen})")
plt.xlabel("Theta (degrees)")
plt.ylabel("Number of images")
plt.show()



## Visibilities Analysis
For the COCO Keypoint 1.0 format, for every keypoint, there is a visibility that is 0 if the keypoint is not present in the image, 1 if it's present, but not visible, and 2 if it's present and visible.

**Note** that the first 4 keypoints never have a visibility of 0, because it's a requirement of the project: we need those 4 keypoints to be able to build the perspective to obtain the 4 other keypoints, if they are cutted out, like we said earlier.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

df_aux = df.copy()
df_aux = df_aux.explode('visibilities')  # create row for each kepoint
df_aux['keypoint_id'] = df_aux.groupby(level=0).cumcount()  # 0...7
df_aux.rename(columns={'visibilities': 'visibility'}, inplace=True)

KEYPOINT_NAMES = ["keypoint 0","keypoint 1","keypoint 2","keypoint 3","keypoint 4","keypoint 5","keypoint 6","keypoint 7"]
df_aux['keypoint_name'] = df_aux['keypoint_id'].map(dict(enumerate(KEYPOINT_NAMES)))

sns.countplot(data=df_aux, x="keypoint_name", hue="visibility", palette=["blue", "red", "green"])
plt.xticks(rotation=30)
plt.title(f"Visibilities ({df_chosen})")
plt.xlabel("Keypoints")
plt.ylabel("Number of images")
plt.show()



## Dimension Analysis
For dimension we refer to how much of the image is covered by the bounding box.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.displot(df["dimension"])
plt.title(f"Bounding Box Dimension ({df_chosen})")
plt.xlabel("Dimension (%)")
plt.ylabel("Number of images")
plt.show()

## Highest Keypoint Analysis
For the annotations we followed this rule: the highest keypoint in the image is either the first or the second keypoint.<br>
It's the first keypoint if the second one, that we obtain following the clockwise order, is in the short side of the foosball table and the second one otherwise.

We had to make this rule because, given an image of a foosball table, there isn't an order for the vertices of the upper and lower rectangle, but following this rule might have brought an imbalance, so we should look the amount of times the first and the second keypoint are the highest.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

ax = sns.countplot(x="highest_keypoint", data=df)

for p in ax.patches:
    ax.text(
        x=p.get_x() + p.get_width() / 2,
        y=p.get_height() + 5,
        s=int(p.get_height()),
        ha='center'
    )

plt.title(f"Highest keypoint ({df_chosen})")
plt.xlabel("Keypoint")
plt.ylabel("Number of images")
plt.show()