# Analysis of the Datapoints Dataframe

## Imports and reading


In [13]:
import pandas as pd
from pathlib import Path
from utils import print_pretty_df

# Quick ANSI color code shortcuts
r = "\033[31m"
y = "\033[33m"
g = "\033[32m"
b = "\033[34m"
e = "\033[0m"

pickleName = "all_datapoints_2024-12-24_11-59-55.pkl"
datapointsDfPath = Path("..") / ".." / "data" / "Review_ML-RS-FPGA" / "Dataframes" / pickleName
datapointsDf = pd.read_pickle(datapointsDfPath)

In [14]:
print_pretty_df(datapointsDf)

+----+-------------------------------------------------------+-------------------------------+------------------+----------------+----------------+---------------------------------------------------+----------------------+-----------------------------------------+------------------------------------------------+----------------------+------------------+-----------+-------------+-----------+--------------+-------------------+------------+-------------+-------------------------------------+----------------+-----------------------------------------+-------------------------------------------------------------+---------------------------------+------------+----------+----------+---------------------+
|    |                   BBT Citation Key                    |             Model             | Equivalent model |    Backbone    |    Modality    |                      Dataset                      |         Task         |               Application               |                     Board    

## Quick statistics / Overview


### Implementation means

Typical implementation tags look like `'RTL design (VHDL)'` or `'Vitis AI (1.4)'`.
When grouping them by "family" I discard information about language or version.


In [5]:
# --- Initial raw count ---
implementationCounts = datapointsDf["Implementation"].value_counts()
print(implementationCounts)
print()

# -- Group by "family", i.e., discard version or language information in between parenthesis ---
def determine_impl_group(index: str) -> str:
    return index.split("(")[0].strip()
implementationGrouped = implementationCounts.groupby(determine_impl_group).sum()
print(implementationGrouped.sort_values(ascending=False))

Implementation
RTL design    28
HLS           12
Vitis AI      12
N/A            8
FINN           4
Name: count, dtype: int64

Implementation
RTL design (VHDL)       11
N/A                      8
RTL design (N/A)         7
RTL design (Verilog)     7
Vitis AI (N/A)           6
HLS (Vitis)              6
FINN                     4
Vitis AI (DNNDK)         4
HLS (N/A)                3
RTL design (XSG)         3
HLS (MATLAB)             1
HLS (VGT)                1
Vitis AI (v2.5)          1
Vitis AI (v1.4)          1
HLS (Vivado)             1
Name: count, dtype: int64


## FPGA boards

A typical board tag looks like that: `'Zynq 7000 (Z7020) {Arty Z7}'`: `'<family> (<model>) {<evaluation board>}'`.

The following cell groups the board by these 3 criteria.


In [7]:
# --- Initial raw count (of the full tags) ---
boardCounts = datapointsDf["Board"].value_counts()
# print(boardCounts)
# print()

# -- Group by "family", i.e., discard model in between parenthesis and evaluation board in between brackets ---
def determine_board_family_group(index: str) -> str:
    return index.split("(")[0].strip()
boardFamilyGrouped = boardCounts.groupby(determine_board_family_group).sum()
print(boardFamilyGrouped.sort_values(ascending=False))
print()

# -- Group by "board specific model", i.e., discard family and evaluation name in between brackets ---
def determine_board_model_group(index: str) -> str:
    return index.split("(")[1].split("{")[0].strip()[:-1]
boardModelGrouped = boardCounts.groupby(determine_board_model_group).sum()
print(boardModelGrouped.sort_values(ascending=False))
print()

# -- Group by "evaluation board/kit" (the name in between curly braces) ---
def determine_board_eval_group(index: str) -> str:
    boardKit: str = index.split("{")[1][:-1].strip()
    return boardKit if boardKit else "N/A   "
boardKitGrouped = boardCounts.groupby(determine_board_eval_group).sum()
print(boardKitGrouped.sort_values(ascending=False))

Board
Zynq US+      25
Zynq 7000     14
Virtex-7      10
Artix-7        5
Kintex-7       3
Kintex US      2
Virtex-6       2
Alveo          1
Cyclone V      1
Spartan-3A     1
Name: count, dtype: int64

Board
Z7020         10
ZU7EV         10
VX690T        10
ZU3EG          8
XC7K325T       3
ZU15EG         3
ZU9EG          3
XC7A35T        3
XCKU040        2
Z7045          2
XC7A200T       2
VLX240T        2
U280           1
Z7035          1
Z7100          1
ZU19EG         1
XC3SD1800A     1
5CSXC6         1
Name: count, dtype: int64

Board
N/A                     15
VC709                    7
ZCU104                   6
UltraZed-EG              4
OVC3                     3
PYNQ-Z1                  3
Arty-35T                 3
ZCU102                   3
Alinx AXU15EG            3
ZC706                    2
Z-turn                   2
AC701                    2
KCU105                   2
KC705                    2
Quad-FPGA                1
Ultra96                  1
Kria KV260          

### Model data

Model information is already split in 3 Series: `'Model'`, `'Equivalent model'` and `'Backbone'`.
I think grouping by `'Model'`, i.e., how the model is called in the article, makes no sense. However, even if `'Equivalent model'` is a subjective tag decided by myself, this conveys interressant information.
Same for `'Backbone'`.


In [15]:
# ----- Initial dataframe -----
print_pretty_df(datapointsDf[["Model", "Equivalent model", "Backbone"]], max_rows=10)

# ----- Group by "Equivalent model" -----
equivalentModelCounts = datapointsDf["Equivalent model"].value_counts()
print(equivalentModelCounts)
print()

# ----- Group by "Backbone" -----
backbineCounts = (datapointsDf["Backbone"].value_counts().rename(lambda x: "N/A" if x == "" else x))
print(backbineCounts)

+---+---------------------------+------------------+-----------+
|   |           Model           | Equivalent model | Backbone  |
+---+---------------------------+------------------+-----------+
| 0 |     ResNet-18+YOLOv2      |      YOLOv2      | ResNet-18 |
| 1 |         ResNet-50         |       CNN        | ResNet-50 |
| 2 |          SICNet           |       CNN        |           |
| 3 |       LeNet-5 [32]        |     LeNet-5      |  LeNet-5  |
| 4 |        LeNet-5 [8]        |     LeNet-5      |  LeNet-5  |
| 5 | Weightless Neural Systems |     Diverse      |           |
| 6 |          LeNet-5          |                  |  LeNet-5  |
| 7 |            BNN            |       CNN        |           |
| 8 |           A2NN            |       CNN        |   VGG11   |
| 9 |       Fuzzy ARTMAP        |     Diverse      |           |
+---+---------------------------+------------------+-----------+
Equivalent model
CNN            19
Diverse        11
               10
YOLOv2          6
Y

### Datasets, RS Applications and ML formulations

In our reporting method, each experiment is performed on a unique Dataset (i.e., we selected the most relevant/common dataset when authors reported results on different one).
Each dataset is used (or even built) for a specific Remote Sensing application which is formulated as a Machine Learning problem or task.

The `Application` tag is no exact science and was kept in order to give some context.


In [17]:
# ----- Print all unique ML problem formulations -----
mlTaskList = datapointsDf["Task"].unique()
print(mlTaskList)

# ----- Group by "Dataset" -----
DatasetCount = datapointsDf["Dataset"].value_counts()
print(DatasetCount)

# ----- Group by "Application" -----
applicationCount = datapointsDf["Application"].value_counts()
print(applicationCount)


['Object detection' 'Classification' 'Pixel classification' 'Segmentation'
 'Regression']
Dataset
University of Pavia {Pixel classification}           6
MSTAR {Classification}                               5
DOTAv1.0 {Object Detection}                          4
NWPU-RESISC45 {Classification}                       4
Potsdam {Segmentation}                               4
SSDD {Object Detection}                              3
PennSyn2Real {Object Detection}                      3
UAV RGB (cust.) {Object Detection}                   3
AVIRIS-NG {Pixel classification}                     3
UAV RGB (cust.) {Classification}                     2
MASATI {Classification}                              2
UAV RGB (cust.) {Pixel classification}               2
ALOS-2 (cust.) {Classification}                      2
RGB (cust.) {Classification}                         2
L8 Biome {Classification}                            2
DIOR {Object Detection}                              2
UAV RGB+MMW (cust.) {O

## Analyze reporting: missing metrics


In [31]:
def is_undefined(item) -> bool:
    if isinstance(item, str):
        return item.startswith("N/A") or item.startswith("???") or item == ""
    elif isinstance(item, list):
        return all(is_undefined(subitem) for subitem in item)
    else:
        raise ValueError(f"Unsupported type: {type(item)}")
    
# ---  Quick check if any of the main information is missing ---
for index, article in datapointsDf.iterrows():
    if is_undefined(article["Model"]):
        print(f"Item N°{b}{index}{e} has no Model")
    if is_undefined(article["Dataset"]):
        print(f"Item N°{b}{index}{e} has no Dataset")
    if is_undefined(article["Board"]):
        print(f"Item N°{b}{index}{e} has no Board")
    if is_undefined(article["Task"]):
        print(f"Item N°{b}{index}{e} has no Task")

### Check per article: Which article miss the most metrics

#### First for the "performance" metrics


In [None]:
performanceMetrics: list[str] = [
    "Latency",
    "Task score",
    "Footprint",
    "Throughput",
    "Frequency",
    "Complexity",
    "Power consumption",
]
# --- Compute the number of missing (performance) metrics for each model ---
# Add a column to the dataframe with the number of missing metrics
datapointsDf["Missing perf metrics"] = datapointsDf.apply(
    lambda article: sum(
        [
            is_undefined(article[metric])
            for metric in performanceMetrics
        ]
    ),
    axis=1,
)
# Print only the 'BBT Citation Key' and the missing metrics, only if there is more than 3 missing metrics
print(
    datapointsDf[datapointsDf["Missing perf metrics"] >= 4][
        ["BBT Citation Key", "Missing perf metrics"]
    ]
)

                               BBT Citation Key  Missing metrics
0            gargAircraftDetectionSatellite2024                4
1      upadhyayDesignImplementationCNNbased2024                6
2      upadhyayDesignImplementationCNNbased2024                5
5            torresCombinedWeightlessNeural2020                6
6   chenHardwareImplementationConvolutional2020                4
7               myojinDetectingUncertainBNN2020                6
11          hashimotoShipClassificationSAR2019a                5
15              fraczekEmbeddedVisionSystem2018                4
16    matos-carvalhoStaticDynamicAlgorithms2019                6
19                boyleHighlevelFPGADesign2023a                5
23         chellaswamyFPGAbasedRemoteTarget2024                5
24         nerisFPGABasedImplementationCNN2022a                4
25         nerisFPGABasedImplementationCNN2022a                4
29      pitsisEfficientConvolutionalNeural2019a                4
38             shibiOnboa

#### Then for the "FPGA" metrics


In [40]:
FPGAMetrics: list[str] = [
    "Design",
    "Memory",
    "Precision",
    "Optimizations",
    "FPGA Util",
]
DPUMetrics: list[str] = [
    "Precision",
    "DPU Config",
    "DPU Core",
    "DPU Optimizations",
    "DPU Util",
]

# --- Compute the number of missing (fpga) metrics for each model ---
datapointsDf["Missing fpga metrics"] = datapointsDf.apply(
    lambda article: sum(
        [
            is_undefined(article[metric])
            for metric in FPGAMetrics
        ]
    ),
    axis=1,
)
datapointsDf["Missing dpu metrics"] = datapointsDf.apply(
    lambda article: sum(
        [
            is_undefined(article[metric])
            for metric in DPUMetrics
        ]
    ),
    axis=1,
)
# Print only the 'BBT Citation Key' and the missing metrics, only if there is more than 3 missing metrics
print_pretty_df(
    datapointsDf[
        ["BBT Citation Key", "Missing fpga metrics", "Missing dpu metrics"]
    ]
)

+----+-------------------------------------------------------+----------------------+---------------------+
|    |                   BBT Citation Key                    | Missing fpga metrics | Missing dpu metrics |
+----+-------------------------------------------------------+----------------------+---------------------+
| 0  |          gargAircraftDetectionSatellite2024           |          2           |          4          |
| 1  |       upadhyayDesignImplementationCNNbased2024        |          4           |          3          |
| 2  |       upadhyayDesignImplementationCNNbased2024        |          4           |          3          |
| 3  |       weiFPGABasedHybridTypeImplementation2019        |          0           |          4          |
| 4  |       weiFPGABasedHybridTypeImplementation2019        |          0           |          4          |
| 5  |          torresCombinedWeightlessNeural2020           |          3           |          4          |
| 6  |      chenHardwareImpl

### Check per metric: Which metric are the less reported


In [None]:
allMetrics = performanceMetrics + FPGAMetrics + DPUMetrics
for metric in allMetrics:
    missing_metrics = 0
    for index, article in datapointsDf.iterrows():
        if is_undefined(article[metric]):
            missing_metrics += 1

    print(f"{r}{missing_metrics}{e} models miss the {b}{metric}{e} metric.")

[31m8[0m models miss the [34mLatency[0m metric.
[31m7[0m models miss the [34mTask score[0m metric.
[31m34[0m models miss the [34mFootprint[0m metric.
[31m40[0m models miss the [34mThroughput[0m metric.
[31m12[0m models miss the [34mFrequency[0m metric.
[31m34[0m models miss the [34mComplexity[0m metric.
[31m19[0m models miss the [34mPower consumption[0m metric.
[31m17[0m models miss the [34mDesign[0m metric.
[31m17[0m models miss the [34mMemory[0m metric.
[31m2[0m models miss the [34mPrecision[0m metric.
[31m27[0m models miss the [34mOptimizations[0m metric.
[31m17[0m models miss the [34mFPGA Util[0m metric.
[31m2[0m models miss the [34mPrecision[0m metric.
[31m54[0m models miss the [34mDPU Config[0m metric.
[31m61[0m models miss the [34mDPU Core[0m metric.
[31m59[0m models miss the [34mDPU Optimizations[0m metric.
[31m60[0m models miss the [34mDPU Util[0m metric.
