# (U) Exploratory analysis of the BigBasin dataset

(U) This provides a simple exposition of the structure of the fully-unpacked dataset, along with some tricks and pitfalls for its usage.

In [1]:
from pathlib import Path
import re

import fsspec
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm_notebook

(U) Put all relevant secrets for accessing the dataset in an optional `.env` file in the same directory as this notebook. For instance, the data through an Azure Blob Container with a SAS token, one sets up their `.env` file to the following:

```
export AZURE_STORAGE_ACCOUNT_NAME="timcarchivestorage"
export AZURE_STORAGE_SAS_TOKEN="token string, within the quotes"
```

(U) For local file system access, no need for a `.env` file.

In [2]:
%load_ext dotenv
%dotenv

(U) We use [**fsspec**](https://filesystem-spec.readthedocs.io/en/latest/) to abstract away the low-level issues with accessing the dataset. Change the following definition of `PROTOCOL` to the appropriate protocol string for some remote storage technology (e.g. `file` for local file system, `s3` for AWS S3 bucket, `abfs` for Azure Blob Container, etc.)

In [3]:
PROTOCOL = "abfs"
fs = fsspec.filesystem(PROTOCOL)

(U) The ROOT variable points to the directory that contains the top level of the unpacked Parquet hierarchy. Note that this value may depend on the access protocol defined above.

In [4]:
ROOT = "bigbasin/unpacked"

(U) The following should show a list of cross-compilation architecture tandems if both `PROTOCOL` and `ROOT` were set right.

In [5]:
fs.ls(ROOT)

['bigbasin/unpacked/arch=linux-ubuntu18.04-a64fx-AARCH64',
 'bigbasin/unpacked/arch=linux-ubuntu18.04-aarch64-AARCH64',
 'bigbasin/unpacked/arch=linux-ubuntu18.04-arm-ARM',
 'bigbasin/unpacked/arch=linux-ubuntu18.04-arm-x86',
 'bigbasin/unpacked/arch=linux-ubuntu18.04-broadwell-x86',
 'bigbasin/unpacked/arch=linux-ubuntu18.04-bulldozer-x86',
 'bigbasin/unpacked/arch=linux-ubuntu18.04-core2-x86',
 'bigbasin/unpacked/arch=linux-ubuntu18.04-excavator-x86',
 'bigbasin/unpacked/arch=linux-ubuntu18.04-graviton-AARCH64',
 'bigbasin/unpacked/arch=linux-ubuntu18.04-graviton2-AARCH64',
 'bigbasin/unpacked/arch=linux-ubuntu18.04-haswell-x86',
 'bigbasin/unpacked/arch=linux-ubuntu18.04-i686-x86',
 'bigbasin/unpacked/arch=linux-ubuntu18.04-ivybridge-x86',
 'bigbasin/unpacked/arch=linux-ubuntu18.04-k10-x86',
 'bigbasin/unpacked/arch=linux-ubuntu18.04-mic_knl-x86',
 'bigbasin/unpacked/arch=linux-ubuntu18.04-nehalem-x86',
 'bigbasin/unpacked/arch=linux-ubuntu18.04-nocona-x86',
 'bigbasin/unpacked/arch

## Partitioning analysis

How many Parquet files (*leaves*) does this hierarchy hold? We could leverage `pyarrow.dataset.dataset` to hold this information, but as reported below, it's been lacking in robustness for certain purposes involving this dataset.

In [6]:
%%time
parquets = [p for p in fs.find(ROOT, withdirs=False) if p.endswith(".parquet")]
len(parquets)

CPU times: user 1min 6s, sys: 902 ms, total: 1min 7s
Wall time: 1min 38s


160931

That's a lot of partitions! Many systems that build up distributed computations against partitioned data would have trouble processing all of it in one gulp.

Let's take a closer look how the hierarchy is _hived_:

In [7]:
hive = pd.DataFrame([Path(p).relative_to(ROOT).parts for p in parquets])
with pd.option_context("display.max_colWidth", None):
    display(hive)

Unnamed: 0,0,1,2,3,4
0,arch=linux-ubuntu18.04-a64fx-AARCH64,compiler=gcc-7.5.0,package=aeskeyfind,version=master,aeskeyfind.4c96ada082b42828d4d9ede6487f370ecbecafec95a5315b8e73c7e8cd30ac26.parquet
1,arch=linux-ubuntu18.04-a64fx-AARCH64,compiler=gcc-7.5.0,package=earlyoom,version=1.6,earlyoom.849fe438fc62a5981db19d3e0f57338a6ececd407cf19bd42c15768585d1895b.parquet
2,arch=linux-ubuntu18.04-a64fx-AARCH64,compiler=gcc-7.5.0,package=figlet,version=2.2.4,chkfont.f220efe72fba81c74e7b67edeb7fb3cdbb0d7c94376c38d2967f63a17181d2bf.parquet
3,arch=linux-ubuntu18.04-a64fx-AARCH64,compiler=gcc-7.5.0,package=figlet,version=2.2.4,figlet.5761c63e7cf0cc0086c91579005198c53bab23020fe27d964e7207e5383670ad.parquet
4,arch=linux-ubuntu18.04-a64fx-AARCH64,compiler=gcc-7.5.0,package=flash,version=1.2.11,flash.f2e8ed41b3bd6732c41a81c6400003db5e3071f71a759ebf253122b31cf1a7c6.parquet
...,...,...,...,...,...
160926,arch=linux-ubuntu21.04-zen2-x86,compiler=gcc-9.3.0,package=zlib,version=1.2.8,libz.so.1.2.8.postprocessed.stripped.3437286233b61ecaf211d7ace3f66b64b08adea4afe53e9d35c47f2dda653f94.parquet
160927,arch=linux-ubuntu21.04-zen2-x86,compiler=gcc-9.3.0,package=zlib,version=1.2.8,libz.so.1.2.8.postprocessed.stripped.387b0930f805c5dc43b62e45b34ca2e1f1acae2ee340da99f5d7a2e8edaff537.parquet
160928,arch=linux-ubuntu21.04-zen2-x86,compiler=gcc-9.3.0,package=zlib,version=1.2.8,libz.so.1.2.8.postprocessed.stripped.52462a04e601dca0343ed6fcb6562374dcd93cc40d09f81315534fd199040e4d.parquet
160929,arch=linux-ubuntu21.04-zen2-x86,compiler=gcc-9.3.0,package=zlib,version=1.2.8,libz.so.1.2.8.postprocessed.stripped.c98d28d693a1327dcfa675017a7579b8649c5dfee543cecc09540d7e4fa3b7d9.parquet


We see 4 explicit partitioning axes:

1. Cross-compilation architecture tandem: each string is structured as `linux-DISTRO-HOST-TARGET`.
    1. `DISTRO` is the particular GNU/Linux operating system that is home to both the compiler and the target binary.
    1. `HOST` is the CPU architecture of the system that builds the binary.
    1. `TARGET` is the CPU architecture for which the binary is built.
1. Compiler and its version.
1. Name of the package.
1. Nominal package version.

The name of the Parquet file itself seems to contain further information. Plus, seemingly some of the partitioning quadruplets admit more than a single file. This begs the question as to whether more than one binary can belong to any given package. Beyond that, is the data associated to a partitioning quadruplet divided only for file size control, or is there more semantics behind this split?

In [8]:
hive.columns = ["arch", "compiler", "package", "version", "file"]
hive_agg_count = hive.groupby(["arch", "compiler", "package", "version"]).agg({"file": "count"})
hive_agg_count

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,file
arch,compiler,package,version,Unnamed: 4_level_1
arch=linux-ubuntu18.04-a64fx-AARCH64,compiler=gcc-7.5.0,package=aeskeyfind,version=master,1
arch=linux-ubuntu18.04-a64fx-AARCH64,compiler=gcc-7.5.0,package=earlyoom,version=1.6,1
arch=linux-ubuntu18.04-a64fx-AARCH64,compiler=gcc-7.5.0,package=figlet,version=2.2.4,2
arch=linux-ubuntu18.04-a64fx-AARCH64,compiler=gcc-7.5.0,package=flash,version=1.2.11,1
arch=linux-ubuntu18.04-a64fx-AARCH64,compiler=gcc-7.5.0,package=iniparser,version=4.1,1
...,...,...,...,...
arch=linux-ubuntu21.04-zen2-x86,compiler=gcc-10.3.0,package=zlib,version=1.2.8,24
arch=linux-ubuntu21.04-zen2-x86,compiler=gcc-11.1.0,package=zlib,version=1.2.11,24
arch=linux-ubuntu21.04-zen2-x86,compiler=gcc-11.1.0,package=zlib,version=1.2.8,24
arch=linux-ubuntu21.04-zen2-x86,compiler=gcc-9.3.0,package=zlib,version=1.2.11,24


Package `figlet` immediately pops as relevant for asking the first question.

In [9]:
with pd.option_context("display.max_colWidth", None):
    display(
        hive.loc[
            (hive.arch == "arch=linux-ubuntu18.04-a64fx-AARCH64")
            & (hive.compiler == "compiler=gcc-7.5.0")
            & (hive.package == "package=figlet")
            & (hive.version == "version=2.2.4")
        ]
    )

Unnamed: 0,arch,compiler,package,version,file
2,arch=linux-ubuntu18.04-a64fx-AARCH64,compiler=gcc-7.5.0,package=figlet,version=2.2.4,chkfont.f220efe72fba81c74e7b67edeb7fb3cdbb0d7c94376c38d2967f63a17181d2bf.parquet
3,arch=linux-ubuntu18.04-a64fx-AARCH64,compiler=gcc-7.5.0,package=figlet,version=2.2.4,figlet.5761c63e7cf0cc0086c91579005198c53bab23020fe27d964e7207e5383670ad.parquet


There effectively seems to be possible for the dataset to store multiple executable for certain packages.

What about packages such as `zlib`, for which there are multiple files, but for which the presence of multiple binaries is not obvious?

In [10]:
hive_zlib = hive.loc[
    (hive.arch == "arch=linux-ubuntu21.04-zen2-x86")
    & (hive.compiler == "compiler=gcc-11.1.0")
    & (hive.package == "package=zlib")
    & (hive.version == "version=1.2.11")
]
with pd.option_context("display.max_colWidth", None):
    display(hive_zlib)

Unnamed: 0,arch,compiler,package,version,file
160835,arch=linux-ubuntu21.04-zen2-x86,compiler=gcc-11.1.0,package=zlib,version=1.2.11,libz.so.1.2.11.11bff7576adb32be81a9dffde950ec1b3c1635a5a2adf640c33d37e17486f7dd.parquet
160836,arch=linux-ubuntu21.04-zen2-x86,compiler=gcc-11.1.0,package=zlib,version=1.2.11,libz.so.1.2.11.168ffad0ead04b0615a23592c9687b9ad9035984472dfd004b369046e21a9ab6.parquet
160837,arch=linux-ubuntu21.04-zen2-x86,compiler=gcc-11.1.0,package=zlib,version=1.2.11,libz.so.1.2.11.897b4396bbfdcfd5a11dec926077c56cb1fc1fcbddaba7ad9a2ae5fcec55b3e6.parquet
160838,arch=linux-ubuntu21.04-zen2-x86,compiler=gcc-11.1.0,package=zlib,version=1.2.11,libz.so.1.2.11.cb0ae46af0ece152373862882cbb9ffb97e6d8454e6a7ce6c6990815c63dcad6.parquet
160839,arch=linux-ubuntu21.04-zen2-x86,compiler=gcc-11.1.0,package=zlib,version=1.2.11,libz.so.1.2.11.e0a042ac5d8cf84880821d67e2e83fbff4b2f42ff4196409edd80522a1c1bfb2.parquet
160840,arch=linux-ubuntu21.04-zen2-x86,compiler=gcc-11.1.0,package=zlib,version=1.2.11,libz.so.1.2.11.ede647019b947ffa6c6593ccdc575d372f56153800cefbbab2276370eccb9bee.parquet
160841,arch=linux-ubuntu21.04-zen2-x86,compiler=gcc-11.1.0,package=zlib,version=1.2.11,libz.so.1.2.11.postprocessed.kiteshield-all.1f156c90ee84453ea75badb3b8bb70c9f540dfe94601dc8f7584d905bd10d7e9.parquet
160842,arch=linux-ubuntu21.04-zen2-x86,compiler=gcc-11.1.0,package=zlib,version=1.2.11,libz.so.1.2.11.postprocessed.kiteshield-all.2d0d0645a79ecad5c5961daaa9c275f366ce9279ade08f742bf0ffe675c4e59f.parquet
160843,arch=linux-ubuntu21.04-zen2-x86,compiler=gcc-11.1.0,package=zlib,version=1.2.11,libz.so.1.2.11.postprocessed.kiteshield-all.4a1a099c3a465a914ec1a50635782bf7469c1d6a40a3b94d0997ace0385a6dae.parquet
160844,arch=linux-ubuntu21.04-zen2-x86,compiler=gcc-11.1.0,package=zlib,version=1.2.11,libz.so.1.2.11.postprocessed.kiteshield-all.a0d933a016a46cdbbf636bd5d82c3c4ed458929e51be6ff0980bfffca6396d2b.parquet


The names in play suggest that each file could correspond to a distinct compiler configuration. Let's dig into each file to check this hypothesis.

In [11]:
data_zlib = pd.concat(
    [
        pd.read_parquet(f"{PROTOCOL}://{parquets[i]}").assign(path=parquets[i])
        for i in hive_zlib.index
    ],
    ignore_index=True
)
data_zlib

Unnamed: 0,r_arch,r_compiler,r_package,r_version,compile_options,compile_options_hash,bin_name,bin_size,bin_md5,bin_sha256,func_name,func_addr_start,func_addr_end,basic_blocks,func_asm_summary,pcode,path
0,linux-ubuntu21.04-zen2-x86,gcc-11.1.0,zlib,1.2.11,-O2,tfsvZKJmhgbt,libz.so.1.2.11,109200,ce6ec7c4196041386413eb36472889fd,a93cc0233af92bbd6863074c2283260efaca634e01b8bb...,_init,00103000,0010301a,"[{'blk_addr_start': '00103000', 'blk_id': 0, '...","[{'asm_scrub_type': 'mode_2', 'func_asm_hash':...","[{'hi_pcode': ['r8I4zCth3MxqDkqw', 'i3dHXOZLeS...",bigbasin/unpacked/arch=linux-ubuntu21.04-zen2-...
1,linux-ubuntu21.04-zen2-x86,gcc-11.1.0,zlib,1.2.11,-O2,tfsvZKJmhgbt,libz.so.1.2.11,109200,ce6ec7c4196041386413eb36472889fd,a93cc0233af92bbd6863074c2283260efaca634e01b8bb...,FUN_00103020,00103020,0010302c,"[{'blk_addr_start': '00103020', 'blk_id': 0, '...","[{'asm_scrub_type': 'mode_2', 'func_asm_hash':...","[{'hi_pcode': ['g6uDeuvdQTWjaBP4', 'p1_ah6QRdO...",bigbasin/unpacked/arch=linux-ubuntu21.04-zen2-...
2,linux-ubuntu21.04-zen2-x86,gcc-11.1.0,zlib,1.2.11,-O2,tfsvZKJmhgbt,libz.so.1.2.11,109200,ce6ec7c4196041386413eb36472889fd,a93cc0233af92bbd6863074c2283260efaca634e01b8bb...,deregister_tm_clones,001035e0,00103608,"[{'blk_addr_start': '001035e0', 'blk_id': 0, '...","[{'asm_scrub_type': 'mode_2', 'func_asm_hash':...","[{'hi_pcode': ['XlH8FVpskIWnFx7-'], 'hi_pcode_...",bigbasin/unpacked/arch=linux-ubuntu21.04-zen2-...
3,linux-ubuntu21.04-zen2-x86,gcc-11.1.0,zlib,1.2.11,-O2,tfsvZKJmhgbt,libz.so.1.2.11,109200,ce6ec7c4196041386413eb36472889fd,a93cc0233af92bbd6863074c2283260efaca634e01b8bb...,register_tm_clones,00103610,00103648,"[{'blk_addr_start': '00103610', 'blk_id': 0, '...","[{'asm_scrub_type': 'mode_2', 'func_asm_hash':...","[{'hi_pcode': ['XlH8FVpskIWnFx7-'], 'hi_pcode_...",bigbasin/unpacked/arch=linux-ubuntu21.04-zen2-...
4,linux-ubuntu21.04-zen2-x86,gcc-11.1.0,zlib,1.2.11,-O2,tfsvZKJmhgbt,libz.so.1.2.11,109200,ce6ec7c4196041386413eb36472889fd,a93cc0233af92bbd6863074c2283260efaca634e01b8bb...,__do_global_dtors_aux,00103650,00103688,"[{'blk_addr_start': '00103650', 'blk_id': 0, '...","[{'asm_scrub_type': 'mode_2', 'func_asm_hash':...","[{'hi_pcode': ['sO4_DgLqQuPE2OEN', 'ikCekTeW5J...",bigbasin/unpacked/arch=linux-ubuntu21.04-zen2-...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023,linux-ubuntu21.04-zen2-x86,gcc-11.1.0,zlib,1.2.11,-Os,hY6x3CeUOd5O,libz.so.1.2.11.postprocessed.stripped,100936,ee9ec2729dcdfb409f48eac7f3c9277c,fee270a141917e9be1d8dabacf72868b5b8327ff4fb194...,gzprintf,001106a0,0011075c,"[{'blk_addr_start': '001106a0', 'blk_id': 0, '...","[{'asm_scrub_type': 'no_scrub', 'func_asm_hash...","[{'hi_pcode': ['0tLbt_4c_5SP2YRK', 'fw115HVidD...",bigbasin/unpacked/arch=linux-ubuntu21.04-zen2-...
2024,linux-ubuntu21.04-zen2-x86,gcc-11.1.0,zlib,1.2.11,-Os,hY6x3CeUOd5O,libz.so.1.2.11.postprocessed.stripped,100936,ee9ec2729dcdfb409f48eac7f3c9277c,fee270a141917e9be1d8dabacf72868b5b8327ff4fb194...,gzflush,00110760,001107dc,"[{'blk_addr_start': '00110760', 'blk_id': 0, '...","[{'asm_scrub_type': 'no_scrub', 'func_asm_hash...","[{'hi_pcode': ['GRmP1MsYMKIL9_bn', 'gV_5rftzsd...",bigbasin/unpacked/arch=linux-ubuntu21.04-zen2-...
2025,linux-ubuntu21.04-zen2-x86,gcc-11.1.0,zlib,1.2.11,-Os,hY6x3CeUOd5O,libz.so.1.2.11.postprocessed.stripped,100936,ee9ec2729dcdfb409f48eac7f3c9277c,fee270a141917e9be1d8dabacf72868b5b8327ff4fb194...,gzsetparams,001107e0,001108b7,"[{'blk_addr_start': '001107e0', 'blk_id': 0, '...","[{'asm_scrub_type': 'no_scrub', 'func_asm_hash...","[{'hi_pcode': ['GRmP1MsYMKIL9_bn', 'GQi3lFs73L...",bigbasin/unpacked/arch=linux-ubuntu21.04-zen2-...
2026,linux-ubuntu21.04-zen2-x86,gcc-11.1.0,zlib,1.2.11,-Os,hY6x3CeUOd5O,libz.so.1.2.11.postprocessed.stripped,100936,ee9ec2729dcdfb409f48eac7f3c9277c,fee270a141917e9be1d8dabacf72868b5b8327ff4fb194...,gzclose_w,001108c0,001109b7,"[{'blk_addr_start': '001108c0', 'blk_id': 0, '...","[{'asm_scrub_type': 'no_scrub', 'func_asm_hash...","[{'hi_pcode': ['GRmP1MsYMKIL9_bn', 'Ghj9aPBJAG...",bigbasin/unpacked/arch=linux-ubuntu21.04-zen2-...


In [12]:
data_zlib_agg_compile_options = data_zlib.groupby("path").agg({"compile_options": "nunique"})
data_zlib_agg_compile_options

Unnamed: 0_level_0,compile_options
path,Unnamed: 1_level_1
bigbasin/unpacked/arch=linux-ubuntu21.04-zen2-x86/compiler=gcc-11.1.0/package=zlib/version=1.2.11/libz.so.1.2.11.11bff7576adb32be81a9dffde950ec1b3c1635a5a2adf640c33d37e17486f7dd.parquet,1
bigbasin/unpacked/arch=linux-ubuntu21.04-zen2-x86/compiler=gcc-11.1.0/package=zlib/version=1.2.11/libz.so.1.2.11.168ffad0ead04b0615a23592c9687b9ad9035984472dfd004b369046e21a9ab6.parquet,1
bigbasin/unpacked/arch=linux-ubuntu21.04-zen2-x86/compiler=gcc-11.1.0/package=zlib/version=1.2.11/libz.so.1.2.11.897b4396bbfdcfd5a11dec926077c56cb1fc1fcbddaba7ad9a2ae5fcec55b3e6.parquet,1
bigbasin/unpacked/arch=linux-ubuntu21.04-zen2-x86/compiler=gcc-11.1.0/package=zlib/version=1.2.11/libz.so.1.2.11.cb0ae46af0ece152373862882cbb9ffb97e6d8454e6a7ce6c6990815c63dcad6.parquet,1
bigbasin/unpacked/arch=linux-ubuntu21.04-zen2-x86/compiler=gcc-11.1.0/package=zlib/version=1.2.11/libz.so.1.2.11.e0a042ac5d8cf84880821d67e2e83fbff4b2f42ff4196409edd80522a1c1bfb2.parquet,1
bigbasin/unpacked/arch=linux-ubuntu21.04-zen2-x86/compiler=gcc-11.1.0/package=zlib/version=1.2.11/libz.so.1.2.11.ede647019b947ffa6c6593ccdc575d372f56153800cefbbab2276370eccb9bee.parquet,1
bigbasin/unpacked/arch=linux-ubuntu21.04-zen2-x86/compiler=gcc-11.1.0/package=zlib/version=1.2.11/libz.so.1.2.11.postprocessed.kiteshield-all.1f156c90ee84453ea75badb3b8bb70c9f540dfe94601dc8f7584d905bd10d7e9.parquet,1
bigbasin/unpacked/arch=linux-ubuntu21.04-zen2-x86/compiler=gcc-11.1.0/package=zlib/version=1.2.11/libz.so.1.2.11.postprocessed.kiteshield-all.2d0d0645a79ecad5c5961daaa9c275f366ce9279ade08f742bf0ffe675c4e59f.parquet,1
bigbasin/unpacked/arch=linux-ubuntu21.04-zen2-x86/compiler=gcc-11.1.0/package=zlib/version=1.2.11/libz.so.1.2.11.postprocessed.kiteshield-all.4a1a099c3a465a914ec1a50635782bf7469c1d6a40a3b94d0997ace0385a6dae.parquet,1
bigbasin/unpacked/arch=linux-ubuntu21.04-zen2-x86/compiler=gcc-11.1.0/package=zlib/version=1.2.11/libz.so.1.2.11.postprocessed.kiteshield-all.a0d933a016a46cdbbf636bd5d82c3c4ed458929e51be6ff0980bfffca6396d2b.parquet,1


This confirms that all records within each Parquet file share a single compilation configuration, as described in column `compile_options`.

There still seems to be more of such information embedded in the name of the binary (`bin_name`). Indeed, sometimes such names have a `.postprocessed` component. Let's unpack this.

In [13]:
data_zlib["postprocessing"] = data_zlib["bin_name"].str.split(".postprocessed.").apply(lambda x: x[-1] if x[-1] != x[0] else "vanilla")
data_zlib_agg_co_pp = data_zlib.groupby(["compile_options", "postprocessing"]).agg({"path": "nunique"})
data_zlib_agg_co_pp.unstack("postprocessing")

Unnamed: 0_level_0,path,path,path,path
postprocessing,kiteshield-all,kiteshield-outer,stripped,vanilla
compile_options,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
-O0,1,1,1,1
-O1,1,1,1,1
-O2,1,1,1,1
-O3,1,1,1,1
-Og,1,1,1,1
-Os,1,1,1,1


This confirms this package effectively sports a distinct Parquet file for each compilation configuration.

## Deciding on a data subset for experimenting

The goal of research on this dataset is to compare binaries. Assumptions:

1. Binaries coming from distinct source codes should be distinct (dissimilar).
    1. Binaries that share functions at the source level or statically-compiled libraries should be more similar.
1. Binaries coming from the same source code should be similar across compilers.
1. Binaries coming from the same source code should be more or less similar depending on compiler configurations.
    1. A higher degree of optimization should decrease similarity.
1. Binaries compiled from distinct host architecture should be similar.
1. Across successive versions, binaries should tend to be similar.

Most experiments would verify assumption 1 -- it's a basic sanity check, really. Let's focus on evaluating assumption 5 as well, selecting one architecture tandem and compiler that enables it best.

In [14]:
tandems_x_compilers = (
    hive
    .groupby(["arch", "compiler"])
    .agg({"package": "nunique"})
    .unstack("compiler")
)
tandems_x_compilers.replace(np.nan, "")

Unnamed: 0_level_0,package,package,package,package,package,package,package,package,package,package,package
compiler,compiler=aocc-3_1_0,compiler=clang-11.0.1,compiler=clang-12.0.0,compiler=clang-6.0.0,compiler=clang-9.0.1,compiler=gcc-10.3.0,compiler=gcc-11.1.0,compiler=gcc-7.5.0,compiler=gcc-9.3.0,compiler=intel-2021.2.0,compiler=nvhpc-21.3
arch,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2
arch=linux-ubuntu18.04-a64fx-AARCH64,,,,,,,,8.0,,,
arch=linux-ubuntu18.04-aarch64-AARCH64,,,,,,,,1.0,,,
arch=linux-ubuntu18.04-arm-ARM,,,,,,,,3.0,,,
arch=linux-ubuntu18.04-arm-x86,,,,,,,,1.0,,,
arch=linux-ubuntu18.04-broadwell-x86,,,,,,,,1.0,,,
...,...,...,...,...,...,...,...,...,...,...,...
arch=linux-ubuntu21.04-westmere-x86,,,,,,8.0,8.0,,1.0,,
arch=linux-ubuntu21.04-x86-x86,,,,,,1.0,1.0,,1.0,,
arch=linux-ubuntu21.04-x86_64-x86,,92.0,85.0,,96.0,113.0,128.0,,110.0,,
arch=linux-ubuntu21.04-zen-x86,,,,,,1.0,1.0,,1.0,,


In [15]:
tandems_x_compilers.stack()

Unnamed: 0_level_0,Unnamed: 1_level_0,package
arch,compiler,Unnamed: 2_level_1
arch=linux-ubuntu18.04-a64fx-AARCH64,compiler=gcc-7.5.0,8.0
arch=linux-ubuntu18.04-aarch64-AARCH64,compiler=gcc-7.5.0,1.0
arch=linux-ubuntu18.04-arm-ARM,compiler=gcc-7.5.0,3.0
arch=linux-ubuntu18.04-arm-x86,compiler=gcc-7.5.0,1.0
arch=linux-ubuntu18.04-broadwell-x86,compiler=gcc-7.5.0,1.0
...,...,...
arch=linux-ubuntu21.04-zen-x86,compiler=gcc-11.1.0,1.0
arch=linux-ubuntu21.04-zen-x86,compiler=gcc-9.3.0,1.0
arch=linux-ubuntu21.04-zen2-x86,compiler=gcc-10.3.0,1.0
arch=linux-ubuntu21.04-zen2-x86,compiler=gcc-11.1.0,1.0


Let's filter this down a bit: any tandem and compiler combo with less than 100 packages to explore is obviously no-go.

In [16]:
txc_stacked = tandems_x_compilers.stack()
txc100 = txc_stacked.loc[txc_stacked["package"] >= 100.0].unstack("compiler")
txc100.replace(np.nan, "")

Unnamed: 0_level_0,package,package,package,package,package,package
compiler,compiler=clang-6.0.0,compiler=gcc-10.3.0,compiler=gcc-7.5.0,compiler=gcc-9.3.0,compiler=intel-2021.2.0,compiler=gcc-11.1.0
arch,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
arch=linux-ubuntu18.04-x86_64-x86,112.0,101.0,121.0,105.0,110.0,
arch=linux-ubuntu21.04-x86_64-x86,,113.0,,110.0,,128.0


The choice is now much easier. We will use the architecture tandem `linux-ubuntu21.04-x86_64-x86` and compiler `gcc-11.1.0`. Let's measure the number of versions for each package.

In [17]:
hive_selected = hive.loc[
    (hive["arch"] == "arch=linux-ubuntu21.04-x86_64-x86")
    & (hive["compiler"] == "compiler=gcc-11.1.0")
].drop(columns=["arch", "compiler"])
hive_selected

Unnamed: 0,package,version,file
112572,package=activeharmony,version=4.6.0,agg.so.36a81602bddf2f541896823e823f0ef1f926e9e...
112573,package=activeharmony,version=4.6.0,agg.so.57923f7e852de6542122911d54c7acce3f2dccc...
112574,package=activeharmony,version=4.6.0,agg.so.5d5dcab6f344bf089fa0e70d771c863e60b279c...
112575,package=activeharmony,version=4.6.0,agg.so.7692f55d8154308fd3c85c97120c58446927e5a...
112576,package=activeharmony,version=4.6.0,agg.so.af2129723349b12ae72d53d65ce6fe2362562cf...
...,...,...,...
152044,package=zstd,version=1.4.5,zstd.postprocessed.kiteshield-outer.b6cedac767...
152045,package=zstd,version=1.4.5,zstd.postprocessed.upx-1.bd17cf0506318084f3108...
152046,package=zstd,version=1.4.5,zstd.postprocessed.upx-1.cb0a2efb14054e20fb465...
152047,package=zstd,version=1.4.5,zstd.postprocessed.upx-best.145f04c3b68e1e3d8e...


In [18]:
hive_selected_package_versions = (
    hive_selected
    .groupby("package")
    .agg({"version": "nunique"})
)
hive_selected_package_versions

Unnamed: 0_level_0,version
package,Unnamed: 1_level_1
package=activeharmony,1
package=aeskeyfind,1
package=agrep,1
package=alglib,1
package=aragorn,2
...,...
package=wtdbg2,1
package=xxhash,4
package=xz,1
package=zlib,2


Many of these have only one version. Let's focus on those with at least 2.

In [19]:
df_packages_selected = hive_selected_package_versions.loc[
    hive_selected_package_versions["version"] > 1
]
df_packages_selected

Unnamed: 0_level_0,version
package,Unnamed: 1_level_1
package=aragorn,2
package=argon2,3
package=blktrace,2
package=cachefilesd,5
package=earlyoom,2
package=figlet,3
package=haproxy,2
package=hardlink,2
package=hiredis,4
package=iniparser,4


In [20]:
packages_selected = set(df_packages_selected.index)
len(packages_selected)

35

This looks like a suitable subset: small enough for experimentation to be fast, large enough for results to matter. How many binaries are thus in play?

In [21]:
hive_selected

Unnamed: 0,package,version,file
112572,package=activeharmony,version=4.6.0,agg.so.36a81602bddf2f541896823e823f0ef1f926e9e...
112573,package=activeharmony,version=4.6.0,agg.so.57923f7e852de6542122911d54c7acce3f2dccc...
112574,package=activeharmony,version=4.6.0,agg.so.5d5dcab6f344bf089fa0e70d771c863e60b279c...
112575,package=activeharmony,version=4.6.0,agg.so.7692f55d8154308fd3c85c97120c58446927e5a...
112576,package=activeharmony,version=4.6.0,agg.so.af2129723349b12ae72d53d65ce6fe2362562cf...
...,...,...,...
152044,package=zstd,version=1.4.5,zstd.postprocessed.kiteshield-outer.b6cedac767...
152045,package=zstd,version=1.4.5,zstd.postprocessed.upx-1.bd17cf0506318084f3108...
152046,package=zstd,version=1.4.5,zstd.postprocessed.upx-1.cb0a2efb14054e20fb465...
152047,package=zstd,version=1.4.5,zstd.postprocessed.upx-best.145f04c3b68e1e3d8e...


In [22]:
subset_exp = hive_selected.loc[hive_selected["package"].isin(packages_selected)]
subset_exp

Unnamed: 0,package,version,file
112734,package=aragorn,version=1.2.36,aragorn.04e26c90c5323cc4f4677bbeed81396beba82e...
112735,package=aragorn,version=1.2.36,aragorn.4ba4380fdf8f54e45ad7c138cafb9d61ede70d...
112736,package=aragorn,version=1.2.36,aragorn.4fa7ba312c88261d9a2795b0d65d095c1b0548...
112737,package=aragorn,version=1.2.36,aragorn.58f16bf402ce44d50e710b277b79c8f11e9d33...
112738,package=aragorn,version=1.2.36,aragorn.658427ef2395ae117298f5f418efdb650f812f...
...,...,...,...
152044,package=zstd,version=1.4.5,zstd.postprocessed.kiteshield-outer.b6cedac767...
152045,package=zstd,version=1.4.5,zstd.postprocessed.upx-1.bd17cf0506318084f3108...
152046,package=zstd,version=1.4.5,zstd.postprocessed.upx-1.cb0a2efb14054e20fb465...
152047,package=zstd,version=1.4.5,zstd.postprocessed.upx-best.145f04c3b68e1e3d8e...


We know that each file corresponds to a distinct binary, and that the file name encodes some versioning information (because of the way shared object versioning works on Linux) and compilation procedurals. We can thus strip these off to get a good lower bound on the number of nominally distinct binaries.

In [23]:
len(np.unique(subset_exp["file"].str.split(".").apply(lambda x: x[0])))

265

## Compiler configuration analysis

Since we have a running assumption that optimization and other compilation tricks alter the structure of binaries, we would presumably run experiments on binaries that share not only a certain compiler version, but also the whole build configuration. Let's index the various binaries in play by this aspect. <a name="processing-batches"></a>

In [24]:
compile_options = {}
for i in tqdm_notebook(subset_exp.index):
    df = pd.read_parquet(f"{PROTOCOL}://{parquets[i]}")
    compile_options_file = df["compile_options"].unique()
    assert len(compile_options_file) == 1
    compile_options[i] = compile_options_file[0]
hive["compile_options"] = pd.Series(compile_options)
hive.loc[~hive["compile_options"].isna()]

  0%|          | 0/9396 [00:00<?, ?it/s]

Unnamed: 0,arch,compiler,package,version,file,compile_options
112734,arch=linux-ubuntu21.04-x86_64-x86,compiler=gcc-11.1.0,package=aragorn,version=1.2.36,aragorn.04e26c90c5323cc4f4677bbeed81396beba82e...,-O1
112735,arch=linux-ubuntu21.04-x86_64-x86,compiler=gcc-11.1.0,package=aragorn,version=1.2.36,aragorn.4ba4380fdf8f54e45ad7c138cafb9d61ede70d...,-Og
112736,arch=linux-ubuntu21.04-x86_64-x86,compiler=gcc-11.1.0,package=aragorn,version=1.2.36,aragorn.4fa7ba312c88261d9a2795b0d65d095c1b0548...,-O0
112737,arch=linux-ubuntu21.04-x86_64-x86,compiler=gcc-11.1.0,package=aragorn,version=1.2.36,aragorn.58f16bf402ce44d50e710b277b79c8f11e9d33...,-O2
112738,arch=linux-ubuntu21.04-x86_64-x86,compiler=gcc-11.1.0,package=aragorn,version=1.2.36,aragorn.658427ef2395ae117298f5f418efdb650f812f...,
...,...,...,...,...,...,...
152044,arch=linux-ubuntu21.04-x86_64-x86,compiler=gcc-11.1.0,package=zstd,version=1.4.5,zstd.postprocessed.kiteshield-outer.b6cedac767...,
152045,arch=linux-ubuntu21.04-x86_64-x86,compiler=gcc-11.1.0,package=zstd,version=1.4.5,zstd.postprocessed.upx-1.bd17cf0506318084f3108...,
152046,arch=linux-ubuntu21.04-x86_64-x86,compiler=gcc-11.1.0,package=zstd,version=1.4.5,zstd.postprocessed.upx-1.cb0a2efb14054e20fb465...,
152047,arch=linux-ubuntu21.04-x86_64-x86,compiler=gcc-11.1.0,package=zstd,version=1.4.5,zstd.postprocessed.upx-best.145f04c3b68e1e3d8e...,


The nominal non-optimized compilation configuration corresponds to option `-O0` for the GCC compiler. It also corresponds to having no compile-affecting option at all. Let's also reject all post-processings of the linked binary. How many versions do we then have for each package?

In [25]:
hive["postprocessing"] = (
    hive["file"]
    .str.split(".postprocessed.")
    .apply(lambda x: x[-1] if len(x) > 1 else ".asdf.parquet")
    .str.split(".")
    .apply(lambda y: ".".join(y[:-2]))
)
hive

Unnamed: 0,arch,compiler,package,version,file,compile_options,postprocessing
0,arch=linux-ubuntu18.04-a64fx-AARCH64,compiler=gcc-7.5.0,package=aeskeyfind,version=master,aeskeyfind.4c96ada082b42828d4d9ede6487f370ecbe...,,
1,arch=linux-ubuntu18.04-a64fx-AARCH64,compiler=gcc-7.5.0,package=earlyoom,version=1.6,earlyoom.849fe438fc62a5981db19d3e0f57338a6ecec...,,
2,arch=linux-ubuntu18.04-a64fx-AARCH64,compiler=gcc-7.5.0,package=figlet,version=2.2.4,chkfont.f220efe72fba81c74e7b67edeb7fb3cdbb0d7c...,,
3,arch=linux-ubuntu18.04-a64fx-AARCH64,compiler=gcc-7.5.0,package=figlet,version=2.2.4,figlet.5761c63e7cf0cc0086c91579005198c53bab230...,,
4,arch=linux-ubuntu18.04-a64fx-AARCH64,compiler=gcc-7.5.0,package=flash,version=1.2.11,flash.f2e8ed41b3bd6732c41a81c6400003db5e3071f7...,,
...,...,...,...,...,...,...,...
160926,arch=linux-ubuntu21.04-zen2-x86,compiler=gcc-9.3.0,package=zlib,version=1.2.8,libz.so.1.2.8.postprocessed.stripped.343728623...,,stripped
160927,arch=linux-ubuntu21.04-zen2-x86,compiler=gcc-9.3.0,package=zlib,version=1.2.8,libz.so.1.2.8.postprocessed.stripped.387b0930f...,,stripped
160928,arch=linux-ubuntu21.04-zen2-x86,compiler=gcc-9.3.0,package=zlib,version=1.2.8,libz.so.1.2.8.postprocessed.stripped.52462a04e...,,stripped
160929,arch=linux-ubuntu21.04-zen2-x86,compiler=gcc-9.3.0,package=zlib,version=1.2.8,libz.so.1.2.8.postprocessed.stripped.c98d28d69...,,stripped


In [26]:
hive_noopt_nopost = hive.loc[hive.compile_options.isin({"", "-O0"}) & ~hive.postprocessing.apply(bool)]
hive_noopt_nopost

Unnamed: 0,arch,compiler,package,version,file,compile_options,postprocessing
112736,arch=linux-ubuntu21.04-x86_64-x86,compiler=gcc-11.1.0,package=aragorn,version=1.2.36,aragorn.4fa7ba312c88261d9a2795b0d65d095c1b0548...,-O0,
112738,arch=linux-ubuntu21.04-x86_64-x86,compiler=gcc-11.1.0,package=aragorn,version=1.2.36,aragorn.658427ef2395ae117298f5f418efdb650f812f...,,
112747,arch=linux-ubuntu21.04-x86_64-x86,compiler=gcc-11.1.0,package=aragorn,version=1.2.38,aragorn.ec7b5999c8195f3225bc9c910d21c4bc4a4ae3...,,
112756,arch=linux-ubuntu21.04-x86_64-x86,compiler=gcc-11.1.0,package=argon2,version=20161029,argon2.4ae4beef8818c768ef08fd62cd20ee9cb514adb...,,
112758,arch=linux-ubuntu21.04-x86_64-x86,compiler=gcc-11.1.0,package=argon2,version=20161029,argon2.58ca0778d7fdf5e0281d4478170da7650b58157...,-O0,
...,...,...,...,...,...,...,...
151982,arch=linux-ubuntu21.04-x86_64-x86,compiler=gcc-11.1.0,package=zstd,version=1.4.3,libzstd.so.1.4.3.5ed526a291b0b6804077a13b51dc1...,,
152003,arch=linux-ubuntu21.04-x86_64-x86,compiler=gcc-11.1.0,package=zstd,version=1.4.4,libzstd.so.1.4.4.0dd132b969ebac53edc01d0e44f7f...,,
152025,arch=linux-ubuntu21.04-x86_64-x86,compiler=gcc-11.1.0,package=zstd,version=1.4.5,libzstd.so.1.4.5.25ac422728b9ec12be141f63f9bf9...,-O0,
152028,arch=linux-ubuntu21.04-x86_64-x86,compiler=gcc-11.1.0,package=zstd,version=1.4.5,libzstd.so.1.4.5.5999c8c098d5f894042ba5b6a4ed8...,,


The dataset seems to carry distinct binaries built either with `-O0` as opposed to no option at all. How are these split?

In [27]:
hive_noopt_nopost.groupby("compile_options").agg({"file": "count"})

Unnamed: 0_level_0,file
compile_options,Unnamed: 1_level_1
,455
-O0,639


There are more binaries with `-O0`; furthermore, the non-optimized configuration is deliberate in this case. Likely, one would want to start experimenting with these binaries.

## Known pitfall when using the dataset

The `pyarrow.dataset` routines are not very robust when dealing with Parquet files with a complicated schema that might be subtly non-uniform throughout. For instance, the following idiom would hang my notebook kernel:

It seems the PyArrow code fails to handle the optimized case of loading a projection of the column space when this projection includes one of the dataset's heavily-recursive columns, such as `basic_blocks`. The workaround consists in accepting the memory and runtime hit of loading the whole of all Parquet files that contributes to one's computation. This idiom has been demonstrated [above](#processing-batches). The approach that has worked best for me has been to iterate on the Parquet leaves of my choice, loading each in sequence using Pandas:

This simple loop is easily processed in parallel using `joblib`, or for fancier processing, using Dask. In this case, remark that `dask.dataframe.read_parquet` will run into similar problems to those raised using `pyarrow.dataset.Dataset.to_batches`. One works around using `dask.dataframe.from_delayed` and sets up a `dask.delayed` routine that invokes `pandas.read_parquet`.