# Analysis of Android App Build Reproducibility

This notebook analyzes the reproducibility of Android app builds across different environments and build methods.

In [10]:
import hashlib
import io
import itertools
import json
import os
import pathlib
import shutil
import sys
from collections import defaultdict
from pathlib import Path

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

## Methodology

To test the reproducibility of the Android application, we have followed [the guide in the official repository](https://github.com/signalapp/Signal-Android/tree/main/reproducible-builds). Initially, we have run each step manually, and then wrote [a custom script](https://github.com/TheTechZone/reproducible-tests/blob/main/build_signal.py) to automate the process and prevent errors that could be introduced by humans.
Each build output was obtained using a clean<a name="clean"></a>[<sup>[clean]</sup>](#clean) Docker container. 

All tests very carried again version `v7.25.2` of the Signal application, which the most recent version currently available on Play Store (`region=CH`, `releaseChannel=production`). [According to public info](https://github.com/signalapp/Signal-Android/issues/13754#issuecomment-2450435519), this version should include the recent proguard fixes that ensure build determinism (TODO: link the commits)

The only variability in environment consists of the host operating system that was used (`fedora-40`, `fedora-41`, and `ubuntu-24.04`). Each machine was connected to a diffrent physical Android device for retrieving the play store app.

Example of footnote.


<a name="clean"></a> [^clean](#clean) Before invoking Gradle, the Docker image was built from scratch with no caching to prevent build artifacts from affecting reproducibility.  -- see the `./clean` script
[^clean]: )

## Data Collection

First, let's collect information about all the builds in our outputs directory.

In [11]:
def get_file_hash(filepath):
    """Calculate SHA-256 hash of a file."""
    sha256_hash = hashlib.sha256()
    with open(filepath, "rb") as f:
        for byte_block in iter(lambda: f.read(4096), b""):
            sha256_hash.update(byte_block)
    return sha256_hash.hexdigest()


def collect_build_data(output_dir):
    """Collect build data from the outputs directory."""
    builds_data = []

    def human_size(bytes, units=[" bytes", "KB", "MB", "GB", "TB", "PB", "EB"]):
        """Returns a human readable string representation of bytes"""
        return (
            str(bytes) + units[0]
            if bytes < 1024
            else human_size(bytes >> 10, units[1:])
        )

    for build_dir in Path(output_dir).iterdir():
        if not build_dir.is_dir():
            continue

        if "checkpoints" in build_dir.name:
            continue

        # Parse environment info from directory name
        env_info = build_dir.name.split("-")
        os_name = env_info[0]
        build_type = env_info[1]
        execution_count = 1 if len(env_info) == 2 else int(env_info[2])

        # Process APKs
        for apk_dir in build_dir.glob("**/apks*"):
            apk_type = apk_dir.name
            for apk_file in apk_dir.glob("**/*.apk"):
                size = apk_file.stat().st_size
                builds_data.append(
                    {
                        "os": os_name,
                        "build_type": build_type,
                        "exec_count": execution_count,
                        "apk_type": apk_type,
                        "local_apk": apk_type == "apks-i-built",
                        "filename": apk_file.name,
                        "filepath": str(apk_file),
                        "hash": get_file_hash(apk_file),
                        "size": size,
                        "human_size": human_size(size),
                    }
                )

    return pd.DataFrame(builds_data)


# Collect data
output_dir = "outputs"
df = collect_build_data(output_dir)
df.head(50)

Unnamed: 0,os,build_type,exec_count,apk_type,local_apk,filename,filepath,hash,size,human_size
0,fedora41_dfs_sort,scripted,1,apks-i-built,True,base-master.apk,outputs/fedora41_dfs_sort-scripted/apks-i-buil...,ed0c8b3e94ba72dc96920b55732c29dd8038c6446df5b0...,52380815,49MB
1,fedora41_dfs_sort,scripted,1,apks-i-built,True,base-arm64_v8a.apk,outputs/fedora41_dfs_sort-scripted/apks-i-buil...,995c37061435d46d5f3d7caf4fe7e1817a8b9a0d4049a7...,10699335,10MB
2,fedora41_dfs_sort,scripted,1,apks-i-built,True,base-xxhdpi.apk,outputs/fedora41_dfs_sort-scripted/apks-i-buil...,41b12629671ba0778364d671909add3a94380955c5c366...,1637689,1MB
3,ubuntu22,scripted,1,apks-from-device,False,base-master.apk,outputs/ubuntu22-scripted-1/apks-from-device/b...,eb1902b3aa98e15a140e9de574dfa47b78bf8bb85fc556...,85037580,81MB
4,ubuntu22,scripted,1,apks-from-device,False,base-arm64_v8a.apk,outputs/ubuntu22-scripted-1/apks-from-device/b...,d296fff8691b3632f85725202045875af7e89dafedc9df...,23569519,22MB
5,ubuntu22,scripted,1,apks-from-device,False,base-xxhdpi.apk,outputs/ubuntu22-scripted-1/apks-from-device/b...,c65a2691cde28650f8934d492bff6c241b2e55eee4d222...,1666719,1MB
6,ubuntu22,scripted,1,apks-i-built,True,base-master.apk,outputs/ubuntu22-scripted-1/apks-i-built/base-...,fa1e7eec6dc3e634aae9ccba817d04db46b46c10ca4556...,84693087,80MB
7,ubuntu22,scripted,1,apks-i-built,True,base-arm64_v8a.apk,outputs/ubuntu22-scripted-1/apks-i-built/base-...,4832f3a837a4059db7476c45d995045b0c561388234335...,23552089,22MB
8,ubuntu22,scripted,1,apks-i-built,True,base-xxhdpi.apk,outputs/ubuntu22-scripted-1/apks-i-built/base-...,41b12629671ba0778364d671909add3a94380955c5c366...,1637689,1MB
9,fedora40,scripted,1,apks-from-device,False,base-master.apk,outputs/fedora40-scripted-1/apks-from-device/b...,eb1902b3aa98e15a140e9de574dfa47b78bf8bb85fc556...,85037580,81MB


In [12]:
def print_tree_with_hashes(df):
    """Print directory tree with file hashes."""
    # Create a nested dictionary structure
    tree = defaultdict(lambda: defaultdict(lambda: defaultdict(dict)))

    # ANSI color codes
    BLUE = "\033[34m"
    PURPLE = "\33[35m"
    BOLD = "\033[1m"
    RESET = "\033[0m"

    # Populate the tree structure
    for _, row in df.iterrows():
        tree[row["os"]][row["build_type"]][row["apk_type"]][row["filename"]] = row[
            "hash"
        ][
            :8
        ]  # First 8 chars of hash

    # Print the tree
    for os in sorted(tree.keys()):
        print(f"{BLUE}{BOLD}{os}/{RESET}")
        for build_type in sorted(tree[os].keys()):
            print(f"├── {BLUE}{BOLD}{build_type}/{RESET}")
            for apk_type in sorted(tree[os][build_type].keys()):
                print(f"│   ├── {BLUE}{BOLD}{apk_type}/{RESET}")
                for filename, hash_prefix in sorted(
                    tree[os][build_type][apk_type].items()
                ):
                    print(
                        f"│   │   ├── {filename} [hash: {PURPLE}{BOLD}{hash_prefix}{RESET}]"
                    )
    print("..... Done")


print("Directory Tree with Hash Prefixes (first 8 chars):")
print_tree_with_hashes(df)

Directory Tree with Hash Prefixes (first 8 chars):
[34m[1mfedora40/[0m
├── [34m[1mmanual/[0m
│   ├── [34m[1mapks-from-device/[0m
│   │   ├── base-arm64_v8a.apk [hash: [35m[1md296fff8[0m]
│   │   ├── base-master.apk [hash: [35m[1meb1902b3[0m]
│   │   ├── base-xxhdpi.apk [hash: [35m[1mc65a2691[0m]
│   ├── [34m[1mapks-i-built/[0m
│   │   ├── base-arm64_v8a.apk [hash: [35m[1ma74d526f[0m]
│   │   ├── base-master.apk [hash: [35m[1m47686026[0m]
│   │   ├── base-xxhdpi.apk [hash: [35m[1m51119b04[0m]
├── [34m[1mscripted/[0m
│   ├── [34m[1mapks-from-device/[0m
│   │   ├── base-arm64_v8a.apk [hash: [35m[1md296fff8[0m]
│   │   ├── base-master.apk [hash: [35m[1meb1902b3[0m]
│   │   ├── base-xxhdpi.apk [hash: [35m[1mc65a2691[0m]
│   ├── [34m[1mapks-i-built/[0m
│   │   ├── base-arm64_v8a.apk [hash: [35m[1ma74d526f[0m]
│   │   ├── base-master.apk [hash: [35m[1m47686026[0m]
│   │   ├── base-xxhdpi.apk [hash: [35m[1m51119b04[0m]
[34m[1mfedora40_

## Build Reproducibility Analysis

Let's analyze the reproducibility of builds by comparing hashes across different environments.

In [13]:
df.groupby(["os", "build_type"])["exec_count"].nunique().reset_index(
    name="execution_times"
).rename_axis(None).set_index(["os", "build_type"])

Unnamed: 0_level_0,Unnamed: 1_level_0,execution_times
os,build_type,Unnamed: 2_level_1
fedora40,manual,1
fedora40,scripted,2
fedora40_dfs_sort,scripted,1
fedora40_dfs_sort_reverse,scripted,1
fedora41,scripted,1
fedora41_dfs_sort,scripted,1
ubuntu22,scripted,1
ubuntu22_dfs_sort,scripted,1
ubuntu22_dfs_sort_reverse,scripted,1


> So indeed, 5 total runs, 4 of them scripted as explained in Methodology

In [14]:
df[~df["local_apk"]].groupby("filename")["hash"].nunique().reset_index(
    name="distinct file hashes"
).set_index("filename")

Unnamed: 0_level_0,distinct file hashes
filename,Unnamed: 1_level_1
base-arm64_v8a.apk,1
base-master.apk,1
base-xxhdpi.apk,1


> So, we have only one unique variant of each apk file, which implies that every phone received the same apk bundle (splits) when 

Sanity check: do all the builds in the same environment produce the same apk?

In [16]:
df[df["local_apk"]].groupby(["os", "filename"])["hash"].nunique().reset_index(
    name="distinct file hashes"
).set_index(["os", "filename"])

Unnamed: 0_level_0,Unnamed: 1_level_0,distinct file hashes
os,filename,Unnamed: 2_level_1
fedora40,base-arm64_v8a.apk,1
fedora40,base-master.apk,1
fedora40,base-xxhdpi.apk,1
fedora40_dfs_sort,base-arm64_v8a.apk,1
fedora40_dfs_sort,base-master.apk,1
fedora40_dfs_sort,base-xxhdpi.apk,1
fedora40_dfs_sort_reverse,base-arm64_v8a.apk,1
fedora40_dfs_sort_reverse,base-master.apk,1
fedora40_dfs_sort_reverse,base-xxhdpi.apk,1
fedora41,base-arm64_v8a.apk,1


> So, the host operating system does NOT introduce variability with regards to the built of subsequent apk (bunndles)

In [17]:
df[df["local_apk"]].groupby(["filename"])["hash"].nunique()

filename
base-arm64_v8a.apk    3
base-master.apk       7
base-xxhdpi.apk       2
Name: hash, dtype: int64

In [19]:
# Step 1: Group by 'filename' and count unique 'hash' values
unique_hashes_count = df[df["local_apk"]].groupby("filename")["hash"].nunique()

# Step 2: Filter filenames where the unique hash count is greater than 1
diverging_filenames = unique_hashes_count[unique_hashes_count > 1].index

# Step 3: Extract rows where the 'filename' is in the list of filenames with multiple unique hashes
divergent_data = df[df["local_apk"] & df["filename"].isin(diverging_filenames)]

# Step 4: Group by 'filename' and get the relevant columns
result = (
    divergent_data.groupby(["filename", "hash"])[["os", "filepath"]]
    .agg(
        {
            "os": "first",  # Pick the first 'os' for each filename (assuming it's consistent within each group)
            "filepath": "first",  # Pick the first 'filepath' for each filename (assuming it's consistent within each group)
            # 'hash': 'unique'       # Get all unique hash values for each filename
        }
    )
    .reset_index()
)

# Step 5: Show only rows where there are multiple hashes (you can remove this step if needed)
result = result[result["hash"].apply(len) > 1]

# Show the result
result.set_index(["filename", "os"])

Unnamed: 0_level_0,Unnamed: 1_level_0,hash,filepath
filename,os,Unnamed: 2_level_1,Unnamed: 3_level_1
base-arm64_v8a.apk,ubuntu22,4832f3a837a4059db7476c45d995045b0c561388234335...,outputs/ubuntu22-scripted-1/apks-i-built/base-...
base-arm64_v8a.apk,fedora41_dfs_sort,995c37061435d46d5f3d7caf4fe7e1817a8b9a0d4049a7...,outputs/fedora41_dfs_sort-scripted/apks-i-buil...
base-arm64_v8a.apk,fedora40,a74d526fd13f4fe3bda7e066c575fa0c931d8da7333b8d...,outputs/fedora40-scripted-1/apks-i-built/base-...
base-master.apk,ubuntu22_dfs_sort_reverse,11b479c5082ca4323ab811d05288283f079f881f665619...,outputs/ubuntu22_dfs_sort_reverse-scripted/apk...
base-master.apk,fedora40_dfs_sort_reverse,243e0a3b54cd8b21046155f7fccbe08def0a998f8ecdf0...,outputs/fedora40_dfs_sort_reverse-scripted/apk...
base-master.apk,ubuntu22_dfs_sort,391c7b876ff6aa67b5922f3699c17f3beb21ff0913ba6a...,outputs/ubuntu22_dfs_sort-scripted/apks-i-buil...
base-master.apk,fedora40,4768602685c6e5ceff078093c87e9aba5f19597b0d76c5...,outputs/fedora40-scripted-1/apks-i-built/base-...
base-master.apk,fedora41,dbe76a3649f90f061170c6d3d2071bcddc3c32672bebc2...,outputs/fedora41-scripted-1/apks-i-built/base-...
base-master.apk,fedora41_dfs_sort,ed0c8b3e94ba72dc96920b55732c29dd8038c6446df5b0...,outputs/fedora41_dfs_sort-scripted/apks-i-buil...
base-master.apk,ubuntu22,fa1e7eec6dc3e634aae9ccba817d04db46b46c10ca4556...,outputs/ubuntu22-scripted-1/apks-i-built/base-...


> What about ubuntu22?

In [20]:
frame = df[(df["os"] == "ubuntu22") & (df["local_apk"])][["os", "filename", "filepath", "hash"]]
frame2 = df[(df["os"] == "ubuntu22_dfs_sort") & (df["local_apk"])][["os", "filename", "filepath", "hash"]]
frame3 = df[(df["os"] == "ubuntu22_dfs_sort_reverse") & (df["local_apk"])][["os", "filename", "filepath", "hash"]]
frame = pd.concat([frame, frame2, frame3])
frame

Unnamed: 0,os,filename,filepath,hash
6,ubuntu22,base-master.apk,outputs/ubuntu22-scripted-1/apks-i-built/base-...,fa1e7eec6dc3e634aae9ccba817d04db46b46c10ca4556...
7,ubuntu22,base-arm64_v8a.apk,outputs/ubuntu22-scripted-1/apks-i-built/base-...,4832f3a837a4059db7476c45d995045b0c561388234335...
8,ubuntu22,base-xxhdpi.apk,outputs/ubuntu22-scripted-1/apks-i-built/base-...,41b12629671ba0778364d671909add3a94380955c5c366...
15,ubuntu22_dfs_sort,base-master.apk,outputs/ubuntu22_dfs_sort-scripted/apks-i-buil...,391c7b876ff6aa67b5922f3699c17f3beb21ff0913ba6a...
16,ubuntu22_dfs_sort,base-arm64_v8a.apk,outputs/ubuntu22_dfs_sort-scripted/apks-i-buil...,995c37061435d46d5f3d7caf4fe7e1817a8b9a0d4049a7...
17,ubuntu22_dfs_sort,base-xxhdpi.apk,outputs/ubuntu22_dfs_sort-scripted/apks-i-buil...,41b12629671ba0778364d671909add3a94380955c5c366...
27,ubuntu22_dfs_sort_reverse,base-master.apk,outputs/ubuntu22_dfs_sort_reverse-scripted/apk...,11b479c5082ca4323ab811d05288283f079f881f665619...
28,ubuntu22_dfs_sort_reverse,base-arm64_v8a.apk,outputs/ubuntu22_dfs_sort_reverse-scripted/apk...,995c37061435d46d5f3d7caf4fe7e1817a8b9a0d4049a7...
29,ubuntu22_dfs_sort_reverse,base-xxhdpi.apk,outputs/ubuntu22_dfs_sort_reverse-scripted/apk...,41b12629671ba0778364d671909add3a94380955c5c366...


> So, it produced a similar `base-arm64_v8a.apk` and `base-xxhdpi.apk` as fedora41, but different than fedora40. Each os has a unique `base-master.apk`

## Investigation

In [21]:
from build_signal import SignalBuilder
import argparse
args = argparse.Namespace()
args.version = "v7.25.2"
args.clean = None
args.dfs = None
sb = SignalBuilder(args)

In [11]:
# sb.clone_signal("7.25.2")

In [22]:
from pathlib import Path

sb.setup_apkdiff(Path("./"))


Setting up apkdiff.py...


PosixPath('/home/chrissy/Code/reproducible-tests/reproducible-signal/apkdiff.py')

In [23]:
from apkdiff import ApkDiff

In [24]:
df.columns

Index(['os', 'build_type', 'exec_count', 'apk_type', 'local_apk', 'filename',
       'filepath', 'hash', 'size', 'human_size'],
      dtype='object')

In [25]:
# Group by 'filename' and drop duplicate entries by 'hash'
df_unique = df.groupby("filename").apply(
    lambda group: group.drop_duplicates(subset=["hash"])
)

# Reset the index to get a clean DataFrame
df_unique = df_unique.reset_index(drop=True)

# Display the result
df_unique.set_index(["filename", "os"])

  df_unique = df.groupby("filename").apply(


Unnamed: 0_level_0,Unnamed: 1_level_0,build_type,exec_count,apk_type,local_apk,filepath,hash,size,human_size
filename,os,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
base-arm64_v8a.apk,fedora41_dfs_sort,scripted,1,apks-i-built,True,outputs/fedora41_dfs_sort-scripted/apks-i-buil...,995c37061435d46d5f3d7caf4fe7e1817a8b9a0d4049a7...,10699335,10MB
base-arm64_v8a.apk,ubuntu22,scripted,1,apks-from-device,False,outputs/ubuntu22-scripted-1/apks-from-device/b...,d296fff8691b3632f85725202045875af7e89dafedc9df...,23569519,22MB
base-arm64_v8a.apk,ubuntu22,scripted,1,apks-i-built,True,outputs/ubuntu22-scripted-1/apks-i-built/base-...,4832f3a837a4059db7476c45d995045b0c561388234335...,23552089,22MB
base-arm64_v8a.apk,fedora40,scripted,1,apks-i-built,True,outputs/fedora40-scripted-1/apks-i-built/base-...,a74d526fd13f4fe3bda7e066c575fa0c931d8da7333b8d...,23561264,22MB
base-master.apk,fedora41_dfs_sort,scripted,1,apks-i-built,True,outputs/fedora41_dfs_sort-scripted/apks-i-buil...,ed0c8b3e94ba72dc96920b55732c29dd8038c6446df5b0...,52380815,49MB
base-master.apk,ubuntu22,scripted,1,apks-from-device,False,outputs/ubuntu22-scripted-1/apks-from-device/b...,eb1902b3aa98e15a140e9de574dfa47b78bf8bb85fc556...,85037580,81MB
base-master.apk,ubuntu22,scripted,1,apks-i-built,True,outputs/ubuntu22-scripted-1/apks-i-built/base-...,fa1e7eec6dc3e634aae9ccba817d04db46b46c10ca4556...,84693087,80MB
base-master.apk,fedora40,scripted,1,apks-i-built,True,outputs/fedora40-scripted-1/apks-i-built/base-...,4768602685c6e5ceff078093c87e9aba5f19597b0d76c5...,85021050,81MB
base-master.apk,ubuntu22_dfs_sort,scripted,1,apks-i-built,True,outputs/ubuntu22_dfs_sort-scripted/apks-i-buil...,391c7b876ff6aa67b5922f3699c17f3beb21ff0913ba6a...,52381303,49MB
base-master.apk,fedora40_dfs_sort_reverse,scripted,1,apks-i-built,True,outputs/fedora40_dfs_sort_reverse-scripted/apk...,243e0a3b54cd8b21046155f7fccbe08def0a998f8ecdf0...,52381043,49MB


In [26]:
# Step 1: Group by the 'filename' index level
grouped = df_unique.groupby("filename")

# Step 2: For each group, generate all pairs of rows (without repetition)
pairs = []
for filename, group in grouped:
    # If the group has more than one row, create combinations of indices
    if len(group) > 1:
        # Generate all unique combinations of row indices for this group
        for index1, index2 in itertools.combinations(group.index, 2):
            pair_data = {
                "filename": filename,
                "pair": (index1, index2),
                "row_1_data": group.loc[index1].to_dict(),
                "row_2_data": group.loc[index2].to_dict(),
            }
            pairs.append(pair_data)

# Step 3: Convert the list of pairs into a DataFrame
pairs_df = pd.DataFrame(pairs)

# Show the pairs DataFrame
pairs_df

Unnamed: 0,filename,pair,row_1_data,row_2_data
0,base-arm64_v8a.apk,"(0, 1)","{'os': 'fedora41_dfs_sort', 'build_type': 'scr...","{'os': 'ubuntu22', 'build_type': 'scripted', '..."
1,base-arm64_v8a.apk,"(0, 2)","{'os': 'fedora41_dfs_sort', 'build_type': 'scr...","{'os': 'ubuntu22', 'build_type': 'scripted', '..."
2,base-arm64_v8a.apk,"(0, 3)","{'os': 'fedora41_dfs_sort', 'build_type': 'scr...","{'os': 'fedora40', 'build_type': 'scripted', '..."
3,base-arm64_v8a.apk,"(1, 2)","{'os': 'ubuntu22', 'build_type': 'scripted', '...","{'os': 'ubuntu22', 'build_type': 'scripted', '..."
4,base-arm64_v8a.apk,"(1, 3)","{'os': 'ubuntu22', 'build_type': 'scripted', '...","{'os': 'fedora40', 'build_type': 'scripted', '..."
5,base-arm64_v8a.apk,"(2, 3)","{'os': 'ubuntu22', 'build_type': 'scripted', '...","{'os': 'fedora40', 'build_type': 'scripted', '..."
6,base-master.apk,"(4, 5)","{'os': 'fedora41_dfs_sort', 'build_type': 'scr...","{'os': 'ubuntu22', 'build_type': 'scripted', '..."
7,base-master.apk,"(4, 6)","{'os': 'fedora41_dfs_sort', 'build_type': 'scr...","{'os': 'ubuntu22', 'build_type': 'scripted', '..."
8,base-master.apk,"(4, 7)","{'os': 'fedora41_dfs_sort', 'build_type': 'scr...","{'os': 'fedora40', 'build_type': 'scripted', '..."
9,base-master.apk,"(4, 8)","{'os': 'fedora41_dfs_sort', 'build_type': 'scr...","{'os': 'ubuntu22_dfs_sort', 'build_type': 'scr..."


In [27]:
pairs_df["row_1_data"][0]

{'os': 'fedora41_dfs_sort',
 'build_type': 'scripted',
 'exec_count': 1,
 'apk_type': 'apks-i-built',
 'local_apk': True,
 'filename': 'base-arm64_v8a.apk',
 'filepath': 'outputs/fedora41_dfs_sort-scripted/apks-i-built/base-arm64_v8a.apk',
 'hash': '995c37061435d46d5f3d7caf4fe7e1817a8b9a0d4049a723c99dcd8e04d5e0e2',
 'size': 10699335,
 'human_size': '10MB'}

In [28]:
# ApkDiff.compare()
for index, test in pairs_df.iterrows():
    r1, r2 = test["row_1_data"], test["row_2_data"]
    print(r1)
    print(r2)
    # ApkDiff().compare(r1['filepath'], r2['filepath'])

    break

{'os': 'fedora41_dfs_sort', 'build_type': 'scripted', 'exec_count': 1, 'apk_type': 'apks-i-built', 'local_apk': True, 'filename': 'base-arm64_v8a.apk', 'filepath': 'outputs/fedora41_dfs_sort-scripted/apks-i-built/base-arm64_v8a.apk', 'hash': '995c37061435d46d5f3d7caf4fe7e1817a8b9a0d4049a723c99dcd8e04d5e0e2', 'size': 10699335, 'human_size': '10MB'}
{'os': 'ubuntu22', 'build_type': 'scripted', 'exec_count': 1, 'apk_type': 'apks-from-device', 'local_apk': False, 'filename': 'base-arm64_v8a.apk', 'filepath': 'outputs/ubuntu22-scripted-1/apks-from-device/base-arm64_v8a.apk', 'hash': 'd296fff8691b3632f85725202045875af7e89dafedc9df17a455120bff6f1791', 'size': 23569519, 'human_size': '22MB'}


In [29]:
import io
import re
from contextlib import redirect_stdout

In [30]:
apkdiff_result_path = pathlib.Path("apkdiff-results")
mismatch_folder = pathlib.Path(f"mismatches")

if apkdiff_result_path.exists() and apkdiff_result_path.is_dir():
    shutil.rmtree(apkdiff_result_path)
apkdiff_result_path.mkdir(parents=True, exist_ok=True)
if mismatch_folder.exists() and mismatch_folder.is_dir():
    shutil.rmtree(mismatch_folder)


def extract_filename(output: str):
    # Regular expression to extract filenames
    pattern = r"APKs differ on file ([\w/.-]+)!"
    return re.findall(pattern, output)


for index, row in pairs_df.iterrows():
    r1, r2 = row["row_1_data"], row["row_2_data"]
    l1 = r1["os"] if r1["local_apk"] else "playstore"
    l2 = r2["os"] if r2["local_apk"] else "playstore"
    test_idx_folder = apkdiff_result_path / f"{r1['filename']}_{l1}_vs_{l2}"

    print(f"Test #{index+1}: comparing {r1['filename']} -- {l1} with {l2}")

    output, res = None, False
    with io.StringIO() as buf, redirect_stdout(buf):
        res = ApkDiff().compare(r1["filepath"], r2["filepath"])
        output = buf.getvalue()

    if res:
        print("\tAPKS match [OK] ^^")
    else:
        diffs = extract_filename(output)
        print(diffs)

        # Step 3: Generate the mismatch folder (simulated command execution)
        diff1_folder = mismatch_folder / "first"
        diff2_folder = mismatch_folder / "second"

        test_idx_folder.mkdir(parents=True, exist_ok=True)
        shutil.move(str(diff1_folder), str(test_idx_folder / "first"))
        shutil.move(str(diff2_folder), str(test_idx_folder / "second"))

    print(f"Output results in {test_idx_folder}" "")
    print("output")
    # break
    # print(f"Saaaaaaaaaaaaad:\t{captured_output}")
    print("=====================")

if mismatch_folder.exists() and mismatch_folder.is_dir():
    shutil.rmtree(mismatch_folder)

Test #1: comparing base-arm64_v8a.apk -- fedora41_dfs_sort with playstore
['AndroidManifest.xml']
Output results in apkdiff-results/base-arm64_v8a.apk_fedora41_dfs_sort_vs_playstore
output
Test #2: comparing base-arm64_v8a.apk -- fedora41_dfs_sort with ubuntu22
	APKS match [OK] ^^
Output results in apkdiff-results/base-arm64_v8a.apk_fedora41_dfs_sort_vs_ubuntu22
output
Test #3: comparing base-arm64_v8a.apk -- fedora41_dfs_sort with fedora40
	APKS match [OK] ^^
Output results in apkdiff-results/base-arm64_v8a.apk_fedora41_dfs_sort_vs_fedora40
output
Test #4: comparing base-arm64_v8a.apk -- playstore with ubuntu22
['AndroidManifest.xml']
Output results in apkdiff-results/base-arm64_v8a.apk_playstore_vs_ubuntu22
output
Test #5: comparing base-arm64_v8a.apk -- playstore with fedora40
['AndroidManifest.xml']
Output results in apkdiff-results/base-arm64_v8a.apk_playstore_vs_fedora40
output
Test #6: comparing base-arm64_v8a.apk -- ubuntu22 with fedora40
	APKS match [OK] ^^
Output results in a

In [31]:
def extract_apk_comparison_results(pairs_df):
    """
    Compare APK pairs and collect comparison results into a DataFrame.

    Parameters:
    - pairs_df: DataFrame containing APK pairs to compare

    Returns:
    - results_df: DataFrame with comparison results
    """
    # Create base directories for results and mismatches
    apkdiff_result_path = pathlib.Path("apkdiff-results")
    mismatch_folder = pathlib.Path("mismatches")

    # Clean up existing directories
    if apkdiff_result_path.exists() and apkdiff_result_path.is_dir():
        shutil.rmtree(apkdiff_result_path)
    apkdiff_result_path.mkdir(parents=True, exist_ok=True)

    if mismatch_folder.exists() and mismatch_folder.is_dir():
        shutil.rmtree(mismatch_folder)

    results = []

    for index, row in pairs_df.iterrows():
        r1, r2 = row["row_1_data"], row["row_2_data"]
        l1 = r1["os"] if r1["local_apk"] else "playstore"
        l2 = r2["os"] if r2["local_apk"] else "playstore"
        assert r1["filename"] == r2["filename"]

        # Create test index folder
        test_idx_folder = apkdiff_result_path / f"{r1['filename']}_{l1}_vs_{l2}"
        print(f"Test #{index+1}: comparing {r1['filename']} -- {l1} with {l2}")

        # Capture comparison output
        with io.StringIO() as buf, redirect_stdout(buf):
            res = ApkDiff().compare(r1["filepath"], r2["filepath"])
            output = buf.getvalue()

        # Prepare result dictionary
        result = {
            "pair": row["pair"],
            "filename": r1["filename"],
            "apk1_origin": l1,
            "apk2_origin": l2,
            "match": res,
            "match_reason": "apkdiff",
            "differences": None,
            "apk1_filepath": r1["filepath"],
            "apk2_filepath": r2["filepath"],
            "apk1_hash": r1["hash"],
            "apk2_hash": r2["hash"],
            "apk1_local": r1["local_apk"],
            "apk2_local": r2["local_apk"],
            "apk1_size": r1["size"],
            "apk2_size": r2["size"],
            "results_folder": str(test_idx_folder),
        }

        # If APKs don't match, move files and capture differences
        if not res:
            print("\tAPKs DON'T match [BAD]")

            # Regular expression to extract filenames with differences
            pattern = r"APKs differ on file ([\w/.-]+)!"
            differences = re.findall(pattern, output)
            result["differences"] = differences
            result["match_reason"] = None

            # Create test index folder and move mismatch files
            diff1_folder = mismatch_folder / "first"
            diff2_folder = mismatch_folder / "second"
            test_idx_folder.mkdir(parents=True, exist_ok=True)

            try:
                shutil.move(str(diff1_folder), str(test_idx_folder / "first"))
                shutil.move(str(diff2_folder), str(test_idx_folder / "second"))
            except Exception as e:
                print(f"Error moving mismatch files: {e}")

            print(f"\tMismatch files moved to {test_idx_folder}")
        else:
            print("\tAPKs match [OK]")

        print("=====================")

        results.append(result)

    # Clean up mismatch folder if it exists
    if mismatch_folder.exists() and mismatch_folder.is_dir():
        shutil.rmtree(mismatch_folder)

    # Convert results to DataFrame
    results_df = pd.DataFrame(results)

    return results_df

In [51]:
apkdiff_results = extract_apk_comparison_results(pairs_df)
###ADD previously seen results: that some artifacts match by hash
ub22_1 = df[(df["os"] == "ubuntu22") & (df["local_apk"])][
    ["os", "filename", "filepath", "hash", "size"]
]
ub22_2 = df[(df["os"] == "ubuntu22_dfs_sort") & (df["local_apk"])][
    ["os", "filename", "filepath", "hash", "size"]
]
ub22_3 = df[(df["os"] == "ubuntu22_dfs_sort_reverse") & (df["local_apk"])][
    ["os", "filename", "filepath", "hash", "size"]
]
ub22 = pd.concat([ub22_1, ub22_2, ub22_3])

f41_1 = df[(df["os"] == "fedora41") & (df["local_apk"])][
    ["os", "filename", "filepath", "hash", "size"]
]
f41_2 = df[(df["os"] == "fedora41_dfs_sort") & (df["local_apk"])][
    ["os", "filename", "filepath", "hash", "size"]
]
f41_3 = df[(df["os"] == "fedora41_dfs_sort_reverse") & (df["local_apk"])][
    ["os", "filename", "filepath", "hash", "size"]
]
f41 = pd.concat([f41_1, f41_2, f41_3])
merged_hashes = pd.merge(f41, ub22, on=["filename", "hash"], how="inner")
merged_hashes

extra_matches = {}


def create_match(filename: str, origin1, origin2, hash1, hash2, fp1, fp2, size1, size2):
    data = {
        "pair": (-1, -1),
        "filename": filename,
        "apk1_origin": origin1,
        "apk2_origin": origin2,
        "match": hash1 == hash2,
        "match_reason": "sha256",
        "differences": None,
        "apk1_filepath": fp1,
        "apk2_filepath": fp2,
        "apk1_hash": hash1,
        "apk2_hash": hash2,
        "apk1_local": True,
        "apk2_local": True,
        "apk1_size": size1,
        "apk2_size": size2,
        "results_folder": None,
    }
    apkdiff_results.loc[len(apkdiff_results)] = data


for index, row in merged_hashes.iterrows():
    create_match(
        row["filename"],
        row["os_x"],
        row["os_y"],
        row["hash"],
        row["hash"],
        row["filepath_x"],
        row["filepath_y"],
        row["size_x"],
        row["size_y"],
    )
print("apkdiff dataframe is done, ready for analysis")

Test #1: comparing base-arm64_v8a.apk -- fedora41_dfs_sort with playstore
	APKs DON'T match [BAD]
	Mismatch files moved to apkdiff-results/base-arm64_v8a.apk_fedora41_dfs_sort_vs_playstore
Test #2: comparing base-arm64_v8a.apk -- fedora41_dfs_sort with ubuntu22
	APKs match [OK]
Test #3: comparing base-arm64_v8a.apk -- fedora41_dfs_sort with fedora40
	APKs match [OK]
Test #4: comparing base-arm64_v8a.apk -- playstore with ubuntu22
	APKs DON'T match [BAD]
	Mismatch files moved to apkdiff-results/base-arm64_v8a.apk_playstore_vs_ubuntu22
Test #5: comparing base-arm64_v8a.apk -- playstore with fedora40
	APKs DON'T match [BAD]
	Mismatch files moved to apkdiff-results/base-arm64_v8a.apk_playstore_vs_fedora40
Test #6: comparing base-arm64_v8a.apk -- ubuntu22 with fedora40
	APKs match [OK]
Test #7: comparing base-master.apk -- fedora41_dfs_sort with playstore
	APKs DON'T match [BAD]
	Mismatch files moved to apkdiff-results/base-master.apk_fedora41_dfs_sort_vs_playstore
Test #8: comparing base-m

In [52]:
apkdiff_results.set_index(["filename", "pair"], inplace=True)
#apkdiff_results.filter(["apk1_filepath", "apk2_filepath", "pair"])
filtered = apkdiff_results["filename", "pair", "apk1_origin", "apk2_origin", "match", "match_reason", "differences"]]
#apkdiff_results.filter(["filename", "pair", "apk1_origin", "apk2_origin", "match", "match_reason", "differences"])
filtered.head()

KeyError: ('filename', 'pair', 'apk1_origin', 'apk2_origin', 'match', 'match_reason', 'differences')

> Did anyone match the playstore apk?

In [49]:
matches_playstore = apkdiff_results[
    (
        apkdiff_results["match"]
        & (apkdiff_results["apk1_local"] ^ apkdiff_results["apk2_local"])
    )
].set_index("filename")
if matches_playstore.size == 0:
    print("No local (split) apk matched playstore (split) apk")
matches_playstore

KeyError: "None of ['filename'] are in the columns"

> No luck :c

In [25]:
apkdiff_results[
    (
        apkdiff_results["match"]
        & (apkdiff_results["apk1_local"] & apkdiff_results["apk2_local"])
    )
].set_index(["filename"])

Unnamed: 0_level_0,pair,apk1_origin,apk2_origin,match,match_reason,differences,apk1_filepath,apk2_filepath,apk1_hash,apk2_hash,apk1_local,apk2_local,apk1_size,apk2_size,results_folder
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
base-arm64_v8a.apk,"(1, 2)",fedora40,fedora41,True,apkdiff,,outputs/fedora40-scripted-1/apks-i-built/base-...,outputs/fedora41-scripted-1/apks-i-built/base-...,a74d526fd13f4fe3bda7e066c575fa0c931d8da7333b8d...,4832f3a837a4059db7476c45d995045b0c561388234335...,True,True,23561264,23552089,apkdiff-results/base-arm64_v8a.apk_fedora40_vs...
base-master.apk,"(4, 5)",fedora40,fedora41,True,apkdiff,,outputs/fedora40-scripted-1/apks-i-built/base-...,outputs/fedora41-scripted-1/apks-i-built/base-...,4768602685c6e5ceff078093c87e9aba5f19597b0d76c5...,dbe76a3649f90f061170c6d3d2071bcddc3c32672bebc2...,True,True,85021050,84693087,apkdiff-results/base-master.apk_fedora40_vs_fe...
base-xxhdpi.apk,"(8, 9)",fedora40,fedora41,True,apkdiff,,outputs/fedora40-scripted-1/apks-i-built/base-...,outputs/fedora41-scripted-1/apks-i-built/base-...,51119b048c0e03525cbb253ed7c19974738b58804908d3...,41b12629671ba0778364d671909add3a94380955c5c366...,True,True,1658464,1637689,apkdiff-results/base-xxhdpi.apk_fedora40_vs_fe...
base-arm64_v8a.apk,"(-1, -1)",fedora41,ubuntu22,True,sha256,,outputs/fedora41-scripted-1/apks-i-built/base-...,outputs/ubuntu22-scripted-1/apks-i-built/base-...,4832f3a837a4059db7476c45d995045b0c561388234335...,4832f3a837a4059db7476c45d995045b0c561388234335...,True,True,23552089,23552089,
base-xxhdpi.apk,"(-1, -1)",fedora41,ubuntu22,True,sha256,,outputs/fedora41-scripted-1/apks-i-built/base-...,outputs/ubuntu22-scripted-1/apks-i-built/base-...,41b12629671ba0778364d671909add3a94380955c5c366...,41b12629671ba0778364d671909add3a94380955c5c366...,True,True,1637689,1637689,


So: `fedora41` and `fedora40` are reproducible (according to apkdiff). `base-master.apk` differs between ubuntu and fedora. either way, no one matches playstore :C

TODO: what tf is going on, run apkdiff on the matrix of diverging stuff and check

In [26]:
mismatched = apkdiff_results[~apkdiff_results["match"]]
mismatched

Unnamed: 0,pair,filename,apk1_origin,apk2_origin,match,match_reason,differences,apk1_filepath,apk2_filepath,apk1_hash,apk2_hash,apk1_local,apk2_local,apk1_size,apk2_size,results_folder
0,"(0, 1)",base-arm64_v8a.apk,playstore,fedora40,False,,[AndroidManifest.xml],outputs/fedora40-scripted-1/apks-from-device/b...,outputs/fedora40-scripted-1/apks-i-built/base-...,d296fff8691b3632f85725202045875af7e89dafedc9df...,a74d526fd13f4fe3bda7e066c575fa0c931d8da7333b8d...,False,True,23569519,23561264,apkdiff-results/base-arm64_v8a.apk_playstore_v...
1,"(0, 2)",base-arm64_v8a.apk,playstore,fedora41,False,,[AndroidManifest.xml],outputs/fedora40-scripted-1/apks-from-device/b...,outputs/fedora41-scripted-1/apks-i-built/base-...,d296fff8691b3632f85725202045875af7e89dafedc9df...,4832f3a837a4059db7476c45d995045b0c561388234335...,False,True,23569519,23552089,apkdiff-results/base-arm64_v8a.apk_playstore_v...
3,"(3, 4)",base-master.apk,playstore,fedora40,False,,"[AndroidManifest.xml, classes2.dex, classes3.d...",outputs/fedora40-scripted-1/apks-from-device/b...,outputs/fedora40-scripted-1/apks-i-built/base-...,eb1902b3aa98e15a140e9de574dfa47b78bf8bb85fc556...,4768602685c6e5ceff078093c87e9aba5f19597b0d76c5...,False,True,85037580,85021050,apkdiff-results/base-master.apk_playstore_vs_f...
4,"(3, 5)",base-master.apk,playstore,fedora41,False,,"[AndroidManifest.xml, classes2.dex, classes3.d...",outputs/fedora40-scripted-1/apks-from-device/b...,outputs/fedora41-scripted-1/apks-i-built/base-...,eb1902b3aa98e15a140e9de574dfa47b78bf8bb85fc556...,dbe76a3649f90f061170c6d3d2071bcddc3c32672bebc2...,False,True,85037580,84693087,apkdiff-results/base-master.apk_playstore_vs_f...
5,"(3, 6)",base-master.apk,playstore,ubuntu22,False,,"[AndroidManifest.xml, classes3.dex, classes4.d...",outputs/fedora40-scripted-1/apks-from-device/b...,outputs/ubuntu22-scripted-1/apks-i-built/base-...,eb1902b3aa98e15a140e9de574dfa47b78bf8bb85fc556...,fa1e7eec6dc3e634aae9ccba817d04db46b46c10ca4556...,False,True,85037580,84693087,apkdiff-results/base-master.apk_playstore_vs_u...
7,"(4, 6)",base-master.apk,fedora40,ubuntu22,False,,"[classes2.dex, classes3.dex, classes4.dex, cla...",outputs/fedora40-scripted-1/apks-i-built/base-...,outputs/ubuntu22-scripted-1/apks-i-built/base-...,4768602685c6e5ceff078093c87e9aba5f19597b0d76c5...,fa1e7eec6dc3e634aae9ccba817d04db46b46c10ca4556...,True,True,85021050,84693087,apkdiff-results/base-master.apk_fedora40_vs_ub...
8,"(5, 6)",base-master.apk,fedora41,ubuntu22,False,,"[classes2.dex, classes3.dex, classes4.dex, cla...",outputs/fedora41-scripted-1/apks-i-built/base-...,outputs/ubuntu22-scripted-1/apks-i-built/base-...,dbe76a3649f90f061170c6d3d2071bcddc3c32672bebc2...,fa1e7eec6dc3e634aae9ccba817d04db46b46c10ca4556...,True,True,84693087,84693087,apkdiff-results/base-master.apk_fedora41_vs_ub...
9,"(7, 8)",base-xxhdpi.apk,playstore,fedora40,False,,"[AndroidManifest.xml, resources.arsc]",outputs/fedora40-scripted-1/apks-from-device/b...,outputs/fedora40-scripted-1/apks-i-built/base-...,c65a2691cde28650f8934d492bff6c241b2e55eee4d222...,51119b048c0e03525cbb253ed7c19974738b58804908d3...,False,True,1666719,1658464,apkdiff-results/base-xxhdpi.apk_playstore_vs_f...
10,"(7, 9)",base-xxhdpi.apk,playstore,fedora41,False,,"[AndroidManifest.xml, resources.arsc]",outputs/fedora40-scripted-1/apks-from-device/b...,outputs/fedora41-scripted-1/apks-i-built/base-...,c65a2691cde28650f8934d492bff6c241b2e55eee4d222...,41b12629671ba0778364d671909add3a94380955c5c366...,False,True,1666719,1637689,apkdiff-results/base-xxhdpi.apk_playstore_vs_f...


## Diffuse

In [27]:
import zipfile

import requests

In [28]:
# Create directories
diffuse_dir = pathlib.Path("tools/diffuse")
diffuse_dir.mkdir(parents=True, exist_ok=True)

diffuse_results_dir = pathlib.Path("diffuse-results")
diffuse_results_dir.mkdir(parents=True, exist_ok=True)
## Download diffuse manually and put it there:
# https://github.com/JakeWharton/diffuse/releases/latest

In [29]:
!./tools/diffuse/bin/diffuse diff -h

[33mUsage:[39m diffuse diff [2m[<options>][22m <old> <new>

  Display changes between two binaries.

[33mInput options:[39m
  [36m--apk[39m, [36m--aar[39m, [36m--aab[39m, [36m--jar[39m, [36m--dex[39m  Input file type. Default is 'apk'.
  [36m--old-mapping[39m=[33;2m<file>[39;22m               Mapping file produced by R8 or ProGuard.
  [36m--new-mapping[39m=[33;2m<file>[39;22m               Mapping file produced by R8 or ProGuard.

[33mOutput options:[39m
  [36m--text[39m=[33;2m<file>[39;22m         File to write text report. Note: Specifying this option
                        will disable printing the text report to standard out by
                        default. Specify '--stdout text' to restore that
                        behavior.
  [36m--html[39m=[33;2m<file>[39;22m         File to write HTML report. Note: Specifying this option
                        will disable printing the text report to standard out by
                        default. Spe

In [30]:
import subprocess
from contextlib import redirect_stderr, redirect_stdout

In [31]:
mismatched[~(mismatched["filename"].str.contains("arm"))].shape

(7, 16)

In [32]:
# for index, row in mismatched[mismatched["filename"].str.contains("master")].iterrows():
# diffuse can only compare the apk that has a resources.arsc entry
for index, row in mismatched[~(mismatched["filename"].str.contains("arm"))].iterrows():
    print("==============================")
    print(
        f"./diffuse check between {row['filename']} {row['apk1_origin']} ({row['apk1_filepath']}) and {row['apk2_origin']} ({row['apk2_filepath']})"
    )
    name = (
        f"diffuse_{row['filename']}__{row['apk1_origin']}_vs_{row['apk2_origin']}.txt"
    )

    output_buffer = io.StringIO()
    result_file_path = str(diffuse_results_dir / name)

    # with redirect_stdout(output_buffer):
    try:
        # Run the diffuse command and redirect stdout to the buffer
        output = subprocess.check_output(
            [
                "./tools/diffuse/bin/diffuse",
                "diff",
                row["apk1_filepath"],
                row["apk2_filepath"],
            ],
            stderr=subprocess.STDOUT,
        )

        output = output.decode(
            "utf-8"
        ).strip()  # Decode the byte output to string and strip any leading/trailing whitespace
        print(len(output))
        # Check if the output is empty
        if not output:
            print(f"The output file {name} is empty.")
        else:
            # Replace "OLD" and "NEW" APK names in the output
            output = output.replace(
                f"OLD: {row['filename']}", f"OLD: {row['apk1_filepath']}"
            )
            output = output.replace(
                f"NEW: {row['filename']}", f"NEW: {row['apk2_filepath']}"
            )

            with open(result_file_path, "w") as file:
                file.write(output)
                # print(
                #     f"Captured output from the command:\n{output[:100]}..."
                # )  # Print first 100 characters of captured output
        # break
    except subprocess.CalledProcessError as e:
        print(f"An error occurred while running diffuse for {row['filename']}: {e}")

./diffuse check between base-master.apk playstore (outputs/fedora40-scripted-1/apks-from-device/base-master.apk) and fedora40 (outputs/fedora40-scripted-1/apks-i-built/base-master.apk)
4723
./diffuse check between base-master.apk playstore (outputs/fedora40-scripted-1/apks-from-device/base-master.apk) and fedora41 (outputs/fedora41-scripted-1/apks-i-built/base-master.apk)
4580
./diffuse check between base-master.apk playstore (outputs/fedora40-scripted-1/apks-from-device/base-master.apk) and ubuntu22 (outputs/ubuntu22-scripted-1/apks-i-built/base-master.apk)
4488
./diffuse check between base-master.apk fedora40 (outputs/fedora40-scripted-1/apks-i-built/base-master.apk) and ubuntu22 (outputs/ubuntu22-scripted-1/apks-i-built/base-master.apk)
3227
./diffuse check between base-master.apk fedora41 (outputs/fedora41-scripted-1/apks-i-built/base-master.apk) and ubuntu22 (outputs/ubuntu22-scripted-1/apks-i-built/base-master.apk)
2436
./diffuse check between base-xxhdpi.apk playstore (outputs/f

In [33]:
# !diffoscope --html diffoscope.html outputs/fedora40-manual/apks-i-built/base-arm64_v8a.apk outputs/fedora40-manual/apks-from-device/base-arm64_v8a.apk

In [34]:
from bs4 import BeautifulSoup


def clean_diffoscope_report(infile, outfile):
    # Define the ignore list (this could be keywords, specific file names, etc.)
    ignore_list = [
        "APK Signing Block",
        "zipinfo",
        "apksigner",
        # Add more ignore patterns here as needed
    ]
    ignore_list.extend(ApkDiff.IGNORE_FILES)

    # Function to check if the section should be ignored
    def should_ignore(section):
        text = section.text.lower()
        # print(text)
        # print("++++++++++++++++++++++++++++++++++++++++++++")
        for pattern in ignore_list:
            if pattern.lower() in text:
                print(f"{pattern} in text!")
                return True
        return False

    # Load the HTML file
    with open(infile, "r", encoding="utf-8") as file:
        soup = BeautifulSoup(file, "lxml")

    # Find all sections with the class 'difference' (or any other class you're interested in)
    parent_difference_div = soup.body.find("div", class_="difference")

    # If the parent div exists, proceed to find its children with class 'difference'
    if parent_difference_div:
        # Find all child divs with class 'difference' within this parent div
        difference_sections = parent_difference_div.find_all("div", class_="difference")

        # Iterate over all the 'difference' sections
        for section in difference_sections:
            if should_ignore(section):
                section.decompose()  # Remove the section from the tree

    # Save the modified HTML back to a file
    with open(outfile, "w", encoding="utf-8") as file:
        file.write(str(soup))
    print(f"Report updated and saved as '{outfile}'.")

In [35]:
# clean_diffoscope_report("diffoscope.html", "sanitized.html")

In [36]:
diffoscope_results_dir = pathlib.Path("diffoscope-results2")
diffoscope_results_dir.mkdir(parents=True, exist_ok=True)

# for index, row in mismatched.iterrows():
#     print("==============================")
#     print(
#         f"./diffoscope check between {row['filename']} {row['apk1_origin']} ({row['apk1_filepath']}) and {row['apk2_origin']} ({row['apk2_filepath']})"
#     )
#     name = f"diffoscope_{row['filename']}__{row['apk1_origin']}_vs_{row['apk2_origin']}"
#     orig_name = f"{name}.orig.html"
#     name += ".html"

#     !diffoscope --html {str(diffoscope_results_dir / orig_name)} {row['apk1_filepath']} {row['apk2_filepath']}
#     print("Ran diffoscope. Now filtering...")
#     clean_diffoscope_report(str(diffoscope_results_dir / orig_name), str(diffoscope_results_dir / name))
print("> Wrote all diffoscope results.")

> Wrote all diffoscope results.


In [37]:
# compare resources
!aapt2 dump chunks outputs/fedora40-scripted-1/apks-from-device/base-xxhdpi.apk > r1
!aapt2 dump chunks outputs/fedora41-scripted-1/apks-i-built/base-xxhdpi.apk > r2
!diff r1 r2
!apkanalyzer apk compare --different-only outputs/fedora40-scripted-1/apks-from-device/base-xxhdpi.apk outputs/fedora41-scripted-1/apks-i-built/base-xxhdpi.apk

1666719	1637689	-29030	/
32	0	-32	/stamp-cert-sha256
1052	808	-244	/AndroidManifest.xml
44149	0	-44149	/META-INF/
1019	0	-1019	/META-INF/BNDLTOOL.RSA
21511	0	-21511	/META-INF/MANIFEST.MF
21619	0	-21619	/META-INF/BNDLTOOL.SF


In [38]:
!aapt2 dump resources outputs/fedora40-scripted-1/apks-from-device/base-xxhdpi.apk > r1
!aapt2 dump resources outputs/fedora41-scripted-1/apks-i-built/base-xxhdpi.apk > r2
!diff r1 r2

In [39]:
!apkanalyzer compare packages outputs/fedora40-scripted-1/apks-from-device/base-xxhdpi.apk # outputs/fedora41-scripted-1/apks-i-built/base-xxhdpi.apk

Subject must be one of: apk, files, manifest, dex, resources

apk summary              Prints the application Id, version code and version name.
apk file-size            Prints the file size of the APK.                         
apk download-size        Prints an estimate of the download size of the APK.      
apk features             Prints features used by the APK.                         
apk compare              Compares the sizes of two APKs.                          
files list               Lists all files in the zip.                              
files cat                Prints the given file contents to stdout                 
manifest print           Prints the manifest in XML format                        
manifest application-id  Prints the application id.                               
manifest version-name    Prints the version name.                                 
manifest version-code    Prints the version code.                                 
manifest min-sdk         

In [40]:
!ls apkdiff-results/base-master.apk_playstore_vs_fedora40/first/classes2.dex

apkdiff-results/base-master.apk_playstore_vs_fedora40/first/classes2.dex


In [41]:
!dex-diff outputs/fedora40-scripted-1/apks-from-device/base-master.apk outputs/fedora40-scripted-1/apks-i-built/base-master.apk

⚔️ dex-diff v0.1.2
🧠 Heap size: 16384 MB
🚀 Initialising...
➡️ Decompiling before APK... (this may take some time)
🙌 decompiling base-master.apk skipped as cache exist
✅ Decompiling before APK finished
➡️ Decompiling after APK... (this may take some time)
🙌 decompiling base-master.apk skipped as cache exist
✅ Decompiling after APK finished
✅ Decompile finished (1920ms)
🙌 skipping new report file generation as cache exist
✅ Report ready (2.41s) -> file:///home/adrian/Dev/Shenanigans/reproducible-stuff/dex-diff-result/6a2e31ff088ecd970fe82fa52464eb0b_0070f5b6ad9290c0fe97211f9d459fee_report.html 


In [42]:
assert 1 == 4

AssertionError: 

============

In [43]:
!pip install dexparser


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [44]:
from dexparser import DEXParser

with open(
    "apkdiff-results/base-master.apk_playstore_vs_fedora40/first/classes2.dex", "rb"
) as fileobj:
    dex = DEXParser(fileobj=fileobj.read())

In [None]:
# dexdiff is too verbose to be easily analyzed (comes from google)
# dex-diff produces nice looking reports -- but it relies on deprecated libs and emits a lot of code, making it unsuitable for a git repo (expect maybe in CI)

In [50]:
dex.header

{'magic': b'dex\n035\x00',
 'checksum': 3783480039,
 'signiture': b'\xc6b\xb71l+\xa0E8\x91\x1e\xba\xa7\xe2;\x11\xc4*0\xfe',
 'file_size': 10835656,
 'header_size': 112,
 'endian_tag': 305419896,
 'link_size': 0,
 'link_off': 0,
 'map_off': 10835436,
 'string_ids_size': 77189,
 'string_ids_off': 112,
 'type_ids_size': 13638,
 'type_ids_off': 308868,
 'proto_ids_size': 18241,
 'proto_ids_off': 363420,
 'field_ids_size': 33484,
 'field_ids_off': 582312,
 'method_ids_size': 65248,
 'method_ids_off': 850184,
 'class_defs_size': 10126,
 'class_defs_off': 1372168,
 'data_size': 9139456,
 'data_off': 1696200}

Error comparing files: 'DEXParser' object has no attribute 'items'


Traceback (most recent call last):
  File "/tmp/ipykernel_1484695/454505238.py", line 229, in main
    comparison_results = comparator.compare_dex_files()
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ipykernel_1484695/454505238.py", line 113, in compare_dex_files
    for (name1, dex1), (name2, dex2) in zip(self.dex_files1.items(), self.dex_files2.items()):
                                            ^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'DEXParser' object has no attribute 'items'


In [45]:
# dex.get_classdef_data()
typeids = dex.get_typeids()
strings = dex.get_strings()
for class_def in dex.get_classdef_data():
    type_id = typeids[class_def["class_idx"]]
    class_name = strings[type_id]
    print(class_def)
    print(class_name)
    # print("Parent:", strings[1935])
    break

{'class_idx': 817, 'access': ['public', 'abstract', 'synthetic'], 'superclass_idx': 1935, 'interfaces_off': 0, 'source_file_idx': 36293, 'annotation_off': 0, 'class_data_off': 9846208, 'static_values_off': 0}
b'Lcom/android/tools/r8/RecordTag;'


In [None]:
for index, row in mismatched.iterrows():
    print("==============================")
    print(
        f"./dex-diff check between {row['filename']} {row['apk1_origin']} ({row['apk1_filepath']}) and {row['apk2_origin']} ({row['apk2_filepath']})"
    )
    name = f"dex-diff_{row['filename']}__{row['apk1_origin']}_vs_{row['apk2_origin']}"
    orig_name = f"{name}.orig.html"
    name += ".html"

    !dex-diff {row['apk1_filepath']} {row['apk2_filepath']}
    # !mkdir -p dex-diff-result/{name}
    # !mv dex-diff-result/*.html dex-diff-result/{name}
    # clean_diffoscope_report(str(diffoscope_results_dir / orig_name), str(diffoscope_results_dir / name))
print("> Wrote all dex-diff results.")

In [None]:
# dex_analysis = analyze_dex_files(['./apkdiff-results/base-master.apk_playstore_vs_fedora40/first/classes2.dex', './apkdiff-results/base-master.apk_playstore_vs_fedora40/second/classes2.dex'])

# Conclusions

Based on the analysis above, we can draw the following conclusions:

1. Build Reproducibility: [This will be filled based on actual results]
2. Size Consistency: [This will be filled based on actual results]
3. Environment Impact: [This will be filled based on actual results]

### Recommendations

Based on these findings, here are some recommendations for improving build reproducibility:

1. [Will be filled based on actual results]
2. [Will be filled based on actual results]
3. [Will be filled based on actual results]