## 🧪 White-box Testing: Data Flow Coverage under Different Python Versions

### 1️⃣ Background and Objectives
> This notebook evaluates whether `pickle` maintains deterministic behavior in data flow scenarios across different **Python versions** (3.12.4 vs. 3.7.12) on the same system.  We focus on internal data definitions and uses (e.g., `dispatch`, `memo`, function lookups).

### 2️⃣ Environment Information
> We record the operating system and Python version in use.

In [1]:
import platform
import sys

def print_environment_info():
    print("📌 Current operating system:", platform.system(), platform.release())
    print("📌 Python version:", sys.version)

In [2]:
print_environment_info()

📌 Current operating system: Windows 10
📌 Python version: 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 05:35:01) [MSC v.1916 64 bit (AMD64)]


In [3]:
print_environment_info()

📌 Current operating system: Windows 11
📌 Python version: 3.12.4 | packaged by Anaconda, Inc. | (main, Jun 18 2024, 15:03:56) [MSC v.1929 64 bit (AMD64)]


### 3️⃣ Test Cases and Input Structures
> Each input structure is mapped to a specific branch logic (e.g., empty vs. non-empty lists).

In [4]:
import pickle
import hashlib

def get_hash(obj):
    """Return the SHA256 hash value of the object after pickle serialization"""
    return hashlib.sha256(pickle.dumps(obj)).hexdigest()

In [5]:
import os

def save_result_by_python_version(case_name, value_hash, file_path="dataflow_python_hashes.txt"):
    py_version = ".".join(sys.version.split(" ")[0].split(".")[:2])
    block = f"{case_name}\n{py_version.ljust(10)}Result: {value_hash}\n"

    if os.path.exists(file_path):
        with open(file_path, "r", encoding="utf-8") as f:
            content = f.read()
    else:
        content = ""

    blocks = content.strip().split("\n\n") if content else []
    updated = False

    for i in range(len(blocks)):
        if blocks[i].startswith(case_name):
            lines = blocks[i].split("\n")
            lines = [line for line in lines if not line.startswith(py_version)]
            lines.append(f"{py_version.ljust(10)}Result: {value_hash}")
            blocks[i] = "\n".join(lines)
            updated = True
            break

    if not updated:
        blocks.append(block.strip())

    with open(file_path, "w", encoding="utf-8") as f:
        f.write("\n\n".join(blocks) + "\n")

In [6]:
def run_dataflow_tests_by_version():
    save_result_by_python_version("TC_DF_01 dispatch[int] used", get_hash(123))

    a = [1, 2]
    obj_memo_used = [a, a]
    save_result_by_python_version("TC_DF_02 memo reused", get_hash(obj_memo_used))

    obj_memo_not_used = [[1, 2], [3, 4]]
    save_result_by_python_version("TC_DF_03 memo not reused", get_hash(obj_memo_not_used))

    save_result_by_python_version("TC_DF_04 dispatch function call", get_hash({"a": 1}))


In [7]:
run_dataflow_tests_by_version()

In [8]:
def print_version_results(file_path="dataflow_python_hashes.txt"):
    if not os.path.exists(file_path):
        print("❌ No result file found.")
        return
    with open(file_path, "r", encoding="utf-8") as f:
        print("✅ Recorded results:\n")
        print(f.read())

### 4️⃣ Platform-specific Hash Results
> Hash outputs for each test case executed on macOS, Windows, and Linux.

In [9]:
print_version_results("dataflow_python_hashes.txt")

✅ Recorded results:

TC_DF_01 dispatch[int] used
3.7       Result: ca9493975e3875030e2d5a5c2265f13827a049d4473f62a448d71c05cd0e41ce
3.12      Result: b78afd939a4aef912cfa7945f436bb5de305a4dc69cae7af84ddd948519f3a31

TC_DF_02 memo reused
3.7       Result: 3bae964ad9614c65791f0d9418f31a10d5f8f562095ddeff4250a538cb8bf848
3.12      Result: fd6f1c67c6a01a87b514543fcc927e76d2d0cea7fdedcd6fb96e63d781f99a2f

TC_DF_03 memo not reused
3.7       Result: 3b041a7e32dcde82e4c5436d46030883002f558fcd7910bde65a2846458b5e3a
3.12      Result: b24b57aa92344c16c3b542410745c70f62724188e914967ba94565ffb3821eee

TC_DF_04 dispatch function call
3.7       Result: b393a0717c5c0feabe9f330343c4443d6596fd8fcd17d2e70683e19d72e97cd5
3.12      Result: 02985f17a95b8ad0ca0b37d9659510b998086e750bf325b5b11465734d053bec



### 5️⃣ Consistency Analysis and Divergence Detection

All data flow test cases were executed on the same system using **Python 3.7.12** and **Python 3.12.4**.  
The serialized outputs from `pickle.dumps()` were hashed with SHA256 to detect binary-level consistency across versions.

**Result:** Every test case produced **different hashes** between Python versions.

#### ⚠️ Test cases with inconsistent hashes:

| Test Case ID     | Description                          |
|------------------|--------------------------------------|
| TC_DF_01         | dispatch[int] used (int type)        |
| TC_DF_02         | memo reused (repeated list reference)|
| TC_DF_03         | memo not reused (distinct lists)     |
| TC_DF_04         | dispatch function call               |

<br>

> These differences suggest that Python's internal dispatch maps and memoization encoding strategies vary between versions, even if functional behavior remains the same. Byte-level consistency is not guaranteed across versions.


### 6️⃣ Conclusions and Findings

The `pickle` module does not yield hash-identical outputs across different Python versions—even when internal def-use paths (e.g., memoization, dispatch dictionaries) follow the same logic.

#### 🔍 Key Findings:

- ❌ All def-use related test cases showed binary-level divergence;
- 🔁 memo structures and reference tracking differ in serialized form;
- ⚙️ dispatch lookups executed as expected, but underlying layout changed.

#### ⚠️ Limitations:

- Manual def-use scenarios; no automated data flow tracing;
- Only tested the default pickle protocol;
- Did not include deeply nested or cyclic object graphs.

> Developers requiring consistent serialized outputs must avoid using `pickle` across Python versions. A standardized format like JSON or Protobuf is preferred for long-term stability.


### 📎 Appendix: Raw Data File

The complete platform hash records can be found in the following file:

👉 [Download dataflow_python_hashes.txt](./dataflow_python_hashes.txt)