2. Chromatography is frequently used to determine the outcome of experiments. However, most chromatography instrument manufacturers provide data in proprietary data formats. We’ve developed the [Rainbow](https://github.com/evanyeyeye/rainbow) package to unlock these files and we want to know whether you can extend Rainbow.

[Here](https://drive.google.com/drive/folders/1tyYTM94BdOkCkvZCJ4a1gT5CYYb-EKDj?usp=sharing) are three folders with artificially generated and encoded chromatography data:

(a) **pear** challenge (easy): time vs. intensity data

(b) **scale** challenge (intermediate): time vs. wavelength vs. absorbance data

(c) **sixtysix** (hard): time vs. mass vs. intensity data

In each folder, you will find a `sample/` subfolder and `problemX` subfolders (where X=1,2,3). The sample subfolder contains a matched binary/csv pair. You should examine this pair with a hex editor (or any other tool of your choice) to determine its binary organization. The rest of the folders contain only binaries. Your decoding script should run on these files. I will check that the csv output matches what is expected.

**Note:** your answers should not include any hard-coded magic numbers (other than the lengths of headers, chunks, footers, etc.)

**Please provide a concise and clear explanation for each file structure in markdown format.** What is the format of the header, data, and footer? I suggest writing a couple paragraphs to accompany a table like this:

| Location | Length (bytes) | Endianess | format | Value        |
|----------|----------------|-----------|--------|--------------|
| 0x180    | 4              | big       | uint   | time[0] (ms) |
| 0x184    | 4              | little    | uint   | intensity[0] |
| ...      |                |           |        |              |

Please document your code clearly with comments and docstrings. Please provide your answer as one `.py` file per problem (so, one for pear, one for scale, and one for sixtysix). Please provide the decoded `.csv` files so I can check them against the expected results. Place one decoded csv file per problem directory like this: `pear/problem1/pear.csv`.


In [None]:
import struct
import pandas as pd
import os


def extract_pear_to_df(input_path, header_size=0x140, footer_size=0x1e0):
    """
    Extracts binary data from a file to a DataFrame.
    
    :param input_path: Path to the binary file.
    :param header_size: Size of the header in bytes.
    :param footer_size: Size of the footer in bytes.
    :return: DataFrame containing the extracted data.
    """
    col0 = []
    col1 = []

    with open(input_path, 'rb') as f:
        # Skip the header
        f.seek(header_size)

        # Calculate the size of the body (excluding header and footer)
        file_size = os.path.getsize(input_path)
        body_size = file_size - header_size - footer_size

        # Read the body
        bytes_read = 0
        while bytes_read < body_size:
            # Read 8 bytes (2 columns of 4 bytes each)
            chunk = f.read(8)
            if len(chunk) < 8:
                break

            # Extract values from the specified columns
            col0_val = struct.unpack('<I', chunk[0:4])[0]
            col1_val = struct.unpack('<I', chunk[4:8])[0]

            col0.append(col0_val)
            col1.append(col1_val)

            bytes_read += 8

    # Create DataFrame
    df = pd.DataFrame({'Time (ms)': col0, 'Intensity': col1})
    return df


def main(input_path=None, header_size=0x140, footer_size=0x1e0):
    """
    Extracts binary data from a file to a DataFrame and saves it to a CSV file.

    :param input_path: Path to the binary file.
    :param header_size: Size of the header in bytes.
    :param footer_size: Size of the footer in bytes.
    """
    if input_path is None:
        input_path = input('Enter the path to the binary file: ')

    # Extract the binary data to a DataFrame
    df = extract_pear_to_df(input_path, header_size, footer_size)
    # Save the DataFrame to a CSV file
    df.to_csv(input_path + '.csv', index=True)


## Test

In [None]:
# Set the header and footer size
header_size = 0x140
footer_size = 0x1e0
file_path = './pear/sample/pear'
df = extract_pear_to_df(file_path, header_size, footer_size)

# Read the CSV file into a DataFrame
csv_df = pd.read_csv('./pear/sample/pear.csv')

# Compare the DataFrames
print("DataFrames are equal:", df.equals(csv_df))

In [None]:
# List of file paths
file_paths = ["./pear/problem1/pear", "./pear/problem2/pear",
              "./pear/problem3/pear"]

# Loop through each file path and decode the binary to CSV
for p in file_paths:
    main(p)

# Scale

0.0000,-1,-2,27,23,98,1,48,69,-1,-2093,39,-1,822,1,0,16,40,0
0.0004,0,0,26,23,99,1,50,70,-1,-2109,36,-1,830,-1,0,14,40,-1
0.0009,-2,1,26,24,100,-1,47,68,-2,-2132,38,-1,835,0,-2,15,40,0

In [45]:
import struct
import binascii


def float_to_hex(f):
    print(hex(struct.unpack('<I', struct.pack('<f', f))[0]))


float_to_hex(0.0004)
float_to_hex(0.0009)
float_to_hex(0.0013)
float_to_hex(0.3809)
float_to_hex(5.0000)
float_to_hex(0.5622)


0x39d1b717
0x3a6bedfa
0x3aaa64c3
0x3ec30553
0x40a00000
0x3f0fec57


## Observations

line3 -> 0.0004 -> 0x39d1b717 ->  (17 b7 d1 39) -> location: 0x250
line4 -> 0.0009 -> 0x3a6bedfa ->  (fa ed 6b 3a) -> location: 0x29e
line5 -> 0.0013 -> 0x3aaa64c3 -> (c3 64 aa 3a) -> location: 0x2ec
...
line-1 -> 5.0000 -> 0x40a00000 ->  (00 00 a0 40) -> location: 0xdb9d6



`48 48`

line 2: [0.0000,-1,-2,27,23,98,1,48,69,-1,-2093,39,-1,822,1,0,16,40,0]

`00 00 00 00 [FF FF FF EC] [FF FF FF D8] 00 00 02 1C 00 00 01 CC 00 00 07 A8 00 00 00 14 00 00 03 C0 00 00 05 64 [FF FF FF EC] FF FF 5C 7C 00 00 03 0C [FF FF FF EC] 00 00 40 38 00 00 00 14 00 00 00 00 00 00 01 40 00 00 03 20 00 00 00 00 [48 48]`

line 3: [0.0004,0,0,26,23,99,1,50,70,-1,-2109,36,-1,830,-1,0,14,40,-1]

`17 B7 D1 39 00 00 00 00 00 00 00 00 00 00 02 08 00 00 01 CC 00 00 07 BC 00 00 00 14 00 00 03 E8 00 00 05 78 [[FF FF FF EC]] FF FF 5B 3C 00 00 02 D0 [[FF FF FF EC]] 00 00 40 D8 [FF FF FF EC] 00 00 00 00 00 00 01 18 00 00 03 20 [FF FF FF EC] 48 48`

line 4: [0.0009,-2,1,26,24,100,-1,47,68,-2,-2132,38,-1,835,0,-2,15,40,0]

`FA ED 6B 3A [FF FF FF D8] 00 00 00 14 00 00 02 08 00 00 01 E0 00 00 07 D0 [FF FF FF EC] 00 00 03 AC 00 00 05 50 [FF FF FF D8] FF FF 59 70 00 00 02 F8 [FF FF FF EC] 00 00 41 3C 00 00 00 00 [FF FF FF D8] 00 00 01 2C 00 00 03 20 [00 00 00 00 48 48]` 


Seems padding is used to align the floats to 4 bytes. The floats are stored in little-endian format.
line 3:

I'm not sure about the order of the columns. We'll see. 


In [63]:
# negative number is 2-complement, then scale 20 to hex
def int_scale_to_hex(i):
    if i < 0:
        i = 2 ** 32 + i
    return hex(i * 20)


def hex_to_int_scale(h):
    i = int(h, 16) // 20
    if i >= 2 ** 31:
        i -= 2 ** 32
    return i


for x in [-1, -2, 27, 23, 98, 1, 48, 69, -1, -2093, 39, -1, 822, 1, 0, 16, 40,
          0]:
    hex_val = int_scale_to_hex(x)
    print(x, hex_val, hex_to_int_scale(hex_val))




-1 0x13ffffffec -1
-2 0x13ffffffd8 -2
27 0x21c 27
23 0x1cc 23
98 0x7a8 98
1 0x14 1
48 0x3c0 48
69 0x564 69
-1 0x13ffffffec -1
-2093 0x13ffff5c7c -2093
39 0x30c 39
-1 0x13ffffffec -1
822 0x4038 822
1 0x14 1
0 0x0 0
16 0x140 16
40 0x320 40
0 0x0 0


In [1]:
import struct
import pandas as pd
import os


def hex_to_int_scale(h):
    i = int(h, 16) // 20
    if i >= 2 ** 31:
        i -= 2 ** 32
    return i


def extract_scale_to_df(input_path, header_size=0x200, footer_size=0):
    """
    Extracts binary data from a file to a DataFrame.
    
    :param input_path: Path to the binary file.
    :param header_size: Size of the header in bytes.
    :param footer_size: Size of the footer in bytes.
    :return: DataFrame containing the extracted data.
    """
    # Initialize columns 
    columns = [[] for _ in
               range(19)]  # Assuming 19 columns based on the provided data

    with open(input_path, 'rb') as f:
        # Skip the header
        f.seek(header_size)

        # Calculate the size of the body (excluding header and footer)
        file_size = os.path.getsize(input_path)
        body_size = file_size - header_size - footer_size

        # Read the body
        bytes_read = 0

        while bytes_read < body_size:
            chunk = f.read(78)
            if len(chunk) < 78:
                break

            # Ignore the first 2 padding bytes

            # Get first column is hex of float
            float_val = struct.unpack('<f', chunk[2:6])[0]
            columns[0].append(float_val)

            # Extract the rest of the chuck 4 byte at a time, 
            # convert to hex and then to int
            for i in range(1, 19):
                hex_val = binascii.hexlify(chunk[6 + (i - 1) * 4: 6 + i * 4])
                int_val = hex_to_int_scale(hex_val)
                columns[i].append(int_val)
                
            bytes_read += 78

    # Create DataFrame
    df = pd.DataFrame({f'Column {i}': col for i, col in enumerate(columns)})
    return df


def main(input_path=None, header_size=0x200, footer_size=0):
    """
    Extracts binary data from a file to a DataFrame and saves it to a CSV file.

    :param input_path: Path to the binary file.
    :param header_size: Size of the header in bytes.
    :param footer_size: Size of the footer in bytes.
    """
    if input_path is None:
        input_path = input('Enter the path to the binary file: ')

    # Extract the binary data to a DataFrame
    df = extract_scale_to_df(input_path, header_size, footer_size)
    # Save the DataFrame to a CSV file
    df.to_csv(input_path + '.csv', index=True)


# Test

In [2]:
header_size = 0x200
footer_size = 0
file_path = './scale/sample/scale'
df = extract_scale_to_df(file_path, header_size, footer_size)
print(df)


NameError: name 'binascii' is not defined