# Binary File Operations: Complete Guide

Binary files store data in a format that computers can read directly, without text encoding. This notebook covers all binary file operations including pickle serialization.

## 🔢 Understanding Binary Files

**Binary files** contain data in binary format (0s and 1s) that is not human-readable. Examples include:
- Images (PNG, JPEG)
- Videos (MP4, AVI)
- Executables (.exe, .app)
- Pickle files (.pkl)
- Database files

**Key differences from text files:**
- No character encoding/decoding
- More compact storage
- Faster read/write operations
- Platform-independent (when done correctly)

## 📂 Binary File Modes

All binary modes include the 'b' character:

| Mode | Description | Creates File? | Truncates? | Position |
|------|-------------|---------------|------------|----------|
| `rb` | Read binary | No | No | Beginning |
| `wb` | Write binary | Yes | Yes | Beginning |
| `ab` | Append binary | Yes | No | End |
| `rb+` | Read/Write binary | No | No | Beginning |
| `wb+` | Write/Read binary | Yes | Yes | Beginning |
| `ab+` | Append/Read binary | Yes | No | End |

## 🔧 Basic Binary File Operations

In [1]:
import os
from pathlib import Path

# Create sample_files directory
Path('../sample_files').mkdir(exist_ok=True)

def demonstrate_binary_modes():
    """Demonstrate all binary file modes"""
    print("🔢 Binary File Modes Demonstration")
    print("=" * 50)
    
    # Sample binary data
    sample_data = bytes([72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100, 33])  # "Hello World!"
    additional_data = bytes([10, 71, 111, 111, 100, 98, 121, 101, 33])  # "\nGoodbye!"
    
    print(f"📊 Sample data: {sample_data}")
    print(f"📝 As text: {sample_data.decode('utf-8')}")
    
    # Mode 'wb' - Write Binary
    print("\n--- Mode 'wb' (Write Binary) ---")
    with open('../sample_files/binary_demo.bin', 'wb') as f:
        bytes_written = f.write(sample_data)
        print(f"✅ Wrote {bytes_written} bytes")
    
    # Mode 'rb' - Read Binary
    print("\n--- Mode 'rb' (Read Binary) ---")
    with open('../sample_files/binary_demo.bin', 'rb') as f:
        read_data = f.read()
        print(f"📖 Read: {read_data}")
        print(f"📝 As text: {read_data.decode('utf-8')}")
        print(f"✅ Data matches: {read_data == sample_data}")
    
    # Mode 'ab' - Append Binary
    print("\n--- Mode 'ab' (Append Binary) ---")
    with open('../sample_files/binary_demo.bin', 'ab') as f:
        bytes_written = f.write(additional_data)
        print(f"✅ Appended {bytes_written} bytes")
    
    # Read the combined result
    with open('../sample_files/binary_demo.bin', 'rb') as f:
        combined_data = f.read()
        print(f"📖 Combined data: {combined_data}")
        print(f"📝 As text: {combined_data.decode('utf-8')}")
    
    # Mode 'rb+' - Read/Write Binary
    print("\n--- Mode 'rb+' (Read/Write Binary) ---")
    with open('../sample_files/binary_demo.bin', 'rb+') as f:
        # Read first 5 bytes
        first_part = f.read(5)
        print(f"📖 First 5 bytes: {first_part} ({first_part.decode('utf-8')})")
        
        # Get current position
        pos = f.tell()
        print(f"📍 Current position: {pos}")
        
        # Write at current position (overwrite)
        f.write(b'XYZ')
        print("✅ Overwrote 3 bytes with 'XYZ'")
    
    # Read the modified result
    with open('../sample_files/binary_demo.bin', 'rb') as f:
        modified_data = f.read()
        print(f"📖 Modified data: {modified_data}")
        print(f"📝 As text: {modified_data.decode('utf-8', errors='replace')}")

demonstrate_binary_modes()

🔢 Binary File Modes Demonstration
📊 Sample data: b'Hello World!'
📝 As text: Hello World!

--- Mode 'wb' (Write Binary) ---
✅ Wrote 12 bytes

--- Mode 'rb' (Read Binary) ---
📖 Read: b'Hello World!'
📝 As text: Hello World!
✅ Data matches: True

--- Mode 'ab' (Append Binary) ---
✅ Appended 9 bytes
📖 Combined data: b'Hello World!\nGoodbye!'
📝 As text: Hello World!
Goodbye!

--- Mode 'rb+' (Read/Write Binary) ---
📖 First 5 bytes: b'Hello' (Hello)
📍 Current position: 5
✅ Overwrote 3 bytes with 'XYZ'
📖 Modified data: b'HelloXYZrld!\nGoodbye!'
📝 As text: HelloXYZrld!
Goodbye!


## 🥒 Pickle Module: Serializing Python Objects

The `pickle` module allows you to serialize (convert to bytes) and deserialize (convert back) Python objects.

In [2]:
import pickle
import time

def demonstrate_pickle_operations():
    """Comprehensive pickle demonstration"""
    print("🥒 Pickle Module Demonstration")
    print("=" * 40)
    
    # Create complex data structures to pickle
    student_data = {
        'name': 'Alice Johnson',
        'id': 12345,
        'grades': [85, 92, 78, 96, 88],
        'subjects': ['Math', 'Physics', 'Chemistry', 'Biology'],
        'metadata': {
            'enrollment_date': '2024-01-15',
            'active': True,
            'gpa': 4.2
        },
        'test_scores': [(85, 'Math'), (92, 'Physics'), (78, 'Chemistry')]
    }
    
    print(f"📊 Original data type: {type(student_data)}")
    print(f"📝 Sample content: {student_data['name']}, GPA: {student_data['metadata']['gpa']}")
    
    # Method 1: Using dump() and load()
    print("\n--- Method 1: dump() and load() ---")
    
    # Serialize (dump) to file
    pickle_file = '../sample_files/student_data.pkl'
    with open(pickle_file, 'wb') as f:
        pickle.dump(student_data, f)
    print("✅ Data serialized with pickle.dump()")
    
    # Check file size
    file_size = os.path.getsize(pickle_file)
    print(f"📏 Pickle file size: {file_size} bytes")
    
    # Deserialize (load) from file
    with open(pickle_file, 'rb') as f:
        loaded_data = pickle.load(f)
    print("✅ Data deserialized with pickle.load()")
    
    # Verify data integrity
    print(f"🔍 Data integrity check: {loaded_data == student_data}")
    print(f"📊 Loaded data type: {type(loaded_data)}")
    
    # Method 2: Using dumps() and loads() (in-memory)
    print("\n--- Method 2: dumps() and loads() (in-memory) ---")
    
    # Serialize to bytes
    serialized_bytes = pickle.dumps(student_data)
    print(f"✅ Serialized to bytes: {len(serialized_bytes)} bytes")
    print(f"📊 First 50 bytes: {serialized_bytes[:50]}")
    
    # Deserialize from bytes
    deserialized_data = pickle.loads(serialized_bytes)
    print(f"✅ Deserialized from bytes")
    print(f"🔍 Data integrity check: {deserialized_data == student_data}")

demonstrate_pickle_operations()

🥒 Pickle Module Demonstration
📊 Original data type: <class 'dict'>
📝 Sample content: Alice Johnson, GPA: 4.2

--- Method 1: dump() and load() ---
✅ Data serialized with pickle.dump()
📏 Pickle file size: 231 bytes
✅ Data deserialized with pickle.load()
🔍 Data integrity check: True
📊 Loaded data type: <class 'dict'>

--- Method 2: dumps() and loads() (in-memory) ---
✅ Serialized to bytes: 231 bytes
📊 First 50 bytes: b'\x80\x04\x95\xdc\x00\x00\x00\x00\x00\x00\x00}\x94(\x8c\x04name\x94\x8c\rAlice Johnson\x94\x8c\x02id\x94M90\x8c\x06gra'
✅ Deserialized from bytes
🔍 Data integrity check: True


## 🔍 Binary File Search Operations

In [3]:
def demonstrate_binary_search():
    """Demonstrate searching in binary files"""
    print("🔍 Binary File Search Operations")
    print("=" * 40)
    
    # Create a binary file with multiple records
    students = [
        {'id': 101, 'name': 'Alice', 'grade': 85},
        {'id': 102, 'name': 'Bob', 'grade': 92},
        {'id': 103, 'name': 'Charlie', 'grade': 78},
        {'id': 104, 'name': 'Diana', 'grade': 96},
        {'id': 105, 'name': 'Eve', 'grade': 88}
    ]
    
    # Save all students to binary file
    students_file = '../sample_files/students.pkl'
    with open(students_file, 'wb') as f:
        for student in students:
            pickle.dump(student, f)
    print(f"✅ Saved {len(students)} student records")
    
    # Search function
    def search_student_by_id(filename, target_id):
        """Search for a student by ID in binary file"""
        try:
            with open(filename, 'rb') as f:
                while True:
                    try:
                        student = pickle.load(f)
                        if student['id'] == target_id:
                            return student
                    except EOFError:
                        break  # End of file reached
            return None  # Student not found
        except FileNotFoundError:
            print(f"❌ File not found: {filename}")
            return None
    
    # Search for specific students
    search_ids = [102, 105, 999]  # 999 doesn't exist
    
    for student_id in search_ids:
        result = search_student_by_id(students_file, student_id)
        if result:
            print(f"🎯 Found student {student_id}: {result['name']}, Grade: {result['grade']}")
        else:
            print(f"❌ Student {student_id} not found")
    
    # Search by grade range
    def search_students_by_grade_range(filename, min_grade, max_grade):
        """Find all students within a grade range"""
        matching_students = []
        try:
            with open(filename, 'rb') as f:
                while True:
                    try:
                        student = pickle.load(f)
                        if min_grade <= student['grade'] <= max_grade:
                            matching_students.append(student)
                    except EOFError:
                        break
        except FileNotFoundError:
            print(f"❌ File not found: {filename}")
        
        return matching_students
    
    # Find students with grades between 85 and 95
    high_performers = search_students_by_grade_range(students_file, 85, 95)
    print(f"\n🏆 High performers (85-95): {len(high_performers)} students")
    for student in high_performers:
        print(f"   {student['name']}: {student['grade']}")

demonstrate_binary_search()

🔍 Binary File Search Operations
✅ Saved 5 student records
🎯 Found student 102: Bob, Grade: 92
🎯 Found student 105: Eve, Grade: 88
❌ Student 999 not found

🏆 High performers (85-95): 3 students
   Alice: 85
   Bob: 92
   Eve: 88


## ✏️ Binary File Update Operations

In [None]:
def demonstrate_binary_updates():
    """Demonstrate updating records in binary files"""
    print("✏️ Binary File Update Operations")
    print("=" * 40)
    
    # Create initial data
    products = [
        {'id': 'P001', 'name': 'Laptop', 'price': 999.99, 'stock': 10},
        {'id': 'P002', 'name': 'Mouse', 'price': 29.99, 'stock': 50},
        {'id': 'P003', 'name': 'Keyboard', 'price': 79.99, 'stock': 25}
    ]
    
    products_file = '../sample_files/products.pkl'
    
    # Save initial data
    with open(products_file, 'wb') as f:
        pickle.dump(products, f)
    print(f"✅ Saved {len(products)} products")
    
    # Display current data
    def display_products(filename):
        """Display all products from file"""
        with open(filename, 'rb') as f:
            products = pickle.load(f)
        
        print("📦 Current Products:")
        for product in products:
            print(f"   {product['id']}: {product['name']} - ${product['price']:.2f} (Stock: {product['stock']})")
    
    display_products(products_file)
    
    # Update function
    def update_product_price(filename, product_id, new_price):
        """Update the price of a specific product"""
        # Read all data
        with open(filename, 'rb') as f:
            products = pickle.load(f)
        
        # Find and update the product
        updated = False
        for product in products:
            if product['id'] == product_id:
                old_price = product['price']
                product['price'] = new_price
                print(f"✅ Updated {product_id}: ${old_price:.2f} → ${new_price:.2f}")
                updated = True
                break
        
        if not updated:
            print(f"❌ Product {product_id} not found")
            return False
        
        # Write back all data
        with open(filename, 'wb') as f:
            pickle.dump(products, f)
        
        return True
    
    # Update stock function
    def update_product_stock(filename, product_id, quantity_change):
        """Update the stock of a specific product"""
        with open(filename, 'rb') as f:
            products = pickle.load(f)
        
        updated = False
        for product in products:
            if product['id'] == product_id:
                old_stock = product['stock']
                product['stock'] += quantity_change
                print(f"✅ Updated {product_id} stock: {old_stock} → {product['stock']} ({quantity_change:+d})")
                updated = True
                break
        
        if not updated:
            print(f"❌ Product {product_id} not found")
            return False
        
        with open(filename, 'wb') as f:
            pickle.dump(products, f)
        
        return True
    
    # Perform updates
    print("\n🔄 Performing Updates:")
    update_product_price(products_file, 'P001', 899.99)  # Reduce laptop price
    update_product_stock(products_file, 'P002', -5)      # Sell 5 mice
    update_product_stock(products_file, 'P003', 10)      # Restock keyboards
    
    print("\n📦 After Updates:")
    display_products(products_file)
    
    # Add new product function
    def add_product(filename, new_product):
        """Add a new product to the file"""
        with open(filename, 'rb') as f:
            products = pickle.load(f)
        
        # Check if product ID already exists
        for product in products:
            if product['id'] == new_product['id']:
                print(f"❌ Product {new_product['id']} already exists")
                return False
        
        products.append(new_product)
        
        with open(filename, 'wb') as f:
            pickle.dump(products, f)
        
        print(f"✅ Added new product: {new_product['id']} - {new_product['name']}")
        return True
    
    # Add a new product
    new_product = {'id': 'P004', 'name': 'Monitor', 'price': 299.99, 'stock': 15}
    add_product(products_file, new_product)
    
    print("\n📦 After Adding New Product:")
    display_products(products_file)

demonstrate_binary_updates()

## 📊 Binary File Performance Analysis

In [None]:
import time
import json

def performance_comparison():
    """Compare binary vs text file performance"""
    print("📊 Binary vs Text File Performance")
    print("=" * 45)
    
    # Create test data
    test_data = []
    for i in range(1000):
        test_data.append({
            'id': i,
            'name': f'User_{i:04d}',
            'email': f'user{i}@example.com',
            'scores': [i % 100, (i * 2) % 100, (i * 3) % 100],
            'active': i % 2 == 0
        })
    
    print(f"📝 Test data: {len(test_data)} records")
    
    # Binary file operations
    binary_file = '../sample_files/performance_test.pkl'
    
    # Write binary
    start_time = time.time()
    with open(binary_file, 'wb') as f:
        pickle.dump(test_data, f)
    binary_write_time = time.time() - start_time
    
    # Read binary
    start_time = time.time()
    with open(binary_file, 'rb') as f:
        binary_loaded = pickle.load(f)
    binary_read_time = time.time() - start_time
    
    # Text file operations (JSON)
    json_file = '../sample_files/performance_test.json'
    
    # Write JSON
    start_time = time.time()
    with open(json_file, 'w', encoding='utf-8') as f:
        json.dump(test_data, f)
    json_write_time = time.time() - start_time
    
    # Read JSON
    start_time = time.time()
    with open(json_file, 'r', encoding='utf-8') as f:
        json_loaded = json.load(f)
    json_read_time = time.time() - start_time
    
    # Get file sizes
    binary_size = os.path.getsize(binary_file)
    json_size = os.path.getsize(json_file)
    
    # Display results
    print(f"\n⏱️ Performance Results:")
    print(f"   Binary write: {binary_write_time:.4f}s")
    print(f"   JSON write:   {json_write_time:.4f}s")
    print(f"   Binary read:  {binary_read_time:.4f}s")
    print(f"   JSON read:    {json_read_time:.4f}s")
    
    print(f"\n💾 File Sizes:")
    print(f"   Binary file: {binary_size:,} bytes")
    print(f"   JSON file:   {json_size:,} bytes")
    print(f"   Size ratio:  {json_size/binary_size:.2f}x (JSON vs Binary)")
    
    print(f"\n🚀 Speed Comparison:")
    print(f"   Binary is {json_write_time/binary_write_time:.1f}x faster for writing")
    print(f"   Binary is {json_read_time/binary_read_time:.1f}x faster for reading")
    
    # Verify data integrity
    print(f"\n🔍 Data Integrity:")
    print(f"   Binary data matches: {binary_loaded == test_data}")
    print(f"   JSON data matches:   {json_loaded == test_data}")
    
    # Clean up large files
    os.remove(binary_file)
    os.remove(json_file)
    print(f"\n🧹 Cleaned up test files")

performance_comparison()

## 🔒 Binary File Best Practices and Security

In [None]:
def demonstrate_best_practices():
    """Show best practices for binary file operations"""
    print("🔒 Binary File Best Practices")
    print("=" * 35)
    
    # 1. Always use context managers
    print("\n1. ✅ Always use context managers (with statement)")
    
    # Good practice
    data = {'message': 'Hello, World!'}
    with open('../sample_files/good_practice.pkl', 'wb') as f:
        pickle.dump(data, f)
    print("   ✅ File automatically closed even if error occurs")
    
    # 2. Handle exceptions properly
    print("\n2. ✅ Handle exceptions properly")
    
    def safe_pickle_load(filename):
        """Safely load pickle file with error handling"""
        try:
            with open(filename, 'rb') as f:
                return pickle.load(f)
        except FileNotFoundError:
            print(f"   ❌ File not found: {filename}")
            return None
        except pickle.UnpicklingError:
            print(f"   ❌ Invalid pickle file: {filename}")
            return None
        except Exception as e:
            print(f"   ❌ Unexpected error: {e}")
            return None
    
    # Test with valid file
    result = safe_pickle_load('../sample_files/good_practice.pkl')
    print(f"   ✅ Loaded data: {result}")
    
    # Test with non-existent file
    result = safe_pickle_load('../sample_files/nonexistent.pkl')
    
    # 3. Validate data after loading
    print("\n3. ✅ Validate data after loading")
    
    def validate_student_data(data):
        """Validate student data structure"""
        required_fields = ['id', 'name', 'grade']
        
        if not isinstance(data, dict):
            return False, "Data must be a dictionary"
        
        for field in required_fields:
            if field not in data:
                return False, f"Missing required field: {field}"
        
        if not isinstance(data['id'], int) or data['id'] <= 0:
            return False, "ID must be a positive integer"
        
        if not isinstance(data['name'], str) or not data['name'].strip():
            return False, "Name must be a non-empty string"
        
        if not isinstance(data['grade'], (int, float)) or not (0 <= data['grade'] <= 100):
            return False, "Grade must be a number between 0 and 100"
        
        return True, "Valid"
    
    # Test validation
    test_students = [
        {'id': 1, 'name': 'Alice', 'grade': 85},  # Valid
        {'id': -1, 'name': 'Bob', 'grade': 90},   # Invalid ID
        {'id': 2, 'name': '', 'grade': 75},       # Invalid name
        {'id': 3, 'name': 'Charlie', 'grade': 150} # Invalid grade
    ]
    
    for i, student in enumerate(test_students):
        is_valid, message = validate_student_data(student)
        status = "✅" if is_valid else "❌"
        print(f"   {status} Student {i+1}: {message}")
    
    # 4. Use appropriate pickle protocol
    print("\n4. ✅ Use appropriate pickle protocol")
    
    test_data = {'version': '1.0', 'data': [1, 2, 3, 4, 5]}
    
    # Different protocols
    protocols = [2, 3, 4, pickle.HIGHEST_PROTOCOL]
    
    for protocol in protocols:
        filename = f'../sample_files/protocol_{protocol}.pkl'
        with open(filename, 'wb') as f:
            pickle.dump(test_data, f, protocol=protocol)
        
        file_size = os.path.getsize(filename)
        print(f"   Protocol {protocol}: {file_size} bytes")
    
    print(f"   💡 Use protocol {pickle.HIGHEST_PROTOCOL} for best performance")
    
    # 5. Security warning
    print("\n5. ⚠️ Security Warning")
    print("   ❌ NEVER load pickle files from untrusted sources!")
    print("   ❌ Pickle can execute arbitrary code during deserialization")
    print("   ✅ For untrusted data, use JSON or other safe formats")
    
    # Clean up protocol test files
    for protocol in protocols:
        filename = f'../sample_files/protocol_{protocol}.pkl'
        if os.path.exists(filename):
            os.remove(filename)

demonstrate_best_practices()

## 🎯 Key Takeaways

### Binary File Modes
- **`rb`**: Read binary files
- **`wb`**: Write binary files (truncates existing)
- **`ab`**: Append to binary files
- **`rb+`, `wb+`, `ab+`**: Combined read/write modes

### Pickle Operations
- **`pickle.dump(obj, file)`**: Serialize object to file
- **`pickle.load(file)`**: Deserialize object from file
- **`pickle.dumps(obj)`**: Serialize to bytes
- **`pickle.loads(bytes)`**: Deserialize from bytes

### Best Practices
1. **Always use context managers** (`with` statement)
2. **Handle exceptions** properly
3. **Validate data** after loading
4. **Use appropriate protocols** for performance
5. **Never trust untrusted pickle files** (security risk)

### Performance Benefits
- **Faster** read/write operations
- **Smaller** file sizes
- **Preserves** Python data types
- **No encoding/decoding** overhead

## 🚨 Security Warning

**NEVER** load pickle files from untrusted sources! Pickle can execute arbitrary code during deserialization, making it a security risk. For untrusted data, use safer formats like JSON.

## 🔜 What's Next?

In the next notebook, we'll explore CSV file operations, which provide a safer and more portable way to store structured data.