# Python Data Serialization and I/O
This notebook covers data serialization and input/output (I/O) in Python, including pickle, CSV, XML, Excel, images, and PDFs, with real-life use cases, best practices, and code examples.

## 1. Pickle (Object Serialization)
**Definition:** Pickle is used to serialize and deserialize Python objects. Useful for saving models or data structures.

**Syntax and Example:**

In [None]:
import pickle
import numpy as np
import os
import time

# Basic pickle example
# Create a Python dictionary
data = {'a': 1, 'b': 2, 'c': [3, 4, 5], 'd': {'nested': 'dictionary'}}

# Serialize the data to a file
print("Pickling data to file...")
with open('data.pkl', 'wb') as f:  # 'wb' mode for writing binary data
    pickle.dump(data, f)

# Deserialize the data from the file
print("Unpickling data from file...")
with open('data.pkl', 'rb') as f:  # 'rb' mode for reading binary data
    loaded_data = pickle.load(f)

# Check if the loaded data matches the original
print(f"Original data: {data}")
print(f"Loaded data: {loaded_data}")
print(f"Data matches: {data == loaded_data}")

# Pickling multiple objects
print("\nPickling multiple objects...")
data1 = [1, 2, 3, 4]
data2 = "Hello, world!"
data3 = {"key": "value"}

with open('multiple.pkl', 'wb') as f:
    pickle.dump(data1, f)
    pickle.dump(data2, f)
    pickle.dump(data3, f)

with open('multiple.pkl', 'rb') as f:
    loaded1 = pickle.load(f)
    loaded2 = pickle.load(f)
    loaded3 = pickle.load(f)

print(f"Loaded objects: {loaded1}, {loaded2}, {loaded3}")

# Pickling more complex objects (like NumPy arrays)
print("\nPickling NumPy arrays...")
array = np.array([[1, 2, 3], [4, 5, 6]])
print(f"Original array:\n{array}")

with open('array.pkl', 'wb') as f:
    pickle.dump(array, f)

with open('array.pkl', 'rb') as f:
    loaded_array = pickle.load(f)

print(f"Loaded array:\n{loaded_array}")
print(f"Arrays equal: {np.array_equal(array, loaded_array)}")

# Custom objects can also be pickled
print("\nPickling custom objects...")

class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age
    
    def greet(self):
        return f"Hello, my name is {self.name} and I am {self.age} years old."
    
    def __eq__(self, other):
        if not isinstance(other, Person):
            return False
        return self.name == other.name and self.age == other.age

# Create an instance
person = Person("Alice", 30)
print(f"Original greeting: {person.greet()}")

# Pickle the object
with open('person.pkl', 'wb') as f:
    pickle.dump(person, f)

# Unpickle the object
with open('person.pkl', 'rb') as f:
    loaded_person = pickle.load(f)

print(f"Loaded greeting: {loaded_person.greet()}")
print(f"Objects equal: {person == loaded_person}")

# Performance comparison (pickle vs manual serialization)
print("\nPerformance comparison:")

# Generate a large dictionary
large_dict = {i: f"value_{i}" for i in range(10000)}

# Measure time for pickle
start_time = time.time()
with open('large.pkl', 'wb') as f:
    pickle.dump(large_dict, f)
with open('large.pkl', 'rb') as f:
    _ = pickle.load(f)
pickle_time = time.time() - start_time
print(f"Pickle time: {pickle_time:.4f} seconds")

# Measure time for manual JSON serialization
start_time = time.time()
import json
with open('large.json', 'w') as f:
    json.dump(large_dict, f)
with open('large.json', 'r') as f:
    _ = json.load(f)
json_time = time.time() - start_time
print(f"JSON time: {json_time:.4f} seconds")

# Different pickle protocols
print("\nPickle protocols:")
for protocol in range(pickle.HIGHEST_PROTOCOL + 1):
    filename = f"protocol_{protocol}.pkl"
    start_time = time.time()
    with open(filename, 'wb') as f:
        pickle.dump(large_dict, f, protocol=protocol)
    end_time = time.time()
    filesize = os.path.getsize(filename)
    print(f"Protocol {protocol}: {end_time - start_time:.4f} seconds, {filesize} bytes")

# Security warning
print("\nSECURITY WARNING: Never unpickle data from untrusted sources!")
print("Pickle can execute arbitrary code during unpickling.")

# Clean up created files
for file in ['data.pkl', 'multiple.pkl', 'array.pkl', 'person.pkl', 'large.pkl', 'large.json'] + \
           [f"protocol_{i}.pkl" for i in range(pickle.HIGHEST_PROTOCOL + 1)]:
    if os.path.exists(file):
        os.remove(file)

# Expected output:
# Pickling data to file...
# Unpickling data from file...
# Original data: {'a': 1, 'b': 2, 'c': [3, 4, 5], 'd': {'nested': 'dictionary'}}
# Loaded data: {'a': 1, 'b': 2, 'c': [3, 4, 5], 'd': {'nested': 'dictionary'}}
# Data matches: True
#
# Pickling multiple objects...
# Loaded objects: [1, 2, 3, 4], Hello, world!, {'key': 'value'}
#
# Pickling NumPy arrays...
# Original array:
# [[1 2 3]
#  [4 5 6]]
# Loaded array:
# [[1 2 3]
#  [4 5 6]]
# Arrays equal: True
#
# Pickling custom objects...
# Original greeting: Hello, my name is Alice and I am 30 years old.
# Loaded greeting: Hello, my name is Alice and I am 30 years old.
# Objects equal: True
#
# Performance comparison:
# Pickle time: 0.1234 seconds
# JSON time: 0.0678 seconds
#
# Pickle protocols:
# Protocol 0: 0.1234 seconds, 160000 bytes
# Protocol 1: 0.0987 seconds, 125000 bytes
# Protocol 2: 0.0754 seconds, 110000 bytes
# Protocol 3: 0.0612 seconds, 100000 bytes
# Protocol 4: 0.0534 seconds, 90000 bytes
# Protocol 5: 0.0523 seconds, 85000 bytes
#
# SECURITY WARNING: Never unpickle data from untrusted sources!
# Pickle can execute arbitrary code during unpickling.

**Output:**
{'a': 1, 'b': 2}

**Real-life use case:** Saving trained machine learning models for later use.

**Common mistakes:** Pickle files are not secure against code execution attacks. Never unpickle data from untrusted sources.

**Best practices:** Use pickle only for trusted data and consider alternatives for cross-language compatibility.

## 2. CSV Files
**Definition:** CSV (Comma-Separated Values) is a common format for tabular data.

**Syntax and Example:**

In [None]:
import csv
import os
import pandas as pd
import io

# Basic CSV writing example
print("Basic CSV writing example:")

# Create a CSV file
with open('example.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    # Write the header row
    writer.writerow(['name', 'age', 'city'])
    # Write data rows
    writer.writerow(['Alice', 30, 'New York'])
    writer.writerow(['Bob', 25, 'San Francisco'])
    writer.writerow(['Charlie', 35, 'Chicago'])

# Basic CSV reading example
print("\nBasic CSV reading example:")
with open('example.csv', 'r') as f:
    reader = csv.reader(f)
    for i, row in enumerate(reader):
        if i == 0:
            print(f"Header: {row}")
        else:
            print(f"Data row {i}: {row}")

# Using DictReader and DictWriter for more readable code
print("\nUsing DictReader and DictWriter:")

# Create a CSV file with DictWriter
with open('dict_example.csv', 'w', newline='') as f:
    fieldnames = ['id', 'name', 'department', 'salary']
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    
    writer.writeheader()  # Write the header row
    writer.writerow({'id': 1, 'name': 'John', 'department': 'IT', 'salary': 75000})
    writer.writerow({'id': 2, 'name': 'Jane', 'department': 'HR', 'salary': 65000})
    writer.writerow({'id': 3, 'name': 'Bob', 'department': 'Sales', 'salary': 80000})

# Read CSV with DictReader
with open('dict_example.csv', 'r') as f:
    reader = csv.DictReader(f)
    print("Employee data:")
    for row in reader:
        print(f"  ID: {row['id']}, Name: {row['name']}, Dept: {row['department']}, Salary: ${row['salary']}")

# Handling different delimiters
print("\nHandling different delimiters:")

# Write a TSV (Tab-Separated Values) file
with open('tab_example.tsv', 'w', newline='') as f:
    writer = csv.writer(f, delimiter='\t')
    writer.writerow(['Name', 'Country', 'Profession'])
    writer.writerow(['Maria', 'Brazil', 'Engineer'])
    writer.writerow(['Ahmed', 'Egypt', 'Doctor'])

# Read TSV file
with open('tab_example.tsv', 'r') as f:
    reader = csv.reader(f, delimiter='\t')
    print("Tab-separated data:")
    for row in reader:
        print(f"  {' | '.join(row)}")

# Handling quotes and special characters
print("\nHandling quotes and special characters:")

# Create data with commas and quotes
complex_data = [
    ['ID', 'Description', 'Notes'],
    [1, 'Product, with comma', 'Customer said "excellent quality"'],
    [2, 'Another, item', 'Multiple\nlines\nof text']
]

# Write to CSV with appropriate quoting
with open('complex.csv', 'w', newline='') as f:
    writer = csv.writer(f, quoting=csv.QUOTE_MINIMAL)
    writer.writerows(complex_data)

# Read back with proper handling
with open('complex.csv', 'r') as f:
    reader = csv.reader(f)
    print("Complex data with special characters:")
    for row in reader:
        print(f"  {row}")

# Different quoting styles
print("\nDifferent quoting styles:")
quoting_styles = {
    'QUOTE_MINIMAL': csv.QUOTE_MINIMAL,     # Quote fields only if needed (default)
    'QUOTE_ALL': csv.QUOTE_ALL,             # Quote all fields
    'QUOTE_NONNUMERIC': csv.QUOTE_NONNUMERIC, # Quote all non-numeric fields
    'QUOTE_NONE': csv.QUOTE_NONE            # Never quote fields
}

sample_data = [['Name', 'Age', 'Comment'], ['Alice', 30, 'Good, student']]

for name, style in quoting_styles.items():
    output = io.StringIO()
    writer = csv.writer(output, quoting=style)
    writer.writerows(sample_data)
    print(f"  {name}: {output.getvalue().strip()}")

# Handling encoding issues
print("\nHandling encoding issues:")
print("When working with CSV files that contain non-ASCII characters:")
print("with open('data.csv', 'r', encoding='utf-8') as f:  # Specify encoding")
print("    reader = csv.reader(f)")
print("    # etc...")

# Working with large CSV files
print("\nWorking with large CSV files:")
print("For large files, read in chunks to avoid memory issues:")
print("""def process_large_csv(filename, chunk_size=1000):
    with open(filename, 'r') as f:
        reader = csv.reader(f)
        header = next(reader)  # Read the header
        
        chunk = []
        for i, row in enumerate(reader):
            chunk.append(row)
            
            # Process each chunk
            if (i + 1) % chunk_size == 0:
                process_chunk(chunk, header)
                chunk = []
                
        # Process the last chunk if it exists
        if chunk:
            process_chunk(chunk, header)
""")

# Using pandas with CSV
print("\nUsing pandas with CSV:")

# Create a DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [30, 25, 35, 28],
    'city': ['New York', 'San Francisco', 'Chicago', 'Boston'],
    'score': [85.5, 92.0, 78.5, 90.0]
})

# Write to CSV
df.to_csv('pandas_example.csv', index=False)

# Read from CSV
df_read = pd.read_csv('pandas_example.csv')
print("DataFrame from CSV:")
print(df_read.head())

# Pandas additional options
print("\nPandas additional options:")
print("Writing with specific options:")
print("df.to_csv('output.csv', index=False, sep='|', na_rep='NULL')")

print("\nReading with specific options:")
print("pd.read_csv('input.csv', usecols=['name', 'age'], nrows=100, skiprows=1)")

# Best practices
print("\nCSV best practices:")
best_practices = [
    "1. Always specify newline='' when opening files to avoid platform-specific issues",
    "2. Specify encoding (usually utf-8) when working with international characters",
    "3. Use DictReader/DictWriter for more readable code with named columns",
    "4. Consider using pandas for complex CSV operations in data science",
    "5. Use appropriate quoting settings for data with special characters",
    "6. Process large files in chunks to avoid memory issues",
    "7. Include error handling for malformed CSV data",
    "8. Always close file handles (or use 'with' statements)"
]

for practice in best_practices:
    print(practice)

# Clean up created files
for file in ['example.csv', 'dict_example.csv', 'tab_example.tsv', 'complex.csv', 'pandas_example.csv']:
    if os.path.exists(file):
        os.remove(file)

# Expected output:
# Basic CSV writing example:
#
# Basic CSV reading example:
# Header: ['name', 'age', 'city']
# Data row 1: ['Alice', '30', 'New York']
# Data row 2: ['Bob', '25', 'San Francisco']
# Data row 3: ['Charlie', '35', 'Chicago']
#
# Using DictReader and DictWriter:
# Employee data:
#   ID: 1, Name: John, Dept: IT, Salary: $75000
#   ID: 2, Name: Jane, Dept: HR, Salary: $65000
#   ID: 3, Name: Bob, Dept: Sales, Salary: $80000
#
# Handling different delimiters:
# Tab-separated data:
#   Name | Country | Profession
#   Maria | Brazil | Engineer
#   Ahmed | Egypt | Doctor
#
# Handling quotes and special characters:
# Complex data with special characters:
#   ['ID', 'Description', 'Notes']
#   ['1', 'Product, with comma', 'Customer said "excellent quality"']
#   ['2', 'Another, item', 'Multiple\nlines\nof text']
#
# Different quoting styles:
#   QUOTE_MINIMAL: Name,Age,Comment\nAlice,30,"Good, student"
#   QUOTE_ALL: "Name","Age","Comment"\n"Alice","30","Good, student"
#   QUOTE_NONNUMERIC: "Name",Age,"Comment"\n"Alice",30,"Good, student"
#   QUOTE_NONE: Name,Age,Comment\nAlice,30,Good, student
#
# Handling encoding issues:
# When working with CSV files that contain non-ASCII characters:
# with open('data.csv', 'r', encoding='utf-8') as f:  # Specify encoding
#     reader = csv.reader(f)
#     # etc...
#
# Working with large CSV files:
# For large files, read in chunks to avoid memory issues:
# <code example for processing large files>
#
# Using pandas with CSV:
# DataFrame from CSV:
#    name  age          city  score
# 0  Alice   30      New York   85.5
# 1    Bob   25  San Francisco   92.0
# 2  Charlie   35       Chicago   78.5
# 3   Diana   28        Boston   90.0
#
# Pandas additional options:
# Writing with specific options:
# df.to_csv('output.csv', index=False, sep='|', na_rep='NULL')
#
# Reading with specific options:
# pd.read_csv('input.csv', usecols=['name', 'age'], nrows=100, skiprows=1)
#
# CSV best practices:
# 1. Always specify newline='' when opening files to avoid platform-specific issues
# <more best practices...>

**Output:**
['name', 'age']
['Alice', '30']
['Bob', '25']

**Real-life use case:** Importing/exporting data between Excel and Python.

**Common mistakes:** Not handling newlines or encoding issues.

**Best practices:** Always specify newline='' and encoding when working with CSV files.

## 3. XML Files
**Definition:** XML (eXtensible Markup Language) is used for structured data exchange.

**Syntax and Example:**

In [None]:
import xml.etree.ElementTree as ET
import os
import io
from pprint import pprint

# Basic XML parsing example
print("Basic XML parsing example:")

# Create a simple XML string
xml_data = '''
<root>
  <person id="1">
    <name>Alice</name>
    <age>30</age>
    <skills>
      <skill>Python</skill>
      <skill>Data Science</skill>
    </skills>
  </person>
  <person id="2">
    <name>Bob</name>
    <age>25</age>
    <skills>
      <skill>Java</skill>
      <skill>DevOps</skill>
    </skills>
  </person>
</root>
'''

# Parse XML from string
root = ET.fromstring(xml_data)

# Find all person elements
print("People in the XML:")
for person in root.findall('person'):
    # Get attribute
    person_id = person.get('id')
    
    # Get child element text
    name = person.find('name').text
    age = person.find('age').text
    
    # Get multiple child elements
    skills = [skill.text for skill in person.find('skills').findall('skill')]
    
    print(f"  Person ID: {person_id}")
    print(f"    Name: {name}")
    print(f"    Age: {age}")
    print(f"    Skills: {', '.join(skills)}")

# Creating XML with ElementTree
print("\nCreating XML with ElementTree:")

# Create root element
new_root = ET.Element('employees')

# Create first employee element
emp1 = ET.SubElement(new_root, 'employee')
emp1.set('id', '1')  # Set attribute

ET.SubElement(emp1, 'name').text = 'John Doe'
ET.SubElement(emp1, 'position').text = 'Manager'
ET.SubElement(emp1, 'department').text = 'IT'

# Create second employee element
emp2 = ET.SubElement(new_root, 'employee')
emp2.set('id', '2')

ET.SubElement(emp2, 'name').text = 'Jane Smith'
ET.SubElement(emp2, 'position').text = 'Developer'
ET.SubElement(emp2, 'department').text = 'IT'

# Convert to XML string with proper indentation
output = io.StringIO()
ET.ElementTree(new_root).write(output, encoding='unicode', xml_declaration=True, method='xml')

# Add manual indentation for readability
def indent_xml(elem, level=0):
    i = "\n" + level*"  "
    if len(elem):
        if not elem.text or not elem.text.strip():
            elem.text = i + "  "
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
        for elem in elem:
            indent_xml(elem, level+1)
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
    else:
        if level and (not elem.tail or not elem.tail.strip()):
            elem.tail = i

indent_xml(new_root)
xml_str = ET.tostring(new_root, encoding='unicode')
print(xml_str)

# Working with XML files
print("\nWorking with XML files:")

# Write XML to file
tree = ET.ElementTree(new_root)
tree.write('employees.xml')

# Read XML from file
print("Reading XML from file:")
tree = ET.parse('employees.xml')
root = tree.getroot()

print(f"Root tag: {root.tag}")
print(f"Number of employees: {len(root)}")

# Finding elements with XPath
print("\nFinding elements with XPath:")

# Parse original XML data again for XPath examples
root = ET.fromstring(xml_data)

# Examples of XPath expressions
xpath_examples = [
    (".//name", "All name elements"),
    ("./person[@id='1']", "Person with id=1"),
    (".//skill", "All skill elements"),
    ("./person[age='25']", "Person with age=25"),
]

for xpath, description in xpath_examples:
    elements = root.findall(xpath)
    print(f"{description} ({xpath}):")
    for elem in elements:
        if elem.text:  # Element with text
            print(f"  {elem.tag}: {elem.text}")
        else:  # Container element
            attrs = ' '.join([f"{k}='{v}'" for k, v in elem.attrib.items()])
            print(f"  <{elem.tag} {attrs}>")

# Handling large XML files efficiently
print("\nHandling large XML files efficiently:")
print("Using iterparse to process large XML files without loading everything into memory:")

# Example code (not executed)
large_xml_code = '''
import xml.etree.ElementTree as ET

def process_large_xml(filename):
    # Context object for keeping track of what we've seen
    context = {}
    
    # Get an iterable for XML elements
    for event, elem in ET.iterparse(filename, events=('start', 'end')):
        if event == 'start':
            # Do something when we first encounter an element
            pass
        elif event == 'end':
            # Process element after all children have been processed
            if elem.tag == 'person':
                process_person(elem, context)
                # Clear element to free memory
                elem.clear()
'''
print(large_xml_code)

# Handling namespaces in XML
print("\nHandling namespaces in XML:")

# Example XML with namespaces
xml_with_ns = '''
<root xmlns="http://default-namespace.org" 
      xmlns:ns1="http://example.org/ns1" 
      xmlns:ns2="http://example.org/ns2">
  <ns1:item id="1">Default namespace item</ns1:item>
  <ns2:item id="2">Another namespaced item</ns2:item>
</root>
'''

# Parse XML with namespaces
root = ET.fromstring(xml_with_ns)

# Define namespace map
namespaces = {
    'default': 'http://default-namespace.org',
    'ns1': 'http://example.org/ns1',
    'ns2': 'http://example.org/ns2'
}

print("Finding elements with namespaces:")
# Find elements in specific namespace
ns1_items = root.findall('.//{http://example.org/ns1}item')
print(f"Items in ns1 namespace: {len(ns1_items)}")
for item in ns1_items:
    print(f"  ID: {item.get('id')}, Text: {item.text}")

# Using namespace map with find/findall
print("\nUsing namespace map:")
ns2_items = root.findall('.//ns2:item', namespaces)
print(f"Items in ns2 namespace: {len(ns2_items)}")
for item in ns2_items:
    print(f"  ID: {item.get('id')}, Text: {item.text}")

# Alternative XML libraries in Python
print("\nAlternative XML libraries in Python:")
alternatives = {
    "lxml": "Faster and more feature-rich XML processing, compatible with ElementTree API",
    "BeautifulSoup": "More forgiving XML/HTML parser, good for web scraping",
    "xmltodict": "Converts XML to Python dictionaries for easier handling"
}
for lib, desc in alternatives.items():
    print(f"- {lib}: {desc}")

# Clean up files created
if os.path.exists('employees.xml'):
    os.remove('employees.xml')

# Expected output (abbreviated):
# Basic XML parsing example:
# People in the XML:
#   Person ID: 1
#     Name: Alice
#     Age: 30
#     Skills: Python, Data Science
#   Person ID: 2
#     Name: Bob
#     Age: 25
#     Skills: Java, DevOps
#
# Creating XML with ElementTree:
# <employees>
#   <employee id="1">
#     <name>John Doe</name>
#     <position>Manager</position>
#     <department>IT</department>
#   </employee>
#   <employee id="2">
#     <name>Jane Smith</name>
#     <position>Developer</position>
#     <department>IT</department>
#   </employee>
# </employees>
#
# Working with XML files:
# Reading XML from file:
# Root tag: employees
# Number of employees: 2
#
# Finding elements with XPath:
# All name elements (.//name):
#   name: Alice
#   name: Bob
# Person with id=1 (./person[@id='1']):
#   <person id='1'>
# All skill elements (.//skill):
#   skill: Python
#   skill: Data Science
#   skill: Java
#   skill: DevOps
# Person with age=25 (./person[age='25']):
#   <person id='2'>
#
# Handling large XML files efficiently:
# Using iterparse to process large XML files without loading everything into memory:
# <code example>
#
# Handling namespaces in XML:
# Finding elements with namespaces:
# Items in ns1 namespace: 1
#   ID: 1, Text: Default namespace item
#
# Using namespace map:
# Items in ns2 namespace: 1
#   ID: 2, Text: Another namespaced item
#
# Alternative XML libraries in Python:
# - lxml: Faster and more feature-rich XML processing, compatible with ElementTree API
# - BeautifulSoup: More forgiving XML/HTML parser, good for web scraping
# - xmltodict: Converts XML to Python dictionaries for easier handling

**Output:**
A
B

**Real-life use case:** Reading configuration files or data from web services.

**Common mistakes:** Not handling namespaces or large files efficiently.

**Best practices:** Use iterparse for large XML files.

## 4. Excel Files
**Definition:** Excel files are widely used for data storage and analysis. Use `pandas` for easy reading/writing.

**Syntax and Example:**

In [None]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
from datetime import datetime

# Basic Excel writing with pandas
print("Basic Excel writing with pandas:")

# Create a sample DataFrame
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Evan'],
    'age': [30, 25, 35, 28, 42],
    'department': ['IT', 'HR', 'Finance', 'IT', 'Marketing'],
    'salary': [75000, 65000, 80000, 70000, 85000],
    'hire_date': ['2020-01-15', '2021-03-10', '2019-07-22', '2020-11-05', '2018-05-30']
}

df = pd.DataFrame(data)

# Convert date strings to datetime objects
df['hire_date'] = pd.to_datetime(df['hire_date'])

# Display the DataFrame
print("Sample DataFrame:")
print(df)

# Write to Excel file
print("\nWriting DataFrame to Excel...")
df.to_excel('example.xlsx', index=False)

# Read from Excel file
print("Reading DataFrame from Excel...")
df2 = pd.read_excel('example.xlsx')
print(df2.head())

# Working with multiple sheets
print("\nWorking with multiple sheets:")

# Create another DataFrame
projects = pd.DataFrame({
    'project_id': ['P001', 'P002', 'P003', 'P004'],
    'project_name': ['Website Redesign', 'Database Migration', 'Mobile App', 'Cloud Integration'],
    'budget': [50000, 75000, 100000, 60000],
    'deadline': ['2023-12-31', '2023-10-15', '2024-03-01', '2023-11-30']
})

# Write multiple DataFrames to different sheets
with pd.ExcelWriter('company_data.xlsx') as writer:
    df.to_excel(writer, sheet_name='Employees', index=False)
    projects.to_excel(writer, sheet_name='Projects', index=False)

# Read specific sheet from Excel
employees = pd.read_excel('company_data.xlsx', sheet_name='Employees')
print("Employees sheet:")
print(employees.head(3))  # Show first 3 rows

projects = pd.read_excel('company_data.xlsx', sheet_name='Projects')
print("\nProjects sheet:")
print(projects.head(3))  # Show first 3 rows

# List all sheets in an Excel file
print("\nListing all sheets in Excel file:")
xl = pd.ExcelFile('company_data.xlsx')
print(f"Sheets in file: {xl.sheet_names}")

# Formatting Excel output
print("\nFormatting Excel output:")

# Create a more complete example (won't run for real in this notebook, just demonstration)
formatting_code = '''
# Create a new Excel file with formatting
with pd.ExcelWriter('formatted_report.xlsx', engine='xlsxwriter') as writer:
    # Write DataFrame to sheet
    df.to_excel(writer, sheet_name='Report', index=False)
    
    # Get the xlsxwriter workbook and worksheet objects
    workbook = writer.book
    worksheet = writer.sheets['Report']
    
    # Add formats
    header_format = workbook.add_format({
        'bold': True,
        'text_wrap': True,
        'valign': 'top',
        'fg_color': '#D7E4BC',
        'border': 1
    })
    
    currency_format = workbook.add_format({
        'num_format': '$#,##0'
    })
    
    date_format = workbook.add_format({
        'num_format': 'yyyy-mm-dd'
    })
    
    # Apply formats to specific columns
    worksheet.set_column('A:A', 15)  # Width of name column
    worksheet.set_column('D:D', 12, currency_format)  # Apply currency format to salary
    worksheet.set_column('E:E', 12, date_format)  # Apply date format to hire_date
    
    # Apply header format to the header row
    for col_num, value in enumerate(df.columns.values):
        worksheet.write(0, col_num, value, header_format)
        
    # Create a chart
    chart = workbook.add_chart({'type': 'column'})
    
    # Configure the chart
    chart.add_series({
        'name': 'Salaries',
        'categories': ['Report', 1, 0, 5, 0],  # Names (A2:A6)
        'values': ['Report', 1, 3, 5, 3],      # Salary values (D2:D6)
    })
    
    chart.set_title({'name': 'Employee Salaries'})
    chart.set_x_axis({'name': 'Employee'})
    chart.set_y_axis({'name': 'Salary ($)', 'major_gridlines': {'visible': False}})
    
    # Insert the chart into the worksheet
    worksheet.insert_chart('G2', chart)
'''
print(formatting_code)

# Using openpyxl directly for more control
print("\nUsing openpyxl directly for more control:")
openpyxl_code = '''
import openpyxl
from openpyxl.styles import Font, PatternFill, Border, Side, Alignment
from openpyxl.chart import BarChart, Reference

# Create a new workbook and select the active sheet
wb = openpyxl.Workbook()
ws = wb.active
ws.title = "Employee Data"

# Add headers
headers = ["Name", "Age", "Department", "Salary", "Hire Date"]
for col_num, header in enumerate(headers, 1):
    cell = ws.cell(row=1, column=col_num)
    cell.value = header
    cell.font = Font(bold=True)
    cell.fill = PatternFill("solid", fgColor="A9D08E")
    cell.alignment = Alignment(horizontal="center")

# Add data
data = [
    ["Alice", 30, "IT", 75000, "2020-01-15"],
    ["Bob", 25, "HR", 65000, "2021-03-10"],
    # ... more data ...
]

for row_num, row_data in enumerate(data, 2):
    for col_num, cell_value in enumerate(row_data, 1):
        cell = ws.cell(row=row_num, column=col_num)
        cell.value = cell_value

# Set column widths
for col in ws.columns:
    column = col[0].column_letter
    ws.column_dimensions[column].width = 15

# Create a chart
chart = BarChart()
chart.title = "Employee Salaries"
chart.x_axis.title = "Employee"
chart.y_axis.title = "Salary"

# Add data to chart
data = Reference(ws, min_col=4, min_row=1, max_row=6, max_col=4)
categories = Reference(ws, min_col=1, min_row=2, max_row=6)
chart.add_data(data, titles_from_data=True)
chart.set_categories(categories)

# Add chart to worksheet
ws.add_chart(chart, "G2")

# Save workbook
wb.save("openpyxl_example.xlsx")
'''
print(openpyxl_code)

# Reading Excel files with specific options
print("\nReading Excel files with specific options:")
print("Example options for pd.read_excel():")

options = {
    "sheet_name": "Specify sheet by name or index (0-based)",
    "header": "Row to use as column names (default 0)",
    "names": "List of column names to use",
    "usecols": "Which columns to read (e.g., 'A:C' or [0, 2])",
    "skiprows": "Skip rows at the beginning (int or list)",
    "nrows": "Number of rows to read",
    "na_values": "Values to recognize as NaN",
    "parse_dates": "List of columns to parse as dates",
    "dtype": "Dict of column data types"
}

for option, description in options.items():
    print(f"- {option}: {description}")

# Common Excel tasks in data science
print("\nCommon Excel tasks in data science:")

tasks = [
    "1. Data import/export: Reading raw data from Excel and writing processed results",
    "2. Data cleaning: Filtering out invalid rows, handling missing values",
    "3. Automated reporting: Generating Excel reports with charts and formatting",
    "4. Data transformation: Reshaping data from Excel into analysis-ready format",
    "5. Interactive dashboards: Excel as a frontend for data analysis results"
]

for task in tasks:
    print(task)

# Saving analysis results with charts
print("\nSaving analysis results with charts (example):")

# Create some sample data for demonstration
x = np.linspace(0, 10, 20)
y1 = np.sin(x)
y2 = np.cos(x)

# Create a figure with matplotlib
plt.figure(figsize=(8, 4))
plt.plot(x, y1, 'b-', label='sin(x)')
plt.plot(x, y2, 'r--', label='cos(x)')
plt.legend()
plt.title('Trigonometric Functions')
plt.grid(True)

# Save plot to an image file
plt.savefig('plot.png')
plt.close()

print("Plot saved to 'plot.png'. In a real workflow, you could:")
print("1. Generate multiple analyses and charts")
print("2. Insert charts into Excel worksheets")
print("3. Create an executive summary sheet")
print("4. Format the data for presentation")

# Best practices
print("\nExcel file best practices:")
best_practices = [
    "1. Always specify 'index=False' when writing DataFrames unless you need the index",
    "2. Use descriptive sheet names",
    "3. Include metadata and documentation sheets",
    "4. Apply appropriate formatting for different data types",
    "5. Set column widths for readability",
    "6. Add data validation where applicable",
    "7. Use named ranges for important data sections",
    "8. Include summary statistics and charts",
    "9. Test Excel files with actual users before distribution",
    "10. Consider file size and performance for large datasets"
]

for practice in best_practices:
    print(practice)

# Clean up created files
for file in ['example.xlsx', 'company_data.xlsx', 'plot.png']:
    if os.path.exists(file):
        os.remove(file)

**Output:**
   name  age
0 Alice   30
1   Bob   25

**Real-life use case:** Automating report generation for business analytics.

**Common mistakes:** Not installing required libraries (openpyxl, xlrd).

**Best practices:** Always specify index=False unless you want to save the DataFrame index.

## 5. Images
**Definition:** Use the `PIL` (Pillow) library to work with images.

**Syntax and Example:**

In [None]:
from PIL import Image, ImageDraw, ImageFont, ImageFilter, ImageEnhance, ImageOps, ExifTags
import numpy as np
import os
import glob
import matplotlib.pyplot as plt
import io
import requests
from datetime import datetime

# 1. Basic image creation and manipulation
print("1. Basic image creation and manipulation:")
# Create a new image with a solid color
img = Image.new('RGB', (200, 100), color='red')
img.save('pil_red.png')

# Open an existing image
img2 = Image.open('pil_red.png')
print(f"Image size: {img2.size}")
print(f"Image mode: {img2.mode}")
print(f"Image format: {img2.format}")

# Resizing images
img_resized = img2.resize((100, 50))
img_resized.save('pil_red_resized.png')
print(f"Resized image size: {img_resized.size}")

# Cropping images
# Crop format is (left, upper, right, lower)
left = 50
upper = 25
right = 150
lower = 75
img_cropped = img2.crop((left, upper, right, lower))
img_cropped.save('pil_red_cropped.png')
print(f"Cropped image size: {img_cropped.size}")

# 2. Drawing on images
print("\n2. Drawing on images:")
# Create a blank canvas
canvas = Image.new('RGB', (400, 200), color='white')
draw = ImageDraw.Draw(canvas)

# Draw shapes
draw.rectangle([(50, 50), (150, 100)], fill='blue', outline='black')
draw.ellipse([(200, 50), (300, 150)], fill='green', outline='red', width=2)
draw.line([(0, 0), (400, 200)], fill='red', width=3)
draw.polygon([(350, 50), (350, 150), (250, 100)], fill='yellow', outline='purple')

# Add text
# Note: In a notebook environment, you might need to adjust the font path or use default font
try:
    font = ImageFont.truetype("arial.ttf", 20)
except IOError:
    # Use default font if arial.ttf is not available
    font = ImageFont.load_default()

draw.text((150, 150), "Hello, Pillow!", fill='black', font=font)

# Save the drawing
canvas.save('pil_drawing.png')
print("Drawing saved to 'pil_drawing.png'")

# 3. Image transformations
print("\n3. Image transformations:")

# Rotate image
img_rotated = img2.rotate(45, expand=True)  # expand=True prevents cropping
img_rotated.save('pil_red_rotated.png')
print("Rotated image saved")

# Flip image
img_flipped_h = ImageOps.mirror(img2)  # horizontal flip
img_flipped_v = ImageOps.flip(img2)    # vertical flip
img_flipped_h.save('pil_red_flipped_h.png')
img_flipped_v.save('pil_red_flipped_v.png')
print("Flipped images saved")

# 4. Image enhancement and filtering
print("\n4. Image enhancement and filtering:")

# Create a test image with a gradient
grad_img = Image.new('RGB', (256, 100), color='black')
draw = ImageDraw.Draw(grad_img)
for x in range(256):
    draw.line([(x, 0), (x, 100)], fill=(x, x, 255-x))
grad_img.save('gradient.png')
print("Created gradient test image")

# Apply filters
blur_img = grad_img.filter(ImageFilter.GaussianBlur(radius=5))
blur_img.save('gradient_blurred.png')
print("Applied Gaussian blur filter")

edge_img = grad_img.filter(ImageFilter.FIND_EDGES)
edge_img.save('gradient_edges.png')
print("Applied edge detection filter")

# Enhance images
enhancer = ImageEnhance.Contrast(grad_img)
enhanced_img = enhancer.enhance(2.0)  # Increase contrast by factor of 2
enhanced_img.save('gradient_enhanced.png')
print("Enhanced image contrast")

# Convert to grayscale
gray_img = ImageOps.grayscale(grad_img)
gray_img.save('gradient_gray.png')
print("Converted to grayscale")

# 5. Working with image data as numpy arrays
print("\n5. Working with image data as numpy arrays:")

# Convert image to numpy array
img_array = np.array(grad_img)
print(f"Image array shape: {img_array.shape}")
print(f"Image data type: {img_array.dtype}")

# Manipulate image data directly
inverted_array = 255 - img_array  # Invert colors
inverted_img = Image.fromarray(inverted_array.astype('uint8'))
inverted_img.save('gradient_inverted.png')
print("Created inverted image using numpy")

# 6. Image metadata
print("\n6. Image metadata:")
print("EXIF data extraction example (would work on photos with EXIF data):")

exif_code = '''
# Get EXIF data from a photo
img = Image.open('photo.jpg')
exif_data = img._getexif()

if exif_data:
    for tag_id, value in exif_data.items():
        tag_name = ExifTags.TAGS.get(tag_id, tag_id)
        print(f"{tag_name}: {value}")
else:
    print("No EXIF data found")
'''
print(exif_code)

# 7. Working with images from the web
print("\n7. Working with images from the web:")
print("Example code for downloading and processing an image from the web:")

web_image_code = '''
# Download an image from a URL
response = requests.get('https://example.com/image.jpg')
img = Image.open(io.BytesIO(response.content))

# Process the image
img_resized = img.resize((300, 200))
img_resized.save('web_image_resized.jpg')
'''
print(web_image_code)

# 8. Batch processing images
print("\n8. Batch processing images:")
print("Example code for batch processing all PNG images in a directory:")

batch_code = '''
# Process all PNG images in a directory
for filename in glob.glob('input_dir/*.png'):
    with Image.open(filename) as img:
        # Get the base filename without extension
        base_name = os.path.splitext(os.path.basename(filename))[0]
        
        # Process the image (example: resize and convert to grayscale)
        img_processed = ImageOps.grayscale(img.resize((100, 100)))
        
        # Save to output directory with new name
        img_processed.save(f'output_dir/{base_name}_processed.png')
        print(f"Processed: {filename}")
'''
print(batch_code)

# 9. Creating animated GIFs
print("\n9. Creating animated GIFs:")
print("Example code for creating an animated GIF:")

gif_code = '''
# Create frames for animation
frames = []
colors = ['red', 'orange', 'yellow', 'green', 'blue', 'purple']

for color in colors:
    # Create a colored frame
    frame = Image.new('RGB', (100, 100), color=color)
    frames.append(frame)

# Save as animated GIF
frames[0].save('animation.gif', 
               save_all=True,
               append_images=frames[1:],
               optimize=True,
               duration=200,  # milliseconds per frame
               loop=0)  # 0 means loop forever
'''
print(gif_code)

# 10. Best practices for image processing
print("\n10. Best practices for image processing:")
best_practices = [
    "1. Always use context managers (with statement) or close() images to free resources",
    "2. Process images in batches for efficiency when dealing with multiple files",
    "3. Use appropriate image formats: JPEG for photos, PNG for graphics with transparency",
    "4. Consider memory usage when working with large images",
    "5. Resize large images before applying compute-intensive operations",
    "6. Use NumPy for complex pixel manipulations for better performance",
    "7. Create thumbnails to improve web page loading times",
    "8. Apply progressive JPEG for faster web loading",
    "9. Use proper error handling when opening files that might not exist",
    "10. Consider using threading or multiprocessing for batch processing"
]

for practice in best_practices:
    print(practice)

# Clean up created files
files_to_clean = [
    'pil_red.png', 'pil_red_resized.png', 'pil_red_cropped.png',
    'pil_drawing.png', 'pil_red_rotated.png', 'pil_red_flipped_h.png',
    'pil_red_flipped_v.png', 'gradient.png', 'gradient_blurred.png',
    'gradient_edges.png', 'gradient_enhanced.png', 'gradient_gray.png',
    'gradient_inverted.png'
]

for file in files_to_clean:
    if os.path.exists(file):
        os.remove(file)

# Expected output:        
# 1. Basic image creation and manipulation:
# Image size: (200, 100)
# Image mode: RGB
# Image format: PNG
# Resized image size: (100, 50)
# Cropped image size: (100, 50)
#
# 2. Drawing on images:
# Drawing saved to 'pil_drawing.png'
#
# 3. Image transformations:
# Rotated image saved
# Flipped images saved
#
# 4. Image enhancement and filtering:
# Created gradient test image
# Applied Gaussian blur filter
# Applied edge detection filter
# Enhanced image contrast
# Converted to grayscale
#
# 5. Working with image data as numpy arrays:
# Image array shape: (100, 256, 3)
# Image data type: uint8
# Created inverted image using numpy
#
# 6. Image metadata:
# EXIF data extraction example (would work on photos with EXIF data):
# <EXIF code example>
#
# 7. Working with images from the web:
# Example code for downloading and processing an image from the web:
# <web image code example>
#
# 8. Batch processing images:
# Example code for batch processing all PNG images in a directory:
# <batch processing code example>
#
# 9. Creating animated GIFs:
# Example code for creating an animated GIF:
# <GIF creation code example>
#
# 10. Best practices for image processing:
# 1. Always use context managers (with statement) or close() images to free resources
# 2. Process images in batches for efficiency when dealing with multiple files
# And so on...

## 6. PDFs
**Definition:** Use the `PyPDF2` library to read PDF files.

**Syntax and Example:**

In [None]:
import PyPDF2
with open('example.pdf', 'rb') as f:
    reader = PyPDF2.PdfReader(f)
    page = reader.pages[0]
    print(page.extract_text())  # Output: Text from the first page of the PDF

In [None]:
import os
import tempfile

# Note: This code requires the following libraries to be installed:
# pip install PyPDF2 reportlab pdf2image
























































































































































































































































































































































































































































































































# <more libraries...># - ReportLab: PDF generation from scratch with precise control# - PyPDF2: General-purpose PDF manipulation, reading, merging, etc.# 7. PDF Libraries Comparison:## <more practices...># 2. Error handling: Use try-except blocks when working with files# 1. Memory management: Close file handlers promptly when done# 6. PDF Best Practices:## <forms code example># Example code for filling and extracting data from PDF forms:# 5. Working with PDF Forms:## <converting code example># Example code for converting between PDF and other formats:# 4. Converting PDF to and from other formats:## <manipulating code example># Example code for merging, splitting, rotating, and adding watermarks to PDFs:# 3. Manipulating PDFs with PyPDF2:## <creating code example># Example code for creating a PDF document:# 2. Creating PDFs with ReportLab:## <reading code example># Example code for reading a PDF file:# 1. Basic PDF Reading with PyPDF2:## Install with: pip install PyPDF2 reportlab pdf2image# Note: This code requires PyPDF2, reportlab, and pdf2image libraries.# Expected output:print("Some examples wouldn't run directly in this notebook without the proper environment setup.")print("\nNote: The example codes provided require various PDF libraries to be installed.")    print(f"- {lib}: {desc}")for lib, desc in libraries.items():}    "PDFMiner": "Advanced PDF parsing and text extraction"    "pdfplumber": "More precise PDF text extraction with layout awareness",    "pdf2image": "Convert PDFs to images",    "pdfrw": "Low-level PDF manipulation library",    "WeasyPrint": "HTML/CSS to PDF conversion",    "ReportLab": "PDF generation from scratch with precise control",    "PyPDF2": "General-purpose PDF manipulation, reading, merging, etc.",libraries = {print("\n7. PDF Libraries Comparison:")# 7. PDF Libraries Comparison    print(practice)for practice in best_practices:]    "10. Alternatives: Consider HTML/CSS for web-based reports instead of PDFs"    "9. Dependencies: Always specify exact library versions in requirements.txt",    "8. Testing: Verify PDFs across different readers (Adobe, Chrome, etc.)",    "7. PDF creation: Use higher-level libraries (like ReportLab) for complex documents",    "6. Text extraction: Be aware that complex layouts may not extract perfectly",    "5. Accessibility: Create PDFs with proper tagging and metadata for screen readers",    "4. Security: Be cautious when opening PDFs from unknown sources",    "3. Performance: Process large PDFs page-by-page rather than loading the entire document",    "2. Error handling: Use try-except blocks when working with files",    "1. Memory management: Close file handlers promptly when done",best_practices = [print("\n6. PDF Best Practices:")# 6. PDF Best Practicesprint(forms_code)'''    os.remove("sample_form.pdf")if os.path.exists("sample_form.pdf"):# Clean up filesfill_pdf_form_example()    print(code)    """        writer.write(output_file)    with open("filled_form.pdf", "wb") as output_file:    # Write the output        )        }            "date_field": "2023-05-01"            "email_field": "john@example.com",            "name_field": "John Doe",        {        writer.pages[0],     writer.update_page_form_field_values(    # Update form fields        writer.add_page(page)    page = reader.pages[0]    # Get the first page with the form        writer = PdfWriter()    reader = PdfReader("form.pdf")    # Open the form PDF        from PyPDF2 import PdfReader, PdfWriter    code = """    print("Example code to fill a PDF form:")    """This is a demonstration - for actual use you need a proper PDF form"""def fill_pdf_form_example():import PyPDF2print("\nFilling a PDF form with PyPDF2:")# Creating interactive forms requires libraries like pdfrw or PyPDF2print("Created a sample form: sample_form.pdf")create_form_pdf("sample_form.pdf")# Create a simple form    c.save()        c.line(150, height - 200, 500, height - 200)    c.line(150, height - 150, 500, height - 150)    c.line(150, height - 100, 500, height - 100)    # Draw lines for form fields        c.drawString(100, height - 200, "Date:")    c.drawString(100, height - 150, "Email:")    c.drawString(100, height - 100, "Name:")    c.setFont("Helvetica", 12)    # Draw form fields (note: these are just visual, not interactive fields)        c.drawString(100, height - 50, "Sample PDF Form")    c.setFont("Helvetica-Bold", 16)    # Add title        width, height = letter    c = canvas.Canvas(filename, pagesize=letter)def create_form_pdf(filename):from reportlab.lib.pagesizes import letterfrom reportlab.pdfgen import canvas# Create a PDF form (simple example with reportlab)forms_code = '''print("Example code for filling and extracting data from PDF forms:")print("\n5. Working with PDF Forms:")# 5. PDF Formsprint(converting_code)'''print("\nConverted PDF to text: pdf_to_text.txt")        text_file.write(text)    with open("pdf_to_text.txt", "w") as text_file:            text += page.extract_text()    for page in reader.pages:    text = ""    reader = PyPDF2.PdfReader(f)with open("example.pdf", "rb") as f:import PyPDF2# Convert PDF to text (simple text extraction)print("\nConverted HTML to PDF: html_to_pdf.pdf")HTML("sample.html").write_pdf("html_to_pdf.pdf")# Convert HTML to PDF    f.write(html_content)with open("sample.html", "w") as f:# Create HTML file"""</html></body>    </table>        </tr>            <td>San Francisco</td>            <td>25</td>            <td>Bob</td>        <tr>        </tr>            <td>New York</td>            <td>30</td>            <td>Alice</td>        <tr>        </tr>            <th>City</th>            <th>Age</th>            <th>Name</th>        <tr>    <table>    <h2>Sample Table</h2>        <p>This HTML will be converted to a PDF document.</p>    <h1>Sample HTML Document</h1><body></head>    </style>        th { background-color: #f2f2f2; }        th, td { border: 1px solid #ddd; padding: 8px; text-align: left; }        table { border-collapse: collapse; width: 100%; }        h1 { color: navy; }        body { font-family: Arial, sans-serif; margin: 20px; }    <style>    <title>Sample HTML Document</title><head><html><!DOCTYPE html>html_content = """from weasyprint import HTML# Converting HTML to PDF using weasyprint        print(f"Saved {image_path}")        image.save(image_path, "PNG")        image_path = f"page_{i+1}.png"    for i, image in enumerate(images):    # Save images        print(f"Converted PDF to {len(images)} images")        images = convert_from_path('example.pdf', output_folder=path)with tempfile.TemporaryDirectory() as path:# Convert PDF to images (this requires poppler to be installed on your system)import tempfilefrom pdf2image import convert_from_path# Converting PDF to imagesconverting_code = '''print("Example code for converting between PDF and other formats:")print("\n4. Converting PDF to and from other formats:")# 4. Converting PDF to and from other formatsprint(manipulating_code)'''        os.remove(file)    if os.path.exists(file):            "encrypted_document.pdf"]:            "rotated_document.pdf", "watermarked_document.pdf",            "merged_document.pdf", "extracted_page.pdf", for file in ["document1.pdf", "document2.pdf", "watermark.pdf", # Clean up filesprint("Owner password: 'ownerpassword'")print("User password: 'userpassword'")print("Created encrypted_document.pdf with passwords:")        writer.write(output_file)    with open("encrypted_document.pdf", "wb") as output_file:        writer.encrypt("userpassword", "ownerpassword")    # Encrypt the PDF            writer.add_page(page)    for page in reader.pages:    # Add all pages        writer = PyPDF2.PdfWriter()    reader = PyPDF2.PdfReader(file)with open("document2.pdf", "rb") as file:print("\nEncrypting a PDF:")# 5. Encrypting a PDFprint("Created watermarked_document.pdf")            writer.write(output_file)        with open("watermarked_document.pdf", "wb") as output_file:                    writer.add_page(page)            page.merge_page(watermark_page)        for page in doc_reader.pages:        # Apply watermark to each page                writer = PyPDF2.PdfWriter()                watermark_page = watermark_reader.pages[0]        watermark_reader = PyPDF2.PdfReader(watermark_file)    with open("watermark.pdf", "rb") as watermark_file:        doc_reader = PyPDF2.PdfReader(doc_file)with open("document1.pdf", "rb") as doc_file:# Apply the watermark to a documentcreate_watermark("watermark.pdf", "CONFIDENTIAL")# Create the watermark    c.save()        c.restoreState()    c.drawCentredString(0, 0, text)    c.rotate(45)    c.translate(width/2, height/2)    c.saveState()    # Rotate and draw the watermark text        c.setFont("Helvetica", 60)    c.setFillColorRGB(0.5, 0.5, 0.5, 0.3)  # Gray color with 30% opacity    # Set transparency        width, height = letter    c = canvas.Canvas(filename, pagesize=letter)def create_watermark(filename, text):# Create a watermark PDFprint("\nAdding a watermark:")# 4. Adding a watermarkprint("Created rotated_document.pdf")        writer.write(output_file)    with open("rotated_document.pdf", "wb") as output_file:            writer.add_page(reader.pages[i])    for i in range(1, len(reader.pages)):    # Add any other pages as is        writer.add_page(page)    page.rotate(90)    page = reader.pages[0]    # Get the first page and rotate it 90 degrees clockwise        writer = PyPDF2.PdfWriter()    reader = PyPDF2.PdfReader(file)with open("document2.pdf", "rb") as file:print("\nRotating pages:")# 3. Rotating pagesprint("Created extracted_page.pdf with the first page of document1.pdf")                writer.write(output_file)    with open("extracted_page.pdf", "wb") as output_file:        writer.add_page(reader.pages[0])    # Extract the first page        writer = PyPDF2.PdfWriter()    reader = PyPDF2.PdfReader(file)with open("document1.pdf", "rb") as file:print("\nExtracting pages:")# 2. Extracting pages from a PDFprint("Created merged_document.pdf")merger.close()merger.write("merged_document.pdf")    merger.append(file)for file in files_to_merge:files_to_merge = ["document1.pdf", "document2.pdf"]merger = PyPDF2.PdfMerger()print("Merging PDFs:")# 1. Merging PDFscreate_text_pdf("document2.pdf", "Document 2", 2)create_text_pdf("document1.pdf", "Document 1", 3)# Create sample files    c.save()                c.showPage()        if i < num_pages - 1:        c.drawString(100, height - 120, f"This is sample content for demonstration.")        c.drawString(100, height - 100, f"{text} - Page {i+1} of {num_pages}")    for i in range(num_pages):        width, height = letter    c = canvas.Canvas(filename, pagesize=letter)        from reportlab.lib.pagesizes import letter    from reportlab.pdfgen import canvasdef create_text_pdf(filename, text, num_pages=1):# Create sample PDF files for demonstrationimport osimport PyPDF2manipulating_code = '''print("Example code for merging, splitting, rotating, and adding watermarks to PDFs:")print("\n3. Manipulating PDFs with PyPDF2:")# 3. Manipulating PDFs with PyPDF2print(creating_code)'''create_sample_pdf("sample_report.pdf")# Create a sample PDF    print(f"PDF created: {filename}")    doc.build(elements)    # Build the PDF        elements.append(table)    table.setStyle(table_style)        ])        ('GRID', (0, 0), (-1, -1), 1, colors.black)        ('BACKGROUND', (0, 1), (-1, -1), colors.beige),        ('BOTTOMPADDING', (0, 0), (-1, 0), 12),        ('FONTSIZE', (0, 0), (-1, 0), 14),        ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),        ('ALIGN', (0, 0), (-1, -1), 'CENTER'),        ('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke),        ('BACKGROUND', (0, 0), (-1, 0), colors.grey),    table_style = TableStyle([    # Add table style        table = Table(data)        ]        ['Diana', '28', 'Boston', 'UX Designer']        ['Charlie', '35', 'Chicago', 'Marketing Analyst'],        ['Bob', '25', 'San Francisco', 'Software Engineer'],        ['Alice', '30', 'New York', 'Data Scientist'],        ['Name', 'Age', 'City', 'Occupation'],    data = [    # Add a table        elements.append(Spacer(1, 12))    elements.append(heading)    heading = Paragraph("Sample Data Table", styles['Heading2'])    # Add a heading        elements.append(Spacer(1, 12))    elements.append(paragraph)    paragraph = Paragraph(paragraph_text, styles['Normal'])    )        "It allows for complex layouts, tables, charts, and more."        "ReportLab is a powerful library for creating PDF documents in Python. "        "This is a paragraph in a sample PDF document created using ReportLab. "    paragraph_text = (    # Add a paragraph        elements.append(Spacer(1, 12))    elements.append(title)    title = Paragraph("Sample PDF Report", styles['Title'])    # Add a title        styles = getSampleStyleSheet()    # Get sample styles        elements = []    # Container for elements to build the PDF        doc = SimpleDocTemplate(filename, pagesize=letter)    # Create a PDF document with letter size pagesdef create_sample_pdf(filename):from reportlab.lib.styles import getSampleStyleSheetfrom reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, Table, TableStylefrom reportlab.lib import colorsfrom reportlab.lib.pagesizes import lettercreating_code = '''print("Example code for creating a PDF document:")print("\n2. Creating PDFs with ReportLab:")# 2. Creating PDFs with ReportLabprint(reading_code)'''    print(f"\nTotal characters extracted: {len(all_text)}")                print(f"Page {i+1}: Extracted {len(page.extract_text())} characters")        all_text += page.extract_text()    for i, page in enumerate(reader.pages):    all_text = ""    print("\nExtracting text from all pages...")    # Iterate through all pages        print(text[:100] + "..." if len(text) > 100 else text)    print("\nText from first page:")    text = page.extract_text()    page = reader.pages[0]    # Extract text from the first page            print(f"Creator: {metadata.creator}")        print(f"Subject: {metadata.subject}")        print(f"Author: {metadata.author}")        print(f"Title: {metadata.title}")    if metadata:    metadata = reader.metadata    # Extract metadata        print(f"Number of pages: {len(reader.pages)}")    # Get basic information        reader = PyPDF2.PdfReader(file)    # Create a PDF reader objectwith open('example.pdf', 'rb') as file:# Open the PDF file in binary read modeimport PyPDF2reading_code = '''print("Example code for reading a PDF file:")print("\n1. Basic PDF Reading with PyPDF2:")# 1. Basic PDF Reading with PyPDF2print("Install with: pip install PyPDF2 reportlab pdf2image")print("Note: This code requires PyPDF2, reportlab, and pdf2image libraries.")
**Output:**
```
Number of pages: 5
Title: Sample Document
Author: John Doe
Text from first page: This is the beginning of the sample document...
```

**Real-life use cases:**
- Automated report generation in business intelligence applications
- Extracting data from PDF forms and invoices
- Creating digital contracts with encryption and security features
- Combining multiple reports into a single document
- Adding watermarks or stamps to official documents
- Converting data visualizations and analysis results to shareable PDFs
- PDF form creation and automated filling for paperwork automation
- Batch processing of PDFs for data extraction and archiving

**Common mistakes:**
- Not properly closing file handles when working with many PDFs
- Ignoring PDF structure complexity when extracting text
- Using string parsing instead of proper PDF libraries
- Not handling encryption or password protection properly
- Inefficient handling of large PDF files (loading entire files into memory)
- Not considering PDF reader compatibility when creating PDFs

**Best practices:**
- Always use binary mode ('rb', 'wb') when working with PDF files
- Close file handlers promptly using context managers (with statement)
- Process large PDFs page-by-page rather than loading the entire document
- Use specialized libraries for specific tasks (PyPDF2 for manipulation, ReportLab for creation)
- Add proper metadata and structure for accessibility
- Test PDFs with different PDF readers to ensure compatibility
- Consider security implications when processing PDFs from unknown sources

## 7. JSON: Human-Readable Data Serialization
JSON (JavaScript Object Notation) is a widely used, human-readable format for data exchange between languages and systems.

In [None]:
import json
# Serialize a Python object to JSON and save to file
person = {'name': 'Alice', 'age': 30, 'city': 'New York'}
with open('person.json', 'w') as f:
    json.dump(person, f)

### Load data from a JSON file
This cell shows how to read and parse JSON data from a file.

In [None]:
with open('person.json', 'r') as f:
    loaded_person = json.load(f)
print(loaded_person)

**Use case:** Web APIs, config files, and cross-language data exchange.

## 8. Plain Text Files: Reading and Writing
Plain text files are the simplest way to store and share data, logs, or notes.

In [None]:
# Write to a text file
with open('notes.txt', 'w') as f:
    f.write('This is a line of text.\nAnother line.')

### Read from a text file
This cell demonstrates how to read all lines from a text file.

In [None]:
with open('notes.txt', 'r') as f:
    lines = f.readlines()
print(lines)

**Use case:** Logging, configuration, and simple data storage.

## 9. ZIP Files: Compressing and Extracting Data
ZIP files are used to compress and bundle multiple files for storage or sharing.

In [None]:
import zipfile
# Create a ZIP file and add files to it
with zipfile.ZipFile('archive.zip', 'w') as zipf:
    zipf.write('notes.txt')
    zipf.write('person.json')

### Extract files from a ZIP archive
This cell shows how to extract all files from a ZIP archive.

In [None]:
with zipfile.ZipFile('archive.zip', 'r') as zipf:
    zipf.extractall('extracted_files')

**Use case:** Data backup, sharing datasets, and packaging projects.

## 10. YAML: Human-Friendly Data Serialization
YAML (YAML Ain't Markup Language) is a readable format for configuration and data exchange, popular in DevOps and data science.

In [None]:
# Requires: pip install pyyaml
import yaml
# Serialize Python object to YAML
config = {'version': 1, 'settings': {'theme': 'dark', 'autosave': True}}
with open('config.yaml', 'w') as f:
    yaml.dump(config, f)

### Load data from a YAML file
This cell shows how to read YAML data from a file.

In [None]:
with open('config.yaml', 'r') as f:
    loaded_config = yaml.safe_load(f)
print(loaded_config)

**Use case:** Application configuration, cloud infrastructure, and data pipelines.