# 12 - File Handling

## Introduction

File handling is essential for data engineering. You'll frequently need to read data from files and write results to files. This notebook covers reading and writing text files and CSV files.

## What You'll Learn

- Reading text files
- Writing text files
- Reading CSV files
- Writing CSV files
- Working with file paths


## Reading Text Files

The `open()` function is used to open files. Always use `with` statement to ensure files are properly closed.


In [1]:
# Reading a text file
# First, let's create a sample file
with open("sample.txt", "w") as file:
    file.write("Hello, World!\n")
    file.write("This is a sample file.\n")
    file.write("Python is great for data engineering!")

# Now read it back
with open("sample.txt", "r") as file:
    content = file.read()
    print(content)


Hello, World!
This is a sample file.
Python is great for data engineering!


## Reading File Line by Line

For large files, it's better to read line by line instead of loading everything into memory.


In [2]:
# Read file line by line
with open("sample.txt", "r") as file:
    for line in file:
        print(line.strip())  # strip() removes newline characters


Hello, World!
This is a sample file.
Python is great for data engineering!


## Reading All Lines into a List

You can read all lines at once using `readlines()`.


In [3]:
# Read all lines into a list
with open("sample.txt", "r") as file:
    lines = file.readlines()
    print("Total lines:", len(lines))
    print("Lines:", lines)


Total lines: 3
Lines: ['Hello, World!\n', 'This is a sample file.\n', 'Python is great for data engineering!']


## Writing to Text Files

Use `"w"` mode to write (overwrites existing file) or `"a"` mode to append.


In [4]:
# Writing to a file
with open("output.txt", "w") as file:
    file.write("Line 1\n")
    file.write("Line 2\n")
    file.write("Line 3\n")

# Verify by reading it
with open("output.txt", "r") as file:
    print(file.read())


Line 1
Line 2
Line 3



## Reading CSV Files

CSV (Comma-Separated Values) files are very common in data engineering. We'll use Python's built-in `csv` module.


In [5]:
import csv

# First, create a sample CSV file
with open("employees.csv", "w", newline="") as file:
    writer = csv.writer(file)
    writer.writerow(["Name", "Age", "Department"])
    writer.writerow(["Alice", 25, "Engineering"])
    writer.writerow(["Bob", 30, "Sales"])
    writer.writerow(["Charlie", 28, "Marketing"])

# Now read the CSV file
with open("employees.csv", "r") as file:
    reader = csv.reader(file)
    for row in reader:
        print(row)


['Name', 'Age', 'Department']
['Alice', '25', 'Engineering']
['Bob', '30', 'Sales']
['Charlie', '28', 'Marketing']


## Reading CSV as Dictionary

Using `csv.DictReader` makes it easier to work with CSV files by treating each row as a dictionary.


In [6]:
# Read CSV as dictionary
with open("employees.csv", "r") as file:
    reader = csv.DictReader(file)
    for row in reader:
        print(f"{row['Name']} is {row['Age']} years old and works in {row['Department']}")


Alice is 25 years old and works in Engineering
Bob is 30 years old and works in Sales
Charlie is 28 years old and works in Marketing


## Writing CSV Files

You can write data to CSV files using `csv.writer` or `csv.DictWriter`.


In [7]:
# Write data to CSV
data = [
    ["Product", "Price", "Quantity"],
    ["Laptop", 999.99, 10],
    ["Mouse", 29.99, 50],
    ["Keyboard", 79.99, 30]
]

with open("products.csv", "w", newline="") as file:
    writer = csv.writer(file)
    writer.writerows(data)

# Verify
with open("products.csv", "r") as file:
    reader = csv.reader(file)
    for row in reader:
        print(row)


['Product', 'Price', 'Quantity']
['Laptop', '999.99', '10']
['Mouse', '29.99', '50']
['Keyboard', '79.99', '30']


## Key Points to Remember

- Always use `with` statement when working with files - it automatically closes the file
- Use `"r"` for reading, `"w"` for writing (overwrites), `"a"` for appending
- For CSV files, use the `csv` module instead of manually parsing
- When reading large files, read line by line to save memory
- In PySpark, you'll use similar concepts but with DataFrames instead of direct file operations
