The Core Python Libraries That Power a Data Engineer‚Äôs Day

Let‚Äôs go through the real heroes ‚Äî the libraries that quietly do the heavy lifting every day in production.

üßæ 1. Pandas ‚Äî The Swiss Army Knife

If you work with tabular data, you‚Äôll live inside Pandas.
It‚Äôs your go-to for reading, transforming, filtering, and exploring data quickly.

In [0]:
import pandas as pd

df = pd.read_csv("sales.csv")
df["Revenue"] = df["Quantity"] * df["Price"]
region_summary = df.groupby("Region")["Revenue"].sum().reset_index()

print(region_summary)

In [0]:
from pathlib import Path


for file in Path("/Volumes/workspace/default/emp").rglob("*.csv"):
    print(file.name)

# for file in Path("/Volumes/workspace/default/emp").rglob("*.csv"):
#     print(file.name, file.stat().st_size)

In [0]:
# rglob is a method from pathlib.Path that recursively searches for files matching a pattern.
# Example: Path("/path/to/dir").rglob("*.csv") finds all CSV files in the directory and its subdirectories.

4. SQLAlchemy ‚Äî Talking to Databases the Pythonic Way

SQLAlchemy bridges Python with SQL databases ‚Äî perfect for extracting or loading data to SQL Server, PostgreSQL, or Snowflake.

In [0]:
from sqlalchemy import create_engine
import pandas as pd

engine = create_engine("mssql+pyodbc://user:password@server/database")
df = pd.read_sql("SELECT TOP 100 * FROM dbo.Customers", engine)

 5. datetime ‚Äî The Secret Ingredient in Incremental Loads
Incremental pipelines revolve around dates ‚Äî ‚Äúload yesterday‚Äôs data,‚Äù ‚Äúget last week‚Äôs changes,‚Äù etc.

In [0]:
from datetime import datetime, timedelta

yesterday = (datetime.now() - timedelta(days=2)).strftime("%Y-%m-%d")
print(f"Processing data for: {yesterday}")

In [0]:
# timedelta represents a duration, the difference between two dates or times.
from datetime import timedelta

delta = timedelta(days=2, hours=3, minutes=15)
print(f"Timedelta example: {delta}")

6. logging ‚Äî Because Print Statements Don‚Äôt Scale

When you‚Äôre debugging in production, print() won‚Äôt help.
Logging helps you trace and debug without flooding your output.

In [0]:
import logging

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)

logging.info("Pipeline started")
logging.warning("Missing column detected")
logging.error("File not found")

Optional But Handy Libraries

requests ‚Äî Call REST APIs (extracting data from external systems)
boto3 / azure-storage-blob ‚Äî Move data to/from cloud storage
pytest ‚Äî Automate testing for your ETL logic
json‚Äî Handle API responses and metadata files
re ‚Äî Clean up messy text columns

 What I Wish I Knew Earlier
If I could go back and give my beginner self some advice, I‚Äôd say:

Don‚Äôt try to learn everything.

Master a few libraries deeply ‚Äî Pandas, Pathlib, PySpark, and logging will take you far.

Focus on writing clean code, not clever code.

You‚Äôll read your old scripts more often than you‚Äôll write new ones.

Understand your data flow.

Learn how data moves ‚Äî from ingestion ‚Üí transformation ‚Üí storage ‚Üí reporting.
Python is just the glue that holds these steps together.
Use version control early (Git).
It‚Äôs not optional in production.

If you can:

Read and clean data with Pandas
Manipulate files with Pathlib
Scale with PySpark
Connect to databases with SQLAlchemy
Manage dates and logs properly
Then congratulations ‚Äî you already understand the core of Python for data engineering.

It worked. But it wasn‚Äôt maintainable.
It was a 300-line .py file with zero structure.

Code that‚Äôs hard to test is often code that‚Äôs hard to trust.

The senior dev‚Äôs first comment?

‚ÄúThis should be a class. This isn‚Äôt a script, it‚Äôs a small application.‚Äù

So I refactored:

In [0]:
class ReportGenerator:
    def __init__(self, api_key):
        self.api_key = api_key

    def fetch_data(self):
        # logic for API call
        pass

    def generate_csv(self, data):
        # write to csv
        pass

    def send_email(self, csv_path):
        # send email with attachment
        pass

Step 2: ‚ÄúDon‚Äôt Repeat Yourself‚Äù Is Not Optional
I thought duplicating three lines of code was harmless.

Turns out, those three lines were copy-pasted in five places.
When the logic changed later, I had to update it in five places.

That was the second red flag.

Fix? Create utilities.

In [0]:
def save_to_csv(data, filename):
    with open(filename, 'w', newline='') as f:
        writer = csv.DictWriter(f, fieldnames=data[0].keys())
        writer.writeheader()
        writer.writerows(data)

In [0]:
import csv

def save_to_csv(data, filename):
    # Save a list of dictionaries to a CSV file.
    # Each dictionary in 'data' represents a row; keys are column names.
    with open(filename, 'w', newline='') as f:
        writer = csv.DictWriter(f, fieldnames=data[0].keys())
        writer.writeheader()  # Write column headers
        writer.writerows(data)  # Write data rows

Step 3: Automation Should Be Idempotent
This one hit hard.

My script wrote files directly to disk. If it crashed halfway, it left half-written files behind. If it ran twice, it duplicated everything.

The senior dev asked me one question:

‚ÄúCan I run this twice and still get the same result?‚Äù

I couldn‚Äôt say yes.

So I made it idempotent:




In [0]:

def generate_unique_filename(base_name):
    for i in range(1, 1000):
        date_str = datetime.now().strftime("%Y-%m-%d")
        return f"{base_name}_{date_str}.csv"

bname = generate_unique_filename("customer")
print(bname)

In [0]:
from datetime import datetime

def generate_unique_filename(base_name, idx):
    date_str = datetime.now().strftime("%Y-%m-%d")
    return f"{base_name}_{date_str}_{idx}.csv"

for i in range(1, 1001):
    fname = generate_unique_filename("customer", i)
    print(fname)

In [0]:

from datetime import datetime

def generate_unique_filename(base_name, idx):
    date_str = datetime.now().strftime("%Y-%m-%d")
    return f"{base_name}_{date_str}_{idx}.csv"

sample_data = [{"id": i, "value": f"val_{i}"} for i in range(10)]

for i in range(1, 1001):
    fname = generate_unique_filename("customer", i)


Step 4: Log Everything. Seriously.
Automation without logging is like flying a drone blindfolded.
It might work until it doesn‚Äôt.

Before the review, I had exactly one print() statement in the script. No logs. No error handling.

Now?

In [0]:
import logging

logging.basicConfig(
    filename='report.log',
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

logging.info("Fetching data from CRM...")

Step 5: Functional Is Not Always Pythonic
Yes, the code ran.
But it didn‚Äôt read like Python.

One section had a massive chain of filters, maps, and lambdas.

In [0]:
cleaned = list(map(lambda x: x.strip().lower(), filter(lambda x: x is not None, raw_list)))

Look familiar? It worked, but it wasn‚Äôt readable. Not to the next dev. Not even to me a week later.

Refactored into something readable:


In [0]:
cleaned = []
for item in raw_list:
    if item:
        cleaned.append(item.strip().lower())

Step 6: Stop Ignoring Edge Cases (Because Real Users Won‚Äôt)
My script assumed the API would always return data.
And that the email address would always be valid.
And that the CSV would always be generated.

Reality? Not so cooperative.

So I added graceful error handling:

In [0]:
def fetch_data(self):
    try:
        response = requests.get("https://api.example.com/data", headers=self.headers)
        response.raise_for_status()
        return response.json()
    except requests.RequestException as e:
        logging.error(f"Failed to fetch data: {e}")
        return None

This single change turned a brittle tool into a robust one. No more mid-night alerts from failed automations.


Step 7: It‚Äôs Not Clean Until Someone Else Can Maintain It
After my script was deployed, another intern took over a similar task.

Guess what they did?

They used my refactored class. Extended it. Added logging. Didn‚Äôt break anything.

That was the moment it hit me clean code isn‚Äôt about perfection.
It‚Äôs about readability. Maintainability. Predictability.

And most of all, it‚Äôs about empathy for the next developer who picks it up.