# Introduction to Data Engineering: Bridging the gap between raw data and insights 💡

What we'll cover in this tutorial

In this tutorial, we'll be using SQLite, a simple and easy-to-use database, to explore the fundamentals of data engineering and data preprocessing. We'll cover the following topics:
+ Introduction to SQL: We'll learn the basics of SQL (Structured Query Language), the standard language for interacting with databases.
+ Data Loading: We'll load the transformed data into a new table or file for further analysis.
+ Data Extraction: We'll use SQL to extract data from our SQLite database.
+ Data Transformation: We'll use SQL to clean, transform, and prepare the data for analysis. This will include handling missing values, inconsistent formatting, and duplicates.


### Quick Start: Where the hell are we? 📓

Don't panic! This unfamiliar interface you're looking at is called a **Jupyter notebook** – think of it as a smart document that can actually run code. While it's built on Python (a programming language), today we're using it purely as our SQL playground.

Phew 😌 so now what ?

Jupyter notebooks work like an interactive document where you can mix explanatory text with executable code. Each gray box (called a "cell") contains either text or code. To run any code cell, simply click inside it and press `Shift + Enter` – the code executes and results appear directly below. Magic! ✨
You can edit any code cell by clicking in it and typing, just like a regular text editor. The notebook has a memory – if you create a database connection in one cell, all the cells below can use it. This is perfect for our SQL journey: connect once, query many times.


If something goes wrong or you want a fresh start, just use the menu: `Kernel → Restart & Clear Output`. This clears the notebook's memory and lets you start over. 

🫡 That's genuinely all you need to know! The beauty of notebooks is that you can experiment freely, see immediate results, and keep notes alongside your SQL queries. Each query you run stays visible with its results, creating a perfect learning trail. Ready to write some SQL? Let's go! 🚀

### Understanding SQL and Relational Databases

Think of relational databases as a collection of interconnected Excel spreadsheets 📊. 

Each spreadsheet (or table) represents a specific business entity – like "Customers," "Products," or "Orders." To communicate with these databases, we use SQL (Structured Query Language), the universal language of databases.

#### 📋 Database Building Blocks
- Tables: Like a spreadsheet, each table stores information about one type of thing. So ... similar to Excel sheets, but more powerful and connected

- Rows (Records): Individual entries in your table
One row = one specific instance
For example: In a table sotring the "Orders" information, each row represents a single order. It is also called "records" in database terminology

- Columns (or Attributes for the Excel lovers): Those are the characteristics you track
Each column stores one type of information, for an order, "UnitPrice," "Quantity," "Discount" in the "Orders" table
Every column has a specific data type (text, number, date, etc.) and it's important to understand what and why we are storing them that way !

  

🔗 The Magic of Relationships
The real power of relational databases comes from connecting tables together. This happens through special columns called keys:

 👉 Primary Key (PK) 🔑
It is an unique identifier for each record in a table, like a passport number or an id card – two records can't share the same primary key
Example: CustomerID in the "Customers" table ensures each customer is unique, otherwise we couldn't identify them.

 👉 Foreign Key (FK) 🔗
A column that points to the primary key in another table
This creates the bridge between related data, and were the power lies in !
Example: The "Orders" table contains a CustomerID column (FK) that links to the CustomerID (PK) in "Customers". This connection tells you exactly which customer placed each order


#### In this tutorial, we will be using the Northwind database, a sample database that contains data about a fictional company that imports and exports specialty foods. It includes tables for customers, orders, products, and more. The schema diagram above shows the relationships between the tables in the Northwind database.

📐 Your First Database Schema
Below is a visual map of the Northwind database – this is called a schema diagram. Each box represents a table, with its columns listed inside. The lines between boxes show how tables connect through their keys. 

Don't worry about memorizing this! Just notice how everything links together like a web. The beauty is that you can start at any table and follow the connections to find related information.


![Northwind Database Schema](./images/db-schema-northwind.svg)


---

#  We need to populate our database with some data !



We've created for you a small script that will create the Northwind database and populate it with some sample data. To make things easy, we use SQLite, a lightweight database that is easy to set up and use, and offers a simple way to work with SQL databases without needing a separate server.
In a real-world scenario, you would typically connect to a database server (like PostgreSQL, MySQL, Snowflake, etc...) and run SQL commands to create and populate your database. 

💡 **You don't really need to understand what's happening down there, you can just execute the code and move on to the next section.**

In [None]:
#Import required libraries for data processing

import urllib.request
import os
import sqlite3
import pandas as pd  # added for dataframe operations

In [None]:
# Verifying if the database file exists, if not, download it. Instantiate the connection to the database.

url = "https://github.com/jpwhite3/northwind-SQLite3/raw/refs/heads/main/dist/northwind.db"
db_path = "northwind.db"

if not os.path.exists(db_path):
    print("Downloading the database...")
    urllib.request.urlretrieve(url, db_path)
    print("Database downloaded successfully!")
else:
    print("Database already exists.")

# Connexion à la base
conn = sqlite3.connect(db_path)
cursor = conn.cursor()


In [None]:
def query_and_print(query):
    """
    Execute a SQL query and print the results.
    """
    cursor.execute(query)
    results = cursor.fetchall()
    
    # Convert results to a DataFrame for better readability
    df = pd.DataFrame(results, columns=[desc[0] for desc in cursor.description])
    
    print(df.to_string(index=False))
    return df


---

# Back to our SQL queries

SQL is the standard language for talking to databases – the organized stores of information that power your business. Think of SQL as the tool that lets you ask questions of your data and get the answers you need to make better decisions. It's like having a direct line to your company's knowledge.

Why is SQL Important for Business Professionals?

SQL empowers you to:

Get answers to critical business questions: Instead of relying solely on reports or technical teams, you can use SQL to directly query your data and answer questions like:
+ What are my top-selling products this quarter?
+ Which marketing campaigns are generating the most leads?
+ What is the average order value for customers in a specific region?
+ Who are my most valuable customers, and what are they buying?



---

# 🤖 SQL + AI: Smarter Queries, Safer Data

AI has revolutionized how we write SQL queries – transforming natural language questions into database commands in seconds. But like any powerful tool, it requires a thoughtful approach and a bit of practice to get the most out of it.

### Choosing Your AI Assistant

Not all AI tools are created equal when it comes to SQL:

**Claude 4 sonnet / opus (Anthropic)** - Excellent at understanding complex business logic. From experience it handles large schemas well, which is super important on critical databases.
More cautious with potentially destructive queries and Better at explaining its reasoning

**ChatGPT 4.1  (OpenAI)** 🥈- Good for standard queries
- Sometimes overly confident with complex joins

**Company / private LLMs** 🏢 Keep your data inside ! A schema can contain important / confidential informations, be careful where you paste them.
The good thing is that most company offer a private version of public LLM. 

### The Golden Rule: Schema First! 📋

The #1 mistake people make? Asking for SQL without providing context. AI models aren't mind readers – they need to know your database structure.

**❌ Bad Request:**
What is a bad request ? Well ....
```
"Show me our top customers"
```
With no given context, the LLM will likely hallucinate and return something completly off track.

**✅ Good Request:**
```
Given this database schema:
- Customers (CustomerID, CompanyName, Country)
- Orders (OrderID, CustomerID, OrderDate, Freight)
- Order Details (OrderID, ProductID, UnitPrice, Quantity)

Write a SQL query to show me the top 5 customers by total order value in 2023.
```

### Effective Prompting Strategies 🎯

**1. Set the Technical Context**
Always start with:
- Your database system (PostgreSQL, MySQL, Snowflake, etc.). This is important as the SQL can slightly change between two db type. Also specify any specific SQL dialect requirements

Example:
```
"Using PostgreSQL 14, with tables containing millions of records..."
```

**2. Be Crystal Clear About Your Needs**

Instead of: "I want customer data"

Try: "I need a list of customers who:
- Placed orders in the last 90 days
- Have a total order value exceeding 10,000 CHF
- Are based in Switzerland or Germany
Include their company name, total spent, and last order date"

**3. Specify the Output Format**
Tell the AI exactly what columns you want and how they should be named:
```
"Return columns: CustomerName, TotalRevenue, LastOrderDate, Country
Sort by TotalRevenue descending"
```

### Real-World Example 💼

Here's a complete, effective prompt:

```
I'm using MySQL 8.0. Here's my schema:

Products (ProductID, ProductName, CategoryID, UnitPrice, UnitsInStock)
Categories (CategoryID, CategoryName)
Order_Details (OrderID, ProductID, UnitPrice, Quantity, Discount)
Orders (OrderID, CustomerID, EmployeeID, OrderDate)

Question: Which product categories generated the most revenue in Q4 2023, 
excluding any orders with more than 20% discount?

Please write an efficient query that:
- Joins all necessary tables
- Calculates total revenue per category
- Filters for Q4 2023 (October-December)
- Excludes high-discount orders
- Shows top 5 categories with their revenue
```

### The Safety Checklist 🛡️

**Always Review Before Running:**

1. 🧨 **Check for Destructive Commands**
   - Never run queries with `DROP`, `DELETE`, `TRUNCATE` without double-checking
   - Be wary of `UPDATE` statements without proper `WHERE` clauses

2. 🔍 **Validate the Logic**
   - Are the date ranges correct?
   - Do the JOIN conditions make sense?
   - Will this query return way too much data?

3. 🫸 **Test on Development First**
   - Run new queries on test data before production
   - Add `LIMIT 10` to test the output structure

4. 📉 **Watch for Performance Killers**
   - Missing WHERE clauses on large tables
   - Unnecessary nested subqueries
   - DISTINCT on large result sets

### Pro Tips for AI-Assisted SQL 🚀

**Iterative Refinement**
Don't expect perfection on the first try. Start simple, then add complexity:
1. Get the basic query working
2. Add filters and conditions
3. Optimize for performance

**Learn from the AI**
When the AI generates a query, ask it to:
- Explain each part of the query
- Suggest performance improvements
- Show alternative approaches

**Build Your Prompt Library**
Save successful prompts as templates, on a Sharepoint, Notes, wherever you can find and share them ! :
```
"[STANDARD SCHEMA HERE]
Task: [Monthly sales report / Customer segmentation / Inventory analysis]
Constraints: [Date range / Specific regions / Minimum thresholds]
Output: [Required columns and format]"
```

### Common Pitfalls to Avoid with AI 🤖⚠️

1. **The Schema Assumption**
   AI might assume column names or relationships. Always verify against your actual schema.

2. **The Timezone Trap**
   Be explicit about timezones when dealing with dates: "Using UTC timestamps...".. This one is painful, make sure you double check !

3. **The NULL Surprise**
   AI often forgets about NULL values. Explicitly mention how to handle them. Do not hesistate to feed your errors back into the LLM.

4. **The Performance Blind Spot**
   AI doesn't know your table sizes. Mention if you're dealing with millions of rows, and fingers crossed, it will optimize the query.

👉 Remember: AI is your assistant, not your replacement. It speeds up query writing dramatically, but you're still the pilot. Always understand what the query does before hitting that execute button! It's hard telling your boss that the "Magic robot" deleted you production database🎮


---

# Let's cross the Rubicon !

Alright, let's get our hands dirty. Imagine we've just been hired as the new data analysts for a company called "Northwind traders". They are a small but ambitious company that imports and exports gourmet food products around the world.

Our manager has given us our first big mission: Help the company boost its sales.

That's a broad goal. Where do we even begin? As data people, our first instinct should be to look at the data we have. We need to understand the business before we can change it. Our main tool for this investigation will be SQL.

Let's start by asking some basic questions.


Peeking into the data with select and limit
Before we can think about sales, we need to know what we're actually selling. Our database has a table called Products that seems like a good place to start.

Let's ask the database to show us everything in that table. The command for "show me" in SQL is select. If we want to see every column, we use the asterisk *, which is a wildcard for "all".

In [None]:
query_and_print("SELECT * from Products;")

Running this, you'll see a lot of data. Maybe too much. We don't need to see every single product right now, we just want to get a feel for the table's structure.

To ask for just a sample, we can add a limit clause. This is useful for taking a quick peek without overwhelming our screen. Let's just look at the first 5 products. To do that, we will be using the clause "LIMIT".

In [None]:
query_and_print("SELECT * from Products LIMIT 5;")

Much better. Now we can see the columns clearly: ProductID, ProductName, SupplierID, UnitPrice, etc.

The * is handy, but most of the time we only care about a few specific columns. For our sales mission, the product name and its price are probably quite important. Let's select just those two columns.

In [None]:
query_and_print("select ProductName, UnitPrice from Products limit 5;")

Here we go. We've just used the most fundamental command in SQL.

+ Select lets us choose the columns we want to see.
+ From tells the database which table to look in.
+ Limit restricts the number of rows returned.

## Finding what you need with where
Looking at a random list of products is a start, but to make smart decisions, we need to ask more specific questions. For instance, a simple sales strategy could be to promote our high-value items.

So, let's ask the database: "Show me the products that are expensive". We need to define "expensive". Let's say, for now, any product with a UnitPrice greater than 50.

To filter our data based on a condition, we use the where clause. Think of it as adding an "if" to your request.

In [None]:
query_and_print("select ProductName, UnitPrice from Products where UnitPrice > 50;");

Now we have a list of our premium products. Notice we used the greater-than symbol >. You can use other familiar comparison operators too:

+ = : equals
+ < : less than
+ >= : greater than or equal to
+ <= : less than or equal to
+ <> or != : not equal to

👉 Let's try another one. What if we want to find information about a specific customer? Let's look at the Customers table and find all of our customers based in Germany. When we filter on text (also called a 'string'), we need to put the value in single quotes ' '.

In [None]:
query_and_print("select CompanyName, ContactName, City from Customers where Country = 'Germany';");

## Combining conditions with and and or

Our questions can get more complex. It's good that we found our expensive products, but what if an expensive product is out of stock? We can't sell it.

Let's refine our previous query. We want products that are both expensive (UnitPrice > 50) and actually in stock (UnitsInStock > 0).

To check for multiple conditions where all must be true, we use and.

In [None]:
query_and_print("select ProductName, UnitPrice, UnitsInStock from Products where UnitPrice > 50 and UnitsInStock > 0;");

This list is much more useful for a sales campaign.

Now, let's consider a different strategy. Perhaps we want to run a marketing campaign in our key European markets, say Switzerland and France. We need a list of customers who are in Switzerland or in France.

For this, we use the or operator. It will show a row if any of the conditions are met.

In [None]:
query_and_print("select CompanyName, Country from Customers where Country = 'Switzerland' or Country = 'France';");

## Creating new information with calculations
Let's look at the Order Details table. It seems to hold the key to our sales performance. It has OrderID, ProductID, UnitPrice, and Quantity.

Wait a minute 💡. It shows the price of a single unit and how many units were sold, but it doesn't have a column for the total value of that line item. No problem, we can calculate it ourselves, directly in the select statement.

In [None]:
# The 3 consecutive double quotes allow us to write a multi-line string in Python and to ignore quotes inside the string.
query_and_print("""select OrderID, ProductID, UnitPrice, Quantity, UnitPrice * Quantity
from "Order Details" limit 10;""")


---

# Can't beat'em? JOIN'em!

So, we've hit our first real data-wrangling challenge. We can see a ProductID in our list of valuable orders, but that's just a number. It doesn't tell our sales team whether the customer bought "Chai" or "Chef Anton's Gumbo Mix". 

That important piece of information, the product's name, lives in a completely different table, the Products table. This is intentional. 

**Databases are designed to be tidy** - instead of repeating the product name and description in every single order line, they just use a unique ID as a reference. It's like having a contact list with just phone numbers, and a separate address book. To find someone's address, you need to use their name to look them up in the other book. 

In SQL, this process of looking up and combining information from multiple tables is called a join. It's how we connect the dots using a shared piece of information—a "key" like ProductID—to build a more complete picture.



### What are the types of join ?

The *inner join* we've been using is the most common, but it's just one tool in the box. 

Think of it as finding the perfect overlap between two lists; it only shows you records that have a match in both tables. Sometimes, however, you need a different perspective. What if you want to see all your products, including the ones that have never been sold? For this, you would use a *left join*. 👉
👉 A *left join* keeps every single record from the first ("left") table and pulls in the matching data from the second ("right") table. If there's no match, it simply leaves the space blank (with a special value called null). 
👉 A *right join* does the exact opposite, and a *full outer join* goes all in, keeping every record from both tables, whether they match or not. The type of join you choose all depends on the question you're asking: are you interested only in the matching data, or do you need to see the whole story, including what's missing?

Here is a graphical representation of these concepts. Use that as a CheatSheet !

![SQL Join Cheat sheet (source: reddit)](./images/sql-join-cheatsheet$.png)

## Finding our sales superstars with join


Well ! It's that time of year at Northwind traders: bonus season. 

The management wants to reward the salespeople based on their performance, but to do that, we need to answer a critical question: Who are our top-performing employees in terms of total sales value?

This is the most complex question we've faced yet, and it's impossible to answer by looking at just one table.

💡 The employee's name is in the Employees table.
💵 The value of each sale is in the "Order Details" table, which we have to calculate (UnitPrice * Quantity).
🪢 The link between an employee and a sale is the Orders table, which connects an EmployeeID to an OrderID.
To get our answer, we need to build a bridge across all three tables. This is a perfect job for join. We'll start with the most detailed information—the individual line items of each order—and add the other pieces of information one by one.

Let's construct the query step-by-step:

👉 First, we join "Order Details" (which has the money) and Orders (which has the EmployeeID).
 Then, we join that result with Employees to get the salesperson's name.

In [None]:
query_and_print("""
select
  e.FirstName,
  e.LastName,
  od.OrderID,
  (od.UnitPrice * od.Quantity) as LineTotal
from "Order Details" as od
inner join Orders as o on od.OrderID = o.OrderID
inner join Employees as e on o.EmployeeID = e.EmployeeID;
""")

It is a bit complex so let's break that down:

👉 We start from "Order Details" (od) because it contains the UnitPrice and Quantity we need to calculate the value of each transaction.
👉 We use an inner join to connect to the Orders table (o) using the shared OrderID. Now, for each line item, we know which employee made the sale via o.EmployeeID.
👉 We use a second inner join to connect to the Employees table (e) using the shared EmployeeID. Now, instead of just an ID, we have their FirstName and LastName.

When you run this, you get a long, detailed list. It shows every single product line from every single sale, right next to the name of the salesperson who secured it. For example, you can see all the sales made by "Steven Buchanan" or "Michael Suyama".


But joins are not just for building massive queries. They are also the perfect tool for answering small, specific, everyday questions. For instance, imagine a colleague from human resources asks: "Who is the direct manager for Steven Buchanan and Michael Suyama?"

The answer is right there in the Employees table, but getting it requires a join on the table itself—a self-join. We need to find the employees, then use their ReportsTo ID to look up their manager's name in the very same table.

We use an inner join because we know they have managers. The key is to filter the employee list down to just the two people we care about.

In [None]:
query_and_print("""select
  e.FirstName || ' ' || e.LastName as EmployeeName,
  m.FirstName || ' ' || m.LastName as ManagerName
from Employees as e
inner join Employees as m on e.ReportsTo = m.EmployeeID
where e.LastName in ('Buchanan', 'Suyama');""")

## A conclusion on our journey so far
Let's pause and appreciate the ground we've covered. We started with a simple, practical goal: help Northwind traders boost sales. We didn't have to learn a hundred complex commands. Instead, with a handful of core tools, we transformed ourselves from passive observers into active investigators.

+ We learned to peek at data with select and limit, and to ask for specific columns instead of everything.
+ We used where, and, and or to filter the noise and focus on the data that mattered, like finding premium products or customers in key markets.
+ We created new information on the fly by doing calculations directly in our queries.
+ Most importantly, we learned to connect the dots with join. This was the key that unlocked the true power of our relational database, allowing us to build a complete picture from separate tables. We saw how to answer complex business questions, from finding our top sales performers to mapping out the company's internal reporting structure.
You now possess the foundational syntax to query almost any relational database. You understand the logic of asking for what you want, filtering it down, and connecting it to related information.


### 👉 Digging deeper: The next steps in your SQL adventure
We left our bonus-season query at a crucial point. We have a long list of every sale transaction next to the employee who made it, but it's not the final answer. The immediate next step in your SQL journey is to learn about aggregation. These are functions that summarise or group your data. You would use:

+ group by to tell the database to bundle rows together (e.g., group all sales by employee).
+ sum() to add up values (e.g., sum(LineTotal) to get the total sales value for that group).
+ count() to count the number of rows (e.g., how many orders a customer has placed).
+ avg(), min(), and max() to find the average, minimum, and maximum values in a group.

Using these aggregation functions will allow you to turn that long list of sales into a concise leaderboard, finally answering the "who gets the bonus" question. Beyond aggregation, you can explore more advanced SQL concepts like window functions (for complex rankings and comparisons) and common table expressions or CTEs (for making long queries more readable).

#### The leap to data engineering: What to expect in a company
The clean, tidy Northwind database in our Jupyter notebook is a perfect learning environment. A real company environment is a bit different, but the skills you've learned are the bedrock you'll build upon.

👉 The tools will be bigger: Instead of a single SQLite file, you will likely work with a cloud data warehouse. Tools like Snowflake, Amazon Redshift, or Google BigQuery are designed to handle enormous amounts of data—terabytes or even petabytes. The good news is that they all speak SQL. The syntax you've learned is 95% transferable. These systems are powerful because they separate the cost of storing data from the cost of querying it, allowing companies to store everything and analyse it on demand.

👉The data will be messy: The Northwind database is like a perfectly curated garden. Real-world data is often more like a wild jungle. You should expect to find:

+ Lots of null values: Missing information is a fact of life.
+ Inconsistent data: You might find the same country listed as 'USA', 'US', and 'United States' in the same column.
+ Incorrect data types: Numbers stored as text, dates in strange formats.
+ Duplicate records: The same order or customer entered twice.
A huge part of a data engineer's job is not just querying data, but cleaning and transforming it so that it becomes reliable and useful. This process is often called ETL (Extract, Transform, Load).

### Your Learning Roadmap 🗺️

For foundational skills, "SQL for Data Analysis" on Udacity stands out with its focus on analytical patterns and real datasets from tech companies. If you prefer interactive learning, DataCamp's "The Complete SQL Bootcamp" offers hands-on exercises with immediate feedback. 

🇨🇭 In the Swiss job market, certain certifications carry real weight. The Snowflake SnowPro Core certification is particularly valued, as many Swiss companies use Snowflake for their data warehousing. Google Cloud's Professional Data Engineer certification is comprehensive and well-recognized across European markets. For those starting out, Microsoft's Azure Data Fundamentals provides a solid foundation and pairs well with Power BI skills that many Swiss businesses use. 

Look at your activities, company tools, and folllow what makes the most sense ! There is not only one answer to this question.

💡Something to not put aside in data are soft skills - They matter just as much. The ability to translate a vague business question into a precise SQL query is pure gold. You'll spend significant time explaining your findings to non-technical stakeholders and documenting your work so others can understand and maintain it.

