# Introduction to Data Engineering: Bridging the gap between raw data and insights

What we'll cover in this tutorial

In this tutorial, we'll be using SQLite, a simple and easy-to-use database, to explore the fundamentals of data engineering and data preprocessing. We'll cover the following topics:
+ Introduction to SQL: We'll learn the basics of SQL (Structured Query Language), the standard language for interacting with databases.
+ Data Loading: We'll load the transformed data into a new table or file for further analysis.
+ Data Extraction: We'll use SQL to extract data from our SQLite database.
+ Data Transformation: We'll use SQL to clean, transform, and prepare the data for analysis. This will include handling missing values, inconsistent formatting, and duplicates.


### Understanding SQL and Relational Databases

Relational databases organize data into tables, which you can visualize as spreadsheets with rows and columns. Each table represents a specific type of entity or concept – imagine tables for "Customers," "Products," or "Orders." 
To query these databases, we are using **SQL**, which is the standard language for interacting with relational databases. 

Rows (or Records): Each row in a table represents a single instance of that entity. For example, in the diagram below, in the "Order" table, each row would represent one specific order. We often refer to these as records.
Columns (Attributes): Each column in a table represents a characteristic or attribute of that entity. In the "Order" table, columns might include "UnitPrice," "Quantity," and "Discount." Each column has a specific data type (e.g., text, number, date).
The power of relational databases lies in their ability to represent relationships between different entities. These relationships are created using keys:

+ Primary Key: A unique identifier for each record within a table. For example, "CustomerID" in the "Customers" table would likely be a primary key, ensuring that each customer has a unique identifier.
+ Foreign Key: A column in one table that refers to the primary key of another table. This establishes a link between the two tables. For example, an "Orders" table might have a "CustomerID" column that is a foreign key referencing the "CustomerID" (primary key) in the "Customers" table. This way, you know which customer placed each order.

Tables are linked together through these keys, allowing you to perform complex queries that combine data from multiple tables. They can be linked using different types of relationships, such as one-to-one, one-to-many, or many-to-many.

Below is a database schema diagram that illustrates these concepts:


![Northwind Database Schema](./images/db-schema-northwind.svg)


---

#  We need to populate our database with some data !

In this tutorial, we will be using the Northwind database, a sample database that contains data about a fictional company that imports and exports specialty foods. It includes tables for customers, orders, products, and more.
The schema diagram above shows the relationships between the tables in the Northwind database.

We've created for you a small script that will create the Northwind database and populate it with some sample data. To make things easy, we use SQLite, a lightweight database that is easy to set up and use, and offers a simple way to work with SQL databases without needing a separate server.
In a real-world scenario, you would typically connect to a database server (like PostgreSQL, MySQL, Snowflake, etc...) and run SQL commands to create and populate your database. 

💡 **You don't really need to understand what's happening down there, you can just execute the code and move on to the next section.**

In [None]:
#Import required libraries for data processing

import urllib.request
import os
import sqlite3
import pandas as pd  # added for dataframe operations

In [None]:
# Verifying if the database file exists, if not, download it. Instantiate the connection to the database.

url = "https://github.com/jpwhite3/northwind-SQLite3/raw/refs/heads/main/dist/northwind.db"
db_path = "northwind.db"

if not os.path.exists(db_path):
    print("Downloading the database...")
    urllib.request.urlretrieve(url, db_path)
    print("Database downloaded successfully!")
else:
    print("Database already exists.")

# Connexion à la base
conn = sqlite3.connect(db_path)
cursor = conn.cursor()


Database already exists.


In [None]:
def query_and_print(query):
    """
    Execute a SQL query and print the results.
    """
    cursor.execute(query)
    results = cursor.fetchall()
    
    # Convert results to a DataFrame for better readability
    df = pd.DataFrame(results, columns=[desc[0] for desc in cursor.description])
    
    print(df.to_string(index=False))
    return df


---

# Back to our SQL queries

SQL is the standard language for talking to databases – the organized stores of information that power your business. Think of SQL as the tool that lets you ask questions of your data and get the answers you need to make better decisions. It's like having a direct line to your company's knowledge.

Why is SQL Important for Business Professionals?

SQL empowers you to:

Get answers to critical business questions: Instead of relying solely on reports or technical teams, you can use SQL to directly query your data and answer questions like:
+ What are my top-selling products this quarter?
+ Which marketing campaigns are generating the most leads?
+ What is the average order value for customers in a specific region?
+ Who are my most valuable customers, and what are they buying?




---

# 🤖 SQL + AI: Smarter Queries, Safer Data

AI speed up and simplify greatly the SQL query generation, but it requires a thoughtful approach, and a bit of practice to get the most out of it. Here are some tips to help you effectively use AI for SQL query generation:

### 1. Prompt Engineering:

Set the Stage: Begin by defining the context: "As a database engineer using [Your Database System]..." that will help the system understand the environment and constraints.
This important for the query to have the proper syntax and to avoid errors.

🫸 Schema as Context: Always provide the relevant database schema as text at the beginning of your prompt or discussion. Without it, the AI model cannot understand what data it is manipulating, and will NEVER give you 

### 2. Prompt Engineering:

Clarity is Key: Articulate your needs precisely. Avoid ambiguity!
Example Prompts:
"I want the products generating the most revenue between 2014 and 2018."
"I want the name and phone number of our lead customer with over $1 million in purchases."
Critical Review:

Always Validate the AI's Output!
Watch Out For:
+ Inaccurate dates or filters - That could get you wrong results and insights
+ Destructive commands like DELETE or DROP TABLE - This can put in danger your whole database, and compromise the operations of others !

---



---

# Let's cross the Rubicon !

Alright, let's get our hands dirty. Imagine we've just been hired as the new data analysts for a company called "Northwind traders". They are a small but ambitious company that imports and exports gourmet food products around the world.

Our manager has given us our first big mission: Help the company boost its sales.

That's a broad goal. Where do we even begin? As data people, our first instinct should be to look at the data we have. We need to understand the business before we can change it. Our main tool for this investigation will be SQL.

Let's start by asking some basic questions.


Peeking into the data with select and limit
Before we can think about sales, we need to know what we're actually selling. Our database has a table called Products that seems like a good place to start.

Let's ask the database to show us everything in that table. The command for "show me" in SQL is select. If we want to see every column, we use the asterisk *, which is a wildcard for "all".

In [None]:
query_and_print("SELECT * from Products;")

Running this, you'll see a lot of data. Maybe too much. We don't need to see every single product right now, we just want to get a feel for the table's structure.

To ask for just a sample, we can add a limit clause. This is useful for taking a quick peek without overwhelming our screen. Let's just look at the first 5 products. To do that, we will be using the clause "LIMIT".

In [None]:
query_and_print("SELECT * from Products LIMIT 5;")


---

### Focusing your search: SELECT (Specific Columns) and LIMIT


Use Case: You're only interested in a few specific details from the folder (table). For example, you only want to see the order number, customer, and date. This is like skimming a document for only the key points.

Instead of using * (which means "everything"), you list the specific column names you want to see. Trying that onto the order table, it would be like "Show me only the Order Number, Customer, and Date for the first 5 orders."
Translated to SQL, it would be:

```sql
SELECT OrderID, CustomerID, OrderDate FROM Orders LIMIT 5;
```


In [10]:
query_and_print("SELECT OrderID, CustomerID, OrderDate FROM Orders LIMIT 5;")

 OrderID CustomerID  OrderDate
   10248      VINET 2016-07-04
   10249      TOMSP 2016-07-05
   10250      HANAR 2016-07-08
   10251      VICTE 2016-07-08
   10252      SUPRD 2016-07-09


Unnamed: 0,OrderID,CustomerID,OrderDate
0,10248,VINET,2016-07-04
1,10249,TOMSP,2016-07-05
2,10250,HANAR,2016-07-08
3,10251,VICTE,2016-07-08
4,10252,SUPRD,2016-07-09


Only the 3 columns we specified were returned ! Much more readable.


---

### Finding Exactly What You Need: WHERE (Simple Filter) 

You need to find specific documents (rows) that meet certain criteria. For instance, you want to find all orders placed by a particular customer. This is like using a keyword search to find a specific file.

The WHERE clause acts as a filter, allowing you to specify conditions that must be met for a row to be included in the results.

In [12]:
query_and_print("SELECT * FROM Orders WHERE CustomerID = 'ALFKI';")

 OrderID CustomerID  EmployeeID           OrderDate        RequiredDate         ShippedDate  ShipVia  Freight                           ShipName                                    ShipAddress        ShipCity      ShipRegion ShipPostalCode ShipCountry
   10643      ALFKI           6          2017-08-25          2017-09-22          2017-09-02        1    19.50                Alfreds Futterkiste                                  Obere Str. 57          Berlin  Western Europe          12209     Germany
   10692      ALFKI           4          2017-10-03          2017-10-31          2017-10-13        2    15.00               Alfred-s Futterkiste                                  Obere Str. 57          Berlin  Western Europe          12209     Germany
   10702      ALFKI           4          2017-10-13          2017-11-24          2017-10-21        1    15.25               Alfred-s Futterkiste                                  Obere Str. 57          Berlin  Western Europe          12209     Germ

Unnamed: 0,OrderID,CustomerID,EmployeeID,OrderDate,RequiredDate,ShippedDate,ShipVia,Freight,ShipName,ShipAddress,ShipCity,ShipRegion,ShipPostalCode,ShipCountry
0,10643,ALFKI,6,2017-08-25,2017-09-22,2017-09-02,1,19.50,Alfreds Futterkiste,Obere Str. 57,Berlin,Western Europe,12209,Germany
1,10692,ALFKI,4,2017-10-03,2017-10-31,2017-10-13,2,15.00,Alfred-s Futterkiste,Obere Str. 57,Berlin,Western Europe,12209,Germany
2,10702,ALFKI,4,2017-10-13,2017-11-24,2017-10-21,1,15.25,Alfred-s Futterkiste,Obere Str. 57,Berlin,Western Europe,12209,Germany
3,10835,ALFKI,1,2018-01-15,2018-02-12,2018-01-21,3,14.25,Alfred-s Futterkiste,Obere Str. 57,Berlin,Western Europe,12209,Germany
4,10952,ALFKI,1,2018-03-16,2018-04-27,2018-03-24,1,14.50,Alfred-s Futterkiste,Obere Str. 57,Berlin,Western Europe,12209,Germany
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
158,25970,ALFKI,8,2014-05-30 03:45:30,2014-06-12 15:55:28,2014-06-01 01:30:12,2,360.00,Godos Cocina Típica,"C/ Romero, 33",Sevilla,Southern Europe,41101,Spain
159,26123,ALFKI,2,2021-08-20 09:27:19,2021-09-16 07:56:40,2021-09-05 02:35:47,3,27.00,Wolski Zajazd,ul. Filtrowa 68,Warszawa,Eastern Europe,01-012,Poland
160,26255,ALFKI,9,2018-09-08 04:33:44,2018-09-17 16:41:31,2018-09-08 04:46:15,2,470.50,Let-s Stop N Shop,87 Polk St. Suite 5,San Francisco,North America,94117,USA
161,26376,ALFKI,3,2023-10-15 09:15:19,2023-10-18 02:17:28,2023-10-22 14:59:32,3,107.00,France restauration,"54, rue Royale",Nantes,Western Europe,44000,France


### Fine-tuning your search: WHERE with AND / OR (Multiple Filters)

You want to find documents (rows) that meet multiple requirements. For example, you want to find customers who live in a specific city and country. The AND operator lets you combine multiple conditions in the WHERE clause. All the conditions must be true for a row to be included.

AND: "Must match all of these!" (Combines conditions – all must be true).
OR: "Must match at least one of these!" (Combines conditions – one or more must be true).

In [14]:
query_and_print("SELECT * FROM Customers WHERE City = 'London' AND Country = 'UK';")

CustomerID           CompanyName       ContactName                Title                      Address   City        Region PostalCode Country          Phone            Fax
     AROUT       Around the Horn      Thomas Hardy Sales Representative              120 Hanover Sq. London British Isles    WA1 1DP      UK (171) 555-7788 (171) 555-6750
     BSBEV         B's Beverages Victoria Ashworth Sales Representative            Fauntleroy Circus London British Isles    EC2 5NT      UK (171) 555-1212           None
     CONSH Consolidated Holdings   Elizabeth Brown Sales Representative Berkeley Gardens 12  Brewery London British Isles    WX1 6LT      UK (171) 555-2282 (171) 555-9199
     EASTC    Eastern Connection         Ann Devon          Sales Agent               35 King George London British Isles    WX3 6FW      UK (171) 555-0297 (171) 555-3373
     NORTS           North/South    Simon Crowther      Sales Associate South House 300 Queensbridge London British Isles    SW7 1RZ      UK (171

Unnamed: 0,CustomerID,CompanyName,ContactName,Title,Address,City,Region,PostalCode,Country,Phone,Fax
0,AROUT,Around the Horn,Thomas Hardy,Sales Representative,120 Hanover Sq.,London,British Isles,WA1 1DP,UK,(171) 555-7788,(171) 555-6750
1,BSBEV,B's Beverages,Victoria Ashworth,Sales Representative,Fauntleroy Circus,London,British Isles,EC2 5NT,UK,(171) 555-1212,
2,CONSH,Consolidated Holdings,Elizabeth Brown,Sales Representative,Berkeley Gardens 12 Brewery,London,British Isles,WX1 6LT,UK,(171) 555-2282,(171) 555-9199
3,EASTC,Eastern Connection,Ann Devon,Sales Agent,35 King George,London,British Isles,WX3 6FW,UK,(171) 555-0297,(171) 555-3373
4,NORTS,North/South,Simon Crowther,Sales Associate,South House 300 Queensbridge,London,British Isles,SW7 1RZ,UK,(171) 555-7733,(171) 555-2530
5,SEVES,Seven Seas Imports,Hari Kumar,Sales Manager,90 Wadhurst Rd.,London,British Isles,OX15 4NB,UK,(171) 555-1717,(171) 555-5646


The OR clause provides broader filtering capabilities. Try this query:

SELECT * FROM Customers WHERE City = 'London' OR Country = 'UK';
This retrieves customers located either in London or anywhere in the UK. Before running such queries, it's crucial to consider: What data do I really need? Can the filtering logic be simplified to achieve a more precise result?  Thoughtful filtering ensures you extract only the most relevant information. Mayb

In [5]:
query_and_print("SELECT * FROM Customers WHERE City = 'London' OR Country = 'UK';")

CustomerID           CompanyName       ContactName                Title                      Address   City        Region PostalCode Country          Phone            Fax
     AROUT       Around the Horn      Thomas Hardy Sales Representative              120 Hanover Sq. London British Isles    WA1 1DP      UK (171) 555-7788 (171) 555-6750
     BSBEV         B's Beverages Victoria Ashworth Sales Representative            Fauntleroy Circus London British Isles    EC2 5NT      UK (171) 555-1212           None
     CONSH Consolidated Holdings   Elizabeth Brown Sales Representative Berkeley Gardens 12  Brewery London British Isles    WX1 6LT      UK (171) 555-2282 (171) 555-9199
     EASTC    Eastern Connection         Ann Devon          Sales Agent               35 King George London British Isles    WX3 6FW      UK (171) 555-0297 (171) 555-3373
     ISLAT        Island Trading     Helen Bennett    Marketing Manager    Garden House Crowther Way  Cowes British Isles   PO31 7PJ      UK (198

Unnamed: 0,CustomerID,CompanyName,ContactName,Title,Address,City,Region,PostalCode,Country,Phone,Fax
0,AROUT,Around the Horn,Thomas Hardy,Sales Representative,120 Hanover Sq.,London,British Isles,WA1 1DP,UK,(171) 555-7788,(171) 555-6750
1,BSBEV,B's Beverages,Victoria Ashworth,Sales Representative,Fauntleroy Circus,London,British Isles,EC2 5NT,UK,(171) 555-1212,
2,CONSH,Consolidated Holdings,Elizabeth Brown,Sales Representative,Berkeley Gardens 12 Brewery,London,British Isles,WX1 6LT,UK,(171) 555-2282,(171) 555-9199
3,EASTC,Eastern Connection,Ann Devon,Sales Agent,35 King George,London,British Isles,WX3 6FW,UK,(171) 555-0297,(171) 555-3373
4,ISLAT,Island Trading,Helen Bennett,Marketing Manager,Garden House Crowther Way,Cowes,British Isles,PO31 7PJ,UK,(198) 555-8888,
5,NORTS,North/South,Simon Crowther,Sales Associate,South House 300 Queensbridge,London,British Isles,SW7 1RZ,UK,(171) 555-7733,(171) 555-2530
6,SEVES,Seven Seas Imports,Hari Kumar,Sales Manager,90 Wadhurst Rd.,London,British Isles,OX15 4NB,UK,(171) 555-1717,(171) 555-5646
