# Introduction to Data Engineering: Bridging the gap between raw data and insights

What we'll cover in this tutorial

In this tutorial, we'll be using SQLite, a simple and easy-to-use database, to explore the fundamentals of data engineering and data preprocessing. We'll cover the following topics:
+ Introduction to SQL: We'll learn the basics of SQL (Structured Query Language), the standard language for interacting with databases.
+ Data Loading: We'll load the transformed data into a new table or file for further analysis.
+ Data Extraction: We'll use SQL to extract data from our SQLite database.
+ Data Transformation: We'll use SQL to clean, transform, and prepare the data for analysis. This will include handling missing values, inconsistent formatting, and duplicates.


### Understanding SQL and Relational Databases

Relational databases organize data into tables, which you can visualize as spreadsheets with rows and columns. Each table represents a specific type of entity or concept – imagine tables for "Customers," "Products," or "Orders." 
To query these databases, we are using **SQL**, which is the standard language for interacting with relational databases. 

Rows (or Records): Each row in a table represents a single instance of that entity. For example, in the diagram below, in the "Order" table, each row would represent one specific order. We often refer to these as records.
Columns (Attributes): Each column in a table represents a characteristic or attribute of that entity. In the "Order" table, columns might include "UnitPrice," "Quantity," and "Discount." Each column has a specific data type (e.g., text, number, date).
The power of relational databases lies in their ability to represent relationships between different entities. These relationships are created using keys:

+ Primary Key: A unique identifier for each record within a table. For example, "CustomerID" in the "Customers" table would likely be a primary key, ensuring that each customer has a unique identifier.
+ Foreign Key: A column in one table that refers to the primary key of another table. This establishes a link between the two tables. For example, an "Orders" table might have a "CustomerID" column that is a foreign key referencing the "CustomerID" (primary key) in the "Customers" table. This way, you know which customer placed each order.

Tables are linked together through these keys, allowing you to perform complex queries that combine data from multiple tables. They can be linked using different types of relationships, such as one-to-one, one-to-many, or many-to-many.

Below is a database schema diagram that illustrates these concepts:


![Northwind Database Schema](./images/db-schema-northwind.svg)

# Let's populate our database with some data !

In this tutorial, we will be using the Northwind database, a sample database that contains data about a fictional company that imports and exports specialty foods. It includes tables for customers, orders, products, and more.
The schema diagram above shows the relationships between the tables in the Northwind database.

We've created for you a small script that will create the Northwind database and populate it with some sample data. To make things easy, we use SQLite, a lightweight database that is easy to set up and use, and offers a simple way to work with SQL databases without needing a separate server.
In a real-world scenario, you would typically connect to a database server (like PostgreSQL, MySQL, Snowflake, etc...) and run SQL commands to create and populate your database. 


In [27]:
import urllib.request
import os
import sqlite3
import pandas as pd  # added for dataframe operations

In [28]:
url = "https://github.com/jpwhite3/northwind-SQLite3/raw/refs/heads/main/dist/northwind.db"
db_path = "northwind.db"

if not os.path.exists(db_path):
    print("Downloading the database...")
    urllib.request.urlretrieve(url, db_path)
    print("Database downloaded successfully!")
else:
    print("Database already exists.")

# Connexion à la base
conn = sqlite3.connect(db_path)
cursor = conn.cursor()


Database already exists.


# Concepts de base de l’ingénierie des données avec SQL

Dans cette section, vous allez découvrir les opérations fondamentales sur une base de données relationnelle :

- **SELECT** : extraire des données d’une ou plusieurs tables
- **UPDATE** : modifier des données existantes
- **Jointures (JOIN)** : combiner des données provenant de plusieurs tables
- **Renommer des colonnes** dans les résultats

Nous utiliserons la base Northwind, un exemple classique de base de données de gestion commerciale (clients, commandes, produits, etc.).

Vous pourrez ainsi vous entraîner à écrire et exécuter des requêtes SQL sur un cas réel.

In [7]:
# Select all columns from the 'Customers' table
cursor.execute("SELECT * FROM Customers;")
customers = cursor.fetchall()
print("First 5 Customers:")
for row in customers[:5]:  # Print only the first 5 rows
    print(row)

# Use pandas to display the data
df = pd.read_sql_query("SELECT * FROM Customers", conn)
print(df.head())

# Select specific columns (CustomerID, CompanyName)
cursor.execute("SELECT CustomerID, CompanyName FROM Customers;")
customer_ids = cursor.fetchall()
print("\nFirst 5 Customer IDs and Company Names:")
for row in customer_ids[:5]:
    print(row)

# Using WHERE clause (e.g., customers from 'Germany')
cursor.execute("SELECT * FROM Customers WHERE Country = 'Germany';")
german_customers = cursor.fetchall()
print("\nCustomers from Germany:")
for row in german_customers:
    print(row)

# Using UPDATE (Careful with UPDATE statements! Always use WHERE to limit the scope)
# Let's update the ContactTitle for a specific customer (e.g., CustomerID = 'ALFKI')
# This is for demonstration, be very cautious when using UPDATE
try:
    cursor.execute("UPDATE Customers SET ContactTitle = 'Sales Manager' WHERE CustomerID = 'ALFKI';")
    conn.commit() # Save the changes
    print("\nContactTitle updated for ALFKI.")
except sqlite3.Error as e:
    print(f"An error occurred: {e}")

# verify update
cursor.execute("SELECT CustomerID, ContactTitle FROM Customers WHERE CustomerID = 'ALFKI'")
updated_record = cursor.fetchone()
print(updated_record)

# Example using ALTER TABLE (renaming a column) - Be CAREFUL with altering tables in production.
# Rename ContactTitle to Title.
try:
    cursor.execute("ALTER TABLE Customers RENAME COLUMN ContactTitle TO Title")
    conn.commit()
    print("Column renamed successfully")
except sqlite3.Error as e:
    print(f"An error occurred: {e}")

# Verify the change (you'll get an error if you try to select from ContactTitle now)
cursor.execute("SELECT CustomerID, Title FROM Customers WHERE CustomerID = 'ALFKI'")
updated_record = cursor.fetchone()
print(updated_record)


# Example of a JOIN (Orders and Customers)

sql_join_query = """
SELECT
    Orders.OrderID,
    Customers.CompanyName,
    Orders.OrderDate
FROM
    Orders
JOIN
    Customers ON Orders.CustomerID = Customers.CustomerID
LIMIT 5;
"""

cursor.execute(sql_join_query)

join_results = cursor.fetchall()

print("\nExample JOIN (Orders and Customers):")
for row in join_results:
  print(row)

# Clean up connection
conn.close()


First 5 Customers:
('ALFKI', 'Alfreds Futterkiste', 'Maria Anders', 'Sales Representative', 'Obere Str. 57', 'Berlin', 'Western Europe', '12209', 'Germany', '030-0074321', '030-0076545')
('ANATR', 'Ana Trujillo Emparedados y helados', 'Ana Trujillo', 'Owner', 'Avda. de la Constitución 2222', 'México D.F.', 'Central America', '05021', 'Mexico', '(5) 555-4729', '(5) 555-3745')
('ANTON', 'Antonio Moreno Taquería', 'Antonio Moreno', 'Owner', 'Mataderos  2312', 'México D.F.', 'Central America', '05023', 'Mexico', '(5) 555-3932', None)
('AROUT', 'Around the Horn', 'Thomas Hardy', 'Sales Representative', '120 Hanover Sq.', 'London', 'British Isles', 'WA1 1DP', 'UK', '(171) 555-7788', '(171) 555-6750')
('BERGS', 'Berglunds snabbköp', 'Christina Berglund', 'Order Administrator', 'Berguvsvägen  8', 'Luleå', 'Northern Europe', 'S-958 22', 'Sweden', '0921-12 34 65', '0921-12 34 67')
  CustomerID                         CompanyName         ContactName  \
0      ALFKI                 Alfreds Futterk