<h1 style="color: rgb(241, 90, 36)"><img src="./images/SQLIcon.png?modified=24223" width=80px height=80px style="vertical-align: middle;"> Database Optimisation</h1>

Database optimisation plays a crucial role in ensuring the efficiency, reliability, and scalability of database systems. Here's why optimisation is essential:

1. Performance Improvement
    - Optimisation enhances the performance of database operations, including data retrieval, insertion, update and deletion
    - Improved performance leads to faster response time for queries, which is critical for applications with high user traffic or real-time processing requirements <br><br>

2. Resource Utilisation
    - Optimised databases consume fewer resources such as CPU, memory and disk I/O
    - Efficient resource utilisation allows the database to handle more concurrent users and transactions without experiencing performance degradation <br><br>

3. Cost Reduction
    - Optimised databases require fewer hardware resources, reducing upfront procurement costs, maintenance expenses, and energy consumption for both on-premises and cloud deployments <br><br>

4. Scalability
    - Optimisation prepares the database for future growth by improving scalability
    - A well-optimised database can accommodate increased data volume and user load without sacrificing performance <br><br>

5. Data Integrity
    - Optimisation techniques, such as normalisation (which we will cover extensively in this lesson) help maintain data integrity by minimizing redundancy and dependency

<h2 style="color: rgb(241, 90, 36)"> Database Normalisation</h2>

Normalisation is the process of organising data in a database efficiently. It involves structuring tables and relationships to minimise redundancy and dependency. The primary goal of normalisation is to ensure data integrity and reduce anomalies such as:

- *Insertion Anomalies*: An insertion anomaly occurs when you cannot add data to the database without first adding other unrelated data. In a denormalised, or poorly normalised database, inserting new records may require adding redundant information or leaving some fields blank.
- *Update Anomalies*: Update anomalies arise when updating data in a database results in inconsistencies. For example, if you have redundant data stored in multiple places, and you update one copy but not the other, the data becomes inconsistent.
- *Deletion Anomalies*: Deletion anomalies occur when deleting data from a database unintentionally removes other unrelated data as well

> Normalisation is typically achieved through a series of rules or *normal forms*, each building on the principles of the previous ones.

Here's an overview of the key normal forms:

<h3 style="color: rgb(241, 90, 36)"> First Normal Form (1NF)</h3>

1NF ensures that each table's structure adheres to certain criteria:
- **Atomicity**: Each cell in the table holds a single value, eliminating the presence of multi-valued attributes
- **Primary Key**: There exists a primary key that uniquely identifies each record
- **No Duplication**: No duplicated rows or columns exist in the table <br><br>

Consider a table storing information about students and their courses: 

<table border="1">
  <tr>
    <th>Student ID</th>
    <th>Name</th>
    <th>Courses</th>
  </tr>
  <tr>
    <td>001</td>
    <td>John Doe</td>
    <td>Math, Physics, Chemistry</td>
  </tr>
  <tr>
    <td>002</td>
    <td>Jane Smith</td>
    <td>Biology, History</td>
  </tr>
  <tr>
    <td>003</td>
    <td>Alice Lee</td>
    <td>Math, English</td>
  </tr>
</table>

In the unnormalised form:
- The `Courses` column contains multiple values separated by commas, violating atomicity
- There's no distinct primary key that can uniquely identify each combination of `Name` and `Courses` values
- Duplicated information exists in the `Courses` column

To bring this table into 1NF, we'll split it into two separate tables:

**Students Table**
<table border="1">
  <tr>
    <th>Student ID</th>
    <th>Name</th>
  </tr>
  <tr>
    <td>001</td>
    <td>John Doe</td>
  </tr>
  <tr>
    <td>002</td>
    <td>Jane Smith</td>
  </tr>
  <tr>
    <td>003</td>
    <td>Alice Lee</td>
  </tr>
</table>

**Courses Table**
<table border="1">
  <tr>
    <th>Student ID</th>
    <th>Course</th>
  </tr>
  <tr>
    <td>001</td>
    <td>Math</td>
  </tr>
  <tr>
    <td>001</td>
    <td>Physics</td>
  </tr>
  <tr>
    <td>001</td>
    <td>Chemistry</td>
  </tr>
  <tr>
    <td>002</td>
    <td>Biology</td>
  </tr>
  <tr>
    <td>002</td>
    <td>History</td>
  </tr>
  <tr>
    <td>003</td>
    <td>Math</td>
  </tr>
  <tr>
    <td>003</td>
    <td>English</td>
  </tr>
</table>

Now, each table satisfies 1NF criteria:
- Atomicity: Each cell contains a single, indivisible value
- Primary Key: The `Student ID` column serves as the primary key in the `Students` and `Courses` tables
- No Duplication: There are no duplicated rows or columns in either tables


<h3 style="color: rgb(241, 90, 36)"> Second Normal Form (2NF)</h3>

Building upon 1NF, 2NF further refines the database structure by addressing partial dependencies. Partial dependencies occur when a non-key attribute (a key attribute that is not part of the primary key) is functionally dependent on only part of the *composite primary key*. 

> A composite key is a type of primary key that consists of two or more attributes. These attributes together uniquely identify a record in a table. Unlike a simple primary key, which might be an integer or UUID, a composite key requires the combination of multiple columns to ensure uniqueness

In simpler terms, consider a scenario where a table has a composite primary made up of two or more attributes. If a non-key attribute depends on only one part of the composite key but not the other, it exhibits a partial dependency.

> 2NF aims to eliminate partial dependencies by ensuring that each non-key attribute is fully functionally dependent on the entire primary key.

Consider a table representing sales data:

<table border="1">
  <tr>
    <th>Order ID</th>
    <th>Product ID</th>
    <th>Product Name</th>
    <th>Price</th>
    <th>Quantity</th>
  </tr>
  <tr>
    <td>001</td>
    <td>101</td>
    <td>Laptop</td>
    <td>$800</td>
    <td>2</td>
  </tr>
  <tr>
    <td>001</td>
    <td>102</td>
    <td>Mouse</td>
    <td>$20</td>
    <td>1</td>
  </tr>
  <tr>
    <td>002</td>
    <td>101</td>
    <td>Laptop</td>
    <td>$800</td>
    <td>1</td>
  </tr>
</table>

In this table:
- `(Order ID, Product ID)` together form the composite primary key because the combination of these two attributes uniquely identifies each record. This means that while neither `StudentID` nor `CourseID` alone can uniquely identify a record, their combination does.
- `Product Name` and `Price` are functionally dependent on `Product ID` which is part of the primary key

To bring this table into 2NF, we need to separate out the attributes that are functionally dependent on only part of the primary key:

**Orders Table**
<table border="1">
  <caption>Orders Table</caption>
  <tr>
    <th>Order ID</th>
    <th>Product ID</th>
    <th>Quantity</th>
  </tr>
  <tr>
    <td>001</td>
    <td>101</td>
    <td>2</td>
  </tr>
  <tr>
    <td>001</td>
    <td>102</td>
    <td>1</td>
  </tr>
  <tr>
    <td>002</td>
    <td>101</td>
    <td>1</td>
  </tr>
</table>

**Products Table**
<table border="1">
  <caption>Products Table</caption>
  <tr>
    <th>Product ID</th>
    <th>Product Name</th>
    <th>Price</th>
  </tr>
  <tr>
    <td>101</td>
    <td>Laptop</td>
    <td>$800</td>
  </tr>
  <tr>
    <td>102</td>
    <td>Mouse</td>
    <td>$20</td>
  </tr>
</table>

Now, each table satisfies 2NF criteria, as all non-key attributes (`Product Name` and `Price`) in the `Products` table are functionally dependent on the entire primary key, now `Product ID`, eliminating partial dependency.

<h3 style="color: rgb(241, 90, 36)"> Third Normal Form (3NF)</h3>

A transitive dependency occurs when an attribute in a table is functionally dependent on another *non-prime attribute*, rather than directly on the primary key.

> A non-prime attribute is an attribute that is not part of any *candidate key* in the table. A candidate key is a minimal set of attributes that can uniquely identify a record in the table. Therefore, non-prime attributes do not contribute to the uniqueness of a record.

3NF aims to eliminate transitive dependencies, ensuring that all non-prime attributes are directly functionally dependent only on the primary key.

Suppose we have a table storing information about students and the courses they are enrolled in. Initially, the table includes attributes such as `Student ID`, `Course ID`, `Course Name`, and `Instructor`:

<table border="1">
  <tr>
    <th>Student ID</th>
    <th>Course ID</th>
    <th>Course Name</th>
    <th>Instructor</th>
  </tr>
  <tr>
    <td>001</td>
    <td>101</td>
    <td>Math</td>
    <td>Mr. Johnson</td>
  </tr>
  <tr>
    <td>001</td>
    <td>102</td>
    <td>Physics</td>
    <td>Mr. Smith</td>
  </tr>
  <tr>
    <td>002</td>
    <td>101</td>
    <td>Math</td>
    <td>Mr. Johnson</td>
  </tr>
  <tr>
    <td>002</td>
    <td>103</td>
    <td>Chemistry</td>
    <td>Mrs. Anderson</td>
  </tr>
  <tr>
    <td>003</td>
    <td>102</td>
    <td>Physics</td>
    <td>Mr. Smith</td>
  </tr>
</table>

In this table:
- `Student ID` is the primary key
- `Course Name` and `Instructor` are not part of the primary key, so they are non-prime attributes. They are also functionally dependent on `Course ID`, which in turn depends on `Student ID`. This creates a transitive dependency because `Course Name` and `Instructor` are indirectly dependent on `Student ID` through `Course ID`.

To eliminate the transitive dependency and achieve 3NF, we restructure the table into three separate tables:

**Students Table**
<!DOCTYPE html>
<html>
<head>
  <title>After 3NF Normalization</title>
</head>
<body>
  <h2>After 3NF Normalization</h2>

  <h3>Students Table</h3>
  <table border="1">
    <tr>
      <th>Student ID</th>
      <th>Course ID</th>
    </tr>
    <tr>
      <td>001</td>
      <td>101</td>
    </tr>
    <tr>
      <td>001</td>
      <td>102</td>
    </tr>
    <tr>
      <td>002</td>
      <td>101</td>
    </tr>
    <tr>
      <td>002</td>
      <td>103</td>
    </tr>
    <tr>
      <td>003</td>
      <td>102</td>
    </tr>
  </table>
</body>
</html>

**Courses Table**
<!DOCTYPE html>
<html>
<head>
  <title>Courses Table</title>
</head>
<body>
  <h3>Courses Table</h3>
  <table border="1">
    <tr>
      <th>Course ID</th>
      <th>Course Name</th>
    </tr>
    <tr>
      <td>101</td>
      <td>Math</td>
    </tr>
    <tr>
      <td>102</td>
      <td>Physics</td>
    </tr>
    <tr>
      <td>103</td>
      <td>Chemistry</td>
    </tr>
  </table>
</body>
</html>

**Instructors Table**
<!DOCTYPE html>
<html>
<head>
  <title>Instructors Table</title>
</head>
<body>
  <h3>Instructors Table</h3>
  <table border="1">
    <tr>
      <th>Course ID</th>
      <th>Instructor</th>
    </tr>
    <tr>
      <td>101</td>
      <td>Mr. Johnson</td>
    </tr>
    <tr>
      <td>102</td>
      <td>Mr. Smith</td>
    </tr>
    <tr>
      <td>103</td>
      <td>Mrs. Anderson</td>
    </tr>
  </table>
</body>
</html>

By breaking the original table into three separate tables:
- We remove the transitive dependency
- Each table now has attributes that are directly dependent on the primary key:
  - `Students` table contains only student and course information
  - `Courses` table contains unique course details
  - `Instructors` table contains unique instructor details for each course
- We eliminate redundancy and potential update and deletion anomalies

<h2 style="color: rgb(241, 90, 36)"> SQL Query Optimisation</h2>

SQL query optimisation is a crucial aspect of database optimisation and is aimed at improving the efficiency and performance of query execution. Proper optimisation ensures that queries run faster and uses fewer resources, which is particularly important for large databases and high-traffic applications.

Strategies for optimisation include:

<h3 style="color: rgb(241, 90, 36)"> Use of Indexes</h3>

An index in SQL is a separate data object outside the table itself. It provides a quick lookup mechanism to access rows in the table without scanning the entire dataset. This concept is similar to the index at the back of a book, which lets you find the specific chapters without reading every page.

> It's important to understand that an index is not a column in the table, but rather an additional structure that the database maintains to optimise search operations. In contrast to the indexes you might encounter in a data frame, SQL indexes are distinct and do not alter the table's schema.

By creating an index on one or more columns, you can significantly speed up queries that involve searching, sorting or filtering on those columns.

Consider a table `employees` with columns such as `id`, `name`, `department`, and `salary`. If we frequently query this table by the `name` column, creating an index on the `name` column can improve performance:

```sql
CREATE INDEX idx_employee_name ON employees(name);
```
<h3 style="color: rgb(241, 90, 36)"> Avoid SELECT * statements</h3>

Using `SELECT *` retrieves all columns from a table, which can be inefficient if only a few columns are needed. Instead, specify only the required columns.

<h3 style="color: rgb(241, 90, 36)"> Effective Use of WHERE Clause</h3>

The `WHERE` clause is used to filter records. Using indexed columns in the `WHERE` clause can greatly enhance query performance.

Assume we have a table `employees` with the following structure and an index on the `name` and `age` columns:

```sql
CREATE TABLE employees (
    id INT PRIMARY KEY,
    name VARCHAR(100),
    age INT,
    department_id INT
);

CREATE INDEX idx_employee_name ON employees(name);
```

If we now filter the records with an indexed column, like so:

```sql
SELECT * FROM employees WHERE name = 'John Doe';
```

The databases will use the `idx_employee_name` index to quickly locate all the records where `name` is `John Doe`, avoiding a full table scan and thus speeding up the query execution. If on the other hand, we filter using a non-indexed column like `age`:

```sql
SELECT * FROM employees WHERE age > 30;
```

The database will perform a full table scan to find all the records where `age` is greater than `30`. This results in a slower query execution compared to using an indexed column.

<h3 style="color: rgb(241, 90, 36)"> Optimizing Joins</h3>

Properly written joins can reduce the time taken for data retrieval. Use `INNER JOIN` instead of `OUTER JOIN` whenever possible, and ensure that the join conditions are indexed.

When you perform a join operation between tables, the database engine needs to match rows from each table based on the join condition. If the columns involved in the join condition are not indexed, the database might need to perform a full table scan on one or both tables, which can be very slow, especially for larger tables. Indexing these columns can significantly speed up the join operation.

<h3 style="color: rgb(241, 90, 36)"> Avoid Subqueries</h3>

Subqueries, or nested queries, can often lead to inefficient execution plans, especially if they need to run multiple times for each row in the outer query. This can result in significant performance overhead. Instead, rewriting subqueries as joins can usually improve performance by allowing the database to use more efficient join algorithms and indexes.

Consider a scenario where you have two tables: `orders` and `customers`. You want to retrieve all orders along with the total number of orders placed by each customer:

```sql
SELECT 
    order_id, 
    customer_id,
    (SELECT COUNT(*) FROM orders o2 WHERE o2.customer_id = orders.customer_id) AS total_orders
FROM orders;
```

In this example, the subquery `SELECT COUNT(*) FROM orders o2 WHERE o2.customer_id = orders.customer_id` is correlated to the outer query, because it references a column from the outer query (`orders.customer_id`). In the case of correlated subqueries, the subquery is executed once for each row returned by the outer query. This is because the subquery depends on the results of the outer query and needs to be evaluated repeatedly for each row.

Therefore, in our example above, the subquery needs to be executed multiple times, once for each row in the `orders` table.

If we now used `JOIN` instead:

```sql
SELECT 
    orders.order_id,
    orders.customer_id,
    COUNT(customer_orders.order_id) AS total_orders
FROM orders
JOIN (
    SELECT customer_id, COUNT(order_id) AS total_orders
    FROM orders
    GROUP BY customer_id
) AS customer_orders ON orders.customer_id = customer_orders.customer_id
GROUP BY orders.order_id, orders.customer_id;
```

In this rewritten query, we first create a derived table, `customer_orders` using a subquery to calculate the total orders placed by each customer. Then, we join this derived table with the orders table based on the `customer_id column`. This offers better performance due to reduced repetitive calculations, optimised execution plans and improved index utilisation.

## Key Takeaways

- Database optimisation improves query performance, decreases resource usage, and supports scalability, leading to cost savings
- First Normal Form (1NF) ensures atomicity by requiring that each cell in the database table holds a single value, preventing the presence of multi-valued attributes
- Second Normal Form (2NF) eliminates partial dependencies by ensuring that each non-key attribute is fully functionally dependent on the entire primary key
- Third Normal Form (3NF) eliminates transitive dependencies, ensuring that all non-primary attributes are directly dependent only on the primary key
- Indexes are utilized to improve query performance by allowing the database to locate data quickly without scanning the entire table, resulting in faster execution times
- Avoiding the use of `SELECT *` statements reduces unnecessary data retrieval overhead by specifying only the required columns
- The effective use of `WHERE` clause in queries improves performance by filtering records based on indexed columns
- Optimizing joins involves preferring `INNER JOIN` over `OUTER JOIN` and ensuring join conditions are indexed
- Avoiding subqueries can help prevent inefficient execution plans, especially when they need to run multiple times for each row in the outer queries. Instead, rewriting subqueries as joins typically improves performance by allowing the database to utilize more efficient join algorithms and indexes.