<div align="right" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/Logo blue_dark.png"  style="width:25px" align="right";/>
</div>

# Data anomalies – 2NF [Exercise]
© ExploreAI Academy

Database normalisation is a design technique for decoupling table structures to **reduce data redundancies and anomalies**. 

In this train, we will go through a practical example of normalising a database up to the **Second Normal Form**. At the end of the train, we will reflect on the data anomalies that can occur in practice and how **2NF** attempts to remedy their  state. 

## Learning objectives
In this train, we will:
* Learn how to normalise a database up to the Second Normal Form.
* Learn how to decompose a 1NF database into multiple tables to eliminate partial dependencies.
* Understand data anomalies and how database normalisation reduces the likelihood of their occurrence. 

## Imports and DB connections

> ⚠️ ⚠️ This exercise extends the concepts introduced in the previous one, `Data anomalies – 1NF`. Therefore, it's important to ensure that you continue using the modified SoftDevEmployees.db database after successfully completing the previous exercise.

> ⚠️ Since the queries here will modify the database, you will have to get a fresh copy of the modified database to redo the code cells.

In [1]:
# Load and activate the SQL extension to allow us to execute SQL in a Jupyter notebook.
%load_ext sql

In [2]:
## Load the SoftDevEmployees database stored in your local machine. 
# Make sure the file is saved in the same folder as this notebook.
%sql sqlite:///SoftDevEmployees.db

'Connected: @SoftDevEmployees.db'

## Data anomalies

Data anomalies are issues that present themselves in poorly structured or denormalised databases. The following are examples of commonly occurring anomalies which you may find: 

 - **Deletion anomaly**: The deletion of a record that leads to the unintentional removal of another required attribute from the database. 
 - **Insertion anomaly**: The inability to insert a record as it requires additional data which may presently not be available.
 - **Update anomaly**: This occurs when we have duplicated data; if we were to update the affected rows and a single row gets missed, this will lead to a data inconsistency.

## First Normal Form database

Below is the ERD for the **`SoftDevEmployees.db`** database which contains a single table called **`Employees_1NF`**. Currently, our database is in the **First Normal Form (1NF)**. Our goal within this train is to transform this database to conform to the **Second Normal Form**. 


<img src ="https://raw.githubusercontent.com/Explore-AI/Pictures/master/SQL4DS/Practical_Normalization/1NF.png" alt="first Normal Form" >

## Second Normal Form – 2NF

To convert to the Second Normal Form, we need to make sure that we meet the following conditions: 

1. The table needs to already be in the First Normal Form.
2. The table should not contain any **partial dependencies.**

**No partial dependency** simply means that **every non-key attribute should be fully dependent on the primary key**. This translates to each table serving a single purpose.

## Converting to 2NF

The strategy to "employ" here is to create new tables that each serve a single purpose. Have a look at the desired ERD sketch given below:

<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/SQL4DS/Practical_Normalization/2NF.png" alt="Second Normal Form">

Note that most of the heavy lifting was performed when setting up the First Normal Form. So we will use this table to create the required fields for the Second Normal Form.
We want to achieve the following:

1. Create the required tables. 
2. For each table, pay attention to relationships that exist between the tables and create the appropriate foreign keys to maintain the referential integrity of the tables.

### Exercise 1 – Create the required tables

Let's start by creating the required 2NF tables based on the above structure. They include: 
- **`Titles_2NF`**
- **`Roles_2NF`**
- **`Departments_2NF`**
- **`Employees_2NF`**
- **`Employee_Department_2NF`**
- **`Employee_Role_2NF`**

Remember to `AUTOINCREMENT` the **ID** `PRIMARY KEY` columns for each table.

In [8]:
%%sql

DROP TABLE IF EXISTS Titles_2NF;

CREATE TABLE Titles_2NF
(
    TitleID INTEGER PRIMARY KEY AUTOINCREMENT ,
    Title VARCHAR
);

DROP TABLE IF EXISTS Roles_2NF;

CREATE TABLE Roles_2NF
(
    RoleID INTEGER PRIMARY KEY AUTOINCREMENT ,
    Role VARCHAR
);

DROP TABLE IF EXISTS Departments_2NF;

CREATE TABLE Departments_2NF
(
    DepartmentID INTEGER PRIMARY KEY AUTOINCREMENT ,
    Department VARCHAR
);

DROP TABLE IF EXISTS Employees_2NF;

CREATE TABLE Employees_2NF
(
    EmployeeID INTEGER PRIMARY KEY AUTOINCREMENT ,
    Name VARCHAR ,
    Surname VARCHAR ,
    Salary REAL ,
    OccupationBand VARCHAR ,
    TitleID INTEGER NOT NULL ,
    FOREIGN KEY (TitleID) REFERENCES Titles_2NF(TitleID)
);

DROP TABLE IF EXISTS Employee_Department_2NF;

CREATE TABLE Employee_Department_2NF
(
    EmployeeID INTEGER NOT NULL,
    DepartmentID INTEGER NOT NULL,
    FOREIGN KEY (EmployeeID) REFERENCES Employees_2NF (EmployeeID),
    FOREIGN KEY (DepartmentID) REFERENCES Departments_2NF (DepartmentID),
    PRIMARY KEY (EmployeeID, DepartmentID)
);

DROP TABLE IF EXISTS Employee_Role_2NF;

CREATE TABLE Employee_Role_2NF
(
    EmployeeID INTEGER NOT NULL ,
    RoleID INTEGER NOT NULL ,
    FOREIGN KEY (EmployeeID) REFERENCES Employees_2NF(EmployeeID),
    FOREIGN KEY (RoleID) REFERENCES Roles_2NF(RoleID),
    PRIMARY KEY (EmployeeID , RoleID)
);

 * sqlite:///SoftDevEmployees.db
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.


[]

### Exercise 2

Let us proceed to populate the 2NF tables we have created in Exercise 1 using the relevant data from the **`Employees_1NF`** table. 

### 2.1 – Populate the `Titles_2NF`, `Roles_2NF`, and `Departments_2NF` tables

We start with the **`Titles_2NF`, `Roles_2NF`**, and **`Departments_2NF`** tables as the queries for these insertions are fairly trivial. 

Write a query to populate these tables by selecting **distinct** values in the relevant columns from the **`Employees_1NF`** table.

Remember to only select rows where the **row value is not blank**.

In [10]:
%%sql

INSERT INTO Titles_2NF(Title)
SELECT
    Title
FROM
    Employees_1NF;
    
INSERT INTO Roles_2NF(Role)
SELECT
    Role
FROM
    Employees_1NF;
    
INSERT INTO Departments_2NF(Department)
SELECT
    Department
FROM
    Employees_1NF;

 * sqlite:///SoftDevEmployees.db
61 rows affected.
61 rows affected.
61 rows affected.


[]

###  2.2 – Populate the `Employees_2NF` table

Now we move on to the **`Employees_2NF`** table. Things become a bit more complex here since we start to take the foreign keys into account. This is because, in the **`Employees_2NF`** table, we have the **`TitleID`** column which references the **`TitleID`** column in the **`Titles_2NF`** table. 

Write a query to populate the **`Employees_2NF`** table by selecting the relevant data from the **`Employees_1NF`** table.

**`Hint:`** We need to join with the appropriate table in order to populate the **`TitleID`** such that we maintain referential integrity.

In [19]:
%%sql

DELETE FROM Employees_2NF;

INSERT INTO Employees_2NF(Name , Surname , Salary , OccupationBand , TitleID)
SELECT DISTINCT
    Name ,
    Surname ,
    Salary ,
    OccupationBand ,
    T.TitleID
FROM
    Employees_1NF E
INNER JOIN
    Titles_2NF T
ON
    T.Title = E.Title

 * sqlite:///SoftDevEmployees.db
860 rows affected.
688 rows affected.


[]

###  2.3 – Populate the `Employee_Department_2NF` and `Employee_Role_2NF` tables.

Finally, we insert data into our **mapping tables**: **`Employee_Department_2NF`** and **`Employee_Role_2NF`**. These are tables that will be used to establish links between the different tables we have already created.

Write a query to populate these tables using foreign keys that reference the primary keys in the 2NF tables being connected.

**`Hint:`** Again, we need to join with the appropriate tables based on the foreign key references that exist in the database structure.

In [17]:
%%sql

SELECT
    *
FROM
    Employees_1NF

 * sqlite:///SoftDevEmployees.db
Done.


Name,Surname,Role,Department,Title,OccupationBand,Salary
André,gerber,Front-End Developer,Web Applications,Mrs,Junior,52357.0
Antoinette,Van Der Berg,UI/UX Developer,Mobile Applications,Dr,Junior,118731.0
Bronwyn,Swartz,UI/UX Developer,Mobile Applications,Miss,Graduate,34350.0
Christopher,Walker,Back-End Developer,Mobile Applications,Mr,Junior,122894.0
Christopher,Walker,Back-End Developer,Web Applications,Mr,Junior,122894.0
Claire,Morris,Full-Stack Developer,Mobile Applications,Ms,Intern,36000.0
Contact,Xaba,UI/UX Developer,Mobile Applications,Dr,Mid-Level,85836.0
Danie,Campbell,Business Analyst,Mobile Applications,Mrs,Mid-Level,205621.0
Danie,Campbell,Business Analyst,Web Applications,Mrs,Mid-Level,205621.0
Danie,davies,Database Analyst,Web Applications,Dr,Senior,313491.0


In [24]:
%%sql

SELECT
    *
FROM
    Employees_2NF

 * sqlite:///SoftDevEmployees.db
Done.


EmployeeID,Name,Surname,Salary,OccupationBand,TitleID
861,André,gerber,52357.0,Junior,1
862,André,gerber,52357.0,Junior,8
863,André,gerber,52357.0,Junior,9
864,André,gerber,52357.0,Junior,23
865,André,gerber,52357.0,Junior,33
866,André,gerber,52357.0,Junior,37
867,André,gerber,52357.0,Junior,38
868,André,gerber,52357.0,Junior,42
869,André,gerber,52357.0,Junior,53
870,Antoinette,Van Der Berg,118731.0,Junior,2


In [29]:
%%sql

DELETE FROM Employee_Department_2NF;
DELETE FROM Employee_Role_2NF;

INSERT INTO Employee_Department_2NF(EmployeeID , DepartmentID)
SELECT DISTINCT
    EE.EmployeeID,
    D.DepartmentID
FROM
    Employees_1NF E
INNER JOIN
    Departments_2NF D
ON
    E.Department = D.Department
INNER JOIN
    Employees_2NF EE
ON
    E.Name = EE.Name
    AND
    E.Surname = EE.Surname;
    
INSERT INTO Employee_Role_2NF(EmployeeID , RoleID)
SELECT DISTINCT
    EE.EmployeeID,
    R.RoleID
FROM
    Employees_1NF E
INNER JOIN
    Employees_2NF EE
ON
    E.Name = EE.Name
    AND
    E.Surname = EE.Surname
INNER JOIN
    Roles_2NF R
ON
    R.Role = E.Role

 * sqlite:///SoftDevEmployees.db
25291 rows affected.
0 rows affected.
25291 rows affected.
5817 rows affected.


[]

## Solutions

### Exercise 1 – Create the required tables

In [None]:
%%sql

DROP TABLE IF EXISTS Employees_2NF;
DROP TABLE IF EXISTS Titles_2NF;
DROP TABLE IF EXISTS Roles_2NF;
DROP TABLE IF EXISTS Departments_2NF;
DROP TABLE IF EXISTS Employee_Department_2NF;
DROP TABLE IF EXISTS Employee_Role_2NF;

CREATE TABLE Titles_2NF (
    TitleID INTEGER NOT NULL,
    Title VARCHAR,
    PRIMARY KEY(TitleID AUTOINCREMENT)
);

CREATE TABLE Roles_2NF (
    RoleID INTEGER NOT NULL,
    Role VARCHAR,
    PRIMARY KEY(RoleID AUTOINCREMENT)
);

CREATE TABLE  Departments_2NF (
    DepartmentID INTEGER NOT NULL,
    Department VARCHAR,
    PRIMARY KEY(DepartmentID AUTOINCREMENT)
);

CREATE TABLE Employees_2NF (
    EmployeeID INTEGER NOT NULL,
    Name VARCHAR, 
    Surname VARCHAR,
    Salary REAL,
    OccupationBand VARCHAR,
    TitleID INTEGER,
    FOREIGN KEY(TitleID) REFERENCES Titles_2NF (TitleID), 
    PRIMARY KEY(EmployeeID AUTOINCREMENT)
);


CREATE TABLE Employee_Role_2NF(
    EmployeeID INTEGER NOT NULL,
    RoleID INTEGER NOT NULL,
    FOREIGN KEY (EmployeeID) REFERENCES Employees_2NF (EmployeeID),
    FOREIGN KEY (RoleID) REFERENCES Roles_2NF (RoleID),
    PRIMARY KEY(EmployeeID, RoleID)
);

CREATE TABLE Employee_Department_2NF(
    EmployeeID INTEGER NOT NULL,
    DepartmentID INTEGER NOT NULL,
    FOREIGN KEY (EmployeeID) REFERENCES Employees_2NF (EmployeeID),
    FOREIGN KEY (DepartmentID) REFERENCES Departments_2NF (DepartmentID),
    PRIMARY KEY(EmployeeID, DepartmentID)
);

### Exercise 2

#### 2.1 – Populate the `Titles_2NF, Roles_2NF`, and `Departments_2NF` tables

In [None]:
%%sql
DELETE FROM Titles_2NF;
DELETE FROM Roles_2NF;
DELETE FROM Departments_2NF;

INSERT INTO Titles_2NF (Title)
SELECT 
    DISTINCT Title 
FROM Employees_1NF 
WHERE Title <> '';

INSERT INTO Roles_2NF (Role) 
SELECT 
    DISTINCT Role
FROM Employees_1NF
WHERE Role <>'';

INSERT INTO Departments_2NF (Department)
SELECT
    DISTINCT Department
FROM Employees_1NF
WHERE Department <>'';

####  2.2 – Populate the `Employees_2NF` table

In [None]:
%%sql
DELETE FROM Employees_2NF;

INSERT INTO Employees_2NF (Name, Surname, Salary, OccupationBand, TitleID)
SELECT DISTINCT
    EMP.Name,
    EMP.Surname,
    EMP.Salary,
    EMP.OccupationBand,
    T.TitleID
FROM 
    Employees_1NF AS EMP
JOIN 
    Titles_2NF AS T 
    ON T.Title = EMP.Title;

####  2.3 – Populate the `Employee_Department_2NF` and `Employee_Role_2NF` tables

In [None]:
%%sql
DELETE FROM Employee_Department_2NF;
DELETE FROM Employee_Role_2NF;

INSERT INTO Employee_Department_2NF (EmployeeID,DepartmentID)
SELECT DISTINCT
    EMP2.EmployeeID,
    DPT.DepartmentID
FROM 
    Employees_1NF AS EMP1
JOIN 
    Employees_2NF AS EMP2 
    ON EMP1.Name = EMP2.Name AND EMP1.Surname = EMP2.Surname
JOIN 
    Departments_2NF AS DPT 
    ON EMP1.Department = DPT.Department;
    

INSERT INTO Employee_Role_2NF (EmployeeID,RoleID)
SELECT DISTINCT
    EMP2.EmployeeID,
    R.RoleID
FROM 
    Employees_1NF AS EMP1
JOIN 
    Employees_2NF AS EMP2 
    ON EMP1.Name = EMP2.Name AND EMP1.Surname = EMP2.Surname
JOIN 
    Roles_2NF AS R 
    ON EMP1.Role = R.Role

## Conclusion

By normalising our database up to the **Second Normal Form (2NF)**, we have ensured that all the data in the database are entirely dependent to the primary key which has, in turn, helped us address various anomalies:

 - **Deletion anomaly**: We have eliminated the deletion anomalies that could occur on the **`Roles`**, **`Departments`**, and **`Titles`** columns by creating separate tables for them. For example, if Jessica Mchunu gets deleted from the **Employees_2NF** table, the **Scrum Master** role will continue to persist in the **Roles_2NF** table.

 - **Update anomaly**: Christoper only appears once in the **Employees_2NF**, so should he get a raise we only need to change his salary information in one place. This reduces the chances of having any data inconsistencies.

 - **Insertion anomaly**: Now we can insert new graduates into the database without having to define a role or place them in a specific department.

Also, as we organise our data into separate tables, we have seen the need for foreign keys to establish relationships between tables and enforcing referential integrity.

Referential integrity plays a key role in understanding the relationships between tables – which are usually underpinned by business rules. Take time to understand these business rules when creating your database. This effort will serve you well for organising your data as a future data professional.

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>