### Database Normalization

In this post I am going to normalize music tracks data which is in spreadsheet form.
I will do some things differently in this case. I will not use a separate "raw" table, I will just use ALTER TABLE statements to remove columns after I don't need them.

In [2]:
import pandas as pd


In [16]:

names = ['Track', 'Artist', 'Album', 'Count', 'Rating', 'Len']
tracks = pd.read_csv('library.csv', header=None, names=names)

tracks.head(10)


Unnamed: 0,Track,Artist,Album,Count,Rating,Len
0,Another One Bites The Dust,Queen,Greatest Hits,55.0,100.0,217
1,Asche Zu Asche,Rammstein,Herzeleid,79.0,100.0,231
2,Beauty School Dropout,Various,Grease,48.0,100.0,239
3,Black Dog,Led Zeppelin,IV,109.0,100.0,296
4,Bring The Boys Back Home,Pink Floyd,The Wall [Disc 2],33.0,100.0,87
5,Circles,Bryan Lee,Blues Is,54.0,60.0,355
6,Comfortably Numb,Pink Floyd,The Wall [Disc 2],36.0,100.0,384
7,Crazy Little Thing Called Love,Queen,Greatest Hits,38.0,100.0,163
8,Electric Funeral,Black Sabbath,Paranoid,44.0,100.0,293
9,Fat Bottomed Girls,Queen,Greatest Hits,38.0,100.0,257


In [14]:
print(tracks.shape)

(296, 6)


Obviously this data, as it is,  has many vertical replications in Artist and Album columns. Data replication takes a lot of memory and makes the database hard to maintain and manipulate. 

To address these issues, we need to import this dataset into a relational database and use relations to eliminate duplication and make the data integrated and consistent.

Let's start with creating a PostgreSQL database. 

In [3]:
CREATE DATABASE musical

The databse was successfully created and now I need to create a table and copy the content of library.csv file into it.

In [2]:
DROP TABLE track CASCADE;
CREATE TABLE track (
    id SERIAL,
    title TEXT, 
    artist TEXT, 
    album TEXT, 
    album_id INTEGER REFERENCES album(id) ON DELETE CASCADE,
    count INTEGER, 
    rating INTEGER, 
    len INTEGER,
    PRIMARY KEY(id)
);

Good!
I have a table for keeping the data in its original form.
Let's copy the spreadsheet to the table.

`CSV HEADER` is to have Postgres skip the first row which is the header.

In [None]:
\copy track(title,artist,album,count,rating,len) FROM '...\Database Normalization\library.csv' WITH DELIMITER ',' CSV;


In [None]:
SELECT COUNT(*) FROM track;

 count
-------
   296
(1 row)

OK!
I have the data in track table. Ready to be normalized. Let's Review NF rules.

#### 1NF Rules

 - Each table cell should contain a single value
 - No duplicated rows or columns
 - Each column must have only one value for each row in the table
 - There must be a primary key for identification
 
 Except for the last item which I will address shortly, my database is compliant with the first set of rules.
 So, I'll go ahead with the next rules.
 
#### 2NF Rules

 - Create separate tables for sets of values that apply to multiple records
 - Relate these tables with a foreign key
 
According to this rule, I have to create separate tables for album and artist columns as they are applied to multiple records.
 
 Back to SQL!
  

In [None]:
DROP TABLE album CASCADE;
CREATE TABLE album (
    id SERIAL,
    title VARCHAR(128) UNIQUE,
    PRIMARY KEY(id)
);

In [None]:
DROP TABLE artist CASCADE;
CREATE TABLE artist (
    id SERIAL,
    name VARCHAR(128) UNIQUE,
    PRIMARY KEY(id)
);

In [None]:
DROP TABLE tracktoartist CASCADE;
CREATE TABLE tracktoartist (
    id SERIAL,
    track VARCHAR(128),
    track_id INTEGER REFERENCES track(id) ON DELETE CASCADE,
    artist VARCHAR(128),
    artist_id INTEGER REFERENCES artist(id) ON DELETE CASCADE,
    PRIMARY KEY(id)
);

In [None]:
unesco=# \dt

 Schema |     Name      | Type  |      Owner
--------+---------------+-------+-----------------
 public | album         | table | pg4e_14526e0dc5
 public | artist        | table | pg4e_14526e0dc5
 public | track         | table | pg4e_14526e0dc5
 public | tracktoartist | table | pg4e_14526e0dc5
(4 rows)


Fantastic!
I have all the necessary tables in my database, ready to be populated with data from track table.


In [None]:
INSERT INTO album (title) SELECT DISTINCT album FROM track;
INSERT INTO tracktoartist (track, artist) SELECT DISTINCT title, artist FROM track;
INSERT INTO artist (name) SELECT DISTINCT artist FROM track;

In [None]:

UPDATE track SET album_id = (SELECT album.id FROM album WHERE album.title = track.album);
UPDATE tracktoartist SET track_id = (SELECT track.id FROM track WHERE tracktoartist.track = track.title);
UPDATE tracktoartist SET artist_id = (SELECT artist.id FROM artist WHERE tracktoartist.artist = artist.name);


In [None]:

ALTER TABLE track DROP COLUMN album;
ALTER TABLE track DROP COLUMN artist;
ALTER TABLE tracktoartist DROP COLUMN track;
ALTER TABLE tracktoartist DROP COLUMN ARTIST;



And now 3NF;

##### 3NF

 - Eliminate fields that do not depend on the key.
 
In order to investigate compliance with this rule more precisely, I need to know more about the real-world application i.e. the context of a database. This database is now per se compliant with this rule as all fields in all table are dependent and relevant to the key.

#### Done with Normalizing the database.

:-)