# Data Transformation
---

In [1]:
import pandas as pd
import numpy as np

# read in our data 
artist = pd.read_csv('../../data/artist.csv')
album = pd.read_csv('../../data/album.csv')
track = pd.read_csv('../../data/track.csv')
track_feat = pd.read_csv('../../data/track_feat.csv')

# create function to check if there are duplicate row values in the dfs
def check_for_dups(df, table):
    '''Function takes in a pandas dataframe (df) and the name of the df as a string (table)'''
    potential_dups = []
    # iterate through the columns in the df
    for col in df.columns:
        # check to see if the column values are numbers, if not continue
        if df.dtypes[col] != np.int64 and df.dtypes[col] != np.float64:
            # if 'count' and 'unique' are different then potential duplicate values
            if df[col].describe()['count'] != df[col].describe()['unique']:
                potential_dups.append(col)
    print(f'The following columns in {table} have potential duplicates: {potential_dups}')

## Checking for duplicates
---

## Artist dataframe:

In [2]:
check_for_dups(artist, 'artist')

The following columns in artist have potential duplicates: ['genre', 'type']


### The columns above, genre and type, are good to have duplicate values. Artist df is free of redundant duplicates.

## Album dataframe:

In [3]:
check_for_dups(album, 'album')

The following columns in album have potential duplicates: ['album_id', 'album_name', 'external_url', 'image_url', 'release_date', 'type', 'album_uri', 'artist_id']


### The only columns above that are good to have duplicate values are type, release_date, and artist_id. Let's take a closer look at the album df. 

In [4]:
album[album['album_name'].duplicated(keep=False)]

Unnamed: 0,album_id,album_name,external_url,image_url,release_date,total_tracks,type,album_uri,artist_id
34,3XADnbi4uhYXb7RuSJ7bre,Slime Language 2 (Deluxe),https://open.spotify.com/album/3XADnbi4uhYXb7R...,https://i.scdn.co/image/ab67616d0000b273bbb071...,2021-04-16,31,album,spotify:album:3XADnbi4uhYXb7RuSJ7bre,2hlmm7s2ICUX0LVIhVFlZQ
35,3ihwKkIMJWmmp1huNH0iWC,Slime Language 2,https://open.spotify.com/album/3ihwKkIMJWmmp1h...,https://i.scdn.co/image/ab67616d0000b2737939ff...,2021-04-16,23,album,spotify:album:3ihwKkIMJWmmp1huNH0iWC,2hlmm7s2ICUX0LVIhVFlZQ
36,4z8IoUDFp5dmiQNDlT4hu5,Slime Language 2 (Deluxe),https://open.spotify.com/album/4z8IoUDFp5dmiQN...,https://i.scdn.co/image/ab67616d0000b273f376cc...,2021-04-15,31,album,spotify:album:4z8IoUDFp5dmiQNDlT4hu5,2hlmm7s2ICUX0LVIhVFlZQ
37,2MUc9nWkppdNnM8dfz0l1w,Slime Language 2,https://open.spotify.com/album/2MUc9nWkppdNnM8...,https://i.scdn.co/image/ab67616d0000b273f56ecc...,2021-04-15,23,album,spotify:album:2MUc9nWkppdNnM8dfz0l1w,2hlmm7s2ICUX0LVIhVFlZQ
66,3aITAVBURujVe8fhI2seeR,Pluto x Baby Pluto (Deluxe),https://open.spotify.com/album/3aITAVBURujVe8f...,https://i.scdn.co/image/ab67616d0000b2738efe5b...,2020-11-13,24,album,spotify:album:3aITAVBURujVe8fhI2seeR,4O15NlyKLIASxsJ0PrXPfz
67,48xpWR8K6CGpy3ETAym3pt,Pluto x Baby Pluto,https://open.spotify.com/album/48xpWR8K6CGpy3E...,https://i.scdn.co/image/ab67616d0000b27357928e...,2020-11-13,16,album,spotify:album:48xpWR8K6CGpy3ETAym3pt,4O15NlyKLIASxsJ0PrXPfz
68,27fzM2E0lgovCD7PCq6eh4,Pluto x Baby Pluto (Deluxe),https://open.spotify.com/album/27fzM2E0lgovCD7...,https://i.scdn.co/image/ab67616d0000b27319be1f...,2020-11-12,24,album,spotify:album:27fzM2E0lgovCD7PCq6eh4,4O15NlyKLIASxsJ0PrXPfz
69,6HcU64bPPXTHIbWmGblIkT,Pluto x Baby Pluto,https://open.spotify.com/album/6HcU64bPPXTHIbW...,https://i.scdn.co/image/ab67616d0000b2739e3502...,2020-11-12,16,album,spotify:album:6HcU64bPPXTHIbWmGblIkT,4O15NlyKLIASxsJ0PrXPfz
108,3aITAVBURujVe8fhI2seeR,Pluto x Baby Pluto (Deluxe),https://open.spotify.com/album/3aITAVBURujVe8f...,https://i.scdn.co/image/ab67616d0000b2738efe5b...,2020-11-13,24,album,spotify:album:3aITAVBURujVe8fhI2seeR,1RyvyyTE3xzB2ZywiAwp0i
109,48xpWR8K6CGpy3ETAym3pt,Pluto x Baby Pluto,https://open.spotify.com/album/48xpWR8K6CGpy3E...,https://i.scdn.co/image/ab67616d0000b27357928e...,2020-11-13,16,album,spotify:album:48xpWR8K6CGpy3ETAym3pt,1RyvyyTE3xzB2ZywiAwp0i


### We can see from the above that there are multiple albums with the same name and album id. This is partially due to the fact that 2 artists have a joint album so when data was pulled for each artist the same album showed up twice. When pulling data we might have also pulled the clean and explicit version for these albums.  Let's remove the duplicate albums.

In [5]:
# drop duplicate albums with the same name
album.drop_duplicates(subset=['album_name'], inplace=True)

check_for_dups(album, 'album')

The following columns in album have potential duplicates: ['release_date', 'type', 'artist_id']


### The columns above, release_date, type, and artist_id, are good to have duplicate values. Album df is free of redundant duplicates.

In [6]:
# get list of album ids from our deduplicated dataframe
final_alb_ids = album['album_id'].tolist()

## Track dataframe:

In [7]:
# get tracks that are only in the albums from our deduplicated album dataframe
track = track[track['album_id'].isin(final_alb_ids)]

In [8]:
check_for_dups(track, 'track')

The following columns in track have potential duplicates: ['track_id', 'song_name', 'external_url', 'explicit', 'type', 'song_uri', 'album_id']


### The only columns above that are good to have duplicate values are song_name, explicit, type, and album_id . Let's take a closer look at the track df. 

In [9]:
track[track['track_id'].duplicated(keep=False)]

Unnamed: 0,track_id,song_name,external_url,duration_ms,explicit,disc_number,type,song_uri,album_id
1121,0anUgBvnA4u5LsHUeLRprc,Tic Tac,https://open.spotify.com/track/0anUgBvnA4u5LsH...,189147,True,1,track,spotify:track:0anUgBvnA4u5LsHUeLRprc,3aITAVBURujVe8fhI2seeR
1122,4V9K7DLPZNhCH77wLsAiNF,My Legacy,https://open.spotify.com/track/4V9K7DLPZNhCH77...,193333,True,1,track,spotify:track:4V9K7DLPZNhCH77wLsAiNF,3aITAVBURujVe8fhI2seeR
1123,669gYFBhezfYPXQP1hBMOX,Heart In Pieces,https://open.spotify.com/track/669gYFBhezfYPXQ...,196834,True,1,track,spotify:track:669gYFBhezfYPXQP1hBMOX,3aITAVBURujVe8fhI2seeR
1124,6vbe3GmRG30x2nraEvHudq,Because of You,https://open.spotify.com/track/6vbe3GmRG30x2nr...,216541,True,1,track,spotify:track:6vbe3GmRG30x2nraEvHudq,3aITAVBURujVe8fhI2seeR
1125,2z0LyXYmPgos95olIAQIO3,Bust a Move,https://open.spotify.com/track/2z0LyXYmPgos95o...,226285,True,1,track,spotify:track:2z0LyXYmPgos95olIAQIO3,3aITAVBURujVe8fhI2seeR
...,...,...,...,...,...,...,...,...,...
1899,7yOoJXW6wkpzlJpfjDKhIV,She Never Been To Pluto,https://open.spotify.com/track/7yOoJXW6wkpzlJp...,204117,True,1,track,spotify:track:7yOoJXW6wkpzlJpfjDKhIV,48xpWR8K6CGpy3ETAym3pt
1900,5qk0TivMEVkujTu7xJLKu7,Off Dat,https://open.spotify.com/track/5qk0TivMEVkujTu...,184736,True,1,track,spotify:track:5qk0TivMEVkujTu7xJLKu7,48xpWR8K6CGpy3ETAym3pt
1901,7JwmlNz8ic9ATnLn7DyAKx,I Don’t Wanna Break Up,https://open.spotify.com/track/7JwmlNz8ic9ATnL...,243936,True,1,track,spotify:track:7JwmlNz8ic9ATnLn7DyAKx,48xpWR8K6CGpy3ETAym3pt
1902,3h1LpNOGh1NgKVQPetS2Ed,Bankroll,https://open.spotify.com/track/3h1LpNOGh1NgKVQ...,186846,True,1,track,spotify:track:3h1LpNOGh1NgKVQPetS2Ed,48xpWR8K6CGpy3ETAym3pt


### Here we can see that much like how there were duplicate albums, there are duplicate songs in our data. We also have certain albums that have a regular and a deluxe version which is usually the same songs with an additional track or two. Let's remove the duplicate tracks.

In [10]:
# drop duplicate tracks with the same track_id
track.drop_duplicates(subset=['track_id'], inplace=True)

check_for_dups(track, 'track')

The following columns in track have potential duplicates: ['song_name', 'explicit', 'type', 'album_id']


### The columns above, song_name, explicit, type, and album_id, are good to have duplicate values. Track df is free of redundant duplicates.

In [11]:
# get list of album ids from our deduplicated dataframe
final_track_ids = track['track_id'].tolist()

## Track_feature dataframe:

In [12]:
# get track_feat that are only in the tracks from our deduplicated track dataframe
track_feat = track_feat[track_feat['track_id'].isin(final_track_ids)]

In [13]:
check_for_dups(track_feat, 'track_feat')

The following columns in track_feat have potential duplicates: ['track_id', 'type', 'song_uri']


### The only column above that is good to have a duplicate value is type Let's take a closer look at the track_feat df. 

In [14]:
track_feat[track_feat['track_id'].duplicated(keep=False)]

Unnamed: 0,track_id,danceability,energy,instrumentalness,liveness,loudness,speechiness,tempo,type,valence,song_uri
1121,0anUgBvnA4u5LsHUeLRprc,0.942,0.566,0.0,0.1100,-4.834,0.2260,132.858,audio_features,0.479,spotify:track:0anUgBvnA4u5LsHUeLRprc
1122,4V9K7DLPZNhCH77wLsAiNF,0.890,0.682,0.0,0.1160,-5.249,0.3230,143.979,audio_features,0.526,spotify:track:4V9K7DLPZNhCH77wLsAiNF
1123,669gYFBhezfYPXQP1hBMOX,0.872,0.699,0.0,0.1030,-6.127,0.0487,138.929,audio_features,0.725,spotify:track:669gYFBhezfYPXQP1hBMOX
1124,6vbe3GmRG30x2nraEvHudq,0.781,0.469,0.0,0.0921,-8.456,0.0549,132.989,audio_features,0.518,spotify:track:6vbe3GmRG30x2nraEvHudq
1125,2z0LyXYmPgos95olIAQIO3,0.804,0.683,0.0,0.0693,-7.344,0.0550,140.035,audio_features,0.229,spotify:track:2z0LyXYmPgos95olIAQIO3
...,...,...,...,...,...,...,...,...,...,...,...
1899,7yOoJXW6wkpzlJpfjDKhIV,0.827,0.615,0.0,0.3070,-5.879,0.2220,132.996,audio_features,0.221,spotify:track:7yOoJXW6wkpzlJpfjDKhIV
1900,5qk0TivMEVkujTu7xJLKu7,0.898,0.641,0.0,0.6650,-7.673,0.2970,151.937,audio_features,0.578,spotify:track:5qk0TivMEVkujTu7xJLKu7
1901,7JwmlNz8ic9ATnLn7DyAKx,0.615,0.584,0.0,0.1140,-5.638,0.2350,162.169,audio_features,0.529,spotify:track:7JwmlNz8ic9ATnLn7DyAKx
1902,3h1LpNOGh1NgKVQPetS2Ed,0.830,0.735,0.0,0.2160,-5.230,0.3010,139.932,audio_features,0.682,spotify:track:3h1LpNOGh1NgKVQPetS2Ed


### Similar to the track dataframe we have the same 80 duplicates due to the similar albums between artists.  Let's remove the duplicate tracks.

In [15]:
# drop duplicate tracks with the same track_id
track_feat.drop_duplicates(subset=['track_id'], inplace=True)

check_for_dups(track_feat, 'track')

The following columns in track have potential duplicates: ['type']


### The type column is good to have duplicates, the track_feat df is free of redundant duplicates.

## Checking for null values
---

In [16]:
# function to quickly check if there are null values in the dataframes
def check_for_nulls(dfs, tables):
    '''Function takes in a list of pandas dataframe (dfs) and the name of the df as a list (tables)'''
    for df, table in zip(dfs, tables):
        if df.isnull().sum().sum() == 0:
            print(f'No null values in {table} dataframe!')
        else:
            print(f'{tables} has {df.isnull().sum().sum()} null value(s), further investigate.')

In [17]:
check_for_nulls([artist, album, track, track_feat], ['artist', 'album', 'track', 'track_feat'])

No null values in artist dataframe!
No null values in album dataframe!
No null values in track dataframe!
No null values in track_feat dataframe!


## Checking correct data types
---

In [18]:
artist.dtypes

artist_id       object
artist_name     object
external_url    object
genre           object
image_url       object
followers        int64
popularity       int64
type            object
artist_uri      object
dtype: object

### Comparing the above data types to the example tables we can see that our data types are good to go. Now let's check the album dataframe.

In [19]:
album.dtypes

album_id        object
album_name      object
external_url    object
image_url       object
release_date    object
total_tracks     int64
type            object
album_uri       object
artist_id       object
dtype: object

### All of the above are good except release_date, we will change the data type of this column to be datetime.

In [20]:
# change release_date data type to be datetime
album['release_date'] = pd.to_datetime(album.release_date, format='%Y-%m-%d', errors='coerce')

In [21]:
album.dtypes

album_id                object
album_name              object
external_url            object
image_url               object
release_date    datetime64[ns]
total_tracks             int64
type                    object
album_uri               object
artist_id               object
dtype: object

### Data types for album are good, now check track dataframe.

In [22]:
track.dtypes

track_id        object
song_name       object
external_url    object
duration_ms      int64
explicit          bool
disc_number      int64
type            object
song_uri        object
album_id        object
dtype: object

### Data types for track are good, lastly check track_feat dataframe.

In [23]:
track_feat.dtypes

track_id             object
danceability        float64
energy              float64
instrumentalness    float64
liveness            float64
loudness            float64
speechiness         float64
tempo               float64
type                 object
valence             float64
song_uri             object
dtype: object

### Data types for track_feat are good. All of the dataframes have the correct datatypes as shown in the example tables!

## Save transformed data to new csv files.
---

In [24]:
# save all fully transformed df to csv files naming them accordingly
dataframe = [artist, album, track, track_feat]
df_names = ['artist', 'album', 'track', 'track_feat']

for df, df_name in zip(dataframe, df_names):
    df.to_csv(f'../../data/final_{df_name}.csv', index=False)