# Why Data Cleaning
While in the middle of developing the ETL process, I came across some values that would be counted as "unclean data".

For example
* In the songs data files, the release year can often be 0
* In the songs data files, some string fields are empty string "" 


I believe we shouldn't dump it into the database as is. If we aren't fixing it, at least, we should make it `NULL`. That way, we are standardizing the way of representing missing values.

In [43]:
import os
import glob
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from typing import List

In [44]:
def get_files(filepath:str) -> List[str]:
    """returns all json files in the directory tree under the filepath

    Arguments:
        filepath -- the root path

    Returns:
        a list of json filepaths under the root path
    """
    all_files = []
    for root, dirs, files in os.walk(filepath):
        files = glob.glob(os.path.join(root,'*.json'))
        for f in files :
            all_files.append(os.path.abspath(f))
    
    return all_files

# Part 1: The data under `song_data` directory

## 1.1 Identifying the problems

Reading the json files into a single Dataframe

In [45]:
song_files = get_files("../data/raw/song_data")
df_songs_artists = pd.DataFrame()
for path in song_files:
    df_songs_artists = pd.concat([df_songs_artists, pd.read_json(path, lines = True)])
df_songs_artists.head()

Unnamed: 0,num_songs,artist_id,artist_latitude,artist_longitude,artist_location,artist_name,song_id,title,duration,year
0,1,ARD7TVE1187B99BFB1,,,California - LA,Casual,SOMZWCG12A8C13C480,I Didn't Mean To,218.93179,0
0,1,ARMJAGH1187FB546F3,35.14968,-90.04892,"Memphis, TN",The Box Tops,SOCIWDW12A8C13D406,Soul Deep,148.03546,1969
0,1,ARKRRTF1187B9984DA,,,,Sonora Santanera,SOXVLOJ12AB0189215,Amor De Cabaret,177.47546,0
0,1,AR7G5I41187FB4CE6C,,,"London, England",Adam Ant,SONHOTT12A8C13493C,Something Girls,233.40363,1982
0,1,ARXR32B1187FB57099,,,,Gob,SOFSOCN12A8C143F5D,Face the Ashes,209.60608,2007


There are 2 observable issues here:
* Some years are set to 0. They shoud be Nan
* Some `artist_location`s are empty strings. They should also be Nan 

Checking the datatypes

In [46]:
df_songs_artists.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 71 entries, 0 to 0
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   num_songs         71 non-null     int64  
 1   artist_id         71 non-null     object 
 2   artist_latitude   31 non-null     float64
 3   artist_longitude  31 non-null     float64
 4   artist_location   71 non-null     object 
 5   artist_name       71 non-null     object 
 6   song_id           71 non-null     object 
 7   title             71 non-null     object 
 8   duration          71 non-null     float64
 9   year              71 non-null     int64  
dtypes: float64(3), int64(2), object(5)
memory usage: 6.1+ KB


checking for duplicates for song data and artist data

In [47]:
print(df_songs_artists.shape[0],
      df_songs_artists["song_id"].nunique(),
      df_songs_artists["artist_id"].nunique())

71 71 69


So, some artists are duplicates, which is fine, since the JSON entries are based on songs, and artists can have more than one song.

However, we should also check if the aritsts data is consistent

In [48]:
artist_columns = ["artist_id", "artist_name", "artist_location", "artist_longitude", "artist_latitude"]
df_artists = df_songs_artists[artist_columns].copy()
df_artists.head()

Unnamed: 0,artist_id,artist_name,artist_location,artist_longitude,artist_latitude
0,ARD7TVE1187B99BFB1,Casual,California - LA,,
0,ARMJAGH1187FB546F3,The Box Tops,"Memphis, TN",-90.04892,35.14968
0,ARKRRTF1187B9984DA,Sonora Santanera,,,
0,AR7G5I41187FB4CE6C,Adam Ant,"London, England",,
0,ARXR32B1187FB57099,Gob,,,


We can groupby `artist_id` and count the unique values for each column, for each artist. If the data is consistent, this should all be ones

In [49]:
grouped = df_artists.groupby(["artist_id"]).agg(lambda x: x.nunique(dropna = False))
grouped

Unnamed: 0_level_0,artist_name,artist_location,artist_longitude,artist_latitude
artist_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AR051KA1187B98B2FF,1,1,1,1
AR0IAWL1187B9A96D0,1,1,1,1
AR0RCMP1187FB3F427,1,1,1,1
AR10USD1187B99F3F1,1,1,1,1
AR1Y2PT1187FB5B9CE,1,1,1,1
...,...,...,...,...
ARULZCI1241B9C8611,1,1,1,1
ARVBRGZ1187FB4675A,1,1,1,1
ARWB3G61187FB49404,1,1,1,1
ARXR32B1187FB57099,1,1,1,1


In [50]:
(grouped == 1).all()

artist_name         True
artist_location     True
artist_longitude    True
artist_latitude     True
dtype: bool

The data is consistent

### 1.1.1 Issue Summary
* Some years are set to 0. They shoud be Nan
* Some `artist_location`s are empty strings. They should also be Nan 

## 1.2 Fixing

In [51]:
df_songs_artists["year"].replace({0: np.nan}, inplace = True)
df_songs_artists.head()

Unnamed: 0,num_songs,artist_id,artist_latitude,artist_longitude,artist_location,artist_name,song_id,title,duration,year
0,1,ARD7TVE1187B99BFB1,,,California - LA,Casual,SOMZWCG12A8C13C480,I Didn't Mean To,218.93179,
0,1,ARMJAGH1187FB546F3,35.14968,-90.04892,"Memphis, TN",The Box Tops,SOCIWDW12A8C13D406,Soul Deep,148.03546,1969.0
0,1,ARKRRTF1187B9984DA,,,,Sonora Santanera,SOXVLOJ12AB0189215,Amor De Cabaret,177.47546,
0,1,AR7G5I41187FB4CE6C,,,"London, England",Adam Ant,SONHOTT12A8C13493C,Something Girls,233.40363,1982.0
0,1,ARXR32B1187FB57099,,,,Gob,SOFSOCN12A8C143F5D,Face the Ashes,209.60608,2007.0


In [52]:
df_songs_artists["artist_location"].replace({"": np.nan}, inplace = True)
df_songs_artists.head()

Unnamed: 0,num_songs,artist_id,artist_latitude,artist_longitude,artist_location,artist_name,song_id,title,duration,year
0,1,ARD7TVE1187B99BFB1,,,California - LA,Casual,SOMZWCG12A8C13C480,I Didn't Mean To,218.93179,
0,1,ARMJAGH1187FB546F3,35.14968,-90.04892,"Memphis, TN",The Box Tops,SOCIWDW12A8C13D406,Soul Deep,148.03546,1969.0
0,1,ARKRRTF1187B9984DA,,,,Sonora Santanera,SOXVLOJ12AB0189215,Amor De Cabaret,177.47546,
0,1,AR7G5I41187FB4CE6C,,,"London, England",Adam Ant,SONHOTT12A8C13493C,Something Girls,233.40363,1982.0
0,1,ARXR32B1187FB57099,,,,Gob,SOFSOCN12A8C143F5D,Face the Ashes,209.60608,2007.0


## 1.3 Saving the cleaned data

In [54]:
# artist data
artist_columns = ["artist_id", "artist_name", "artist_location", "artist_longitude", "artist_latitude"]
df_artists = df_songs_artists[artist_columns].copy()
df_artists.drop_duplicates(inplace = True)
df_artists.to_csv("../data/cleaned/artists.csv")

In [55]:
# songs data
song_columns = ["song_id", "title", "artist_id", "duration", "year"]
df_songs = df_songs_artists[song_columns].copy()
df_songs.to_csv("../data/cleaned/songs.csv")

# Part 2: The Data under the `log_data` directory

## 2.1 Identifying problems