# Why Data Cleaning
While in the middle of developing the ETL process, I came across some values that would be counted as "unclean data".

For example
* In the songs data files, the release year can often be 0
* In the songs data files, some string fields are empty string "" 


I believe we shouldn't dump it into the database as is. If we aren't fixing it, at least, we should make it `NULL`. That way, we are standardizing the way of representing missing values.

In [43]:
import os
import glob
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from typing import List

In [44]:
def get_files(filepath:str) -> List[str]:
    """returns all json files in the directory tree under the filepath

    Arguments:
        filepath -- the root path

    Returns:
        a list of json filepaths under the root path
    """
    all_files = []
    for root, dirs, files in os.walk(filepath):
        files = glob.glob(os.path.join(root,'*.json'))
        for f in files :
            all_files.append(os.path.abspath(f))
    
    return all_files

# Part 1: The data under `song_data` directory

## 1.1 Identifying the problems

Reading the json files into a single Dataframe

In [45]:
song_files = get_files("../data/raw/song_data")
df_songs_artists = pd.DataFrame()
for path in song_files:
    df_songs_artists = pd.concat([df_songs_artists, pd.read_json(path, lines = True)])
df_songs_artists.head()

Unnamed: 0,num_songs,artist_id,artist_latitude,artist_longitude,artist_location,artist_name,song_id,title,duration,year
0,1,ARD7TVE1187B99BFB1,,,California - LA,Casual,SOMZWCG12A8C13C480,I Didn't Mean To,218.93179,0
0,1,ARMJAGH1187FB546F3,35.14968,-90.04892,"Memphis, TN",The Box Tops,SOCIWDW12A8C13D406,Soul Deep,148.03546,1969
0,1,ARKRRTF1187B9984DA,,,,Sonora Santanera,SOXVLOJ12AB0189215,Amor De Cabaret,177.47546,0
0,1,AR7G5I41187FB4CE6C,,,"London, England",Adam Ant,SONHOTT12A8C13493C,Something Girls,233.40363,1982
0,1,ARXR32B1187FB57099,,,,Gob,SOFSOCN12A8C143F5D,Face the Ashes,209.60608,2007


There are 2 observable issues here:
* Some years are set to 0. They shoud be Nan
* Some `artist_location`s are empty strings. They should also be Nan 

Checking the datatypes

In [46]:
df_songs_artists.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 71 entries, 0 to 0
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   num_songs         71 non-null     int64  
 1   artist_id         71 non-null     object 
 2   artist_latitude   31 non-null     float64
 3   artist_longitude  31 non-null     float64
 4   artist_location   71 non-null     object 
 5   artist_name       71 non-null     object 
 6   song_id           71 non-null     object 
 7   title             71 non-null     object 
 8   duration          71 non-null     float64
 9   year              71 non-null     int64  
dtypes: float64(3), int64(2), object(5)
memory usage: 6.1+ KB


checking for duplicates for song data and artist data

In [47]:
print(df_songs_artists.shape[0],
      df_songs_artists["song_id"].nunique(),
      df_songs_artists["artist_id"].nunique())

71 71 69


So, some artists are duplicates, which is fine, since the JSON entries are based on songs, and artists can have more than one song.

However, we should also check if the aritsts data is consistent

In [48]:
artist_columns = ["artist_id", "artist_name", "artist_location", "artist_longitude", "artist_latitude"]
df_artists = df_songs_artists[artist_columns].copy()
df_artists.head()

Unnamed: 0,artist_id,artist_name,artist_location,artist_longitude,artist_latitude
0,ARD7TVE1187B99BFB1,Casual,California - LA,,
0,ARMJAGH1187FB546F3,The Box Tops,"Memphis, TN",-90.04892,35.14968
0,ARKRRTF1187B9984DA,Sonora Santanera,,,
0,AR7G5I41187FB4CE6C,Adam Ant,"London, England",,
0,ARXR32B1187FB57099,Gob,,,


We can groupby `artist_id` and count the unique values for each column, for each artist. If the data is consistent, this should all be ones

In [49]:
grouped = df_artists.groupby(["artist_id"]).agg(lambda x: x.nunique(dropna = False))
grouped

Unnamed: 0_level_0,artist_name,artist_location,artist_longitude,artist_latitude
artist_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AR051KA1187B98B2FF,1,1,1,1
AR0IAWL1187B9A96D0,1,1,1,1
AR0RCMP1187FB3F427,1,1,1,1
AR10USD1187B99F3F1,1,1,1,1
AR1Y2PT1187FB5B9CE,1,1,1,1
...,...,...,...,...
ARULZCI1241B9C8611,1,1,1,1
ARVBRGZ1187FB4675A,1,1,1,1
ARWB3G61187FB49404,1,1,1,1
ARXR32B1187FB57099,1,1,1,1


In [50]:
(grouped == 1).all()

artist_name         True
artist_location     True
artist_longitude    True
artist_latitude     True
dtype: bool

The data is consistent

### 1.1.1 Issue Summary
* Some years are set to 0. They shoud be Nan
* Some `artist_location`s are empty strings. They should also be Nan 

## 1.2 Fixing

In [51]:
df_songs_artists["year"].replace({0: np.nan}, inplace = True)
df_songs_artists.head()

Unnamed: 0,num_songs,artist_id,artist_latitude,artist_longitude,artist_location,artist_name,song_id,title,duration,year
0,1,ARD7TVE1187B99BFB1,,,California - LA,Casual,SOMZWCG12A8C13C480,I Didn't Mean To,218.93179,
0,1,ARMJAGH1187FB546F3,35.14968,-90.04892,"Memphis, TN",The Box Tops,SOCIWDW12A8C13D406,Soul Deep,148.03546,1969.0
0,1,ARKRRTF1187B9984DA,,,,Sonora Santanera,SOXVLOJ12AB0189215,Amor De Cabaret,177.47546,
0,1,AR7G5I41187FB4CE6C,,,"London, England",Adam Ant,SONHOTT12A8C13493C,Something Girls,233.40363,1982.0
0,1,ARXR32B1187FB57099,,,,Gob,SOFSOCN12A8C143F5D,Face the Ashes,209.60608,2007.0


In [52]:
df_songs_artists["artist_location"].replace({"": np.nan}, inplace = True)
df_songs_artists.head()

Unnamed: 0,num_songs,artist_id,artist_latitude,artist_longitude,artist_location,artist_name,song_id,title,duration,year
0,1,ARD7TVE1187B99BFB1,,,California - LA,Casual,SOMZWCG12A8C13C480,I Didn't Mean To,218.93179,
0,1,ARMJAGH1187FB546F3,35.14968,-90.04892,"Memphis, TN",The Box Tops,SOCIWDW12A8C13D406,Soul Deep,148.03546,1969.0
0,1,ARKRRTF1187B9984DA,,,,Sonora Santanera,SOXVLOJ12AB0189215,Amor De Cabaret,177.47546,
0,1,AR7G5I41187FB4CE6C,,,"London, England",Adam Ant,SONHOTT12A8C13493C,Something Girls,233.40363,1982.0
0,1,ARXR32B1187FB57099,,,,Gob,SOFSOCN12A8C143F5D,Face the Ashes,209.60608,2007.0


## 1.3 Extracting and Saving the cleaned data

In [96]:
# artist data
artist_columns = ["artist_id", "artist_name", "artist_location", "artist_longitude", "artist_latitude"]
df_artists = df_songs_artists[artist_columns].copy()
df_artists.drop_duplicates(inplace = True)
df_artists.to_csv("../data/cleaned/artists.csv", index = False)

In [97]:
# songs data
song_columns = ["song_id", "title", "artist_id", "duration", "year"]
df_songs = df_songs_artists[song_columns].copy()
df_songs.to_csv("../data/cleaned/songs.csv", index = False)

# Part 2: The Data under the `log_data` directory

## 2.1 Identifying problems

Reading the json files into a single dataframe

In [75]:
log_files = get_files("../data/raw/log_data")
df_logs = pd.DataFrame()
for path in log_files:
    df_logs = pd.concat([df_logs, pd.read_json(path, lines = True)])
df_logs.head(2)

Unnamed: 0,artist,auth,firstName,gender,itemInSession,lastName,length,level,location,method,page,registration,sessionId,song,status,ts,userAgent,userId
0,,Logged In,Walter,M,0,Frye,,free,"San Francisco-Oakland-Hayward, CA",GET,Home,1540919000000.0,38,,200,1541105830796,"""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4...",39
1,,Logged In,Kaylee,F,0,Summers,,free,"Phoenix-Mesa-Scottsdale, AZ",GET,Home,1540345000000.0,139,,200,1541106106796,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8


In [57]:
df_logs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8056 entries, 0 to 387
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   artist         6820 non-null   object 
 1   auth           8056 non-null   object 
 2   firstName      7770 non-null   object 
 3   gender         7770 non-null   object 
 4   itemInSession  8056 non-null   int64  
 5   lastName       7770 non-null   object 
 6   length         6820 non-null   float64
 7   level          8056 non-null   object 
 8   location       7770 non-null   object 
 9   method         8056 non-null   object 
 10  page           8056 non-null   object 
 11  registration   7770 non-null   float64
 12  sessionId      8056 non-null   int64  
 13  song           6820 non-null   object 
 14  status         8056 non-null   int64  
 15  ts             8056 non-null   int64  
 16  userAgent      7770 non-null   object 
 17  userId         8056 non-null   object 
dtypes: float6

In [77]:
df_logs["auth"].unique()

array(['Logged In', 'Logged Out'], dtype=object)

In [81]:
df_logs[df_logs["auth"] == 'Logged Out']

Unnamed: 0,artist,auth,firstName,gender,itemInSession,lastName,length,level,location,method,page,registration,sessionId,song,status,ts,userAgent,userId
0,,Logged Out,,,0,,,free,,PUT,Login,,52,,307,1541207073796,,
6,,Logged Out,,,0,,,free,,GET,Home,,18,,200,1541239749796,,
11,,Logged Out,,,3,,,paid,,GET,Home,,128,,200,1541310732796,,
12,,Logged Out,,,4,,,paid,,PUT,Login,,128,,307,1541310733796,,
90,,Logged Out,,,0,,,paid,,GET,Home,,175,,200,1541329386796,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
203,,Logged Out,,,1,,,paid,,PUT,Login,,977,,307,1543585407796,,
235,,Logged Out,,,15,,,paid,,GET,Home,,977,,200,1543588286796,,
236,,Logged Out,,,16,,,paid,,PUT,Login,,977,,307,1543588287796,,
258,,Logged Out,,,0,,,paid,,PUT,Login,,1097,,307,1543589944796,,


In [80]:
df_logs["page"].unique()

array(['Home', 'NextSong', 'Upgrade', 'Downgrade', 'Settings',
       'Save Settings', 'Login', 'Logout', 'Help', 'Error', 'About',
       'Submit Upgrade', 'Submit Downgrade'], dtype=object)

So, it seems the logs track many of the user actions, including
* Which page the action led to
* The type of HTTP request and its response
* Whether the user was logged in or not
* Whether the user was using a paid or a free service
* What agent the user was using (type of web browser for example)


We should note that our user data can change. It makes sense to make them able to switch between free and premium levels. For the other data (like name), some platforms allow this data to change, other do not, so it's up to use to decide. 

I will go with enabling the users to change their other information as well.

For example, take the user with id 15

In [126]:
df_logs[df_logs["userId"] == 15].iloc[345:350][["page", "level"]]

Unnamed: 0,page,level
268,Submit Downgrade,paid
269,Home,free
270,NextSong,free
271,Upgrade,free
272,Submit Upgrade,free


This user was on premium, then downgraded, and upgraded again. 

We can handle this by doing upsert operations on the database, or we can handle it here in the extraction phase. In both cases, we will have to sort the data by timestamp

In [83]:
df_logs["userId"].unique()

array([39, 8, 10, 26, 101, 83, 66, 48, 86, 17, 15, 89, 80, 44, 88, 49,
       100, 61, 75, 50, 12, 71, 54, 3, '', '53', '69', '62', '101', '95',
       '10', '15', '63', '49', '6', '52', '99', '43', '25', '51', '26',
       '44', '16', '80', '32', '37', '28', '77', '78', '74', '100', '55',
       '33', '61', '73', '58', '83', '94', '57', '42', '60', '84', '91',
       '24', '97', '75', '35', '81', '27', '29', '12', '66', '88', '50',
       '34', '30', '2', '92', '8', '9', '89', '14', '86', '23', '98',
       '54', '45', '20', '11', '85', '48', '72', '36', '7', '64', '47',
       '67', '13', '18', 96, 6, 16, 52, 37, 69, 32, 74, 7, 18, 36, 14, 35,
       '96', '41', '68', '76', '40', '4', '59', '19', '90', '70', '79',
       '17', '71', '65', '56', '87', '21', '38', '5', '82', '39', '22'],
      dtype=object)

There's a problem here. There are userIds that are `ints` and userIds that are `quoted ints`. Unless we are careful, we might treat the number `17` and the string `'17'` as different users (which we shouldn't). 

Also, there's the empty string id. We would need to standardize these values

### 2.1.1 Summary of problems

* userId representation needs to be unified
* To account for possible change of user data, we will need to extract the user data of the latest timestamp

## 2.2 Fixing and extracting

 unify the representation

In [87]:
df_logs["userId"] = df_logs["userId"].replace("", np.nan).astype(np.float64)
df_logs["userId"].unique()

array([ 39.,   8.,  10.,  26., 101.,  83.,  66.,  48.,  86.,  17.,  15.,
        89.,  80.,  44.,  88.,  49., 100.,  61.,  75.,  50.,  12.,  71.,
        54.,   3.,  nan,  53.,  69.,  62.,  95.,  63.,   6.,  52.,  99.,
        43.,  25.,  51.,  16.,  32.,  37.,  28.,  77.,  78.,  74.,  55.,
        33.,  73.,  58.,  94.,  57.,  42.,  60.,  84.,  91.,  24.,  97.,
        35.,  81.,  27.,  29.,  34.,  30.,   2.,  92.,   9.,  14.,  23.,
        98.,  45.,  20.,  11.,  85.,  72.,  36.,   7.,  64.,  47.,  67.,
        13.,  18.,  96.,  41.,  68.,  76.,  40.,   4.,  59.,  19.,  90.,
        70.,  79.,  65.,  56.,  87.,  21.,  38.,   5.,  82.,  22.])

Sort the data by timestamp

In [88]:
df_logs.sort_values(by = "ts", inplace = True)

Groupby userId, and pick the last entry

In [128]:
grouped = df_logs.groupby("userId").agg("last")
grouped.head()

Unnamed: 0_level_0,artist,auth,firstName,gender,itemInSession,lastName,length,level,location,method,page,registration,sessionId,song,status,ts,userAgent
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2.0,The Ramones,Logged In,Jizelle,F,3,Benjamin,210.18077,free,"Plymouth, IN",PUT,NextSong,1539909000000.0,354,Pet Semetary,200,1542450490796,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK..."
3.0,The Rakes,Logged In,Isaac,M,2,Valdez,150.59546,free,"Saginaw, MI",PUT,NextSong,1541078000000.0,112,Strasbourg,200,1541191397796,Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) G...
4.0,James Newton Howard,Logged In,Alivia,F,0,Terrell,141.5571,free,"Parkersburg-Vienna, WV",GET,Home,1540505000000.0,1070,I'm Sorry,200,1543541644796,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK..."
5.0,Deas Vail,Logged In,Elijah,M,0,Davis,237.68771,free,"Detroit-Warren-Dearborn, MI",PUT,NextSong,1540772000000.0,985,Anything You Say (Unreleased Version),200,1543607664796,"""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4..."
6.0,Toto,Logged In,Cecilia,F,0,Owens,411.19302,free,"Atlanta-Sandy Springs-Roswell, GA",GET,Home,1541032000000.0,1027,Home Of The Brave,200,1543586278796,Mozilla/5.0 (Windows NT 6.1; WOW64; rv:32.0) G...


### 2.3.1 Extract and Save the user data

In [100]:
user_cols = ["firstName", "lastName", "gender", "level"]
df_users = grouped[user_cols]
df_users

Unnamed: 0_level_0,firstName,lastName,gender,level
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2.0,Jizelle,Benjamin,F,free
3.0,Isaac,Valdez,M,free
4.0,Alivia,Terrell,F,free
5.0,Elijah,Davis,M,free
6.0,Cecilia,Owens,F,free
...,...,...,...,...
97.0,Kate,Harrell,F,paid
98.0,Jordyn,Powell,F,free
99.0,Ann,Banks,F,free
100.0,Adler,Barrera,M,free


Save the users' data

In [127]:
df_users.to_csv("../data/cleaned/users.csv")

### 2.3.2 Extract and Save the Timestamp data

In [130]:
session_datetime = pd.to_datetime(df_logs["ts"], unit = "ms")
hour = session_datetime.dt.hour
day = session_datetime.dt.day
weekday = session_datetime.dt.weekday
week = session_datetime.dt.isocalendar().week
month = session_datetime.dt.month
year = session_datetime.dt.year

time_data_dict = {
    "ts": df_logs["ts"].copy(),
    "hour":hour,
    "day": day,
    "weekday": weekday,
    "week": week,
    "month": month,
    "year": year
}

In [131]:
time_df = pd.DataFrame(time_data_dict)
time_df.head()

Unnamed: 0,ts,hour,day,weekday,week,month,year
0,1541105830796,20,1,3,44,11,2018
1,1541106106796,21,1,3,44,11,2018
2,1541106106796,21,1,3,44,11,2018
3,1541106132796,21,1,3,44,11,2018
4,1541106352796,21,1,3,44,11,2018


Save the `time` data

In [132]:
time_df.to_csv("../data/cleaned/time.csv", index = False)

### 2.3.3 Extract and Save the `Songplay` data

In [137]:
songplay_cols = ["sessionId", "ts", "userId", "level", "location", "userAgent"]
df_songplays = df_logs[df_logs["page"] == "NextSong"].copy()
df_songplays = df_songplays[songplay_cols]
df_songplays.head()

Unnamed: 0,sessionId,ts,userId,level,location,userAgent
2,139,1541106106796,8.0,free,"Phoenix-Mesa-Scottsdale, AZ","""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK..."
4,139,1541106352796,8.0,free,"Phoenix-Mesa-Scottsdale, AZ","""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK..."
5,139,1541106496796,8.0,free,"Phoenix-Mesa-Scottsdale, AZ","""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK..."
6,139,1541106673796,8.0,free,"Phoenix-Mesa-Scottsdale, AZ","""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK..."
7,139,1541107053796,8.0,free,"Phoenix-Mesa-Scottsdale, AZ","""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK..."


In [138]:
df_songplays.to_csv("../data/cleaned/songplays.csv", index = False)