### Exploration

In this notebook I am going to explore the data and processes of the first project of the NanoDegree Data Engineering Program.

- 'Walk' the directory with Path
- Retrieve all JSON files
- Load them into Pandas 
- Explore and clean
- Ready for insert -> NOTE: watch out with insertion in the database, something like auto increment oid should be ON 
- Create a SQL database (SQLlite for training purposes?)
- Insert the transformed and cleaned data
- Bonus: Logging (!) -> try to keep it simple but logging is essential for these tasks -> especially for Exception statements etc.

Make sure to write clean robust code, add sensible checks -> assert. Eventually this will become your etl.ipynb 

----------------------------

After the first exploration, there appears to be a recurring process:
- Collect the data for a particular table
- Assert that the data is correct
- Insert the data into the specified table
- Verify the results

There is probably a lot of functionality we can re-use for each table.

#### Find JSON files and return the directories 

In [1]:
import json
import numpy as np
import pandas as pd
from pathlib import Path

In [2]:
def create_path_list(file_path: str, extension: str = '.json') -> list:
    """Returns a list of Paths of all the files with the extension. All subdirectories of file_path are included.
    
    Example
        from pathlib import Path
        
        data_path = Path('.') / 'data'
        csv_path_list = create_path_list(data_path, '.csv')     
    """
    return_list = [x for x in file_path.glob(f"**/*{extension}")]
    print(f"{file_path} contains {len(return_list)} {extension} files.")    
    
    return return_list    

In [3]:
data_path = Path('.') / 'data'

In [4]:
data_path_list = create_path_list(data_path)

data contains 101 .json files.


### Process song data

From this data 2 tables are created:
- songs - songs in music database -> song_id, title, artist_id, year, duration
- artists - artists in music database -> artist_id, name, location, latitude, longitude

Create a list of tuples for a direct insert into the Postgres table. Validate each row, concatenate to a Dataframe, transform to a list of tuples and return.

https://realpython.com/python-exceptions/#the-assertionerror-exception

Observations:
- Songs: year contains 0 values

#### Songs table

In [5]:
song_path_list = create_path_list(data_path / 'song_data')

data\song_data contains 71 .json files.


In [6]:
song_columns = sorted(['title', 'song_id', 'year', 'duration'])  

In [None]:
def df_assertions(df: pd.DataFrame, target_columns: list) -> None:
    """Assert statements to make sure the retrieved data is valid and clean before insertion into the Postgres table."""     
    df_cols = df.columns.str.lower() 
    found_cols = [x for x in df_cols if x in target_columns]
    
    assert sorted(found_cols) == sorted(target_columns), f"The columns do not match."
    assert df[target_columns].isnull().values.any() == False, f"Missing values in not nullable target columns."
    # assert data types of each column   
    # assert any constraint important to the Postgres Table 
    
    return None

In [None]:
song_df = pd.DataFrame(columns=song_columns)

for idx, file in enumerate(song_path_list):
    temp_df = pd.read_json(file, lines=True)    
    try:
        df_assertions(temp_df, song_columns)
    except AssertionError as error:
        print(f"Error @ file {idx} {file}: {error} NOTE: this file will not be inserted.")
    else:
        song_df = song_df.append(temp_df[song_columns], ignore_index=True)

In [7]:
song_table_data = list(song_df.to_records(index=False))

NameError: name 'song_df' is not defined

#### Artists table

In [None]:
example_df[['artist_name', 'artist_location', 'artist_latitude', 'artist_longitude']]

In [None]:
artist_columns = ['artist_id', 'artist_name', 'artist_location', 'artist_latitude', 'artist_longitude']

In [None]:
artist_df = pd.DataFrame(columns=artist_columns)

for idx, file in enumerate(song_path_list):
    temp_df = pd.read_json(file, lines=True)    
    artist_df = artist_df.append(temp_df[artist_columns], ignore_index=True)

In [None]:
artist_df.info()

In [None]:
artist_df

In [None]:
artist_df.replace({'': None, np.nan: None})

### Logging data

There are 2 tables we want to extract from logging data
- users
- time

Challenge:
- users needs to be filtered on ['auth']=='Logged In']
- time needs to be filterd on ['page']=='NextSong'
- Ideally, we do not want to load the data twice since that would cause A lot of overhead...

Create a function which takes temp_df as input, and returns a temp_users and temp_time df

In [61]:
log_path_list = create_path_list(data_path / 'log_data')

data\log_data contains 30 .json files.


In [62]:
log_df = pd.read_json(log_path_list[0], lines=True)

In [65]:
def expands_dfs(temp_df):
    """Returns 2 dataframes which can be used for the users and time tables in Postgres."""
    users_columns = ['userId', 'firstName', 'lastName', 'gender', 'level']
    time_columns = ['ts']
    
    song_df = temp_df[temp_df['auth']=='Logged In']
    time_df = temp_df[temp_df['page']=='NextSong']
    
    return (song_df[users_columns], time_df[time_columns])

In [66]:
expands_dfs(log_df)

(    userId firstName lastName gender level
 0       39    Walter     Frye      M  free
 1        8    Kaylee  Summers      F  free
 2        8    Kaylee  Summers      F  free
 3        8    Kaylee  Summers      F  free
 4        8    Kaylee  Summers      F  free
 5        8    Kaylee  Summers      F  free
 6        8    Kaylee  Summers      F  free
 7        8    Kaylee  Summers      F  free
 8        8    Kaylee  Summers      F  free
 9        8    Kaylee  Summers      F  free
 10      10    Sylvie     Cruz      F  free
 11      26      Ryan    Smith      M  free
 12      26      Ryan    Smith      M  free
 13      26      Ryan    Smith      M  free
 14     101    Jayden      Fox      M  free,
                ts
 2   1541106106796
 4   1541106352796
 5   1541106496796
 6   1541106673796
 7   1541107053796
 8   1541107493796
 9   1541107734796
 10  1541108520796
 12  1541109125796
 13  1541109325796
 14  1541110994796)

In [67]:
song_df = pd.DataFrame()
time_df = pd.DataFrame()

for idx, file in enumerate(log_path_list):
    temp_df = pd.read_json(file, lines=True)   
    
    temp_song, temp_time = expands_dfs(temp_df)
    
    song_df = song_df.append(temp_song, ignore_index=True)
    time_df = time_df.append(temp_time, ignore_index=True)

In [69]:
time_df

Unnamed: 0,ts
0,1541106106796
1,1541106352796
2,1541106496796
3,1541106673796
4,1541107053796
...,...
6815,1543603205796
6816,1543603476796
6817,1543603678796
6818,1543603884796


In [75]:
time_df[]

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [77]:
expand_ms(time_df['ts']).drop_duplicates(subset='start_time')

Unnamed: 0,start_time,hour,day,month,year,weekday
0,2018-11-01 21:01:46.796,21,44,11,2018,3
1,2018-11-01 21:05:52.796,21,44,11,2018,3
2,2018-11-01 21:08:16.796,21,44,11,2018,3
3,2018-11-01 21:11:13.796,21,44,11,2018,3
4,2018-11-01 21:17:33.796,21,44,11,2018,3
...,...,...,...,...,...,...
6815,2018-11-30 18:40:05.796,18,48,11,2018,4
6816,2018-11-30 18:44:36.796,18,48,11,2018,4
6817,2018-11-30 18:47:58.796,18,48,11,2018,4
6818,2018-11-30 18:51:24.796,18,48,11,2018,4


In [71]:
def expand_ms(ms_series: pd.Series) -> pd.DataFrame:
    """Expands a Pandas series with milliseconds with several datetime attributes."""
    df = pd.DataFrame({'start_time': pd.to_datetime(ms_series, unit='ms')})
    
    df['hour'] = df['start_time'].dt.hour
    df['day'] = df['start_time'].dt.day
    df['day'] = df['start_time'].dt.isocalendar().week  
    df['month'] = df['start_time'].dt.month
    df['year'] = df['start_time'].dt.year
    df['weekday'] = df['start_time'].dt.weekday

    return df    

In [None]:
for time in time_df:
    pd.DataFrame({'start_time': pd.to_datetime(time, unit='ms')}

In [73]:
type(time_df)

pandas.core.frame.DataFrame

In [8]:
log_path_list = create_path_list(data_path / 'log_data')

data\log_data contains 30 .json files.


In [25]:
log_df = pd.read_json(log_path_list[0], lines=True)

In [26]:
log_df.head()

Unnamed: 0,artist,auth,firstName,gender,itemInSession,lastName,length,level,location,method,page,registration,sessionId,song,status,ts,userAgent,userId
0,,Logged In,Walter,M,0,Frye,,free,"San Francisco-Oakland-Hayward, CA",GET,Home,1540919166796,38,,200,1541105830796,"""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4...",39
1,,Logged In,Kaylee,F,0,Summers,,free,"Phoenix-Mesa-Scottsdale, AZ",GET,Home,1540344794796,139,,200,1541106106796,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8
2,Des'ree,Logged In,Kaylee,F,1,Summers,246.30812,free,"Phoenix-Mesa-Scottsdale, AZ",PUT,NextSong,1540344794796,139,You Gotta Be,200,1541106106796,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8
3,,Logged In,Kaylee,F,2,Summers,,free,"Phoenix-Mesa-Scottsdale, AZ",GET,Upgrade,1540344794796,139,,200,1541106132796,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8
4,Mr Oizo,Logged In,Kaylee,F,3,Summers,144.03873,free,"Phoenix-Mesa-Scottsdale, AZ",PUT,NextSong,1540344794796,139,Flat 55,200,1541106352796,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8


In [27]:
log_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   artist         11 non-null     object 
 1   auth           15 non-null     object 
 2   firstName      15 non-null     object 
 3   gender         15 non-null     object 
 4   itemInSession  15 non-null     int64  
 5   lastName       15 non-null     object 
 6   length         11 non-null     float64
 7   level          15 non-null     object 
 8   location       15 non-null     object 
 9   method         15 non-null     object 
 10  page           15 non-null     object 
 11  registration   15 non-null     int64  
 12  sessionId      15 non-null     int64  
 13  song           11 non-null     object 
 14  status         15 non-null     int64  
 15  ts             15 non-null     int64  
 16  userAgent      15 non-null     object 
 17  userId         15 non-null     int64  
dtypes: float64(1

In [None]:
# songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

In [None]:
log_df['page'].value_counts()

In [None]:
next_songs = log_df[log_df['page']=='NextSong']

#### Users -> [user_id, first_name, last_name, gender, level]
Levels feels like something we should update... hence the hints on the cheat sheet :D -> take time into account, the files are in chronological order

- Per file we should only keep one row per userid
- If the userid already exists, check if we can update gender and / or level, else skip
- make sure to read the files in the correct order

In [13]:
users_dtype_dict = {'userId': int,
                    'firstName': str, 
                    'lastName': str,
                    'gender': str,
                    'level': str} 

In [14]:
users_columns = ['userId', 'firstName', 'lastName', 'gender', 'level']

In [15]:
log_path_list = create_path_list(data_path / 'log_data')

data\log_data contains 30 .json files.


#### make sure the data files are in correct chronological order + make sure the data is appended in chronological order!! this way we can keep track whom changes their level accordingly

In [7]:
sorted(log_path_list)

[WindowsPath('data/log_data/2018/11/2018-11-01-events.json'),
 WindowsPath('data/log_data/2018/11/2018-11-02-events.json'),
 WindowsPath('data/log_data/2018/11/2018-11-03-events.json'),
 WindowsPath('data/log_data/2018/11/2018-11-04-events.json'),
 WindowsPath('data/log_data/2018/11/2018-11-05-events.json'),
 WindowsPath('data/log_data/2018/11/2018-11-06-events.json'),
 WindowsPath('data/log_data/2018/11/2018-11-07-events.json'),
 WindowsPath('data/log_data/2018/11/2018-11-08-events.json'),
 WindowsPath('data/log_data/2018/11/2018-11-09-events.json'),
 WindowsPath('data/log_data/2018/11/2018-11-10-events.json'),
 WindowsPath('data/log_data/2018/11/2018-11-11-events.json'),
 WindowsPath('data/log_data/2018/11/2018-11-12-events.json'),
 WindowsPath('data/log_data/2018/11/2018-11-13-events.json'),
 WindowsPath('data/log_data/2018/11/2018-11-14-events.json'),
 WindowsPath('data/log_data/2018/11/2018-11-15-events.json'),
 WindowsPath('data/log_data/2018/11/2018-11-16-events.json'),
 Windows

In [10]:
def df_log_assertions(df: pd.DataFrame, target_columns: list, not_nullable_columns: list = None) -> None:
    """Assert statements to make sure the retrieved data is valid and clean before insertion into the Postgres table."""     
    low_df_columns = [x.lower() for x in df.columns]
    low_target_columns = [x.lower() for x in target_columns] 
    
    found_cols = [x for x in low_df_columns if x in low_target_columns]    
    assert sorted(found_cols) == sorted(low_target_columns), f"The columns do not match."
    
    if not_nullable_columns:
        assert df[not_nullable_columns].isnull().values.any() == False, f"Missing values in not nullable target columns."
    else:
        assert df[target_columns].isnull().values.any() == False, f"Missing values in the target columns, if allowed please specify these columns."
    
    return None

In [11]:
def create_log_insert_list(file_path_list: list, insert_columns: list, primary_keys: list,
                           dtype_dict: dict, not_nullable_columns: list = None) -> list:
    """Takes a raw file_path list as input, performs several validation checks, and returns a list of tuples
    ready for insertion in Postgres."""
    target_df = pd.DataFrame(columns=insert_columns)

    for idx, file in enumerate(file_path_list):
        temp_df = pd.read_json(file, lines=True)   

        try:
            df_log_assertions(temp_df, insert_columns, not_nullable_columns)
        except AssertionError as error:
            print(f"AssertionError @ file {idx} {file}: {error} NOTE: this file will not be inserted.")
        else:
            try:
                # we do not want to store non logged in users
                temp_df = temp_df[temp_df['auth']=='Logged In']
                temp_df[insert_columns] = temp_df[insert_columns].astype(dtype_dict)
            except ValueError as error:
                print(f"ValueError @ file {idx} {file}: {error} NOTE: this file will not be inserted")
            else:
                target_df = target_df.append(temp_df[insert_columns], ignore_index=True)
                
    insert_df = target_df.drop_duplicates(subset=primary_keys)
    print(f"There were {target_df.shape[0]-insert_df.shape[0]} duplicate primary keys removed from the insert dataframe")
                
    if not_nullable_columns:
        insert_df = insert_df.replace({'': None, np.nan: None})  # Postgres does not recognize '' or np.nan as NULL
    
    # This list comprehension converts the numpy dtypes to standard python dtypes which are necessary for Postgres
    return (insert_df, [tuple(row) for row in insert_df.itertuples(index=False)])

In [16]:
log_df, log_tuple_list = create_log_insert_list(file_path_list=log_path_list,
                                                insert_columns=users_columns,
                                                primary_keys=['userId', 'gender', 'level'],
                                                dtype_dict=users_dtype_dict,
                                                not_nullable_columns=['userId', 'level'])

There were 7665 duplicate primary keys removed from the insert dataframe


In [17]:
log_df

Unnamed: 0,userId,firstName,lastName,gender,level
0,39,Walter,Frye,M,free
1,8,Kaylee,Summers,F,free
10,10,Sylvie,Cruz,F,free
11,26,Ryan,Smith,M,free
14,101,Jayden,Fox,M,free
...,...,...,...,...,...
6097,21,Preston,Sanders,M,free
6346,38,Gianna,Jones,F,free
6376,5,Elijah,Davis,M,free
6688,82,Avery,Martinez,F,paid


In [41]:
def _df_assertions(df: pd.DataFrame, target_columns: list, not_nullable_columns: list = None) -> None:
    """Assert statements to make sure the retrieved data is valid and clean before insertion into the Postgres table."""     
    low_df_columns = [x.lower() for x in df.columns]
    low_target_columns = [x.lower() for x in target_columns] 
    
    found_cols = [x for x in low_df_columns if x in low_target_columns]    
    assert sorted(found_cols) == sorted(low_target_columns), f"The columns do not match."
    
    if not_nullable_columns:
        assert df[not_nullable_columns].isnull().values.any() == False, f"Missing values in not nullable target columns."
    else:
        assert df[target_columns].isnull().values.any() == False, f"Missing values in the target columns, if allowed please specify these columns."
    
    return None

In [50]:
def _expand_ms(ms_series: pd.Series) -> pd.DataFrame:
    """Expands a Pandas series of milliseconds with several datetime attributes."""
    df = pd.DataFrame({'start_time': pd.to_datetime(ms_series, unit='ms')})
    
    df['hour'] = df['start_time'].dt.hour
    df['day'] = df['start_time'].dt.day
    df['day'] = df['start_time'].dt.isocalendar().week  
    df['month'] = df['start_time'].dt.month
    df['year'] = df['start_time'].dt.year
    df['weekday'] = df['start_time'].dt.weekday

    return df  

In [51]:
def create_log_insert_lists(file_path_list: list, insert_columns: list, primary_keys: list,
                            dtype_dict: dict, not_nullable_columns: list = None) -> list:
    """Takes a raw file_path list as input, performs several validation checks, and returns a list of tuples
    ready for insertion in Postgres."""
    target_df = pd.DataFrame(columns=insert_columns)

    for idx, file in enumerate(file_path_list):
        temp_df = pd.read_json(file, lines=True)   

        try:
            _df_log_assertions(temp_df, insert_columns, not_nullable_columns)
        except AssertionError as error:
            print(f"AssertionError @ file {idx} {file}: {error} NOTE: this file will not be inserted.")
        else:
            try:
                # we do not want to store non logged in users
                temp_df = temp_df[temp_df['auth']=='Logged In']
                temp_df[insert_columns] = temp_df[insert_columns].astype(dtype_dict)
            except ValueError as error:
                print(f"ValueError @ file {idx} {file}: {error} NOTE: this file will not be inserted")
            else:
                target_df = target_df.append(temp_df[insert_columns], ignore_index=True)
                
    insert_users_df = target_df.drop_duplicates(subset=primary_keys)
    print(f"There were {target_df.shape[0]-insert_users_df.shape[0]} duplicate primary keys removed from the insert dataframe")
    
    insert_time_df = _expand_ms(target_df['ts'])
                
    if not_nullable_columns:
        insert_users_df = insert_users_df.replace({'': None, np.nan: None})  # Postgres does not recognize '' or np.nan as NULL
    
    # The list comprehension converts the numpy dtypes to standard python dtypes which are necessary for Postgres
    return ([tuple(row) for row in insert_users_df.itertuples(index=False)], [tuple(row) for row in insert_time_df.itertuples(index=False)])

In [47]:
log_dtype_dict = {'userId': int,
                  'firstName': str, 
                  'lastName': str,
                  'gender': str,
                  'level': str,
                  'ts': int} 

In [48]:
log_columns = ['userId', 'firstName', 'lastName', 'gender', 'level', 'ts']

In [52]:
users_table_data, time_table_data = create_log_insert_lists(file_path_list=log_path_list,
                                                            insert_columns=log_columns,
                                                            primary_keys=['userId', 'gender', 'level'],
                                                            dtype_dict=log_dtype_dict,
                                                            not_nullable_columns=['userId', 'level'])

There were 7665 duplicate primary keys removed from the insert dataframe


In [57]:
len(time_table_data)

7770

### Time table
- start_time, hour, day, week, month, year, weekday
- ts (timestamp) of records in log data with page=NextSong
- Convert back to ms when joining on the other tables..if necessary..

#### Extract Data for Time Table
- Filter records by `NextSong` action
- Convert the `ts` timestamp column to datetime
  - Hint: the current timestamp is in milliseconds
- Extract the timestamp, hour, day, week of year, month, year, and weekday from the `ts` column and set `time_data` to a list containing these values in order
  - Hint: use pandas' [`dt` attribute](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.html) to access easily datetimelike properties.
- Specify labels for these columns and set to `column_labels`
- Create a dataframe, `time_df,` containing the time data for this file by combining `column_labels` and `time_data` into a dictionary and converting this into a dataframe

In [58]:
log_path_list = create_path_list(data_path / 'log_data')

data\log_data contains 30 .json files.


In [59]:
log_df = pd.read_json(log_path_list[0], lines=True)

In [60]:
log_df

Unnamed: 0,artist,auth,firstName,gender,itemInSession,lastName,length,level,location,method,page,registration,sessionId,song,status,ts,userAgent,userId
0,,Logged In,Walter,M,0,Frye,,free,"San Francisco-Oakland-Hayward, CA",GET,Home,1540919166796,38,,200,1541105830796,"""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4...",39
1,,Logged In,Kaylee,F,0,Summers,,free,"Phoenix-Mesa-Scottsdale, AZ",GET,Home,1540344794796,139,,200,1541106106796,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8
2,Des'ree,Logged In,Kaylee,F,1,Summers,246.30812,free,"Phoenix-Mesa-Scottsdale, AZ",PUT,NextSong,1540344794796,139,You Gotta Be,200,1541106106796,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8
3,,Logged In,Kaylee,F,2,Summers,,free,"Phoenix-Mesa-Scottsdale, AZ",GET,Upgrade,1540344794796,139,,200,1541106132796,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8
4,Mr Oizo,Logged In,Kaylee,F,3,Summers,144.03873,free,"Phoenix-Mesa-Scottsdale, AZ",PUT,NextSong,1540344794796,139,Flat 55,200,1541106352796,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8
5,Tamba Trio,Logged In,Kaylee,F,4,Summers,177.18812,free,"Phoenix-Mesa-Scottsdale, AZ",PUT,NextSong,1540344794796,139,Quem Quiser Encontrar O Amor,200,1541106496796,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8
6,The Mars Volta,Logged In,Kaylee,F,5,Summers,380.42077,free,"Phoenix-Mesa-Scottsdale, AZ",PUT,NextSong,1540344794796,139,Eriatarka,200,1541106673796,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8
7,Infected Mushroom,Logged In,Kaylee,F,6,Summers,440.2673,free,"Phoenix-Mesa-Scottsdale, AZ",PUT,NextSong,1540344794796,139,Becoming Insane,200,1541107053796,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8
8,Blue October / Imogen Heap,Logged In,Kaylee,F,7,Summers,241.3971,free,"Phoenix-Mesa-Scottsdale, AZ",PUT,NextSong,1540344794796,139,Congratulations,200,1541107493796,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8
9,Girl Talk,Logged In,Kaylee,F,8,Summers,160.15628,free,"Phoenix-Mesa-Scottsdale, AZ",PUT,NextSong,1540344794796,139,Once again,200,1541107734796,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",8


In [20]:
log_df['ts'].head()

0    1541105830796
1    1541106106796
2    1541106106796
3    1541106132796
4    1541106352796
Name: ts, dtype: int64

In [21]:
def expand_ms(ms_series: pd.Series) -> pd.DataFrame:
    """Expands a Pandas series with milliseconds with several datetime attributes."""
    df = pd.DataFrame({'start_time': pd.to_datetime(ms_series, unit='ms')})
    
    df['hour'] = df['start_time'].dt.hour
    df['day'] = df['start_time'].dt.day
    df['day'] = df['start_time'].dt.isocalendar().week  
    df['month'] = df['start_time'].dt.month
    df['year'] = df['start_time'].dt.year
    df['weekday'] = df['start_time'].dt.weekday

    return df    

In [69]:
expand_ms(log_df['ts'])

Unnamed: 0,start_time,hour,day,month,year,weekday
0,2018-11-01 20:57:10.796,20,44,11,2018,3
1,2018-11-01 21:01:46.796,21,44,11,2018,3
2,2018-11-01 21:01:46.796,21,44,11,2018,3
3,2018-11-01 21:02:12.796,21,44,11,2018,3
4,2018-11-01 21:05:52.796,21,44,11,2018,3
5,2018-11-01 21:08:16.796,21,44,11,2018,3
6,2018-11-01 21:11:13.796,21,44,11,2018,3
7,2018-11-01 21:17:33.796,21,44,11,2018,3
8,2018-11-01 21:24:53.796,21,44,11,2018,3
9,2018-11-01 21:28:54.796,21,44,11,2018,3


In [23]:
expand_ms(log_df['ts']).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   start_time  15 non-null     datetime64[ns]
 1   hour        15 non-null     int64         
 2   day         15 non-null     UInt32        
 3   month       15 non-null     int64         
 4   year        15 non-null     int64         
 5   weekday     15 non-null     int64         
dtypes: UInt32(1), datetime64[ns](1), int64(4)
memory usage: 803.0 bytes


In [24]:
list(expand_ms(log_df['ts']).columns)

['start_time', 'hour', 'day', 'month', 'year', 'weekday']

### Songplays

#### Extract Data and Songplays Table
This one is a little more complicated since information from the songs table, artists table, and original log file are all needed for the `songplays` table. Since the log file does not specify an ID for either the song or the artist, you'll need to get the song ID and artist ID by querying the songs and artists tables to find matches based on song title, artist name, and song duration time.
- Implement the `song_select` query in `sql_queries.py` to find the song ID and artist ID based on the title, artist name, and duration of a song.
- Select the timestamp, user ID, level, song ID, artist ID, session ID, location, and user agent and set to `songplay_data`

#### Insert Records into Songplays Table
- Implement the `songplay_table_insert` query and run the cell below to insert records for the songplay actions in this log file into the `songplays` table. Remember to run `create_tables.py` before running the cell below to ensure you've created/resetted the `songplays` table in the sparkify database.