# Initial Exploration of Music Trends of a User

# To-do List

- Create code so that all spotify music history files will be converted to pandas
    - Such that this project can be used by other people
    - Won't have to depend on specific file names, just that the correct kind of file is located within a directory
- Determine if Spotify API is needed
    - ex. pulling more metadata about songs such as genre tags

# Imports

In [None]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
import datetime as dt

import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Data File Imports

## Separate File Imports
    Converted to RAW format cells so that code doesn't execute

### Original Data Imports

### Combine into a single DF
    - Can seperate by date or other criteria as needed
    - Easier to drop unneeded cols from all df's
    - Export as a JSON so that I can reduce total code of notebook
        - Convert prev. cells to non-coding cells and simply import the full_history file

Confirmed that cols are shared across all df's

### Export as a file
Keeping as JSON for consistency with other files, otherwise arbitrary choice.

## Import of Full History File

**For col descriptions, see file "ReadMeFirst_ExtendedStreamingHistory.pdf" included in this project**

In [None]:
df_music = pd.read_json('Data/All_Streaming_History_Audio_2016-2024.json')

In [None]:
print(df_music.info())
df_music.head()

# Early Observations/Notes

Tentative plans for each col
- "ts"
    - convert to datetime
    - Best if split into seperate date & time cols
- "username"
    - **drop**
    - will have only 1 unique value (because all data is from my own spotify history)
    - I will plan to drop this col seperately from other code so that it can be easily converted to non-code format if more than a single user is included in the analysis
- "platform"
    - **drop**
    - No apparent use for which platform is used in a music trends analysis
- "ms_played"
    - convert to datetime (time specifically)
    - Interested in accuracy to the second. No apparent reason for more accuracy
- 'conn_country'
    - **drop**
    - All same value (US)
- 'ip_addr_decrypted'
    - **drop**, col is N/A
- 'user_agent_decrypted'
    - Needs looking into. Could all be same value or maybe hold helpful info
- 'master_metadata_track_name'
    - change col name to "track_name"
- 'master_metadata_album_artist_name'
    - change col name to "artist_name"
- 'master_metadata_album_album_name'
    - change col name to "album_name"
    - Potentially drop.
        - I don't usually care about which album a song is from when I listen to it.
        - If combined with release date, some patterns may be able to be found
- 'spotify_track_uri'
    - Rename to "track_uri"
    - Will be helpful if/when futher information about the needs to be found
    - Will be helpful for when song recommendation algroithm is built
    - Can be helpful for spotify_better_shuffler project
- 'episode_name'
    - **drop**, col is N/A
- 'episode_show_name'
    - **drop**, col is N/A
- 'spotify_episode_uri'
    - **drop**, col is N/A
- 'reason_start'
    - Need to determine meaning for all values (not fully elaborated in Spotify provided document)
- 'reason_end'
    - Need to determine meaning for all values (not fully elaborated in Spotify provided document)
- 'shuffle'
    - keep
- 'skipped'
    - keep
    - There are NaN values
    - Needs investigation if NaN equates to "No"
- 'offline'
    - Will be a fairly reliable indicator of if the song is included in my "Liked" songs playlist
        - My personal "offline" strategy for spotify is to keep my "Liked" songs playlist downloaded to my device(s)
        - It is a recent development that I have other playlists downloaded to devices for offline use
            - Estimated start of Jan 2023 (more likely March or May but as I am unsure, I will default to JAN)
- 'offline_timestamp'
    - Needs looking into.
- 'incognito_mode'
    - **Drop**
    - No apparent use in a music trends analysis

**Desired Information**
- Release dates of songs or albums
- Decade that the music was released in
    - Can help detect patterns in preferences
- genre that spotify assigns to the track
    - OR All of the genre tags that spotify has labeled the track with
- What songs and genres are still being listened to first year to this current year

# Early Cleaning

- Removing information that is obviously not applicable to the analysis
- 

## Drop Cols Round 1

In [None]:
df_music = df_music.drop(["platform", 'conn_country', 'ip_addr_decrypted', 'episode_name', 'episode_show_name', 'spotify_episode_uri', 'incognito_mode'], axis=1, errors="ignore")
    #errors="ignore" to avoid hiccups if running the cell again

### Drop username col

**If more than 1 user being analyzed:** Prevent code execution

In [None]:
df_music = df_music.drop(["username"], axis=1, errors="ignore")

In [None]:
df_music.head()

## Investigate cols that need keep/drop determination

'user_agent_decrypted' , 'offline_timestamp'

Determinations:
- 'user_agent_decrypted'
    - Drop
    - "unknown" value is vast majority and other values don't apply to music trend analysis
    - Empty string is second majority value
- 'offline_timestamp'
    - Drop
    - Value of 0 (zero) is vast majority at 38408 records, value of 1 is second most majority at 306 records

## Drop Rows Round 2

In [None]:
df_music = df_music.drop(['user_agent_decrypted', 'offline_timestamp'], axis=1, errors="ignore")

df_music.head()

# Early Exploration

In [None]:
df_music.head()

In [None]:
df_music.info()

Cols with nulls: ['master_metadata_track_name', 'master_metadata_album_artist_name', 'master_metadata_album_album_name', 'spotify_track_uri', 'reason_end', 'skipped']

## Quick analysis

In [None]:
df_music.describe(include='all')

Inconsistency:

    Unique values for "ts" should equal observation count.
    Requires looking into.

**Most played song titles and most played title/artist combos

In [None]:
print(df_music[['master_metadata_track_name']].value_counts().sort_values(ascending=False).head(10))
print()
print(df_music[['master_metadata_track_name', 'master_metadata_album_artist_name']].value_counts().sort_values(ascending=False).head(10))

**Note** to delete observations for "Binaural Beats (Restful Sleep)" track. This was listened to for a time as a sleep aid and not used for actual listening pleasure.

Check the artists associated with the two most played track titles (Drunken Sailor and Spanish Ladies)
    Assumption that there will be many artists for each track title

In [None]:
df_music[df_music['master_metadata_track_name']=='Drunken Sailor'][['master_metadata_album_artist_name']].value_counts().sort_values(ascending=False)

In [None]:
df_music[df_music['master_metadata_track_name']=='Spanish Ladies'][['master_metadata_album_artist_name']].value_counts().sort_values(ascending=False)

# Data Cleaning

- Col name edits
- split "ts" col into separate date & time cols
    - Note to determine if time will be helpful data to this analysis
- dtype conversions (ex. "ts" to datetime)
    - convert "ms_played" to datetime (to seconds accuracy)
- Determine if dropping 'offline_timestamp' col

## Col name edits

In [None]:
df_music.rename({'master_metadata_track_name': "track_name", 'master_metadata_album_artist_name' : "artist_name", 'master_metadata_album_album_name' : "album_name", 'spotify_track_uri' : "track_uri"}, axis=1, inplace=True)

In [None]:
df_music.head(2)

## "ts" col edits

### Date col

In [None]:
df_music['date'] = pd.to_datetime(pd.to_datetime(df_music['ts'], yearfirst=True).dt.date, yearfirst=True)

In [None]:
print(df_music[['date']].dtypes)
df_music[['date']].head()

### Time col

In [None]:
df_music['time'] = pd.to_datetime(df_music['ts'], yearfirst=True).dt.time

In [None]:
print(df_music[['time']].dtypes)
df_music[['time']].head()

**"time" col is in datetime acceptable format even though returned dtype is "object"**
proof below:

In [None]:
print(df_music['time'][0].hour)
print(df_music['time'][0].minute)
print(df_music['time'][0].second)
df_music['time'][0]

### Drop ts col

In [None]:
df_music.drop(['ts'], axis=1, errors='ignore', inplace=True)

In [None]:
df_music.head(2)

# Understanding non-obvious cols