# 🎵 Apple Music Stream Data Analysis

In this project, we are going to explore my personal music streaming data from Apple Music. Apple Music is a music and video streaming service developed by Apple Inc. The dataset used here showcases my personal streaming on the platform. 

We can see many things in the dataset like

- List of songs played
- List of Singers/Artists
- Start time and position of songs
- Music Labels

and much more. 

I got this data from Apple's privacy website.

## Requesting and downloading data

Follow these steps request your personal data from apple
- Go to privacy.apple.com
- Log in to your account
- Click on **Request a copy of your data**
- Be sure to check mark on **Apple Media Services Information** and click on continue at bottom
- Choose default size and click on **Complete Request**

Check the below screenshots for reference

![Picture title](image-20210605-194813.png)

![Picture title](image-20210605-194845.png)

![Picture title](image-20210605-194900.png)

## Data Preparation and cleaning


Steps:-

1. Load the dataset (csv file)
2. Check for shape and columns of dataframe
3. Check for missing values
4. Check the basic statistics of columns


In [None]:
file_path = '../input/applemusicstreaming/Apple Music Play Activity.csv'

In [None]:
# Install plotly express if not already installed
!pip install plotly_express --quiet

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly_express as px

pd.set_option('display.max_columns', None)

In [None]:
from matplotlib import rcParams
# figure size in inches
rcParams['figure.figsize'] = 12, 8

In [None]:
music_df = pd.read_csv(file_path)

In [None]:
music_df.head()

In [None]:
print("Rows x Columns: {}".format(music_df.shape))

In [None]:
# check for available columns
music_df.columns.to_list()

display basic statistics of numerical variables

In [None]:
music_df.describe()

In [None]:
music_df.info()

Check for the no. of missing values in each column

In [None]:
music_df.isnull().sum()

In [None]:
sns.heatmap(music_df.isnull())
plt.show()

Our DataFrame many columns which have all the NULL values. Such columns have to be removed from the dataset.  Our goal should be making the above HeatMap as dark as possible (i.e without any white marks)

In [None]:
nans = [col for col in music_df.columns if music_df[col].isnull().all()==True]

In [None]:
# drop the above columns from the dataframe
music_df.drop(nans, axis=1, inplace=True)

In [None]:
# check for current shape of df
music_df.shape

We have reduced our DataFrame columns from 45 cols -&gt; 30 cols

In [None]:
music_df.isnull().sum()

There are more columns like ID's which are not going to contribute much in our analysis. So we will drop these columns as well manually

In [None]:
to_delete = ['Apple Id Number', 'Build Version', 'Client IP Address', 'Device Identifier', 'Metrics Bucket Id', 'Metrics Client Id', 'UTC Offset In Seconds', 'Store Country Name']
music_df.drop(to_delete, axis=1, inplace=True)

In [None]:
music_df.isnull().sum()

In [None]:
music_df.shape

### Converting timestamp columns to actual TimeStamp

Timestamp columns - Event End Timestamp, Event Start Timestamp, Event Received Timestamp are not exactly timestamp but string. We have to convert these columns into actual timestamps

In [None]:
music_df['Event End Timestamp'] = pd.to_datetime(music_df['Event End Timestamp'], format='%Y-%m-%dT%H:%M:%S')
music_df['Event Received Timestamp'] = pd.to_datetime(music_df['Event Received Timestamp'], format='%Y-%m-%dT%H:%M:%S')
music_df['Event Start Timestamp'] = pd.to_datetime(music_df['Event Start Timestamp'], format='%Y-%m-%dT%H:%M:%S')

In [None]:
music_df.head()

## Data Analysis

Questions to ask for data analysis

1. Who are the top 10 favourite artists?
2. Which are the top 10 songs played?
3. Who are top 10 favourite content providers or music labels?
4. Which are the top 10 songs that were listened for longest time?
5. What is the reason of ending the song most?
6. Which are your most loved genre?
7. Which media type do you prefer most on Apple Music?
8. Do you prefer listening to music when you are online/offline?
9. What time do you prefer to listen music?
10. Which was the most active month?
11. Which was the most active year?
12. Total time spent on the platform?

### 1. Who are your top 10 favourite artists/singers/band?

In [None]:
top_10_artist = music_df['Artist Name'].value_counts()[:10]

In [None]:
fig = px.bar(top_10_artist, title="Top 10 favourite artists", labels={"index":"Artists", 'value':"No. of times song played"}, color_discrete_sequence=px.colors.qualitative.Set2)
fig.show()

### 2. Which are the top 20 songs played? (favourite songs)

In [None]:
top_20_songs = music_df['Content Name'].value_counts()[:20]

In [None]:
# (optional)
# changing the name of longest song name
as_list = top_20_songs.index.tolist()
idx = as_list.index("I'm the One (feat. Justin Bieber, Quavo, Chance the Rapper & Lil Wayne)")
as_list[idx] = 'I am the one (ft. Justin Bieber)'
top_20_songs.index = as_list

In [None]:
fig = px.bar(top_20_songs, title="Top 20 favourite songs", labels={"index":"Songs", 'value':"No. of times song played"}, color_discrete_sequence=px.colors.qualitative.Bold)
fig.update_xaxes(tickangle=22)
fig.show()

### 3. Which are the top 10 favourite content providers?

In [None]:
top_10_labels = music_df['Content Provider'].value_counts()[:10]

In [None]:
# (optional)
# changing the name of longest label name
as_list = top_10_labels.index.tolist()
idx = as_list.index("Super Cassettes Industries Pvt Limited a.k.a. T-Series")
as_list[idx] = 'T-Series'
top_10_labels.index = as_list

In [None]:
fig = px.bar(top_10_labels, title="Top 20 favourite labels", labels={"index":"Music Labels", 'value':"No. of times song label played"}, color_discrete_sequence=px.colors.qualitative.Pastel)
fig.update_xaxes(tickangle=25)
fig.show()

What are the top songs played from particular label?

In [None]:
def top_10_song_of_label(label):
    """
    Function to see what are the top musics played from particular label. 
    """
    # use groupby method and sort ascending
    label_df = music_df[music_df['Content Provider'] == label]
    top_10_song = label_df['Content Name'].value_counts()[:10]
    print(top_10_song)
    fig = px.bar(top_10_song, labels={"index": "Song Names", "value": "No. of time song played", "variable":"Song name"}, title=f"Top songs from {label}")
    fig.show()

In [None]:
top_10_song_of_label('The Warner Music Group')

In [None]:
top_10_song_of_label('Super Cassettes Industries Pvt Limited a.k.a. T-Series')

### 4. Which are the top 10 songs that were listened for longest time? 

In [None]:
top_longest_played = music_df.groupby('Content Name')['Play Duration Milliseconds'].sum().sort_values(ascending=False)

In [None]:
# Converting milliseconds to minutes
top_longest_played = top_longest_played / 60000

In [None]:
colors = px.colors.qualitative

In [None]:
fig = px.bar(top_longest_played[:10], labels={"Content Name": "Song Names", "value": "Play Time (in mins)", "variable":"Duration"}, color_discrete_sequence=colors.G10_r)
fig.show()

### 5.  What is the reason of ending the song most?

In [None]:
music_df['End Reason Type'].value_counts()

In [None]:
fig = px.pie(music_df, names='End Reason Type', color_discrete_sequence=colors.Set3)
fig.show()

I don't usually listen to full songs 😂

### 6. Which are your most loved genre?

In [None]:
top_genre = music_df.Genre.value_counts()[:10]

In [None]:
fig = px.bar(top_genre, color_discrete_sequence=colors.T10_r)
fig.show()

### 7. Which media type do you prefer most on Apple Music?

In [None]:
fig = px.pie(music_df, names='Media Type', color_discrete_sequence=colors.Dark2, title="Most preferable Media Type (eg. Audio/Video)")
fig.show()

### 8. Do you prefer listening to music when you are online/offline?

In [None]:
music_df.Offline.value_counts()

In [None]:
fig = px.pie(music_df, names="Offline", title="Do you prefer listening to music Offline?")
fig.show()

Yeah!! A Lot. Around 38% of the time, I like listening to songs when I am offline. Most probably, I close my eyes and feel the music. 

### 9. What time do you prefer to listen to music?

In [None]:
music_df['Event Start Timestamp']

In [None]:
# converting event start timestamp to separate time section
music_df["Event Start Time"] = music_df['Event Start Timestamp'].dt.time
music_df["Event Start Time"].head()

In [None]:
hours = music_df["Event Start Time"].groupby(music_df["Event Start Timestamp"].dt.hour).count()

In [None]:
fig = px.bar(hours, title="Most active hours (24hr)", labels={"value": "count", "Event Start Timestamp":"Timings (hours)"}, color_discrete_sequence=colors.Prism)
fig.update_xaxes(dtick=1)
fig.show()

Looks like I can hear music at any time from above graph. HAHA!! 

### 10. What month have you listened to songs most?

In [None]:
months = music_df["Event Start Time"].groupby(music_df["Event Start Timestamp"].dt.month).count()

In [None]:
m = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sept', 'Oct', 'Nov','Dec']
fig = px.bar(months, title="Most active Months", text=m, labels={"value": "count", "Event Start Timestamp":"Months"}, color_discrete_sequence=colors.Light24)
fig.update_xaxes(dtick=1)
fig.show()

### 11. Which year have you listened to songs most on Apple Music?

In [None]:
years = music_df["Event Start Time"].groupby(music_df["Event Start Timestamp"].dt.year).count()

In [None]:
fig = px.bar(years, title="Most active years", labels={"value": "count", "Event Start Timestamp":"Year"}, color_discrete_sequence=colors.Prism_r)
fig.update_xaxes(dtick=1)
fig.show()

### 12. Total time spent listening to&nbsp;music

In [None]:
total_time = music_df['Play Duration Milliseconds'].sum()

In [None]:
total_mins = total_time/60000
print("Total minutes spent: {:.2f} mins".format(total_mins))
total_hours = total_mins/60
print("Total hours spent: {:.2f} hours".format(total_hours))

In [None]:
start_time = music_df['Event End Timestamp'].min()
end_time = music_df['Event End Timestamp'].max()

In [None]:
total_possible_time = (end_time - start_time).days

In [None]:
total_possible_hours = total_possible_time * 24
print("Total possible time could be spent: {:.2f} hours".format(total_possible_hours))

In [None]:
hours_spent_list = np.array([total_hours, total_possible_hours])
hours_spent_list_labels = [" Actual Hours Spent", "Possible Hours"]

fig, ax = plt.subplots(figsize=(12,6))
ax.pie(hours_spent_list, labels= hours_spent_list_labels, autopct='%1.1f%%',  explode=[0.2,0.2], startangle=180, shadow = True);
plt.title("Hours Spent Percentage");

### Daily average songs played

In [None]:
total_songs = music_df.shape[0]
print("Daily average of songs played: {:.2f} songs".format(total_songs/total_possible_time))

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=4597a1b6-de90-4f9a-b59d-60c2ea6c17ad' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>