# Music Listening Habits Data Mining

In [1]:
import pandas as pd
import ast
import json
import utils
import re

## Data needs to support visualisation

This document aims to provide a detailed overview of the music listening habits of users from different music streaming platforms. The data is extracted from various sources, including Deezer, Spotify, and YouTube Music. The statistics include the following key metrics:

- Top 15 artists (name & link) 
- Listening duration per artist 
- Number of different tracks listened per artist 

- Top 15 genres for each user 
- Listening time for each genre 
- Proportion of listening per genre 

- Average listening time per user 

- Ranking of genres per period (each week, each month)
- Ranking of tracks per period (each week, each month)
- Ranking of artist per period  (each week, each month)

- Exact listening time per user per date 
- Average listening time per user per hour 
- Average listening time per user per month 

## Import Data from Deezer

The data is loaded from the Excel file containing the listening history of the user.

In [77]:
df = pd.read_excel("./data/Archive-Clement-Deezer/clement-deezer-data.xlsx", "10_listeningHistory")

The data is then processed to remove any negative listening times and sort the data by date.

In [79]:
# Drop useless columns
df = df[["Date", "Song Title", "Artist", "Album Title", "Listening Time"]]
# Drop negative listening time
df = df[df["Listening Time"] > 0]
# Convert the date to datetime
df['Date'] = pd.to_datetime(df['Date'])
# Sort the data by date
df = df.sort_values("Date", ascending=False)

Here is a preview of the data:

In [80]:
df.head(5)

Unnamed: 0,Date,Song Title,Artist,Album Title,Listening Time
1740,2024-12-03 18:02:15,Cirice,Ghost,Meliora,53
15427,2024-12-03 16:29:51,I Love Rock 'N Roll,Joan Jett and the Blackhearts,I Love Rock 'N' Roll (Expanded Edition),175
30156,2024-12-03 16:26:56,Wish I Had an Angel,Nightwish,Once,245
31984,2024-12-03 16:22:51,Something To Hide,Grandson,Something To Hide,119
15186,2024-12-03 16:20:42,Monster,PVRIS,Monster,178


For each record in the DataFrame, we have the date of the listening session, the title of the song, the artist, the album title, and the listening time in seconds.

# API Request for genres

The LastFM API is used to fetch the tags for each song. The API request is made using the song title and artist name. More details on the implemention in the `utils.py` file.

In [11]:
# Step 1: Create a unique DataFrame for Title and Artist
unique_songs = df[["Song Title", "Artist"]].drop_duplicates()

# Step 2: Fetch tags for each unique song
unique_songs["Tags"] = unique_songs.apply(lambda row: utils.fetch_tags(row["Song Title"], row["Artist"]), axis=1)

# Step 3: Merge the tags back to the original DataFrame
df = df.merge(unique_songs, on=["Song Title", "Artist"], how="left")

The tags are then added to the DataFrame. Here is a preview of the data:

In [12]:
df.head(5)

Unnamed: 0,Date,Song Title,Artist,Album Title,Listening Time,Tags
0,2024-12-03 18:02:15,Cirice,Ghost,Meliora,53,"[heavy metal, doom metal, metal, 2015, hard rock]"
1,2024-12-03 16:29:51,I Love Rock 'N Roll,Joan Jett and the Blackhearts,I Love Rock 'N' Roll (Expanded Edition),175,"[rock, 80s, classic rock, female vocalists, ha..."
2,2024-12-03 16:26:56,Wish I Had an Angel,Nightwish,Once,245,"[symphonic metal, Gothic Metal, metal, Power m..."
3,2024-12-03 16:22:51,Something To Hide,Grandson,Something To Hide,119,[my top songs]
4,2024-12-03 16:20:42,Monster,PVRIS,Monster,178,"[alternative rock, electronic rock, Hip-Hop, e..."


Now that we have the tags for each song, we can proceed with the analysis. The data is then exported to a CSV file for further processing.

In [14]:
df.to_csv("matthieu_songs_with_tags_verified.csv", index=False)
print("DataFrame exported to songs_with_tags.csv")

DataFrame exported to songs_with_tags.csv


# Data Exploration

Since the data is now ready, we can proceed with the exploration. The following sections will provide insights into the listening habits of the user, including top artists, genres, tracks, and more. The csv file is loaded and contains the following columns: [Date, Song Title, Artist, Album Title, Listening Time, and Tags].

In [17]:
df = pd.read_csv("clement_songs_with_tags_verified.csv")

We need to convert the 'Date' column to a datetime format and the 'Tags' column to a list of strings.

In [18]:
df['Date'] = pd.to_datetime(df['Date'])
df["Tags"] = df["Tags"].apply(ast.literal_eval)

In [19]:
df.head(5)

Unnamed: 0,Date,Song Title,Artist,Album Title,Listening Time,Tags
0,2024-12-03 18:02:15,Cirice,Ghost,Meliora,53,"[heavy metal, doom metal, metal, 2015, hard rock]"
1,2024-12-03 16:29:51,I Love Rock 'N Roll,Joan Jett and the Blackhearts,I Love Rock 'N' Roll (Expanded Edition),175,"[rock, 80s, classic rock, female vocalists, ha..."
2,2024-12-03 16:26:56,Wish I Had an Angel,Nightwish,Once,245,"[symphonic metal, Gothic Metal, metal, Power m..."
3,2024-12-03 16:22:51,Something To Hide,Grandson,Something To Hide,119,[my top songs]
4,2024-12-03 16:20:42,Monster,PVRIS,Monster,178,"[alternative rock, electronic rock, Hip-Hop, e..."


## Explore within time interval

We can filter the data based on a specific date interval to analyze the listening habits of the user during that period.

In [20]:
# Define the date interval
start_date = "2024-01-01"
end_date = "2025-01-31"

# Filter the DataFrame
date_filtered_df = df[(df['Date'] >= start_date) & (df['Date'] <= end_date)]

Now that we have filtered the data based on the specified date interval, we can proceed with the extraction of some statistics.

## Top Artists (based on listening time)

We can calculate the total listening time for each artist and identify the top artists based on this metric.

In [25]:
artist_listening_time = date_filtered_df.groupby('Artist')['Listening Time'].sum().reset_index(name='Total Listening Time')
artist_listening_time.sort_values("Total Listening Time", ascending=False, inplace=True)
artist_listening_time.head(5)

Unnamed: 0,Artist,Total Listening Time
129,Grandson,43328
141,Imagine Dragons,40106
296,Starset,36294
131,Green Day,18743
308,Sub Urban,17903


## Top Artists (based on listening count)

We can also calculate the total number of tracks listened to for each artist and identify the top artists based on this metric.

In [26]:
artist_listening_count = date_filtered_df.groupby('Artist').size().reset_index(name='Total Artists Listening Count')
artist_listening_count.sort_values("Total Artists Listening Count", ascending=False, inplace=True)
artist_listening_count.head(5)

Unnamed: 0,Artist,Total Artists Listening Count
141,Imagine Dragons,285
129,Grandson,261
296,Starset,168
186,Linkin Park,138
104,Ethan Bortnick,116


## Top Genres (Based on Listening Time)

We can analyze the top genres based on the total listening time for each genre.

In [29]:
# Step 1: Expand the DataFrame
expanded_df = date_filtered_df.explode("Tags")  # Creates one row per genre
# Step 2: Aggregate by genre
genre_time = expanded_df.groupby("Tags")["Listening Time"].sum().reset_index()
# Step 3: Sort by total listening time
genre_time.sort_values(by="Listening Time", ascending=False, inplace=True)
# Display the top genres
genre_time.head(5)

Unnamed: 0,Tags,Listening Time
358,rock,337503
146,alternative,230041
151,alternative rock,198391
265,hard rock,93412
334,pop,86592


## Top Tracks (Based on Listening Time)

We can identify the top tracks based on the total listening time for each track.

In [30]:
artist_listening_time = date_filtered_df.groupby('Song Title')['Listening Time'].sum().reset_index(name='Total Listening Time')
artist_listening_time.sort_values("Total Listening Time", ascending=False, inplace=True)
artist_listening_time.head(5)

Unnamed: 0,Song Title,Total Listening Time
208,Holiday / Boulevard of Broken Dreams,10185
321,Monster,9603
326,My Demons,9577
250,It Has Begun,9496
367,Pull Me Under,8727


## Top Tracks (Based on Play Count)

We can also identify the top tracks based on the total number of times they were played.

In [31]:
artist_listening_count = date_filtered_df.groupby('Song Title').size().reset_index(name='Total Artists Listening Count')
artist_listening_count.sort_values("Total Artists Listening Count", ascending=False, inplace=True)
artist_listening_count.head(5)

Unnamed: 0,Song Title,Total Artists Listening Count
546,cut my fingers off,90
173,Gasoline,69
321,Monster,62
77,Carsick,62
484,To Ashes and Blood (from the series Arcane Lea...,47


## Daily Listening Time

This section aims to provide insights into the daily listening habits of the user. We will calculate the total listening time for each day. The following code snippet demonstrates the process.

In [40]:
# Copy the original DataFrame
df_daily = df.copy()

# Ensure 'Date' column is in datetime format and convert to just date
df_daily['Date'] = pd.to_datetime(df_daily['Date']).dt.date

# Group by Date and calculate total Listening Time
daily_listening_time = (
    df_daily.groupby('Date', as_index=False)['Listening Time']
    .sum()
    .sort_values('Date', ascending=False)
)

# Generate the full range of dates
min_date = daily_listening_time['Date'].min()
max_date = daily_listening_time['Date'].max()
date_range = pd.date_range(start=min_date, end=max_date).date  # Create a list of all dates

# Reindex to include all dates and fill missing values with 0
daily_listening_time.set_index('Date', inplace=True)  
daily_listening_time = daily_listening_time.reindex(date_range, fill_value=0)  
daily_listening_time.index.name = 'Date'  # Set index name back to 'Date'

# Reset the index and return to a flat structure
daily_listening_time.reset_index(inplace=True)

# Sort the DataFrame in descending order by date
daily_listening_time.rename(columns={"index": "Date"}, inplace=True)
daily_listening_time.sort_values('Date', ascending=False, inplace=True)

# Display the top 15 rows
daily_listening_time.head(15)


Unnamed: 0,Date,Listening Time
3453,2024-12-03,10212
3452,2024-12-02,566
3451,2024-12-01,12263
3450,2024-11-30,11029
3449,2024-11-29,7703
3448,2024-11-28,1965
3447,2024-11-27,3095
3446,2024-11-26,944
3445,2024-11-25,0
3444,2024-11-24,0


## Average Monthly Listening Time

This section aims to provide insights into the average monthly listening time of the user. We will calculate the average listening time per month and display the results. This stat is computed across all years in the dataset.

In [44]:
df_monthly = df.copy()
# Extract year and month
df_monthly['Year'] = df['Date'].dt.year
df_monthly['Month'] = df['Date'].dt.month_name()
# Group by Year and Month to calculate monthly sums
monthly_sums = df_monthly.groupby(['Year', 'Month'], sort=False)['Listening Time'].sum().reset_index()
# Group by Month to calculate the average of monthly sums
average_monthly_sums = monthly_sums.groupby('Month', sort=False)['Listening Time'].mean()
# Ensure months are in calendar order
calendar_order = [
    'January', 'February', 'March', 'April', 'May', 'June', 
    'July', 'August', 'September', 'October', 'November', 'December'
]
average_monthly_sums = average_monthly_sums.reindex(calendar_order)

average_monthly_sums

Month
January      58469.857143
February     55076.333333
March        56511.714286
April        67287.142857
May          55616.857143
June         37841.555556
July         61056.333333
August       36617.375000
September    72816.555556
October      63333.625000
November     56088.888889
December     46058.000000
Name: Listening Time, dtype: float64

## Average Hourly Listening Time

This section aims to provide insights into the average hourly listening time of the user. We will calculate the average listening time per hour and display the results. Unlike the daily listening time, this stat is computed across all days within the time interval.

In [47]:
df_hourly = date_filtered_df.copy()

# Extract the hour
df_hourly['Hour'] = df_hourly['Date'].dt.hour

# Determine the min and max date
min_date = df_hourly['Date'].min()
max_date = df_hourly['Date'].max()

# Calculate the total number of days (inclusive)
total_days = (max_date - min_date).days + 1

# Create a placeholder for all hours across all days in the observation period
all_hours = pd.DataFrame({'Hour': range(24)})

# Aggregate total listening time per hour
hourly_sums = df_hourly.groupby('Hour')['Listening Time'].sum().reset_index()

# Merge to ensure every hour (0-23) is represented
hourly_sums = pd.merge(all_hours, hourly_sums, on='Hour', how='left').fillna(0)

# Compute the average listening time per hour (divide by total days)
hourly_sums['Average Listening Time'] = hourly_sums['Listening Time'] / total_days
hourly_sums

Unnamed: 0,Hour,Listening Time,Average Listening Time
0,0,8913,26.526786
1,1,4107,12.223214
2,2,521,1.550595
3,3,46,0.136905
4,4,220,0.654762
5,5,15450,45.982143
6,6,17485,52.03869
7,7,21665,64.479167
8,8,17441,51.907738
9,9,31028,92.345238


---- 

# Data Export

Now that we've played around with the data, we can export it to a JSON file for further processing and visualization. The JSON file will contain the following structure:

```json
{
    "users": [
        {
            "user_id": "user_id",
            "username": "username",
            "top_artists": [
                {
                    "start": "2024-11-22",
                    "end": "2024-12-22",
                    "label": "4 weeks",
                    "count": 15,
                    "ranking": [
                        {
                            "Artist": "artist_name",
                            "Listening Time": 12345
                        },
                        ...
                    ]
                },
                ...
            ],
            "top_genres": [
                {
                    "start": "2024-11-22",
                    "end": "2024-12-22",
                    "label": "4 weeks",
                    "count": 15,
                    "ranking": [
                        {
                            "Genre": "genre_name",
                            "Listening Time": 12345,
                            "list": [
                                {
                                    "Song Title": "song title - artist_name",
                                    "Listening Time": 12345
                                },
                                ...   
                            ]
                        },
                        ...
                    ]
                },
                ...
            ],
            "top_tracks": [
                {
                    "start": "2024-11-22",
                    "end": "2024-12-22",
                    "label": "4 weeks",
                    "count": 15,
                    "ranking": [
                        {
                            "Song Title": "song title ",
                            "Artist": "artist_name",  
                            "Listening Time": 12345
                        },
                        ...
                    ]
                },
                ... 
            ],
            "average_listening_time": {
                "dataMonth": [],
                "dataYear": [],
                "dataDay": []
            }
        }
    ], "merged_data": {
        "dataMonth": [],
        "dataYear": [],
        "dataDay": []
    }
}
```

We will now export the data to a JSON file using the structure defined above. The full implementation can be found in the `utils.py` file.

--- 

# Handling Spotify Data


We will now proceed with the extraction and processing of Spotify data. The data is loaded from the JSON files containing the listening history of the user.

In [56]:
df = pd.concat([
    pd.read_json("data/Archive-Matthieu-Spotify/Spotify Account Data/StreamingHistory_music_0.json"),
    pd.read_json("data/Archive-Matthieu-Spotify/Spotify Account Data/StreamingHistory_music_1.json"),
    pd.read_json("data/Archive-Matthieu-Spotify/Spotify Account Data/StreamingHistory_music_2.json")
])

Some data cleaning and adjustments are required to prepare the data and make it consistent with the previous dataset. 

In [57]:
# Ensure datetime format
df["endTime"] = pd.to_datetime(df['endTime'])
# Sort the data by date
df.sort_values("endTime", ascending=True, inplace=True)
# Convert msPlayed to seconds
df['msPlayed'] = (df["msPlayed"] / 100).astype(int)
# Drop negative listening time
df = df[df["msPlayed"] > 0]
# Rename columns
df.rename(columns={"endTime": "Date", "artistName": "Artist", "trackName": "Song Title", "msPlayed": "Listening Time"}, inplace=True)

In [58]:
df.head(5)

Unnamed: 0,Date,Artist,Song Title,Listening Time
0,2023-09-17 19:40:00,AJR,Sober Up (feat. Rivers Cuomo),48
1,2023-12-04 17:07:00,Lecrae,They Ain’t Know,696
2,2023-12-05 01:01:00,gio.,shadows,691
3,2023-12-05 07:09:00,Young Oceans,You Are Not Far,1346
4,2023-12-05 07:13:00,Josiah Queen,Fishes and Loaves,2169


Since it follows the same structure as the Deezer data, we can proceed with the same steps to extract the tags using the LastFM API. Then we can export the data to a CSV and use it for the analysis.

## Youtube Music Data

We will now proceed with the extraction and processing of YouTube Music data. The data is loaded from the JSON file containing the listening history of the user.

Additional data cleaning and adjustments are required to prepare the data and make it consistent with the previous datasets.

In [36]:
df = pd.read_json("data/Archive-Thomas-Youtube-Music/Thomas-Youtube-history.json")

In [37]:
# Filter out non-music entries
df = df[df['header'].str.contains("Music")]
# Convert the 'subtitles' column from a string to a list of dictionaries
df["Song Title"] = df["title"].apply(lambda x: re.sub("Vous avez regardé ", "", x))
# drop subtitles float columns
df = df[df['subtitles'].notna()]
# Extract the artist name from the 'subtitles' column
df["Artist"] = df["subtitles"].apply(lambda x: re.sub(" - Topic", "", x[0]['name']))
# Convert the 'time' column to datetime format
df['Date'] = pd.to_datetime(df['time'], format="mixed")
df['Date'] = df['Date'].dt.tz_convert('Europe/Paris')
# Drop unnecessary columns
df.drop(columns=["activityControls", "products", "titleUrl", 'header', "subtitles", "time", "title"], inplace=True)

Unfortunately, the YouTube Music data does not provide the listening time for each track. As a result, we will set the listening time to 0 for all entries in this dataset.

In [38]:
df['Listening Time'] = 0

In [39]:
df.head(5)

Unnamed: 0,Song Title,Artist,Date,Listening Time
0,Personne,47Ter,2024-12-04 13:56:40.714000+01:00,0
1,La vérité,Fredz,2024-12-04 13:54:20.071000+01:00,0
2,Doss,47Ter,2024-12-04 13:50:47.128000+01:00,0
121,Swim,Chase Atlantic,2024-12-03 20:02:43.415000+01:00,0
122,RICOCHET,Chase Atlantic,2024-12-03 19:59:48.639000+01:00,0


In [101]:
# Step 1: Create a unique DataFrame for Title and Artist
unique_songs = df[["Song Title", "Artist"]].drop_duplicates()

# Step 2: Fetch tags for each unique song
unique_songs["Tags"] = unique_songs.apply(lambda row: utils.fetch_tags(row["Song Title"], row["Artist"]), axis=1)

# Step 3: Merge the tags back to the original DataFrame
df = df.merge(unique_songs, on=["Song Title", "Artist"], how="left")

# Export the data to a CSV file
df.to_csv("thomas_songs_with_tags_verified.csv", index=False)

-----

# Exportation from the CSV to JSON

Here is the instruction to export the data from the CSV file to a JSON file.

In [2]:
utils.export_music_data_to_json("clement_songs_with_tags_verified.csv", "clement.json", "clement", "Clément Laurent")
utils.export_music_data_to_json("matthieu_songs_with_tags_verified.csv", "matthieu.json", "matthieu", "Matthieu Randriantsoa")
utils.export_music_data_to_json("celine_songs_with_tags_verified.csv", "celine.json", "celine", "Céline Constant")
utils.export_music_data_to_json("thomas_songs_with_tags_verified.csv", "thomas.json", "thomas", "Thomas Halvick")

Data has been processed and saved to clement.json
Data has been processed and saved to matthieu.json
Data has been processed and saved to celine.json
Data has been processed and saved to thomas.json


# Additional Data Tweaks

Some structural changes are required but are easier to make directly in the JSON file itself. Also, we need to merge the data from different users into a single JSON entry.

In [3]:
# Define users and their corresponding files
users = [
    ("Céline", "celine.json"),
    ("Clément", "clement.json"),
    ("Matthieu", "matthieu.json"),
    ("Thomas", "thomas.json")
]

# Initialize empty lists for merged data
merged_months = []
merged_years = []
merged_days = []

# Process each user's data and merge
for user, file in users:
    with open(file, encoding="utf-8") as f:
        data = json.load(f)

    # Append the user name to each entry in the respective data categories
    for entry in data["users"][0]['average_listening_time']['dataMonth']:
        entry['user'] = user
        merged_months.append(entry)
    for entry in data["users"][0]['average_listening_time']['dataYear']:
        entry['user'] = user
        merged_years.append(entry)
    for entry in data["users"][0]['average_listening_time']['dataDay']:
        entry['user'] = user
        merged_days.append(entry)

# Create the final merged object
final_merged_data = {
    "dataMonth": merged_months,
    "dataYear": merged_years,
    "dataDay": merged_days
}

# Export the merged data to a JSON file
with open("merged_data.json", "w", encoding="utf-8") as f:
    json.dump(final_merged_data, f, indent=4, ensure_ascii=False)


**Hope this helps!**