# Project: Intelligent Playlist Sorter (Proof of Concept)

**Author:** Lakshay Chhabra

## 1. Objective
The primary goal of this project is to design and simulate a machine learning-based playlist sorter. Instead of a simple random shuffle, this system will analyze a user's listening habits to create a more coherent and personalized playlist flow. This notebook serves as a Proof of Concept (POC) to test the core logic using a simulated dataset.

## 2. The Problem
Standard music players like VLC often use a purely random shuffling algorithm. This can lead to a jarring listening experience where songs of drastically different genres, moods, or energy levels play back-to-back (e.g., a high-energy party song followed by a slow, sad ballad), disrupting the listener's mood.

## 3. The Proposed Solution
We will create a **"Preference Score"** for each song. This score will be more intelligent than a simple play count. It will be calculated based on:

*   **How many times a song has been played.**
*   **The average play duration vs. the total length of the song (the "Play Ratio").** A high ratio indicates that the user doesn't skip the song and likely enjoys it more.

We will then sort the playlist based on this score to generate a "smart" playlist.

## 4. Simulation Setup
*   **Environment:** Jupyter Notebook (`.ipynb`) in VS Code
*   **Libraries:** `pandas`, `numpy`
*   **Dataset:** We will create a simulated dataset of Hindi and Punjabi songs with the following attributes:
    *   `Sr. No`
    *   `Name of the song`
    *   `Genre`
    *   `Music Type`
    *   `Artist`
    *   `Length of song`
    *   `Play duration (average)`
    *   `How many times Played`

In [None]:
# Step 1: Importing the necessary libraries
import pandas as pd
import numpy as np

In [8]:
# Step 2: Create the simulated dataset
data = {
    'Name of the song': [
        'Chaleya', 'Apna Bana Le', 'Heeriye', 'Maan Meri Jaan', 'Obsessed', 'Kesariya', 
        'Pehle Bhi Main', 'Satranga', 'Jale 2', 'What Jhumka?', 'Arjan Vailly', 'Akhiyaan Gulaab'
    ],
    'Genre': [
        'Romantic', 'Romantic', 'Chill Vibe', 'Romantic', 'Upbeat', 'Romantic', 
        'Sad', 'Romantic', 'Haryanvi Pop', 'Upbeat', 'Action', 'Chill Vibe'
    ],
    'Music Type': [
        'Soft Pop', 'Ballad', 'Lofi', 'Pop', 'Punjabi Pop', 'Ballad', 
        'Ballad', 'Ballad', 'Dance', 'Bollywood Remix', 'Folk-Action', 'Pop'
    ],
    'Artist': [
        'Arijit Singh', 'Arijit Singh', 'Jasleen Royal', 'King', 'Vicky Kaushal', 'Arijit Singh', 
        'Vishal Mishra', 'Arijit Singh', 'Sapna Choudhary', 'Arijit Singh', 'Bhupinder Babbal', 'Shahid Mallya'
    ],
    'Length of song': [
        '3:20', '3:25', '3:15', '3:10', '2:50', '4:10', 
        '4:00', '4:31', '3:05', '3:33', '3:02', '3:10'
    ],
    # Humne play duration aur counts ko thoda random rakha hai
    'Play duration (average)': [
        '3:18', '2:10', '3:14', '1:30', '2:48', '4:05', 
        '1:15', '4:25', '3:00', '1:45', '3:01', '3:08'
    ],
    'How many times Played': [
        25, 40, 55, 30, 45, 15, 20, 18, 50, 60, 12, 35
    ]
}

# Step 3: Create a DataFrame
df = pd.DataFrame(data)


# Step 4: Display the DataFrame
print("--- Original Unsorted DataFrame of Songs ---")
print(df.head())

--- Original Unsorted DataFrame of Songs ---
  Name of the song       Genre   Music Type         Artist Length of song  \
0          Chaleya    Romantic     Soft Pop   Arijit Singh           3:20   
1     Apna Bana Le    Romantic       Ballad   Arijit Singh           3:25   
2          Heeriye  Chill Vibe         Lofi  Jasleen Royal           3:15   
3   Maan Meri Jaan    Romantic          Pop           King           3:10   
4         Obsessed      Upbeat  Punjabi Pop  Vicky Kaushal           2:50   

  Play duration (average)  How many times Played  
0                    3:18                     25  
1                    2:10                     40  
2                    3:14                     55  
3                    1:30                     30  
4                    2:48                     45  


In [9]:
# --- Step 5: Feature Engineering & Score Calculation ---

# Helper function to convert 'MM:SS' time format to total seconds for easier calculation.
def time_to_seconds(time_str):
    """
    This function takes a time string in 'MM:SS' format and returns the total number of seconds.
    Example: '3:20' -> 200
    """
    try:
        minutes, seconds = map(int, time_str.split(':'))
        return (minutes * 60) + seconds
    except:
        # Agar koi format me galti ho to 0 return karega
        return 0
    

# --- Apply the function to create new columns with time in seconds ---
# Create a new column 'Length_sec' by converting 'Length of song'
df['Length_sec'] = df['Length of song'].apply(time_to_seconds)

# Create a new column 'Play_duration_sec' by converting 'Play duration (average)'
df['Play_duration_sec'] = df['Play duration (average)'].apply(time_to_seconds)

# --- Calculate the "Play Ratio" ---
# Play Ratio shows what percentage of a song is listened to on average.
# A ratio close to 1.0 is very good.
# We will handle the case where song length is 0 to avoid division by zero error.
df['Play_Ratio'] = np.where(df['Length_sec'] > 0, df['Play_duration_sec'] / df['Length_sec'], 0)

# We use np.clip to ensure the ratio doesn't go above 1.0 (in case of bad data).
df['Play_Ratio'] = np.clip(df['Play_Ratio'], 0, 1)

# --- Calculate the final "Preference Score" ---
# Formula: Score = (How many times Played) * (Play Ratio)
# This score gives weight to songs that are played often AND listened to completely.
df['Preference_Score'] = df['How many times Played'] * df['Play_Ratio']

# --- Display the DataFrame with the new calculated columns to verify our logic ---
print("------ Data with New Calculated Features (Scores) ------")

# Hum sirf zaroori columns display karenge taaki table saaf dikhe.
# .round(2) se hum decimal ke baad sirf 2 digit tak show karenge.
df[['Name of the song', 'How many times Played', 'Play_Ratio', 'Preference_Score']].round(2)

------ Data with New Calculated Features (Scores) ------


Unnamed: 0,Name of the song,How many times Played,Play_Ratio,Preference_Score
0,Chaleya,25,0.99,24.75
1,Apna Bana Le,40,0.63,25.37
2,Heeriye,55,0.99,54.72
3,Maan Meri Jaan,30,0.47,14.21
4,Obsessed,45,0.99,44.47
5,Kesariya,15,0.98,14.7
6,Pehle Bhi Main,20,0.31,6.25
7,Satranga,18,0.98,17.6
8,Jale 2,50,0.97,48.65
9,What Jhumka?,60,0.49,29.58


In [10]:
# --- Step 6: Sorting the Playlist based on Preference Score ---

# We sort the entire DataFrame in descending order based on the 'Preference_Score'.
# This brings the most "preferred" songs to the top.
# inplace=False ensures that the original df is not modified, and we get a new DataFrame.
sorted_df = df.sort_values(by='Preference_Score', ascending=False)

# --- Step 7: Displaying the Final "Smart" Playlist ---

# To make the final output clean, we will only select the columns that matter to the user.
# We also reset the index to get a clean 0, 1, 2... sequence.
final_playlist = sorted_df[['Name of the song', 'Genre', 'Artist', 'Preference_Score']].reset_index(drop=True)

# To make it look like a real playlist, let's add a "New Order" or "Rank" column.
# The index of a DataFrame starts at 0, so we add 1 to start our ranking from 1.
final_playlist.index = final_playlist.index + 1
final_playlist.index.name = "New Order"

print("=====================================================")
print("          Your New Intelligent Playlist            ")
print("=====================================================")
print("Sorted based on your listening habits (play count & duration)\n")

# Display the final, beautiful playlist
final_playlist.round(2)



          Your New Intelligent Playlist            
Sorted based on your listening habits (play count & duration)



Unnamed: 0_level_0,Name of the song,Genre,Artist,Preference_Score
New Order,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Heeriye,Chill Vibe,Jasleen Royal,54.72
2,Jale 2,Haryanvi Pop,Sapna Choudhary,48.65
3,Obsessed,Upbeat,Vicky Kaushal,44.47
4,Akhiyaan Gulaab,Chill Vibe,Shahid Mallya,34.63
5,What Jhumka?,Upbeat,Arijit Singh,29.58
6,Apna Bana Le,Romantic,Arijit Singh,25.37
7,Chaleya,Romantic,Arijit Singh,24.75
8,Satranga,Romantic,Arijit Singh,17.6
9,Kesariya,Romantic,Arijit Singh,14.7
10,Maan Meri Jaan,Romantic,King,14.21


In [11]:
# --- Step 8: Preparing Data for the Machine Learning Model ---
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Select the features our model should learn from.
# Notice how we are now including Genre, Music Type, and Artist!
features_to_use = ['Genre', 'Music Type', 'Artist', 'Preference_Score']
model_df = df[features_to_use]

# --- One-Hot Encoding for categorical features ---
# pd.get_dummies() automatically converts columns like 'Genre' into multiple
# numerical columns (e.g., 'Genre_Romantic', 'Genre_Upbeat') with 0s and 1s.
encoded_df = pd.get_dummies(model_df, columns=['Genre', 'Music Type', 'Artist'])

# --- Scaling the numerical features ---
# We scale the Preference_Score so it doesn't overly dominate the other features.
scaler = StandardScaler()
encoded_df['Preference_Score'] = scaler.fit_transform(encoded_df[['Preference_Score']])

print("------ Data Ready for ML Model (First 5 rows) ------")
encoded_df.head()

------ Data Ready for ML Model (First 5 rows) ------


Unnamed: 0,Preference_Score,Genre_Action,Genre_Chill Vibe,Genre_Haryanvi Pop,Genre_Romantic,Genre_Sad,Genre_Upbeat,Music Type_Ballad,Music Type_Bollywood Remix,Music Type_Dance,...,Music Type_Punjabi Pop,Music Type_Soft Pop,Artist_Arijit Singh,Artist_Bhupinder Babbal,Artist_Jasleen Royal,Artist_King,Artist_Sapna Choudhary,Artist_Shahid Mallya,Artist_Vicky Kaushal,Artist_Vishal Mishra
0,-0.166657,False,False,False,True,False,False,False,False,False,...,False,True,True,False,False,False,False,False,False,False
1,-0.125407,False,False,False,True,False,False,True,False,False,...,False,False,True,False,False,False,False,False,False,False
2,1.840579,False,True,False,False,False,False,False,False,False,...,False,False,False,False,True,False,False,False,False,False
3,-0.872585,False,False,False,True,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
4,1.154217,False,False,False,False,False,True,False,False,False,...,True,False,False,False,False,False,False,False,True,False


In [12]:
# --- Step 9: Training the K-Means Clustering Model ---

# We choose 'k' (the number of clusters/groups) we want to create.
# Let's start with k=4 to create 4 distinct groups of songs.
# Finding the 'best' k is a separate topic (using the Elbow Method),
# but for now, 4 is a good starting point.
k = 4
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)

# Train the model and assign each song to a cluster
df['Cluster'] = kmeans.fit_predict(encoded_df)

print("------ Songs with their assigned Cluster ------")
print(df[['Name of the song', 'Genre', 'Artist', 'Cluster']].to_string())

# --- Step 10: Creating the NEW Smart Playlist (Sorted by Cluster) ---

# This is the key step! We will now sort by two levels:
# 1. First, sort by the 'Cluster' number.
# 2. Then, WITHIN each cluster, sort by our original 'Preference_Score'.
final_clustered_playlist = df.sort_values(by=['Cluster', 'Preference_Score'], ascending=[True, False])


# --- Display the final, truly intelligent playlist ---

print("\n\n========================================================")
print("     Final ML-Powered Playlist (Grouped by Mood)      ")
print("========================================================")
print("Songs are grouped by similarity, then ranked by preference.\n")

# Displaying the final result
final_view = final_clustered_playlist[['Name of the song', 'Genre', 'Artist', 'Cluster', 'Preference_Score']].reset_index(drop=True)
final_view.index = final_view.index + 1
final_view.index.name = "New Order"
final_view.round(2)

------ Songs with their assigned Cluster ------
   Name of the song         Genre            Artist  Cluster
0           Chaleya      Romantic      Arijit Singh        0
1      Apna Bana Le      Romantic      Arijit Singh        0
2           Heeriye    Chill Vibe     Jasleen Royal        1
3    Maan Meri Jaan      Romantic              King        0
4          Obsessed        Upbeat     Vicky Kaushal        2
5          Kesariya      Romantic      Arijit Singh        0
6    Pehle Bhi Main           Sad     Vishal Mishra        3
7          Satranga      Romantic      Arijit Singh        0
8            Jale 2  Haryanvi Pop   Sapna Choudhary        2
9      What Jhumka?        Upbeat      Arijit Singh        0
10     Arjan Vailly        Action  Bhupinder Babbal        3
11  Akhiyaan Gulaab    Chill Vibe     Shahid Mallya        1


     Final ML-Powered Playlist (Grouped by Mood)      
Songs are grouped by similarity, then ranked by preference.



Unnamed: 0_level_0,Name of the song,Genre,Artist,Cluster,Preference_Score
New Order,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,What Jhumka?,Upbeat,Arijit Singh,0,29.58
2,Apna Bana Le,Romantic,Arijit Singh,0,25.37
3,Chaleya,Romantic,Arijit Singh,0,24.75
4,Satranga,Romantic,Arijit Singh,0,17.6
5,Kesariya,Romantic,Arijit Singh,0,14.7
6,Maan Meri Jaan,Romantic,King,0,14.21
7,Heeriye,Chill Vibe,Jasleen Royal,1,54.72
8,Akhiyaan Gulaab,Chill Vibe,Shahid Mallya,1,34.63
9,Jale 2,Haryanvi Pop,Sapna Choudhary,2,48.65
10,Obsessed,Upbeat,Vicky Kaushal,2,44.47


In [14]:
# --- Step 11: Testing the Trained Model on a New, Unseen Song ---

# Let's create a new song as a dictionary.
# Test Case 1: Ek naya Romantic gaana Arijit Singh ka.
test_song = {
    'Name of the song': 'Naya Arijit Romantic Song',
    'Genre': 'Romantic',
    'Music Type': 'Ballad',
    'Artist': 'Vicky Kaushal',  # Maan lete hain yeh artist hai
    'Length of song': '3:30',  # Maan lete hain iska
    'Play duration (average)': '3:00',  # Maan lete hain iska average play duration hai
    'Preference_Score': 50.0  # Maan lete hain iska score high hai
}

# Convert the test song into a DataFrame
test_df = pd.DataFrame([test_song])

print(f"--- Testing with a new song: '{test_song['Name of the song']}' ---\n")

# --- Preparing the test data (CRUCIAL STEP) ---
# Humein is naye gaane ko bilkul waise hi taiyaar karna hoga jaise humne training data ko kiya tha.

# 1. One-Hot Encode the test song
test_encoded = pd.get_dummies(test_df, columns=['Genre', 'Music Type', 'Artist'])

# 2. Align columns: Ensure the test data has the EXACT same columns as the training data.
# The 'reindex' command adds any missing columns (from the training set) and fills them with 0.
# It also removes any columns from the test set that weren't in the training set.
final_test_data = test_encoded.reindex(columns=encoded_df.columns, fill_value=0)

# 3. Scale the Preference_Score using the SAME scaler we used for training
# We use .transform() here, NOT .fit_transform(), because we want to use the
# scaling rules learned from the training data.
final_test_data['Preference_Score'] = scaler.transform(final_test_data[['Preference_Score']])

# --- Making the Prediction ---
# Use our trained 'kmeans' model to predict the cluster
predicted_cluster = kmeans.predict(final_test_data)

print(f"Prediction Complete!")
print(f"The model has placed this song in: Cluster {predicted_cluster[0]}\n")

# --- Verification: Let's see what other songs are in this cluster ---
print(f"--- Let's verify! Here are all the other songs from Cluster {predicted_cluster[0]}: ---")

# Filter the original DataFrame to show all songs from the predicted cluster
songs_in_same_cluster = df[df['Cluster'] == predicted_cluster[0]]

# Displaying the result
songs_in_same_cluster[['Name of the song', 'Genre', 'Artist', 'Cluster']]

--- Testing with a new song: 'Naya Arijit Romantic Song' ---

Prediction Complete!
The model has placed this song in: Cluster 2

--- Let's verify! Here are all the other songs from Cluster 2: ---


Unnamed: 0,Name of the song,Genre,Artist,Cluster
4,Obsessed,Upbeat,Vicky Kaushal,2
8,Jale 2,Haryanvi Pop,Sapna Choudhary,2


In [15]:
# --- Step 12: Creating a Brand New, Unseen Playlist for Testing ---

# Yeh hamari 'unseen' test playlist hai. Isme naye gaane,
# naye artists, aur alag listening stats hain.
new_playlist_data = {
    'Name of the song': [
        'Shayad', 'Bijlee Bijlee', 'Raatan Lambiyan', 'Ghodey Pe Sawaar', 
        'Kal Ho Naa Ho', 'Lover', 'Tu Hai Kahan'
    ],
    'Genre': [
        'Romantic', 'Upbeat', 'Romantic', 'Chill Vibe', 'Sad', 
        'Punjabi Pop', 'Lofi'
    ],
    'Music Type': [
        'Ballad', 'Punjabi Pop', 'Ballad', 'Retro Pop', 'Bollywood Classic', 
        'Pop', 'Indie'
    ],
    'Artist': [
        'Arijit Singh', 'Hardy Sandhu', 'Tanishk Bagchi', 'Amit Trivedi', 
        'Sonu Nigam', 'Diljit Dosanjh', 'AUR'
    ],
    'Length of song': [
        '4:07', '2:48', '3:50', '3:18', '5:20', '3:03', '4:24'
    ],
    'Play duration (average)': [
        '4:05', '2:00', '3:48', '3:15', '5:15', '1:30', '4:20'
    ],
    'How many times Played': [
        60, 50, 45, 30, 15, 55, 40
    ]
}

# Is data se ek naya DataFrame banate hain.
new_playlist_df = pd.DataFrame(new_playlist_data)

print("------ The New Unsorted Playlist We Will Test ------")
new_playlist_df

------ The New Unsorted Playlist We Will Test ------


Unnamed: 0,Name of the song,Genre,Music Type,Artist,Length of song,Play duration (average),How many times Played
0,Shayad,Romantic,Ballad,Arijit Singh,4:07,4:05,60
1,Bijlee Bijlee,Upbeat,Punjabi Pop,Hardy Sandhu,2:48,2:00,50
2,Raatan Lambiyan,Romantic,Ballad,Tanishk Bagchi,3:50,3:48,45
3,Ghodey Pe Sawaar,Chill Vibe,Retro Pop,Amit Trivedi,3:18,3:15,30
4,Kal Ho Naa Ho,Sad,Bollywood Classic,Sonu Nigam,5:20,5:15,15
5,Lover,Punjabi Pop,Pop,Diljit Dosanjh,3:03,1:30,55
6,Tu Hai Kahan,Lofi,Indie,AUR,4:24,4:20,40


In [16]:
# --- Step 13: Processing and Sorting the New Playlist with our Trained Model ---

print("--- Starting the sorting process for the new playlist... ---\n")

# 1. Calculate Preference Score for the new playlist
new_playlist_df['Length_sec'] = new_playlist_df['Length of song'].apply(time_to_seconds)
new_playlist_df['Play_duration_sec'] = new_playlist_df['Play duration (average)'].apply(time_to_seconds)
new_playlist_df['Play_Ratio'] = np.where(new_playlist_df['Length_sec'] > 0, new_playlist_df['Play_duration_sec'] / new_playlist_df['Length_sec'], 0)
new_playlist_df['Play_Ratio'] = np.clip(new_playlist_df['Play_Ratio'], 0, 1)
new_playlist_df['Preference_Score'] = new_playlist_df['How many times Played'] * new_playlist_df['Play_Ratio']

# 2. Prepare the new playlist data for the model (One-Hot Encode, Align, Scale)
model_features_new = new_playlist_df[['Genre', 'Music Type', 'Artist', 'Preference_Score']]
encoded_new = pd.get_dummies(model_features_new, columns=['Genre', 'Music Type', 'Artist'])
aligned_new = encoded_new.reindex(columns=encoded_df.columns, fill_value=0) # IMPORTANT: Aligning to original columns
aligned_new['Preference_Score'] = scaler.transform(aligned_new[['Preference_Score']]) # IMPORTANT: Using old scaler

# 3. Predict the cluster for EACH song in the new playlist
predicted_clusters_new = kmeans.predict(aligned_new)
new_playlist_df['Predicted_Cluster'] = predicted_clusters_new

# 4. Sort the new playlist based on the predicted clusters and preference score
final_sorted_new_playlist = new_playlist_df.sort_values(by=['Predicted_Cluster', 'Preference_Score'], ascending=[True, False])

print("--- Process Complete! ---\n")

# 5. Display the final result
print("==========================================================")
print("     The New Playlist - Now Intelligently Sorted!     ")
print("==========================================================")

final_view_new = final_sorted_new_playlist[['Name of the song', 'Artist', 'Genre', 'Predicted_Cluster', 'Preference_Score']].reset_index(drop=True)
final_view_new.index = final_view_new.index + 1
final_view_new.index.name = "New Order"
final_view_new.round(2)

--- Starting the sorting process for the new playlist... ---

--- Process Complete! ---

     The New Playlist - Now Intelligently Sorted!     


Unnamed: 0_level_0,Name of the song,Artist,Genre,Predicted_Cluster,Preference_Score
New Order,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Lover,Diljit Dosanjh,Punjabi Pop,0,27.05
2,Ghodey Pe Sawaar,Amit Trivedi,Chill Vibe,1,29.55
3,Shayad,Arijit Singh,Romantic,2,59.51
4,Raatan Lambiyan,Tanishk Bagchi,Romantic,2,44.61
5,Tu Hai Kahan,AUR,Lofi,2,39.39
6,Bijlee Bijlee,Hardy Sandhu,Upbeat,2,35.71
7,Kal Ho Naa Ho,Sonu Nigam,Sad,3,14.77
