**XGBoost: A Powerful Machine Learning Algorithm**

XGBoost is a highly efficient, scalable, and powerful machine learning algorithm used for classification and regression tasks, known for its speed, accuracy, and ability to handle large datasets with regularization to prevent overfitting and improve generalization.

**XGBoost: Leveraging Decision Trees and Embeddings**

* **XGBoost:** Leverages decision trees to identify patterns and predict track recommendations based on input features.
* **DistilBERT Embeddings (Only for Feature Engineering):** Converts track names into semantic vectors that capture textual meaning.

XGBoost uses both DistilBERT embeddings and numerical features to predict the next track.

**Why DistilBERT Outperforms Bag of Words for Next-Track Prediction**

DistilBERT offers a significant advantage over Bag of Words for predicting the next track in a playlist:

* **Semantic Understanding:** DistilBERT can understand the semantic similarity between track names. For example, it can recognize that "Upbeat Party Anthem" and "Dance All Night" are related by their shared theme of energetic, danceable music, even if they don't contain identical words.
* **Contextual Relationships:** DistilBERT can recognize that tracks with similar themes or moods are likely to follow each other.

In contrast, Bag of Words treats each word independently, missing these contextual relationships and making it less effective for sequence-based tasks like predicting the next track.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [None]:
path = 'https://raw.githubusercontent.com/Niyanta5/NumpyDataSet/refs/heads/main/df_merge.csv'

In [None]:
df = pd.read_csv(path)

In [None]:
df.shape

(100000, 15)

In [None]:
df.isnull().sum()

Unnamed: 0,0
track_pos,0
track_artist_name,0
track_track_name,0
track_duration_ms,0
track_album_name,0
playlist_name,0
playlist_num_artists,0
playlist_num_albums,0
playlist_num_tracks,0
playlist_num_followers,0


In [None]:
df.dropna(inplace=True)

In [None]:
# df = df_merge.copy()

In [None]:
df.head()

Unnamed: 0,track_pos,track_artist_name,track_track_name,track_duration_ms,track_album_name,playlist_name,playlist_num_artists,playlist_num_albums,playlist_num_tracks,playlist_num_followers,playlist_num_edits,playlist_duration_ms,playlist_collaborative,bag_of_words,sentiment_bag_of_words
0,0,The Jackson 5,ABC,174866,ABC,party party,116,142,152,1,3,39413578,False,jackson c easy love b baby michael sing come s...,0.7964
1,1,Streetlight Manifesto,Point/Counterpoint,327920,Everything Goes Numb,party party,116,142,152,1,3,39413578,False,know dont never would ill ive like wont cant im,0.1316
2,2,Michael Jackson,Billie Jean,293826,Thriller 25 Super Deluxe Edition,party party,116,142,152,1,3,39413578,False,jean one billie lover uh son baby kid hoo girl,0.5859
3,3,Green Day,Basket Case,181533,Dookie,party party,116,142,152,1,3,39413578,False,sometimes chorus give creeps mind plays tricks...,0.128
4,4,The White Stripes,Seven Nation Army,231800,Elephant,party party,116,142,152,1,3,39413578,False,im na gon back comin prechorus instrumental bl...,0.0


In [None]:
df.drop(columns=['track_pos'],inplace=True)

In [None]:
pip install --upgrade torch transformers




In [None]:
from transformers import AutoTokenizer, AutoModel
import torch

# Use distilBERT model for faster performance
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

print("DistilBERT model loaded successfully.")


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

DistilBERT model loaded successfully.


In [None]:
from transformers import AutoTokenizer, AutoModel
import torch

# Load DistilBERT model and tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Use device for faster computation if available (GPU or CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)


DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0-5): 6 x TransformerBlock(
        (attention): DistilBertSdpaAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): L

In [None]:
# import torch
# from transformers import AutoTokenizer, AutoModel
# import pandas as pd

# Function to extract embeddings
# def get_bert_embeddings_batch(track_names, batch_size=8):
#     batch_embeddings = []

#     for i in range(0, len(track_names), batch_size):
#         batch = track_names[i:i + batch_size]
#         print(f"Processing batch {i // batch_size + 1} of {len(track_names) // batch_size}")

#         # Tokenize the batch of track names
#         inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True, max_length=512)
#         inputs = {key: val.to(device) for key, val in inputs.items()}  # Move inputs to device (CPU in your case)

#         # Get the embeddings for the batch without computing gradients
#         with torch.no_grad():
#             outputs = model(**inputs)

#         # Extract the CLS token's embeddings (sentence-level embeddings)
#         batch_cls_embeddings = outputs.last_hidden_state[:, 0, :].cpu().numpy()
#         batch_embeddings.extend(batch_cls_embeddings)

#     return batch_embeddings


# # Ensure that the 'track_track_name' column exists in the DataFrame
# if 'track_track_name' not in df.columns:
#     raise ValueError("'track_track_name' column not found in the DataFrame")

# # Convert the 'track_track_name' column to a list
# track_names = df['track_track_name'].tolist()

# # Get embeddings for the track names in batches
# batch_embeddings = get_bert_embeddings_batch(track_names, batch_size=8)

# # Convert embeddings into a DataFrame
# embedding_dim = len(batch_embeddings[0])  # Length of one embedding (dimension size)
# embeddings_df = pd.DataFrame(batch_embeddings, columns=[f"dim_{i}" for i in range(embedding_dim)])

# # Add the embeddings back to the original DataFrame
# df = pd.concat([df, embeddings_df], axis=1)

# # Save the DataFrame with embeddings (optional)
# df.to_csv("output_with_embeddings.csv", index=False)

# print("Embeddings generated and added to the DataFrame.")


**Embedding Generation Using DistilBERT**

**DistilBERT Setup**

1. **Model Initialization:** A DistilBERT model is initialized to convert track names into dense vector embeddings.

**Embedding Generation Process**

1. **Text Tokenization:** For each track name, a batch of texts is tokenized.
2. **Embedding Generation:** The tokenized texts are fed into the DistilBERT model to generate embeddings (semantic representations).
3. **Feature Concatenation:** These embeddings, representing the semantic meaning of track names, are concatenated with other numerical features (like track duration, number of followers) and categorical features (encoded track/artist names).

**Benefits of Using DistilBERT Embeddings**

* **Semantic Understanding:** DistilBERT captures the semantic meaning of track names, enabling the model to understand relationships between tracks based on their content and themes.
* **Improved Recommendation Accuracy:** By incorporating semantic information, the model can make more accurate recommendations, suggesting tracks that are not only similar in terms of explicit features but also in terms of their underlying meaning.
* **Enhanced Cold-Start Recommendations:** DistilBERT can help improve recommendations for new tracks with limited historical data by leveraging their semantic similarity to existing tracks.

In [None]:
# Importing necessary library
import pandas as pd

# URL for the CSV file
url = "https://raw.githubusercontent.com/Niyanta5/NumpyDataSet/refs/heads/main/output_with_embeddings.csv"

# Reading the CSV file into a DataFrame
df_embeddings = pd.read_csv(url)

# Display the first few rows of the DataFrame to ensure it's read correctly
df_embeddings.head()


Unnamed: 0,track_pos,track_artist_name,track_track_name,track_duration_ms,track_album_name,playlist_name,playlist_num_artists,playlist_num_albums,playlist_num_tracks,playlist_num_followers,...,dim_758,dim_759,dim_760,dim_761,dim_762,dim_763,dim_764,dim_765,dim_766,dim_767
0,0.0,The Jackson 5,ABC,174866.0,ABC,party party,116.0,142.0,152.0,1.0,...,0.108369,0.072873,-0.038432,-0.094065,0.225188,-0.193449,0.127536,-0.10499,0.262299,0.372101
1,1.0,Streetlight Manifesto,Point/Counterpoint,327920.0,Everything Goes Numb,party party,116.0,142.0,152.0,1.0,...,0.285177,-0.254053,0.125463,-0.03481,0.069405,-0.014522,-0.041621,-0.155827,-0.055627,0.382087
2,2.0,Michael Jackson,Billie Jean,293826.0,Thriller 25 Super Deluxe Edition,party party,116.0,142.0,152.0,1.0,...,0.183318,-0.181992,-0.018647,0.03193,0.143976,-0.128821,0.031139,-0.162357,0.098784,0.255263
3,3.0,Green Day,Basket Case,181533.0,Dookie,party party,116.0,142.0,152.0,1.0,...,0.004572,-0.082458,0.125939,-0.043712,0.116102,-0.088013,-0.102764,-0.113838,0.109734,0.210027
4,4.0,The White Stripes,Seven Nation Army,231800.0,Elephant,party party,116.0,142.0,152.0,1.0,...,0.023708,-0.132355,0.095664,-0.22145,0.127104,0.007802,-0.063387,-0.030303,0.122991,0.279553


In [None]:
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score

def advanced_preprocessing(df):
    # Create a copy of the dataframe
    df_processed = df.copy()

    # Categorical Encoding
    categorical_cols = [
        'track_artist_name',
        'track_album_name',
        'track_track_name'
    ]

    # Label Encoding for categorical columns
    label_encoder = LabelEncoder()
    for col in categorical_cols:
        df_processed[f'{col}_encoded'] = label_encoder.fit_transform(df_processed[col].astype(str).fillna('Unknown'))

    # Select features for modeling
    feature_cols = [
        'track_artist_name_encoded',
        'track_album_name_encoded',
        'track_track_name_encoded'
    ]

    # Add numerical features if available
    numerical_cols = df_processed.select_dtypes(include=['float64', 'int64']).columns
    feature_cols.extend([col for col in numerical_cols if col not in ['track_artist_name_encoded', 'track_album_name_encoded', 'track_track_name_encoded']])

    # Prepare features
    X = df_processed[feature_cols]

    # Handle target variable
    if df_processed.columns[-1].startswith('dim_') or df_processed.columns[-1] in feature_cols:
        # If last column is a feature or dimensional column, create a dummy target
        y = np.random.randint(0, 2, size=len(df_processed))
    else:
        # Use the last column as target
        y = df_processed[df_processed.columns[-1]]

    # Ensure binary classification
    y = pd.Series(y).map(lambda x: 1 if x > 0 else 0)

    # Scaling numerical features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    return X_scaled, y, df_processed



In [None]:
def train_xgboost_recommendation_model(X, y):
    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # XGBoost Model
    model = xgb.XGBClassifier(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=5,
        random_state=42,
        use_label_encoder=False,
        eval_metric='logloss'
    )

    # Train the model
    model.fit(X_train, y_train)

    # Predict probabilities (likelihood of being next in playlist)
    y_pred_proba = model.predict_proba(X)[:, 1]

    return y_pred_proba, model

def recommend_tracks_using_xgboost(X, initial_indices, df, y_pred_proba):
    # Create a DataFrame with predictions and original data
    pred_df = pd.DataFrame({
        'track_artist_name': df['track_artist_name'],
        'track_track_name': df['track_track_name'],
        'prediction_score': y_pred_proba
    })

    # Remove initial tracks from the recommendations
    initial_tracks = df.iloc[initial_indices]
    pred_df = pred_df[~pred_df['track_track_name'].isin(initial_tracks['track_track_name'])]

    # Sort by prediction score and select top recommendations
    top_recommendations = pred_df.sort_values('prediction_score', ascending=False).head(5)

    return top_recommendations




In [None]:
def main(df):
    # Preprocessing
    X_processed, y, original_df = advanced_preprocessing(df)

    # Train XGBoost model and get prediction probabilities
    y_pred_proba, model = train_xgboost_recommendation_model(X_processed, y)

    # Select initial tracks (replace with actual indices from your dataset)
    initial_indices = [0,88,2]  # Example initial tracks

    # Get top recommended tracks based on XGBoost predictions
    recommendations = recommend_tracks_using_xgboost(X_processed, initial_indices, original_df, y_pred_proba)


    return recommendations

# Usage
recommendations = main(df_embeddings)


In [None]:
recommendations

Unnamed: 0,track_artist_name,track_track_name,prediction_score
693,Tyga,Molly,0.986659
333,Travis Scott,goosebumps,0.985959
803,Moose Blood,Bukowski,0.982462
628,Noah Mac,Warmth,0.981924
316,A$AP Mob,Way Hii,0.980591


In [None]:
def train_xgboost_recommendation_model(X_train, y_train):
    # Train the XGBoost model
    model = XGBClassifier()
    model.fit(X_train, y_train)
    return model

def recommend_tracks(X, df, model):
    # 1. Generate prediction probabilities for each track
    y_pred_proba = model.predict_proba(X)[:, 1]  # Probability of track being liked

    # 2. Create DataFrame with track details and prediction scores
    pred_df = pd.DataFrame({
        'track_artist_name': df['track_artist_name'],
        'track_track_name': df['track_track_name'],
        'prediction_score': y_pred_proba
    })

    # 3. Sort by prediction score in descending order and return top 5
    top_recommendations = pred_df.sort_values('prediction_score', ascending=False).head(5)

    return top_recommendations


The above code does the following steps:

1. Generates probabilities for each track:
Extracts the probability that each track is liked (positive class)
2. Generating Recommendation Scores
Stores probabilities, track artist name, track name, and prediction score in a DataFrame (pred_df)
3. Returning Top Recommendations
Sorts tracks by prediction score in descending order
Returns the top 5 tracks with the highest likelihood of being liked









In [None]:
import tensorflow as tf

# Define the model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(1, activation='sigmoid', input_shape=(2,))
])

# Define the inputs (as TensorFlow constants)
inputs = tf.constant([[0.0, 1.0], [1.0, 1.0]], dtype=tf.float32)

# Get predictions from the model
outputs = model(inputs)

# Print the outputs
print(outputs.numpy())  # Convert the tensor to numpy for easy printing


[[0.73830277]
 [0.42321122]]


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


## Conclusion

By combining the power of XGBoost and the semantic understanding of DistilBERT, this music recommendation system can provide more accurate and personalized recommendations, enhancing the user experience.