**Classifying Spotify Song Explicitness with LangChain and LLAMA3-70B**
This notebook demonstrates how to leverage LangChain to classify the explicitness of Spotify song tracks. The classification is performed using the LLAMA3-70B Tool calling model, accessed through the Fireworks API.

We've chosen this approach because the Kaggle dataset we're using, "Most Streamed Spotify Songs 2024", has been identified to contain some misclassified explicitness data points, as discussed in the forum thread : https://www.kaggle.com/datasets/nelgiriyewithana/most-streamed-spotify-songs-2024/discussion/513790

In [3]:
import pandas as pd
from typing import List,Literal

from langchain.tools import tool
from langchain_fireworks import ChatFireworks
from langchain.pydantic_v1 import BaseModel, Field
from langchain.prompts import ChatPromptTemplate


In [6]:
spotify_df = pd.read_csv('spotify_music_stream_data.csv', encoding='latin-1')

Using fireworks AI API, we can import LLAMA - 70B model. Here is the link for more info : https://docs.fireworks.ai/getting-started/quickstart

Pass the API Key to the ```key``` variable.

In [7]:
# firefunction-v2-rc

key = ...
llama_model = ChatFireworks(model="accounts/fireworks/models/firefunction-v2-rc",fireworks_api_key=key)

In [8]:
spotify_df.head() 

Unnamed: 0,Track,Album Name,Artist,Release Date,ISRC,All Time Rank,Track Score,Spotify Streams,Spotify Playlist Count,Spotify Playlist Reach,...,SiriusXM Spins,Deezer Playlist Count,Deezer Playlist Reach,Amazon Playlist Count,Pandora Streams,Pandora Track Stations,Soundcloud Streams,Shazam Counts,TIDAL Popularity,Explicit Track
0,MILLION DOLLAR BABY,Million Dollar Baby - Single,Tommy Richman,4/26/2024,QM24S2402528,1,725.4,390470936,30716,196631588,...,684,62.0,17598718,114.0,18004655,22931,4818457.0,2669262,,0
1,Not Like Us,Not Like Us,Kendrick Lamar,5/4/2024,USUG12400910,2,545.9,323703884,28113,174597137,...,3,67.0,10422430,111.0,7780028,28444,6623075.0,1118279,,1
2,i like the way you kiss me,I like the way you kiss me,Artemas,3/19/2024,QZJ842400387,3,538.4,601309283,54331,211607669,...,536,136.0,36321847,172.0,5022621,5639,7208651.0,5285340,,0
3,Flowers,Flowers - Single,Miley Cyrus,1/12/2023,USSM12209777,4,444.9,2031280633,269802,136569078,...,2182,264.0,24684248,210.0,190260277,203384,,11822942,,0
4,Houdini,Houdini,Eminem,5/31/2024,USUG12403398,5,423.3,107034922,7223,151469874,...,1,82.0,17660624,105.0,4493884,7006,207179.0,457017,,1


In [10]:
track_df = spotify_df.iloc[:,:3]
tracks = track_df.to_dict(orient="tight")["data"]

# sample check
tracks[:5]

[['MILLION DOLLAR BABY', 'Million Dollar Baby - Single', 'Tommy Richman'],
 ['Not Like Us', 'Not Like Us', 'Kendrick Lamar'],
 ['i like the way you kiss me', 'I like the way you kiss me', 'Artemas'],
 ['Flowers', 'Flowers - Single', 'Miley Cyrus'],
 ['Houdini', 'Houdini', 'Eminem']]

In [12]:
# creating tool to parse the output using Pydantic Output Parser

@tool
class ClassifyExplicit(BaseModel):
    "output parser"
    explicit : List[Literal[0,1]] = Field(...,description="list containing  explicit ratings (0 or 1) as values. \
                                               Explicity: True=1 & False=0")
    

print(ClassifyExplicit)

name='ClassifyExplicit' description='output parser' args_schema=<class 'pydantic.v1.main.ClassifyExplicitSchema'> func=<class '__main__.ClassifyExplicit'>


In [14]:
# adding tools to the llama model
model_with_tools = llama_model.bind_tools([ClassifyExplicit])
model_with_tools.kwargs

{'tools': [{'type': 'function',
   'function': {'name': 'ClassifyExplicit',
    'description': 'output parser',
    'parameters': {'type': 'object',
     'properties': {'explicit': {'type': 'array',
       'items': {'enum': [0, 1], 'type': 'integer'}}},
     'required': ['explicit']}}}]}

In [15]:

prompt = ChatPromptTemplate.from_messages([
    ("system","You are a Song Track Explicity Classifier. Your task is to take the name of the track and classify it as explicit or not. \
     Value 1 represents explicity , and 0 as non-explicity track. Give the output response using the ClassifyExplicit tool."),
    ("user","List of tracks with format [track_name,Album_name,Artist]: {tracks}"),
])

# building chain
chain = prompt | model_with_tools 


Iteratively, running chain for each cell will response the output. Here's how the output from each response would look like:

```
['MILLION DOLLAR BABY', 'Million Dollar Baby - Single', 'Tommy Richman']
content='' additional_kwargs={'tool_calls': [{'index': 0, 'id': 'call_THUlk1IX0zajhwCWuXzowi1P', 'type': 'function', 'function': {'name': 'ClassifyExplicit', 'arguments': '{"explicit": [0]}'}}]} response_metadata={'token_usage': {'prompt_tokens': 438, 'total_tokens': 460, 'completion_tokens': 22}, 'model_name': 'accounts/fireworks/models/firefunction-v2-rc', 'system_fingerprint': '', 'finish_reason': 'tool_calls', 'logprobs': None} id='run-521f9d06-f300-4317-8846-1960c3e53008-0' tool_calls=[{'name': 'ClassifyExplicit', 'args': {'explicit': [0]}, 'id': 'call_THUlk1IX0zajhwCWuXzowi1P'}]
```

In [23]:
# This cell will iterate throughout each row to classify song track explicity.
# It would take lot of time to execution.

track_score = []
for en,track in enumerate(tracks[:2]): # experimenting for 2 tracks only. 

    input_grid = {"tracks":track}
    response = chain.invoke(input=input_grid)
    print(track)
    print(response)
    res = response.tool_calls[0]["args"]["explicit"][0]
    track_score.append(res)

['MILLION DOLLAR BABY', 'Million Dollar Baby - Single', 'Tommy Richman']

content='' additional_kwargs={'tool_calls': [{'index': 0, 'id': 'call_THUlk1IX0zajhwCWuXzowi1P', 'type': 'function', 'function': {'name': 'ClassifyExplicit', 'arguments': '{"explicit": [0]}'}}]} response_metadata={'token_usage': {'prompt_tokens': 438, 'total_tokens': 460, 'completion_tokens': 22}, 'model_name': 'accounts/fireworks/models/firefunction-v2-rc', 'system_fingerprint': '', 'finish_reason': 'tool_calls', 'logprobs': None} id='run-521f9d06-f300-4317-8846-1960c3e53008-0' tool_calls=[{'name': 'ClassifyExplicit', 'args': {'explicit': [0]}, 'id': 'call_THUlk1IX0zajhwCWuXzowi1P'}]

['Not Like Us', 'Not Like Us', 'Kendrick Lamar']

content='' additional_kwargs={'tool_calls': [{'index': 0, 'id': 'call_RAE8NMVvS2KVVh8BS6V8MSsM', 'type': 'function', 'function': {'name': 'ClassifyExplicit', 'arguments': '{"explicit": [1]}'}}]} response_metadata={'token_usage': {'prompt_tokens': 430, 'total_tokens': 452, 'completio

In [61]:
spotify_refined_df = pd.DataFrame.from_dict({"tracks":tracks,"Scores":track_score})
spotify_refined_df.to_csv("spotify_refined.csv")

In [25]:
# Last column - Explicit Classified
spotify_refined_df.head()

Unnamed: 0,Track,Album Name,Artist,Release Date,ISRC,All Time Rank,Track Score,Spotify Streams,Spotify Playlist Count,Spotify Playlist Reach,...,Deezer Playlist Count,Deezer Playlist Reach,Amazon Playlist Count,Pandora Streams,Pandora Track Stations,Soundcloud Streams,Shazam Counts,TIDAL Popularity,Explicit Track,Explicit Classified
0,MILLION DOLLAR BABY,Million Dollar Baby - Single,Tommy Richman,4/26/2024,QM24S2402528,1,725.4,390470936,30716,196631588,...,62.0,17598718,114.0,18004655,22931,4818457.0,2669262,,0,0
1,Not Like Us,Not Like Us,Kendrick Lamar,5/4/2024,USUG12400910,2,545.9,323703884,28113,174597137,...,67.0,10422430,111.0,7780028,28444,6623075.0,1118279,,1,1
2,i like the way you kiss me,I like the way you kiss me,Artemas,3/19/2024,QZJ842400387,3,538.4,601309283,54331,211607669,...,136.0,36321847,172.0,5022621,5639,7208651.0,5285340,,0,1
3,Flowers,Flowers - Single,Miley Cyrus,1/12/2023,USSM12209777,4,444.9,2031280633,269802,136569078,...,264.0,24684248,210.0,190260277,203384,,11822942,,0,0
4,Houdini,Houdini,Eminem,5/31/2024,USUG12403398,5,423.3,107034922,7223,151469874,...,82.0,17660624,105.0,4493884,7006,207179.0,457017,,1,1


**Using the llama model and internet data, we classified the tracks.** The model revealed a discrepancy with the Kaggle dataset, with a total of 1,082 misclassified tracks.


In [48]:
total = 0
for _,rows in spotify_refined_df.iterrows():
    res = rows["Explicit Track"] == rows["Explicit Classified"]
    total += res

print(f"Total tracks found misclassified: {spotify_refined_df.shape[0] - total}")

Total tracks found misclassified: 1082


Here is the link for the refined dataset:
https://www.kaggle.com/datasets/pragyantiwari/spotify-refined-explicity-classified-1

If you found this notebook informative and helpful, like and share it ðŸ™‚.
