# Naive Bayes Classifier for Determining Whether a Song is Explicit

Our goal is to create a Naive Bayes Classifier model for determining whether a song is explicit given its lyrics. Our data was obtained using the Musixmatch API. 

We got the 13 top artists in the US, then for each artist we got all their songs. Then, we got the lyrics for each of those songs and whether the song is explicit. We then wrote that data to a .csv file.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
import nltk
import ast
from sklearn.metrics import precision_recall_fscore_support, confusion_matrix, ConfusionMatrixDisplay
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

## Data Collection

To collect the data, we used the Python wrapper for the Musixmatch API. The code used to obtain the data is contained in `getData.ipynb`. 

## Data Preparation

To prepare the data, we had to stem the song lyrics and remove stop words using `nltk`. Then, we had to use a `CountVectorizer` to convert the lyrics into a matrix of word counts. Finally, we had to group features and labels into a format compatable with `nltk.NaiveBayesClassifier`.

First, we read in the data. Then, we stem it and remove stop words to make things more simple for our classifier.

In [None]:
tracks = pd.read_csv("data/tracks.csv",index_col=0)
tracks.drop_duplicates(inplace=True)
tracks.dropna(inplace=True)
tracks.reset_index(inplace=True)
del tracks["index"]

# Stemming words and removing stop words
nltk.download('stopwords')
stopWords = stopwords.words('english')
stemmer = PorterStemmer()

# Function to stem lyrics
def stemLyrics(stop_words):
    stemmedLyricsList = []
    for lyrics in tracks['lyrics']:
        lyricsList = lyrics.split(" ")
        stemmedLyrics = [stemmer.stem(word) for word in lyricsList if word.lower() not in stop_words]
        stemmedLyrics = ' '.join(stemmedLyrics)
        stemmedLyricsList.append(stemmedLyrics)
    return stemmedLyricsList

stemmedLyrics = stemLyrics(stopWords)
tracks = tracks.assign(stemmed_lyrics=stemmedLyrics)
tracks

stemmed_lyrics = tracks['stemmed_lyrics']
y = tracks['explicit']

print(len(tracks))

Now, we supply the lyrics to a count vectorizer to get the counts of each word present. Then, we put our data in a form which `nltk` is happy with.

In [None]:
vectorizer = CountVectorizer()
vec = list(vectorizer.fit_transform(stemmed_lyrics).toarray())
words = vectorizer.get_feature_names_out()
tracks = tracks.assign(vectorized_lyrics=vec)
tracks

featuresets = []
# featursets have form ({"feature1": count, "feature2": count, ...}, isExplicit)
for i in range(len(y)):
    featureX = {}
    for j in range(len(words)):
        featureX[words[j]] = (tracks['vectorized_lyrics'].iloc[i])[j]
    featuresets.append((featureX, y.iloc[i]))

## Machine Learning Model

Now, we can define a function which will train a Naive Bayes Classifier on supplied data, and then test the classifier and return both the classifier and the metrics.

In [None]:
# Naive bayes classifier function
def naive_bayes(featuresets, printMetrics=True, showConfusionMatrix=False):

    # Split into training and testing
    trainSet, testSet = train_test_split(featuresets)
    # Train the classifier
    classifier = nltk.NaiveBayesClassifier.train(trainSet)

    # Predict the test data
    testY = list(list(zip(*testSet))[1])
    test_features = list(list(zip(*testSet))[0])
    predictY = [classifier.classify(features) for features in test_features]

    # Get metrics for classifier (precision, recall, fscore, support)
    metrics = {}
    p,r,f,s = precision_recall_fscore_support(testY,predictY)

    metrics["precision"] = p
    metrics["recall"] = r
    metrics["f-score"] = f
    metrics["support"] = s

    if printMetrics:
        for metric in metrics.keys():
            print(f"{metric}: {metrics[metric]}")

    # Plot confusion matrix
    if showConfusionMatrix:
        labels = ["Not Explicit", "Explicit"]
        confusionMatrix = confusion_matrix(testY,predictY)
        display = ConfusionMatrixDisplay(confusion_matrix=confusionMatrix, display_labels=labels)
        display.plot()

    return classifier, metrics

Use the `naive_bayes` function to get a classifier and metrics for the dataset. Then, display the most informative features for the model

In [None]:
# Run naive bayes function
classifier, metrics = naive_bayes(featuresets,showConfusionMatrix=True)
# classifier, metrics = naive_bayes(data['featureset'],showConfusionMatrix=True)
classifier.show_most_informative_features(15)


## Feature Selection

We will now only take a subset of the features and see how the model performs. This will allow models to train quicker.

In [None]:
# Drop unimportant features
num_most_informative = 30
most_informative_features = [featureName for (featureName,_) in classifier.most_informative_features(num_most_informative)]
informative_featuresets = []
for (featureDict,isExplicit) in featuresets:
    newFeatureDict = {}
    for word in featureDict.keys():
        if word in most_informative_features:
            newFeatureDict[word] = featureDict[word]
    informative_featuresets.append((newFeatureDict,isExplicit))
informative_featuresets

Supply new dataset with less features to `naive_bayes`

In [None]:
# Make new classifier based only on most informative features
informative_classifier, informative_metrics = naive_bayes(informative_featuresets,showConfusionMatrix=True)

As shown by the metrics displayed above, recall is very low. The confusion matrix tells us that dropping less informative features results in a large number of false negatives. 

## Validation

We will now define a Monte Carlo cross validation function which will train `n` naive bayes classifiers, each with randomly chosen train/test subsets. The function will return the average metrics for the classifiers.

In [None]:
# Function which does Monte Carlo
def MonteCarloCrossValidation(featuresets,n):
    metrics_averages = {"precision": 0, "recall": 0, "f-score": 0, "support": 0}
    for _ in range(n):
        _, metrics = naive_bayes(featuresets,printMetrics=False)
        for metric in metrics.keys():
            metrics_averages[metric] += metrics[metric]
    for metric in metrics_averages.keys():
        metrics_averages[metric] = metrics_averages[metric]/n

    for metric in metrics.keys():
        print(f"average {metric}: {metrics_averages[metric]}")
    return metrics_averages

Run the Monte Carlo cross validation on the data with all the features:

In [None]:
# Performance of classifier using all features
metrics_averages_all = MonteCarloCrossValidation(featuresets,10)

Now, we run Monte Carlo cross validation on the data with only the most informative features

In [None]:
metrics_averages_most_informative = MonteCarloCrossValidation(informative_featuresets,20)

The validation tests above support the claim that dropping less informative features results in a large amount of false negatives and therefore low recall. We can see that leaving in all the features is a good idea.