# Song Lyric Challenge

**Author**: Cliff Hammett

**Last update**: 12/11/2024

---

For these challenges, we will analyse song lyrics from popular artists, including BTS, Taylor Swift, Beyonce and others, using Natural Language Processing and Sentiment Analysis. For challenge A and B, we will use the following dataset:

[Song Lyrics dataset on Kaggle](https://www.kaggle.com/datasets/deepshah16/song-lyrics-dataset)

This is located in the directory `data_challengeA+B/`. The dataset has some duplication of songs as remixes, which will have the same or largely similar lyrics. You can attempt to deduplicate this data; otherwise, simply proceed to analyse the data but bear in mind this limitation. Also bear in mind that the lyric field sometimes also contains information about which artist is singing a lyric at any point.

The above dataset is not suitable for sentiment analysis, because it is missing punctuation/line breaks needed to break it into smaller units. So for Challenge C, on sentiment analysis, we will look at the following dataset, which breaks the data into lines:

[Taylor Swift lyrics dataset](https://www.kaggle.com/datasets/PromptCloudHQ/taylor-swift-song-lyrics-from-all-the-albums)

This is located in the directory `data_challengeC/`.

## Challenge A: Analyse an artist’s lyrics

**Knowledge required:** You will need to use a Natural Language Processing (NLP) library, such as spacy in Python, or udpipe in R. You will also need basic knowledge of data frames. For python users, I have included code that puts a single artist’s lyrics into one string for analysis, to allow you to concentrate on analysing the text (this should be easier to achieve in R)

**Skills gained:** Practice in NLP skills, focussing on Parts of Speech (POS) analysis.

For this challenge, pick a single artist from the datasets available in the directory `data_challengeA+B/`. Excluding stop words, perform a parts of speech analysis on lemmas to identify the following about this artists song lyrics:

* What are the 10 most frequently used verbs?
* What are the 10 most frequently used nouns?
* What are the 10 most frequently used adjectives
* What are the 10 most frequently used adverbs?


Here are some useful libraries you might want to utilise in your investigation.

In [None]:
import pandas as pd
import spacy

The code below will load one artist and put their lyrics in a single string. Change to your chosen artist (look in the directory for filenames.)

In [None]:
file = "ArianaGrande.csv"
df = pd.read_csv('data_challengeA+B/' + file)
lyrics = df.Lyric.str.cat(sep='; ')

Add more cells as needed, and document as you go.

What does this suggest to you about the themes that are covered in this artists lyrics? And what are the limitations of this approach? Write some reflections.

If, based on your reflections, you can make some improvements to your approach, then make them below:


# Challenge B: Compare three artists

**Knowledge required:** You will need to use a Natural Language Processing (NLP) library, such as Spacy in Python, or udpipe in R. You will also need basic knowledge of data frames.

**Skills practised:** Practice in NLP skills, focussing on Parts of Speech (POS) analysis, interpreting results.

Perform a comparative analysis of at least two further artists, to identify the following:

How often are your first artists top 10 verb, noun, adjective and adverb lemmas used by these two artists?

What are the two additional artists top 10 verb, noun, adjective and adverb lemmas?

You might want to write a function for some of the steps to make some of this easier.

How similar or different are the two new artist from the first artist? What does this tell you about the original artist you analysed? What would you attribute this to? Does it challenge any of your original ideas?

Can you change these measures, so it shows how frequently this word as opposed to other content words (e.g. as a proportion or a percentage). Does this change your analysis?

# Challenge C: Sentiment analysis

**Knowledge required:** You will need to use a sentiment analysis library, such as nltk/vader in Python, or syuzhet in R. You will also need basic knowledge of data frames.

**Skills practised:** Use of and interpretation of sentiment analysis

Focus again on one artist. Find a biographical article about the artist, and see how their career is split into different phases. Split the data into these phases (e.g. by year of release), and extract the lyrics from these to compare how sentiment has changed.

Calculate average positive or negative sentiment for lyrics in these phases. Is there a meaningful change in sentiment between the chosen phases? How would you account for that change?

In [None]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

As a final challenge, see if you can write a loop that applies sentiment analysis to the lyrics of all the artists, to see whose lyrics has the most positive and negative sentiment of the artists offered. Is it who you expected?