# Love in Music Genres

## Overview

The main purpose of this project is to draw some general conclusions about how artists in different musical genres approach the theme of love in their songs.

To achieve this goal, the project will examine a dataset of songs, searching for patterns that repeat within the same musical genre and that best represent the topic of love, but are also distinctive for the genre itself.

To do this, we will follow the following steps:

1. Construction of the dataset (of love songs only, divided by genre)
2. Retrieval of the lyrics for each song
3. Obtaining for each song the keywords related to the concept of love (two techniques will be used: word2vec and the study of features from a LinearSVC classifier trained on the dataset)
4. Construction of a set of 30 representative keywords for each genre from the keywords of each song
5. Sentiment analysis on the songs
6. Sentiment analysis on the genre keywords

The experimental results will be presented in the following two ways:

1. Comparison of the keywords obtained with the two methods
2. Comparison of the results of sentiment analysis

## Procedure

### Step 0: Initialization

For the purposes of the project, a class named `Engine` has been developed. This class will function as an interface with other classes, facilitating the implementation of all the steps outlined above.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
!pip install gensim
!pip install nltk
!pip install bornrule

In [11]:
import nltk
from engine import *
import pandas as pd

#Init
nltk.download('averaged_perceptron_tagger', quiet=True)
Engine.init()

### Step 1: Building the dataset (getting the songs)

The dataset will be constructed using Spotify APIs, fetching 100 love songs for each of the initial genres (blues, country, metal, pop, rap, rock, soul).

To achieve this, specific Spotify playlists were chosen in advance based on the genre, with a focus on love songs only. The playlist IDs were extracted and used through the API to retrieve 100 songs randomly chosen from each playlist.

As a result, the final dataset counts a total of 700 songs.

Please, note that if the database is already populated with songs, this method won't do anything (in order to let this method work you need to clean and reset the database first - check the appendix).

In [None]:
#Populating the dataset with 700 different love songs, 100 for each genre.
Engine.load_songs()

### Step 2: Download the lyrics for each song

For each song, the title and artist will be isolated, and these details will be used with the genius.com API (a well-known website that provides lyrics for searched songs).

The Genius API allows obtaining a complete link to the lyrics of the searched song. Subsequently, web scraping of the song's lyrics page will be performed to isolate the text, clean it from unnecessary data, and then incorporate it into the dataset.

Note: if everything goes fine with the "load_songs()" call, you can skip the "load_lyrics()" execution, because it is already embedded in the "load_songs()" method.

In [None]:
#Retrieving the lyrics from genius.com website -> result will be written in the column "lyrics" of the table "songs" of our SQLite database
Engine.load_lyrics()

### Step 3: Extracting the keywords bound to "love" for each song

For each song, love-related keywords will be extracted using two different methodologies:

1. **Word2Vec:** Utilizing the Word2Vec technique.

2. **LinearSVC:** Extracting features using a classifier trained on the dataset.

#### Step 3.1 Keywords using Word2Vec

Word2Vec is a technique that allows representing words in a vector space, where words with similar contexts are assigned similar vectors. In this case study, we will keep as keywords the words with a vector similarity above a certain threshold with the word "love".

We will use a pre-trained Word2Vec model to achieve this result.

In [None]:
#Using word2vec to get the keywords bound to the concept of love
# >>> Result will be written in the column "keywords_w2v" of the table "songs" of our SQLite database
Engine.load_w2v_keywords()

#### Step 3.2 Keywords from LinearSVC's feature study

LinearSVC is a linear classification algorithm that falls under the category of Support Vector Machines. In essence, it is designed to find a hyperplane that maximizes the separation of data represented in the feature space.

For the project's purpose, LinearSVC is trained on the entire dataset. After the training is complete, the algorithm's coefficients assigned to each feature are analyzed. This analysis helps identify the most important features at the individual class level. In this case, these features are the words that lyrics are composed of, and those with coefficients above a certain threshold will be chosen as representative keywords for each sample.

In [None]:
#Training LinearSVC in order to get the coefficients for each word, and then extracting the words with high coefficient
# >>> Result will be written in the colum "keywords_tc" of the table "songs" of our SQLite database
Engine.load_tc_keywords()

### Step 4: Getting 30 representative keyword for each genre

Starting from the keywords computed for each song (both from Word2Vec and LinearSVC), the TfIdf score will be calculated for each word concerning the genre. The "documents" for the TfIdf calculation will be constructed by merging at the genre level all the keywords from all the songs.

This process allows obtaining a high score for words that are highly distinctive for a genre, and a low score for frequent words common to all genres or, conversely, infrequent words.

For each genre, the top 30 words will be extracted, ordered by score. This procedure will be performed for both Word2Vec and LinearSVC keywords.

In [None]:
#Using TfIdf to get the words that better identify the genre
# >>> Result wil be written in the colums "top_kw_w2v" (for the word2vec words) and "top_kw_tc" (for the LinearSVC words) of the table "genres" of our SQLite database
Engine.load_genre_kws()

### Step 5: Performing a sentiment analysis over the songs

A pre-trained zero-shot text classification model will be utilized to perform sentiment analysis on the lyrics of each song. The sentiment will be categorized as either "positive" or "negative."

In [None]:
#Using a zero-shot text-classification pre-trained model we perform a sentiment analysis over the song lyrics
# >>> Result will be written in the column "sentiment_zs" of the table "songs" of our SQLite database
Engine.load_song_sentiment()

### Step 6: Performing a sentiment analysis over the genre keywords

Using the same pre-trained zero-shot text classification model, a second sentiment analysis will be conducted on the 30 representative keywords for each genre, derived from both Word2Vec and LinearSVC.

In [None]:
#Using a zero-shot text-classification pre-trained model we perform a sentiment analysis over the representative keywords for each genre
# >>> Result wil be written in the colums "sentiment_zs_w2v" (for the word2vec words) and "sentiment_zs_tc" (for the LinearSVC words) of the table "genres" of our SQLite database
Engine.load_genre_sentiment()

## Experimental Results

### About keywords

The following code will show some interesting facts about keywords:
1. Unique keywords for each genres (important keyword within the song genres that do not appear significantly in other genres)
2. The rateo of overlapping keywords (the number of keywords in common between Word2Vec and LinearSVC over the total number of different keywords)

In [13]:
unique_w2v, unique_tc, over_keywds = Engine.compare_genre_keywords()

#### Unique Keywords for genre

In [14]:
rows = [genre[0] for genre in unique_w2v]
cols = ["Word2Vec", "LinearSVC"]
data = [ list(y) for y in list(zip(["; ".join(x[1]) for x in  unique_w2v], ["; ".join(x[1]) for x in  unique_tc])) ]

pd.set_option('display.max_colwidth', None)

pd.DataFrame(data, columns=cols, index=rows)

Unnamed: 0,Word2Vec,LinearSVC
blues,tears; loving,night; lonely; need; well; mine; mind; did; gone; rain; home; sun
country,crazy; hell; knows; dad,back; here; take; little; ever; tonight; town; kiss; song; left; old; hell; hair; high; crazy; down
metal,die; fucking,never; pain; too; die; light; something; please; close; inside; lost; taste; dead; enough; feels; found
pop,friends; knew; hate,na; oh; leave; head; everything; change; someone; words; new; fall; friends; walk; nobody; stop; face
rap,fuck; shit; ma; bitch; bitches; fucked; shawty; fuckin; really,make; gon; fuck; nigga; shit; more; bitch; told; same; hope; tryna; money; niggas; ride; ayy; hit; call; feelin
rock,dreams,is; forever; there; find; alone; nothing; all; am; hard; made; cry; hear; together; far; side; then
soul,,love; baby; let; way; want; day; sweet; feeling; free; boy; live; really; show


#### Overlapping rateo for genre

In [19]:
rows = [genre[0] for genre in unique_w2v]
cols = ["Rateo"]
data = ["{:.1f}".format(x[1]*100) + "%" for x in  over_keywds]

pd.set_option('display.max_colwidth', None)

pd.DataFrame(data, columns=cols, index=rows)

Unnamed: 0,Rateo
blues,15.4%
country,11.1%
metal,17.6%
pop,13.2%
rap,13.2%
rock,13.2%
soul,20.0%


### About sentiment analysis

The following code, instead, will show some interesting facts about sentiment. The following tables will show:
1. a detail over sentiments calculated over songs, over Word2Vec genre keywords, and over LinearSVC
2. a comparison between sentiments, telling if the different techniques gave the same results


In [15]:
sentiments, report_s = Engine.compare_sentiments()

#### Detail over sentiment
This table shows the predicted sentiment over:
1. song lyrics (at a genre level: so the sentiment for a genre is calculated as the sentiment that had the maximum number of occurrences)
2. Word2Vec keywords (treated as a documents composed of the relevant genre keywords)
3. LinearSVC keywords (treated as documents composed of the relevant genre keywords)

In [17]:
rows = []
cols = ["Sentiment from songs", "Sentiment from W2V keywords", "Sentiment from LSVC keywords"]
data = []
for s in sentiments:
    rows.append(s)
    data.append(sentiments[s])

pd.set_option('display.max_colwidth', None)

pd.DataFrame(data, columns=cols, index=rows)

Unnamed: 0,Sentiment from songs,Sentiment from W2V keywords,Sentiment from LSVC keywords
blues,negative,positive,negative
country,negative,negative,negative
metal,negative,negative,negative
pop,negative,negative,negative
rap,negative,negative,negative
rock,negative,negative,negative
soul,positive,positive,positive


#### Sentiment comparison
The following table will show how the different techniques used agree with each others over the sentiment prediction

In [18]:
titles = { "w2v_vs_tc" : "Word2Vec vs LinearSVC", "songs_vs_w2v": "Song sentiments vs Word2Vec", "songs_vs_tc" : "Song sentiments vs LinearSVC", "total" : "Total Agreement"}
cols = ["Agreement Percentage"]
rows = []
data = []

for e in report_s:
    rows.append(titles[e])
    data.append("{:.1f}".format(report_s[e]*100) + "%")

pd.set_option('display.max_colwidth', None)

pd.DataFrame(data, columns=cols, index=rows)

Unnamed: 0,Agreement Percentage
Word2Vec vs LinearSVC,85.7%
Song sentiments vs Word2Vec,85.7%
Song sentiments vs LinearSVC,100.0%
Total Agreement,85.7%


## Appendix

### Reset dataset commands

The following command will create a backup of the current database in use and then clear the database in order to make possible to execute again the commands from the beginning (otherwise, commands described in steps 1 to 6 won't make any change).

In [None]:
# Dataset reset
Engine.reset_database()