<a href="https://colab.research.google.com/github/Parsa2820/50-years-lyrics/blob/master/notebooks/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Sharif University of Technology
Department of Computer Engineering

---
# Modern Information Retrieval Course
# **50 Years of the Best-Selling Music Artists Lyrics Comparison**
### Homework 2
### Dr. Asgari
### Parsa Mohammadian — 98102284
Spring 2022

---

## Introduction
The art of music play an important role in the human world. Besides the instrumental aspect, lyrics and content of a music are also important. In this project, I will explore and compare the lyrics of the best-selling music artists in a 50 years period (from 1969 to 2019). This list is taken from [Visual Capitalist](https://www.visualcapitalist.com/chart-toppers-50-years-of-the-best-selling-music-artists/) website. They have also visualized this data in an awesome [video](https://www.youtube.com/watch?v=a3w8I8boc_I). For the reference, I will use the image bellow to pick artist that has been top-selling for at least one consecutive year. Since the dataset is not provided, I have hardcoded the artists and their info in the code.

![top-seller-chart](../resources/top-sellers-chart.jpg)

---

In [11]:
class Artist:
    def __init__(self, name: str, top_seller_begin_year: int):
        self.name: str = name
        self.top_seller_begin_year: int = top_seller_begin_year
        self.lyrics: pd.DataFrame = None
        self.normalized_words_count: int = 0
        self.profanity_count: int = 0

artists = [
    Artist("The Beatles", 1969),
    Artist("Elvis Presley", 1973),
    Artist("Elton John", 1975),
    Artist("Eagles", 1977),
    Artist("Michael Jackson", 1980),
    Artist("Madonna", 1985),
    Artist("Eminem", 2001),
    Artist("Rihanna", 2008),
    Artist("Drake", 2013)
]

## Required Libraries

---

In [12]:
"""
Run this cell to install required python packages.
Skip if you have already installed following packages.
"""
!pip install pandas
!pip install tqdm
!pip install nltk
!pip install better-profanity



In [13]:
import re
import string
import itertools
import functools
import pandas as pd
import tqdm
import nltk
import better_profanity as bp

In [14]:
pd.set_option('display.expand_frame_repr', False)
nltk.download("punkt")
nltk.download("stopwords")


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\p.mohammadian\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\p.mohammadian\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Dataset
The dataset consists of multiple csv files, each file named as "`<artist> Lyrics.csv`" and contains all songs of the artist with their lyrics. I have written the script in [dataset/lyrics-script](../datasets/lyrics-script/genius.py) to generate the dataset. In order to get lyrics, I used [Genius](https://genius.com/) API. It worths mentioning that the order of the songs in every file is according to the number of views of the song in Genius website.

---

In [15]:
"""
Load data
"""
DATA_FILE_PREFIX = "../datasets/"
DATA_FILE_SUFFIX = " Lyrics.csv"

for artist in tqdm.tqdm(artists):
    artist.lyrics = pd.read_csv(f"{DATA_FILE_PREFIX}{artist.name}{DATA_FILE_SUFFIX}")
    artist.lyrics.rename(columns={"Unnamed: 0": "idx"}, inplace=True)
    artist.lyrics.set_index("idx", inplace=True)
    artist.lyrics.dropna(inplace=True)

for artist in artists:
    print(f"\n{artist.name} with {artist.lyrics.size} songs", artist.lyrics.head(1), sep='\n')

100%|██████████| 9/9 [00:00<00:00, 30.69it/s]


The Beatles with 1660 songs
     song_name                                        song_lyrics
idx                                                              
0    Let It Be  Let It Be Lyrics[Verse 1]\r\nWhen I find mysel...

Elvis Presley with 1622 songs
                      song_name                                        song_lyrics
idx                                                                               
0    Can’t Help Falling in Love  Can’t Help Falling in Love Lyrics[Verse 1]\r\n...

Elton John with 1204 songs
     song_name                                        song_lyrics
idx                                                              
0    Your Song  Your Song Lyrics[Verse 1]\r\nIt's a little bit...

Eagles with 270 songs
            song_name                                        song_lyrics
idx                                                                     
0    Hotel California  Hotel California Lyrics[Verse 1]\r\nOn a dark ...

Michael Jackson with 138




## Tokenization

---

In [16]:
for artist in artists:
    artist.lyrics["song_lyrics_tokenized"] = artist.lyrics["song_lyrics"].apply(lambda x: nltk.word_tokenize(x))

## Normalization

---

In [17]:
def to_lower(tokens):
    """
    Converts the tokens to lower case.
    """
    return [token.lower() for token in tokens]


def remove_lyrics_tags(tokens):
    """
    Removes the tags added by Genius from the lyrics. 
    For example, [Chorus], [Verse 1], ...
    """
    new_tokens = []
    tag = False
    for i in range(len(tokens)):
        if tokens[i] == '[':
            tag = True
        elif tokens[i] == ']':
            tag = False
        elif not tag:
            new_tokens.append(tokens[i])
    return new_tokens


def remove_song_name(tokens):
    """
    Removes the song name from the tokens.
    """
    keyword = "lyrics"
    if keyword in tokens:
        return tokens[tokens.index(keyword) + 1:]
    return tokens[:]


def remove_punctuation(tokens):
    """
    Removes punctuation from the given tokens.
    """
    return [token for token in tokens if token not in string.punctuation]


def remove_stop_words(tokens):
    """
    Removes stop words from the given tokens.
    """
    remove_stop_words.stop_words = set(nltk.corpus.stopwords.words('english'))
    return [token for token in tokens if token not in remove_stop_words.stop_words]


def normalize_lyrics(tokens):
    """
    Normalizes the tokens of the lyrics.
    """
    normalization_functions = [to_lower, remove_lyrics_tags, remove_song_name, remove_punctuation, remove_stop_words]
    return functools.reduce(lambda x, f: f(x), normalization_functions, tokens)


In [18]:
for artist in artists:
    artist.lyrics["song_lyrics_normalized"] = artist.lyrics["song_lyrics_tokenized"].apply(normalize_lyrics)

## Profanity Analysis and Removal

---

In [26]:
for artist in artists:
    for l in tqdm.tqdm(artist.lyrics["song_lyrics_normalized"], desc=f"{artist.name} lyrics"):
        for token in l:
            if bp.profanity.contains_profanity(token):
                artist.profanity_count += 1
            

The Beatles lyrics: 100%|██████████| 830/830 [00:42<00:00, 19.36it/s]
Elvis Presley lyrics: 100%|██████████| 811/811 [00:26<00:00, 30.13it/s]
Elton John lyrics: 100%|██████████| 602/602 [00:25<00:00, 23.49it/s]
Eagles lyrics: 100%|██████████| 135/135 [00:05<00:00, 23.90it/s]
Michael Jackson lyrics: 100%|██████████| 692/692 [01:00<00:00, 11.52it/s]
Madonna lyrics: 100%|██████████| 1022/1022 [01:16<00:00, 13.36it/s]
Eminem lyrics: 100%|██████████| 508/508 [01:15<00:00,  6.75it/s]
Rihanna lyrics: 100%|██████████| 400/400 [00:36<00:00, 10.99it/s]
Drake lyrics: 100%|██████████| 506/506 [00:48<00:00, 10.39it/s]


In [27]:
for artist in artists:
    print(f"{artist.name} has {artist.profanity_count} profanity words")

The Beatles has 879 profanity words
Elvis Presley has 362 profanity words
Elton John has 930 profanity words
Eagles has 104 profanity words
Michael Jackson has 1136 profanity words
Madonna has 3107 profanity words
Eminem has 8860 profanity words
Rihanna has 980 profanity words
Drake has 4663 profanity words


## Stemming and Lemmatization

---