<a href="https://colab.research.google.com/github/Parsa2820/50-years-lyrics/blob/master/notebooks/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Sharif University of Technology
Department of Computer Engineering

---
# Modern Information Retrieval Course
# **50 Years of the Best-Selling Music Artists Lyrics Comparison**
### Homework 2
### Dr. Asgari
### Parsa Mohammadian — 98102284
Spring 2022

---

## Introduction
The art of music play an important role in the human world. Besides the instrumental aspect, lyrics and content of a music are also important. In this project, I will explore and compare the lyrics of the best-selling music artists in a 50 years period (from 1969 to 2019). This list is taken from [Visual Capitalist](https://www.visualcapitalist.com/chart-toppers-50-years-of-the-best-selling-music-artists/) website. They have also visualized this data in an awesome [video](https://www.youtube.com/watch?v=a3w8I8boc_I). For the reference, I will use the image bellow to pick artist that has been top-selling for at least one consecutive year. Since the dataset is not provided, I have hardcoded the artists and their info in the code.

![top-seller-chart](../resources/top-sellers-chart.jpg)

---

In [1]:
class Artist:
    def __init__(self, name: str, top_seller_begin_year: int):
        self.name: str = name
        self.top_seller_begin_year: int = top_seller_begin_year
        self.lyrics: pd.DataFrame = None

artists = [
    Artist("The Beatles", 1969),
    Artist("Elvis Presley", 1973),
    Artist("Elton John", 1975),
    Artist("Eagles", 1977),
    Artist("Michael Jackson", 1980),
    Artist("Madonna", 1985),
    Artist("Eminem", 2001),
    Artist("Rihanna", 2008),
    Artist("Drake", 2013)
]

## Required Libraries

---

In [2]:
"""
Run this cell to install required python packages.
Skip if you have already installed following packages.
"""
!pip install pandas
!pip install tqdm
!pip install nltk
!pip install better-profanity

Collecting pandas
  Downloading pandas-1.4.2-cp310-cp310-win_amd64.whl (10.6 MB)
     ---------------------------------------- 10.6/10.6 MB 3.6 MB/s eta 0:00:00
Collecting numpy>=1.21.0
  Downloading numpy-1.22.3-cp310-cp310-win_amd64.whl (14.7 MB)
     ---------------------------------------- 14.7/14.7 MB 2.3 MB/s eta 0:00:00
Collecting pytz>=2020.1
  Downloading pytz-2022.1-py2.py3-none-any.whl (503 kB)
     -------------------------------------- 503.5/503.5 KB 2.3 MB/s eta 0:00:00
Installing collected packages: pytz, numpy, pandas
Successfully installed numpy-1.22.3 pandas-1.4.2 pytz-2022.1




Collecting tqdm
  Downloading tqdm-4.64.0-py2.py3-none-any.whl (78 kB)
     -------------------------------------- 78.4/78.4 KB 396.6 kB/s eta 0:00:00
Installing collected packages: tqdm
Successfully installed tqdm-4.64.0




Collecting nltk
  Downloading nltk-3.7-py3-none-any.whl (1.5 MB)
     ---------------------------------------- 1.5/1.5 MB 1.8 MB/s eta 0:00:00
Collecting click
  Downloading click-8.1.2-py3-none-any.whl (96 kB)
     ---------------------------------------- 96.6/96.6 KB 2.8 MB/s eta 0:00:00
Collecting joblib
  Downloading joblib-1.1.0-py2.py3-none-any.whl (306 kB)
     -------------------------------------- 307.0/307.0 KB 1.1 MB/s eta 0:00:00
Collecting regex>=2021.8.3
  Downloading regex-2022.3.15-cp310-cp310-win_amd64.whl (274 kB)
     -------------------------------------- 274.4/274.4 KB 3.3 MB/s eta 0:00:00
Installing collected packages: regex, joblib, click, nltk
Successfully installed click-8.1.2 joblib-1.1.0 nltk-3.7 regex-2022.3.15




Collecting better-profanity
  Downloading better_profanity-0.7.0-py3-none-any.whl (46 kB)
     -------------------------------------- 46.1/46.1 KB 254.4 kB/s eta 0:00:00
Installing collected packages: better-profanity
Successfully installed better-profanity-0.7.0


In [34]:
import re
import string
import itertools
import functools
import pandas as pd
import tqdm
import nltk
import better_profanity as bp

In [4]:
pd.set_option('display.expand_frame_repr', False)
nltk.download("punkt")
nltk.download("stopwords")


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\p.mohammadian\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\p.mohammadian\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

## Dataset
The dataset consists of multiple csv files, each file named as "`<artist> Lyrics.csv`" and contains all songs of the artist with their lyrics. I have written the script in [dataset/lyrics-script](../datasets/lyrics-script/genius.py) to generate the dataset. In order to get lyrics, I used [Genius](https://genius.com/) API. It worths mentioning that the order of the songs in every file is according to the number of views of the song in Genius website.

---

In [23]:
"""
Load data
"""
DATA_FILE_PREFIX = "../datasets/"
DATA_FILE_SUFFIX = " Lyrics.csv"

for artist in tqdm.tqdm(artists):
    artist.lyrics = pd.read_csv(f"{DATA_FILE_PREFIX}{artist.name}{DATA_FILE_SUFFIX}")
    artist.lyrics.rename(columns={"Unnamed: 0": "idx"}, inplace=True)
    artist.lyrics.set_index("idx", inplace=True)
    artist.lyrics.dropna(inplace=True)

for artist in artists:
    print(f"\n{artist.name} with {artist.lyrics.size} songs", artist.lyrics.head(1), sep='\n')

100%|██████████| 9/9 [00:00<00:00, 20.99it/s]


The Beatles with 1660 songs
     song_name                                        song_lyrics
idx                                                              
0    Let It Be  Let It Be Lyrics[Verse 1]\r\nWhen I find mysel...

Elvis Presley with 1622 songs
                      song_name                                        song_lyrics
idx                                                                               
0    Can’t Help Falling in Love  Can’t Help Falling in Love Lyrics[Verse 1]\r\n...

Elton John with 1204 songs
     song_name                                        song_lyrics
idx                                                              
0    Your Song  Your Song Lyrics[Verse 1]\r\nIt's a little bit...

Eagles with 270 songs
            song_name                                        song_lyrics
idx                                                                     
0    Hotel California  Hotel California Lyrics[Verse 1]\r\nOn a dark ...

Michael Jackson with 138




## Tokenization

---

In [47]:
for artist in artists:
    artist.lyrics["song_lyrics_tokenized"] = artist.lyrics["song_lyrics"].apply(lambda x: nltk.word_tokenize(x))

## Normalization

---

In [65]:
def to_lower(tokens):
    """
    Converts the tokens to lower case.
    """
    return [token.lower() for token in tokens]


def remove_lyrics_tags(tokens):
    """
    Removes the tags added by Genius from the lyrics. 
    For example, [Chorus], [Verse 1], ...
    """
    new_tokens = []
    tag = False
    for i in range(len(tokens)):
        if tokens[i] == '[':
            tag = True
        elif tokens[i] == ']':
            tag = False
        elif not tag:
            new_tokens.append(tokens[i])
    return new_tokens


def remove_song_name(tokens):
    """
    Removes the song name from the tokens.
    """
    keyword = "lyrics"
    if keyword in tokens:
        return tokens[tokens.index(keyword) + 1:]
    return tokens[:]


def remove_punctuation(tokens):
    """
    Removes punctuation from the given tokens.
    """
    return [token for token in tokens if token not in string.punctuation]


def remove_stop_words(tokens):
    """
    Removes stop words from the given tokens.
    """
    remove_stop_words.stop_words = set(nltk.corpus.stopwords.words('english'))
    return [token for token in tokens if token not in remove_stop_words.stop_words]


def normalize_lyrics(tokens):
    """
    Normalizes the tokens of the lyrics.
    """
    normalization_functions = [to_lower, remove_lyrics_tags, remove_song_name, remove_punctuation, remove_stop_words]
    return functools.reduce(lambda x, f: f(x), normalization_functions, tokens)


In [66]:
for artist in artists:
    artist.lyrics["song_lyrics_normalized"] = artist.lyrics["song_lyrics_tokenized"].apply(normalize_lyrics)