<a href="https://colab.research.google.com/github/Parsa2820/50-years-lyrics/blob/master/notebooks/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Sharif University of Technology
Department of Computer Engineering

---
# Modern Information Retrieval Course
# **50 Years of the Best-Selling Music Artists Lyrics Comparison**
### Homework 2
### Dr. Asgari
### Parsa Mohammadian — 98102284
Spring 2022

---

## Introduction
The art of music play an important role in the human world. Besides the instrumental aspect, lyrics and content of a music are also important. In this project, I will explore and compare the lyrics of the best-selling music artists in a 50 years period (from 1969 to 2019). This list is taken from [Visual Capitalist](https://www.visualcapitalist.com/chart-toppers-50-years-of-the-best-selling-music-artists/) website. They have also visualized this data in an awesome [video](https://www.youtube.com/watch?v=a3w8I8boc_I). For the reference, I will use the image bellow to pick artist that has been top-selling for at least one consecutive year. Since the dataset is not provided, I have hardcoded the artists and their info in the code.

![top-seller-chart](../resources/top-sellers-chart.jpg)

---

In [3]:
class Artist:
    def __init__(self, name: str, top_seller_begin_year: int):
        self.name: str = name
        self.top_seller_begin_year: int = top_seller_begin_year
        self.lyrics: pd.DataFrame = None

artists = [
    Artist("The Beatles", 1969),
    Artist("Elvis Presley", 1973),
    Artist("Elton John", 1975),
    Artist("Eagles", 1977),
    Artist("Michael Jackson", 1980),
    Artist("Madonna", 1985),
    Artist("Eminem", 2001),
    Artist("Rihanna", 2008),
    Artist("Drake", 2013)
]

## Required Libraries

---

In [18]:
"""
Run this cell to install required python packages.
Skip if you have already installed following packages.
"""
!pip install pandas
!pip install tqdm
!pip install nltk
!pip install better-profanity

Collecting better-profanity
  Downloading better_profanity-0.7.0-py3-none-any.whl (46 kB)
     -------------------------------------- 46.1/46.1 KB 763.1 kB/s eta 0:00:00
Installing collected packages: better-profanity
Successfully installed better-profanity-0.7.0


In [19]:
import re
import string
import pandas as pd
import tqdm
import nltk
import better_profanity as bp

In [25]:
nltk.download("punkt")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\parsa\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\parsa\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Dataset
The dataset consists of multiple csv files, each file named as "`<artist> Lyrics.csv`" and contains all songs of the artist with their lyrics. I have written the script in [dataset/lyrics-script](../datasets/lyrics-script/genius.py) to generate the dataset. In order to get lyrics, I used [Genius](https://genius.com/) API. It worths mentioning that the order of the songs in every file is according to the number of views of the song in Genius website.

---

In [17]:
"""
Load data
"""
DATA_FILE_PREFIX = "../datasets/"
DATA_FILE_SUFFIX = " Lyrics.csv"

for artist in tqdm.tqdm(artists):
    artist.lyrics = pd.read_csv(f"{DATA_FILE_PREFIX}{artist.name}{DATA_FILE_SUFFIX}")
    artist.lyrics.rename(columns={"Unnamed: 0": "idx"}, inplace=True)
    artist.lyrics.set_index("idx", inplace=True)
    print(f"{artist.name} with {artist.lyrics.size} songs", artist.lyrics.head(3), sep='\n')

100%|██████████| 1/1 [00:00<00:00, 41.67it/s]

The Beatles with 1684 songs
         song_name                                        song_lyrics
idx                                                                  
0        Let It Be  Let It Be Lyrics[Verse 1]\nWhen I find myself ...
1        Yesterday  Yesterday Lyrics[Verse 1]\nYesterday\nAll my t...
2    Come Together  Come Together Lyrics[Intro]\nShoot me\nShoot m...





## Tokenization

---

## Normalization

---

In [21]:
def to_lower(tokens):
    """
    Converts the tokens to lower case.
    """
    return [token.lower() for token in tokens]


def remove_lyrics_tags(tokens):
    """
    Removes the tags added by Genius from the lyrics. 
    For example, [Chorus], [Verse 1], ...
    """
    new_tokens = []
    for i in range(len(tokens)):
        if tokens[i] == '[' and i+2 < len(tokens) and tokens[i+2] == ']':
            next(tokens, None)
            next(tokens, None)
            continue
        new_tokens.append(tokens[i])
    return new_tokens


def remove_punctuation(tokens):
    """
    Removes punctuation from the given tokens.
    """
    return [token for token in tokens if token not in string.punctuation]


def remove_stop_words(tokens):
    """
    Removes stop words from the given tokens.
    """
    remove_stop_words.stop_words = set(nltk.corpus.stopwords.words('english'))
    return [token for token in tokens if token not in remove_stop_words.stop_words]


def normalize_lyrics(tokens):
    """
    Normalizes the tokens of the lyrics.
    """
    return (tokens.to_lower()
                  .remove_lyrics_tags()
                  .remove_punctuation()
                  .remove_stop_words())
