# BTS Lyrics Comparison


![](https://static.billboard.com/files/2020/11/bts-press-photo-2020-billboard-1548-1604933999-compressed.jpg)
(Photo credit: [Billboard](https://www.billboard.com/))

## Introduction 
BTS(Korean: 방탄소년단), AKA Bangtan Boys, is a seven-member South Korean boy band debuted in 2013 under Big Hit Entertainment.

I think most of you might have heard one of their latest and popular song, “Dynamite” last year (2020). This song debuted by breaking several records --most viewed in a single day in Youtube history, most commented video on the platform, and second most commented video of all time. Also, thi song hit No.1 on the Billboard for total three weeks; No.1 on Spotify's daily Global Top 50 chart; No.1s on iTunes within 8 hours. This song even broke the record for the fastest viral music video by generating 101.1 million views within 24 hours, according to [Statstia](https://www.statista.com/statistics/478082/fastest-viral-videos-views-in-24-hours/#:~:text=In%20August%202020%2C%20South%20Korean,record%20of%2086.3%20million%20views.). 


## BTS songs and lyrics are catchy and that get stuck in my head! 

Whenever I listen to BTS songs, it get stuck in my head.

Personally, I prefer listening to songs with positive vibe because it really effects my mood. I don't personally like listening to songs about bullying and mental health. BTS songs motivates me with relatable and inspirational words (i.e., "Love Yourself", "Life Goes On", etc.). See this reference [5 Reasons Why You Should Stan BTS) 
(https://thehoneypop.com/2019/12/16/5-reasons-to-stan-bts/)

In this notebook, I want to see which words were used the most in each BTS songs via [word embedding](https://en.wikipedia.org/wiki/Word_embedding), then visualize lyrics similarity using a machine learning method called t_SNE and an interactive visualizing library called Bokeh. 

# Table of Contents

1. Data exploration
2. Focus on one album song by BTS
3. Tokenizing the lyrics + Word Cloud
4. Initializing a document-term matrix (DTM)
5. Creating a counter function
6. The Song-Lyric matrix!
7. Dimension reduction with t-SNE
8. Let's map the items with Bokeh
9. Adding a hover tool
10. Mapping the songs
11. Comparing lyrics of two songs

# Data Exploration

## Import Library

In [None]:
import pandas as pd
import numpy as np
from sklearn.manifold import TSNE

## Read Data

In [None]:
df = pd.read_csv('/kaggle/input/bts-lyrics/lyrics-v0.2.csv')
display(df.sample(5))

In [None]:
df.info()

### Pre-Processing

In [None]:
# Remove duplicates, if any
df.duplicated().sum()
df.drop_duplicates(inplace=True)

In [None]:
# make all na fields reflect as such
df = df.fillna('NA')

# ensure date format for album release date
df['album_rd'] = pd.to_datetime(df.album_rd)

# ignore any track that does not have any lyrics or are album notes
df = df[~df['eng_track_title'].str.contains('skit', case=False) & ~df['eng_track_title'].str.contains('note', case=False)]


In [None]:
# Inspect the types of album
df.eng_album_title.value_counts()

In [None]:
df.lyrics.value_counts()

In [None]:
df.info()

### Filling Gaps

In [None]:
# make all na fields reflect as such
df = df.fillna('NA')

# ensure date format for album release date
df['album_rd'] = pd.to_datetime(df.album_rd)

# ignore any track that does not have any lyrics or are album notes
df = df[~df['eng_track_title'].str.contains('skit', case=False) & ~df['eng_track_title'].str.contains('note', case=False)]
df['lyrics']

### Lyrics Normalisation
#### Method adapted from :

In [None]:
import re
def normalise(text, remove_punc=True):
    """method to normalise text"""
    # change text to lowercase and remove leading and trailing white spaces
    text = text.lower().strip()

    # remove punctuation
    if remove_punc:
        # remove punctuation
        text = re.sub(r'[\W]', ' ', text)
        # remove double spacing sometimes caused by removal of punctuation
        text = re.sub(r'\s+', ' ', text)

    return text

In [None]:
# normalise lyrics
df['lyrics'] = df['lyrics'].apply(lambda x: normalise(x))

# 2. Focus on one album song by BTS 
- There are 16 albums in this dataset and total of 226 songs. Let's setu up a workflow so its outputs (a t-SNE model and a visualization of that model) can be customized. For the example in this notebook, let's focus in on "MAP OF THE SOUL: 7" album filtering the data accordingly.

In [None]:
#I'd like to first extract "BE" album which is the latest album released on 

# Filter for album_title
df_album = df[df['eng_album_title'] == 'MAP OF THE SOUL: 7']
df_album

# Filter for dry skin as well
df_title = df_album[df_album['performed_by'] == "BTS"]
df_title

# Reset index
df_title = df_title.reset_index(drop = True)
df_title

# 3. Tokenizing the lyrics 

- To get our end goal of comparing lyrics in each song, we first need to do some preprocessing tasts and bookkepping of the actual words in each song's lyrics. The first step will be tokenizing the list of words in "Lyrics" column. After splitting them into tokens, we'll make a binary bag of words. Then we will create a dictionary with the tokens, lyrics_idx, which will have the follwoing format:

{**"Lyrics"**: index value, ...}

## a. create a dictionary with the tokens, lyrics_idx

In [None]:
### Filling Gaps
# make all na fields reflect as such
df = df.fillna('NA')

# ensure date format for album release date
df['album_rd'] = pd.to_datetime(df.album_rd)

# ignore any track that does not have any lyrics or are album notes
df = df[~df['eng_track_title'].str.contains('skit', case=False) & ~df['eng_track_title'].str.contains('note', case=False)]
df['lyrics']

In [None]:
import re
def normalise(text, remove_punc=True):
    """method to normalise text"""
    # change text to lowercase and remove leading and trailing white spaces
    text = text.lower().strip()

    # remove punctuation
    if remove_punc:
        # remove punctuation
        text = re.sub(r'[\W]', ' ', text)
        # remove double spacing sometimes caused by removal of punctuation
        text = re.sub(r'\s+', ' ', text)

    return text

In [None]:
# Initialize dictionary, List, and initial index
lyric_idx = {}
corpus = []
idx = 0

# For Loop for tokenization
for i in range(len(df_title)):
    btslyrics = df_title["lyrics"][i]
    btslyrics_lower = btslyrics.lower()
    tokens = btslyrics_lower.split(' ')
    corpus.append(tokens)
    for lyric in tokens:
        if lyric not in lyric_idx:
            lyric_idx[lyric] = idx
            idx += 1
lyric_idx

In [None]:
# normalise lyrics
df['lyrics'] = df['lyrics'].apply(lambda x: normalise(x))

### b. WordCloud

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline


plt.subplots(figsize=(25,15))
wordcloud = WordCloud(
                          background_color='white',
                          width=1920,
                          height=1080
                         ).generate(" ".join(df.lyrics))
plt.imshow(wordcloud)
plt.axis('off')
plt.savefig('neighbourhood.png')
plt.show()

* As you can see from the above, there are so many good and relative vibe words. I don't really see words with negative and hurtful vibes!
* Except injections ("na", "oh oh", "la la"), I see "dream", "my heart", "love", "let jump", "star", "my way", "smile", "love myself", "will", "hope", etc.)

## 4. Initializing a document-term matrix (DTM)
The next step is making a document-term matrix (DTM). Here each song will correspond to a document, and each words in lyrics will correspond to a term. This means we can think of the matrix as a “song-lyric” matrix. The size of the matrix should be as the picture shown below.To create this matrix, we'll first make an empty matrix filled with zeros. The length of the matrix is the total number of songs in the data. The width of the matrix is the total number of words in lyrics. After initializing this empty matrix, we'll fill it in the following tasks.

In [None]:
M = len(df_title)
N = len(lyric_idx)
A = np.zeros((M, N))
A.shape

In [None]:
#check the result
print("The index for dream is", lyric_idx['dream'])

# 5. Creating a counter function
Before we can fill the matrix, let's create a function to count the tokens (i.e., a lyrics list) for each row. Our end goal is to fill the matrix with 1 or 0: if an lyric is in a song, the value is 1. If not, it remains 0. The name of this function, oh_encoder, will become clear next.

In [None]:
# Define the oh_encoder function
def oh_encoder(tokens):
    x = np.zeros(N)
    for lyric in tokens:
        # Get the index for each lyric
        idx = lyric_idx[lyric]
        # Put 1 at the corresponding indices
        x[idx] = 1
    return x

# 6. The Song-Lyric matrix!
Now we'll apply the oh_encoder() functon to the tokens in corpus and set the values at each row of this matrix. So the result will tell us what lyrics each song is composed of. For example, if a song contains "love", "happiest" in lyrics, the outcome of this song will be as follows. This is what we called one-hot encoding. By encoding each lyric in the songs, the Song-Lyric matrix will be filled with binary values.

In [None]:
# Make a document-term matrix
i = 0
for tokens in corpus:
    A[i, :] = oh_encoder(tokens)
    i +=1
    # ... YOUR CODE FOR TASK 6 ..

# 7. Dimension reduction with t-SNE
The dimensions of the existing matrix is (10, 735), which means there are 735 features in our data. For visualization, we should downsize this into two dimensions. We'll use t-SNE for reducing the dimension of the data here.

T-distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimensionality reduction technique that is well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, this technique can reduce the dimension of data while keeping the similarities between the instances. This enables us to make a plot on the coordinate plane, which can be said as vectorizing. All of these songs in our data will be vectorized into two-dimensional coordinates, and the distances between the points will indicate the similarities between the items.

In [None]:
# Dimension reduction with t-SNE
model = TSNE(n_components = 2, learning_rate = 200, random_state = 42)
tsne_features = model.fit_transform(A)

# Make X, Y columns 
df_title['X'] = tsne_features[:, 0]
df_title['Y'] = tsne_features[:, 1]

In [None]:
df_title



# 8. Let's map the items with Bokeh
We are now ready to start creating our plot. With the t-SNE values, we can plot all our items on the coordinate plane. And the coolest part here is that it will also show us the eng_album_title, eng_track_title, and album_seq of each song. Let's make a scatter plot using Bokeh and add a hover tool to show that information. Note that we won't display the plot yet as we will make some more additions to it.

In [None]:
from bokeh.io import show, output_notebook, push_notebook
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, HoverTool
output_notebook()

# Make a source and a scatter plot  
source = ColumnDataSource(df_title)
plot = figure(x_axis_label ='T-SNE 1', 
              y_axis_label ='T-SNE 2', 
              width = 500, height = 400)
plot.circle(x = 'X', 
    y = 'Y', 
    source = source, 
    size = 10, color = '#FF7373', alpha = .8)

# 9. Adding a hover tool
Why don't we add a hover tool? Adding a hover tool allows us to check the information of each item whenever the cursor is directly over a glyph. We'll add tooltips with each product's name, brand, price, and rank (i.e., rating).

In [None]:
# Create a HoverTool object
hover = HoverTool(tooltips = [('eng_album_title', '@eng_album_title'),
                              ('track_title', '@track_title'),
                              ('album_seq', '@album_seq')])
                    
plot.add_tools(hover)


# 10. Mapping the songs 
Finally, it's show time! Let's see how the map we've made looks like. Each point on the plot corresponds to the songs. Then what do the axes mean here? The axes of a t-SNE plot aren't easily interpretable in terms of the original data. Like mentioned above, t-SNE is a visualizing technique to plot high-dimensional data in a low-dimensional space. Therefore, it's not desirable to interpret a t-SNE plot quantitatively.

Instead, what we can get from this map is the distance between the points (which items are close and which are far apart). **The closer the distance between the two songs is, the more similar the composition they have.** 

**Therefore this enables us to compare the songs

In [None]:
# Plot the map
# ... YOUR CODE FOR TASK 10 ...
show(plot)

# 11. Comparing lyrics of two songs

Use this little lyrics engine help us compare two songs and see their similarity with lyrics 

In [None]:
# Print the lyrics of two similar songs
song_1 = df_title[df_title['track_title'] == "We Are Bulletproof: The Eternal "]
song_2 = df_title[df_title['track_title'] == "ON (Remix) ft. Sia "]

# Display each song's data and lyrics
display(song_1)
print(song_1.lyrics.values)

display(song_2)
print(song_2.lyrics.values)