# **Introduction**

Hello, kagglers. This is my first notebook in kaggle. I hope kagglers enjoy my notebook on "billboard the hot 100 songs".

Dataset and information below are from: https://www.kaggle.com/dhruvildave/billboard-the-hot-100-songs

# **Features information**
The dataset is comprised of seven features. 
* date: Date of chart
* rank: Rank of song
* song: Song title
* artist: Song artist
* last-week: Rank in previous week
* peak-rank: Top rank achieved by the song
* weeks-on-board: Weeks the song appeared on the chart

# **Packages import**

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt  # matplotlib should be imported to use seaborn
import seaborn as sns
import os

# **Data read**

In [None]:
df = pd.read_csv('../input/billboard-the-hot-100-songs/charts.csv')
df.head()

In [None]:
df.tail(10)

# **Null value check**
There are NaN value in 'last-week' feature.
This is because...
1. The songs in the first week(1958-08-04), there is no previous week rank!
2. The songs which were on the chart only a week have no previous week rank!

In [None]:
df.isna().sum()

In [None]:
plt.figure(figsize=(16,10))
sns.heatmap(df.isna()) # null value visualization

In [None]:
import missingno as msno  # another way to check null value
msno.bar(df, color = 'orchid')

# **✨Data explore✨**
The contents are as follows:
* The number of new entry songs on the board each week
* The TOP artists
* The TOP songs
* Wordcloud of song titles

> # **The number of new entry songs on the board each week**
> How many songs entered the board each week?
> The answer is related to the feature 'weeks on board'. If the value is '1', the song belongs to *new entry song*.

In [None]:
newentry = df[df['weeks-on-board']==1] # all the new entry songs
newentry = newentry[100:] # except for the first week songs
newentry.shape # the number of new entry songs

In [None]:
# group on 'date' to check the number of new entry songs each week
date_grouped = newentry.groupby('date').sum()
date_grouped = date_grouped[['weeks-on-board']]
date_grouped.sort_values('weeks-on-board', ascending = False)

**Amazing! In the week 1998-12-05, 60 songs entered on the board.**
**Let's see the result by visualization.**

In [None]:
fig = plt.figure(figsize=(16, 9)) 
ax = fig.add_subplot(111) 
ax.plot(date_grouped.index, date_grouped['weeks-on-board'], 'red', 
        label="date_grouped", alpha = 0.5)
ax.set_title('The flow of the numbers of new entry songs', fontsize=30) 
ax.set_ylabel('Counts', fontsize=14) 
ax.set_xlabel('Date', fontsize=14) 
plt.xticks(['1958-08-11', '1974-03-30', '1989-12-02', '2005-07-23', '2021-03-13'])
plt.show()

> # **The TOP artists**
> 1. Who took the most number of first place?
> 2. Who appeared on the chart most frequently?

# 1. Who took the most number of first place?

In [None]:
topartists = df[df['rank']==1] # exract the artists who ranked first place
topartists = topartists.groupby('artist').sum() # sum of the rank
topartists[['rank']].sort_values('rank', ascending = False).head(10)

**Mariah Carey is the top artist who took the most number of first place.**

In [None]:
topartists_rank = topartists.sort_values('rank', ascending = False)['rank']
topartists_rank = topartists_rank[:10,] # for the first 10 artists

#color set
colors = ('red', 'darkorange', 'pink', 'yellow', 'yellowgreen', 
          'green', 'lightskyblue', 'lightblue', 'royalblue', 'orchid')

fig = plt.figure(figsize=(11,11))
ax = fig.add_subplot() 

# chart by rank 
ax.pie(topartists_rank,
       labels=topartists_rank.index,
       autopct='%1.1f%%',
       textprops={'size': 15},
       colors = colors,
       startangle=90, counterclock=False)
ax.set_title('The top artist who took the most number of first place', fontsize=25)
fig.subplots_adjust(wspace=0.7)
plt.show()

# 2. Who appeared on the chart most frequently?

In [None]:
df_artist = df['artist'].value_counts().head(10)
print(df_artist)

In [None]:
df_artist = df_artist[:10,] # for the first ten artists
plt.figure(figsize=(15,8)) # figsize: inch by inch
sns.barplot(x = df_artist.index, y = df_artist.values, alpha=0.9, palette = 'spring')
plt.ylabel('Count', fontsize=18)
plt.xlabel('Artists name', fontsize=18)
plt.title('The artists appeared on the chart most frequently', fontsize=25)
plt.xticks(rotation=45, fontsize=15)
plt.show()

> # **The TOP songs**
> 1. Which songs took the most number of first place?
> 2. Which songs appeared on the chart for the longest time(most frequently)?

**💡 In this part, there is a problem with *song* because different artists have same *song title*. Therefore, song titles should be grouped by *artist*.**

> # 1. Which songs took the most number of first place?

In [None]:
topsongs = df[df['rank']==1] # extract the songs which ranked first place
topsongs = topsongs[['song','artist']].value_counts() 
topsongs = topsongs.reset_index(name='counts').head(15) # reset index name
topsongs

**Old Town Road is the top song which took the most number of first place.**

In [None]:
plt.figure(figsize = (15, 20))
sns.barplot(data = topsongs, x = 'counts', y = 'song', palette = 'rainbow')
plt.yticks(fontsize = 30)
plt.xticks(fontsize = 30)
plt.title('The songs which took the most number of first place', fontsize = 60, loc='right')
plt.show()

> # 2. Which songs appeared on the chart for the longest time?

In [None]:
df_song = df['song'].value_counts()
df_song

> # A problem :

**Stay is the most appeared song. However, there are songs which have same titles.**

**For example, fourteen artists have songs titled 'Stay'.**

In [None]:
# check same title :'Stay'
stay = df[df['song']=='Stay']
stay = stay['artist'].value_counts()
stay

In [None]:
colors = ('red', 'orange', 'lightcyan', 'yellow', 'teal', 
          'palegreen', 'mistyrose', 'olive', 'lavender', 'dodgerblue', 'blue', 'purple', 'pink', 'cyan')
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot() 
ex = np.zeros(len(colors))
ex[-3:]=(0.1,0.2,0.3)

# <Stay>s 
ax.pie(stay,
       autopct='%1.1f%%',
       textprops={'size': 12},
       colors = colors, pctdistance=0.7, radius=1.3,
       startangle=180, counterclock=False, explode=ex)
ax.set_title('<Stay>s', fontsize=30, loc='left')
plt.subplots_adjust(left=0.0, bottom=0.0, right=0.85)

# use legend
plt.legend(stay.index.unique(), loc="center right", fontsize=12, 
           bbox_to_anchor=(1.35, 0.5), bbox_transform=plt.gcf().transFigure)
plt.show()

**I selected 'song' and 'artist' columns not to bring the problem.**

In [None]:
longest = df[['song','artist']].value_counts()
longest = longest.reset_index(name='max') # reset index name
longest

**The song "Radioactive" by Imagine Dragons was on board for 87 weeks. Now, I wonder the first longest 10 songs.**

In [None]:
longest10 = longest.head(10)
plt.figure(figsize = (10, 10))
sns.barplot(data = longest10, x = 'max', y = 'song', palette = 'winter', alpha = 0.8)
plt.yticks(fontsize = 15)
plt.title('The songs appeared on board for the longest times', fontsize = 25)
plt.show()

**I also wonder the distribution of max values in 'weeks on board'.** 

In [None]:
plt.rcParams["figure.figsize"]=(18,23)
sns.set(style="darkgrid")
ax = sns.countplot(y='max', palette="spring_r", data=longest)
plt.title('The distribution of max values in weeks on board', fontsize=30)

# annotaion
for p in ax.patches:
        value = p.get_width()
        x = p.get_x() + p.get_width() + 0.02
        y = p.get_y() + p.get_height() - 0.07
        ax.annotate(value, (x, y))
plt.show()

> # **Wordcloud of the song titles**
What words were used on titles?

In [None]:
# import wordcloud
from PIL import Image as im
from wordcloud import WordCloud,STOPWORDS

plt.subplots(figsize=(14,10))
wc = WordCloud(max_words=100,
               stopwords=STOPWORDS, max_font_size=180,
               random_state=42, width=600, height=300, colormap='cool')
wc.generate(' '.join(df['song']))
plt.imshow(wc, interpolation="bilinear")
plt.axis('off')
plt.show()

**Love, heart... They are the most appeared words on the titles.❤**

# **✨Thank you✨**