## 1.Introduction

I really like this group, Linkin Park. Now they are back with a new composer, Emili Armstrong, who replaced Chester Bennington, the best singer in my opinion.

You might think that this is the best moment to analyze the trajectory of Linkin Park through the years. So, this work showcases my ability to clean the dataset and create insightful visualizations about the key parameters of this group, in my opinion.

### Objectives in Tableau
1. Album with the most songs
2. Relationship between song duration and popularity
3. Trajectory over the years
4. Top 10 most popular songs
5. Top 10 longest songs

Let's begin!

## 2. Identification, cleaning, filling, and transformation of the dataset

In [1]:
# Import libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

data = pd.read_csv('/kaggle/input/inputs/linkin_park_all_songs.csv')
print( f' Dimention of the dataset: {data.shape}')

 Dimention of the dataset: (1000, 6)


In [2]:
# Quick screenshoot oh the dataframe
data.head()

Unnamed: 0,Song Name,Album,Duration (minutes),Release Year,Popularity,Track ID
0,In the End,Hybrid Theory (Bonus Edition),3.61,2000,88,60a0Rd6pjrkxjPbaKzXjfq
1,Numb,Meteora,3.13,2003,87,2nLtzopw4rPReszdYBJU6h
2,Faint,Meteora,2.7,2003,84,4Yf5bqU3NK4kNOypcrLYwU
3,The Emptiness Machine,The Emptiness Machine,3.17,2024,88,2PnlsTsOTLE5jnBnNe2K0A
4,One Step Closer,Hybrid Theory (Bonus Edition),2.62,2000,82,3K4HG9evC7dg3N0R9cYqk4


In [3]:
# Count how many rows are empty and their types
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Song Name           1000 non-null   object 
 1   Album               1000 non-null   object 
 2   Duration (minutes)  1000 non-null   float64
 3   Release Year        1000 non-null   int64  
 4   Popularity          1000 non-null   int64  
 5   Track ID            1000 non-null   object 
dtypes: float64(1), int64(2), object(3)
memory usage: 47.0+ KB


We can see that there aren't any empty columns, so we won't use .dropna() to remove them. However, we used .drop() to remove the "Track ID" column, which is irrelevant to our analysis.

When we finish all this work, we will compare the original dataset with the new dataset, as it will be easier to see the differences when we do the visualization.

In [4]:
new_data = data.drop('Track ID',axis=1)
new_data.head()

Unnamed: 0,Song Name,Album,Duration (minutes),Release Year,Popularity
0,In the End,Hybrid Theory (Bonus Edition),3.61,2000,88
1,Numb,Meteora,3.13,2003,87
2,Faint,Meteora,2.7,2003,84
3,The Emptiness Machine,The Emptiness Machine,3.17,2024,88
4,One Step Closer,Hybrid Theory (Bonus Edition),2.62,2000,82


In [5]:
print(f'Dimensions BEFORE removing duplicates:{new_data.shape}')
new_data.drop_duplicates(inplace=True)
print(f'Dimensions AFTER removing duplicates:{new_data.shape}')

Dimensions BEFORE removing duplicates:(1000, 5)
Dimensions AFTER removing duplicates:(951, 5)


In [6]:
# Wathching the count of unique values there are on "Song Name"
new_data['Song Name'].nunique()

466

In [7]:
# First 15 song names
list(new_data['Song Name'].unique()[:15])

['In the End',
 'Numb',
 'Faint',
 'The Emptiness Machine',
 'One Step Closer',
 'Two Faced',
 'Heavy Is the Crown',
 'Cut the Bridge',
 'Stained',
 'Over Each Other',
 'Good Things Go',
 'Casualty',
 'Overflow',
 'Papercut',
 'IGYEIH']

In [8]:
# First screeshoot of counted repeated
new_data['Song Name'].value_counts()[:15]

Song Name
In the End            50
Numb                  41
Friendly Fire         22
Lost                  21
What I’ve Done        16
Fighting Myself       12
Somewhere I Belong    11
One More Light        11
BURN IT DOWN          11
Numb / Encore         11
My December           11
What I've Done        10
Faint                  9
One Step Closer        8
Breaking the Habit     7
Name: count, dtype: int64

In [9]:
# Convert everything in lowercase
new_data['Song Name']= new_data['Song Name'].str.lower()

# Split the strings using "-" as the separator
new_data['Song Name'] = new_data['Song Name'].str.split('-',n=1).str[0]

# Check dataframe
new_data['Song Name'].value_counts()[:15]

Song Name
in the end            50
numb                  41
friendly fire         22
lost                  21
what i’ve done        16
fighting myself       12
my december           11
one more light        11
burn it down          11
numb / encore         11
somewhere i belong    11
what i've done        10
faint                  9
in the end             9
crawling               8
Name: count, dtype: int64

The dataframe changed a little bit

In [10]:
# Remove special caracters
new_data['Song Name'] = new_data['Song Name'].str.split('/',n=1).str[0]
new_data['Song Name'] = new_data['Song Name'].str.split('(',n=1).str[0]
new_data['Song Name'] = new_data['Song Name'].str.split('"',n=1).str[0]

# Remove white spaces
new_data['Song Name'] = new_data['Song Name'].str.strip()
new_data['Song Name'].value_counts()[:15]

Song Name
numb                   65
in the end             62
friendly fire          25
lost                   23
one step closer        18
what i've done         17
burn it down           17
faint                  17
what i’ve done         16
points of authority    16
somewhere i belong     16
crawling               15
breaking the habit     14
one more light         14
papercut               14
Name: count, dtype: int64

In [11]:
new_data.tail()

Unnamed: 0,Song Name,Album,Duration (minutes),Release Year,Popularity
993,guilty all the same,ALT. 21st Century Classics,5.92,2023,8
994,grr,Hybrid Theory (20th Anniversary Edition),0.45,2020,1
996,friendly fire,the new rock alt sounds 2024,2.95,2024,0
998,wretches and kings,Rock & Loud,4.18,2024,1
999,in the end,Nostalgia ultra,3.61,2023,2


We can see that some songs have a duration of less than 1 minute, and row 994 is an example of this. So, we will remove songs with this characteristic 

In [12]:
new_data = new_data[new_data['Duration (minutes)']>=1]
new_data.tail()

Unnamed: 0,Song Name,Album,Duration (minutes),Release Year,Popularity
992,numb,Cottage Rules Make Memories,3.13,2023,6
993,guilty all the same,ALT. 21st Century Classics,5.92,2023,8
996,friendly fire,the new rock alt sounds 2024,2.95,2024,0
998,wretches and kings,Rock & Loud,4.18,2024,1
999,in the end,Nostalgia ultra,3.61,2023,2


In [13]:
print(f'The new dimention is {new_data.shape}, while the previous one was {data.shape}')

The new dimention is (934, 5), while the previous one was (1000, 6)


Now, we will follow the same steps for the "Album" column.

In [14]:
list(new_data['Album'].unique()[:15])

['Hybrid Theory (Bonus Edition)',
 'Meteora',
 'The Emptiness Machine',
 'From Zero',
 'Heavy Is the Crown',
 'Minutes to Midnight',
 'Two Faced',
 'Numb / Encore: MTV Ultimate Mash-Ups Presents Collision Course',
 'Meteora (Bonus Edition)',
 'Over Each Other',
 'LIVING THINGS',
 'One More Light',
 'Transformers: Revenge Of The Fallen The Album',
 'Lost',
 'A Thousand Suns']

In [15]:
# Lower case
new_data['Album']= new_data['Album'].str.lower()

# Remove
new_data['Album'] = new_data['Album'].str.split('-',n=1).str[0]
new_data['Album'] = new_data['Album'].str.split('/',n=1).str[0]
new_data['Album'] = new_data['Album'].str.split('(',n=1).str[0]
new_data['Album'] = new_data['Album'].str.strip()

# Remove values repeated less 3 times
count_album = new_data['Album'].value_counts()
values_to_keep_album = count_album[count_album > 3].index
new_data = new_data[new_data['Album'].isin(values_to_keep_album)]

In [16]:
print (f' The final dimensions is {new_data.shape}')

 The final dimensions is (637, 5)


In [17]:
new_data.head()

Unnamed: 0,Song Name,Album,Duration (minutes),Release Year,Popularity
0,in the end,hybrid theory,3.61,2000,88
1,numb,meteora,3.13,2003,87
2,faint,meteora,2.7,2003,84
4,one step closer,hybrid theory,2.62,2000,82
5,two faced,from zero,3.06,2024,81


## 3. Final DataSet
After all, we obteined a new data set of Linkin Park songs, ready for use in BI tools, Tableau in my case.

In [18]:
new_data.to_csv('/kaggle/working/new_data_linkinpark.csv', index=False)

![PNG Dashboard](https://storage.googleapis.com/kagglesdsdata/datasets/6200869/10062070/LINKIN%20PARK.png?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=databundle-worker-v2%40kaggle-161607.iam.gserviceaccount.com%2F20241130%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20241130T220928Z&X-Goog-Expires=345600&X-Goog-SignedHeaders=host&X-Goog-Signature=c325c20495b776ff151038846f3cd783ba288bd886d676cb8383546d33f899787faad6c999f803b3461c06f3ec8861385b583966b766b52775458ee973d2a0bf9afe589937e57878a80b1c00b421b39433fec8e0aaea2bfa1c22a944824d1b77dce697381d2c4eb47871286b1cd4d2ea7f8b934323d40c2416c2346fe4de7417f016e2a0bdafb00915a01647be1a5e4e8089baa45843e8239363c158273c3d9ef1289999c33708d043ed96658ed918f0a03a5ac94217715ed141c9da718f33cbb2e45b371f10edee9b043e421aedc874584b01b5e7d5830d3e62b7ef2bd211d0b4b8d5169a7c906a93b67e54a5f99451de3bda2057825849448358d3d66aabaf)

[CLICK TO WATCH DASHBOARD ON MY TABLEAU 😊📊](https://public.tableau.com/views/Dashboard-LinkinPark/LINKINPARK?:language=es-ES&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link)