Now that you have scrapped the website Billboard to create a hot_songs dataset, it's time to prepare a new dataset of not_hot_songs. This dataset can contain songs of your choice, others collected from the web or any other combination. Some sources of songs can be:

Wikipedia
Subset of million songs dataset Note: this dataset takes several GB of disk space!!!
Kaggle
Your favourite songs included in the Trello board as long as they are not hot songs :)
Considerations
You want your dataset of not_hot_song to be:

As heterogeneous in terms of (genre, length,...etc) as possible to create better groups of songs.
Not too big and not too small (typically around 2-3K) songs
In a real-life scenario, you might want to have your dataset as biggest as possible and use specialized Big Data techniques like PySpark to group similar songs together. However, you are going to work on your own laptop which has limited power. Therefore, you need to limit the size of your dataset of not_hot_songs otherwise the process of grouping similar songs will take forever.

Deliverables
Your fork should contain a jupyter notebook with the code to:

Gather the songs
Remove songs already present in the hot_songs dataset

In [1]:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests

In [2]:
data = pd.read_csv('Spotify_final_dataset.csv')

In [3]:
data.head()

Unnamed: 0,Position,Artist Name,Song Name,Days,Top 10 (xTimes),Peak Position,Peak Position (xTimes),Peak Streams,Total Streams
0,1,Post Malone,Sunflower SpiderMan: Into the SpiderVerse,1506,302.0,1,(x29),2118242,883369738
1,2,Juice WRLD,Lucid Dreams,1673,178.0,1,(x20),2127668,864832399
2,3,Lil Uzi Vert,XO TOUR Llif3,1853,212.0,1,(x4),1660502,781153024
3,4,J. Cole,No Role Modelz,2547,6.0,7,0,659366,734857487
4,5,Post Malone,rockstar,1223,186.0,1,(x124),2905678,718865961


In [14]:
hot_songs=pd.read_csv('hot_songs.csv')

In [15]:
hot_songs

Unnamed: 0,Title,Artists
0,Rockin' Around The Christmas Tree,Brenda Lee
1,All I Want For Christmas Is You,Mariah Carey
2,Jingle Bell Rock,Bobby Helms
3,Last Christmas,Wham!
4,A Holly Jolly Christmas,Burl Ives
...,...,...
95,El Amor de Su Vida,Grupo Frontera & Grupo Firme
96,Standing Next To You,Jung Kook
97,Man Made A Bar,Morgan Wallen Featuring Eric Church
98,Que Onda,Calle 24 x Chino Pacas x Fuerza Regida


In [6]:
data=data[["Artist Name","Song Name"]]

In [7]:
data

Unnamed: 0,Artist Name,Song Name
0,Post Malone,Sunflower SpiderMan: Into the SpiderVerse
1,Juice WRLD,Lucid Dreams
2,Lil Uzi Vert,XO TOUR Llif3
3,J. Cole,No Role Modelz
4,Post Malone,rockstar
...,...,...
11079,The Band Perry,If I Die Young
11080,Justin Timberlake,Not a Bad Thing
11081,Mike WiLL Made,It 23
11082,The Vamps,Somebody To You


In [8]:
data.dropna()

Unnamed: 0,Artist Name,Song Name
0,Post Malone,Sunflower SpiderMan: Into the SpiderVerse
1,Juice WRLD,Lucid Dreams
2,Lil Uzi Vert,XO TOUR Llif3
3,J. Cole,No Role Modelz
4,Post Malone,rockstar
...,...,...
11079,The Band Perry,If I Die Young
11080,Justin Timberlake,Not a Bad Thing
11081,Mike WiLL Made,It 23
11082,The Vamps,Somebody To You


In [16]:
def duplicate_remover(df: pd.DataFrame) -> pd.DataFrame:
    """
    This function takes a pd dataframe, makes a copy, and checks if there are absolute
    duplicate values in each row. If found, it drops the row and saves the change in the new dataframe, else it doesn't do
    anything and returns the cleaned dataframe.
    """
    df2 = df.copy()
    df2 = df2.drop_duplicates()
    return df2

data=duplicate_remover(data)
data

Unnamed: 0,Artists,Title,Artists_lower,Title_lower
0,Post Malone,Sunflower SpiderMan: Into the SpiderVerse,post malone,sunflower spiderman: into the spiderverse
1,Juice WRLD,Lucid Dreams,juice wrld,lucid dreams
2,Lil Uzi Vert,XO TOUR Llif3,lil uzi vert,xo tour llif3
3,J. Cole,No Role Modelz,j. cole,no role modelz
4,Post Malone,rockstar,post malone,rockstar
...,...,...,...,...
11079,The Band Perry,If I Die Young,the band perry,if i die young
11080,Justin Timberlake,Not a Bad Thing,justin timberlake,not a bad thing
11081,Mike WiLL Made,It 23,mike will made,it 23
11082,The Vamps,Somebody To You,the vamps,somebody to you


In [17]:
data=data.rename(columns={"Artist Name":"Artists","Song Name":"Title"})
data

Unnamed: 0,Artists,Title,Artists_lower,Title_lower
0,Post Malone,Sunflower SpiderMan: Into the SpiderVerse,post malone,sunflower spiderman: into the spiderverse
1,Juice WRLD,Lucid Dreams,juice wrld,lucid dreams
2,Lil Uzi Vert,XO TOUR Llif3,lil uzi vert,xo tour llif3
3,J. Cole,No Role Modelz,j. cole,no role modelz
4,Post Malone,rockstar,post malone,rockstar
...,...,...,...,...
11079,The Band Perry,If I Die Young,the band perry,if i die young
11080,Justin Timberlake,Not a Bad Thing,justin timberlake,not a bad thing
11081,Mike WiLL Made,It 23,mike will made,it 23
11082,The Vamps,Somebody To You,the vamps,somebody to you


In [18]:
data['Artists_lower'] = data['Artists'].str.lower()
data['Title_lower'] = data['Title'].str.lower()
hot_songs['Artists_lower'] = hot_songs['Artists'].str.lower()
hot_songs['Title_lower'] = hot_songs['Title'].str.lower()

In [None]:
filtered_data = data[~((data['Artists_lower'].isin(hot_songs['Artists_lower'])) & (data['Title_lower'].isin(hot_songs['Title_lower'])))]

filtered_data = filtered_data.drop(['Artists_lower', 'Title_lower'], axis=1).reset_index(drop=True)
filtered_data

Conclusion: filtered_data didn't return any match with the Hot_Songs list of Billboard Top100 so we are going to proceed and sample from the bigger Data list.

In [28]:
data=filtered_data.sample(2500)


In [30]:
data=data.reset_index()

In [31]:
data

Unnamed: 0,index,Artists,Title
0,1306,Sam Feldt,Post Malone
1,7333,Quality Control,100 Racks (Offset feat. Playboi Carti)
2,17,Drake,God's Plan
3,10277,Francis and the Lights,May I Have This Dance (Remix)
4,5291,Future,Trapped In The Sun
...,...,...,...
2495,6012,YoungBoy Never Broke Again,Vette Motors
2496,951,Topic,Breaking Me
2497,4273,OneRepublic,Wanted
2498,9345,Eslabon Armado,Con Tus Besos


In [32]:
data.to_csv("not_hot_songs.csv")