# Prototype

## Business case

Step 1:
- scraping data form wikipedia (https://en.wikipedia.org/wiki/Triple_J_Hottest_100) and billboard (https://www.billboard.com/charts/hot-100)
- Create user input
- check if song is in the list
- if song is in the list, recommend 3 other songs from the list (random)
- if song is not in the list, return no recommendations

Step 2:
- Accept multiple values for '&' and '+' (and)
- add a link to recommended songs on spotify
- Scrape data every week, check for updates, remove songs that are not in the list, add songs that are new
- Split python file web scraping and python file recommendating songs

## Scraping websites

In [None]:
# import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
from time import sleep
import numpy as np
import random

### Scraping Billboard Hot 100 2021

In [None]:
url = "https://www.billboard.com/charts/hot-100"
response = requests.get(url)
response.status_code # 200 status code means OK!

200

In [None]:
soup = BeautifulSoup(response.content, "html.parser")

In [None]:
song_2021 = []
artist_2021 = []
year_2021 = []
songs_2021 = soup.findAll('span',attrs={"class":"chart-element__information"})
for x in songs_2021:
    try:
        song = str(x.find('span', attrs={'class': 'chart-element__information__song text--truncate color--primary'}).text).strip()
        song_2021.append(song)
    except:
        song_2021.append('NA')
    try:
        artist = str(x.find('span', attrs={'class': 'chart-element__information__artist text--truncate color--secondary'}).text).strip()
        artist_2021.append(artist)
    except:
        artist_2021.append('NA')
    year_2021.append('2021')
    
df_2021 = pd.DataFrame({'song':song_2021, 'artist':artist_2021, 'year': year_2021})

In [None]:
df_2021

Unnamed: 0,song,artist,year
0,Drivers License,Olivia Rodrigo,2021
1,34+35,Ariana Grande,2021
2,Calling My Phone,Lil Tjay Featuring 6LACK,2021
3,Blinding Lights,The Weeknd,2021
4,Up,Cardi B,2021
...,...,...,...
95,Almost Maybes,Jordan Davis,2021
96,Back To The Streets,Saweetie Featuring Jhene Aiko,2021
97,Bad Boy,Juice WRLD & Young Thug,2021
98,Opp Stoppa,YBN Nahmir Featuring 21 Savage,2021


### Scraping Wikipedia Hot 100 2020

In [None]:
url2 = "https://en.wikipedia.org/wiki/Triple_J_Hottest_100,_2020"
response2 = requests.get(url2)
response2.status_code # 200 status code means OK!

200

In [None]:
soup2 = BeautifulSoup(response2.content, "html.parser")
soup2

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Triple J Hottest 100, 2020 - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"YDYfmN4tB1xy-u63xeqcjwAAAE8","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Triple_J_Hottest_100,_2020","wgTitle":"Triple J Hottest 100, 2020","wgCurRevisionId":1007684741,"wgRevisionId":1007684741,"wgArticleId":65741235,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages which use infobox templates with ignored data cells","Triple J Hottest 100"],"wgPageContentLanguage":"en","wgPageContentModel":"w

In [None]:
link_songs = soup2.select("table.wikitable.sortable > tbody > tr > td")[2].get_text()
link_songs 

'Glass Animals'

In [None]:
song_2020 = []
artist_2020 = []
year_2020 = []

if (response2.status_code == 200):
    wait_time = np.random.randint(1,4)
    sleep(wait_time)
    soup2 = BeautifulSoup(response2.content, "html.parser")
    songs_2020 = soup2.select("table.wikitable.sortable > tbody > tr")
    for i in songs_2020:
        response2 = requests.get(url2)
        try:
            song_2020.append(i.select("td")[1].get_text().replace("\n",""))
        except:
            song_2020.append('NA')
        try:
            artist_2020.append(i.select("td")[2].get_text().replace("\n",""))
        except:
            artist_2020.append('NA')
        year_2020.append('2020')
    
df_2020 = pd.DataFrame({'song':song_2020[0:101], 'artist':artist_2020[0:101], 'year': year_2020[0:101]})

In [None]:
df_2020

Unnamed: 0,song,artist,year
0,,,2020
1,Heat Waves,Glass Animals,2020
2,Booster Seat,Spacey Jane,2020
3,The Difference,Flume and Toro y Moi,2020
4,Cherub,Ball Park Music,2020
...,...,...,...
96,Germaphobe,Hockey Dad,2020
97,Audacity,Stormzy featuring Headie One,2020
98,Your Man,Joji,2020
99,Itch,Hockey Dad,2020


### Concat songs 2021 and 2020

In [None]:
df_final = pd.concat([df_2020,df_2021], ignore_index=True)
df_final

Unnamed: 0,song,artist,year
0,,,2020
1,Heat Waves,Glass Animals,2020
2,Booster Seat,Spacey Jane,2020
3,The Difference,Flume and Toro y Moi,2020
4,Cherub,Ball Park Music,2020
...,...,...,...
196,Almost Maybes,Jordan Davis,2021
197,Back To The Streets,Saweetie Featuring Jhene Aiko,2021
198,Bad Boy,Juice WRLD & Young Thug,2021
199,Opp Stoppa,YBN Nahmir Featuring 21 Savage,2021


In [None]:
df_final = df_final[1:]
df_final

Unnamed: 0,song,artist,year
1,Heat Waves,Glass Animals,2020
2,Booster Seat,Spacey Jane,2020
3,The Difference,Flume and Toro y Moi,2020
4,Cherub,Ball Park Music,2020
5,Lost in Yesterday,Tame Impala,2020
...,...,...,...
196,Almost Maybes,Jordan Davis,2021
197,Back To The Streets,Saweetie Featuring Jhene Aiko,2021
198,Bad Boy,Juice WRLD & Young Thug,2021
199,Opp Stoppa,YBN Nahmir Featuring 21 Savage,2021


### Clean up song list

In [None]:
def clean_input(df):
  spec_chars = ["!",'"',"#","%","(",")","*",",","-",".","/",":",";","<","=",">","?","@","[","\\","]","^","_","`","{","|","}","~","–"]
  song_dict = {'é': 'e', 'à' : 'a', ' (like a version)' : '', ' (flume remix)' : '', ' (go baby)' : '', ' (okokok)' : ''}
  for char in spec_chars:
    df = df.str.replace(char, '')
  df.replace(song_dict, regex=True, inplace=True)
  df = list(map(lambda x:x.lower(),df))
  return df
  

In [None]:
df_final['song'] = clean_input(df_final['song'])
df_final['song'].head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


1           heat waves
2         booster seat
3       the difference
4               cherub
5    lost in yesterday
Name: song, dtype: object

In [None]:
df_final

Unnamed: 0,song,artist,year
1,heat waves,Glass Animals,2020
2,booster seat,Spacey Jane,2020
3,the difference,Flume and Toro y Moi,2020
4,cherub,Ball Park Music,2020
5,lost in yesterday,Tame Impala,2020
...,...,...,...
196,almost maybes,Jordan Davis,2021
197,back to the streets,Saweetie Featuring Jhene Aiko,2021
198,bad boy,Juice WRLD & Young Thug,2021
199,opp stoppa,YBN Nahmir Featuring 21 Savage,2021


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
df_final.to_csv('/content/drive/MyDrive/unit7/day3/lab-unsupervised-learning-intro/prototype/dataset/hot_songs.csv', index=False)