## Web Scraping Lab 1:


### Prepare your project:

#### Business goal:

Make sure you've understood the big picture of your project: the goal of the company (Gnod), their current product (Gnoosic), their strategy, and how your project fits into this context. Re-read the business case and the e-mail from the CTO, take a look at the flowchart and create an initial Trello board with the tasks you think you'll have to acomplish.

#### Scraping popular songs:

Your product will take a song as an input from the user and will output another song (the recommendation). In most cases, the recommended song will have to be similar to the inputed song, but the CTO thinks that if the song is on the top charts at the moment, the user will enjoy more a recommendation of a song that's also popular at the moment.

You have find data on the internet about currently popular songs. Billboard mantains a weekly Top 100 of "hot" songs here: https://www.billboard.com/charts/hot-100. 

It's a good place to start! Scrape the current top 100 songs and their respective artists, and put the information into a pandas dataframe.

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import random

# 2. find url and store it in avariable
url = "https://www.billboard.com/charts/hot-100"

# 3. download html with a get request
response = requests.get(url)

In [2]:
#check response status code 
response.status_code

200

In [3]:
#parse and store the contents of the url call
soup=BeautifulSoup(response.content, 'html.parser')

In [4]:
#prettify the soup 
# jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10 
print(soup.prettify)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [5]:
#Miguel's solution: using CSS

songs = soup.select('button > span.chart-element__information > span.chart-element__information__song.text--truncate.color--primary')

In [6]:
titles=[]
artists=[]

for i in soup.find_all('span', {'class':'chart-element__information__song'}):
    titles.append(i.get_text())

for i in soup.find_all('span', {'class':'chart-element__information__artist'}):
    artists.append(i.get_text())

In [7]:
len(titles)

100

In [8]:
len(artists)

100

In [9]:
top_100_hot_songs=pd.DataFrame({'Title':titles, 'Artist':artists})

In [10]:
top_100_hot_songs

Unnamed: 0,Title,Artist
0,Butter,BTS
1,Good 4 U,Olivia Rodrigo
2,Levitating,Dua Lipa Featuring DaBaby
3,Kiss Me More,Doja Cat Featuring SZA
4,Montero (Call Me By Your Name),Lil Nas X
...,...,...
95,All I Know So Far,P!nk
96,What's Next,Drake
97,Enough For You,Olivia Rodrigo
98,Juggernaut,"Tyler, The Creator Featuring Lil Uzi Vert & Ph..."


### Bonus

Can you find other websites with lists of "hot" songs? What about songs that were popular on a certain decade? 

You can scrape more lists and add extra features to the project.

In [11]:
#create another column with age 2010 and then append rows from other scrapped charts
#and make entries: top 2001 x exemple. Thats one way

#another in which charts the songs have appeared?

#no podem posar current per les de ara xq llavors es torna un objecte, un string.
#If these borads release every month podem posar the date of the release

In [12]:

# 2. find url and store it in avariable
url = "http://www.discjockey.org/top-100-songs-of-the-1980s/"

# 3. download html with a get request
response = requests.get(url)

In [13]:
response.status_code

200

In [14]:
#parse and store the contents of the url call
soup2=BeautifulSoup(response.content, 'html.parser')

In [15]:
#prettify the soup
print(soup2.prettify())

<!DOCTYPE html>
<html dir="ltr" lang="en">
 <head>
  <title>
   Top 100 Songs of the 1980s - Chicago DJ Fourth Estate Audio
  </title>
  <meta content="Top 100 songs of the 1980s - Chicago DJ Fourth Estate Audio" name="description"/>
  <meta content="Chicago DJ Fourth Estate Audio Presents the Top Songs of the 1980s" name="keywords"/>
  Chicago DJ,Chicago Wedding DJ,Fourth Estate Audio,Top Songs of the 1980s
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="p0oIwbkp79V1fHXzfr6krKOff3OLHM1I3VUKcHEQx4c" name="google-site-verification"/>
  <script src="/site/template/assets/functions.js">
  </script>
  <script src="/site/template/assets/rollover.js">
  </script>
  <script src="/site/epage/assets/newwindow.js">
  </script>
  <link href="/site/template/assets/home_template_fourthestate_895/css/template0.css?v=2" id="templatecss" media="screen" rel="stylesheet" type="text/css"/>
  <link href="/site/template/assets/home_template_fourthestate_895/css/layou

In [16]:
top_80s_songs = []
top_80s_artists = []

for q in soup2.select('td:nth-child(2)'):
    top_80s_songs.append(q.get_text())

for q in soup2.select("td:nth-child(3)"):
    top_80s_artists.append(q.get_text())

In [17]:
len(top_80s_songs)

100

In [18]:
len(top_80s_artists)

99

In [19]:
#we remove the last song as it has no artist
# pop deletes the last element of a column
top_80s_songs.pop()

'Strokin'

In [20]:
len(top_80s_songs)

99

In [21]:
hits_80s=pd.DataFrame({"title_80":top_80s_songs, "artist_80":top_80s_artists})

In [22]:
hits_80s

Unnamed: 0,title_80,artist_80
0,Don't Stop Believin',Journey
1,You Shook Me All Night Long,AC/DC
2,Love Shack,B-52's
3,Livin' On A Prayer,Bon Jovi
4,Pour Some Sugar On Me,Def Leppard
...,...,...
94,Tainted Love,Soft Cell
95,Heaven,Bryan Adams
96,Here And Now,Luther Vandross
97,Nothin' But A Good Time,Poison


# Creating the recommender

In [23]:
import random

In [24]:
#top_100_songs['Title'].values

In [25]:
#def song_choice():
 #   song = input('Which song do you like? ')
    #  if song in top_100_songs['Title'].values:
   #     print('Oh, nice one! We reccomend you to listen to:', random.choice(top_100_songs['Title']))
  #  else:
   #     print('This song is not hot, try another song so we can give you a recommendation')

In [26]:
# song_choice()

In [27]:

def billboard_scraper():
    from bs4 import BeautifulSoup
    import requests
    import pandas as pd
    import random
    # 2. find url and store it in avariable
    url = "https://www.billboard.com/charts/hot-100"
    # 3. download html with a get request
    response = requests.get(url)
    soup=BeautifulSoup(response.content, 'html.parser')
    titles=[]
    artists=[]
    for i in soup.find_all('span', {'class':'chart-element__information__song'}):
        titles.append(i.get_text())
    for i in soup.find_all('span', {'class':'chart-element__information__artist'}):
        artists.append(i.get_text())
    return pd.DataFrame({'song':titles, 'artist':artists})

In [28]:
# Create the prototype
def hits_80s_scraper():

    # 2. find url and store it in avariable
    url = "http://www.discjockey.org/top-100-songs-of-the-1980s/"

    # 3. download html with a get request
    response = requests.get(url)
    soup=BeautifulSoup(response.content, 'html.parser')
    
    top_80s_songs = []
    top_80s_artists = []

    for q in soup2.select('td:nth-child(2)'):
        top_80s_songs.append(q.get_text())

    for q in soup2.select("td:nth-child(3)"):
        top_80s_artists.append(q.get_text())
    
    top_80s_songs.pop()
    
    return pd.DataFrame({"title_80":top_80s_songs, "artist_80":top_80s_artists})

In [29]:
from random import randint

In [30]:
def hot_recommender():
    from billboard_scraper import billboard_scraper

billboard = billboard_scraper()

song = input("What song do you like?")

song_row = billboard[billboard["song"].str.contains(song)]
if len(song_row) == 0 :
    print("That song is not in the list")
else:
    check_song = input("did you mean " + song_row["song"].values[0] + "by " + song_row["artist"].values[0]+"? ")
    
    if check_song == 'yes':
            print("That's a hot song")
            random_song = randint(0, len(billboard)-1)
            print("You might also like " + billboard["song"][random_song] + "by " + billboard["artist"][random_song])
    else:
            print("Ah, not the one I had in mind")

What song do you like?Butter
did you mean Butterby BTS? yes
That's a hot song
You might also like transparentsoulby Willow Featuring Travis Barker


## 80s from spotify

In [31]:
import getpass

In [32]:
client_id=str(getpass.getpass('client_id?'))
client_secret = str(getpass.getpass('client_secret?'))

client_id?········
client_secret?········


In [33]:
import spotipy # install if needed
from spotipy.oauth2 import SpotifyClientCredentials

In [34]:
#Initialize SpotiPy with user credentias
sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials(
    client_id=client_id,
    client_secret=client_secret))

In [35]:
# https://open.spotify.com/playlist/3acAVwCcB38UtE24t92tex?si=e39ecf230a2f4177



In [36]:
playlist = sp.user_playlist_tracks("spotify", "3acAVwCcB38UtE24t92tex")
#esta funcion solo devuelve hasta 100 canciones!


In [37]:
#para q me salgan todos los resultados hago:

def get_playlist_tracks(user_id, playlist_id):
    results = sp.user_playlist_tracks(user_id, playlist_id)
    tracks = results['items']
    while results['next']:
        results = sp.next(results)
        tracks.extend(results['items'])
    return tracks

In [38]:
# 2- call function on playlist
full_playlist= get_playlist_tracks('spotify', '3acAVwCcB38UtE24t92tex')

In [39]:
full_playlist

[{'added_at': '2017-01-27T15:49:54Z',
  'added_by': {'external_urls': {'spotify': 'https://open.spotify.com/user/tomdavis19'},
   'href': 'https://api.spotify.com/v1/users/tomdavis19',
   'id': 'tomdavis19',
   'type': 'user',
   'uri': 'spotify:user:tomdavis19'},
  'is_local': False,
  'primary_color': None,
  'track': {'album': {'album_type': 'album',
    'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/0lZoBs4Pzo7R89JM9lxwoT'},
      'href': 'https://api.spotify.com/v1/artists/0lZoBs4Pzo7R89JM9lxwoT',
      'id': '0lZoBs4Pzo7R89JM9lxwoT',
      'name': 'Duran Duran',
      'type': 'artist',
      'uri': 'spotify:artist:0lZoBs4Pzo7R89JM9lxwoT'}],
    'available_markets': ['AD',
     'AE',
     'AG',
     'AL',
     'AM',
     'AO',
     'AR',
     'AT',
     'AU',
     'AZ',
     'BA',
     'BB',
     'BD',
     'BE',
     'BF',
     'BG',
     'BH',
     'BI',
     'BJ',
     'BN',
     'BO',
     'BR',
     'BS',
     'BW',
     'BY',
     'BZ',
     'CA',

In [40]:
len(full_playlist)

151

In [41]:
full_playlist[0].keys()
#info for each track

dict_keys(['added_at', 'added_by', 'is_local', 'primary_color', 'track', 'video_thumbnail'])

In [42]:
full_playlist[0]['track'].keys()
# = playlist['items'][0]['track'].keys()

dict_keys(['album', 'artists', 'available_markets', 'disc_number', 'duration_ms', 'episode', 'explicit', 'external_ids', 'external_urls', 'href', 'id', 'is_local', 'name', 'popularity', 'preview_url', 'track', 'track_number', 'type', 'uri'])

In [43]:
# Function to extract the song id's:
track_id=[]
for item in full_playlist:
    track_id.append(item['track']['id'])
    
# Function to extract just the artist id's:

artist_id=[]
for item in full_playlist:
    artist_id.append(item['track']['artists'][0]['id'])

In [44]:
len(track_id)

151

In [46]:
len(artist_id)

151

In [66]:
#audio features is not working because we have more than 100 tracks
#so we will do it later once we get the final list of songs and do a for loop to apply this
#function every 100 songs

#audio_feat=sp.audio_features(tracks=track_id)
#import pandas as pd 
#audio_feat_df=pd.DataFrame(sp.audio_features(tracks=track_id))

In [67]:
def get_albums_from_artist(artist_id):
    for i in artist_id:
        return sp.artist_albums(i, album_type='album')

In [68]:
get_albums_from_artist(artist_id)

{'href': 'https://api.spotify.com/v1/artists/0lZoBs4Pzo7R89JM9lxwoT/albums?offset=0&limit=20&include_groups=album',
 'items': [{'album_group': 'album',
   'album_type': 'album',
   'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/0lZoBs4Pzo7R89JM9lxwoT'},
     'href': 'https://api.spotify.com/v1/artists/0lZoBs4Pzo7R89JM9lxwoT',
     'id': '0lZoBs4Pzo7R89JM9lxwoT',
     'name': 'Duran Duran',
     'type': 'artist',
     'uri': 'spotify:artist:0lZoBs4Pzo7R89JM9lxwoT'}],
   'available_markets': ['AD',
    'AE',
    'AG',
    'AL',
    'AM',
    'AO',
    'AR',
    'AT',
    'AU',
    'AZ',
    'BA',
    'BB',
    'BD',
    'BE',
    'BF',
    'BG',
    'BH',
    'BI',
    'BJ',
    'BN',
    'BO',
    'BR',
    'BS',
    'BT',
    'BW',
    'BY',
    'BZ',
    'CA',
    'CH',
    'CI',
    'CL',
    'CM',
    'CO',
    'CR',
    'CV',
    'CW',
    'CY',
    'CZ',
    'DE',
    'DJ',
    'DK',
    'DM',
    'DO',
    'DZ',
    'EC',
    'EE',
    'EG',
    'ES',


In [None]:
#playlist

In [None]:
#playlist['total']

In [None]:
#playlist.keys()

In [None]:
#playlist['items'][0]['track'].keys()

#si playlist tiene más de 100 canciones, no da xq el limite es 100

In [None]:
#playlist['items'][0]['track']['album']

In [None]:
# Function to extract just the song titles:

playlist_80s_titles=[]
for item in playlist['items']:
    playlist_80s_titles.append(item['track']['name'])

In [None]:
d = list(zip(playlist_80s_titles,playlist_80s_artists, playlist_80s_artists_id))
titles_artists_df = pd.DataFrame(d, columns=['title','artist', 'artist_id'])

In [None]:
pd.concat([titles_artists_df,audio_feat_df], axis=1)