# Web Scraping Lab

In [1]:
'''Dear xxxxxxxx,

We are thrilled to welcome you as a Data Analyst for Gnoosic!

As you know, we are trying to come up with ways to enhance our music recommendations. 
One of the new features we'd like to research is to recommend songs (not only bands). 
We're also aware of the limitations of our collaborative filtering algorithms, 
and would like to give users two new possibilities when searching for recommendations:

- Songs that are actually similar to the ones they picked from an acoustic point of view.
- Songs that are popular around the world right now, independently from their tastes.

Coming up with the perfect song recommender will take us months - no need to stress out too much. 
In this first week, we want you to explore new data sources for songs. 
The internet is full of information and our first step is to acquire it do an initial exploration. 
Feel free to use APIs or directly scrape the web to collect as much information as possible from popular songs. 
Eventually, we'll need to collect data from millions of songs, but we can start with a few hundreds or thousands 
from each source and see if the collected features are useful. 

Once the data is collected, we want you to create clusters of songs that are similar to each other. 
The idea is that if a user inputs a song from one group, we'll prioritize giving them recommendations 
of songs from that same group.

On Friday, you will present your work to me and Marek, the CEO and founder. 
Full disclosure: I need you to be very convincing about this whole song-recommender, 
as this has been my personal push and the main reason we hired you for!

Be open minded about this process: we are agile, and that means that we define our products and features 
on-the-go, while exploring the tools and the data that's available to us. We'd love you to provide your 
own vision of the product and the next steps to be taken.

Lots of luck and strength for this first week with us!

Jane
'''

"Dear xxxxxxxx,\n\nWe are thrilled to welcome you as a Data Analyst for Gnoosic!\n\nAs you know, we are trying to come up with ways to enhance our music recommendations. \nOne of the new features we'd like to research is to recommend songs (not only bands). \nWe're also aware of the limitations of our collaborative filtering algorithms, \nand would like to give users two new possibilities when searching for recommendations:\n\n- Songs that are actually similar to the ones they picked from an acoustic point of view.\n- Songs that are popular around the world right now, independently from their tastes.\n\nComing up with the perfect song recommender will take us months - no need to stress out too much. \nIn this first week, we want you to explore new data sources for songs. \nThe internet is full of information and our first step is to acquire it do an initial exploration. \nFeel free to use APIs or directly scrape the web to collect as much information as possible from popular songs. \nE

## Importing the Libraries

In [2]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
from tqdm.notebook import tqdm

## Storing and reading the webpage

In [3]:
# find url and store it in a variable
url = "https://www.popvortex.com/music/charts/top-100-songs.php"

In [4]:
# download html with a get request
response = requests.get(url)
response.status_code # 200 status code means OK!

200

In [5]:
# parse html (create the 'soup')
soup = BeautifulSoup(response.content, "html.parser")

In [6]:
# The artist list
artist = []

for i in soup.select('.artist'):
    artist.append(i.get_text())

In [7]:
# The song list
title = []

for i in soup.select('.title'):
    title.append(i.get_text())

In [8]:
# Merging the lists into one DataFrame
song1 = pd.DataFrame({"artist":artist,
                       "title":title
                      })
song1

Unnamed: 0,artist,title
0,Sam Smith & Kim Petras,Unholy
1,Transformation Worship,Eagle (feat. KB)
2,David Guetta & Bebe Rexha,I'm Good (Blue)
3,Fleetwood Mac,Everywhere
4,HARDY & Lainey Wilson,wait in the truck
...,...,...
95,Auli'i Cravalho,How Far I'll Go
96,"Bette Midler, Sarah Jessica Parker & Kathy Najimy",One Way or Another (Hocus Pocus 2 Version)
97,Blasterjaxx & Timmy Trumpet,Narco
98,Thomas Rhett,Half Of Me (feat. Riley Green)


In [9]:
import random
var = input("Sing a song for me:")
random_name = random.choice(artist_song['title'])

if var in song1['title'].values:
    print("Pump up the volume and let us dance to", random_name)
else:
    print("When the music is over ...")

Sing a song for me:Unholy
Pump up the volume and let us dance to Lose Yourself


In [11]:
# save it as csv
#top100.to_csv('top100.csv', index = False, header = True)

## LAB2

## GETTING MORE SONGS 

In [12]:
# and again

# find url and store it in a variable
url = "https://playback.fm/charts/rock/1980"

In [13]:
# download html with a get request
response = requests.get(url)
response.status_code 

200

In [14]:
# parse html (create the soup)
soup1 = BeautifulSoup(response.content, 'html.parser')

# prettifying the soup 
#soup2.prettify

In [15]:
iterations = range(1970, 1980, 1)
#[i for i in iterations]
for i in iterations:
    start_at= str(i)
    url1 = 'https://playback.fm/charts/rock' + start_at
    print(url1)

https://playback.fm/charts/rock1970
https://playback.fm/charts/rock1971
https://playback.fm/charts/rock1972
https://playback.fm/charts/rock1973
https://playback.fm/charts/rock1974
https://playback.fm/charts/rock1975
https://playback.fm/charts/rock1976
https://playback.fm/charts/rock1977
https://playback.fm/charts/rock1978
https://playback.fm/charts/rock1979


In [16]:
import random
from time import sleep
from random import randint

In [17]:
pages = []

for i in iterations:
    # assemble the url:
    start_at= str(i)
    url1 = 'https://playback.fm/charts/rock' + start_at

    # download html with a get request:
    response = requests.get(url1)
    response = requests.get(url, headers = {"Accept-Language": "en-US"})

    # monitor the process by printing the status code
    print("Status code: " + str(response.status_code))

    # store response into "pages" list
    pages.append(response)

    # respectful nap:
    wait_time = randint(1,4)
    print("Sing me to sleep ... " + str(wait_time) + " please wait.")
    sleep(wait_time)

Status code: 200
Sing me to sleep ... 3 please wait.
Status code: 200
Sing me to sleep ... 3 please wait.
Status code: 200
Sing me to sleep ... 3 please wait.
Status code: 200
Sing me to sleep ... 1 please wait.
Status code: 200
Sing me to sleep ... 1 please wait.
Status code: 200
Sing me to sleep ... 1 please wait.
Status code: 200
Sing me to sleep ... 4 please wait.
Status code: 200
Sing me to sleep ... 2 please wait.
Status code: 200
Sing me to sleep ... 3 please wait.
Status code: 200
Sing me to sleep ... 2 please wait.


In [21]:
# The artist list 2
artist1 = []

for i in range(len(pages)):
    parsed = BeautifulSoup(pages[i].content, "html.parser")
    artist_html = parsed.select('td:nth-child(2) > a')
    for i in soup1.select('td:nth-child(2) > a'):
        artist1.append(i.get_text())

In [22]:
# The song list 2
title1 = []

for i in range(len(pages)):
    parsed = BeautifulSoup(pages[i].content, "html.parser")
    title_html = parsed.select('td.mobile-hide > a > span.song')
    for i in soup1.select('td.mobile-hide > a > span.song'):
        title1.append(i.get_text())

In [23]:
song2 = pd.DataFrame({"artist":artist1,
                       "title":title1
                      })
song2

Unnamed: 0,artist,title
0,\nREO Speedwagon\n,Keep On Loving You
1,\nThe Police\n,Don't Stand So Close to Me
2,\nPink Floyd\n,Another Brick in the Wall
3,\nThe J. Geils Band\n,Love Stinks
4,"\nLipps, Inc\n",Funkytown
...,...,...
995,\nThe Police\n,So Lonely
996,\nSupertramp\n,Dreamer
997,\nGenesis\n,Misunderstanding
998,\nGenesis\n,Turn It On Again


In [24]:
# Some Make Up - removing the '\n' from the artists
song2['artist'] = song2['artist'].str.replace('\n','')
song2

Unnamed: 0,artist,title
0,REO Speedwagon,Keep On Loving You
1,The Police,Don't Stand So Close to Me
2,Pink Floyd,Another Brick in the Wall
3,The J. Geils Band,Love Stinks
4,"Lipps, Inc",Funkytown
...,...,...
995,The Police,So Lonely
996,Supertramp,Dreamer
997,Genesis,Misunderstanding
998,Genesis,Turn It On Again
