# Lab | Web Scraping Single Page (GNOD part 1)

#### Business goal:

- Check the `case_study_gnod.md` file.
- Make sure you've understood the big picture of your project:

  - the goal of the company (`Gnod`),
  - their current product (`Gnoosic`),
  - their strategy, and
  - how your project fits into this context.

  Re-read the business case and the e-mail from the CTO.

#### Instructions - Scraping popular songs

Your product will take a song as an input from the user and will output another song (the recommendation). In most cases, the recommended song will have to be similar to the inputted song, but the CTO thinks that if the song is on the top charts at the moment, the user will also enjoy a recommendation of another song that is popular at the moment.

You have to find data on the internet about currently popular songs. Popvortex maintains a weekly Top 100 of "hot" songs here: [http://www.popvortex.com/music/charts/top-100-songs.php](http://www.popvortex.com/music/charts/top-100-songs.php).

It's a good place to start! Scrape the current top 100 songs and their respective artists, and put the information into a pandas dataframe.

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [2]:
# 2. find url and store it in a variable
url = "https://www.popvortex.com/music/charts/top-100-songs.php"

In [3]:
response = requests.get(url)
response.status_code # 200 status code means OK!

200

In [4]:
tunes = BeautifulSoup(response.content, "html.parser")
# tunes

In [5]:
# Still doesn't look great so havning another crack at it here
# print(tunes.prettify())

In [6]:
# Looks much better so can move on. Need to find the elements I want to recover

#chart-position-1 > div.chart-content.col-xs-12.col-sm-8 > p
#chart-position-1 > div.chart-content.col-xs-12.col-sm-8 > p > cite
#chart-position-1 > div.chart-content.col-xs-12.col-sm-8 > p > em

In [7]:
# Here are all the song titles 
# tunes.select("div.chart-content.col-xs-12.col-sm-8 > p > cite")

# Here are all the artists 
tunes.select("div.chart-content.col-xs-12.col-sm-8 > p > em")

[<em class="artist">Kendrick Lamar</em>,
 <em class="artist">Shaboozey</em>,
 <em class="artist">Kendrick Lamar</em>,
 <em class="artist">Randy Travis</em>,
 <em class="artist">Kendrick Lamar</em>,
 <em class="artist">Teddy Swims</em>,
 <em class="artist">Jimmie Allen</em>,
 <em class="artist">BTS</em>,
 <em class="artist">Drake</em>,
 <em class="artist">Benson Boone</em>,
 <em class="artist">Shaboozey</em>,
 <em class="artist">Hozier</em>,
 <em class="artist">Tommy Richman</em>,
 <em class="artist">Marshmello &amp; Kane Brown</em>,
 <em class="artist">Taylor Swift</em>,
 <em class="artist">Sabrina Carpenter</em>,
 <em class="artist">Black Oxygen</em>,
 <em class="artist">Future, Metro Boomin &amp; Kendrick Lamar</em>,
 <em class="artist">Dua Lipa</em>,
 <em class="artist">Dua Lipa</em>,
 <em class="artist">The Red Clay Strays</em>,
 <em class="artist">Drake</em>,
 <em class="artist">Tim McGraw</em>,
 <em class="artist">Keith Urban &amp; Lainey Wilson</em>,
 <em class="artist">Gloc23</

In [8]:
# tunes.select("div.chart-content.col-xs-12.col-sm-8 > p > cite")

title = []
for cite in tunes.select("div.chart-content.col-xs-12.col-sm-8 > p > cite"):
    #print(li.get_text().split("-"))
    title.append(cite.get_text().split("-")[0])    # .strip()
    # artist.append(cite.get_text().split("-")[1])
print(title)

['Not Like Us', 'A Bar Song (Tipsy)', 'meet the grahams', 'Where That Came From', 'euphoria', 'Lose Control', 'G.R.I.T.S', 'Not Today', 'Family Matters', 'Beautiful Things', 'A Bar Song (Tipsy)', 'Too Sweet', 'MILLION DOLLAR BABY', 'Miles on It', 'Fortnight (feat. Post Malone)', 'Espresso', 'Hollywood Nights', 'Like That', 'Illusion', 'Training Season', 'Wondering Why', 'Push Ups', 'Live Like You Were Dying', 'GO HOME W U', 'Kendrick Lamar 6:16 In LA Drake Diss', 'Save Me (with Lainey Wilson)', 'The Door', 'Where the Wild Things Are', 'Halfway To Hell', 'Lobby', 'Lil Boo Thang', 'Unwritten', 'i like the way you kiss me', 'The Sound of Silence (CYRIL Remix)', 'I Can Do It With a Broken Heart', 'Need a Favor', 'White Horse', 'Austin', 'Cowgirls (feat. ERNEST)', 'Save Me', 'Wildflowers and Wild Horses (Single Version)', 'Last Night', 'A Country Boy Can Survive', 'Austin', 'GOOD DAY', 'Praise (feat. Brandon Lake, Chris Brown & Chandler Moore)', 'Beautiful Things', 'Lovin On Me', 'Forever a

In [9]:
artist = []
for cite in tunes.select("div.chart-content.col-xs-12.col-sm-8 > p > em"):
    #print(li.get_text().split("-"))
    artist.append(cite.get_text().split("-")[0])    # .strip()
    # artist.append(cite.get_text().split("-")[1])
print(artist)

['Kendrick Lamar', 'Shaboozey', 'Kendrick Lamar', 'Randy Travis', 'Kendrick Lamar', 'Teddy Swims', 'Jimmie Allen', 'BTS', 'Drake', 'Benson Boone', 'Shaboozey', 'Hozier', 'Tommy Richman', 'Marshmello & Kane Brown', 'Taylor Swift', 'Sabrina Carpenter', 'Black Oxygen', 'Future, Metro Boomin & Kendrick Lamar', 'Dua Lipa', 'Dua Lipa', 'The Red Clay Strays', 'Drake', 'Tim McGraw', 'Keith Urban & Lainey Wilson', 'Gloc23', 'Jelly Roll', 'Teddy Swims', 'Luke Combs', 'Jelly Roll', 'SMITH', 'Paul Russell', 'Natasha Bedingfield', 'Artemas', 'Disturbed', 'Taylor Swift', 'Jelly Roll', 'Chris Stapleton', 'Dasha', 'Morgan Wallen', 'Jelly Roll', 'Lainey Wilson', 'Morgan Wallen', 'Hank Williams, Jr.', 'Dasha', 'Forrest Frank', 'Elevation Worship', 'Benson Boone', 'Jack Harlow', 'Randy Travis', 'Luke Combs', 'Beyoncé', 'August Moon', 'HARDY', 'Cody Johnson', 'Zach Bryan', 'Dua Lipa', 'Noah Kahan', 'Nate Smith', 'Tucker Wetmore', 'Benson Boone', 'Beyoncé & Miley Cyrus', 'Ariana Grande', 'Taylor Swift', 'f

In [10]:
top100 = pd.DataFrame({"artist":artist,
                           "title":title
                          })
top100

Unnamed: 0,artist,title
0,Kendrick Lamar,Not Like Us
1,Shaboozey,A Bar Song (Tipsy)
2,Kendrick Lamar,meet the grahams
3,Randy Travis,Where That Came From
4,Kendrick Lamar,euphoria
...,...,...
95,Djo,End of Beginning
96,Tyler Braden,Devil You Know
97,Jason Aldean,Let Your Boys Be Country
98,Seph Schlueter & Matt Maher,Counting My Blessings (Collab Version)


## Instructions Part 1 ##
Prioritize the MVP (Minimum Viable Product)
In the previous lab, you had to scrape data about "hot songs". It's critical to be on track with that part, as it was part of the request from the CTO.

If you couldn't finish the first lab, use this time to go back there.

Expand the project
If you're done, you can try to expand the project on your own. Here are a few suggestions:

Find other lists of hot songs on the internet and scrape them too: having a bigger pool of songs will be awesome!
Apply the same logic to other "groups" of songs: the best songs from a decade or from a country / culture / language / genre.
Wikipedia maintains a large collection of lists of songs: https://en.wikipedia.org/wiki/Lists_of_songs

In [11]:
# Here is my link: https://en.wikipedia.org/wiki/List_of_Billboard_number-one_rap_singles_of_the_2000s

# 2. find url and store it in a variable
url2 = "https://en.wikipedia.org/wiki/List_of_Billboard_number-one_rap_singles_of_the_2000s"

In [12]:
response = requests.get(url2)
response.status_code # 200 status code means OK!
raps = BeautifulSoup(response.content, "html.parser")
raps
# Still doesn't look great so having another crack at it here
# print(raps.prettify())

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-clientpref-0 vector-feature-night-mode-disabled skin-theme-clientpref-day vector-toc-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of Billboard number-one rap singles of the 2000s - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-c

In [13]:
# Trying to find the correct path 
# raps.select("td > a")

[<a href="/wiki/Billboard_Year-End" title="Billboard Year-End"><i>Billboard</i> year-end</a>,
 <a href="/wiki/List_of_Billboard_number-one_rap_singles_of_the_1980s_and_1990s" title="List of Billboard number-one rap singles of the 1980s and 1990s">← 1990s</a>,
 <a href="#2000">2000</a>,
 <a href="#2001">2001</a>,
 <a href="#2002">2002</a>,
 <a href="#2003">2003</a>,
 <a href="#2004">2004</a>,
 <a href="#2005">2005</a>,
 <a href="#2006">2006</a>,
 <a href="#2007">2007</a>,
 <a href="#2008">2008</a>,
 <a href="#2009">2009</a>,
 <a class="mw-redirect" href="/wiki/List_of_Billboard_Hot_Rap_Songs_number-one_hits_of_the_2010s" title="List of Billboard Hot Rap Songs number-one hits of the 2010s">2010s →</a>,
 <a href="/wiki/Hot_Boyz_(song)" title="Hot Boyz (song)">Hot Boyz</a>,
 <a href="/wiki/Nas" title="Nas">Nas</a>,
 <a class="mw-redirect" href="/wiki/Eve_(entertainer)" title="Eve (entertainer)">Eve</a>,
 <a href="/wiki/Q-Tip_(musician)" title="Q-Tip (musician)">Q-Tip</a>,
 <a href="/wiki/Y

In [14]:
raps.select("th > a")

[<a href="/wiki/Whistle_While_You_Twurk" title="Whistle While You Twurk">Whistle While You Twurk</a>,
 <a href="/wiki/Wobble_Wobble" title="Wobble Wobble">Wobble Wobble</a>,
 <a href="/wiki/Country_Grammar_(Hot_Shit)" title="Country Grammar (Hot Shit)">Country Grammar (Hot Shit)</a>,
 <a href="/wiki/Flamboyant_(Big_L_song)" title="Flamboyant (Big L song)">Flamboyant</a>,
 <a href="/wiki/Callin%27_Me_(Lil%27_Zane_song)" title="Callin' Me (Lil' Zane song)">Callin' Me</a>,
 <a href="/wiki/Bounce_with_Me" title="Bounce with Me">Bounce with Me</a>,
 <a href="/wiki/Callin%27_Me_(Lil%27_Zane_song)" title="Callin' Me (Lil' Zane song)">Callin' Me</a>,
 <a href="/wiki/Bounce_with_Me" title="Bounce with Me">Bounce with Me</a>,
 <a class="mw-redirect" href="/wiki/Move_Somethin%27_(song)" title="Move Somethin' (song)">Move Somethin'</a>,
 <a class="mw-redirect" href="/wiki/Souljas" title="Souljas">Souljas</a>,
 <a href="/wiki/Baby_If_You%27re_Ready" title="Baby If You're Ready">Baby If You're Ready

In [15]:
titles = []  # Changed variable name to avoid confusion

for link in raps.select("th > a"):
    title = link.get_text()  # Get the text inside the <a> tag
    titles.append(title)     # Append it to the list of titles

print(titles)

['Whistle While You Twurk', 'Wobble Wobble', 'Country Grammar (Hot Shit)', 'Flamboyant', "Callin' Me", 'Bounce with Me', "Callin' Me", 'Bounce with Me', "Move Somethin'", 'Souljas', "Baby If You're Ready", 'Ms. Jackson', "It Wasn't Me", "Bow Wow (That's My Name)", 'What Would You Do?', 'Purple Pills', 'My Projects', 'Raise Up', 'Lights, Camera, Action!', "Feels Good (Don't Worry Bout a Thing)", 'I Need a Girl (Part One)', 'Oh Boy', 'The ROC (Just Fire)', 'Dilemma', 'Work It', 'Air Force Ones', '21 Questions', 'Magic Stick', 'Right Thurr', 'P.I.M.P.', 'Get Low', 'Shake Ya Tailfeather', 'Get Low', 'Damn!', 'Stand Up', 'Slow Jamz', 'One Call Away', 'Tipsy', 'Overnight Celebrity', 'Slow Motion', 'Candy Shop', 'Hate It or Love It', 'Just a Lil Bit', 'Let Me Hold You', 'Like You', 'Gold Digger', 'Soul Survivor', 'I Think They Like Me', 'Grillz', 'Lean wit It, Rock wit It', 'What You Know', 'Shoulder Lean', "Pullin' Me Back", 'Money Maker', 'Shortie Like Mine', 'We Fly High', 'Runaway Love', 

In [16]:
len(titles)

85

In [17]:
artists = []  # Changed variable name to avoid confusion

for link in raps.select("td > a"):
    artist = link.get_text()  # Get the text inside the <a> tag
    artists.append(artist)     # Append it to the list of titles

print(artists)

['Billboard year-end', '← 1990s', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010s →', 'Hot Boyz', 'Nas', 'Eve', 'Q-Tip', 'Ying Yang Twins', '504 Boyz', 'Nelly', 'Big L', 'Lil Zane', '112', "Lil' Bow Wow", 'Xscape', 'Lil Zane', '112', "Lil' Bow Wow", 'Xscape', 'Hi-Tek', 'Reflection Eternal', 'André 3000', 'Master P', "Doggy's Angels", 'LaToiya', 'Mos Def', 'Pharoahe Monch', 'Nate Dogg', "Doggy's Angels", 'LaToiya', 'Outkast', 'Shaggy', "Lil' Bow Wow", 'City High', 'My Baby', "Lil' Romeo", 'D12', 'Coo Coo Cal', 'Petey Pablo', 'Jonell', 'Method Man', 'Mr. Cheeks', 'Naughty by Nature', '3LW', 'No Good', 'P. Diddy', 'Usher', 'Loon', "Cam'ron", 'Beanie Sigel', 'Memphis Bleek', 'Hot in Herre', 'Nelly', 'Nelly', 'Nelly', 'Kyjuan', 'Ali', 'Murphy Lee', 'In da Club', '50 Cent', '50 Cent', 'Nate Dogg', "Lil' Kim", '50 Cent', 'Chingy', '50 Cent', 'Lil Jon', 'Ying Yang Twins', 'Nelly', 'P. Diddy', 'Murphy Lee', 'Lil Jon', 'Ying Yang Twins', 'YoungBloodz', 'Lil

In [18]:
raps.select("tr > td ")

[<td style="background-color:#ffff99">†
 </td>,
 <td><a href="/wiki/Billboard_Year-End" title="Billboard Year-End"><i>Billboard</i> year-end</a> number-one single
 </td>,
 <td style="background-color:#ffff99">‡
 </td>,
 <td><i>Billboard</i> decade-end number-one single
 </td>,
 <td style="background-color:#f2f2f2">↑
 </td>,
 <td>Return of a single to number one
 </td>,
 <td align="center"><a href="/wiki/List_of_Billboard_number-one_rap_singles_of_the_1980s_and_1990s" title="List of Billboard number-one rap singles of the 1980s and 1990s">← 1990s</a> • <a href="#2000">2000</a> • <a href="#2001">2001</a> • <a href="#2002">2002</a> • <a href="#2003">2003</a> • <a href="#2004">2004</a> • <a href="#2005">2005</a> • <a href="#2006">2006</a> • <a href="#2007">2007</a> • <a href="#2008">2008</a> • <a href="#2009">2009</a> • <a class="mw-redirect" href="/wiki/List_of_Billboard_Hot_Rap_Songs_number-one_hits_of_the_2010s" title="List of Billboard Hot Rap Songs number-one hits of the 2010s">2010s 

In [30]:
# pip install html5lib


In [33]:
# This won't work because they are different lengths


# toprap = pd.DataFrame({"artist":artists,
#                            "title":titles
#                           })
# toprap

In [36]:
# This is not scraping but it actually works just as well for our purposes. Better in fact

import pandas as pd
wikiurl = 'https://en.wikipedia.org/wiki/List_of_Billboard_number-one_rap_singles_of_the_2000s'
tables = pd.read_html(wikiurl)
tables

[   0                                       1
 0  †    Billboard year-end number-one single
 1  ‡  Billboard decade-end number-one single
 2  ↑        Return of a single to number one,
                                             Contents
 0  ← 1990s • 2000 • 2001 • 2002 • 2003 • 2004 • 2...,
                           Single  \
 0               "Hot Boyz" †[15]   
 1      "Whistle While You Twurk"   
 2                "Wobble Wobble"   
 3   "Country Grammar (Hot Shit)"   
 4                   "Flamboyant"   
 ..                           ...   
 90               "Boom Boom Pow"   
 91       "Best I Ever Had" †[30]   
 92               "Run This Town"   
 93                     "Forever"   
 94        "Empire State of Mind"   
 
                                               Artist  Reached number one  \
 0         Missy Elliott featuring Nas, Eve and Q-Tip   November 27, 1999   
 1                                    Ying Yang Twins       April 1, 2000   
 2                           

In [39]:
combined_df = pd.concat(tables)

# Print the combined DataFrame
print(combined_df)

      0                                       1  \
0     †    Billboard year-end number-one single   
1     ‡  Billboard decade-end number-one single   
2     ↑        Return of a single to number one   
0   NaN                                     NaN   
0   NaN                                     NaN   
..  ...                                     ...   
0   NaN                                     NaN   
1   NaN                                     NaN   
2   NaN                                     NaN   
3   NaN                                     NaN   
0   NaN                                     NaN   

                                             Contents            Single  \
0                                                 NaN               NaN   
1                                                 NaN               NaN   
2                                                 NaN               NaN   
0   ← 1990s • 2000 • 2001 • 2002 • 2003 • 2004 • 2...               NaN   
0           

In [38]:
dataframes = [pd.DataFrame(table[1:], columns=table[0]) for table in tables]

# Concatenate the DataFrames vertically (along axis 0)
combined_df = pd.concat(dataframes, ignore_index=True)

# Print the combined DataFrame
print(combined_df)

KeyError: 0

In [29]:
# import pandas as pd
# import requests
# from bs4 import BeautifulSoup


# page = requests.get('https://en.wikipedia.org/wiki/United_States_presidential_election').text
# soup = BeautifulSoup(page, 'html.parser')
# table = soup.find('table', class_="wikitable sortable")

# df = pd.read_html(str(table))
# df = pd.concat(df)
# print(df)

     Year        Party Presidential candidate Vice presidential candidate  \
0    1788  Independent      George Washington                     None[a]   
1    1788   Federalist          John Adams[b]                     None[a]   
2    1788   Federalist               John Jay                     None[a]   
3    1788   Federalist     Robert H. Harrison                     None[a]   
4    1788   Federalist          John Rutledge                     None[a]   
..    ...          ...                    ...                         ...   
215  2016   Republican            John Kasich               Carly Fiorina   
216  2016   Democratic         Bernie Sanders            Elizabeth Warren   
217  2016   Democratic    Faith Spotted Eagle           Winona LaDuke (G)   
218  2020   Democratic              Joe Biden               Kamala Harris   
219  2020   Republican           Donald Trump                  Mike Pence   

    Popular vote       % Electoral votes Notes  
0          43782   100.0  

  df = pd.read_html(str(table))


## Instructions Part 2 

Practice web scraping. This is not involved with the GNOD project of the week
As you've seen, scraping the internet is a skill that can get you all sorts of information. Here are some little challenges that you can try to gain more experience in the field. Open a new Jupyter notebook and scrape at least 3 of these sites.

- Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page: url ='https://en.wikipedia.org/wiki/Python'
- Find the number of titles that have changed in the United States Code since its last release point: url = 'http://uscode.house.gov/download/download.shtml'
- Create a Python list with the top ten FBI's Most Wanted names: url = 'https://www.fbi.gov/wanted/topten'
- Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe: url = 'https://www.emsc-csem.org/Earthquake/'
- List all language names and number of related articles in the order they appear in wikipedia.org: url = 'https://www.wikipedia.org/'
- A list with the different kind of datasets available in data.gov.uk: url = 'https://data.gov.uk/'
- Display the top 10 languages by number of native speakers stored in a pandas dataframe: url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'