# Lab | Web Scraping Single Page

#### Business goal:

- Check the `case_study_gnod.md` file.
- Make sure you've understood the big picture of your project:

  - the goal of the company (`Gnod`),
  - their current product (`Gnoosic`),
  - their strategy, and
  - how your project fits into this context.

  Re-read the business case and the e-mail from the CTO, take a look at the flowchart and create an initial Trello board with the tasks you think you'll have to accomplish.

#### Instructions - Scraping popular songs

Your product will take a song as an input from the user and will output another song (the recommendation). In most cases, the recommended song will have to be similar to the inputted song, but the CTO thinks that if the song is on the top charts at the moment, the user will enjoy more a recommendation of a song that's also popular at the moment.

You have find data on the internet about currently popular songs. Billboard maintains a weekly Top 100 of "hot" songs here: [https://www.billboard.com/charts/hot-100](https://www.billboard.com/charts/hot-100).

It's a good place to start! Scrape the current top 100 songs and their respective artists, and put the information into a pandas dataframe.

# Case Study: The site for recommendations - "Gnod"

### Scenario

You have been hired as a Data Analyst for "Gnod".

"Gnod" is a site that provides recommendations for music, art, literature and products based on collaborative filtering algorithms. Their flagship product is the _music recommender_, which you can try at [www.gnoosic.com](www.gnoosic.com). The site asks users to input 3 bands they like, and computes similarity scores with the rest of the users. Then, they recommend to the user bands that users with similar tastes have picked.

"Gnod" is a small company, and its only revenue stream so far are adds in the site. In the future, they would like to explore partnership options with music apps (such as _Deezer_, _Soundcloud_ or even _Apple Music_ and _Spotify_). However, for that to be possible, they need to expand and improve their recommendations.

That's precisely where you come. They have hired you as a Data Analyst, and they expect you to bring a mix of technical expertise and business mindset to the table.

Jane, CTO of Gnod, has sent you an email assigning you with your first task.

### Task(s)

> This is an e-mail Jane - CTO of Gnod - sent over your inbox in the first weeks working there.

_Dear xxxxxxxx,
We are thrilled to welcome you as a Data Analyst for *Gnoosic*!_

_As you know, we are trying to come up with ways to enhance our music recommendations. One of the new features we'd like to research is to recommend songs (not only bands). We're also aware of the limitations of our collaborative filtering algorithms, and would like to give users two new possibilities when searching for recommendations:_

- _Songs that are actually similar to the ones they picked from an acoustic point of view._
- _Songs that are popular around the world right now, independently from their tastes._

_Coming up with the perfect song recommender will take us months - no need to stress out too much. In this first week, we want you to explore new data sources for songs. The Internet is full of information and our first step is to acquire it do an initial exploration. Feel free to use APIs or directly scrape the web to collect as much information as possible from popular songs. Eventually, we'll need to collect data from millions of songs, but we can start with a few hundreds or thousands from each source and see if the collected features are useful._

_Once the data is collected, we want you to create clusters of songs that are similar to each other. The idea is that if a user inputs a song from one group, we'll prioritize giving them recommendations of songs from that same group._

_On Friday, you will present your work to me and Marek, the CEO and founder. Full disclosure: I need you to be very convincing about this whole song-recommender, as this has been my personal push and the main reason we hired you for!_

_Be open minded about this process: we are agile, and that means that we define our products and features on-the-go, while exploring the tools and the data that's available to us. We'd love you to provide your own vision of the product and the next steps to be taken._

_Lots of luck and strength for this first week with us!_

_-Jane_

## Loading libraries

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
from time import sleep


## Storing the URL

In [2]:
url = "https://www.billboard.com/charts/hot-100"

## Getting the html code of the web page

In [3]:
response = requests.get(url)
response.status_code # 200 status code means OK!

200

## Parsing the html code

In [4]:
soup = BeautifulSoup(response.content, "html.parser")
soup

<!DOCTYPE html>

<html class="" lang="">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1, user-scalable=no" name="viewport"/>
<title>The Hot 100 Chart | Billboard</title>
<meta content="The Hot 100 Chart" name="title" property="title">
<meta content="@billboard" name="twitter:site"/>
<meta content="Billboard" property="og:site_name">
<meta content="article" property="og:type">
<link href="/manifest.json" rel="manifest"/>
<style>
        .chart-pro-access {
            background-image: url('https://www.billboard.com/assets/1606143657/images/piano/chart-pro-access-mb.png?472a790e67f42f9b25d0');
        }

        @media (min-width: 769px) {
            .chart-pro-access {
                background-image: url('https://www.billboard.com/assets/1606143657/images/piano/chart-pro-access-dk.png?472a790e67f42f9b25d0');
            }
        }
    </style>
<script async="async" data-cfasync="false" src="ht

## Retrieving the desired info from the Soup.

In [5]:
#charts > div > div.chart-list__wrapper > div > ol
soup.select("ol")

[<ol class="chart-list__elements">
 <li class="chart-list__element display--flex">
 <button class="chart-element__wrapper display--flex flex--grow sort--default">
 <span class="chart-element__rank flex--column flex--xy-center flex--no-shrink">
 <span class="chart-element__rank__number">1</span>
 <span class="chart-element__trend chart-element__trend--steady color--secondary"><i class="fa fa-arrow-right"><span class="sr--only">Steady</span></i></span>
 </span>
 <span class="chart-element__information">
 <span class="chart-element__information__song text--truncate color--primary">Mood</span>
 <span class="chart-element__information__artist text--truncate color--secondary">24kGoldn Featuring iann dior</span>
 <span class="chart-element__information__delta color--secondary">
 <span class="chart-element__information__delta__text text--default">-</span>
 <span class="chart-element__information__delta__text text--last">1 Last Week</span>
 <span class="chart-element__information__delta__text t

In [6]:
song_name = soup.find_all("span", "chart-element__information__song text--truncate color--primary")
song_name

[<span class="chart-element__information__song text--truncate color--primary">Mood</span>,
 <span class="chart-element__information__song text--truncate color--primary">Positions</span>,
 <span class="chart-element__information__song text--truncate color--primary">I Hope</span>,
 <span class="chart-element__information__song text--truncate color--primary">Laugh Now Cry Later</span>,
 <span class="chart-element__information__song text--truncate color--primary">Blinding Lights</span>,
 <span class="chart-element__information__song text--truncate color--primary">Lemonade</span>,
 <span class="chart-element__information__song text--truncate color--primary">Holy</span>,
 <span class="chart-element__information__song text--truncate color--primary">Dakiti</span>,
 <span class="chart-element__information__song text--truncate color--primary">Savage Love (Laxed - Siren Beat)</span>,
 <span class="chart-element__information__song text--truncate color--primary">For The Night</span>,
 <span class="

In [7]:
artist_name = soup.find_all("span", "chart-element__information__artist text--truncate color--secondary")
artist_name

[<span class="chart-element__information__artist text--truncate color--secondary">24kGoldn Featuring iann dior</span>,
 <span class="chart-element__information__artist text--truncate color--secondary">Ariana Grande</span>,
 <span class="chart-element__information__artist text--truncate color--secondary">Gabby Barrett Featuring Charlie Puth</span>,
 <span class="chart-element__information__artist text--truncate color--secondary">Drake Featuring Lil Durk</span>,
 <span class="chart-element__information__artist text--truncate color--secondary">The Weeknd</span>,
 <span class="chart-element__information__artist text--truncate color--secondary">Internet Money &amp; Gunna Featuring Don Toliver &amp; NAV</span>,
 <span class="chart-element__information__artist text--truncate color--secondary">Justin Bieber Featuring Chance The Rapper</span>,
 <span class="chart-element__information__artist text--truncate color--secondary">Bad Bunny &amp; Jhay Cortez</span>,
 <span class="chart-element__inform

In [8]:
artist_name[0].get_text()

'24kGoldn Featuring iann dior'

In [9]:
song_name[0].get_text()

'Mood'

## Create lists for artist and song

In [10]:
artist = []
song = []

In [11]:
num_iter = len(artist_name)

for i in range(num_iter):
    artist.append(artist_name[i].get_text())
    song.append(song_name[i].get_text())

In [12]:
print(artist)
print('---------------------------------')
print(song)

['24kGoldn Featuring iann dior', 'Ariana Grande', 'Gabby Barrett Featuring Charlie Puth', 'Drake Featuring Lil Durk', 'The Weeknd', 'Internet Money & Gunna Featuring Don Toliver & NAV', 'Justin Bieber Featuring Chance The Rapper', 'Bad Bunny & Jhay Cortez', 'Jawsh 685 x Jason Derulo', 'Pop Smoke Featuring Lil Baby & DaBaby', 'Cardi B Featuring Megan Thee Stallion', 'Maluma & The Weeknd', 'Chris Brown & Young Thug', 'DaBaby Featuring Roddy Ricch', 'Ava Max', 'Lewis Capaldi', 'BTS', 'Harry Styles', 'Morgan Wallen', 'Kane Brown With Swae Lee & Khalid', 'Justin Bieber & benny blanco', 'Moneybagg Yo', 'surf mesa Featuring Emilee', 'AJR', 'Jack Harlow Featuring DaBaby, Tory Lanez & Lil Wayne', 'Dua Lipa Featuring DaBaby', 'Luke Combs', 'Pop Smoke', 'Lee Brice', 'Ariana Grande', 'Russell Dickerson', 'Shawn Mendes', 'HARDY Featuring Lauren Alaina & Devin Dawson', 'Jason Aldean', 'Mike WiLL Made-It, Nicki Minaj & YoungBoy Never Broke Again', 'Blake Shelton Featuring Gwen Stefani', 'Parker McCol

## Constructing the dataframe

In [13]:
# each list becomes a column
songs = pd.DataFrame({"song":song,
                       "artist":artist,
                      })

songs.head()

Unnamed: 0,song,artist
0,Mood,24kGoldn Featuring iann dior
1,Positions,Ariana Grande
2,I Hope,Gabby Barrett Featuring Charlie Puth
3,Laugh Now Cry Later,Drake Featuring Lil Durk
4,Blinding Lights,The Weeknd


# Lab | Web Scraping Multiple Pages

#### Business goal:

- Check the `case_study_gnod.md` file.
- Make sure you've understood the big picture of your project:

  - the goal of the company (`Gnod`),
  - their current product (`Gnoosic`),
  - their strategy, and
  - how your project fits into this context.

  Re-read the business case and the e-mail from the CTO, take a look at the flowchart and create an initial Trello board with the tasks you think you'll have to accomplish.

#### Instructions 

#### Prioritize the MVP

In the previous lab, you had to scrape data about "hot songs". It's critical to be on track with that part, as it was part of the request from the CTO.

If you couldn't finish the first lab, use this time to go back there.

#### Expand the project

If you're done, you can try to expand the project on your own. Here are a few suggestions:

- Find other lists of hot songs on the internet and scrape them too: having a bigger pool of songs will be awesome!
- Apply the same logic to other "groups" of songs: the best songs from a decade or from a country / culture / language / genre.
- Wikipedia maintains a large collection of lists of songs: https://en.wikipedia.org/wiki/Lists_of_songs

#### Practice web scraping

As you've seen, scraping the internet is a skill that can get you all sorts of information. Here are some little challenges that you can try to gain more experience in the field:

- Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page: `url ='https://en.wikipedia.org/wiki/Python'`
- Find the number of titles that have changed in the United States Code since its last release point: `url = 'http://uscode.house.gov/download/download.shtml'`
- Create a Python list with the top ten FBI's Most Wanted names: `url = 'https://www.fbi.gov/wanted/topten'`
- Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe: `url = 'https://www.emsc-csem.org/Earthquake/'`
- List all language names and number of related articles in the order they appear in [wikipedia.org](wikipedia.org): `url = 'https://www.wikipedia.org/'`
- A list with the different kind of datasets available in [data.gov.uk](data.gov.uk): `url = 'https://data.gov.uk/'`
- Display the top 10 languages by number of native speakers stored in a pandas dataframe: `url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'`


### a. Scraping top 89 songs of 2019 from The Current

In [14]:
url_tc = "https://www.thecurrent.org/countdown/top89-2019"

In [15]:
response = requests.get(url_tc)
response.status_code

200

In [16]:
soup_tc = BeautifulSoup(response.content, "html.parser")
soup_tc

<!DOCTYPE html>

<html class="svg">
<head>
<script type="text/javascript">window.NREUM||(NREUM={});NREUM.info={"beacon":"bam-cell.nr-data.net","errorBeacon":"bam-cell.nr-data.net","licenseKey":"4feae813d0","applicationID":"59426475","transactionName":"dVpcTRRZClhVEU1aWkNbRl0JQQgbQwsNTg==","queueTime":0,"applicationTime":271,"agent":""}</script>
<script type="text/javascript">(window.NREUM||(NREUM={})).loader_config={licenseKey:"4feae813d0",applicationID:"59426475"};window.NREUM||(NREUM={}),__nr_require=function(e,t,n){function r(n){if(!t[n]){var i=t[n]={exports:{}};e[n][0].call(i.exports,function(t){var i=e[n][1][t];return r(i||t)},i,i.exports)}return t[n].exports}if("function"==typeof __nr_require)return __nr_require;for(var i=0;i<n.length;i++)r(n[i]);return r}({1:[function(e,t,n){function r(){}function i(e,t,n){return function(){return o(e,[u.now()].concat(c(arguments)),t?null:this,n),t?void 0:this}}var o=e("handle"),a=e(6),c=e(7),f=e("ee").get("tracer"),u=e("loader"),s=NREUM;"undefi

In [17]:
#pjaxReplaceTarget > div > section > ol
soup_tc.select("ol")

[<ol data-default-img="/assets/album-cover-default-32217dc68a771f3a44aa2b7a640cf91133b61bd1f2ae68c9ddb00055e9a8ac1d.png">
 </ol>,
 <ol class="subNav" id="streamsSub">
 <li>
 <h5><a href="/local">Local Current</a></h5>
 <p>
           Minnesota Music<br/>
 <a class="listen player-open" href="/listen/local" target="player">Listen Live <span></span></a> <a class="blog" href="http://blog.thecurrent.org">Blog ›</a>
 </p>
 </li>
 <li><h5><a href="/rock-the-cradle">Rock The Cradle</a></h5>
 <p>
           Music for kids and their adults<br/>
 <a class="listen player-open" href="/listen/rock-the-cradle" target="player">Listen Live <span></span></a>
 </p>
 </li>
 <li><h5><a href="/heartland">Radio Heartland</a></h5>
 <p>
           Acoustic, Americana and Roots<br/>
 <a class="listen player-open" href="/listen/heartland" target="player">Listen Live <span></span></a>
 </p>
 </li>
 <li><h5><a href="/purple-current">Purple Current</a></h5>
 <p>
           Exploring the musical legacy of Prince and

In [18]:
#pjaxReplaceTarget > div > section > ol > li:nth-child(1) > div.countdown-info > h3

In [19]:
song_name_tc = soup_tc.select("li > div > h3")
song_name_tc

[<h3 class="countdown-title">
           "Juice"
         </h3>,
 <h3 class="countdown-title">
           "bad guy"
         </h3>,
 <h3 class="countdown-title">
           "Truth Hurts"
         </h3>,
 <h3 class="countdown-title">
           "Saying Goodbye"
         </h3>,
 <h3 class="countdown-title">
           "Seventeen"
         </h3>,
 <h3 class="countdown-title">
           "Not"
         </h3>,
 <h3 class="countdown-title">
           "Hey, Ma"
         </h3>,
 <h3 class="countdown-title">
           "Cuz I Love You"
         </h3>,
 <h3 class="countdown-title">
           "Harmony Hall"
         </h3>,
 <h3 class="countdown-title">
           "Like A Girl"
         </h3>,
 <h3 class="countdown-title">
           "This Land"
         </h3>,
 <h3 class="countdown-title">
           "Red Bull &amp; Hennessy"
         </h3>,
 <h3 class="countdown-title">
           "I Get No Joy"
         </h3>,
 <h3 class="countdown-title">
           "Dylan Thomas"
         </h3>,
 <h3 class=

In [20]:
artist_name_tc = soup_tc.select("li > div > div.countdown-artist")
artist_name_tc

[<div class="countdown-artist">
           Lizzo
         </div>,
 <div class="countdown-artist">
           Billie Eilish
         </div>,
 <div class="countdown-artist">
           Lizzo
         </div>,
 <div class="countdown-artist">
           J.S. Ondara
         </div>,
 <div class="countdown-artist">
           Sharon Van Etten
         </div>,
 <div class="countdown-artist">
           Big Thief
         </div>,
 <div class="countdown-artist">
           Bon Iver
         </div>,
 <div class="countdown-artist">
           Lizzo
         </div>,
 <div class="countdown-artist">
           Vampire Weekend
         </div>,
 <div class="countdown-artist">
           Lizzo
         </div>,
 <div class="countdown-artist">
           Gary Clark Jr.
         </div>,
 <div class="countdown-artist">
           Jenny Lewis
         </div>,
 <div class="countdown-artist">
           Jade Bird
         </div>,
 <div class="countdown-artist">
           Better Oblivion Community Center
     

In [21]:
artist_name_tc[0].get_text("", strip = True)

'Lizzo'

In [22]:
artist_tc = []
song_tc = []

In [23]:
num_iter_tc = len(artist_name_tc)

for i in range(num_iter_tc):
    artist_tc.append(artist_name_tc[i].get_text("", strip = True))
    song_tc.append(song_name_tc[i].get_text("", strip = True))

In [24]:
artist_tc

['Lizzo',
 'Billie Eilish',
 'Lizzo',
 'J.S. Ondara',
 'Sharon Van Etten',
 'Big Thief',
 'Bon Iver',
 'Lizzo',
 'Vampire Weekend',
 'Lizzo',
 'Gary Clark Jr.',
 'Jenny Lewis',
 'Jade Bird',
 'Better Oblivion Community Center',
 'Billie Eilish',
 'Dessa & the Minnesota Orchestra',
 'The Highwomen',
 'Vampire Weekend',
 'Jenny Lewis',
 'Lizzo',
 'Brittany Howard',
 'The National',
 'Lana Del Rey',
 'Wilco',
 'Lana Del Rey',
 'The Highwomen',
 'Cloud Cult',
 'Andrew Bird',
 'Sturgill Simpson',
 'Maggie Rogers',
 'The Black Keys',
 'Tame Impala',
 'Dessa',
 'The Cactus Blossoms',
 'J.S. Ondara',
 'Hozier',
 'The Raconteurs',
 'Jenny Lewis',
 'Angel Olsen',
 'Chance the Rapper',
 'Clairo',
 'Joseph',
 'Beck',
 'Frank Turner',
 'DJ Shadow',
 'Gary Clark Jr.',
 'Aldous Harding',
 'Tame Impala',
 'Michael Kiwanuka',
 'Billie Eilish',
 'Of Monsters And Men',
 'Fontaines D.C.',
 'The New Pornographers',
 'Nick Cave & the Bad Seeds',
 'Local Natives',
 'Bad Bad Hats',
 'Caamp',
 'Sharon Van Ette

In [25]:
song_tc

['"Juice"',
 '"bad guy"',
 '"Truth Hurts"',
 '"Saying Goodbye"',
 '"Seventeen"',
 '"Not"',
 '"Hey, Ma"',
 '"Cuz I Love You"',
 '"Harmony Hall"',
 '"Like A Girl"',
 '"This Land"',
 '"Red Bull & Hennessy"',
 '"I Get No Joy"',
 '"Dylan Thomas"',
 '"bury a friend"',
 '"Velodrome - Live"',
 '"Redesigning Women"',
 '"This Life"',
 '"Heads Gonna Roll"',
 '"Tempo (feat. Missy Elliott)"',
 '"Stay High"',
 '"Rylan"',
 '"Doin\' Time"',
 '"Love Is Everywhere (Beware)"',
 '"The greatest"',
 '"Highwomen"',
 '"Best Time of My Life"',
 '"Sisyphus"',
 '"Sing Along"',
 '"Burning"',
 '"Lo/Hi"',
 '"Borderline"',
 '"Good For You"',
 '"Please Don\'t Call Me Crazy"',
 '"Lebanon"',
 '"Almost (Sweet Music)"',
 '"Now That You\'re Gone"',
 '"Wasted Youth"',
 '"All Mirrors"',
 '"Do You Remember (feat. Death Cab for Cutie)"',
 '"Bags"',
 '"Fighter"',
 '"Saw Lightning"',
 '"Sister Rosetta"',
 '"Rocket Fuel (feat. De La Soul)"',
 '"Pearl Cadillac"',
 '"The Barrel"',
 '"Patience"',
 '"Hero"',
 '"when the party\'s ove

In [26]:
the_current_top_89 = pd.DataFrame({"song":song_tc,
                       "artist":artist_tc,
                      })

the_current_top_89.head()

Unnamed: 0,song,artist
0,"""Juice""",Lizzo
1,"""bad guy""",Billie Eilish
2,"""Truth Hurts""",Lizzo
3,"""Saying Goodbye""",J.S. Ondara
4,"""Seventeen""",Sharon Van Etten


### b. Wikipedia Number One Songs

In [27]:
url3 = "https://en.wikipedia.org/wiki/Lists_of_number-one_songs"

In [28]:
response_wnos = requests.get(url3)
response_wnos.status_code # 200 status code means OK!

200

In [29]:
soup_wnos = BeautifulSoup(response_wnos.content, "html.parser")
soup_wnos

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Lists of number-one songs - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"5081a96d-bb88-4a01-afdc-efce5a36ee2c","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Lists_of_number-one_songs","wgTitle":"Lists of number-one songs","wgCurRevisionId":981048750,"wgRevisionId":981048750,"wgArticleId":25970693,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description is different from Wikidata","Lists of music lists","Lists of number-one songs"]

In [30]:
#mw-content-text > div.mw-parser-output > ul:nth-child(3)

In [31]:
links = soup_wnos.select("#mw-content-text > div.mw-parser-output > ul:nth-child(3) > li > a")

# Checking the output
for i in range(len(links)):
    print(links[i])    

<a href="/wiki/List_of_Billboard_number-one_singles" title="List of Billboard number-one singles">List of Billboard number-one singles</a>
<a href="/wiki/Lists_of_UK_Singles_Chart_number_ones" title="Lists of UK Singles Chart number ones">Lists of UK Singles Chart number ones</a>
<a href="/wiki/List_of_number-one_singles_in_Australia" title="List of number-one singles in Australia">List of number-one singles in Australia</a>
<a href="/wiki/List_of_number-one_singles_in_Canada" title="List of number-one singles in Canada">List of number-one singles in Canada</a>
<a href="/wiki/List_of_number-one_singles_in_France" title="List of number-one singles in France">List of number-one singles in France</a>
<a href="/wiki/List_of_number-one_hits_(Germany)" title="List of number-one hits (Germany)">List of number-one hits (Germany)</a>
<a href="/wiki/List_of_songs_that_reached_number_one_on_the_Irish_Singles_Chart" title="List of songs that reached number one on the Irish Singles Chart">List of s

In [32]:
urls = []

base = "https://en.wikipedia.org"

for link in links:
    urls.append(base + link['href'])

urls

['https://en.wikipedia.org/wiki/List_of_Billboard_number-one_singles',
 'https://en.wikipedia.org/wiki/Lists_of_UK_Singles_Chart_number_ones',
 'https://en.wikipedia.org/wiki/List_of_number-one_singles_in_Australia',
 'https://en.wikipedia.org/wiki/List_of_number-one_singles_in_Canada',
 'https://en.wikipedia.org/wiki/List_of_number-one_singles_in_France',
 'https://en.wikipedia.org/wiki/List_of_number-one_hits_(Germany)',
 'https://en.wikipedia.org/wiki/List_of_songs_that_reached_number_one_on_the_Irish_Singles_Chart',
 'https://en.wikipedia.org/wiki/List_of_number-one_hits_(Italy)',
 'https://en.wikipedia.org/wiki/List_of_number-one_songs_in_Norway',
 'https://en.wikipedia.org/wiki/List_of_number-one_hits_(Spain)',
 'https://en.wikipedia.org/wiki/European_Hot_100_Singles',
 'https://en.wikipedia.org/wiki/Recorded_Music_NZ',
 'https://en.wikipedia.org/wiki/List_of_number-one_singles_in_Switzerland',
 'https://en.wikipedia.org/wiki/List_of_number-one_singles_in_the_Dutch_Top_40',
 'htt

In [33]:
wnos_soups = []

for url in urls:
    # send request
    response = requests.get(url)
    print(response.status_code)

    # parse & store html
    soup = BeautifulSoup(response.content, "html.parser")
    wnos_soups.append(soup.find("table", {"class":"wikitable"}))

    # respectful nap:
    wait_time = np.random.randint(1,4)
    print("I will sleep for " + str(wait_time) + " second/s.")
    sleep(wait_time)

200
I will sleep for 1 second/s.
200
I will sleep for 1 second/s.
200
I will sleep for 2 second/s.
200
I will sleep for 1 second/s.
200
I will sleep for 1 second/s.
200
I will sleep for 2 second/s.
200
I will sleep for 3 second/s.
200
I will sleep for 2 second/s.
200
I will sleep for 1 second/s.
200
I will sleep for 3 second/s.
200
I will sleep for 3 second/s.
200
I will sleep for 1 second/s.
200
I will sleep for 1 second/s.
200
I will sleep for 2 second/s.
200
I will sleep for 1 second/s.
200
I will sleep for 1 second/s.
200
I will sleep for 3 second/s.
200
I will sleep for 1 second/s.
200
I will sleep for 3 second/s.
200
I will sleep for 1 second/s.
200
I will sleep for 2 second/s.
200
I will sleep for 1 second/s.
200
I will sleep for 1 second/s.
200
I will sleep for 2 second/s.
200
I will sleep for 1 second/s.


### c. Top 10 languages spoken by number of speakers

In [34]:
url_t10 = "https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers"

In [35]:
response_t10 = requests.get(url_t10)
response_t10.status_code

200

In [36]:
soup_t10 = BeautifulSoup(response_t10.content, "html.parser")
soup_t10

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of languages by number of native speakers - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"cae953a6-eceb-4dc4-9b90-d5f04acb6d46","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_languages_by_number_of_native_speakers","wgTitle":"List of languages by number of native speakers","wgCurRevisionId":985620308,"wgRevisionId":985620308,"wgArticleId":405385,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Wikipedia indefinitely semi-protected pages","Articles with short desc

In [37]:
#mw-content-text > div.mw-parser-output > table:nth-child(18) > tbody

table = soup_t10.select("tbody")
table

[<tbody><tr>
 <th>Rank
 </th>
 <th>Language
 </th>
 <th>Speakers<br/><small>(millions)</small>
 </th>
 <th>% of World pop.<br/><small>(March 2019)<sup class="reference" id="cite_ref-8"><a href="#cite_note-8">[8]</a></sup></small>
 </th>
 <th>Language family
 </th>
 <th>Branch
 </th></tr>
 <tr>
 <td>1
 </td>
 <td><a href="/wiki/Mandarin_Chinese" title="Mandarin Chinese">Mandarin Chinese</a>
 </td>
 <td>918
 </td>
 <td>11.922
 </td>
 <td><a href="/wiki/Sino-Tibetan_languages" title="Sino-Tibetan languages">Sino-Tibetan</a>
 </td>
 <td><a href="/wiki/Varieties_of_Chinese" title="Varieties of Chinese">Sinitic</a>
 </td></tr>
 <tr>
 <td>2
 </td>
 <td><a href="/wiki/Spanish_language" title="Spanish language">Spanish</a>
 </td>
 <td>480
 </td>
 <td>5.994
 </td>
 <td><a href="/wiki/Indo-European_languages" title="Indo-European languages">Indo-European</a>
 </td>
 <td><a href="/wiki/Romance_languages" title="Romance languages">Romance</a>
 </td></tr>
 <tr>
 <td>3
 </td>
 <td><a href="/wiki/Engl

In [38]:
#mw-content-text > div.mw-parser-output > table:nth-child(18) > tbody > tr:nth-child(1) > td:nth-child(2) > a

languages = soup_t10.select("tr > td > a")
languages

[<a href="/wiki/Mandarin_Chinese" title="Mandarin Chinese">Mandarin Chinese</a>,
 <a href="/wiki/Sino-Tibetan_languages" title="Sino-Tibetan languages">Sino-Tibetan</a>,
 <a href="/wiki/Varieties_of_Chinese" title="Varieties of Chinese">Sinitic</a>,
 <a href="/wiki/Spanish_language" title="Spanish language">Spanish</a>,
 <a href="/wiki/Indo-European_languages" title="Indo-European languages">Indo-European</a>,
 <a href="/wiki/Romance_languages" title="Romance languages">Romance</a>,
 <a href="/wiki/English_language" title="English language">English</a>,
 <a href="/wiki/Indo-European_languages" title="Indo-European languages">Indo-European</a>,
 <a href="/wiki/Germanic_languages" title="Germanic languages">Germanic</a>,
 <a href="/wiki/Hindi" title="Hindi">Hindi</a>,
 <a href="/wiki/Hindustani_language" title="Hindustani language">Hindustani</a>,
 <a href="/wiki/Indo-European_languages" title="Indo-European languages">Indo-European</a>,
 <a href="/wiki/Indo-Aryan_languages" title="Indo-

In [39]:
languages[0].get_text()

'Mandarin Chinese'

In [43]:
len(languages)

384

In [40]:
#mw-content-text > div.mw-parser-output > table:nth-child(18) > tbody > tr:nth-child(1) > td:nth-child(3)

number_speakers = soup_t10.select("tr > td:nth-child(3)")
number_speakers

[<td>918
 </td>,
 <td>480
 </td>,
 <td>379
 </td>,
 <td>341
 </td>,
 <td>228
 </td>,
 <td>221
 </td>,
 <td>154
 </td>,
 <td>128
 </td>,
 <td>92.7
 </td>,
 <td>83.1
 </td>,
 <td>82.0
 </td>,
 <td>81.4
 </td>,
 <td>79.4
 </td>,
 <td>77.3
 </td>,
 <td>77.2
 </td>,
 <td>76.1
 </td>,
 <td>76.0
 </td>,
 <td>75.0
 </td>,
 <td>73.1
 </td>,
 <td>68.6
 </td>,
 <td>68.3
 </td>,
 <td>64.8
 </td>,
 <td>64.6
 </td>,
 <td>56.4
 </td>,
 <td>52.8
 </td>,
 <td>52.2
 </td>,
 <td>50.1
 </td>,
 <td>48.2
 </td>,
 <td>46.9
 </td>,
 <td>43.9
 </td>,
 <td>43.6
 </td>,
 <td>43.4
 </td>,
 <td>39.7
 </td>,
 <td>37.8
 </td>,
 <td>37.3
 </td>,
 <td>37.1
 </td>,
 <td>34.5
 </td>,
 <td>33.9
 </td>,
 <td>32.9
 </td>,
 <td>32.6
 </td>,
 <td>32.4
 </td>,
 <td>31.9
 </td>,
 <td>29.4
 </td>,
 <td>27.5
 </td>,
 <td>27.3
 </td>,
 <td>27.0
 </td>,
 <td>25.1
 </td>,
 <td>24.6
 </td>,
 <td>24.6
 </td>,
 <td>24.3
 </td>,
 <td>23.6
 </td>,
 <td>23.1
 </td>,
 <td>22.4
 </td>,
 <td>22.1
 </td>,
 <td>21.9
 </td>,
 <td>20.9
 </td>,


In [41]:
number_speakers[0].get_text().replace("\n", "")

'918'

In [44]:
len(number_speakers)

192

In [42]:
language = []
amount_speakers = []

num_iter_t10 = len(languages)

for i in range(num_iter_t10):
    language.append(languages[i].get_text())
    amount_speakers.append(number_speakers[i].get_text().replace("\n", ""))

IndexError: list index out of range

In [None]:
language