# Lab | Web Scraping Multiple Pages

## Business goal:
- Check the case_study_gnod.md file.

- Make sure you've understood the big picture of your project:

1. the goal of the company (Gnod),
2. their current product (Gnoosic),
3. their strategy, and
4. how your project fits into this context.

Re-read the business case and the e-mail from the CTO, take a look at the flowchart and create an initial Trello board with the tasks you think you'll have to accomplish.

## Instructions

- Prioritize the MVP

In the previous lab, you had to scrape data about "hot songs". It's critical to be on track with that part, as it was part of the request from the CTO.

If you couldn't finish the first lab, use this time to go back there.

- Expand the project

If you're done, you can try to expand the project on your own. Here are a few suggestions:

Find other lists of hot songs on the internet and scrape them too: having a bigger pool of songs will be awesome!
Apply the same logic to other "groups" of songs: the best songs from a decade or from a country / culture / language / genre.
Wikipedia maintains a large collection of lists of songs: https://en.wikipedia.org/wiki/Lists_of_songs

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_songs_recorded_by_Joy_Division'

In [3]:
response = requests.get(url)

In [4]:
response.status_code

200

In [5]:
soup = BeautifulSoup(response.content,"html.parser")

In [136]:
soup.find_all("th", {"scope": "row"})[1].text.strip()

'"As You Said"'

In [140]:
soup.find_all("th", {"scope": "row"})[54].text.strip()

'"You\'re No Good for Me"[b]'

In [110]:
soup.find("div",{"id":"bodyContent"}).select("th:nth-child(1)")[3].get_text().strip()

'"As You Said"'

In [119]:
soup.find("div",{"id":"bodyContent"}).select("th:nth-child(1)")[56].get_text().strip()

'"You\'re No Good for Me"[b]'

In [70]:
soup.find("div",{"id":"mw-content-text"}).select("td:nth-child(3)")

[<td>1980
 </td>,
 <td>1978
 </td>,
 <td>1980
 </td>,
 <td>1980
 </td>,
 <td>1979
 </td>,
 <td>1979
 </td>,
 <td>1981
 </td>,
 <td>1980
 </td>,
 <td>1979
 </td>,
 <td>1981
 </td>,
 <td>1980
 </td>,
 <td>1978
 </td>,
 <td>1979
 </td>,
 <td>1994
 </td>,
 <td>1980
 </td>,
 <td>1981
 </td>,
 <td>1977
 </td>,
 <td>1979
 </td>,
 <td>1978
 </td>,
 <td>1994
 </td>,
 <td>1980
 </td>,
 <td>1979
 </td>,
 <td>1981
 </td>,
 <td>1997
 </td>,
 <td>1980
 </td>,
 <td>1994
 </td>,
 <td>1979
 </td>,
 <td>1979
 </td>,
 <td>1980
 </td>,
 <td>1981
 </td>,
 <td>1980
 </td>,
 <td>1977
 </td>,
 <td>1981
 </td>,
 <td>1980
 </td>,
 <td>1980
 </td>,
 <td>1979
 </td>,
 <td>1977
 </td>,
 <td>1979
 </td>,
 <td>1981
 </td>,
 <td>1980
 </td>,
 <td>1979
 </td>,
 <td>1980
 </td>,
 <td>1979
 </td>,
 <td>1981
 </td>,
 <td>1981
 </td>,
 <td>1981
 </td>,
 <td>1980
 </td>,
 <td>1994
 </td>,
 <td>1979
 </td>,
 <td>1980
 </td>,
 <td>1981
 </td>,
 <td>1977
 </td>,
 <td>1979
 </td>,
 <td>1994
 </td>]

In [72]:
soup.find("div",{"id":"mw-content-text"}).select("td:nth-child(3)")[53].get_text().strip()

'1994'

In [124]:
songs = []
years = []

for i in range(3,57):
    canciones = soup.find("div",{"id":"bodyContent"}).select("th:nth-child(1)")[i].get_text().strip()
    songs.append(canciones)
for i in range(54):
    años = soup.find("div",{"id":"mw-content-text"}).select("td:nth-child(3)")[i].get_text().strip()
    years.append(años)

In [125]:
songs

['"As You Said"',
 '"At a Later Date" (live)',
 '"Atmosphere"[a]',
 '"Atrocity Exhibition"',
 '"Auto-suggestion"',
 '"Candidate"',
 '"Ceremony" (live)',
 '"Colony"',
 '"Day of the Lords"',
 '"Dead Souls"',
 '"Decades"',
 '"Digital"',
 '"Disorder"',
 '"The Drawback"[b]',
 '"The Eternal"',
 '"Exercise One"',
 '"Failures"',
 '"From Safety to Where...?"',
 '"Glass"',
 '"Gutz"[b]',
 '"Heart and Soul"',
 '"I Remember Nothing"',
 '"Ice Age"',
 '"In a Lonely Place (Detail)" (demo)',
 '"Incubation"',
 '"Inside the Line"[b]',
 '"Insight"',
 '"Interzone"',
 '"Isolation"',
 '"The Kill"',
 '"Komakino"',
 '"Leaders of Men"',
 '"(Living in the) Ice Age"',
 '"Love Will Tear Us Apart"',
 '"A Means to an End"',
 '"New Dawn Fades"',
 '"No Love Lost"',
 '"Novelty"',
 '"The Only Mistake"',
 '"Passover"',
 '"Shadowplay"',
 '"She\'s Lost Control (12" version)"',
 '"She\'s Lost Control"',
 '"Sister Ray"[c] (live)(The Velvet Underground cover)',
 '"Something Must Break"',
 '"The Sound of Music"',
 '"These Days

In [126]:
years

['1980',
 '1978',
 '1980',
 '1980',
 '1979',
 '1979',
 '1981',
 '1980',
 '1979',
 '1981',
 '1980',
 '1978',
 '1979',
 '1994',
 '1980',
 '1981',
 '1977',
 '1979',
 '1978',
 '1994',
 '1980',
 '1979',
 '1981',
 '1997',
 '1980',
 '1994',
 '1979',
 '1979',
 '1980',
 '1981',
 '1980',
 '1977',
 '1981',
 '1980',
 '1980',
 '1979',
 '1977',
 '1979',
 '1981',
 '1980',
 '1979',
 '1980',
 '1979',
 '1981',
 '1981',
 '1981',
 '1980',
 '1994',
 '1979',
 '1980',
 '1981',
 '1977',
 '1979',
 '1994']

In [127]:
List_of_songs_by_Joy_Division = pd.DataFrame({'song':songs,'years':years})
List_of_songs_by_Joy_Division

Unnamed: 0,song,years
0,"""As You Said""",1980
1,"""At a Later Date"" (live)",1978
2,"""Atmosphere""[a]",1980
3,"""Atrocity Exhibition""",1980
4,"""Auto-suggestion""",1979
5,"""Candidate""",1979
6,"""Ceremony"" (live)",1981
7,"""Colony""",1980
8,"""Day of the Lords""",1979
9,"""Dead Souls""",1981


## Practice web scraping
As you've seen, scraping the internet is a skill that can get you all sorts of information. Here are some little challenges that you can try to gain more experience in the field:

- Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page: url ='https://en.wikipedia.org/wiki/Python'

In [319]:
url ='https://en.wikipedia.org/wiki/Python'

In [320]:
response = requests.get(url)

In [321]:
response.status_code

200

In [322]:
soup = BeautifulSoup(response.content,"html.parser")

In [323]:
soup.select("ul a")

[<a accesskey="z" href="/wiki/Main_Page" title="Visit the main page [z]"><span>Main page</span></a>,
 <a href="/wiki/Wikipedia:Contents" title="Guides to browsing Wikipedia"><span>Contents</span></a>,
 <a href="/wiki/Portal:Current_events" title="Articles related to current events"><span>Current events</span></a>,
 <a accesskey="x" href="/wiki/Special:Random" title="Visit a randomly selected article [x]"><span>Random article</span></a>,
 <a href="/wiki/Wikipedia:About" title="Learn about Wikipedia and how it works"><span>About Wikipedia</span></a>,
 <a href="//en.wikipedia.org/wiki/Wikipedia:Contact_us" title="How to contact Wikipedia"><span>Contact us</span></a>,
 <a href="https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&amp;utm_medium=sidebar&amp;utm_campaign=C13_en.wikipedia.org&amp;uselang=en" title="Support us by donating to the Wikimedia Foundation"><span>Donate</span></a>,
 <a href="/wiki/Help:Contents" title="Guidance on how to use and edit Wikip

In [324]:
soup.select("ul a")[91]

<a href="/wiki/Pythonidae" title="Pythonidae">Pythonidae</a>

In [325]:
soup.select("ul a")[116]

<a href="/wiki/Pithon" title="Pithon">Pithon</a>

In [326]:
soup.select("ul a")[91].get('href')

'/wiki/Pythonidae'

In [327]:
soup.select("ul a")[91].get('title')

'Pythonidae'

In [328]:
soup.select("ul a:nth-child(1)")[19].get_text()[3:].strip()

'Snakes'

In [329]:
soup.select("ul a:nth-child(1)")[26].get_text()[3:].strip()

'See also'

In [334]:
soup.select("ul a")[91]

<a href="/wiki/Pythonidae" title="Pythonidae">Pythonidae</a>

In [337]:
name = []
links = []

for i in range(91,117):
    titulo = soup.select("ul a")[i].get('title')
    name.append(titulo)
    link = soup.select("ul a")[i].get('href')
    links.append(link)

In [247]:
name

['Pythonidae',
 'Python (genus)',
 'Python (mythology)',
 'Python (programming language)',
 'CMU Common Lisp',
 'PERQ',
 'Python of Aenus',
 'Python (painter)',
 'Python of Byzantium',
 'Python of Catana',
 'Python Anghelo',
 'Python (Efteling)',
 'Python (Busch Gardens Tampa Bay)',
 'Python (Coney Island, Cincinnati, Ohio)',
 'Python (automobile maker)',
 'Python (Ford prototype)',
 'Python (missile)',
 'Python (nuclear primary)',
 'Colt Python',
 'Python (codename)',
 'Python (film)',
 'Monty Python',
 'Python (Monty) Pictures',
 'Timon of Phlius',
 'Pyton',
 'Pithon']

In [251]:
Python_may_refer_to = pd.DataFrame({'name':name,'links':links})
Python_may_refer_to

Unnamed: 0,name,links
0,Pythonidae,/wiki/Pythonidae
1,Python (genus),/wiki/Python_(genus)
2,Python (mythology),/wiki/Python_(mythology)
3,Python (programming language),/wiki/Python_(programming_language)
4,CMU Common Lisp,/wiki/CMU_Common_Lisp
5,PERQ,/wiki/PERQ#PERQ_3
6,Python of Aenus,/wiki/Python_of_Aenus
7,Python (painter),/wiki/Python_(painter)
8,Python of Byzantium,/wiki/Python_of_Byzantium
9,Python of Catana,/wiki/Python_of_Catana


In [308]:
url ='https://es.wikipedia.org/wiki/Pythonidae'
response = requests.get(url)
soup = BeautifulSoup(response.content,"html.parser")

In [315]:
soup.select("h1")[0].get_text()

'Pythonidae'

In [339]:
for i in range(len(links)):
  url ="https://en.wikipedia.org"+links[i]
  response=requests.get(url)  
  soup = BeautifulSoup(response.content, "html.parser")
  print(soup.select("h1")[0].get_text())
  print("https://en.wikipedia.org"+links[i])

Pythonidae
Python (genus)
Python (mythology)
Python (programming language)
CMU Common Lisp
PERQ
Python of Aenus
Python (painter)
Python of Byzantium
Python of Catana
Python Anghelo
Python (Efteling)
Python (Busch Gardens Tampa Bay)
Python (Coney Island, Cincinnati, Ohio)
Python (automobile maker)
Python (Ford prototype)
Python (missile)
Python (nuclear primary)
Colt Python
Python (codename)
Python (film)
Monty Python
Python (Monty) Pictures
Timon of Phlius
Pyton
Pithon


- Find the number of titles that have changed in the United States Code since its last release point: url = 'http://uscode.house.gov/download/download.shtml'

In [281]:
url = 'http://uscode.house.gov/download/download.shtml'

In [282]:
response = requests.get(url)

In [283]:
response.status_code

200

In [284]:
soup = BeautifulSoup(response.content,"html.parser")

In [276]:
# Buscar la sección que muestra la información de la última actualización
last_update = soup.find('div', {'class': 'usctitlechanged'})

# Encontrar el número de la última versión publicada
last_version = last_update.find('span', {'class': 'usctitlechanged'}).text.strip()

# Encontrar el número de la versión actual
current_version = soup.find('div', {'class': 'usctitle'}).find('span').text.strip()

# Calcular la cantidad de títulos que han cambiado
changed_titles = int(current_version) - int(last_version)

# Imprimir la cantidad de títulos que han cambiado
print(f'La cantidad de títulos que han cambiado desde la última publicación es: {changed_titles}')

AttributeError: 'NoneType' object has no attribute 'text'

- Create a Python list with the top ten FBI's Most Wanted names: url = 'https://www.fbi.gov/wanted/topten'

In [285]:
url = 'https://www.fbi.gov/wanted/topten'

In [286]:
response = requests.get(url)

In [287]:
response.status_code

200

In [288]:
soup = BeautifulSoup(response.content,"html.parser")

In [295]:
soup.select("li h3")[0].get_text().strip()

'YULAN ADONAY ARCHAGA CARIAS'

In [298]:
wanted = []

for i in range(len(soup.select("li h3"))):
    buscados = soup.select("li h3")[i].get_text().strip()
    wanted.append(buscados)

In [299]:
wanted

['YULAN ADONAY ARCHAGA CARIAS',
 'BHADRESHKUMAR CHETANBHAI PATEL',
 'WILVER VILLEGAS-PALOMINO',
 'ALEJANDRO ROSALES CASTILLO',
 'RUJA IGNATOVA',
 'ARNOLDO JIMENEZ',
 'OMAR ALEXANDER CARDENAS',
 'ALEXIS FLORES',
 'MICHAEL JAMES PRATT',
 'JOSE RODOLFO VILLARREAL-HERNANDEZ']

- Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe: url = 'https://www.emsc-csem.org/Earthquake/'

In [340]:
url = 'https://www.emsc-csem.org/Earthquake/'

In [341]:
response = requests.get(url)

In [342]:
response.status_code

200

In [343]:
soup = BeautifulSoup(response.content,"html.parser")

In [352]:
soup.select("td.tabev6 a")[0].get_text().strip()

'2023-05-09\xa0\xa0\xa002:22:07.3'

In [355]:
soup.select("td.tabev1")

[<td class="tabev1">34.10 </td>,
 <td class="tabev1">116.68 </td>,
 <td class="tabev1">19.39 </td>,
 <td class="tabev1">155.30 </td>,
 <td class="tabev1">39.08 </td>,
 <td class="tabev1">33.05 </td>,
 <td class="tabev1">37.48 </td>,
 <td class="tabev1">137.47 </td>,
 <td class="tabev1">39.11 </td>,
 <td class="tabev1">33.07 </td>,
 <td class="tabev1">37.50 </td>,
 <td class="tabev1">72.55 </td>,
 <td class="tabev1">37.63 </td>,
 <td class="tabev1">35.61 </td>,
 <td class="tabev1">38.41 </td>,
 <td class="tabev1">37.34 </td>,
 <td class="tabev1">11.46 </td>,
 <td class="tabev1">162.82 </td>,
 <td class="tabev1">45.59 </td>,
 <td class="tabev1">15.35 </td>,
 <td class="tabev1">9.36 </td>,
 <td class="tabev1">113.99 </td>,
 <td class="tabev1">19.41 </td>,
 <td class="tabev1">155.30 </td>,
 <td class="tabev1">21.09 </td>,
 <td class="tabev1">66.77 </td>,
 <td class="tabev1">36.53 </td>,
 <td class="tabev1">28.87 </td>,
 <td class="tabev1">38.45 </td>,
 <td class="tabev1">37.27 </td>,
 <td 

In [353]:
soup.select("td.tabev1")[0].get_text().strip()

'34.10'

In [356]:
soup.select("td.tabev1")[1].get_text().strip()

'116.68'

In [360]:
soup.select("td.tb_region")[0].get_text().strip()

'SOUTHERN CALIFORNIA'

In [387]:
data_time = []
latitud = []
longitude = []
region_name = []

for i in range(0,40,2):
    latitud.append(soup.select("td.tabev1")[i].get_text().strip())
    
for i in range(1,40,2):
        longitude.append(soup.select("td.tabev1")[i].get_text().strip())

for i in range(0,20):
    data_time.append(soup.select("td.tabev6 a")[i].get_text().strip())
    region_name.append(soup.select("td.tb_region")[i].get_text().strip())

In [389]:
latitud

['34.10',
 '19.39',
 '39.08',
 '37.48',
 '39.11',
 '37.50',
 '37.63',
 '38.41',
 '11.46',
 '45.59',
 '9.36',
 '19.41',
 '21.09',
 '36.53',
 '38.45',
 '39.05',
 '19.42',
 '18.58',
 '33.66',
 '2.18']

In [390]:
Earthquakes_today = pd.DataFrame({'date_time':data_time,'latitud':latitud,'longitude':longitude,'region_name':region_name})
Earthquakes_today

Unnamed: 0,date_time,latitud,longitude,region_name
0,2023-05-09 02:22:07.3,34.1,116.68,SOUTHERN CALIFORNIA
1,2023-05-09 02:16:27.3,19.39,155.3,"ISLAND OF HAWAII, HAWAII"
2,2023-05-09 02:12:59.9,39.08,33.05,CENTRAL TURKEY
3,2023-05-09 02:05:56.6,37.48,137.47,"NEAR WEST COAST OF HONSHU, JAPAN"
4,2023-05-09 02:00:14.2,39.11,33.07,CENTRAL TURKEY
5,2023-05-09 01:47:41.6,37.5,72.55,TAJIKISTAN
6,2023-05-09 01:40:22.4,37.63,35.61,CENTRAL TURKEY
7,2023-05-09 01:26:10.9,38.41,37.34,CENTRAL TURKEY
8,2023-05-09 01:23:31.4,11.46,162.82,SOLOMON ISLANDS
9,2023-05-09 01:19:31.0,45.59,15.35,CROATIA


- List all language names and number of related articles in the order they appear in wikipedia.org: url = 'https://www.wikipedia.org/'

In [391]:
url = 'https://www.wikipedia.org/'
response = requests.get(url)
print(response.status_code)
soup = BeautifulSoup(response.content,"html.parser")

200


In [426]:
soup.select('div.central-featured div')[0].get_text().split('\n')[2]

'English'

In [430]:
names_language = []

for i in range(len(soup.select('div.central-featured div'))):
    names_language.append(soup.select('div.central-featured div')[i].get_text().split('\n')[2])

In [431]:
names_language

['English',
 'Русский',
 '日本語',
 'Deutsch',
 'Español',
 'Français',
 'Italiano',
 '中文',
 'فارسی',
 'Português']

- A list with the different kind of datasets available in data.gov.uk: url = 'https://data.gov.uk/'

In [432]:
url = 'https://data.gov.uk/'
response = requests.get(url)
print(response.status_code)
soup = BeautifulSoup(response.content,"html.parser")

200


In [435]:
soup.select('li a.govuk-link')

[<a class="govuk-link" href="/search?filters%5Btopic%5D=Business+and+economy">Business and economy</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Crime+and+justice">Crime and justice</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Defence">Defence</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Education">Education</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Environment">Environment</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Government">Government</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Government+spending">Government spending</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Health">Health</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Mapping">Mapping</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Society">Society</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Towns+and+cities">Towns and cities</a>,
 <a class="govuk-link" href="/search?f

In [438]:
list_of_govuk = []
for i in range(len(soup.select('li a.govuk-link'))):
    list_of_govuk.append(soup.select('li a.govuk-link')[i].get_text().strip())

In [439]:
list_of_govuk

['Business and economy',
 'Crime and justice',
 'Defence',
 'Education',
 'Environment',
 'Government',
 'Government spending',
 'Health',
 'Mapping',
 'Society',
 'Towns and cities',
 'Transport',
 'Digital service performance',
 'Government reference data']

- Display the top 10 languages by number of native speakers stored in a pandas dataframe: url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [440]:
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'
response = requests.get(url)
print(response.status_code)
soup = BeautifulSoup(response.content,"html.parser")

200


In [467]:
soup.select('tbody td a.mw-redirect')

[<a class="mw-redirect" href="/wiki/ISO_639:cmn" title="ISO 639:cmn">Mandarin Chinese</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:spa" title="ISO 639:spa">Spanish</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:eng" title="ISO 639:eng">English</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:hin" title="ISO 639:hin">Hindi</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:por" title="ISO 639:por">Portuguese</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:ben" title="ISO 639:ben">Bengali</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:rus" title="ISO 639:rus">Russian</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:jpn" title="ISO 639:jpn">Japanese</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:yue" title="ISO 639:yue">Yue Chinese</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:vie" title="ISO 639:vie">Vietnamese</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:tur" title="ISO 639:tur">Turkish</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:wuu" title="ISO 639:wuu">Wu Chinese<

In [472]:
soup.select('tbody td a.mw-redirect')[3].get_text()

'Hindi'

In [477]:
soup.select('tbody td')[1]

<td>939
</td>

In [478]:
soup.select('tbody td')[5]

<td>485
</td>

In [483]:
language=[]
native_speakers=[]

for i in range(10):
    language.append(soup.select('tbody td a.mw-redirect')[i].get_text().strip())

for i in range(1,40,4):
    native_speakers.append(soup.select('tbody td')[i].get_text().strip())

In [484]:
native_speakers

['939', '485', '380', '345', '236', '234', '147', '123', '86.1', '85.0']

In [485]:
list_of_languages = pd.DataFrame({'language':language,'native_speakers':native_speakers})
list_of_languages

Unnamed: 0,language,native_speakers
0,Mandarin Chinese,939.0
1,Spanish,485.0
2,English,380.0
3,Hindi,345.0
4,Portuguese,236.0
5,Bengali,234.0
6,Russian,147.0
7,Japanese,123.0
8,Yue Chinese,86.1
9,Vietnamese,85.0
