# <font color='orange'> Lab | Web Scraping Multiple Pages  </font>

- Business goal:
- Check the case_study_gnod.md file.
- Make sure you've understood the big picture of your project:
- the goal of the company (Gnod),
- their current product (Gnoosic),
- their strategy, and
- how your project fits into this context.
- Re-read the business case and the e-mail from the CTO, take a look at the flowchart and create an initial Trello board with the tasks you think you'll have to accomplish.


- Instructions
  - Prioritize the MVP
  - In the previous lab, you had to scrape data about "hot songs". It's critical to be on track with that part, as it was part of the request from the CTO.
  - If you couldn't finish the first lab, use this time to go back there.
  - Expand the project
  - If you're done, you can try to expand the project on your own. Here are a few suggestions:
  - Find other lists of hot songs on the internet and scrape them too: having a bigger pool of songs will be awesome!
  - Apply the same logic to other "groups" of songs: the best songs from a decade or from a country / culture / language / genre.
  - Wikipedia maintains a large collection of lists of songs: https://en.wikipedia.org/wiki/Lists_of_songs

- Enlace trello 
https://trello.com/invite/b/NdM54rFw/ATTI63ea383c365c9d6f9c23698cc7321976C345E061/gnoosic

In [1581]:
import re
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np



In [1582]:
# Nos traemos lo del lab anterior
url = 'https://www.billboard.com/charts/hot-100/'
billboard_html = requests.get(url).content
soup = BeautifulSoup(billboard_html, "html.parser")

In [1583]:
songs = soup.find_all("h3", class_="a-no-trucate")
artists = soup.find_all("span", class_="a-no-trucate")
for i in [songs, artists]:
    for j in range(len(i)):
        i[j] = i[j].getText()
billboard = pd.DataFrame(
    {"song": songs,
     "artist": artists})
billboard= billboard.replace({'\n':'','\t':'' }, regex=True)
billboard

Unnamed: 0,song,artist
0,Flowers,Miley Cyrus
1,Kill Bill,SZA
2,"Boy's A Liar, Pt. 2",PinkPantheress & Ice Spice
3,Creepin',"Metro Boomin, The Weeknd & 21 Savage"
4,Last Night,Morgan Wallen
...,...,...
95,Slut Me Out,NLE Choppa
96,La Jumpa,Arcangel & Bad Bunny
97,Shut Up My Moms Calling,Hotel Ugly
98,Gold,Dierks Bentley


In [1584]:
# Encontramos una nueva lista con top 200 
url = 'https://kworb.net/spotify/country/global_daily.html'
spotify_chart = requests.get(url).content
soup = BeautifulSoup(spotify_chart, "html.parser")


In [1585]:
search = soup.select('div>a') # Nos va a traer el artista y la cancion
artists = search[::2]
songs = search[1::2]

In [1586]:
# 6. Get the text
for i in [artists, songs]:
    for j in range(len(i)):
        i[j] = i[j].getText()

In [1587]:
top200 = pd.DataFrame({"songs": songs,"artist": artists })
top200

Unnamed: 0,songs,artist
0,Flowers,Miley Cyrus
1,TQG,KAROL G
2,Kill Bill,SZA
3,Die For You - Remix,The Weeknd
4,Boy's a liar Pt. 2,PinkPantheress
...,...,...
195,STAR WALKIN' (League of Legends Worlds Anthem),Lil Nas X
196,After Hours,The Weeknd
197,Locked out of Heaven,Bruno Mars
198,Payphone,Maroon 5


In [1588]:
# List of anti-war songs
# General pacifist and anti-war songs
url = 'https://en.wikipedia.org/wiki/List_of_anti-war_songs'
anti_war = requests.get(url).content
soup = BeautifulSoup(anti_war, "html.parser")

In [1589]:
search=soup.select('td')
search = [i.get_text() for i in search]
search=search[2:917]
year=search[0::3]
song=search[1::3]
artist = search[2::3]
anti_war_songs=pd.DataFrame({'year': year,
                             'song': song,
                             'artist':artist})
anti_war_songs=anti_war_songs.replace({'"':'','\n':''}, regex=True)
anti_war_songs

Unnamed: 0,year,song,artist
0,1971,And the Band Played Waltzing Matilda,Eric Bogle
1,1867,Johnny I Hardly Knew Ye,Traditional
2,1985,19,Paul Hardcastle
3,1966,7 O'Clock News/Silent Night,Simon & Garfunkel
4,1989,After the War,Gary Moore
...,...,...,...
300,1976,Zombie,Fela Kuti
301,1994,Zombie,The Cranberries
302,1968,Zor and Zam,The Monkees
303,1982,Радиоактивность (Radioactivity),Center


# <font color='orange'> Practice web scraping  </font>
- As you've seen, scraping the internet is a skill that can get you all sorts of information. Here are some little challenges that you can try to gain more experience in the field:
- Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page: url ='https://en.wikipedia.org/wiki/Python'
- Find the number of titles that have changed in the United States Code since its last release point: url = 'http://uscode.house.gov/download/download.shtml'
- Create a Python list with the top ten FBI's Most Wanted names: url = 'https://www.fbi.gov/wanted/topten'
- Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe: url = 'https://www.emsc-csem.org/Earthquake/'
- List all language names and number of related articles in the order they appear in wikipedia.org: url = 'https://www.wikipedia.org/'
- A list with the different kind of datasets available in data.gov.uk: url = 'https://data.gov.uk/'
- Display the top 10 languages by number of native speakers stored in a pandas dataframe: url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [1590]:
# # Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page:
url ='https://en.wikipedia.org/wiki/Python'
python_links = requests.get(url).content
soup = BeautifulSoup(python_links, "html.parser")

In [1591]:
links = []
for link in soup.find_all("a"):
    href = link.get("href")
    if href.startswith("http"):
        links.append(href)

In [1592]:
# Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page:
links=list(links)
links

['https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en',
 'https://www.wikidata.org/wiki/Special:EntityPage/Q747452',
 'https://commons.wikimedia.org/wiki/Category:Python',
 'https://af.wikipedia.org/wiki/Python',
 'https://als.wikipedia.org/wiki/Python',
 'https://ar.wikipedia.org/wiki/%D8%A8%D8%A7%D9%8A%D8%AB%D9%88%D9%86_(%D8%AA%D9%88%D8%B6%D9%8A%D8%AD)',
 'https://az.wikipedia.org/wiki/Python_(d%C9%99qiql%C9%99%C5%9Fdirm%C9%99)',
 'https://bn.wikipedia.org/wiki/%E0%A6%AA%E0%A6%BE%E0%A6%87%E0%A6%A5%E0%A6%A8_(%E0%A6%A6%E0%A7%8D%E0%A6%AC%E0%A7%8D%E0%A6%AF%E0%A6%B0%E0%A7%8D%E0%A6%A5%E0%A6%A4%E0%A6%BE_%E0%A6%A8%E0%A6%BF%E0%A6%B0%E0%A6%B8%E0%A6%A8)',
 'https://be.wikipedia.org/wiki/Python',
 'https://bg.wikipedia.org/wiki/%D0%9F%D0%B8%D1%82%D0%BE%D0%BD_(%D0%BF%D0%BE%D1%8F%D1%81%D0%BD%D0%B5%D0%BD%D0%B8%D0%B5)',
 'https://cs.wikipedia.org/wiki/Python_(rozcestn%C3%ADk)',
 'https://da.wikipedia.org/wik

In [1593]:
# Find the number of titles that have changed in the United States Code since its last release point:
url ='http://uscode.house.gov/download/download.shtml'
titles_UE = requests.get(url).content
soup = BeautifulSoup(titles_UE, "html.parser")

In [1594]:
titles = soup.find_all("div", class_="usctitlechanged")
titles = [i.get_text() for i in titles]

In [1595]:
# 6. Get the text
# for i in [titles]:
#     for j in range(len(i)):
#         i[j] = i[j].getText()

In [1596]:
titles=list(titles)

In [1597]:
titles = [x.replace('\n', '').replace('٭','').strip() for x in titles]
titles

['Title 2 - The Congress',
 'Title 3 - The President',
 'Title 5 - Government Organization and Employees',
 'Title 6 - Domestic Security',
 'Title 7 - Agriculture',
 'Title 8 - Aliens and Nationality',
 'Title 10 - Armed Forces',
 'Title 12 - Banks and Banking',
 'Title 14 - Coast Guard',
 'Title 15 - Commerce and Trade',
 'Title 16 - Conservation',
 'Title 18 - Crimes and Criminal Procedure',
 'Title 19 - Customs Duties',
 'Title 20 - Education',
 'Title 21 - Food and Drugs',
 'Title 22 - Foreign Relations and Intercourse',
 'Title 23 - Highways',
 'Title 24 - Hospitals and Asylums',
 'Title 25 - Indians',
 'Title 26 - Internal Revenue Code',
 'Title 29 - Labor',
 'Title 30 - Mineral Lands and Mining',
 'Title 31 - Money and Finance',
 'Title 33 - Navigation and Navigable Waters',
 'Title 34 - Crime Control and Law Enforcement',
 'Title 35 - Patents',
 'Title 36 - Patriotic and National Observances, Ceremonies, and Organizations',
 "Title 38 - Veterans' Benefits",
 'Title 39 - Postal 

In [1598]:
len(titles)

42

In [1599]:
# Create a Python list with the top ten FBI's Most Wanted names:
url = 'https://www.fbi.gov/wanted/topten'
fbi_mw = requests.get(url).content
soup = BeautifulSoup(fbi_mw, "html.parser")


In [1600]:
fbi = soup.find_all("h3", class_='title' )
fbi = [i.get_text() for i in fbi]
fbi=pd.DataFrame({"most_wanted":fbi})
fbi=fbi.replace({'\n':'','-':' ' }, regex=True)
fbi


Unnamed: 0,most_wanted
0,JOSE RODOLFO VILLARREAL HERNANDEZ
1,ALEJANDRO ROSALES CASTILLO
2,YULAN ADONAY ARCHAGA CARIAS
3,RUJA IGNATOVA
4,ARNOLDO JIMENEZ
5,OMAR ALEXANDER CARDENAS
6,ALEXIS FLORES
7,BHADRESHKUMAR CHETANBHAI PATEL
8,MICHAEL JAMES PRATT
9,RAFAEL CARO QUINTERO


In [1601]:
# Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe: 
url = 'https://www.emsc-csem.org/Earthquake/'
earthquakes = requests.get(url).content
soup = BeautifulSoup(earthquakes, "html.parser")

In [1602]:
date_time=soup.select("b>a")
date_time = [i.get_text() for i in date_time]
date_time = [x.replace("\xa0\xa0\xa0", " ") for x in date_time]
date_time = [x.split(" ") for x in date_time]
date_time.pop(50) # Eliminamos el elemento 50, ya que no es parte de la info que queremos  ['Privacy']]
date_time=pd.DataFrame(date_time)
date_time = date_time.set_axis(['date', 'time'], axis=1)


In [1603]:
# latitude
lat_long=soup.find_all("td", class_="tabev1") # aqui nos esta dando primero latitud y luego longitud
lat_long = [i.get_text() for i in lat_long]
lat_long = [x.replace("\xa0", "") for x in lat_long]
latitude = lat_long[0:100:2]
longitude = lat_long[1:100:2]
latitude= pd.DataFrame(latitude)
longitude= pd.DataFrame(longitude)

In [1604]:
# Aqui no obtuvimos la longitud y latitud completa, falta traernos las columas dew si es N o S ,  E Y W
lat_long_let=soup.find_all("td", class_="tabev2")  # Aqui tambien nos esta trayendo la magnitud
lat_long_let=[i.get_text() for i in lat_long_let]
lat_long_let=[x.replace("\xa0","") for x in lat_long_let]
lat_let = lat_long_let[0:150:3]
lon_let = lat_long_let[1:150:3]
magnitude = lat_long_let[2:150:3]
lat_let= pd.DataFrame(lat_let)
lon_let= pd.DataFrame(lon_let)
magnitude= pd.DataFrame(magnitude)

In [1605]:
region=soup.find_all("td", class_="tb_region")
region = [i.get_text() for i in region] # aqui nos esta dando primero latitud y luego longitud
region=[x.replace("\xa0","") for x in region]
region = pd.DataFrame(region)

In [1606]:
# Creamos un Dataframe con toda nuestra data
data=np.concatenate([date_time, latitude, lat_let, longitude, lon_let, magnitude, region], axis=1, )
data=pd.DataFrame(data)


In [1607]:
data["latitude"]= data[2] + " " + data[3]
data["longitude"]= data[4] + " " + data[5]
data=data.drop([2,3,4,5], axis=1)
data=data.set_axis(['date', 'time','magnitude','region', 'latitude', 'longitude'], axis=1)
data=data[0:20]
data

Unnamed: 0,date,time,magnitude,region,latitude,longitude
0,2023-03-02,22:06:39.1,2.0,WESTERN TURKEY,38.34 N,27.17 E
1,2023-03-02,21:52:37.9,2.1,"ISLAND OF HAWAII, HAWAII",19.39 N,155.48 W
2,2023-03-02,21:49:59.0,4.5,CHINA-LAOS-VIETNAM BORDER REGION,22.28 N,102.28 E
3,2023-03-02,21:49:22.6,2.3,CENTRAL TURKEY,37.39 N,36.90 E
4,2023-03-02,21:35:51.2,3.0,WESTERN TURKEY,38.99 N,27.85 E
5,2023-03-02,21:31:53.1,2.2,CENTRAL TURKEY,37.48 N,37.13 E
6,2023-03-02,21:18:41.7,2.2,NORTHERN CALIFORNIA,38.79 N,122.78 W
7,2023-03-02,21:16:49.8,3.2,NORTHERN CALIFORNIA,38.83 N,122.70 W
8,2023-03-02,21:13:20.0,3.7,"SERAM, INDONESIA",3.35 S,128.35 E
9,2023-03-02,21:11:06.5,2.0,ALBANIA,41.36 N,19.59 E


In [1608]:
# List all language names and number of related articles in the order they appear in wikipedia.org: 
url = 'https://www.wikipedia.org/'
wikipedia = requests.get(url).content
soup = BeautifulSoup(wikipedia, "html.parser")

In [1609]:
languages = soup.find_all('strong')
articles =  soup.select('small>bdi') 

In [1610]:
languages = [i.get_text() for i in languages] 
languages.pop(0)
languages.pop(10)
languages=pd.DataFrame(languages)

In [1611]:
articles = [i.get_text() for i in articles]
articles=pd.DataFrame(articles)
type(articles)

pandas.core.frame.DataFrame

In [1612]:
wikipedia=np.concatenate([languages, articles], axis=1)

In [1613]:
wikipedia=pd.DataFrame(wikipedia)
wikipedia=wikipedia.set_axis(['idioma', 'languages'], axis=1)
wikipedia


Unnamed: 0,idioma,languages
0,English,6 606 000+
1,Русский,1 887 000+
2,日本語,1 359 000+
3,Deutsch,2 764 000+
4,Français,2 488 000+
5,Español,1 833 000+
6,Italiano,1 792 000+
7,中文,1 331 000+
8,فارسی,947 000+
9,Polski,1 552 000+


In [1614]:
# A list with the different kind of datasets available in data.gov.uk:
url = 'https://data.gov.uk/'
datasets = requests.get(url).content
soup = BeautifulSoup(datasets, "html.parser")


In [1615]:
datasets=soup.select('h3>a')
datasets = [i.get_text() for i in datasets ]
datasets = pd.DataFrame({"datasets": datasets})
datasets

Unnamed: 0,datasets
0,Business and economy
1,Crime and justice
2,Defence
3,Education
4,Environment
5,Government
6,Government spending
7,Health
8,Mapping
9,Society


In [1616]:
# Display the top 10 languages by number of native speakers stored in a pandas dataframe:
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'
speakers = requests.get(url).content
soup =  BeautifulSoup (speakers,'html.parser')

In [1617]:
speakers=soup.find_all('td')
speakers = [i.get_text() for i in speakers]
speakers=speakers[0:108]
languages=speakers[::4]
num_speakers = speakers[1::4]


In [1618]:
speakers=pd.DataFrame({'languages': languages,
                       'num_speakers': num_speakers})

In [1619]:
speakers=speakers.replace({'\n':""}, regex=True)
speakers

Unnamed: 0,languages,num_speakers
0,"Mandarin Chinese(incl. Standard Chinese, but e...",939.0
1,Spanish,485.0
2,English,380.0
3,Hindi(excl. Urdu),345.0
4,Portuguese,236.0
5,Bengali,234.0
6,Russian,147.0
7,Japanese,123.0
8,Yue Chinese(incl. Cantonese),86.1
9,Vietnamese,85.0
