Practice web scraping
As you've seen, scraping the internet is a skill that can get you all sorts of information. Here are some little challenges that you can try to gain more experience in the field:

- Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page: url ='https://en.wikipedia.org/wiki/Python'


- Find the number of titles that have changed in the United States Code since its last release point: url = 'http://uscode.house.gov/download/download.shtml'


- Create a Python list with the top ten FBI's Most Wanted names: url = 'https://www.fbi.gov/wanted/topten'


- Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe: url = 'https://www.emsc-csem.org/Earthquake/'


- List all language names and number of related articles in the order they appear in wikipedia.org: url = 'https://www.wikipedia.org/'

- A list with the different kind of datasets available in data.gov.uk: url = 'https://data.gov.uk/'
Display the top 10 languages by number of native speakers stored in a pandas dataframe: url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [81]:
from bs4 import BeautifulSoup
import requests as req

import pandas as pd

#### 1. Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page: url ='https://en.wikipedia.org/wiki/Python

In [18]:
url_wiki = 'https://en.wikipedia.org/wiki/Python'
response_wiki = req.get(url=url_wiki)
soup_wiki = BeautifulSoup(response_wiki.content, 'html.parser')

In [28]:

for a in soup_wiki.find_all(href=True):
    if str(a['href']).startswith('http'):
        print ("Found the URL:", a['href'])

Found the URL: https://creativecommons.org/licenses/by-sa/3.0/
Found the URL: https://en.wikipedia.org/wiki/Python
Found the URL: https://en.wiktionary.org/wiki/Python
Found the URL: https://en.wiktionary.org/wiki/python
Found the URL: https://en.wikipedia.org/w/index.php?title=Special:WhatLinksHere/Python&namespace=0
Found the URL: https://en.wikipedia.org/w/index.php?title=Python&oldid=1048703433
Found the URL: https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
Found the URL: https://www.wikidata.org/wiki/Special:EntityPage/Q747452
Found the URL: https://commons.wikimedia.org/wiki/Category:Python
Found the URL: https://af.wikipedia.org/wiki/Python
Found the URL: https://als.wikipedia.org/wiki/Python
Found the URL: https://ar.wikipedia.org/wiki/%D8%A8%D8%A7%D9%8A%D8%AB%D9%88%D9%86_(%D8%AA%D9%88%D8%B6%D9%8A%D8%AD)
Found the URL: https://az.wikipedia.org/wiki/Python
Found the URL: https://bn.wi

#### 2. Find the number of titles that have changed in the United States Code since its last release point: url = 'http://uscode.house.gov/download/download.shtml'

In [29]:
url_usa = 'http://uscode.house.gov/download/download.shtml'
response_usa = req.get(url=url_usa)
soup_usa = BeautifulSoup(response_usa.content, 'html.parser')

In [34]:
res = soup_usa.select('.usctitlechanged')

In [40]:
for r in res:
    print(r.text.strip())

Title 18 - Crimes and Criminal Procedure ٭
Title 34 - Crime Control and Law Enforcement


#### 3. Create a Python list with the top ten FBI's Most Wanted names: url = 'https://www.fbi.gov/wanted/topten' 


In [60]:
url_fbi = 'https://www.fbi.gov/wanted/topten'
response_fbi = req.get(url=url_fbi)
soup_fbi = BeautifulSoup(response_fbi.content, 'html.parser')

In [66]:
res_fbi = soup_fbi.select('.portal-type-person.castle-grid-block-item')


In [79]:
for i in res_fbi:
    title = i.findAll(class_='title')[0]
    print(title.text.strip())
    

RAFAEL CARO-QUINTERO
YULAN ADONAY ARCHAGA CARIAS
EUGENE PALMER
BHADRESHKUMAR CHETANBHAI PATEL
ALEJANDRO ROSALES CASTILLO
ARNOLDO JIMENEZ
JASON DEREK BROWN
ALEXIS FLORES
JOSE RODOLFO VILLARREAL-HERNANDEZ
OCTAVIANO JUAREZ-CORRO



#### 4. Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe: url = 'https://www.emsc-csem.org/Earthquake/'

In [82]:
url_quake = 'https://www.emsc-csem.org/Earthquake/'
response_quake = req.get(url=url_quake)
soup_quake = BeautifulSoup(response_quake.content, 'html.parser')

In [92]:
tabel_head = soup_quake.select('#tbody')

- time : tabev6 href
- data : tabev6 href
- lat : tabev1
- long : tabev1
- region : tb_region

In [113]:

print(len(i.findAll(class_='tabev1')))
date = i.findAll(class_='tabev6')
for d in date:
    date_raw = d.text.lstrip('earthquacke')
    date_ = date_raw.split('\xa0')
    #print(date_[0])
    print(date_[3].rsplit('min ago'))

100
['20:42:27.532', '']
['20:32:05.542', '']
['20:31:11.843', '']
['20:26:23.048', '']
['20:07:50.51hr 07', '']
['20:03:20.31hr 11', '']
['20:00:26.31hr 14', '']
['19:52:37.51hr 22', '']
['19:52:37.11hr 22', '']
['19:47:58.01hr 27', '']
['19:47:07.01hr 27', '']
['19:40:52.41hr 34', '']
['19:39:26.01hr 35', '']
['19:38:48.31hr 36', '']
['19:37:15.01hr 37', '']
['19:32:40.61hr 42', '']
['19:20:07.91hr 54', '']
['19:10:30.02hr 04', '']
['19:09:57.32hr 05', '']
['19:03:41.52hr 11', '']
['18:57:14.02hr 17', '']
['18:41:30.42hr 33', '']
['18:27:29.22hr 47', '']
['18:12:58.63hr 02', '']
['17:13:16.44hr 01', '']
['17:10:33.24hr 04', '']
['16:49:50.04hr 25', '']
['16:46:16.54hr 28', '']
['16:38:21.14hr 36', '']
['16:07:04.05hr 07', '']
['15:58:25.85hr 16', '']
['15:48:10.85hr 26', '']
['15:44:13.25hr 30', '']
['15:40:49.25hr 34', '']
['15:37:57.95hr 37', '']
['15:36:30.05hr 38', '']
['15:34:27.35hr 40', '']
['15:13:33.06hr 01', '']
['15:13:11.66hr 01', '']
['15:12:04.66hr 02', '']
['14:54:39.7