### Lab | Web Scraping Multiple Pages

#### Instructions Part 2

#### Practice web scraping. This is not involved with the GNOD project of the week

As you've seen, scraping the internet is a skill that can get you all sorts of information. Here are some little challenges that you can try to gain more experience in the field. Open a new Jupyter notebook and scrape at least 3 of these sites.

1. Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page: url ='https://en.wikipedia.org/wiki/Python'
2. Find the number of titles that have changed in the United States Code since its last release point: url = 'http://uscode.house.gov/download/download.shtml'
3. Create a Python list with the top ten FBI's Most Wanted names: url = 'https://www.fbi.gov/wanted/topten'
4. Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe: url = 'https://www.emsc-csem.org/Earthquake/'
5. List all language names and number of related articles in the order they appear in wikipedia.org: url = 'https://www.wikipedia.org/'
6. A list with the different kind of datasets available in data.gov.uk: url = 'https://data.gov.uk/'
7. Display the top 10 languages by number of native speakers stored in a pandas dataframe: url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

------------------------

In [1]:
# Importing libaries:

from bs4 import BeautifulSoup
import requests
import pandas as pd

1. Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page: url ='https://en.wikipedia.org/wiki/Python'

In [2]:
url = 'https://en.wikipedia.org/wiki/Python'

response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.content, 'html.parser')
    links = soup.find_all('a')

    link_list = [link.get('href') for link in links if link.get('href')]

    wiki_links = [link for link in link_list if link.startswith('/wiki/')]

    print("Wikipedia Links:")
    for link in wiki_links:
        print(link)
else:
    print("Failed to fetch the page")

Wikipedia Links:
/wiki/Main_Page
/wiki/Wikipedia:Contents
/wiki/Portal:Current_events
/wiki/Special:Random
/wiki/Wikipedia:About
/wiki/Help:Contents
/wiki/Help:Introduction
/wiki/Wikipedia:Community_portal
/wiki/Special:RecentChanges
/wiki/Wikipedia:File_upload_wizard
/wiki/Main_Page
/wiki/Special:Search
/wiki/Help:Introduction
/wiki/Special:MyContributions
/wiki/Special:MyTalk
/wiki/Python
/wiki/Talk:Python
/wiki/Python
/wiki/Python
/wiki/Special:WhatLinksHere/Python
/wiki/Special:RecentChangesLinked/Python
/wiki/Wikipedia:File_Upload_Wizard
/wiki/Special:SpecialPages
/wiki/Pythonidae
/wiki/Python_(genus)
/wiki/Python_(mythology)
/wiki/Python_(programming_language)
/wiki/CMU_Common_Lisp
/wiki/PERQ#PERQ_3
/wiki/Python_of_Aenus
/wiki/Python_(painter)
/wiki/Python_of_Byzantium
/wiki/Python_of_Catana
/wiki/Python_Anghelo
/wiki/Python_(Efteling)
/wiki/Python_(Busch_Gardens_Tampa_Bay)
/wiki/Python_(Coney_Island,_Cincinnati,_Ohio)
/wiki/Python_(automobile_maker)
/wiki/Python_(Ford_prototype)

In [3]:
# Extracting the url's:

from urllib.parse import urljoin  

absolute_links = [urljoin(url, link) for link in wiki_links]
print("Absolute URLs of Wikipedia Links:")

for link in absolute_links:
    print(link)

Absolute URLs of Wikipedia Links:
https://en.wikipedia.org/wiki/Main_Page
https://en.wikipedia.org/wiki/Wikipedia:Contents
https://en.wikipedia.org/wiki/Portal:Current_events
https://en.wikipedia.org/wiki/Special:Random
https://en.wikipedia.org/wiki/Wikipedia:About
https://en.wikipedia.org/wiki/Help:Contents
https://en.wikipedia.org/wiki/Help:Introduction
https://en.wikipedia.org/wiki/Wikipedia:Community_portal
https://en.wikipedia.org/wiki/Special:RecentChanges
https://en.wikipedia.org/wiki/Wikipedia:File_upload_wizard
https://en.wikipedia.org/wiki/Main_Page
https://en.wikipedia.org/wiki/Special:Search
https://en.wikipedia.org/wiki/Help:Introduction
https://en.wikipedia.org/wiki/Special:MyContributions
https://en.wikipedia.org/wiki/Special:MyTalk
https://en.wikipedia.org/wiki/Python
https://en.wikipedia.org/wiki/Talk:Python
https://en.wikipedia.org/wiki/Python
https://en.wikipedia.org/wiki/Python
https://en.wikipedia.org/wiki/Special:WhatLinksHere/Python
https://en.wikipedia.org/wiki/

2. Find the number of titles that have changed in the United States Code since its last release point: url = 'http://uscode.house.gov/download/download.shtml'

In [4]:
url2 = 'http://uscode.house.gov/download/download.shtml'
response2 = requests.get(url2)

if response2.status_code == 200:
    soup2 = BeautifulSoup(response2.content, 'html.parser')
title_elements = soup2.find_all("div", {'class':'usctitle'})
for title_element in title_elements:
    print(title_element.get_text(strip=True)) 

All titles in the format selected compressed into a zip archive.

Title 1 - General Provisions٭
Title 2 - The Congress
Title 3 - The President٭
Title 4 - Flag and Seal, Seat of Government, and the States٭
Title 5 - Government Organization and Employees٭
Title 6 - Domestic Security
Title 7 - Agriculture
Title 8 - Aliens and Nationality
Title 9 - Arbitration٭
Title 10 - Armed Forces٭
Title 11 - Bankruptcy٭
Title 12 - Banks and Banking
Title 13 - Census٭
Title 14 - Coast Guard٭
Title 15 - Commerce and Trade
Title 16 - Conservation
Title 17 - Copyrights٭
Title 18 - Crimes and Criminal Procedure٭
Title 19 - Customs Duties
Title 20 - Education
Title 21 - Food and Drugs
Title 22 - Foreign Relations and Intercourse
Title 23 - Highways٭
Title 24 - Hospitals and Asylums
Title 25 - Indians
Title 26 - Internal Revenue Code
Title 27 - Intoxicating Liquors
Title 28 - Judiciary and Judicial Procedure٭
Title 29 - Labor
Title 30 - Mineral Lands and Mining
Title 31 - Money and Finance٭
Title 32 - Nation

5. List all language names and number of related articles in the order they appear in wikipedia.org: url = 'https://www.wikipedia.org/'

In [8]:
url3 = 'https://www.wikipedia.org/'

response3 = requests.get(url3)

if response3.status_code == 200:
    soup3 = BeautifulSoup(response3.content, 'html.parser')

lang_div = soup3.find('div', {'class': 'central-featured'})    
    
if lang_div:
    language = []
    article_language = []
        
    language_blocks = lang_div.find_all('div', {'class': 'central-featured-lang'})
        
    for block in language_blocks:
           
        lang = block.find('strong').text.strip()
        art_lang = block.find('small').text.strip()
               
        language.append(lang)
        article_language.append(art_lang)
            
        print(f"Language: {lang}, Article Language: {art_lang}")
        
        
    data = {
            "Language": language,
            "Number of Articles": article_language
        }
        
    df = pd.DataFrame(data)
else:
    print("Language information not found on the page.")

Language: English, Article Language: 6 744 000+ articles
Language: Español, Article Language: 1 906 000+ artículos
Language: Русский, Article Language: 1 947 000+ статей
Language: 日本語, Article Language: 1 392 000+ 記事
Language: Deutsch, Article Language: 2 852 000+ Artikel
Language: Français, Article Language: 2 567 000+ articles
Language: Italiano, Article Language: 1 835 000+ voci
Language: 中文, Article Language: 1 387 000+ 条目 / 條目
Language: العربية, Article Language: 1 221 000+ مقالة
Language: Português, Article Language: 1 113 000+ artigos


6. A list with the different kind of datasets available in data.gov.uk: url = 'https://data.gov.uk/'

In [None]:
url4 = 'https://data.gov.uk/'

response4 = requests.get(url4)

soup4 = BeautifulSoup(response4.content, 'html.parser')
dataset_types = soup4.find_all('h3', class_='govuk-heading-s dgu-topics__heading')

types_list = [dataset.get_text() for dataset in dataset_types]
print(types_list)

7. Display the top 10 languages by number of native speakers stored in a pandas dataframe: url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [None]:
url5 = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'
response5 = requests.get(url5)

if response5.status_code == 200:
    soup5 = BeautifulSoup(response5.content, 'html.parser')
    
    table = soup5.find('table', {'class': 'wikitable sortable static-row-numbers'})

    language_data = []
    rows = table.find_all('tr')
    
    for row in rows[1:11]: 
        cols = row.find_all(['td'])
        language = cols[0].get_text(strip=True)
        native_speakers = cols[1].get_text(strip=True)
        language_data.append({'Language': language, 'Native Speakers (millions)': native_speakers})

    
    df1 = pd.DataFrame(language_data)
else:
    print("Failed to fetch the page.")


In [None]:
df1