# Lab | Web Scraping Multiple Pages

### Instructions Part 2

Create a Python list with the top ten FBI's Most Wanted names: url = 'https://www.fbi.gov/wanted/topten'

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import random
from time import sleep

In [2]:
url = "https://www.fbi.gov/wanted/topten"
response = requests.get(url)
response.status_code # 200 status code means OK!

200

In [3]:
soup = BeautifulSoup(response.content, "html.parser")
#soup

In [4]:
fugitives = []
for link in soup.select('h3.title > a'):
    fugitives.append(link['href'])

fugitives

['https://www.fbi.gov/wanted/topten/alejandro-castillo',
 'https://www.fbi.gov/wanted/topten/ruja-ignatova',
 'https://www.fbi.gov/wanted/topten/donald-eugene-fields-ii',
 'https://www.fbi.gov/wanted/topten/alexis-flores',
 'https://www.fbi.gov/wanted/topten/arnoldo-jimenez',
 'https://www.fbi.gov/wanted/topten/omar-alexander-cardenas',
 'https://www.fbi.gov/wanted/topten/yulan-adonay-archaga-carias',
 'https://www.fbi.gov/wanted/topten/bhadreshkumar-chetanbhai-patel',
 'https://www.fbi.gov/wanted/topten/wilver-villegas-palomino',
 'https://www.fbi.gov/wanted/topten/jose-rodolfo-villarreal-hernandez']

In [5]:
soup1 = BeautifulSoup(requests.get(fugitives[0]).content, "html.parser") # trying with the first one
#soup1

In [6]:
soup1.select("#content-core > div > div > h1")
soup1.select("#content-core > div > div > p")
soup1.select("#content-core > div > div > div.wanted-person-aliases > p")

[<p>Alexandro Castillo, Alex Castillo, Alejandro Rosales, Alejandro Castillo, Alejandro Rosales-Castillo, Alejandro Rosalescastillo</p>]

In [7]:
names = [] 
crimes = []
aliases = []

# this pulls in only the info needed from each page

for i in fugitives:
    soup1 = BeautifulSoup(requests.get(i).content, "html.parser")
    
    try:
        names.append(soup1.select("#content-core > div > div > h1")[0].get_text())
    except:
        names.append('NA')
    
    try:
        crimes.append(soup1.select("#content-core > div > div > p")[0].get_text())
    except:
        crimes.append('NA')
    
    try:
        aliases.append(soup1.select("#content-core > div > div > div.wanted-person-aliases > p")[0].get_text())
    except:
        aliases.append('NA')

    wait_time = random.randint(1,4000)
    print("I will sleep for " + str(wait_time/1000) + " second/s.")
    sleep(wait_time/1000)

I will sleep for 0.282 second/s.
I will sleep for 3.202 second/s.
I will sleep for 2.402 second/s.
I will sleep for 0.466 second/s.
I will sleep for 0.639 second/s.
I will sleep for 0.143 second/s.
I will sleep for 2.599 second/s.
I will sleep for 3.669 second/s.
I will sleep for 3.567 second/s.
I will sleep for 3.346 second/s.


In [8]:
names

['ALEJANDRO ROSALES CASTILLO',
 'RUJA IGNATOVA',
 'DONALD EUGENE FIELDS II',
 'ALEXIS FLORES',
 'ARNOLDO JIMENEZ',
 'OMAR ALEXANDER CARDENAS',
 'YULAN ADONAY ARCHAGA CARIAS',
 'BHADRESHKUMAR CHETANBHAI PATEL',
 'WILVER VILLEGAS-PALOMINO',
 'JOSE RODOLFO VILLARREAL-HERNANDEZ']

In [9]:
most_wanted = pd.DataFrame({"Name":names,"Wanted For":crimes,"Alias":aliases})
most_wanted

Unnamed: 0,Name,Wanted For,Alias
0,ALEJANDRO ROSALES CASTILLO,Unlawful Flight to Avoid Prosecution - Murder,"Alexandro Castillo, Alex Castillo, Alejandro R..."
1,RUJA IGNATOVA,Conspiracy to Commit Wire Fraud; Wire Fraud; C...,"Dr. Ruja Ignatova, Ruja Plamenova Ignatova, Ru..."
2,DONALD EUGENE FIELDS II,Sex Trafficking of Children,"Don Fields, Donald Eugene Fields Jr., Eugene F..."
3,ALEXIS FLORES,Unlawful Flight to Avoid Prosecution - Kidnapp...,"Mario Flores, Mario Roberto Flores, Mario F. R..."
4,ARNOLDO JIMENEZ,Unlawful Flight to Avoid Prosecution - First D...,"Arnoldo Gimenez, Arnoldo Rochel Jimenez"
5,OMAR ALEXANDER CARDENAS,Unlawful Flight to Avoid Prosecution - Murder,
6,YULAN ADONAY ARCHAGA CARIAS,Racketeering Conspiracy (RICO); Cocaine Import...,"Alexander Mendoza, Yulan Andony Archaga Carias..."
7,BHADRESHKUMAR CHETANBHAI PATEL,Unlawful Flight to Avoid Prosecution - First D...,Bhadreshkumar C. Patel
8,WILVER VILLEGAS-PALOMINO,Narco-Terrorism; International Cocaine Distrib...,"Carlos El Puerco, El Puerco, Wilver Villegas, ..."
9,JOSE RODOLFO VILLARREAL-HERNANDEZ,Interstate Stalking and Conspiracy to Commit M...,"""El Gato"""


Display the top 10 languages by number of native speakers stored in a pandas dataframe: url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [10]:
url2 = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [11]:
response2 = requests.get(url2)
print(response2.status_code)
soup2 = BeautifulSoup(response2.content, "html.parser")
#soup2

200


In [12]:
table = soup2.select("table")[1]

languages = []

for i in table.select("tr td a"):
    languages.append(i.get_text())

languages

['Mandarin Chinese',
 'Spanish',
 'English',
 'Arabic',
 'Hindi',
 'Bengali',
 'Portuguese',
 'Russian',
 'Japanese',
 'Western Punjabi',
 'Javanese']

How do I pull in more info from that table? eg. Number of speakers

In [13]:
data2 = pd.DataFrame({"Language":languages[:9]})
data2

Unnamed: 0,Language
0,Mandarin Chinese
1,Spanish
2,English
3,Arabic
4,Hindi
5,Bengali
6,Portuguese
7,Russian
8,Japanese


A list with the different kind of datasets available in data.gov.uk: url = 'https://www.data.gov.uk/'

In [14]:
url3 = 'https://www.data.gov.uk'
response3 = requests.get(url3)
print(response3.status_code)
soup3 = BeautifulSoup(response3.content, "html.parser")
#soup3

200


In [15]:
departments = []

for e in soup3.select('h3'):
    departments.append(e.get_text())
    
departments

['Business and economy',
 'Crime and justice',
 'Defence',
 'Education',
 'Environment',
 'Government',
 'Government spending',
 'Health',
 'Mapping',
 'Society',
 'Towns and cities',
 'Transport',
 'Digital service performance',
 'Government reference data']

In [16]:
soup3.select('h3 a')

[<a class="govuk-link" href="/search?filters%5Btopic%5D=Business+and+economy">Business and economy</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Crime+and+justice">Crime and justice</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Defence">Defence</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Education">Education</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Environment">Environment</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Government">Government</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Government+spending">Government spending</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Health">Health</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Mapping">Mapping</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Society">Society</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Towns+and+cities">Towns and cities</a>,
 <a class="govuk-link" href="/search?f

In [17]:
links2 = []

for e in soup3.select('h3 a'):
    links2.append(url3+e["href"])
    
links2

['https://www.data.gov.uk/search?filters%5Btopic%5D=Business+and+economy',
 'https://www.data.gov.uk/search?filters%5Btopic%5D=Crime+and+justice',
 'https://www.data.gov.uk/search?filters%5Btopic%5D=Defence',
 'https://www.data.gov.uk/search?filters%5Btopic%5D=Education',
 'https://www.data.gov.uk/search?filters%5Btopic%5D=Environment',
 'https://www.data.gov.uk/search?filters%5Btopic%5D=Government',
 'https://www.data.gov.uk/search?filters%5Btopic%5D=Government+spending',
 'https://www.data.gov.uk/search?filters%5Btopic%5D=Health',
 'https://www.data.gov.uk/search?filters%5Btopic%5D=Mapping',
 'https://www.data.gov.uk/search?filters%5Btopic%5D=Society',
 'https://www.data.gov.uk/search?filters%5Btopic%5D=Towns+and+cities',
 'https://www.data.gov.uk/search?filters%5Btopic%5D=Transport',
 'https://www.data.gov.uk/search?filters%5Btopic%5D=Digital+service+performance',
 'https://www.data.gov.uk/search?filters%5Btopic%5D=Government+reference+data']

In [18]:
dept = pd.DataFrame({"Department":departments,"Link To Data":links2})
dept

Unnamed: 0,Department,Link To Data
0,Business and economy,https://www.data.gov.uk/search?filters%5Btopic...
1,Crime and justice,https://www.data.gov.uk/search?filters%5Btopic...
2,Defence,https://www.data.gov.uk/search?filters%5Btopic...
3,Education,https://www.data.gov.uk/search?filters%5Btopic...
4,Environment,https://www.data.gov.uk/search?filters%5Btopic...
5,Government,https://www.data.gov.uk/search?filters%5Btopic...
6,Government spending,https://www.data.gov.uk/search?filters%5Btopic...
7,Health,https://www.data.gov.uk/search?filters%5Btopic...
8,Mapping,https://www.data.gov.uk/search?filters%5Btopic...
9,Society,https://www.data.gov.uk/search?filters%5Btopic...
