# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended contennt.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit each url and take a look at its source through Chrome DevTools. You'll need to identify the html tags, special class names etc. used for the html content you are expected to extract.

- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide) documentation 
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are imported for you. If you prefer to use additional libraries feel free to uncomment them.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# from pprint import pprint
# from lxml import html
# from lxml.html import fromstring
# import urllib.request
from urllib.request import urlopen
# import random
import re
# import scrapy

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [2]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [3]:
response = requests.get(url=url)
response.status_code

200

In [4]:
soup = BeautifulSoup(response.content)

In [5]:
soup


<!DOCTYPE html>

<html data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/frameworks-7ac375ffa6839741afb345694ec42581.css" integrity="sha512-esN1/6aDl0Gvs0VpTsQlgSFdM9A4iTeMmOmXnpAg1dy/FpI38lc+2tsMbWNz29y7yYSr7FiJt4EyTKfBU7ZsZQ==" media="all" rel="stylesheet">
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/behaviors-243ed7c0c7cea9f4cf2aba3a8442d64d.css" integrity="sha512-JD7XwMfOqfTPKro6hELWTUp8kPg2kxLmSGKmr/9lCzva5wqdN1n0AVkJid3/oyd+QJ0LsjQq2h+tLL4mqxdfnw==" media="all" rel="stylesheet">
<link crossorigin="anonymous" href="https://github.gi

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [6]:
developer_names = soup.find_all(
    name="h1", 
    class_="h3 lh-condensed"
)

In [7]:
developer_names[0]

<h1 class="h3 lh-condensed">
<a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":9248427,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="afec7e93325c060638c8494f34211f288e447e9ae021388fc86b97132e3c005e" data-view-component="true" href="/AkihiroSuda">
            Akihiro Suda
</a> </h1>

In [8]:
names = [name.text.replace(" ","").replace("\n","") for name in developer_names]
print(names)

['AkihiroSuda', 'SamuelColvin', 'ThakeeNathees', 'KrisNóva', 'ClaudéricDemers', 'chencheng(云谦)', 'JeffGeerling', 'BrentShaffer', 'BlakeBlackshear', 'KirkByers', 'DavidKhourshid', 'MaxHowell', 'ChrisBanes', '砖家', 'Mr.doob', 'DavidRodríguez', 'LuongVo', 'LysandreDebut', 'NikitaSobolev', 'DanielLemire', 'YoshiyaHinosawa', 'JesseDuffield', 'KoushikDutta', 'MinkoGechev', 'MikePenz']


#### Display the trending Python repositories in GitHub

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [9]:
# This is the url you will scrape in this exercise
url_repos = 'https://github.com/trending/python?since=daily'

In [10]:
response = requests.get(url=url_repos)
response.status_code

200

In [11]:
soup = BeautifulSoup(response.content)
soup


<!DOCTYPE html>

<html data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/frameworks-7ac375ffa6839741afb345694ec42581.css" integrity="sha512-esN1/6aDl0Gvs0VpTsQlgSFdM9A4iTeMmOmXnpAg1dy/FpI38lc+2tsMbWNz29y7yYSr7FiJt4EyTKfBU7ZsZQ==" media="all" rel="stylesheet">
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/behaviors-243ed7c0c7cea9f4cf2aba3a8442d64d.css" integrity="sha512-JD7XwMfOqfTPKro6hELWTUp8kPg2kxLmSGKmr/9lCzva5wqdN1n0AVkJid3/oyd+QJ0LsjQq2h+tLL4mqxdfnw==" media="all" rel="stylesheet">
<link crossorigin="anonymous" href="https://github.gi

In [12]:
repos = soup.find_all(
    name="h1", 
    class_="h3 lh-condensed"
)

In [13]:
repos_list = [repo.text.replace(" ","").replace("\n","") for repo in repos]
print(repos_list)

['google-research/deeplab2', 'PaddlePaddle/PaddleClas', 'trekhleb/learn-python', 'rwightman/pytorch-image-models', 'tiangolo/fastapi', 'TCM-Course-Resources/Practical-Ethical-Hacking-Resources', 'Python-World/python-mini-projects', 'Asabeneh/30-Days-Of-Python', 'satwikkansal/wtfpython', 'alibaba/AliceMind', 'python-discord/cj8-qualifier', 'huggingface/transformers', 'Tencent/TFace', 'PeterL1n/BackgroundMattingV2', 'great-expectations/great_expectations', 'facebookresearch/simsiam', 'fighting41love/funNLP', 'boto/boto3', 'ZHKKKe/MODNet', 'RasaHQ/rasa', 'DAGsHub/fds', 'getredash/redash', 'unit8co/darts', 'CorentinJ/Real-Time-Voice-Cloning', 'chubin/wttr.in']


#### Display all the image links from Walt Disney wikipedia page (BORJA)

In [19]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [20]:
#your code
html = urlopen('https://en.wikipedia.org/wiki/Walt_Disney')
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img', {'src':re.compile('.jpg')})
for image in images: 
    print(image['src']+'\n')

//upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg

//upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Newman_Laugh-O-Gram_%281921%29.webm/220px-seek%3D2-Newman_Laugh-O-Gram_%281921%29.webm.jpg

//upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Trolley_Troubles_poster.jpg/170px-Trolley_Troubles_poster.jpg

//upload.wikimedia.org/wikipedia/en/thumb/4/4e/Steamboat-willie.jpg/170px-Steamboat-willie.jpg

//upload.wikimedia.org/wikipedia/commons/thumb/5/57/Walt_Disney_1935.jpg/170px-Walt_Disney_1935.jpg

//upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Walt_Disney_Snow_white_1937_trailer_screenshot_%2813%29.jpg/220px-Walt_Disney_Snow_white_1937_trailer_screenshot_%2813%29.jpg

//upload.wikimedia.org/wikipedia/commons/thumb/1/15/Disney_drawing_goofy.jpg/170px-Disney_drawing_goofy.jpg

//upload.wikimedia.org/wikipedia/commons/thumb/1/13/DisneySchiphol1951.jpg/220px-DisneySchiphol1951.jpg

//upload.wikimedia.org/w

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page (SANTI)

In [31]:
# This is the url you will scrape in this exercise
url ='https://en.wikipedia.org/wiki/Python' 

In [32]:
response = requests.get(url ='https://en.wikipedia.org/wiki/Python' )
response

<Response [200]>

In [33]:
#your code
html = urlopen('https://en.wikipedia.org/wiki/Python')
soup = BeautifulSoup(html)
soup

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Python - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"adb09aac-960c-4b26-8db2-c4d19cfbedd9","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Python","wgTitle":"Python","wgCurRevisionId":1026447930,"wgRevisionId":1026447930,"wgArticleId":46332325,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Disambiguation pages with short descriptions","Short description is different from Wikidata","All article disambiguation pages","All disambiguation pages","Animal common name disambiguat

In [34]:
links = []

for link in soup.findAll( 'a' ):
    links.append(link.get( 'href' ))

print(links)

[None, '#mw-head', '#searchInput', 'https://en.wiktionary.org/wiki/Python', 'https://en.wiktionary.org/wiki/python', '/wiki/Pythons', '/wiki/Python_(genus)', '#Computing', '#People', '#Roller_coasters', '#Vehicles', '#Weaponry', '#Other_uses', '#See_also', '/w/index.php?title=Python&action=edit&section=1', '/wiki/Python_(programming_language)', '/wiki/CMU_Common_Lisp', '/wiki/PERQ#PERQ_3', '/w/index.php?title=Python&action=edit&section=2', '/wiki/Python_of_Aenus', '/wiki/Python_(painter)', '/wiki/Python_of_Byzantium', '/wiki/Python_of_Catana', '/wiki/Python_Anghelo', '/w/index.php?title=Python&action=edit&section=3', '/wiki/Python_(Efteling)', '/wiki/Python_(Busch_Gardens_Tampa_Bay)', '/wiki/Python_(Coney_Island,_Cincinnati,_Ohio)', '/w/index.php?title=Python&action=edit&section=4', '/wiki/Python_(automobile_maker)', '/wiki/Python_(Ford_prototype)', '/w/index.php?title=Python&action=edit&section=5', '/wiki/Python_(missile)', '/wiki/Python_(nuclear_primary)', '/wiki/Colt_Python', '/w/in

#### Number of Titles that have changed in the United States Code since its last release point (BORJA)

In [14]:
# This is the url you will scrape in this exercise
url_usa = 'http://uscode.house.gov/download/download.shtml'

In [15]:
response = requests.get(url_usa)
response.status_code

200

In [16]:
#your code
database_soup = BeautifulSoup(response.content)
database_soup

<?xml version='1.0' encoding='UTF-8' ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml"><head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="IE=8" http-equiv="X-UA-Compatible"/>
<meta content="no-cache" http-equiv="pragma"/><!-- HTTP 1.0 -->
<meta content="no-cache,must-revalidate" http-equiv="cache-control"/><!-- HTTP 1.1 -->
<meta content="0" http-equiv="expires"/>
<link href="/javax.faces.resource/favicon.ico.xhtml?ln=images" rel="shortcut icon"/><link href="/javax.faces.resource/cssLayout.css.xhtml?ln=css" rel="stylesheet" type="text/css"/><script src="/javax.faces.resource/jsf.js.xhtml?ln=javax.faces" type="text/javascript"></script><link href="/javax.faces.resource/static.css.xhtml?ln=css" rel="stylesheet" type="text/css"/></head><body><script src="/javax.faces.resource/browserPreferences.js.xhtml?ln=scripts" type="text/javasc

In [17]:
titles = database_soup.find_all("div", class_="usctitlechanged")

print(f"The number of titles that have changed: {len(titles)}")

The number of titles that have changed: 35


In [18]:
for i in range(len(titles)):
    new_title = titles[i].text[12:]
    print(new_title)

Title 2 - The Congress

        
Title 3 - The President ٭

Title 5 - Government Organization and Employees ٭

Title 6 - Domestic Security

        
Title 7 - Agriculture

        
Title 8 - Aliens and Nationality

        
Title 10 - Armed Forces ٭

Title 12 - Banks and Banking

        
Title 14 - Coast Guard ٭

Title 15 - Commerce and Trade

        
Title 16 - Conservation

        
Title 18 - Crimes and Criminal Procedure ٭

Title 20 - Education

        
Title 22 - Foreign Relations and Intercourse

        
Title 24 - Hospitals and Asylums

        
Title 28 - Judiciary and Judicial Procedure ٭

Title 31 - Money and Finance ٭

Title 32 - National Guard ٭

Title 33 - Navigation and Navigable Waters

        
Title 34 - Crime Control and Law Enforcement

        
Title 36 - Patriotic and National Observances, Ceremonies, and Organizations ٭

Title 37 - Pay and Allowances of the Uniformed Services ٭

Title 38 - Veterans' Benefits ٭

Title 40 - Public Buildings, Property, and Works 

####  20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe

In [32]:
# This is the url you will scrape in this exercise
url_earth = 'https://www.emsc-csem.org/Earthquake/'

In [34]:
response = requests.get(url = url_earth)
response.status_code

200

In [37]:
earth_soup = BeautifulSoup(response.content)
earth_soup

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://opengraphprotocol.org/schema/">
<head><meta content="srFzNKBTd0FbRhtnzP--Tjxl01NfbscjYwkp4yOWuQY" name="google-site-verification"/><meta content="BCAA3C04C41AE6E6AFAF117B9469C66F" name="msvalidate.01"/><meta content="43b36314ccb77957" name="y_key"/><!-- 5-Clk8f50tFFdPTU97Bw7ygWE1A -->
<meta content="en" http-equiv="Content-Language"/><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="all" name="robots"/>
<meta content="earthquake,earthquakes,last earthquake,earthquake today,earthquakes today,earth quake,earth quakes,real time seismicity,seismic,seismicity,seismicity map,seismology,sismologie,EMSC,CSEM,seismicity on google earth,sumatra,tsunami,tsunamis,map,maps,richter,mercalli,moment tensors,epicenter,magnitude,seismology,foreshock,aftershock,tremor" name="keyw

In [46]:
tbody = earth_soup.find("tbody")

In [132]:
earthquakes = tbody.find_all("tr")

In [144]:
def consigueme_la_info(lista_dic, earthquake):
    dic = dict()

    date = earthquake.find_all("td")[3].text[10:20]
    time = earthquake.find_all("td")[3].text[23:31]
    lat_coor = earthquake.find_all("td")[4].text.strip()
    lat_orient = earthquake.find_all("td")[5].text.strip()
    long_coor = earthquake.find_all("td")[6].text.strip()
    long_orient = earthquake.find_all("td")[7].text.strip()
    country = earthquake.find_all("td")[11].text.strip()

    dic = {"date": date, 
           "time": time, 
           "latitud_coord": lat_coor, 
           "lat_orient": lat_orient, 
           "longitud_coord": long_coor, 
           "longitud_orient": long_orient,
           "country": country
    }
    
    lista_dic.append(dic)
    
    return lista_dic

In [145]:
final_list = []

for i in range(20): 
    final_list = consigueme_la_info(final_list, earthquakes[i])

In [148]:
final_list

[{'date': '2021-06-24',
  'time': '13:29:25',
  'latitud_coord': '47.54',
  'lat_orient': 'N',
  'longitud_coord': '8.17',
  'longitud_orient': 'E',
  'country': 'SWITZERLAND'},
 {'date': '2021-06-24',
  'time': '13:24:13',
  'latitud_coord': '36.38',
  'lat_orient': 'N',
  'longitud_coord': '27.00',
  'longitud_orient': 'E',
  'country': 'DODECANESE IS.-TURKEY BORDER REG'},
 {'date': '2021-06-24',
  'time': '13:01:41',
  'latitud_coord': '39.10',
  'lat_orient': 'N',
  'longitud_coord': '21.69',
  'longitud_orient': 'E',
  'country': 'GREECE'},
 {'date': '2021-06-24',
  'time': '12:45:43',
  'latitud_coord': '19.20',
  'lat_orient': 'N',
  'longitud_coord': '155.49',
  'longitud_orient': 'W',
  'country': 'ISLAND OF HAWAII, HAWAII'},
 {'date': '2021-06-24',
  'time': '12:42:40',
  'latitud_coord': '36.34',
  'lat_orient': 'N',
  'longitud_coord': '27.01',
  'longitud_orient': 'E',
  'country': 'DODECANESE IS.-TURKEY BORDER REG'},
 {'date': '2021-06-24',
  'time': '12:09:57',
  'latitu

In [149]:
earthquakes_df = pd.DataFrame(final_list)

In [151]:
earthquakes_df

Unnamed: 0,date,time,latitud_coord,lat_orient,longitud_coord,longitud_orient,country
0,2021-06-24,13:29:25,47.54,N,8.17,E,SWITZERLAND
1,2021-06-24,13:24:13,36.38,N,27.0,E,DODECANESE IS.-TURKEY BORDER REG
2,2021-06-24,13:01:41,39.1,N,21.69,E,GREECE
3,2021-06-24,12:45:43,19.2,N,155.49,W,"ISLAND OF HAWAII, HAWAII"
4,2021-06-24,12:42:40,36.34,N,27.01,E,DODECANESE IS.-TURKEY BORDER REG
5,2021-06-24,12:09:57,25.62,S,179.94,W,SOUTH OF FIJI ISLANDS
6,2021-06-24,12:05:42,36.35,N,27.02,E,DODECANESE IS.-TURKEY BORDER REG
7,2021-06-24,12:03:27,40.08,N,124.0,W,NORTHERN CALIFORNIA
8,2021-06-24,12:01:19,36.31,N,27.0,E,DODECANESE IS.-TURKEY BORDER REG
9,2021-06-24,11:56:30,37.87,N,118.21,W,NEVADA


#### Display the date, days, title, city, country of next 25 hackathon events as a Pandas dataframe table

In [None]:
# This is the url you will scrape in this exercise
url ='https://hackevents.co/hackathons'
url_hack = 'https://hackevents.co/search/anything/anywhere/anytime' 

In [67]:
#NO FUNCIONAN LAS PAGINAS WEBS

#### List all language names and number of related articles in the order they appear in wikipedia.org

In [156]:
# This is the url you will scrape in this exercise
url_wiki = 'https://www.wikipedia.org/'

In [157]:
response = requests.get(url=url_wiki)
response.status_code

200

In [158]:
wiki_soup = BeautifulSoup(response.content)

In [159]:
central = wiki_soup.find(
    name="div", 
    class_="central-featured"
)

In [204]:
len(central.find_all("div"))

10

In [210]:
central.find_all("div")[2]

<div class="central-featured-lang lang3" dir="ltr" lang="es">
<a class="link-box" data-slogan="La enciclopedia libre" href="//es.wikipedia.org/" id="js-link-box-es" title="Español — Wikipedia — La enciclopedia libre">
<strong>Español</strong>
<small><bdi dir="ltr">1 694 000+</bdi> <span>artículos</span></small>
</a>
</div>

In [206]:
lang_list = list()

for i in range(10):
    lang = central.find_all("div")[i].text.split("\n")[2]
    number_art = central.find_all("div")[i].text.split("\n")[3].split("+")[0].replace("\xa0","")
    
    dic = {"language": lang, "number_of_articles": number_art}
    
    lang_list.append(dic)

English 6321000
日本語 1274000
Español 1694000
Deutsch 2588000
Русский 1733000
Français 2338000
中文 1205000
Italiano 1700000
Português 1066000
Polski 1479000


In [207]:
lang_list

# This is not the same order but i don't know how to do it with the html order instead of "find_all" provide order 
# without ordering it manually

[{'language': 'English', 'number_of_articles': '6321000'},
 {'language': '日本語', 'number_of_articles': '1274000'},
 {'language': 'Español', 'number_of_articles': '1694000'},
 {'language': 'Deutsch', 'number_of_articles': '2588000'},
 {'language': 'Русский', 'number_of_articles': '1733000'},
 {'language': 'Français', 'number_of_articles': '2338000'},
 {'language': '中文', 'number_of_articles': '1205000'},
 {'language': 'Italiano', 'number_of_articles': '1700000'},
 {'language': 'Português', 'number_of_articles': '1066000'},
 {'language': 'Polski', 'number_of_articles': '1479000'}]

#### A list with the different kind of datasets available in data.gov.uk (BORJA)

In [26]:
# This is the url you will scrape in this exercise
url = 'https://data.gov.uk/'

In [27]:
response = requests.get(url)
response.status_code

200

In [28]:
#your code 
database_soup = BeautifulSoup(response.content)
database_soup


<!DOCTYPE html>

<!--[if lt IE 9]><html class="lte-ie8" lang="en"><![endif]-->
<!--[if gt IE 8]><!--><html lang="en"><!--<![endif]-->
<html class="govuk-template">
<head>
<meta charset="utf-8"/>
<title>Find open data - data.gov.uk</title>
<meta content="#0b0c0c" name="theme-color">
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<link href="/find-assets/application-7567a3b31be2f40957b40fb3d87bf807ac2d1e64b4429aeb5f57bdf77e57a82b.css" media="screen" rel="stylesheet"/>
<meta content="authenticity_token" name="csrf-param">
<meta content="3o2WeC367EcVGFHKB+kaxuCI7lHuakA9WGWgaloZ8iKVNZMN6ffBb5crqXChq81CwoxXAsuVVJcj+IIbu0DZlQ==" name="csrf-token"/>
</meta></meta></head>
<body class="govuk-template__body">
<script>document.body.className = ((document.body.className) ? document.body.className + ' js-enabled' : 'js-enabled');</script>
<div aria-label="cookie banner" class="gem-c-cookie-banner govuk-clearfix" data-module="cookie-banner" data-nosnippet="" id="global-cookie-

In [29]:
database = database_soup.find_all("h3")

for i in range(len(database)):
    all_databases = database[i].text
    print(all_databases)

Business and economy
Crime and justice
Defence
Education
Environment
Government
Government spending
Health
Mapping
Society
Towns and cities
Transport
Digital service performance
Government reference data


#### Top 10 languages by number of native speakers stored in a Pandas Dataframe (SANTI)

In [36]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [37]:
response = requests.get(url = url)
response.status_code

200

In [38]:
Top10_soup = BeautifulSoup(response.content)
Top10_soup

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of languages by number of native speakers - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"e3895c30-5cbf-4481-9af2-ff357998ae9c","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_languages_by_number_of_native_speakers","wgTitle":"List of languages by number of native speakers","wgCurRevisionId":1028742968,"wgRevisionId":1028742968,"wgArticleId":405385,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Wikipedia indefinitely semi-protected pages","Articles with short de

In [40]:
#your code
tbody = Top10_soup.find_all("tbody")
len(tbody)

10

In [43]:
t = tbody[1]

Top10 = t.find_all("tr")

In [44]:
def dame_el_top_10(lista_dic, top):
    dic = dict()
    
    for i in range(1,11):
        language = top[i].find_all("td")[1].text.strip().split("[")[0]
    
        dic = {"Lenguage": language}
    
        lista_dic.append(dic)
    
    return lista_dic

In [45]:
new_list = []
new_list = dame_el_top_10(new_list, Top10)
new_list

[{'Lenguage': 'Mandarin Chinese'},
 {'Lenguage': 'Spanish'},
 {'Lenguage': 'English'},
 {'Lenguage': 'Hindi (sanskritised Hindustani)'},
 {'Lenguage': 'Bengali'},
 {'Lenguage': 'Portuguese'},
 {'Lenguage': 'Russian'},
 {'Lenguage': 'Japanese'},
 {'Lenguage': 'Western Punjabi'},
 {'Lenguage': 'Marathi'}]

### BONUS QUESTIONS

#### IMDB's Top 250 data (movie name, Initial release, director name and stars) as a pandas dataframe

In [212]:
# This is the url you will scrape in this exercise 
url_imdb = 'https://www.imdb.com/chart/top'

In [213]:
response = requests.get(url=url_imdb)
response.status_code

200

In [220]:
imdb_soup = BeautifulSoup(response.content)
type(imdb_soup)

bs4.BeautifulSoup

In [219]:
tbody = imdb_soup.find("tbody")
type(tbody)

bs4.element.Tag

In [277]:
# Now we have every movie in a list:

row_movies = tbody.find_all("tr")
type(row_movies)

bs4.element.ResultSet

In [280]:
# Let's get the movies info:

movies_list = []

for i in range(len(row_movies)):
    name = row_movies[i].find_all("td")[1].text[16:-8]
    year = row_movies[i].find_all("td")[1].text[-6:-2]
    director = row_movies[i].find_all("td")[1].find("a").get("title").split(" (dir.),")[0]
    stars = row_movies[i].find_all("td")[2].text.strip()
    
    dic = {"movie_name": name,
          "release_year": year,
          "director": director,
          "rating": stars}
    
    movies_list.append(dic)

In [281]:
# READY FOR JSONIZING AND UPLOADING TO MONGO! :D

movies_list

[{'movie_name': 'Cadena perpetua',
  'release_year': '1994',
  'director': 'Frank Darabont',
  'rating': '9.2'},
 {'movie_name': 'El padrino',
  'release_year': '1972',
  'director': 'Francis Ford Coppola',
  'rating': '9.1'},
 {'movie_name': 'El padrino: Parte II',
  'release_year': '1974',
  'director': 'Francis Ford Coppola',
  'rating': '9.0'},
 {'movie_name': 'El caballero oscuro',
  'release_year': '2008',
  'director': 'Christopher Nolan',
  'rating': '9.0'},
 {'movie_name': '12 hombres sin piedad',
  'release_year': '1957',
  'director': 'Sidney Lumet',
  'rating': '8.9'},
 {'movie_name': 'La lista de Schindler',
  'release_year': '1993',
  'director': 'Steven Spielberg',
  'rating': '8.9'},
 {'movie_name': 'El señor de los anillos: El retorno del rey',
  'release_year': '2003',
  'director': 'Peter Jackson',
  'rating': '8.9'},
 {'movie_name': 'Pulp Fiction',
  'release_year': '1994',
  'director': 'Quentin Tarantino',
  'rating': '8.8'},
 {'movie_name': 'El bueno, el feo y el