# Web Scraping Lab

You will find in this notebook some web scraping exercises to practice your scraping skills using `requests` and `Beautiful Soup`.

**Tips:**

- Check the [response status code](https://http.cat/) for each request to ensure you have obtained the intended content.
- Look at the HTML code in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit each url and take a look at its source through Chrome DevTools. You'll need to identify the html tags, special class names etc. used for the html content you are expected to extract.
- Check out the css selectors.

### Useful Resources
- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide) documentation 
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### First of all, gathering our tools.

In [1]:
import requests as req
from bs4 import BeautifulSoup as bs
import pandas as pd
import json

⚠️ **Again, please remember to limit your output before submission so that your code doesn't get lost in the output.**

#### Challenge 1 - Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [2]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below (with different names):

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [3]:
html=req.get(url).content

soup=bs(html, 'html.parser')


Identificamos que todas las cajas con el nombre del usuarios son un h1 de clase "h3 lh-condensed". Asi que los buscamos todos y los añadimos a una lista (find_all) y guardamos esta lista en una variable (users).

In [4]:
users = soup.find_all('h1', class_='h3 lh-condensed')

users

[<h1 class="h3 lh-condensed">
 <a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":1148717,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="e63714fba102c81161256026f733701b107c533357596fd3a9b8b0ccdb58acba" data-view-component="true" href="/emilk">
             Emil Ernerfeldt
 </a> </h1>,
 <h1 class="h3 lh-condensed">
 <a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":40610,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="a89d6b5a4c2d109b974e13cd420c0b82704b6785bb03c83e25090ee5468fe8d6" data-view-component="true" href="/kishikawakatsumi">
             Kishikawa Katsumi
 </a> </h1

Ahora, dentro de users, queremos acceder a la información que nos interesa.

In [5]:
user_names = []

for e in users:                  
    user_name_individual = (e.a['href'])+(e.a.text)              
    user_names.append(user_name_individual)
print(user_names)

['/emilk\n            Emil Ernerfeldt\n', '/kishikawakatsumi\n            Kishikawa Katsumi\n', '/jpsim\n            JP Simard\n', '/shuding\n            Shu Ding\n', '/agnivade\n            Agniva De Sarker\n', '/PySimpleGUI\n            PySimpleGUI\n', '/yujincheng08\n            LoveSy\n', '/danielroe\n            Daniel Roe\n', '/jsjoeio\n            Joe Previte\n', '/sgugger\n            Sylvain Gugger\n', '/dtolnay\n            David Tolnay\n', '/sharkdp\n            David Peter\n', '/Neo23x0\n            Florian Roth\n', '/klaasnicolaas\n            Klaas Schoute\n', '/gammazero\n            Andrew Gillis\n', '/muesli\n            Christian Muehlhaeuser\n', '/jonataslaw\n            Jonny Borges\n', '/JLLeitschuh\n            Jonathan Leitschuh\n', '/derailed\n            Fernand Galiana\n', '/klauspost\n            Klaus Post\n', '/mrdoob\n            mrdoob\n', '/jrfnl\n            Juliette\n', '/unknwon\n            Joe Chen\n', '/nomi9995\n            Numan\n', '/Ebazhanov\n

In [28]:
user_names = []

for e in users:
    user_name_individual = (e.a['href'].lstrip('/'))+" "+"("+(e.a.text.strip('\n '))+")"      
    user_names.append(user_name_individual)

print(user_names)

"""Construcción de user_name_individual:
Estamos accediendo al contenido y limpiándolo a la vez (daba problemas haciéndolo por separado)

Primero accedemos al usuario --> Valor del atributo 'href', dentro del tag 'a'.
Limpiamos el usuario --> Usamos .lstrip() para elminar '/'.

Accedemos al nombre --> texto al final del tag 'a'. El nombre está incluido en el tag 'a', pero no corresponde a ningún atributo. Es simplemente texto y a esto accedemos a través del .text
Limpiamos el nombre --> Usamos .strip para eliminar '\n' y todos los espacio que haya (hasta encontrar otro caracter por izquierza y derecha del string)

Construcción del elemento en user_names ---> Añadimos lo necesario para obtener el formato que queremos: añadimos el espacio entre usuario y nombre y los paréntisis en nombre."""

['emilk (Emil Ernerfeldt)', 'kishikawakatsumi (Kishikawa Katsumi)', 'jpsim (JP Simard)', 'shuding (Shu Ding)', 'agnivade (Agniva De Sarker)', 'PySimpleGUI (PySimpleGUI)', 'yujincheng08 (LoveSy)', 'danielroe (Daniel Roe)', 'jsjoeio (Joe Previte)', 'sgugger (Sylvain Gugger)', 'dtolnay (David Tolnay)', 'sharkdp (David Peter)', 'Neo23x0 (Florian Roth)', 'klaasnicolaas (Klaas Schoute)', 'gammazero (Andrew Gillis)', 'muesli (Christian Muehlhaeuser)', 'jonataslaw (Jonny Borges)', 'JLLeitschuh (Jonathan Leitschuh)', 'derailed (Fernand Galiana)', 'klauspost (Klaus Post)', 'mrdoob (mrdoob)', 'jrfnl (Juliette)', 'unknwon (Joe Chen)', 'nomi9995 (Numan)', 'Ebazhanov (Zhenja)']


"Construcción de user_name_individual:\nEstamos accediendo al contenido y limpiándolo a la vez (daba problemas haciéndolo por separado)\n\nPrimero accedemos al usuario --> Valor del atributo 'href', dentro del tag 'a'.\nLimpiamos el usuario --> Usamos .lstrip() para elminar '/'.\n\nAccedemos al nombre --> texto al final del tag 'a'. El nombre está incluido en el tag 'a', pero no corresponde a ningún atributo. Es simplemente texto y a esto accedemos a través del .text\nLimpiamos el nombre --> Usamos .strip para eliminar '\n' y todos los espacio que haya (hasta encontrar otro caracter por izquierza y derecha del string)\n\nConstrucción del elemento en user_names ---> Añadimos lo necesario para obtener el formato que queremos: añadimos el espacio entre usuario y nombre y los paréntisis en nombre."

####  Challenge 2 - Display the trending Python repositories in GitHub

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [29]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'

In [30]:
html=req.get(url).content

soup=bs(html, 'html.parser')

In [38]:
repos = soup.find_all('h1', class_='h3 lh-condensed')

In [47]:
repos_names = []

for e in repos:
    repos_names.append(e.a['href'].strip('/'))   # Accedemos al contenido, limpiamos y añadimos a la lista en la misma línea de código.

repos_names


['RunaCapital/awesome-oss-alternatives',
 'jackfrued/Python-100-Days',
 'github/copilot-docs',
 'ansible/ansible',
 'TheAlgorithms/Python',
 'facebookresearch/mae',
 'scikit-learn/scikit-learn',
 'pymc-devs/pymc',
 'Jxck-S/plane-notify',
 'deepset-ai/haystack',
 'joeammond/CVE-2021-4034',
 'sinwindie/OSINT',
 'KurtBestor/Hitomi-Downloader',
 'joke2k/faker',
 'bottlesdevs/Bottles',
 'huggingface/transformers',
 'zulip/zulip',
 'ranger/ranger',
 'feast-dev/feast',
 'facebookresearch/detr',
 'dmlc/dgl',
 'NVlabs/imaginaire',
 'nftdevs/NFTs-Upload-to-OpenSea',
 'thumbor/thumbor',
 'yetao0806/NodeNote']

#### Challenge 3 - Display all the image links from Walt Disney wikipedia page

In [48]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [49]:
html=req.get(url).content

soup=bs(html, 'html.parser')

In [77]:
images = soup.find_all('a', class_='image') # Accedemos a la caja anterior ('a') al tag que nos interesa ('img')

In [78]:
img_links = []

for e in images:
    img_links.append(e.img['src']) # Accedemos al tag que nos interesa ('img') y añadimos a la lista el atributo que queremos ('src' --> contiene el link de la imagen)

img_links

['//upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG',
 '//upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png',
 '//upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Trolley_Troubles_poster.jpg/170px-Trolley_Troubles_poster.jpg',
 '//upload.wikimedia.org/wikipedia/en/thumb/4/4e/Steamboat-willie.jpg/170px-Steamboat-willie.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/5/57/Walt_Disney_1935.jpg/170px-Walt_Disney_1935.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Walt_Disney_Snow_white_1937_trailer_screenshot_%2813%29.jpg/220px-Walt_Disney_Snow_white_1937_trailer_screenshot_%2813%29.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/1/15/Disney_drawing_goofy.jpg/170px-Disney_drawing_goofy.jpg',
 '//upload.wikimedia.or

#### Challenge 4 - Retrieve all links to pages on Wikipedia that refer to some kind of Python.

In [79]:
# This is the url you will scrape in this exercise
url ='https://en.wikipedia.org/wiki/Python' 

In [80]:
html=req.get(url).content

soup=bs(html, 'html.parser')

In [119]:
lis = soup.find_all('li')

In [120]:
links = []

for e in lis:
    if e.a:
        links.append(e.a['href'])

links

['/wiki/Pythonidae',
 '/wiki/Python_(genus)',
 '#Computing',
 '#People',
 '#Roller_coasters',
 '#Vehicles',
 '#Weaponry',
 '#Other_uses',
 '#See_also',
 '/wiki/Python_(programming_language)',
 '/wiki/CMU_Common_Lisp',
 '/wiki/PERQ#PERQ_3',
 '/wiki/Python_of_Aenus',
 '/wiki/Python_(painter)',
 '/wiki/Python_of_Byzantium',
 '/wiki/Python_of_Catana',
 '/wiki/Python_Anghelo',
 '/wiki/Python_(Efteling)',
 '/wiki/Python_(Busch_Gardens_Tampa_Bay)',
 '/wiki/Python_(Coney_Island,_Cincinnati,_Ohio)',
 '/wiki/Python_(automobile_maker)',
 '/wiki/Python_(Ford_prototype)',
 '/wiki/Python_(missile)',
 '/wiki/Python_(nuclear_primary)',
 '/wiki/Colt_Python',
 '/wiki/PYTHON',
 '/wiki/Python_(film)',
 '/wiki/Python_(mythology)',
 '/wiki/Monty_Python',
 '/wiki/Python_(Monty)_Pictures',
 '/wiki/Timon_of_Phlius',
 '/wiki/Cython',
 '/wiki/Pyton',
 '/wiki/Pithon',
 '/wiki/Category:Disambiguation_pages',
 '/wiki/Category:Human_name_disambiguation_pages',
 '/wiki/Category:Disambiguation_pages_with_given-name-ho

Nos damos cuenta de que los links obtenidos en la lista anterior (python_links) puede no ser exactamente lo que queremos.

    - Hay links que llevan a páginas de Wikipedia que no están relacionadas con Python.
    - Hay links "repetidos" en diferentes versiones (misma página en diferentes idiomas, por ejemplo).

Podríamos quedarnos solo con aquellos en los que aparece 'Python' y en los que se trata de la parte específica de la url para ese site (https://en.wikipedia.org (url base) + parte específica).

Una vez tengamos lo anterior, podemos añadirlo a la url base (https://en.wikipedia.org) y construir una lista con estos elementos.

También nos damos cuenta de que en estos resultados filtrados, hay algunas urls que no nos interesan realmente. Estas urls son aquellas que contienen 'index' (se corresponden con links a páginas de wikipedia relacionadas con 'Python', pero que realmente no sirven para mostrar contenido, si no para otras funciones como editar el contenido, por ejemplo). También descartaremos estos links a la hora de construir nuestra lista definitiva.

In [122]:
python_links = []

for e in links:
    if 'Python' in e and 'http' not in e and 'index' not in e:     # Filtramos para quedarnos únicamente con los links que nos interesan (contenido sobre 'Python')
        python_links.append('https://en.wikipedia.org'+e)

python_links

['https://en.wikipedia.org/wiki/Pythonidae',
 'https://en.wikipedia.org/wiki/Python_(genus)',
 'https://en.wikipedia.org/wiki/Python_(programming_language)',
 'https://en.wikipedia.org/wiki/Python_of_Aenus',
 'https://en.wikipedia.org/wiki/Python_(painter)',
 'https://en.wikipedia.org/wiki/Python_of_Byzantium',
 'https://en.wikipedia.org/wiki/Python_of_Catana',
 'https://en.wikipedia.org/wiki/Python_Anghelo',
 'https://en.wikipedia.org/wiki/Python_(Efteling)',
 'https://en.wikipedia.org/wiki/Python_(Busch_Gardens_Tampa_Bay)',
 'https://en.wikipedia.org/wiki/Python_(Coney_Island,_Cincinnati,_Ohio)',
 'https://en.wikipedia.org/wiki/Python_(automobile_maker)',
 'https://en.wikipedia.org/wiki/Python_(Ford_prototype)',
 'https://en.wikipedia.org/wiki/Python_(missile)',
 'https://en.wikipedia.org/wiki/Python_(nuclear_primary)',
 'https://en.wikipedia.org/wiki/Colt_Python',
 'https://en.wikipedia.org/wiki/Python_(film)',
 'https://en.wikipedia.org/wiki/Python_(mythology)',
 'https://en.wikipe

#### Challenge 5 - Number of Titles that have changed in the United States Code since its last release point 

In [123]:
# This is the url you will scrape in this exercise
url = 'http://uscode.house.gov/download/download.shtml'

In [124]:
html=req.get(url).content

soup=bs(html, 'html.parser')

In [127]:
changed = soup.find_all('div', class_='usctitlechanged')    # Todos los Titles que han cambiado corresponden a la clase 'usctitlechanged'.

len(changed)

24

#### Challenge 6 - A Python list with the top ten FBI's Most Wanted names 

In [128]:
# This is the url you will scrape in this exercise
url = 'https://www.fbi.gov/wanted/topten'

In [129]:
html=req.get(url).content

soup=bs(html, 'html.parser')

In [137]:
top10 = soup.find_all('h3', class_='title')

top10_names = []

for e in top10:
    top10_names.append(e.text.strip('\n'))
top10_names

['YULAN ADONAY ARCHAGA CARIAS',
 'EUGENE PALMER',
 'BHADRESHKUMAR CHETANBHAI PATEL',
 'ALEJANDRO ROSALES CASTILLO',
 'ARNOLDO JIMENEZ',
 'JASON DEREK BROWN',
 'ALEXIS FLORES',
 'JOSE RODOLFO VILLARREAL-HERNANDEZ',
 'OCTAVIANO JUAREZ-CORRO',
 'RAFAEL CARO-QUINTERO']

#### Challenge 7 - List all language names and number of related articles in the order they appear in wikipedia.org

In [142]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'

In [143]:
html=req.get(url).content

soup=bs(html, 'html.parser')

In [172]:
all_info = soup.find_all('a') # Caja cque contiene toda la info que queremos extraer.

[<a class="link-box" data-slogan="The Free Encyclopedia" href="//en.wikipedia.org/" id="js-link-box-en" title="English — Wikipedia — The Free Encyclopedia">
 <strong>English</strong>
 <small><bdi dir="ltr">6 383 000+</bdi> <span>articles</span></small>
 </a>,
 <a class="link-box" data-slogan="フリー百科事典" href="//ja.wikipedia.org/" id="js-link-box-ja" title="Nihongo — ウィキペディア — フリー百科事典">
 <strong>日本語</strong>
 <small><bdi dir="ltr">1 292 000+</bdi> <span>記事</span></small>
 </a>,
 <a class="link-box" data-slogan="Свободная энциклопедия" href="//ru.wikipedia.org/" id="js-link-box-ru" title="Russkiy — Википедия — Свободная энциклопедия">
 <strong>Русский</strong>
 <small><bdi dir="ltr">1 756 000+</bdi> <span>статей</span></small>
 </a>,
 <a class="link-box" data-slogan="Die freie Enzyklopädie" href="//de.wikipedia.org/" id="js-link-box-de" title="Deutsch — Wikipedia — Die freie Enzyklopädie">
 <strong>Deutsch</strong>
 <small><bdi dir="ltr">2 617 000+</bdi> <span>Artikel</span></small>
 </a>,

In [197]:
lang = []
for e in all_info:
    l = e.strong   # El tag 'strong' contiene el idioma.
    if l:
        lang.append(l.text)
        
    x = e.small    # El tag 'small' contiene el número de artículos.
    if x:
        lang.append(x.text[0]+'.'+x.text[2:5]+'.'+x.text[6:])    # EL texto incluye caracteres extraños (''\xa') que equivales a los espacios (donde hay un espacio en wikipedia, hay \xa  en el html). Eliminamos estos caracteres y añadimos puntos para obtener el número de artículos limpio.
lang

['English',
 '6.383.000+ articles',
 '日本語',
 '1.292.000+ 記事',
 'Русский',
 '1.756.000+ статей',
 'Deutsch',
 '2.617.000+ Artikel',
 'Español',
 '1.717.000+ artículos',
 'Français',
 '2.362.000+ articles',
 'Italiano',
 '1.718.000+ voci',
 '中文',
 '1.231.000+ 條目',
 'Polski',
 '1.490.000+ haseł',
 'Português',
 '1.074.000+ artigos']

#### Challenge 8 - A list with the different kind of datasets available in data.gov.uk 

In [198]:
# This is the url you will scrape in this exercise
url = 'https://data.gov.uk/'

In [199]:
html=req.get(url).content

soup=bs(html, 'html.parser')

In [212]:
datasets = [e.text for e in soup.find_all('a', class_= 'govuk-link')]    # Hacemos el bucle y generamos la lista en la misma línea (conseguimos un código mucho más fino que los anteriores, ahora que tenemos más soltura)
datasets

['cookies to collect information',
 'View cookies',
 'change your cookie settings',
 'feedback',
 'Business and economy',
 'Crime and justice',
 'Defence',
 'Education',
 'Environment',
 'Government',
 'Government spending',
 'Health',
 'Mapping',
 'Society',
 'Towns and cities',
 'Transport',
 'Digital service performance',
 'Government reference data']

Nos damos cuenta de que hay 4 links que están incluidos en la misma clase que los links a datasets, pero que no es lo que estamos buscando.

Eliminamos estos elementos para  obtener la lista definitiva.

In [214]:
datasets = datasets[4:]

print(datasets)

['Business and economy', 'Crime and justice', 'Defence', 'Education', 'Environment', 'Government', 'Government spending', 'Health', 'Mapping', 'Society', 'Towns and cities', 'Transport', 'Digital service performance', 'Government reference data']


#### Challenge 9 - Top 10 languages by number of native speakers stored in a Pandas Dataframe

In [313]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [314]:
html=req.get(url).content

soup=bs(html, 'html.parser')

In [340]:
tablas = content.find_all('table', class_= 'wikitable sortable jquery-tablesorter')

print(tablas)  

'''No nos encuentra nada para 'table', class_= 'wikitable sortable jquery-tablesorter' y desconozco el motivo.
Después de mil pruebas (incluso intentando ir accediendo una a uno a todos los tags), no consigo acceder a la tabla que nos interesa.'''


[]


"No nos encuentra nada para 'table', class_= 'wikitable sortable jquery-tablesorter' y desconozco el motivo.\nDespués de mil pruebas (incluso intentando ir accediendo una a uno a todos los tags), no consigo acceder a la tabla que nos interesa."

### Stepping up the game

####  Challenge 10 - The 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe

In [None]:
# This is the url you will scrape in this exercise
url = 'https://www.emsc-csem.org/Earthquake/'

#### Challenge 11 - IMDB's Top 250 data (movie name, Initial release, director name and stars) as a pandas dataframe

In [None]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'

#### Challenge 12 - Movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [None]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'

#### Challenge 13 - Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [None]:
#https://openweathermap.org/current

In [None]:
def weather(city):
    pass

#### Challenge 14 - Book name,price and stock availability as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

**Did you limit your output? Thank you! 🙂**