# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended contennt.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit each url and take a look at its source through Chrome DevTools. You'll need to identify the html tags, special class names etc. used for the html content you are expected to extract.

- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide) documentation 
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are imported for you. If you prefer to use additional libraries feel free to uncomment them.

In [16]:
import requests as req
from bs4 import BeautifulSoup as bs
import pandas as pd
from pprint import pprint
#from lxml import html
#from lxml.html import fromstring
import urllib.request
from urllib.request import urlopen
import random
import re
from IPython.display import Image # Para mostrar im√°genes en python
#import scrapy

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [2]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [43]:
html = req.get(url).content
html[:500]

b'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"  data-a11y-animated-images="system" data-a11y-link-underlines="true">\n\n\n\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.co'

In [44]:
len(html) # Mostramos la longitud de ka variable html del url proporcionado

465606

In [45]:
soup = bs(html, 'html.parser')
type(soup)

bs4.BeautifulSoup

In [46]:
#soup # Comento para no mostrarlo todo 

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (Á•ûÊ•ΩÂùÇË¶ö„ÄÖ)',
 'script-8']
 ```

In [47]:
len(soup.find_all('div'))

398

In [146]:
soup.find_all('article', class_='Box-row d-flex')


[]

In [58]:
lista25 = soup.find_all('article', class_='Box-row d-flex')
len(lista25)

25

In [92]:
# Tenemos una lista de 25 elementos -> son nuestros 25 trending developers as√≠ que vamos guay
# Exploro el primero de ellos
lista25[0]

<article class="Box-row d-flex" id="pa-homanp">
<a class="Link color-fg-muted f6" data-view-component="true" href="#pa-homanp" style="width: 16px;" text="center">
    1
</a>
<div class="mx-3">
<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":2464556,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="8353123892e0014c3391371c3a36a907e00ed35489535f2eee1054dbfe390357" data-view-component="true" href="/homanp">
<img alt="@homanp" class="rounded avatar-user" height="48" src="https://avatars.githubusercontent.com/u/2464556?s=96&amp;v=4" width="48"/>
</a> </div>
<div class="d-sm-flex flex-auto">
<div class="col-sm-8 d-md-flex">
<div class="col-md-6">
<h1 class="h3 lh-condensed">
<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING

In [93]:
# Busco en la letra a xq veo que est√° dividido en cajas de 'a' 
lista25[0]('a')

[<a class="Link color-fg-muted f6" data-view-component="true" href="#pa-homanp" style="width: 16px;" text="center">
     1
 </a>,
 <a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":2464556,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="8353123892e0014c3391371c3a36a907e00ed35489535f2eee1054dbfe390357" data-view-component="true" href="/homanp">
 <img alt="@homanp" class="rounded avatar-user" height="48" src="https://avatars.githubusercontent.com/u/2464556?s=96&amp;v=4" width="48"/>
 </a>,
 <a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":2464556,"originating_url":"https://github.com/trending/developer

In [82]:
# Luego busco en la 2¬™ <a> xq veo que ah√≠ es donde est√° el nombre del manin
lista25[0]('a')[2] 

<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":2464556,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="8353123892e0014c3391371c3a36a907e00ed35489535f2eee1054dbfe390357" data-view-component="true" href="/homanp">
            Ismail Pelaseyed
</a>

In [83]:
type(lista25[0]('a')[2])

bs4.element.Tag

In [85]:
lista25[0]('a')[2].text.strip() 

'Ismail Pelaseyed'

In [94]:
lista25[0]('a')[3].text.strip() # Hago igual con el user_id -> esta vez el tipo est√° en el 3er bloque de <a>

'homanp'

In [97]:
# Ahora me hago un bucle para meter en una lista todos los nombres seguidos de sus user_id
resultado = [] # Creo una lista vac√≠a

for e in range(len(lista25)): # Hago un bucle por toda la lista 25
    nombre = lista25[e]('a')[2].text.strip() # Asigno el nombre 
    user = lista25[e]('a')[3].text.strip() # Asigno el user_id

    resultado.append(nombre + ' ' + '('+ user + ')') # Meto en la lista ambos con un append y sumando strings -> easy

resultado

['Ismail Pelaseyed (homanp)',
 'Alex Stokes (ralexstokes)',
 'Don Jayamanne (DonJayamanne)',
 'Stefan Prodan (stefanprodan)',
 'Travis Cline (tmc)',
 'Oliver (SchrodingersGat)',
 'dgtlmoon (changedetection.io)',
 'Guillaume Klein (guillaumekln)',
 'lllyasviel (Fooocus)',
 'Angelos Chalaris (Chalarangelo)',
 'Jon Skeet (jskeet)',
 'Chris Banes (chrisbanes)',
 'Carlos Scheidegger (cscheid)',
 'Anders Eknert (anderseknert)',
 'Emil Ernerfeldt (emilk)',
 'Andrea Aime (aaime)',
 'Mattt (mattt)',
 'Meng Zhang (wsxiaoys)',
 'Shahed Nasser (shahednasser)',
 'dkhamsing (open-source-ios-apps)',
 'Kieron Quinn (KieronQuinn)',
 'afc163 (afc163)',
 'Alan Donovan (adonovan)',
 'Lee Calcote (leecalcote)',
 'Costa Huang (vwxyzjn)']

In [98]:
#cooool

#### Display the trending Python repositories in GitHub

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [100]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'

In [101]:
html = req.get(url).content
html[:500]

b'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"  data-a11y-animated-images="system" data-a11y-link-underlines="true">\n\n\n\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.co'

In [102]:
len(html)

660152

In [103]:
soup = bs(html, 'html.parser')
type(soup)

bs4.BeautifulSoup

In [104]:
len(soup.find_all('div'))

207

In [108]:
toprepos = soup.find_all('article', class_='Box-row') # Es muy parecido al anterior pero diferente class
len(toprepos)

25

In [141]:
toprepos[0]

<article class="Box-row">
<div class="float-right d-flex">
<div class="BtnGroup d-flex" data-view-component="true">
<a aria-label="You must be signed in to star a repository" class="tooltipped tooltipped-s btn-sm btn BtnGroup-item" data-hydro-click='{"event_type":"authentication.click","payload":{"location_in_page":"star button","repository_id":710159073,"auth_type":"LOG_IN","originating_url":"https://github.com/trending/python?since=daily","user_id":null}}' data-hydro-click-hmac="1514517cbe4eb54cff5fbbb87940e0b38efee87bbb4a3994c3197da5dc19691b" data-view-component="true" href="/login?return_to=%2FTHUDM%2FChatGLM3" rel="nofollow"> <svg aria-hidden="true" class="octicon octicon-star v-align-text-bottom d-none d-md-inline-block mr-2" data-view-component="true" height="16" version="1.1" viewbox="0 0 16 16" width="16">
<path d="M8 .25a.75.75 0 0 1 .673.418l1.882 3.815 4.21.612a.75.75 0 0 1 .416 1.279l-3.046 2.97.719 4.192a.751.751 0 0 1-1.088.791L8 12.347l-3.766 1.98a.75.75 0 0 1-1.088-.79

In [120]:
# Otra vez tenemos 25 top repos -> vamos a ir investigando como sacar los nombres en el primer elemento y ya despu√©s iteraremos

# Vemos que en 'h2' est√° el nombre del repo -> solo hay una caja as√≠ que vamos a la 0 -> pasamos a text y splitteamos para quedarnos con el repo

toprepos[0]('h2')[0].text.split()

['THUDM', '/', 'ChatGLM3']

In [122]:
# Ya tenemos el repo -> pasamos a loopear y aplicarselo a toda la lista toprepos
resul2 = [] # Creamos lista vac√≠a

for e in range(len(toprepos)):
    repo = toprepos[e]('h2')[0].text.split()[2] # Realmente solo nos interesa quedarnos con el nombre del repo, no con el nombre del manin que cre√≥ el repo

    resul2.append(repo)

resul2

['ChatGLM3',
 'Wonder3D',
 'public-apis',
 'system-design-primer',
 'devops-exercises',
 'ChatDev',
 'reflex',
 'face_recognition',
 'langflow',
 'raven',
 'nnUNet',
 'Fooocus',
 'CogVLM',
 'rag-demystified',
 'PaddleOCR',
 'deepface',
 'ungoogled-chromium',
 'poetry',
 'django',
 'bisheng',
 'litestar',
 'whisper',
 'discord_bot',
 'autotrain-advanced',
 'haystack']

#### Display all the image links from Walt Disney wikipedia page

In [3]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [4]:
html = req.get(url).content
html[:500]

b'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-d'

In [5]:
len(html)

582596

In [6]:
soup = bs(html, 'html.parser')
type(soup)

bs4.BeautifulSoup

In [7]:
soup.find_all('img')

[<img alt="" aria-hidden="true" class="mw-logo-icon" height="50" src="/static/images/icons/wikipedia.png" width="50"/>,
 <img alt="Wikipedia" class="mw-logo-wordmark" src="/static/images/mobile/copyright/wikipedia-wordmark-en.svg" style="width: 7.5em; height: 1.125em;"/>,
 <img alt="The Free Encyclopedia" class="mw-logo-tagline" height="13" src="/static/images/mobile/copyright/wikipedia-tagline-en.svg" style="width: 7.3125em; height: 0.8125em;" width="117"/>,
 <img alt="Featured article" class="mw-file-element" data-file-height="443" data-file-width="466" decoding="async" height="19" src="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/30px-Cscr-featured.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/40px-Cscr-featured.svg.png 2x" width="20"/>,
 <img alt="Extended-protected article" class="mw-file-element" data-file-height="512" data-file

In [8]:
len(soup.find_all('img'))

37

In [9]:
imagenes = soup.find_all('img') # Asigno a ristra de imagenes a variable lista
len(imagenes)

37

In [10]:
imagenes[10]

<img alt="Walt Disney sits in front of a set of models of the seven dwarfs" class="mw-file-element" data-file-height="388" data-file-width="500" decoding="async" height="171" src="//upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Walt_Disney_Snow_white_1937_trailer_screenshot_%2813%29.jpg/220px-Walt_Disney_Snow_white_1937_trailer_screenshot_%2813%29.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Walt_Disney_Snow_white_1937_trailer_screenshot_%2813%29.jpg/330px-Walt_Disney_Snow_white_1937_trailer_screenshot_%2813%29.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Walt_Disney_Snow_white_1937_trailer_screenshot_%2813%29.jpg/440px-Walt_Disney_Snow_white_1937_trailer_screenshot_%2813%29.jpg 2x" width="220"/>

In [13]:
imagenes[10].attrs

{'alt': 'Walt Disney sits in front of a set of models of the seven dwarfs',
 'src': '//upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Walt_Disney_Snow_white_1937_trailer_screenshot_%2813%29.jpg/220px-Walt_Disney_Snow_white_1937_trailer_screenshot_%2813%29.jpg',
 'decoding': 'async',
 'width': '220',
 'height': '171',
 'class': ['mw-file-element'],
 'srcset': '//upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Walt_Disney_Snow_white_1937_trailer_screenshot_%2813%29.jpg/330px-Walt_Disney_Snow_white_1937_trailer_screenshot_%2813%29.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Walt_Disney_Snow_white_1937_trailer_screenshot_%2813%29.jpg/440px-Walt_Disney_Snow_white_1937_trailer_screenshot_%2813%29.jpg 2x',
 'data-file-width': '500',
 'data-file-height': '388'}

In [15]:
imagenes[10]['src']

'//upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Walt_Disney_Snow_white_1937_trailer_screenshot_%2813%29.jpg/220px-Walt_Disney_Snow_white_1937_trailer_screenshot_%2813%29.jpg'

In [17]:
# Voy a probar a mostrar esta imagen y luego ya hago bucles
url = imagenes[10]['src']
display(Image(url=url))

In [20]:
# cool -> paso a buclear sobre imagenes[]
mostrar = [] # creo una lista de links de im√°genes

for i in range(len(imagenes)):
    mostrar.append(imagenes[i]['src'])

mostrar[:2]

['/static/images/icons/wikipedia.png',
 '/static/images/mobile/copyright/wikipedia-wordmark-en.svg']

In [22]:
# Ya tengo lista de links de imagenes, ahora me creo otra lista con todas las pics
impresora = [] # creo lista vac√≠a

for i in range(len(mostrar)):

    url = imagenes[i]['src']
    display(Image(url=url))


#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page

In [23]:
# This is the url you will scrape in this exercise
url ='https://en.wikipedia.org/wiki/Python' 

In [24]:
html = req.get(url).content
html[:500]

b'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-d'

In [25]:
len(html)

64243

In [26]:
soup = bs(html, 'html.parser')
type(soup)

bs4.BeautifulSoup

In [37]:
# Vemos que todos los hipervinculos de wiki est√°n en cajas de 'a' -> pasamos esta seleccion a una lista
lista = soup.find_all('a')
len(lista)

161

In [49]:
lista[10].attrs

{'href': '/wiki/Wikipedia:Community_portal', 'title': 'The hub for editors'}

In [48]:
lista[10].attrs['href']

'/wiki/Wikipedia:Community_portal'

In [50]:
lista[10].attrs['title']

'The hub for editors'

In [51]:
# Parece que los links est√°n dentro de href -> paso a hacer el bucle que recorra la lista

In [53]:
links = []

for e in lista:
    links.append(e.attrs['href'])

links[:10]

['#bodyContent',
 '/wiki/Main_Page',
 '/wiki/Wikipedia:Contents',
 '/wiki/Portal:Current_events',
 '/wiki/Special:Random',
 '/wiki/Wikipedia:About',
 '//en.wikipedia.org/wiki/Wikipedia:Contact_us',
 'https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en',
 '/wiki/Help:Contents',
 '/wiki/Help:Introduction']

In [54]:
# Seguro que lo puedo hacer en una l√≠nea: Primero lo de dentro del append y de arriba a abajo
linkspro = [e.attrs['href'] for e in lista]

linkspro[:10]

['#bodyContent',
 '/wiki/Main_Page',
 '/wiki/Wikipedia:Contents',
 '/wiki/Portal:Current_events',
 '/wiki/Special:Random',
 '/wiki/Wikipedia:About',
 '//en.wikipedia.org/wiki/Wikipedia:Contact_us',
 'https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en',
 '/wiki/Help:Contents',
 '/wiki/Help:Introduction']

In [55]:
# Cooool

#### Number of Titles that have changed in the United States Code since its last release point 

In [56]:
# This is the url you will scrape in this exercise
url = 'http://uscode.house.gov/download/download.shtml'

In [57]:
html = req.get(url).content
html[:500]

b'<?xml version=\'1.0\' encoding=\'UTF-8\' ?>\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html xmlns="http://www.w3.org/1999/xhtml"><head>\n        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />\n        <meta http-equiv="X-UA-Compatible" content="IE=8" />\n        <meta http-equiv="pragma" content="no-cache" /><!-- HTTP 1.0 -->\n        <meta http-equiv="cache-control" content="no-cache,must-revalidate" '

In [58]:
len(html)

83015

In [59]:
soup = bs(html, 'html.parser')
type(soup)

bs4.BeautifulSoup

In [62]:
titulos = soup.find_all(class_='usctitle') # He usado usctitle y no uscitem porque los appendix tambi√©n los contaba como t√≠tulos
len(titulos) # Compruebo que la longitud coincida con el n√∫mero de t√≠tulos

54

El enunciado nos pide el n√∫mero total -> 54

#### A Python list with the top ten FBI's Most Wanted names 

In [73]:
# This is the url you will scrape in this exercise
url = 'https://www.fbi.gov/wanted/topten'

In [74]:
html = req.get(url).content
html[:500]

b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\n<!-- saved from url=(0023)http://kidmondo.com/404 -->\n<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />\n  \n  <meta http-equiv="imagetoolbar" content="no" />\n  <meta name="robots" content="noindex,nofollow" />\n  <title>There was an Error </title>\n\n\t\t<style>body{background:#fff;margin:0;padding:20p'

In [75]:
len(html)

213661

In [76]:
soup = bs(html, 'html.parser')
type(soup)

bs4.BeautifulSoup

In [77]:
soup.find_all('h3')

[]

No me lo encuentra -> no hace falta hacer este

####  20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe

In [None]:
# This is the url you will scrape in this exercise
url = 'https://www.emsc-csem.org/Earthquake/'

Este tampoco vale

#### Display the date, days, title, city, country of next 25 hackathon events as a Pandas dataframe table

In [None]:
# This is the url you will scrape in this exercise
url ='https://hackevents.co/hackathons'

Link caducado

#### Count number of tweets by a given Twitter account.

You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account

In [78]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/Khruangbin'

In [116]:
html = req.get(url).content
soup = bs(html, 'html.parser')
type(soup)

bs4.BeautifulSoup

In [131]:
soup.find_all('h2')

[]

No me deja.....

Este tampoco se puede hacer con bs...

#### Number of followers of a given twitter account

You will need to include a ***try/except block*** in case account/s name not found. 
<br>***Hint:*** the program should count the followers for any provided account

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

Este tampoco se puede hacer con bs...

#### List all language names and number of related articles in the order they appear in wikipedia.org

In [132]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'

In [133]:
html = req.get(url).content
soup = bs(html, 'html.parser')
type(soup)

bs4.BeautifulSoup

In [137]:
len(soup.find_all('a', class_='link-box'))

10

In [138]:
# Cool, sabemos que aqu√≠ dentro tenemos los 10 idiomas junto con sus art√≠culos
cajas = soup.find_all('a', class_='link-box')
cajas

[<a class="link-box" data-slogan="The Free Encyclopedia" href="//en.wikipedia.org/" id="js-link-box-en" title="English ‚Äî Wikipedia ‚Äî The Free Encyclopedia">
 <strong>English</strong>
 <small><bdi dir="ltr">6¬†715¬†000+</bdi> <span>articles</span></small>
 </a>,
 <a class="link-box" data-slogan="„Éï„É™„ÉºÁôæÁßë‰∫ãÂÖ∏" href="//ja.wikipedia.org/" id="js-link-box-ja" title="Nihongo ‚Äî „Ç¶„Ç£„Ç≠„Éö„Éá„Ç£„Ç¢ ‚Äî „Éï„É™„ÉºÁôæÁßë‰∫ãÂÖ∏">
 <strong>Êó•Êú¨Ë™û</strong>
 <small><bdi dir="ltr">1¬†387¬†000+</bdi> <span>Ë®ò‰∫ã</span></small>
 </a>,
 <a class="link-box" data-slogan="La enciclopedia libre" href="//es.wikipedia.org/" id="js-link-box-es" title="Espa√±ol ‚Äî Wikipedia ‚Äî La enciclopedia libre">
 <strong>Espa√±ol</strong>
 <small><bdi dir="ltr">1¬†892¬†000+</bdi> <span>art√≠culos</span></small>
 </a>,
 <a class="link-box" data-slogan="–°–≤–æ–±–æ–¥–Ω–∞—è —ç–Ω—Ü–∏–∫–ª–æ–ø–µ–¥–∏—è" href="//ru.wikipedia.org/" id="js-link-box-ru" title="Russkiy ‚Äî –í–∏–∫–∏–ø–µ–¥–∏—è ‚Äî –°–≤–æ–±–æ–¥–Ω–∞—è

In [146]:
caja_idioma = cajas[0].text.split()
caja_idioma

['English', '6', '715', '000+', 'articles']

In [147]:
# Tenemos ya una lista con el idioma seguido de los numeros (separados) que conforman el total de art√≠culos
un_idioma = caja_idioma[0]
un_idioma

'English'

In [148]:
# Ahora quiero sacar el numero
un_numero = ''.join(caja_idioma[1:4])
un_numero

'6715000+'

In [153]:
# Ahora paso a hacer lo mismo con cada uno de los 10 idiomas -> bucleo
resultado = [] # Creo lista vac√≠a
for e in cajas:
    idioma = e.text.split()[0]
    numarts = ''.join(e.text.split()[1:4])

    resultado.append(idioma+' '+numarts)

resultado

['English 6715000+',
 'Êó•Êú¨Ë™û 1387000+',
 'Espa√±ol 1892000+',
 '–†—É—Å—Å–∫–∏–π 1938000+',
 'Deutsch 2836000+',
 'Fran√ßais 2553000+',
 'Italiano 1826000+',
 '‰∏≠Êñá 1377000+',
 'Portugu√™s 1109000+',
 'ÿßŸÑÿπÿ±ÿ®Ÿäÿ© 1217000+']

In [154]:
#Cooooool

#### A list with the different kind of datasets available in data.gov.uk 

In [155]:
# This is the url you will scrape in this exercise
url = 'https://data.gov.uk/'

In [156]:
html = req.get(url).content
soup = bs(html, 'html.parser')
type(soup)

bs4.BeautifulSoup

In [158]:
len(soup.find_all('h3'))

14

In [159]:
# Cool -> tenemos 14 tipos de dataset as√≠ que vamos guay
lista = soup.find_all('h3')
lista

[<h3 class="govuk-heading-s dgu-topics__heading"><a class="govuk-link" href="/search?filters%5Btopic%5D=Business+and+economy">Business and economy</a></h3>,
 <h3 class="govuk-heading-s dgu-topics__heading"><a class="govuk-link" href="/search?filters%5Btopic%5D=Crime+and+justice">Crime and justice</a></h3>,
 <h3 class="govuk-heading-s dgu-topics__heading"><a class="govuk-link" href="/search?filters%5Btopic%5D=Defence">Defence</a></h3>,
 <h3 class="govuk-heading-s dgu-topics__heading"><a class="govuk-link" href="/search?filters%5Btopic%5D=Education">Education</a></h3>,
 <h3 class="govuk-heading-s dgu-topics__heading"><a class="govuk-link" href="/search?filters%5Btopic%5D=Environment">Environment</a></h3>,
 <h3 class="govuk-heading-s dgu-topics__heading"><a class="govuk-link" href="/search?filters%5Btopic%5D=Government">Government</a></h3>,
 <h3 class="govuk-heading-s dgu-topics__heading"><a class="govuk-link" href="/search?filters%5Btopic%5D=Government+spending">Government spending</a></

In [167]:
lista[0].text

'Business and economy'

In [170]:
# Tenemos el t√≠tulo -> sacamos el resto -> loop
res = [e.text for e in lista]
res

['Business and economy',
 'Crime and justice',
 'Defence',
 'Education',
 'Environment',
 'Government',
 'Government spending',
 'Health',
 'Mapping',
 'Society',
 'Towns and cities',
 'Transport',
 'Digital service performance',
 'Government reference data']

In [None]:
# Eazy duz it

#### Top 10 languages by number of native speakers stored in a Pandas Dataframe

In [171]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [172]:
html = req.get(url).content
soup = bs(html, 'html.parser')
type(soup)

bs4.BeautifulSoup

In [179]:
tabla = soup.find('table') # S√© que la tabla que quiero es la primer que aparece as√≠ que uso el find() en vez del find_all
tabla

<table class="wikitable sortable static-row-numbers">
<caption>Languages with at least 50 million first-language speakers<sup class="reference" id="cite_ref-e26_7-1"><a href="#cite_note-e26-7">[7]</a></sup>
</caption>
<tbody><tr>
<th>Language
</th>
<th data-sort-type="number">Native speakers<br/><small>(millions)</small>
</th>
<th>Language family
</th>
<th>Branch
</th></tr>
<tr>
<td><a class="mw-redirect" href="/wiki/ISO_639:cmn" title="ISO 639:cmn">Mandarin Chinese</a><br/>(incl. <a href="/wiki/Standard_Chinese" title="Standard Chinese">Standard Chinese</a>, but excl. <a href="/wiki/Varieties_of_Chinese" title="Varieties of Chinese">other varieties</a>)
</td>
<td>939
</td>
<td><a href="/wiki/Sino-Tibetan_languages" title="Sino-Tibetan languages">Sino-Tibetan</a>
</td>
<td><a href="/wiki/Sinitic_languages" title="Sinitic languages">Sinitic</a>
</td></tr>
<tr>
<td><a class="mw-redirect" href="/wiki/ISO_639:spa" title="ISO 639:spa">Spanish</a>
</td>
<td>485
</td>
<td><a href="/wiki/Indo-

In [181]:
tabla.find_all('tr')[0].text.split('\n') # As√≠ saco los headers de las columnas

['',
 'Language',
 '',
 'Native speakers(millions)',
 '',
 'Language family',
 '',
 'Branch',
 '']

In [182]:
filas = tabla.find_all('tr')

filas = [f.text.split('\n') for f in filas] # split por el salto de l√≠nea

filas[:2]

[['',
  'Language',
  '',
  'Native speakers(millions)',
  '',
  'Language family',
  '',
  'Branch',
  ''],
 ['',
  'Mandarin Chinese(incl. Standard Chinese, but excl. other varieties)',
  '',
  '939',
  '',
  'Sino-Tibetan',
  '',
  'Sinitic',
  '']]

In [186]:
# limpieza

final = [] # lista vac√≠a

for f in filas:
    
    tmp = []
    
    for palabra in f:
        
        if palabra!='': # nos libramos de las strings vac√≠as
            tmp.append(palabra)
            
    final.append(tmp)
    
final[:10]

[['Language', 'Native speakers(millions)', 'Language family', 'Branch'],
 ['Mandarin Chinese(incl. Standard Chinese, but excl. other varieties)',
  '939',
  'Sino-Tibetan',
  'Sinitic'],
 ['Spanish', '485', 'Indo-European', 'Romance'],
 ['English', '380', 'Indo-European', 'Germanic'],
 ['Hindi(excl. Urdu, and other languages)',
  '345',
  'Indo-European',
  'Indo-Aryan'],
 ['Portuguese', '236', 'Indo-European', 'Romance'],
 ['Bengali', '234', 'Indo-European', 'Indo-Aryan'],
 ['Russian', '147', 'Indo-European', 'Balto-Slavic'],
 ['Japanese', '123', 'Japonic', 'Japanese'],
 ['Yue Chinese(incl. Cantonese)', '86.1', 'Sino-Tibetan', 'Sinitic']]

In [184]:
final = [[palabra for palabra in f if palabra !=''] for f in filas] # doble bucle de manera comprimida

final[0]

['Language', 'Native speakers(millions)', 'Language family', 'Branch']

In [187]:
final[1]

['Mandarin Chinese(incl. Standard Chinese, but excl. other varieties)',
 '939',
 'Sino-Tibetan',
 'Sinitic']

In [188]:
col_names = final[0]

data = final[1:]

df = pd.DataFrame(data, columns=col_names)

df.head()

Unnamed: 0,Language,Native speakers(millions),Language family,Branch
0,"Mandarin Chinese(incl. Standard Chinese, but e...",939,Sino-Tibetan,Sinitic
1,Spanish,485,Indo-European,Romance
2,English,380,Indo-European,Germanic
3,"Hindi(excl. Urdu, and other languages)",345,Indo-European,Indo-Aryan
4,Portuguese,236,Indo-European,Romance


In [189]:
# Coool, ya tengo mi dataframe, voy a quedarme con los top 10 sobreescribiendo
df = df.head(10)
df

Unnamed: 0,Language,Native speakers(millions),Language family,Branch
0,"Mandarin Chinese(incl. Standard Chinese, but e...",939.0,Sino-Tibetan,Sinitic
1,Spanish,485.0,Indo-European,Romance
2,English,380.0,Indo-European,Germanic
3,"Hindi(excl. Urdu, and other languages)",345.0,Indo-European,Indo-Aryan
4,Portuguese,236.0,Indo-European,Romance
5,Bengali,234.0,Indo-European,Indo-Aryan
6,Russian,147.0,Indo-European,Balto-Slavic
7,Japanese,123.0,Japonic,Japanese
8,Yue Chinese(incl. Cantonese),86.1,Sino-Tibetan,Sinitic
9,Vietnamese,85.0,Austroasiatic,Vietic


In [190]:
# fet√©n.

### BONUS QUESTIONS

#### Scrape a certain number of tweets of a given Twitter account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
# your code

#### IMDB's Top 250 data (movie name, Initial release, director name and stars) as a pandas dataframe

In [None]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'

In [None]:
# your code

#### Movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [None]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'

In [None]:
#your code

#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [None]:
#https://openweathermap.org/current
city = city=input('Enter the city:')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

In [None]:
# your code

#### Book name,price and stock availability as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

In [None]:
#your code