# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended content.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit the urls below and take a look at their source code through Chrome DevTools. You'll need to identify the html tags, special class names, etc used in the html content you are expected to extract.

**Resources**:
- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide)
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are already imported for you. If you prefer to use additional libraries feel free to do it.

In [60]:
import requests
import bs4
import pandas as pd
import re  #Regular expression

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [82]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [83]:
#Generemos el objeto con la respuesta del servidor
response = requests.get(url)

#A la respuesta aplicamos .content para extraer el contenido de la respuesta. 
contenido = response.content
type(contenido)

bytes

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [84]:
#Parsing de los bytes
parsed_contenido = bs4.BeautifulSoup(contenido, "html.parser") 
#print(parsed_contenido)
type(parsed_contenido)

bs4.BeautifulSoup

In [85]:
tags = [tag for tag in parsed_contenido.find_all('article')]
#print(tags)
type(tags)

list

In [86]:
tags[0].h1.a.attrs

{'data-hydro-click': '{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":130929,"originating_url":"https://github.com/trending/developers","user_id":null}}',
 'data-hydro-click-hmac': 'db59e47e8f5052ab99cffecf102de694f970c34eac257c5d19ef53b749ae2bec',
 'href': '/hrydgard',
 'data-view-component': 'true'}

In [87]:
tags[0].h1.a.string.strip()

'Henrik Rydgård'

In [88]:
[tags[i].h1.a.string.strip() for i in range(len(tags)) if tags[i].h1.a.string != None]

['Henrik Rydgård',
 'Lee Robinson',
 'bmaltais',
 'Jerry Liu',
 'J. Nick Koston',
 'Tom Payne',
 'Rich Harris',
 'Matt Kane',
 'Adi Eyal',
 'Ariel Mashraki',
 'Marten Seemann',
 'str4d',
 'Hassan Rezk Habib',
 'Yifei Zhang',
 'Jeff Dickey',
 'Yair Morgenstern',
 'Guillaume Klein',
 'Mattias Wadman',
 'Hans-Kristian Arntzen',
 'Abhinav Gupta',
 'Charlie Marsh',
 'Roman Gershman',
 'Nathan Rajlich',
 'Ayke',
 'Alessandro Ros']

In [89]:
tags = [tag.a for tag in parsed_contenido.find_all('h1',{'class':['h3', 'lh-condensed']})]
names = [tag.string for tag in tags]

#Empieza Transform

names = [name for name in names if name != None]
names = [name.strip() for name in names]
names

['Henrik Rydgård',
 'Lee Robinson',
 'bmaltais',
 'Jerry Liu',
 'J. Nick Koston',
 'Tom Payne',
 'Rich Harris',
 'Matt Kane',
 'Adi Eyal',
 'Ariel Mashraki',
 'Marten Seemann',
 'str4d',
 'Hassan Rezk Habib',
 'Yifei Zhang',
 'Jeff Dickey',
 'Yair Morgenstern',
 'Guillaume Klein',
 'Mattias Wadman',
 'Hans-Kristian Arntzen',
 'Abhinav Gupta',
 'Charlie Marsh',
 'Roman Gershman',
 'Nathan Rajlich',
 'Ayke',
 'Alessandro Ros']

In [90]:
tags = [tag.a.attrs['href'] for tag in parsed_contenido.find_all('h1',{'class':['h3', 'lh-condensed']})]
users = [tag for tag in tags]

In [91]:
#Empieza T
users = [name for name in users if name != None]
users

['/hrydgard',
 '/hrydgard/ppsspp',
 '/leerob',
 '/leerob/leerob.io',
 '/bmaltais',
 '/bmaltais/kohya_ss',
 '/jerryjliu',
 '/jerryjliu/llama_index',
 '/bdraco',
 '/bdraco/denonavr',
 '/twpayne',
 '/twpayne/chezmoi',
 '/Rich-Harris',
 '/Rich-Harris/devalue',
 '/ascorbic',
 '/ascorbic/impala',
 '/adieyal',
 '/adieyal/sd-dynamic-prompts',
 '/a8m',
 '/a8m/golang-cheat-sheet',
 '/marten-seemann',
 '/marten-seemann/draft-seemann-quic-reliable-stream-reset',
 '/str4d',
 '/str4d/rage',
 '/hassanhabib',
 '/hassanhabib/Standard.AI.OpenAI',
 '/Yidadaa',
 '/Yidadaa/ChatGPT-Next-Web',
 '/jdxcode',
 '/jdxcode/rtx',
 '/yairm210',
 '/yairm210/Unciv',
 '/guillaumekln',
 '/guillaumekln/faster-whisper',
 '/wader',
 '/wader/fq',
 '/HansKristian-Work',
 '/HansKristian-Work/vkd3d-proton',
 '/abhinav',
 '/abhinav/doc2go',
 '/charliermarsh',
 '/charliermarsh/ruff',
 '/romange',
 '/romange/helio',
 '/TooTallNate',
 '/TooTallNate/Java-WebSocket',
 '/aykevl',
 '/aler9',
 '/aler9/rtsp-simple-server']

In [92]:
users = [name.strip().split('/')[1] for name in users]
users

['hrydgard',
 'hrydgard',
 'leerob',
 'leerob',
 'bmaltais',
 'bmaltais',
 'jerryjliu',
 'jerryjliu',
 'bdraco',
 'bdraco',
 'twpayne',
 'twpayne',
 'Rich-Harris',
 'Rich-Harris',
 'ascorbic',
 'ascorbic',
 'adieyal',
 'adieyal',
 'a8m',
 'a8m',
 'marten-seemann',
 'marten-seemann',
 'str4d',
 'str4d',
 'hassanhabib',
 'hassanhabib',
 'Yidadaa',
 'Yidadaa',
 'jdxcode',
 'jdxcode',
 'yairm210',
 'yairm210',
 'guillaumekln',
 'guillaumekln',
 'wader',
 'wader',
 'HansKristian-Work',
 'HansKristian-Work',
 'abhinav',
 'abhinav',
 'charliermarsh',
 'charliermarsh',
 'romange',
 'romange',
 'TooTallNate',
 'TooTallNate',
 'aykevl',
 'aler9',
 'aler9']

---Vamos por aqui

In [93]:
users_2= list(set(users))

In [94]:
lista_buena = []
for i in range(len(names)):
    lista_buena.append(f'{names[i]} ({users_2[i]})')                       
lista_buena
#Algo está mal, porque coge los valores repetidos. No sé como quitarlos de la lista de arriba.
#El resultado que nos pasásites está exactamente igual y no tiene en cuenta los repetidos.
#He intentado hacer un set y eliminarlso, pero al aplicarlo después, no los asocia correctamente. 

['Henrik Rydgård (guillaumekln)',
 'Lee Robinson (TooTallNate)',
 'bmaltais (wader)',
 'Jerry Liu (Yidadaa)',
 'J. Nick Koston (marten-seemann)',
 'Tom Payne (bdraco)',
 'Rich Harris (ascorbic)',
 'Matt Kane (romange)',
 'Adi Eyal (charliermarsh)',
 'Ariel Mashraki (jdxcode)',
 'Marten Seemann (a8m)',
 'str4d (abhinav)',
 'Hassan Rezk Habib (aykevl)',
 'Yifei Zhang (adieyal)',
 'Jeff Dickey (jerryjliu)',
 'Yair Morgenstern (str4d)',
 'Guillaume Klein (twpayne)',
 'Mattias Wadman (HansKristian-Work)',
 'Hans-Kristian Arntzen (leerob)',
 'Abhinav Gupta (hrydgard)',
 'Charlie Marsh (bmaltais)',
 'Roman Gershman (aler9)',
 'Nathan Rajlich (hassanhabib)',
 'Ayke (Rich-Harris)',
 'Alessandro Ros (yairm210)']

#### Display the trending Python repositories in GitHub.

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [14]:
# This is the url you will scrape in this exercise
url2 = 'https://github.com/trending/python?since=daily'

In [15]:
#Generemos el objeto con la respuesta del servidor
response2 = requests.get(url2)

#A la respuesta aplicamos .content para extraer el contenido de la respuesta. 
contenido2 = response2.content
type(contenido2)

bytes

In [16]:
soup = bs4.BeautifulSoup(contenido2, "html.parser") 
#print(parsed_contenido)
type(soup)

bs4.BeautifulSoup

In [17]:
soup.name

'[document]'

In [18]:
tags2 = [tag2 for tag2 in soup.find_all('article')]
#print(tags)
type(tags2)
#He cogido article porque es lo que cogimos en la anterior, pero realmente no lo entiendo.

list

In [19]:
print(tags2)

[<article class="Box-row">
<div class="float-right d-flex">
<div class="BtnGroup d-flex" data-view-component="true">
<a aria-label="You must be signed in to star a repository" class="tooltipped tooltipped-s btn-sm btn BtnGroup-item" data-hydro-click='{"event_type":"authentication.click","payload":{"location_in_page":"star button","repository_id":616372661,"auth_type":"LOG_IN","originating_url":"https://github.com/trending/python?since=daily","user_id":null}}' data-hydro-click-hmac="6fa18fc7723346458dbba9c4f37db34eb24227193d7ec989cf1d8a344f14d830" data-view-component="true" href="/login?return_to=%2Fbinary-husky%2Fchatgpt_academic" rel="nofollow"> <svg aria-hidden="true" class="octicon octicon-star v-align-text-bottom d-none d-md-inline-block mr-2" data-view-component="true" height="16" version="1.1" viewbox="0 0 16 16" width="16">
<path d="M8 .25a.75.75 0 0 1 .673.418l1.882 3.815 4.21.612a.75.75 0 0 1 .416 1.279l-3.046 2.97.719 4.192a.751.751 0 0 1-1.088.791L8 12.347l-3.766 1.98a.75.75

In [20]:
tags2[0].h1.a.text.replace('\n', '').strip()

'binary-husky /      chatgpt_academic'

In [21]:
tags2[0].h1.a.text.replace('\n', '').strip()
#repo = ' '.join(repo.split())
#print(repo)
#limpio el código de espacios

'binary-husky /      chatgpt_academic'

In [22]:
listado = [tags2[z].h1.a.text.replace('\n', '').strip() for z in range(len(tags2))]
listado 

['binary-husky /      chatgpt_academic',
 'openai /      chatgpt-retrieval-plugin',
 'BlinkDL /      RWKV-LM',
 'acantril /      learn-cantrill-io-labs',
 'databrickslabs /      dolly',
 'gd3kr /      BlenderGPT',
 'n3d1117 /      chatgpt-telegram-bot',
 'sahil280114 /      codealpaca',
 'sdatkinson /      NeuralAmpModelerPlugin',
 'lllyasviel /      ControlNet',
 'BlinkDL /      ChatRWKV',
 'geohot /      tinygrad',
 'svc-develop-team /      so-vits-svc',
 'GammaTauAI /      reflexion-human-eval',
 'FMInference /      FlexGen',
 'bmaltais /      kohya_ss',
 '34j /      so-vits-svc-fork',
 'Stability-AI /      stablediffusion',
 'pelennor2170 /      NAM_models',
 'stochasticai /      xturing',
 'mindsdb /      mindsdb',
 'fkunn1326 /      openpose-editor',
 'showlab /      Tune-A-Video',
 'gururise /      AlpacaDataCleaned',
 'sympy /      sympy']

In [23]:
listado_sin_espacios = ["/".join(elem.split()).replace('///','/') for elem in listado]
listado_sin_espacios

['binary-husky/chatgpt_academic',
 'openai/chatgpt-retrieval-plugin',
 'BlinkDL/RWKV-LM',
 'acantril/learn-cantrill-io-labs',
 'databrickslabs/dolly',
 'gd3kr/BlenderGPT',
 'n3d1117/chatgpt-telegram-bot',
 'sahil280114/codealpaca',
 'sdatkinson/NeuralAmpModelerPlugin',
 'lllyasviel/ControlNet',
 'BlinkDL/ChatRWKV',
 'geohot/tinygrad',
 'svc-develop-team/so-vits-svc',
 'GammaTauAI/reflexion-human-eval',
 'FMInference/FlexGen',
 'bmaltais/kohya_ss',
 '34j/so-vits-svc-fork',
 'Stability-AI/stablediffusion',
 'pelennor2170/NAM_models',
 'stochasticai/xturing',
 'mindsdb/mindsdb',
 'fkunn1326/openpose-editor',
 'showlab/Tune-A-Video',
 'gururise/AlpacaDataCleaned',
 'sympy/sympy']

#### Display all the image links from Walt Disney wikipedia page.

In [24]:
# This is the url you will scrape in this exercise
url3 = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [25]:
#Generemos el objeto con la respuesta del servidor
response3 = requests.get(url3)

#A la respuesta aplicamos .content para extraer el contenido de la respuesta. 
contenido3 = response3.content

In [26]:
soup3 = bs4.BeautifulSoup(contenido3, "html.parser") 

In [27]:
etiq = soup3.find_all('img')

In [28]:
etiq[0]

<img alt="" aria-hidden="true" class="mw-logo-icon" height="50" src="/static/images/icons/wikipedia.png" width="50"/>

In [29]:
image_links = [img['src'] for img in soup3.find_all('img') if img.has_attr('src')]
image_links

['/static/images/icons/wikipedia.png',
 '/static/images/mobile/copyright/wikipedia-wordmark-en.svg',
 '/static/images/mobile/copyright/wikipedia-tagline-en.svg',
 '//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png',
 '//upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/20px-Extended-protection-shackle.svg.png',
 '//upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG',
 '//upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png',
 '//upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Walt_Disney_Birthplace_Exterior_Hermosa_Chicago_Illinois.jpg/220px-Walt_Disney_Birthplace_Exterior_Hermosa_Chicago_Illinois.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Trolley_Troubles_p

In [30]:
list_cl = [url.split(',') for url in image_links]
list_cl
#Creo una nueva variable para que cada enlace sea un elemento de la lista. 
#Aplico el split sobre cada elemento de la lista ya que no se puede aplicar sobre lalista completa.

[['/static/images/icons/wikipedia.png'],
 ['/static/images/mobile/copyright/wikipedia-wordmark-en.svg'],
 ['/static/images/mobile/copyright/wikipedia-tagline-en.svg'],
 ['//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png'],
 ['//upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/20px-Extended-protection-shackle.svg.png'],
 ['//upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG'],
 ['//upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png'],
 ['//upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Walt_Disney_Birthplace_Exterior_Hermosa_Chicago_Illinois.jpg/220px-Walt_Disney_Birthplace_Exterior_Hermosa_Chicago_Illinois.jpg'],
 ['//upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg'],
 ['//upload.wikimedia.org/wikipedia/commons/thumb/0/0d

In [31]:
type(list_cl)

list

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page.

In [32]:
# This is the url you will scrape in this exercise
url4 ='https://en.wikipedia.org/wiki/Python' 

In [33]:
#Generemos el objeto con la respuesta del servidor
response4 = requests.get(url4)

#A la respuesta aplicamos .content para extraer el contenido de la respuesta. 
contenido4 = response4.content

In [34]:
soup4 = bs4.BeautifulSoup(contenido4, "html.parser") 

In [35]:
etiq4 = soup4.find_all('a')

In [59]:
etiq4

[<a class="mw-jump-link" href="#bodyContent">Jump to content</a>,
 <a accesskey="z" href="/wiki/Main_Page" title="Visit the main page [z]"><span>Main page</span></a>,
 <a href="/wiki/Wikipedia:Contents" title="Guides to browsing Wikipedia"><span>Contents</span></a>,
 <a href="/wiki/Portal:Current_events" title="Articles related to current events"><span>Current events</span></a>,
 <a accesskey="x" href="/wiki/Special:Random" title="Visit a randomly selected article [x]"><span>Random article</span></a>,
 <a href="/wiki/Wikipedia:About" title="Learn about Wikipedia and how it works"><span>About Wikipedia</span></a>,
 <a href="//en.wikipedia.org/wiki/Wikipedia:Contact_us" title="How to contact Wikipedia"><span>Contact us</span></a>,
 <a href="https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&amp;utm_medium=sidebar&amp;utm_campaign=C13_en.wikipedia.org&amp;uselang=en" title="Support us by donating to the Wikimedia Foundation"><span>Donate</span></a>,
 <a href=

In [37]:
any_links = [(a['href'], a['title']) for a in soup4.find_all('a') if a.has_attr('href') and a.has_attr('title')]
any_links
#He añadido los dos atributos tant href como title porque si solo añadía href me salían muchas palabras sueltas que entiendo no son correctas.

[('/wiki/Main_Page', 'Visit the main page [z]'),
 ('/wiki/Wikipedia:Contents', 'Guides to browsing Wikipedia'),
 ('/wiki/Portal:Current_events', 'Articles related to current events'),
 ('/wiki/Special:Random', 'Visit a randomly selected article [x]'),
 ('/wiki/Wikipedia:About', 'Learn about Wikipedia and how it works'),
 ('//en.wikipedia.org/wiki/Wikipedia:Contact_us', 'How to contact Wikipedia'),
 ('https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en',
  'Support us by donating to the Wikimedia Foundation'),
 ('/wiki/Help:Contents', 'Guidance on how to use and edit Wikipedia'),
 ('/wiki/Help:Introduction', 'Learn how to edit Wikipedia'),
 ('/wiki/Wikipedia:Community_portal', 'The hub for editors'),
 ('/wiki/Special:RecentChanges', 'A list of recent changes to Wikipedia [r]'),
 ('/wiki/Wikipedia:File_upload_wizard',
  'Add images or other media for use on Wikipedia'),
 ('/wiki/Special:Search', 

In [38]:
df1=pd.DataFrame(any_links, columns=(['HREF']+['TITLE']))
print(df1)
#lo he pasado a dataframe para que se visualice mejor.

                                                  HREF  \
0                                      /wiki/Main_Page   
1                             /wiki/Wikipedia:Contents   
2                          /wiki/Portal:Current_events   
3                                 /wiki/Special:Random   
4                                /wiki/Wikipedia:About   
..                                                 ...   
119  /wiki/Category:Disambiguation_pages_with_given...   
120  /wiki/Category:Short_description_is_different_...   
121    /wiki/Category:All_article_disambiguation_pages   
122            /wiki/Category:All_disambiguation_pages   
123  /wiki/Category:Animal_common_name_disambiguati...   

                                                 TITLE  
0                              Visit the main page [z]  
1                         Guides to browsing Wikipedia  
2                   Articles related to current events  
3                Visit a randomly selected article [x]  
4               Le

In [39]:
type(any_links)

list

In [40]:
list_cl2 = [url4.split(',') for url in any_links]
list_cl2
#No se muy bien si esoy haciendolo bien.
#Se supone que lo anterior es una lista de tuplas.
#hago esto para crear una lista de listas, y utilizo el separador "," porque el 2º elemento es el "title".
#el output es muy raro porque son todos los elaces igual cuando en en el print(any_links) no eran así.
#Tampoco termino de entender el enunciado porque realmente los lonk son muchísimos, y no sé si es igual un link, que un hipervínculo.

[['https://en.wikipedia.org/wiki/Python'],
 ['https://en.wikipedia.org/wiki/Python'],
 ['https://en.wikipedia.org/wiki/Python'],
 ['https://en.wikipedia.org/wiki/Python'],
 ['https://en.wikipedia.org/wiki/Python'],
 ['https://en.wikipedia.org/wiki/Python'],
 ['https://en.wikipedia.org/wiki/Python'],
 ['https://en.wikipedia.org/wiki/Python'],
 ['https://en.wikipedia.org/wiki/Python'],
 ['https://en.wikipedia.org/wiki/Python'],
 ['https://en.wikipedia.org/wiki/Python'],
 ['https://en.wikipedia.org/wiki/Python'],
 ['https://en.wikipedia.org/wiki/Python'],
 ['https://en.wikipedia.org/wiki/Python'],
 ['https://en.wikipedia.org/wiki/Python'],
 ['https://en.wikipedia.org/wiki/Python'],
 ['https://en.wikipedia.org/wiki/Python'],
 ['https://en.wikipedia.org/wiki/Python'],
 ['https://en.wikipedia.org/wiki/Python'],
 ['https://en.wikipedia.org/wiki/Python'],
 ['https://en.wikipedia.org/wiki/Python'],
 ['https://en.wikipedia.org/wiki/Python'],
 ['https://en.wikipedia.org/wiki/Python'],
 ['https://

In [41]:
df = pd.DataFrame(list_cl2,columns=['URL'])


In [42]:
df

Unnamed: 0,URL
0,https://en.wikipedia.org/wiki/Python
1,https://en.wikipedia.org/wiki/Python
2,https://en.wikipedia.org/wiki/Python
3,https://en.wikipedia.org/wiki/Python
4,https://en.wikipedia.org/wiki/Python
...,...
119,https://en.wikipedia.org/wiki/Python
120,https://en.wikipedia.org/wiki/Python
121,https://en.wikipedia.org/wiki/Python
122,https://en.wikipedia.org/wiki/Python


In [43]:
print(df)

                                      URL
0    https://en.wikipedia.org/wiki/Python
1    https://en.wikipedia.org/wiki/Python
2    https://en.wikipedia.org/wiki/Python
3    https://en.wikipedia.org/wiki/Python
4    https://en.wikipedia.org/wiki/Python
..                                    ...
119  https://en.wikipedia.org/wiki/Python
120  https://en.wikipedia.org/wiki/Python
121  https://en.wikipedia.org/wiki/Python
122  https://en.wikipedia.org/wiki/Python
123  https://en.wikipedia.org/wiki/Python

[124 rows x 1 columns]


# Find the number of titles that have changed in the United States Code since its last release point.

In [44]:
# This is the url you will scrape in this exercise
url = 'http://uscode.house.gov/download/download.shtml'

In [47]:
#NO entiendo el enunciado, directamente.

#### Find a Python list with the top ten FBI's Most Wanted names.

In [45]:
# This is the url you will scrape in this exercise
url5= 'https://www.fbi.gov/wanted/topten'

In [46]:
#Generemos el objeto con la respuesta del servidor
response5 = requests.get(url5)

#A la respuesta aplicamos .content para extraer el contenido de la respuesta. 
contenido5 = response5.content

In [47]:
soup5 = bs4.BeautifulSoup(contenido5, "html.parser") 
type(soup5)

bs4.BeautifulSoup

In [48]:
etiq5 = soup5.select('h3.title a[href]')
etiq5

[<a href="https://www.fbi.gov/wanted/topten/omar-alexander-cardenas">OMAR ALEXANDER CARDENAS</a>,
 <a href="https://www.fbi.gov/wanted/topten/alexis-flores">ALEXIS FLORES</a>,
 <a href="https://www.fbi.gov/wanted/topten/bhadreshkumar-chetanbhai-patel">BHADRESHKUMAR CHETANBHAI PATEL</a>,
 <a href="https://www.fbi.gov/wanted/topten/alejandro-castillo">ALEJANDRO ROSALES CASTILLO</a>,
 <a href="https://www.fbi.gov/wanted/topten/yulan-adonay-archaga-carias">YULAN ADONAY ARCHAGA CARIAS</a>,
 <a href="https://www.fbi.gov/wanted/topten/ruja-ignatova">RUJA IGNATOVA</a>,
 <a href="https://www.fbi.gov/wanted/topten/arnoldo-jimenez">ARNOLDO JIMENEZ</a>,
 <a href="https://www.fbi.gov/wanted/topten/michael-james-pratt">MICHAEL JAMES PRATT</a>,
 <a href="https://www.fbi.gov/wanted/topten/jose-rodolfo-villarreal-hernandez">JOSE RODOLFO VILLARREAL-HERNANDEZ</a>,
 <a href="https://www.fbi.gov/wanted/topten/rafael-caro-quintero">RAFAEL CARO-QUINTERO</a>]

In [49]:
wanted_names = pd.DataFrame([a.text for a in soup5.select('h3.title a[href]')], columns=['MOST WANTED'])
wanted_names

Unnamed: 0,MOST WANTED
0,OMAR ALEXANDER CARDENAS
1,ALEXIS FLORES
2,BHADRESHKUMAR CHETANBHAI PATEL
3,ALEJANDRO ROSALES CASTILLO
4,YULAN ADONAY ARCHAGA CARIAS
5,RUJA IGNATOVA
6,ARNOLDO JIMENEZ
7,MICHAEL JAMES PRATT
8,JOSE RODOLFO VILLARREAL-HERNANDEZ
9,RAFAEL CARO-QUINTERO


####  Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe.

In [50]:
# This is the url you will scrape in this exercise
url6 = 'https://www.emsc-csem.org/Earthquake/'

In [52]:
# your code here
#Generemos el objeto con la respuesta del servidor
response6 = requests.get(url6)

#A la respuesta aplicamos .content para extraer el contenido de la respuesta. 
contenido6 = response6.content

In [53]:
soup6 = bs4.BeautifulSoup(contenido6, "html.parser") 
type(soup6)

bs4.BeautifulSoup

In [54]:
#etiq6 = soup6.find_all('td')
#etiq6

In [55]:
etiq6= soup6.find_all('td',{'class':['tabev1', 'tabev2','tabev2', 'tabev3', 'tabev5', 'tb_region']})
etiq6

[<td class="tabev1">40.83 </td>,
 <td class="tabev2">N  </td>,
 <td class="tabev1">14.13 </td>,
 <td class="tabev2">E  </td>,
 <td class="tabev3">2</td>,
 <td class="tabev5" id="magtyp0">Md</td>,
 <td class="tabev2">2.2</td>,
 <td class="tb_region" id="reg0"> SOUTHERN ITALY</td>,
 <td class="tabev1">38.07 </td>,
 <td class="tabev2">N  </td>,
 <td class="tabev1">36.51 </td>,
 <td class="tabev2">E  </td>,
 <td class="tabev3">5</td>,
 <td class="tabev5" id="magtyp1">ML</td>,
 <td class="tabev2">4.6</td>,
 <td class="tb_region" id="reg1"> CENTRAL TURKEY</td>,
 <td class="tabev1">35.76 </td>,
 <td class="tabev2">N  </td>,
 <td class="tabev1">25.45 </td>,
 <td class="tabev2">E  </td>,
 <td class="tabev3">3</td>,
 <td class="tabev5" id="magtyp2">ML</td>,
 <td class="tabev2">2.9</td>,
 <td class="tb_region" id="reg2"> CRETE, GREECE</td>,
 <td class="tabev1">19.19 </td>,
 <td class="tabev2">N  </td>,
 <td class="tabev1">155.39 </td>,
 <td class="tabev2">W  </td>,
 <td class="tabev3">32</td>,
 <

In [56]:
#list_terrem= for terremotos in soup4.find_all('td') if ('td',
list_terrem = [terremoto for terremoto in soup6.find_all('td') if 'class' in terremoto.attrs and any(c in terremoto['class'] for c in ['tabev1', 'tabev2', 'tabev3', 'tabev5', 'tb_region'])]


In [58]:
df_terrem= pd.DataFrame(list_terrem)
df_terrem
#No sé hacerlo

Unnamed: 0,0
0,40.83
1,N
2,14.13
3,E
4,2
...,...
395,W
396,29
397,M
398,3.4


#### Count the number of tweets by a given Twitter account.
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
# your code here

#### Number of followers of a given twitter account
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the followers for any provided account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
# your code here

#### List all language names and number of related articles in the order they appear in wikipedia.org.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'

In [None]:
# your code here

#### A list with the different kind of datasets available in data.gov.uk.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://data.gov.uk/'

In [None]:
# your code here

#### Display the top 10 languages by number of native speakers stored in a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [None]:
# your code here

## Bonus
#### Scrape a certain number of tweets of a given Twitter account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
# your code here

#### Display IMDB's top 250 data (movie name, initial release, director name and stars) as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'

In [None]:
# your code here

#### Display the movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [None]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'

In [None]:
# your code here

#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [None]:
#https://openweathermap.org/current
city = input('Enter the city: ')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

In [None]:
# your code here

#### Find the book name, price and stock availability as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

In [None]:
# your code here