# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended contennt.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit each url and take a look at its source through Chrome DevTools. You'll need to identify the html tags, special class names etc. used for the html content you are expected to extract.

- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide) documentation 
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are imported for you. If you prefer to use additional libraries feel free to uncomment them.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# from pprint import pprint
# from lxml import html
# from lxml.html import fromstring
# import urllib.request
from urllib.request import urlopen
# import random
import re
# import scrapy

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [2]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [3]:
html = requests.get(url)

In [4]:
#your code
soup = BeautifulSoup(html.content, "html.parser")

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [5]:
nombres = soup.find_all('h1',{'class': 'h3 lh-condensed'})
nombres_ = []
for nombre in nombres:
    nombres_.append(nombre.getText().strip())
# Getting users and saving them into a variable nombres_
users_ = []
users = soup.find_all('p', {'class': 'f4 text-normal mb-1'})
for user in users:
    users_.append(user.getText().strip())
# Create dataframe with name and usernames
df = pd.DataFrame({'Rank': list(range(1,26)),'Name': nombres_,'Username': users_ })
df.set_index('Rank')

Unnamed: 0_level_0,Name,Username
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Jonny Burger,JonnyBurger
2,Brandon,bee-san
3,Lennart,LekoArts
4,Artur Arseniev,artf
5,Remi Rousselet,rrousselGit
6,Jeroen Ooms,jeroen
7,Artem Zakharchenko,kettanaito
8,Kuitos,kuitos
9,二货机器人,zombieJ
10,Emiliano Heyns,retorquere


In [6]:
# Getting names and saving them into a variable nombres_
nombres = soup.find_all('h1',{'class': 'h3 lh-condensed'})
nombres_ = []
for nombre in nombres:
    nombres_.append(nombre.getText().strip())
# Getting users and saving them into a variable nombres_
users_ = []
users = soup.find_all('p', {'class': 'f4 text-normal mb-1'})
for user in users:
    users_.append(user.getText().strip())
nam_us = []
for i in range(26):
    nam_us.append(nombres[i])
    nam_us.append(users[i])
    print(nam_us)
nam_us

[<h1 class="h3 lh-condensed">
<a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":1629785,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="9d8877890ce57c91e5229288b50ebfc009df634258615bd3d1b2cd580f4981e7" data-view-component="true" href="/JonnyBurger">
            Jonny Burger
</a> </h1>, <p class="f4 text-normal mb-1">
<a class="Link--secondary" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":1629785,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="9d8877890ce57c91e5229288b50ebfc009df634258615bd3d1b2cd580f4981e7" data-view-component="true" href="/JonnyBurger">
              Jonny

IndexError: list index out of range

#### Display the trending Python repositories in GitHub

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [7]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'

In [8]:
#your code
html = requests.get(url)
soup = BeautifulSoup(html.content, "html.parser")

In [9]:
repos = soup.find_all("h1", {"class": "h3 lh-condensed"})

In [10]:
repos_f = [(repo.text.replace("\n","").replace(" ","")) for repo in repos]
repos_f

['donnemartin/system-design-primer',
 'bee-san/pyWhat',
 'geekcomputers/Python',
 'linkedin/greykite',
 'PyTorchLightning/lightning-flash',
 'Ma-Lab-Berkeley/ReduNet',
 'keras-team/keras',
 'lukemelas/EfficientNet-PyTorch',
 'ultralytics/yolov5',
 'willmcgugan/rich',
 'plctlab/v8-internals',
 'PyCQA/isort',
 'd2l-ai/d2l-zh',
 'openai/gym',
 'microsoft/recommenders',
 'magic-wormhole/magic-wormhole',
 'shenweichen/DeepCTR',
 'projectdiscovery/nuclei-templates',
 'eriklindernoren/ML-From-Scratch',
 'facebookresearch/detectron2',
 'amundsen-io/amundsen',
 'MIC-DKFZ/nnUNet',
 'Project-MONAI/MONAI',
 'frappe/erpnext',
 'EleutherAI/gpt-neo']

#### Display all the image links from Walt Disney wikipedia page

In [11]:
# This is the url you will scrape in this exercise
url_W = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [12]:
#your code
html = requests.get(url_W)
soup = BeautifulSoup(html.content, "html.parser")

In [13]:
i = soup.find_all("a", {"class": "image"})
enlaces = [tag.get("href") for tag in i]
e_http = ["https://en.wikipedia.org/" + enlace for enlace in enlaces]
e_http

['https://en.wikipedia.org//wiki/File:Walt_Disney_1946.JPG',
 'https://en.wikipedia.org//wiki/File:Walt_Disney_1942_signature.svg',
 'https://en.wikipedia.org//wiki/File:Walt_Disney_envelope_ca._1921.jpg',
 'https://en.wikipedia.org//wiki/File:Trolley_Troubles_poster.jpg',
 'https://en.wikipedia.org//wiki/File:Steamboat-willie.jpg',
 'https://en.wikipedia.org//wiki/File:Walt_Disney_1935.jpg',
 'https://en.wikipedia.org//wiki/File:Walt_Disney_Snow_white_1937_trailer_screenshot_(13).jpg',
 'https://en.wikipedia.org//wiki/File:Disney_drawing_goofy.jpg',
 'https://en.wikipedia.org//wiki/File:DisneySchiphol1951.jpg',
 'https://en.wikipedia.org//wiki/File:WaltDisneyplansDisneylandDec1954.jpg',
 'https://en.wikipedia.org//wiki/File:Walt_disney_portrait_right.jpg',
 'https://en.wikipedia.org//wiki/File:Walt_Disney_Grave.JPG',
 'https://en.wikipedia.org//wiki/File:Roy_O._Disney_with_Company_at_Press_Conference.jpg',
 'https://en.wikipedia.org//wiki/File:Disney_Display_Case.JPG',
 'https://en.wi

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page

In [14]:
# This is the url you will scrape in this exercise
url_P ='https://en.wikipedia.org/wiki/Python'
html = requests.get(url_P)
soup = BeautifulSoup(html.content, "html.parser")

In [15]:
#your code
lis = soup.find_all("li")
links = lis.find("a")

AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

In [16]:
lis[1].a.get("href")

'/wiki/Python_(genus)'

In [17]:
enlaces = []
for x in lis:
    print(x)
    print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
enlaces

<li><a class="mw-redirect" href="/wiki/Pythons" title="Pythons">Pythons</a> or Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia
<ul><li><a href="/wiki/Python_(genus)" title="Python (genus)"><i>Python</i> (genus)</a>, a genus of nonvenomous Pythonidae found in Africa and Asia</li></ul></li>
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
<li><a href="/wiki/Python_(genus)" title="Python (genus)"><i>Python</i> (genus)</a>, a genus of nonvenomous Pythonidae found in Africa and Asia</li>
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
<li class="toclevel-1 tocsection-1"><a href="#Computing"><span class="tocnumber">1</span> <span class="toctext">Computing</span></a></li>
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
<li class="toclevel-1 tocsection-2"><a href="#People"><span class="tocnumber">2</span> <span class="toctext">People</span></a></li>
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
<li class="toclevel-1 tocsection-3"><a href="#Roller_coasters"><span class="tocnumber">3</span> <span class

[]

In [18]:
#enlaces = []
#for i in lis:
#    if len(i) > 0:
#        x = i.a
#        print(x)
#        enlaces.append(i.a.get("href"))
enlaces = [a.get("href") for a in lis if re.match("\Swiki\S", a.get("href"))]

TypeError: expected string or bytes-like object

#### Number of Titles that have changed in the United States Code since its last release point 

In [19]:
# This is the url you will scrape in this exercise
url_USC = 'http://uscode.house.gov/download/download.shtml'
html = requests.get(url_USC)
soup = BeautifulSoup(html.content, "html.parser")

In [20]:
#your code
titleschanged = soup.find_all("div", {"class": "usctitlechanged"})
print(titleschanged[0].text)



          Title 10 - Armed Forces ٭



####  20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe

In [21]:
# This is the url you will scrape in this exercise
url_e = 'https://www.emsc-csem.org/Earthquake/'
html = requests.get(url_e)
soup = BeautifulSoup(html.content, "html.parser")

In [22]:
#your code
earthquakes = soup.find_all("tr", {"class": ["ligne1 normal", "ligne2 normal"]})
earthquakes[5]

<tr class="ligne2 normal" id="989263" onclick="go_details(event,989263);"><td class="tabev0"></td><td class="tabev0"></td><td class="tabev0"></td><td class="tabev6"><b><i style="display:none;">earthquake</i><a href="/Earthquake/earthquake.php?id=989263">2021-05-26   16:49:57.0</a></b><i class="ago" id="ago5">47min ago</i></td><td class="tabev1">6.23 </td><td class="tabev2">N  </td><td class="tabev1">94.60 </td><td class="tabev2">E  </td><td class="tabev3">85</td><td class="tabev5" id="magtyp5"> M</td><td class="tabev2">3.5</td><td class="tb_region" id="reg5"> NICOBAR ISLANDS, INDIA REGION</td><td class="comment updatetimeno" id="upd5" style="text-align:right;">2021-05-26 17:10</td></tr>

In [23]:
earthquakes[0]

<tr class="ligne1 normal" id="989268" onclick="go_details(event,989268);"><td class="tabev0"></td><td class="tabev0"></td><td class="tabev0"></td><td class="tabev6"><b><i style="display:none;">earthquake</i><a href="/Earthquake/earthquake.php?id=989268">2021-05-26   17:26:55.1</a></b><i class="ago" id="ago0">10min ago</i></td><td class="tabev1">38.74 </td><td class="tabev2">S  </td><td class="tabev1">176.16 </td><td class="tabev2">E  </td><td class="tabev3">5</td><td class="tabev5" id="magtyp0">M </td><td class="tabev2">3.1</td><td class="tb_region" id="reg0"> NORTH ISLAND OF NEW ZEALAND</td><td class="comment updatetimeno" id="upd0" style="text-align:right;">2021-05-26 17:30</td></tr>

In [64]:
lista_earthquakes = []
for f in earthquakes:# lista con las filas de la tabla
    a = f.find_all("td", {"class": "tabev1"})
    b = f.find_all("td", {"class": "tabev2"})
    c = f.find("td", {"class": "tb_region"})
    eq = { "Fecha_hora" : f.find("b").find("a").text,
            "Latitude": f.find("td", {"class": "tabev1"}).text + (f.find("td", {"class": "tabev2"}).text),
            "Longitude": a[1].text + b[1].text,
            "Region_name": c.text
        }
    lista_earthquakes.append(eq)

In [66]:
data = pd.DataFrame(lista_earthquakes)
data

Unnamed: 0,Fecha_hora,Latitude,Longitude,Region_name
0,2021-05-26 17:26:55.1,38.74 S,176.16 E,NORTH ISLAND OF NEW ZEALAND
1,2021-05-26 17:20:40.1,46.25 N,7.39 E,SWITZERLAND
2,2021-05-26 16:53:36.0,46.84 N,121.76 W,"MOUNT RAINIER AREA, WASHINGTON"
3,2021-05-26 16:53:28.9,35.08 N,95.34 W,OKLAHOMA
4,2021-05-26 16:51:32.1,42.48 N,2.05 E,PYRENEES
5,2021-05-26 16:49:57.0,6.23 N,94.60 E,"NICOBAR ISLANDS, INDIA REGION"
6,2021-05-26 16:33:02.0,0.77 S,128.19 E,"HALMAHERA, INDONESIA"
7,2021-05-26 16:27:43.7,36.32 N,95.75 W,OKLAHOMA
8,2021-05-26 16:17:48.0,10.12 S,75.74 W,CENTRAL PERU
9,2021-05-26 16:13:27.0,11.30 N,87.01 W,NEAR COAST OF NICARAGUA


#### Display the date, days, title, city, country of next 25 hackathon events as a Pandas dataframe table

In [None]:
# This is the url you will scrape in this exercise
url ='https://hackevents.co/hackathons'
url_hack = 'https://hackevents.co/search/anything/anywhere/anytime'

# Los links no funcionan

#### List all language names and number of related articles in the order they appear in wikipedia.org

In [67]:
# This is the url you will scrape in this exercise
url_WI = 'https://www.wikipedia.org/'
html = requests.get(url_WI)
soup = BeautifulSoup(html.content, "html.parser")

In [76]:
#your code
contenido = soup.find_all("div", {"dir": ["ltr"]})
for c in contenido:
    print(c.text)



English
6 299 000+ articles




日本語
1 268 000+ 記事




Español
1 684 000+ artículos




Deutsch
2 576 000+ Artikel




Русский
1 724 000+ статей




Français
2 329 000+ articles




Italiano
1 693 000+ voci




中文
1 197 000+ 條目




Português
1 066 000+ artigos




Polski
1 473 000+ haseł




#### A list with the different kind of datasets available in data.gov.uk 

In [81]:
# This is the url you will scrape in this exercise
url_UK = 'https://data.gov.uk/'
html = requests.get(url_UK)
soup = BeautifulSoup(html.content, "html.parser")

In [88]:
#your code 
datasets = soup.find_all("h3", {"class": "govuk-heading-s dgu-topics__heading"})
data_available = []
for i in datasets:
    data_available.append(i.text)
data_available

['Business and economy',
 'Crime and justice',
 'Defence',
 'Education',
 'Environment',
 'Government',
 'Government spending',
 'Health',
 'Mapping',
 'Society',
 'Towns and cities',
 'Transport',
 'Digital service performance',
 'Government reference data']

#### Top 10 languages by number of native speakers stored in a Pandas Dataframe

In [89]:
# This is the url you will scrape in this exercise
url_LANG = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'
html = requests.get(url_LANG)
soup = BeautifulSoup(html.content, "html.parser")

In [120]:
#your code
languages = soup.find_all("td")
l_lan = []
for i in languages:
    b = i.find("a")
    if b == "None":
        pass
    print(b)

None
None
None
<a href="/wiki/Mandarin_Chinese" title="Mandarin Chinese">Mandarin Chinese</a>
None
None
<a href="/wiki/Sino-Tibetan_languages" title="Sino-Tibetan languages">Sino-Tibetan</a>
<a href="/wiki/Varieties_of_Chinese" title="Varieties of Chinese">Sinitic</a>
None
<a href="/wiki/Spanish_language" title="Spanish language">Spanish</a>
None
None
<a href="/wiki/Indo-European_languages" title="Indo-European languages">Indo-European</a>
<a href="/wiki/Romance_languages" title="Romance languages">Romance</a>
None
<a href="/wiki/English_language" title="English language">English</a>
None
None
<a href="/wiki/Indo-European_languages" title="Indo-European languages">Indo-European</a>
<a href="/wiki/Germanic_languages" title="Germanic languages">Germanic</a>
None
<a href="/wiki/Hindi" title="Hindi">Hindi</a>
None
None
<a href="/wiki/Indo-European_languages" title="Indo-European languages">Indo-European</a>
<a href="/wiki/Indo-Aryan_languages" title="Indo-Aryan languages">Indo-Aryan</a>
No

None
None
<a href="/wiki/Greek_language" title="Greek language">Greek</a>
None
None
None
<a href="/wiki/Chewa_language" title="Chewa language">Chewa</a>
None
None
None
<a class="mw-redirect" href="/wiki/Deccan_language" title="Deccan language">Deccan</a>
None
None
None
<a href="/wiki/Akan_language" title="Akan language">Akan</a>
None
None
None
<a href="/wiki/Kazakh_language" title="Kazakh language">Kazakh</a>
None
None
None
<a href="/wiki/Northern_Min" title="Northern Min">Northern Min</a>
None
None
None
<a href="/wiki/Sylheti_language" title="Sylheti language">Sylheti</a>
None
None
None
<a href="/wiki/Zulu_language" title="Zulu language">Zulu</a>
None
None
None
<a href="/wiki/Czech_language" title="Czech language">Czech</a>
None
None
None
<a href="/wiki/Kinyarwanda" title="Kinyarwanda">Kinyarwanda</a>
None
None
None
<a href="/wiki/Dhundari_language" title="Dhundari language">Dhundhari</a>
<a href="#cite_note-Hindi-14">[b]</a>
None
None
<a href="/wiki/Haitian_Creole" title="Haitian Cre

### BONUS QUESTIONS

#### IMDB's Top 250 data (movie name, Initial release, director name and stars) as a pandas dataframe

In [None]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'

In [None]:
# your code
