# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended contennt.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit each url and take a look at its source through Chrome DevTools. You'll need to identify the html tags, special class names etc. used for the html content you are expected to extract.

- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide) documentation 
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are imported for you. If you prefer to use additional libraries feel free to uncomment them.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# from pprint import pprint
# from lxml import html
# from lxml.html import fromstring
# import urllib.request
from urllib.request import urlopen
# import random
import re
# import scrapy

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [2]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [3]:
#your code
res= requests.get(url)
soup= BeautifulSoup(res.content, 'html.parser')

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [4]:
#your code
div= [i for i in soup.find_all("h1", {"class":"h3 lh-condensed"})]
results= []

for tag in div:
    a_tag= tag.find("a")
    developer_name= a_tag.text.strip()
    developer_nickname= a_tag.get("href").replace("/","")
    results.append(f"{developer_name} ({developer_nickname})")

for i in results:
    print(i)

Erik Bernhardsson (erikbern)
Hiroshiba (Hiroshiba)
Yair Morgenstern (yairm210)
Tom Payne (twpayne)
Hao Wu (swuecho)
Harrison Chase (hwchase17)
Matthias Fey (rusty1s)
sinclairzx81 (sinclairzx81)
Vectorized (Vectorized)
Nuno Campos (nfcampos)
Alessandro Ros (aler9)
Juan Font (juanfont)
Etienne BAUDOUX (veler)
Sander Verweij (sverweij)
Michael Lynch (mtlynch)
triple Mu (triple-Mu)
Oleh Dokuka (OlegDokuka)
Dongdong Tian (seisman)
vector (NewByVector)
Hadley Wickham (hadley)
Ana Hobden (Hoverbear)
boojack (boojack)
HeYunfei (hyf0)
Priyankar Pal (priyankarpal)
Charlie Marsh (charliermarsh)


#### Display the trending Python repositories in GitHub

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [5]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'

In [6]:
#your code
res= requests.get(url)
soup= BeautifulSoup(res.content, 'html.parser')

In [7]:
div= [i for i in soup.find_all("h1", {"class":"h3 lh-condensed"})]
results= []

for tag in div:
    a_tag= tag.find("a")
    href_tag= a_tag.get("href")
    elements= href_tag.split("/")
    del elements[0]
    user= elements[0]
    repository_name= elements[1]
    results.append(f"{user} ({repository_name})")
        
for i in results:
    print(i)

vinta (awesome-python)
sdatkinson (neural-amp-modeler)
baaivision (Painter)
nomic-ai (gpt4all-ui)
Winfredy (SadTalker)
oobabooga (text-generation-webui)
biobootloader (wolverine)
chroma-core (chroma)
IDEA-Research (GroundingDINO)
THUDM (ChatGLM-6B)
TabbyML (tabby)
liujing04 (Retrieval-based-Voice-Conversion-WebUI)
rondinellimorais (facial-expression-recognition)
Torantulino (Auto-GPT)
sdatkinson (NeuralAmpModelerPlugin)
abetlen (llama-cpp-python)
rokstrnisa (Robo-GPT)
erikbern (ann-benchmarks)
jackfrued (Python-100-Days)
hwchase17 (langchain)
AUTOMATIC1111 (stable-diffusion-webui)
TheAlgorithms (Python)
whitead (paper-qa)
imClumsyPanda (langchain-ChatGLM)
fauxpilot (fauxpilot)


#### Display all the image links from Walt Disney wikipedia page

In [8]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [9]:
#your code
res= requests.get(url)
soup= BeautifulSoup(res.content, 'html.parser')

In [10]:
div= [i for i in soup.find_all("a", {"class":"image"})]
results= []

for tag in div:
    href_tag= tag.get("href")
    results.append(f"https://commons.wikimedia.org{href_tag}")
        
for i in results:
    print(i)

https://commons.wikimedia.org/wiki/File:Walt_Disney_1946.JPG
https://commons.wikimedia.org/wiki/File:Walt_Disney_1942_signature.svg
https://commons.wikimedia.org/wiki/File:Walt_Disney_Birthplace_Exterior_Hermosa_Chicago_Illinois.jpg
https://commons.wikimedia.org/wiki/File:Walt_Disney_envelope_ca._1921.jpg
https://commons.wikimedia.org/wiki/File:Trolley_Troubles_poster.jpg
https://commons.wikimedia.org/wiki/File:Steamboat-willie.jpg
https://commons.wikimedia.org/wiki/File:Walt_Disney_1935.jpg
https://commons.wikimedia.org/wiki/File:Walt_Disney_Snow_white_1937_trailer_screenshot_(13).jpg
https://commons.wikimedia.org/wiki/File:Disney_drawing_goofy.jpg
https://commons.wikimedia.org/wiki/File:WaltDisneyplansDisneylandDec1954.jpg
https://commons.wikimedia.org/wiki/File:Walt_disney_portrait_right.jpg
https://commons.wikimedia.org/wiki/File:Walt_Disney_Grave.JPG
https://commons.wikimedia.org/wiki/File:Roy_O._Disney_with_Company_at_Press_Conference.jpg
https://commons.wikimedia.org/wiki/File:D

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page

In [11]:
# This is the url you will scrape in this exercise
url ='https://en.wikipedia.org/wiki/Python'

In [12]:
#your code
res= requests.get(url)
soup= BeautifulSoup(res.content, 'html.parser')

In [13]:
div= soup.find("div", {"class":"mw-parser-output"})
results= []
a_tag= div.find_all("a")

href= []
for i in a_tag:
    if i.get("href").startswith("/wiki"):
        href= i.get("href")
        results.append(f"https://en.wikipedia.org{href}")

results.pop() # no forma parte de la lista de resultados de búsqueda arbitraria, es otro elemento del div
results.pop() # no forma parte de la lista de resultados de búsqueda arbitraria, es otro elemento del div
for i in results:
    print(i)

https://en.wikipedia.org/wiki/Pythonidae
https://en.wikipedia.org/wiki/Python_(genus)
https://en.wikipedia.org/wiki/Python_(mythology)
https://en.wikipedia.org/wiki/Python_(programming_language)
https://en.wikipedia.org/wiki/CMU_Common_Lisp
https://en.wikipedia.org/wiki/PERQ#PERQ_3
https://en.wikipedia.org/wiki/Python_of_Aenus
https://en.wikipedia.org/wiki/Python_(painter)
https://en.wikipedia.org/wiki/Python_of_Byzantium
https://en.wikipedia.org/wiki/Python_of_Catana
https://en.wikipedia.org/wiki/Python_Anghelo
https://en.wikipedia.org/wiki/Python_(Efteling)
https://en.wikipedia.org/wiki/Python_(Busch_Gardens_Tampa_Bay)
https://en.wikipedia.org/wiki/Python_(Coney_Island,_Cincinnati,_Ohio)
https://en.wikipedia.org/wiki/Python_(automobile_maker)
https://en.wikipedia.org/wiki/Python_(Ford_prototype)
https://en.wikipedia.org/wiki/Python_(missile)
https://en.wikipedia.org/wiki/Python_(nuclear_primary)
https://en.wikipedia.org/wiki/Colt_Python
https://en.wikipedia.org/wiki/Python_(codename)

#### Number of Titles that have changed in the United States Code since its last release point 

In [14]:
# This is the url you will scrape in this exercise
url = 'http://uscode.house.gov/download/download.shtml'

In [15]:
#your code
res= requests.get(url)
soup= BeautifulSoup(res.content, 'html.parser')

In [16]:
divs= soup.find_all("div", {"class":"usctitle"})
results= []

for i in divs:
    if i.text.strip().startswith("Title"):
        results.append(i.text.strip())

print(f"The number of Titles that have changed in the United States Code since its last release point is: {len(results)}")

The number of Titles that have changed in the United States Code since its last release point is: 53


####  20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe

In [17]:
# This is the url you will scrape in this exercise
url = 'https://www.emsc-csem.org/Earthquake/'

In [18]:
#your code
res= requests.get(url)
soup= BeautifulSoup(res.content, 'html.parser')

In [19]:
div= soup.find("tbody", {"id":"tbody"})

# limitar resultados a 20
div20= []
limit= 20
for i in div:
    if limit > 0 and i != "\n":
        div20.append(i)
        limit-=1

# dict para crear dataframe
data= {
    "date":[],
    "time":[],
    "latitude":[],
    "longitude":[],
    "region name":[],
}

for i in div20:
    a_tag= i.find("td", {"class":"tabev6"}).find("a").text

    # date
    date= a_tag[:10]
    data["date"].append(date)

    # time
    time= a_tag[-10:]
    data["time"].append(time)

    # latitude
    tabev1= i.find_all("td", {"class":"tabev1"}) # hay 2 td con misma class, 1 para latitude y 1 para longitude
    tabev2= i.find_all("td", {"class":"tabev2"}) # hay 3 td con misma class, 1 para latitude, 1 para longitude y 1 para "Mag" que voy a ignorar
    pre_latitude= tabev1[0].text
    latitude_number= ""
    for c in pre_latitude:
        if c.isdigit() or c == ".":
            latitude_number+= c
    latitude_word= tabev2[0].text[0]
    latitude= latitude_number + " " + latitude_word
    data["latitude"].append(latitude)

    # longitude
    pre_longitude= tabev1[1].text
    longitude_number= ""
    for c in pre_longitude:
        if c.isdigit() or c == ".":
            longitude_number+= c
    longitude_word= tabev2[1].text[0]
    longitude= longitude_number + " " + longitude_word
    data["longitude"].append(longitude)

    # region_name
    region_name= i.find("td", {"class":"tb_region"}).text[1:]
    data["region name"].append(region_name)

# dataframe
df= pd.DataFrame(data)
df

Unnamed: 0,date,time,latitude,longitude,region name
0,2023-04-11,15:13:56.6,38.16 N,38.55 E,EASTERN TURKEY
1,2023-04-11,15:11:44.0,8.01 S,116.03 E,"LOMBOK REGION, INDONESIA"
2,2023-04-11,15:10:07.0,2.93 S,128.76 E,"CERAM SEA, INDONESIA"
3,2023-04-11,15:00:12.7,37.83 N,36.74 E,CENTRAL TURKEY
4,2023-04-11,14:56:02.0,9.50 S,115.82 E,"SOUTH OF BALI, INDONESIA"
5,2023-04-11,14:55:00.0,24.16 S,67.55 W,"SALTA, ARGENTINA"
6,2023-04-11,14:32:45.5,47.37 N,6.91 E,SWITZERLAND
7,2023-04-11,14:27:48.0,21.64 S,68.58 W,"ANTOFAGASTA, CHILE"
8,2023-04-11,14:22:45.7,38.73 N,39.69 E,EASTERN TURKEY
9,2023-04-11,14:12:28.1,17.94 N,66.96 W,PUERTO RICO REGION


#### Display the date, days, title, city, country of next 25 hackathon events as a Pandas dataframe table

In [20]:
# This is the url you will scrape in this exercise
url ='https://hackevents.co/hackathons'
url_hack = 'https://hackevents.co/search/anything/anywhere/anytime' 

In [21]:
#your code

# no puedo acceder a esta web. parece que ya no existe(?) ni siquiera accediendo desde su LinkedIn

#### List all language names and number of related articles in the order they appear in wikipedia.org

In [22]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'

In [23]:
#your code
res= requests.get(url)
soup= BeautifulSoup(res.content, 'html.parser')

In [24]:
from bs4 import element

div= soup.find("div", {"class":"central-featured"})

# limpiar los bs4.element recibidos en el div para coger sólo los Tag
c_div= []
for i in div:
    if isinstance(i, element.Tag):
        c_div.append(i)

ranking= {}

for i in c_div:
    # country
    language= i.find("strong").text
    if language == "Русский":
        language= "Rusian"
    if language == "日本語":
        language= "Japanese"
    if language == "中文":
        language= "Zhongwen"
    if language == "فارسی":
        language= "Farsi"
    # related articles
    rel_artc= i.find("bdi", {"dir":"ltr"}).text
    
    ranking[language]= rel_artc

print("Lista ordenada por ranking views/day, tal como se reciben los elementos:")
n=1
for k,v in ranking.items():
    if len(k) < 8:
        print(f"{n}.\t{k}\t\t-\t{v}")
    else:
        print(f"{n}.\t{k}\t-\t{v}")
    n+=1

print("\nLista ordenada por orden de los div en la web:")

order_dict= {
    1:"Español",
    2:"English",
    3:"Rusian",
    4:"Japanese",
    5:"Deutsch",
    6:"Français",
    7:"Italiano",
    8:"Zhongwen",
    9:"Farsi",
    10:"Português",
}

# indico que por cada key/value en dict(order_dict) quiero imprimir k/v de order_dict seguido del correspondiente valor en el dict(ranking)
for k,v in order_dict.items():
    if v in ranking.keys():
        if len(v) < 8:
            print(f"{k}.\t{v}\t\t-\t{ranking[v]}")
        else:
            print(f"{k}.\t{v}\t-\t{ranking[v]}")

Lista ordenada por ranking views/day, tal como se reciben los elementos:
1.	English		-	6 638 000+
2.	Rusian		-	1 905 000+
3.	Español		-	1 851 000+
4.	Japanese	-	1 368 000+
5.	Deutsch		-	2 788 000+
6.	Français	-	2 510 000+
7.	Italiano	-	1 805 000+
8.	Zhongwen	-	1 344 000+
9.	Farsi		-	957 000+
10.	Português	-	1 103 000+

Lista ordenada por orden de los div en la web:
1.	Español		-	1 851 000+
2.	English		-	6 638 000+
3.	Rusian		-	1 905 000+
4.	Japanese	-	1 368 000+
5.	Deutsch		-	2 788 000+
6.	Français	-	2 510 000+
7.	Italiano	-	1 805 000+
8.	Zhongwen	-	1 344 000+
9.	Farsi		-	957 000+
10.	Português	-	1 103 000+


#### A list with the different kind of datasets available in data.gov.uk 

In [25]:
# This is the url you will scrape in this exercise
url = 'https://data.gov.uk/'

In [26]:
#your code 
res= requests.get(url)
soup= BeautifulSoup(res.content, 'html.parser')

In [27]:
ul= soup.find("ul", {"class":"govuk-list dgu-topics__list"}).find_all("li")

the_list= []

for i in ul:
    dataset_kind= i.find("a").text
    the_list.append(dataset_kind)

for i in the_list:
    print(i)

Business and economy
Crime and justice
Defence
Education
Environment
Government
Government spending
Health
Mapping
Society
Towns and cities
Transport
Digital service performance
Government reference data


#### Top 10 languages by number of native speakers stored in a Pandas Dataframe

In [108]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [109]:
#your code
res= requests.get(url)
soup= BeautifulSoup(res.content, 'html.parser')

In [110]:
data= {
    "Language":[],
    "Native speakers (millions)":[],
    "Language family":[],
    "Branch":[],
}

tbody= soup.find("tbody")
trs= tbody.find_all("tr")
del trs[0] # pese a especificarle que sólo quiero <tbody>, el <tr> de <thead> se ha acoplado a la fiesta

for i in trs:
    tds= i.find_all("td")
    language= tds[0].find("a").getText()
    data["Language"].append(language)
    native_speakers= tds[1].getText()[:-1]
    data["Native speakers (millions)"].append(native_speakers)
    language_family= tds[2].find("a").getText()
    data["Language family"].append(language_family)
    branch= tds[3].getText()[:-1]
    data["Branch"].append(branch)

index= list(range(1, len(data["Language"])+1))
df= pd.DataFrame(data, index=index)
top10= df.head(10)
top10

Unnamed: 0,Language,Native speakers (millions),Language family,Branch
1,Mandarin Chinese,939.0,Sino-Tibetan,Sinitic
2,Spanish,485.0,Indo-European,Romance
3,English,380.0,Indo-European,Germanic
4,Hindi,345.0,Indo-European,Indo-Aryan
5,Portuguese,236.0,Indo-European,Romance
6,Bengali,234.0,Indo-European,Indo-Aryan
7,Russian,147.0,Indo-European,Balto-Slavic
8,Japanese,123.0,Japonic,Japanese
9,Yue Chinese,86.1,Sino-Tibetan,Sinitic
10,Vietnamese,85.0,Austroasiatic,Vietic


### BONUS QUESTIONS

#### IMDB's Top 250 data (movie name, Initial release, director name and stars) as a pandas dataframe

In [111]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'

In [112]:
# your code
res= requests.get(url)
soup= BeautifulSoup(res.content, 'html.parser')

In [113]:
data= {
    "Movie name":[],
    "Initial release":[],
    "Director name":[],
    "Stars":[],
}

movies= soup.find("tbody", {"class":"lister-list"})
movie_row= movies.find_all("tr")
for i in movie_row:
    movie_name= i.find("td", {"class":"titleColumn"}).find("a").getText()
    data["Movie name"].append(movie_name)
    initial_release= i.find("td", {"class":"titleColumn"}).find("span", {"class":"secondaryInfo"}).getText()
    data["Initial release"].append(initial_release)
    director_name= i.find("td", {"class":"titleColumn"}).find("a").get("title").split(",")[0][:-7]
    data["Director name"].append(director_name)
    stars= i.find("td", {"class":"ratingColumn imdbRating"}).find("strong").get("title")[:4]
    data["Stars"].append(stars)
    
index= list(range(1, len(data["Movie name"])+1))
df= pd.DataFrame(data, index=index)
df

Unnamed: 0,Movie name,Initial release,Director name,Stars
1,Cadena perpetua,(1994),Frank Darabont,9.2
2,El padrino,(1972),Francis Ford Coppola,9.2
3,El caballero oscuro,(2008),Christopher Nolan,9.0
4,El padrino (parte II),(1974),Francis Ford Coppola,9.0
5,12 hombres sin piedad,(1957),Sidney Lumet,9.0
...,...,...,...,...
246,El gigante de hierro,(1999),Brad Bird,8.0
247,Criadas y señoras,(2011),Tate Taylor,8.0
248,Aladdín,(1992),Ron Clements,8.0
249,Dersu Uzala (El cazador),(1975),Akira Kurosawa,8.0
