# Web Scraping Lab

You will find in this notebook some web scraping exercises to practice your scraping skills using `requests` and `Beautiful Soup`.

**Tips:**

- Check the [response status code](https://http.cat/) for each request to ensure you have obtained the intended content.
- Look at the HTML code in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit each url and take a look at its source through Chrome DevTools. You'll need to identify the html tags, special class names etc. used for the html content you are expected to extract.
- Check out the css selectors.

### Useful Resources
- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide) documentation 
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### First of all, gathering our tools.

In [1]:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

In [2]:
#Setting Driver options:
opciones=Options()
opciones.add_experimental_option('excludeSwitches', ['enable-automation'])
opciones.add_experimental_option('useAutomationExtension', True)
opciones.headless=True    # Chrome window won show
opciones.add_argument('--start-maximized')
opciones.add_argument('--incognito')

⚠️ **Again, please remember to limit your output before submission so that your code doesn't get lost in the output.**

#### Challenge 1 - Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [3]:
PATH=ChromeDriverManager().install()
driver=webdriver.Chrome(PATH, options=opciones) #Cargamos Driver y opciones



Current google-chrome version is 98.0.4758
Get LATEST chromedriver version for 98.0.4758 google-chrome
There is no [linux64] chromedriver for browser  in cache
Trying to download new driver from https://chromedriver.storage.googleapis.com/98.0.4758.102/chromedriver_linux64.zip
Driver has been saved in cache [/home/vp/.wdm/drivers/chromedriver/linux64/98.0.4758.102]
  driver=webdriver.Chrome(PATH, options=opciones) #Cargamos Driver y opciones


In [4]:
# This is the url you will scrape in this exercise
URL = 'https://github.com/trending/developers'

driver.get(URL)

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below (with different names):

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [5]:
devs_list = driver.find_elements(By.CLASS_NAME, 'Box-row') #lista de desarrolladores

In [6]:
ranking = []
for element in devs_list:
    dev = f"{element.find_element(By.CSS_SELECTOR,'div.d-sm-flex.flex-auto > div.col-sm-8.d-md-flex > div:nth-child(1) > h1').text}{element.find_element(By.CLASS_NAME,'Link--secondary').text}"
    ranking.append(dev)
ranking

['Matthias Feyrusty1s',
 'Yujia Qiaorapiz1',
 'David Rodríguezdeivid-rodriguez',
 'Geoff Bourneitzg',
 'Keno FischerKeno',
 'Vladimir Agafonkinmourner',
 'Jeremy Longjeremylong',
 'Steve Gordonstevejgordon',
 'Franck Nijhoffrenck',
 'Pedro S. Lopezpedroslopez',
 'Kazuaki MatsuoKazuCocoa',
 'Juliettejrfnl',
 'Seth Michael Larsonsethmlarson',
 'Robert Mosolgormosolgo',
 'Alex Goodmanwagoodman',
 'Florian RothNeo23x0',
 'Patrick Kidgerpatrick-kidger',
 'Dries Vintsdriesvints',
 'Sebastián Ramíreztiangolo',
 '花裤衩PanJiaChen',
 'Sameer Naiksameersbn',
 'Anthony Fuantfu',
 'Willem Melchingpd0wm',
 'Alex Rogozhnikovarogozhnikov',
 'Sylvain Guggersgugger']

####  Challenge 2 - Display the trending Python repositories in GitHub

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [7]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'
driver.get(url)

In [8]:
repos = driver.find_elements(By.CLASS_NAME,'lh-condensed')
ranking = []
for element in repos:
    quitar = element.find_element(By.TAG_NAME, 'span').text # To remove user 'name/'' from element.text
    completo = element.find_element(By.TAG_NAME,'a').text #element.text
    repo_limpio = completo.replace(quitar,'').lstrip()
    ranking.append(repo_limpio)
ranking

['ailab',
 'covid-19-data',
 'Lurnby',
 'taobao_seckill',
 'QHack',
 'evojax',
 'bombcrypto-robot',
 'mlflow',
 'starlite',
 'projected_gan',
 'pandas',
 'frame-interpolation',
 'sphinx',
 'recommenders',
 'pydantic',
 'Paddle',
 'httpie',
 'wxpy',
 'pipelines',
 'examples',
 'spack',
 'moco',
 'pytube',
 'TikTok-Api',
 'cookiecutter-django']

#### Challenge 3 - Display all the image links from Walt Disney wikipedia page

In [9]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'
driver.get(url)

In [10]:
images = driver.find_elements(By.TAG_NAME,'img')
image_links = []
for element in images:
    image_links.append(element.get_attribute('src'))

image_links[:10] #len(image_links) = 33

['https://upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png',
 'https://upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/20px-Extended-protection-shackle.svg.png',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Newman_Laugh-O-Gram_%281921%29.webm/220px-seek%3D2-Newman_Laugh-O-Gram_%281921%29.webm.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Trolley_Troubles_poster.jpg/170px-Trolley_Troubles_poster.jpg',
 'https://upload.wikimedia.org/wikipedia/en/thumb/4/4e/Steamboat-willie.jpg/170px-Steamboat-willie.jpg',
 'http

#### Challenge 4 - Retrieve all links to pages on Wikipedia that refer to some kind of Python.

In [11]:
# This is the url you will scrape in this exercise
url ='https://en.wikipedia.org/wiki/Python'
driver.get(url)

In [12]:
lists = driver.find_elements(By.TAG_NAME,'li')
links = []
for element in lists:
    try: #Not al li elements have an 'a' in it, or an href attribute
        link = element.find_element(By.TAG_NAME,'a')
        links.append(link.get_attribute('href'))
    except:
        pass
print('{},list length:{}'.format(links[:15],len(links)))

['https://en.wikipedia.org/wiki/Pythonidae', 'https://en.wikipedia.org/wiki/Python_(genus)', 'https://en.wikipedia.org/wiki/Python#Computing', 'https://en.wikipedia.org/wiki/Python#People', 'https://en.wikipedia.org/wiki/Python#Roller_coasters', 'https://en.wikipedia.org/wiki/Python#Vehicles', 'https://en.wikipedia.org/wiki/Python#Weaponry', 'https://en.wikipedia.org/wiki/Python#Other_uses', 'https://en.wikipedia.org/wiki/Python#See_also', 'https://en.wikipedia.org/wiki/Python_(programming_language)', 'https://en.wikipedia.org/wiki/CMU_Common_Lisp', 'https://en.wikipedia.org/wiki/PERQ#PERQ_3', 'https://en.wikipedia.org/wiki/Python_of_Aenus', 'https://en.wikipedia.org/wiki/Python_(painter)', 'https://en.wikipedia.org/wiki/Python_of_Byzantium'],list length:130


#### Challenge 5 - Number of Titles that have changed in the United States Code since its last release point 

In [13]:
# This is the url you will scrape in this exercise
url = 'http://uscode.house.gov/download/download.shtml'
driver.get(url)

In [14]:
list = driver.find_element(By.CLASS_NAME,'uscitemlist')
titles = list.find_elements(By.CLASS_NAME,'uscitem')
changes=[]
for element in titles:
    try:
        bold = element.find_element(By.CLASS_NAME,'usctitlechanged').text #All changed titles are written in bold 
        changes.append(bold)
    except:
        pass
print('{}, {} total changes since 12/27/21'.format(changes, len(changes)))

['Title 1 - General Provisions ٭', 'Title 2 - The Congress', 'Title 5 - Government Organization and Employees ٭', 'Title 6 - Domestic Security', 'Title 7 - Agriculture', 'Title 12 - Banks and Banking', 'Title 15 - Commerce and Trade', 'Title 16 - Conservation', 'Title 19 - Customs Duties', 'Title 23 - Highways ٭', 'Title 25 - Indians', 'Title 26 - Internal Revenue Code', 'Title 29 - Labor', 'Title 30 - Mineral Lands and Mining', 'Title 33 - Navigation and Navigable Waters', 'Title 40 - Public Buildings, Property, and Works ٭', 'Title 41 - Public Contracts ٭', 'Title 42 - The Public Health and Welfare', 'Title 43 - Public Lands', 'Title 45 - Railroads', 'Title 46 - Shipping ٭', 'Title 47 - Telecommunications', 'Title 49 - Transportation ٭', 'Title 54 - National Park Service and Related Programs ٭'], 24 total changes since 12/27/21


#### Challenge 6 - A Python list with the top ten FBI's Most Wanted names 

In [15]:
# This is the url you will scrape in this exercise
url = 'https://www.fbi.gov/wanted/topten'
driver.get(url)

In [16]:
criminals = driver.find_elements(By.CLASS_NAME,'title')
for i in range (1, len(criminals)): #First element in criminals is a text called most wantes, no needing to see it
    print(criminals[i].text)

JOSE RODOLFO VILLARREAL-HERNANDEZ
RAFAEL CARO-QUINTERO
YULAN ADONAY ARCHAGA CARIAS
EUGENE PALMER
BHADRESHKUMAR CHETANBHAI PATEL
ALEJANDRO ROSALES CASTILLO
ARNOLDO JIMENEZ
JASON DEREK BROWN
ALEXIS FLORES
OCTAVIANO JUAREZ-CORRO


#### Challenge 7 - List all language names and number of related articles in the order they appear in wikipedia.org

In [17]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'
driver.get(url)

In [18]:
#wraper = driver.find_elements(By.CLASS_NAME,'central-featured')
languages = driver.find_elements(By.CLASS_NAME,'link-box')
lang_list = []
for element in languages:
    lang = element.find_element(By.TAG_NAME,'strong').text
    articles = element.find_element(By.TAG_NAME,'small').text
    lang_list.append(f'{lang} / {articles}')
lang_list

['Español / 1 717 000+ artículos',
 'English / 6 383 000+ articles',
 '日本語 / 1 292 000+ 記事',
 'Русский / 1 756 000+ статей',
 'Deutsch / 2 617 000+ Artikel',
 'Français / 2 362 000+ articles',
 '中文 / 1 231 000+ 條目',
 'Italiano / 1 718 000+ voci',
 'Polski / 1 490 000+ haseł',
 'Português / 1 074 000+ artigos']

#### Challenge 8 - A list with the different kind of datasets available in data.gov.uk 

In [19]:
# This is the url you will scrape in this exercise
url = 'https://data.gov.uk/'
driver.get(url)

In [20]:
headings = driver.find_elements(By.CLASS_NAME,'dgu-topics__heading')
topics = []
for element in headings:
    topics.append(element.find_element(By.TAG_NAME,'a').text)
topics

['Business and economy',
 'Crime and justice',
 'Defence',
 'Education',
 'Environment',
 'Government',
 'Government spending',
 'Health',
 'Mapping',
 'Society',
 'Towns and cities',
 'Transport',
 'Digital service performance',
 'Government reference data']

#### Challenge 9 - Top 10 languages by number of native speakers stored in a Pandas Dataframe

In [21]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'
driver.get(url)

In [22]:
table = driver.find_elements(By.CLASS_NAME,'wikitable')
heads = table[1].find_element(By.TAG_NAME,'thead') #Second table in web page
rows = table[1].find_elements(By.TAG_NAME,'tr')

data = [[i.text for i in e.find_elements(By.TAG_NAME,'td')] for e in rows]
index = [c.text for c in heads.find_elements(By.TAG_NAME,'th')]

In [23]:
import pandas as pd
Top_lang = pd.DataFrame(data, columns=index)
Top_lang.dropna(how="all",inplace=True)
Top_lang.drop(axis = 1, columns='Rank', inplace=True)

Top_lang.head(10)

Unnamed: 0,Language,Native\nspeakers\nin millions\n2007 (2010),Percentage\nof world\npopulation\n(2007)
1,Mandarin (entire branch),935 (955),14.1%
2,Spanish,390 (405),5.85%
3,English,365 (360),5.52%
4,Hindi[b],295 (310),4.46%
5,Arabic,280 (295),4.23%
6,Portuguese,205 (215),3.08%
7,Bengali,200 (205),3.05%
8,Russian,160 (155),2.42%
9,Japanese,125 (125),1.92%
10,Punjabi,95 (100),1.44%


### Stepping up the game

####  Challenge 10 - The 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe

In [24]:
# This is the url you will scrape in this exercise
url = 'https://www.emsc-csem.org/Earthquake/'
driver.get(url)

In [25]:
table = driver.find_element(By.XPATH,'//*[@id="tbody"]')

heads = driver.find_elements(By.TAG_NAME,'th') #esto está ok con los + desplegados
#indexes = [c.text for c in driver.find_elements(By.TAG_NAME,'th')]

rows = table.find_elements(By.TAG_NAME,'tr')
earthquakes = [[i.text for i in e.find_elements(By.TAG_NAME,'td')] for e in rows]

In [26]:
terremotos=pd.DataFrame(earthquakes)
terremotos.head().dropna
terremotos.drop([0,1,2,9,12], inplace=True, axis =1)
terremotos.shape

(50, 8)

In [27]:
indexes = ['Date&Time','Latitude','N/S','Longitude','E/W','Depth(km)','Magnitude','Region_Name']

In [28]:
terremotos.columns = indexes

In [29]:
terremotos.head(20)

Unnamed: 0,Date&Time,Latitude,N/S,Longitude,E/W,Depth(km),Magnitude,Region_Name
0,2022-02-15 11:17:48.0\n09min ago,1.03,N,120.11,E,10,3.6,"MINAHASA, SULAWESI, INDONESIA"
1,2022-02-15 10:47:28.0\n39min ago,23.99,S,67.88,W,191,2.9,"ANTOFAGASTA, CHILE"
2,2022-02-15 10:43:19.7\n43min ago,38.21,N,22.61,E,3,2.5,GREECE
3,2022-02-15 10:37:42.5\n49min ago,19.23,N,155.4,W,32,2.2,"ISLAND OF HAWAII, HAWAII"
4,2022-02-15 10:32:49.8\n54min ago,18.01,N,66.83,W,13,2.3,PUERTO RICO
5,2022-02-15 10:00:42.0\n1hr 26min ago,4.62,S,129.54,E,194,3.6,BANDA SEA
6,2022-02-15 09:57:24.1\n1hr 29min ago,47.73,N,121.73,W,27,2.3,WASHINGTON
7,2022-02-15 09:57:15.3\n1hr 29min ago,39.17,N,32.08,E,7,2.1,CENTRAL TURKEY
8,2022-02-15 09:51:42.0\n1hr 35min ago,0.67,S,131.48,E,10,3.4,"NEAR N COAST OF PAPUA, INDONESIA"
9,2022-02-15 09:50:41.0\n1hr 36min ago,8.95,S,116.86,E,10,2.5,"LOMBOK REGION, INDONESIA"


#### Challenge 11 - IMDB's Top 250 data (movie name, Initial release, director name and stars) as a pandas dataframe

In [30]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'
driver.get(url)

In [36]:
table = driver.find_element(By.CLASS_NAME,'lister-list')
head=driver.find_element(By.TAG_NAME,'thead')
filas = table.find_elements(By.TAG_NAME,'tr')

movie = [[i.text for i in e.find_elements(By.TAG_NAME,'td')] for e in filas]

indexes = [c.text for c in head.find_elements(By.TAG_NAME,'th')]


5

In [42]:
len(indexes)

5

In [44]:
for i in range(0,50):
    print(movie)

[['', '1. Cadena perpetua (1994)', '9,2', '', ''], ['', '2. El padrino (1972)', '9,1', '', ''], ['', '3. El padrino: Parte II (1974)', '9,0', '', ''], ['', '4. El caballero oscuro (2008)', '9,0', '', ''], ['', '5. 12 hombres sin piedad (1957)', '8,9', '', ''], ['', '6. La lista de Schindler (1993)', '8,9', '', ''], ['', '7. El señor de los anillos: El retorno del rey (2003)', '8,9', '', ''], ['', '8. Pulp Fiction (1994)', '8,8', '', ''], ['', '9. El bueno, el feo y el malo (1966)', '8,8', '', ''], ['', '10. El señor de los anillos: La comunidad del anillo (2001)', '8,8', '', ''], ['', '11. El club de la lucha (1999)', '8,7', '', ''], ['', '12. Forrest Gump (1994)', '8,7', '', ''], ['', '13. Origen (2010)', '8,7', '', ''], ['', '14. El señor de los anillos: Las dos torres (2002)', '8,7', '', ''], ['', '15. El Imperio contraataca (1980)', '8,7', '', ''], ['', '16. Matrix (1999)', '8,7', '', ''], ['', '17. Uno de los nuestros (1990)', '8,6', '', ''], ['', '18. Alguien voló sobre el nido d

In [45]:
chart_movies=pd.DataFrame(movie,index=indexes)

ValueError: Length of values (250) does not match length of index (5)

#### Challenge 12 - Movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [None]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'

#### Challenge 13 - Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [None]:
#https://openweathermap.org/current
driver.get(url)

In [None]:
def weather(city):
    pass

#### Challenge 14 - Book name,price and stock availability as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'
driver.get(url)

**Did you limit your output? Thank you! 🙂**