# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended content.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit the urls below and take a look at their source code through Chrome DevTools. You'll need to identify the html tags, special class names, etc used in the html content you are expected to extract.

**Resources**:
- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide)
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are already imported for you. If you prefer to use additional libraries feel free to do it.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [2]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [3]:
# your code here
response = requests.get(url)

In [16]:
if response.status_code == 200:
    # Parse the HTML content of the page using BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Find the elements containing developer names and repositories
    developer_items = soup.find_all('article', class_='Box-row')

    # Iterate through the developer items and print their details
    for index, developer_item in enumerate(developer_items, start=1):
        # Extract the developer name
        developer_name = developer_item.find('h1').get_text(strip=True)
        
        # Extract the developer's username/repository
        developer_repo = developer_item.find('h1').find_next('p').get_text(strip=True)
        
        # Print the developer's rank, name, and repository
        print(f'{index}. {developer_name} ({developer_repo})')
else:
    print(f'Failed to retrieve the page. Status code: {response.status_code}')

1. phuocng (moklick)
2. Moritz Klack (moklick)
3. Rui Chen (chenrui333)
4. Vladimir Kharlampidi (nolimits4web)
5. Stan Girard (StanGirard)
6. Matthias Fey (rusty1s)
7. Olivier Halligon (AliSoftware)
8. Sean DuBois (Sean-Der)
9. Paul Frazee (pfrazee)
10. Ariel Mashraki (a8m)
11. jdx (refcell)
12. refcell.eth (refcell)
13. MichaIng (Yidadaa)
14. Yifei Zhang (Yidadaa)
15. Eugene Yurtsev (eyurtsev)
16. Carlos Scheidegger (cscheid)
17. Robin Appelman (icewind1991)
18. Romain Beauxis (toots)
19. Zoltan Kochan (zkochan)
20. lwouis (ClearlyClaire)
21. Claire (ClearlyClaire)
22. Nathan Rajlich (TooTallNate)
23. James M Snell (jasnell)
24. Adrian Garcia Badaracco (adriangb)
25. Juan Julián Merelo Guervós (JJ)


#### 1. Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools or clicking in 'Inspect' on any browser. Here is an example:

![title](example_1.png)

2. Use BeautifulSoup `find_all()` to extract all the html elements that contain the developer names. Hint: pass in the `attrs` parameter to specify the class.

3. Loop through the elements found and get the text for each of them.

4. While you are at it, use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names. Hint: you may also use `.get_text()` instead of `.text` and pass in the desired parameters to do some string manipulation (check the documentation).

5. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [18]:
# your code here
if response.status_code == 200:
    # Parse the HTML content of the page using BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Find the elements containing developer names and repositories
    developer_items = soup.find_all('article', class_='Box-row')

    # Iterate through the developer items and print their details
    for index, developer_item in enumerate(developer_items, start=1):
        # Extract the developer name
        developer_name = developer_item.find('h1').get_text(strip=True)
        
        # Extract the developer's username/repository
        developer_repo = developer_item.find('h1').find_next('p').get_text(strip=True)
        
        # Print the developer's rank, name, and repository
        print(f'{index}. {developer_name} ({developer_repo})')
else:
    print(f'Failed to retrieve the page. Status code: {response.status_code}')

1. phuocng (moklick)
2. Moritz Klack (moklick)
3. Rui Chen (chenrui333)
4. Vladimir Kharlampidi (nolimits4web)
5. Stan Girard (StanGirard)
6. Matthias Fey (rusty1s)
7. Olivier Halligon (AliSoftware)
8. Sean DuBois (Sean-Der)
9. Paul Frazee (pfrazee)
10. Ariel Mashraki (a8m)
11. jdx (refcell)
12. refcell.eth (refcell)
13. MichaIng (Yidadaa)
14. Yifei Zhang (Yidadaa)
15. Eugene Yurtsev (eyurtsev)
16. Carlos Scheidegger (cscheid)
17. Robin Appelman (icewind1991)
18. Romain Beauxis (toots)
19. Zoltan Kochan (zkochan)
20. lwouis (ClearlyClaire)
21. Claire (ClearlyClaire)
22. Nathan Rajlich (TooTallNate)
23. James M Snell (jasnell)
24. Adrian Garcia Badaracco (adriangb)
25. Juan Julián Merelo Guervós (JJ)


#### 1.1. Display the trending Python repositories in GitHub.

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'

In [19]:
# your code here
# Send an HTTP GET request to the URL
response = requests.get(url)

In [20]:
if response.status_code == 200:
    # Parse the HTML content of the page using BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Find the elements containing repository names
    repo_items = soup.find_all('h1', class_='h3 lh-condensed')

    # Iterate through the repository items and print their names
    for index, repo_item in enumerate(repo_items, start=1):
        # Extract the repository name
        repo_name = repo_item.get_text(strip=True)
        
        # Print the repository name
        print(f'{index}. {repo_name}')
else:
    print(f'Failed to retrieve the page. Status code: {response.status_code}')



1. phuocng
2. Moritz Klack
3. Rui Chen
4. Vladimir Kharlampidi
5. Stan Girard
6. Matthias Fey
7. Olivier Halligon
8. Sean DuBois
9. Paul Frazee
10. Ariel Mashraki
11. jdx
12. refcell.eth
13. MichaIng
14. Yifei Zhang
15. Eugene Yurtsev
16. Carlos Scheidegger
17. Robin Appelman
18. Romain Beauxis
19. Zoltan Kochan
20. lwouis
21. Claire
22. Nathan Rajlich
23. James M Snell
24. Adrian Garcia Badaracco
25. Juan Julián Merelo Guervós


#### 2. Display all the image links from Walt Disney wikipedia page.
Hint: use `.get()` to access information inside tags. Check out the documentation.

In [21]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [22]:
# your code here
# Send an HTTP GET request to the URL
response = requests.get(url)

In [23]:
# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the HTML content of the page using BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Find all image tags in the page
    img_tags = soup.find_all('img')
    
    # Extract and print the source (URL) of each image
    for index, img_tag in enumerate(img_tags, start=1):
        img_src = img_tag.get('src')
        print(f'{index}. {img_src}')
else:
    print(f'Failed to retrieve the page. Status code: {response.status_code}')

1. /static/images/icons/wikipedia.png
2. /static/images/mobile/copyright/wikipedia-wordmark-en.svg
3. /static/images/mobile/copyright/wikipedia-tagline-en.svg
4. //upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png
5. //upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/20px-Extended-protection-shackle.svg.png
6. //upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG
7. //upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png
8. //upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Walt_Disney_Birthplace_Exterior_Hermosa_Chicago_Illinois.jpg/220px-Walt_Disney_Birthplace_Exterior_Hermosa_Chicago_Illinois.jpg
9. //upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg
10. //upload.wikimedia.org/wikipedia/en/thumb/4/4e/Steamboat-willie.jpg/220px-Ste

#### 2.1. List all language names and number of related articles in the order they appear in wikipedia.org.

In [30]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'

In [32]:
# Send an HTTP GET request to the URL
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the HTML content of the page using BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Find the element that contains the list of languages and article counts
    language_items = soup.find_all('a', class_='link-box')
    
    # Create a list to store language names and article counts
    language_data = []
    
    # Iterate through the language items and add data to the list
    for language_item in language_items:
        # Extract the language name
        language_name = language_item.find('strong').get_text(strip=True)
        
        # Extract the number of articles
        article_count = language_item.find('small').get_text(strip=True)
        
        # Add language name and article count to the list
        language_data.append(f'{language_name}: {article_count}')
    
    # Print the list of language names and article counts
    for index, data in enumerate(language_data, start=1):
        print(f'{index}. {data}')
else:
    print(f'Failed to retrieve the page. Status code: {response.status_code}')


1. English: 6 715 000+articles
2. æ¥æ¬èª: 1 387 000+è¨äº
3. EspaÃ±ol: 1 892 000+artÃ­culos
4. Ð ÑÑÑÐºÐ¸Ð¹: 1 938 000+ÑÑÐ°ÑÐµÐ¹
5. Deutsch: 2 836 000+Artikel
6. FranÃ§ais: 2 553 000+articles
7. Italiano: 1 826 000+voci
8. ä¸­æ: 1 377 000+æ¡ç® / æ¢ç®
9. PortuguÃªs: 1 109 000+artigos
10. Ø§ÙØ¹Ø±Ø¨ÙØ©: 1 217 000+ÙÙØ§ÙØ©


#### 2.2. Display the top 10 languages by number of native speakers stored in a pandas dataframe.
Hint: After finding the correct table you want to analyse, you can use a nested **for** loop to find the elements row by row (check out the 'td' and 'tr' tags). <br>An easier way to do it is using pd.read_html(), check out documentation [here](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.read_html.html).

In [33]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [41]:
import pandas as pd

# Define la URL
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

# Utiliza pd.read_html() para extraer las tablas de la página web
tables = pd.read_html(url)

# Inspecciona las tablas para encontrar la correcta
for i, table in enumerate(tables):
    print(f"Tabla {i + 1} - Filas: {table.shape[0]}, Columnas: {table.shape[1]}")
    print(table.head(10))  # Muestra las primeras filas de cada tabla para inspeccionarla


Tabla 1 - Filas: 27, Columnas: 4
                                            Language  \
0  Mandarin Chinese (incl. Standard Chinese, but ...   
1                                            Spanish   
2                                            English   
3            Hindi (excl. Urdu, and other languages)   
4                                         Portuguese   
5                                            Bengali   
6                                            Russian   
7                                           Japanese   
8                      Yue Chinese (incl. Cantonese)   
9                                         Vietnamese   

   Native speakers (millions) Language family        Branch  
0                       939.0    Sino-Tibetan       Sinitic  
1                       485.0   Indo-European       Romance  
2                       380.0   Indo-European      Germanic  
3                       345.0   Indo-European    Indo-Aryan  
4                       236.0   Indo-Eur

#### 3. Display IMDB's top 250 data (movie name, initial release, director name and stars) as a pandas dataframe.
Hint: If you hover over the title of the movie, you should see the director's name. Can you find where it's stored in the html?

In [57]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'

In [64]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from imdb import IMD

# Define the URL for IMDB's top 250 page
url = 'https://www.imdb.com/chart/top'

# Send a GET request to the URL
response = requests.get(url)
b

# Create an instance of the IMDb class
ia = IMDb()

# Get the top 250 movies
top250 = ia.get_top250_movies()
# Check if the request was successful (status code 200)
if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the table that contains the movie data
    table = soup.find('table', {'class': 'chart full-width'})

    # Create empty lists to store data
    movie_names = []
    release_years = []
    director_names = []
    star_ratings = []

    # Iterate through the rows of the table
    for row in table.find_all('tr')[1:]:  # Start from the second row to skip headers
        cells = row.find_all('td')

        # Extract movie name
        movie_name = cells[1].find('a').text.strip()
        movie_names.append(movie_name)

        # Extract initial release year
        release_year = int(cells[1].find('span', {'class': 'secondaryInfo'}).text.strip('()'))
        release_years.append(release_year)

        # Extract director name from the "title" attribute of the movie link (hint)
        title = cells[1].find('a')['title']
        director_name = title.split('(')[-1].strip(')')
        director_names.append(director_name)

        # Extract star rating
        star_rating = float(cells[2].find('strong').text.strip())
        star_ratings.append(star_rating)

    # Create a pandas dataframe from the extracted data
    imdb_top_250_df = pd.DataFrame({
        'Movie Name': movie_names,
        'Initial Release': release_years,
        'Director Name': director_names,
        'Star Rating': star_ratings
    })

    # Display the dataframe
    print(imdb_top_250_df)

else:
    print('Error: Unable to retrieve the IMDB page.')


ImportError: cannot import name 'IMD' from 'imdb' (c:\Users\Win10\anaconda3\Lib\site-packages\imdb\__init__.py)

#### 3.1. Display the movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [65]:
#This is the url you will scrape in this exercise
url = 'https://www.imdb.com/list/ls009796553/'

In [68]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import random

# Define the URL for the IMDb list page
url = 'https://www.imdb.com/list/ls009796553/'

# Send a GET request to the URL
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the list of movies
    movie_list = soup.find('div', {'class': 'lister-list'})

    # Create empty lists to store data
    movie_names = []
    release_years = []
    movie_summaries = []

    # Iterate through the list of movies
    for movie in movie_list.find_all('div', {'class': 'lister-item-content'}):
        # Extract movie name
        movie_name = movie.find('h3').text.strip()
        movie_names.append(movie_name)

        # Extract release year
        release_year = movie.find('span', {'class': 'lister-item-year'}).text.strip('()')
        release_years.append(release_year)

        # Extract movie summary (if available)
        summary = movie.find('p', {'class': 'text-muted'})
        if summary:
            movie_summary = summary.text.strip()
        else:
            movie_summary = 'Summary not available'
        movie_summaries.append(movie_summary)

    # Create a pandas dataframe from the extracted data
    imdb_movie_list_df = pd.DataFrame({
        'Movie Name': movie_names,
        'Release Year': release_years,
        'Summary': movie_summaries
    })

    # Select 10 random movies from the dataframe
    random.seed(42)  # Setting a seed for reproducibility
    top_10_random_movies = imdb_movie_list_df.sample(10)

    # Display the top 10 random movies
    print(top_10_random_movies)

else:
    print('Error: Unable to retrieve the IMDb list page.')


                             Movie Name Release Year  \
69                   70.\nBones\n(2001)         2001   
19          20.\nAntwone Fisher\n(2002)         2002   
51          52.\nBatman Forever\n(1995)         1995   
54      55.\nLa novia de Chucky\n(1998)         1998   
82          83.\nBeyond the Sea\n(2004)         2004   
64                65.\nBaby Boy\n(2001)         2001   
60           61.\nEl informador\n(2000)         2000   
70  71.\nEl chico de la burbuja\n(2001)         2001   
53                54.\nBulworth\n(1998)         1998   
56                    57.\nBait\n(2000)         2000   

                                           Summary  
69               18\n|\n96 min\n|\n\nCrime, Horror  
19            7\n|\n120 min\n|\n\nBiography, Drama  
51          13\n|\n121 min\n|\n\nAction, Adventure  
54    18\n|\n89 min\n|\n\nComedy, Horror, Thriller  
82     A\n|\n118 min\n|\n\nBiography, Drama, Music  
64      18\n|\n130 min\n|\n\nCrime, Drama, Romance  
60      R\n|

## Bonus

#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [None]:
#https://openweathermap.org/current
city = input('Enter the city: ')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

In [None]:
# your code here

#### Find the book name, price and stock availability as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

In [None]:
# your code here

####  Display the 100 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe.
***Hint:*** Here the displayed number of earthquakes per page is 20, but you can easily move to the next page by looping through the desired number of pages and adding it to the end of the url.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://www.emsc-csem.org/Earthquake/?view='

# This is how you will loop through each page:
number_of_pages = int(100/20)
each_page_urls = []

for n in range(1, number_of_pages+1):
    link = url+str(n)
    each_page_urls.append(link)
    
each_page_urls

In [None]:
# your code here