# Lab|Web Scraping

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended content.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit the urls below and take a look at their source code through Chrome DevTools. You'll need to identify the html tags, special class names, etc used in the html content you are expected to extract.

**Resources**:
- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide)
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are already imported for you. If you prefer to use additional libraries feel free to do it.

In [10]:
pip install requests beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [3]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

# 
This script first sends an HTTP GET request to the specified URL, checks if the request was successful (status code 200), and then uses BeautifulSoup to parse the HTML content of the page. It finds and prints the names of trending developers on the GitHub page.

In [4]:
# Send an HTTP GET request to the URL
response = requests.get(url)
response


<Response [200]>

In [21]:
# Check if the request was successful (status code 200)
if response.status_code == 200:
    
# Parse the HTML content of the page using BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')

In [5]:
# Get headers
response.headers # Response headers (as a python dictionary)

{'Server': 'GitHub.com', 'Date': 'Fri, 22 Sep 2023 12:59:37 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Vary': 'X-PJAX, X-PJAX-Container, Turbo-Visit, Turbo-Frame, Accept-Encoding, Accept, X-Requested-With', 'ETag': 'W/"5211d4be8c76be15b5220efa9e5fc9d1"', 'Cache-Control': 'max-age=0, private, must-revalidate', 'Strict-Transport-Security': 'max-age=31536000; includeSubdomains; preload', 'X-Frame-Options': 'deny', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '0', 'Referrer-Policy': 'origin-when-cross-origin, strict-origin-when-cross-origin', 'Content-Security-Policy': "default-src 'none'; base-uri 'self'; child-src github.com/assets-cdn/worker/ gist.github.com/assets-cdn/worker/; connect-src 'self' uploads.github.com www.githubstatus.com collector.github.com raw.githubusercontent.com api.githubcopilot.com api.github.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-manifest-file-7fdce7.s3.amazonaws.com 

In [6]:
# Show the content type
print(response.headers['Content-Type'])
type(response.content)

text/html; charset=utf-8


bytes

In [7]:
# Show content
response.content

b'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"  data-a11y-animated-images="system" data-a11y-link-underlines="false">\n\n\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-a09cef873428.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-5d486a4ede8e.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" m

#### 1. Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools or clicking in 'Inspect' on any browser. Here is an example:

![title](example_1.png)

2. Use BeautifulSoup `find_all()` to extract all the html elements that contain the developer names. Hint: pass in the `attrs` parameter to specify the class.

3. Loop through the elements found and get the text for each of them.

4. While you are at it, use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names. Hint: you may also use `.get_text()` instead of `.text` and pass in the desired parameters to do some string manipulation (check the documentation).
98
5. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [38]:

# Send an HTTP GET request to the URL
response = requests.get(url)
response


<Response [200]>

In [32]:
#Get a soup
soup = BeautifulSoup(response.content)

In [14]:
print(soup)

<!DOCTYPE html>
<html data-a11y-animated-images="system" data-a11y-link-underlines="false" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-a09cef873428.css" media="all" rel="stylesheet"/><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-5d486a4ede8e.css" media="all" rel="stylesheet"/><link crossorigin="anonymous" data-color-theme="dark_dimmed" data-href="https://github.githubassets.com

In [13]:
# Can I make my soup pretty?
print(soup.prettify())

<!DOCTYPE html>
<html data-a11y-animated-images="system" data-a11y-link-underlines="false" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
 <head>
  <meta charset="utf-8"/>
  <link href="https://github.githubassets.com" rel="dns-prefetch"/>
  <link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
  <link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
  <link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
  <link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
  <link href="https://avatars.githubusercontent.com" rel="preconnect"/>
  <link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-a09cef873428.css" media="all" rel="stylesheet"/>
  <link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-5d486a4ede8e.css" media="all" rel="stylesheet"/>
  <link crossorigin="anonymous" data-color-theme="dark_dimmed" data-href="https://

In [30]:
# Findall returns a list
soup.find_all('article', class_='Box-row')


[<article class="Box-row d-flex" id="pa-phuocng">
 <a class="Link color-fg-muted f6" data-view-component="true" href="#pa-phuocng" style="width: 16px;" text="center">
     1
 </a>
 <div class="mx-3">
 <a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":57786711,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="7113908b4a723e4533ae0762dc72745f3d3d61b1201f288f30a2589767a9489d" data-view-component="true" href="/phuocng">
 <img alt="@phuocng" class="rounded avatar-user" height="48" src="https://avatars.githubusercontent.com/u/57786711?s=96&amp;v=4" width="48"/>
 </a> </div>
 <div class="d-sm-flex flex-auto">
 <div class="col-sm-8 d-md-flex">
 <div class="col-md-6">
 <h1 class="h3 lh-condensed">
 <a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click

In [79]:
# Initialize lists to store developer information
developer_names = []

In [50]:
# Find and extract developer names and their associated repositories
developer_cards = soup.find_all('article', class_='Box-row')
for card in developer_cards:
        
        # Extract developer name
        developer_name_elem = card.find('h1')
        developer_name = developer_name_elem.get_text(strip=True) if developer_name_elem else "N/A"
        developer_names.append(developer_name)


In [78]:
#Print the extracted developer names and repositories
for i in range(len(developer_names)):
    print(f"{i + 1}. Developer: {developer_names[i]},")

1. Developer: phuocng,
2. Developer: Moritz Klack,
3. Developer: Rui Chen,
4. Developer: Vladimir Kharlampidi,
5. Developer: Stan Girard,
6. Developer: Matthias Fey,
7. Developer: Olivier Halligon,
8. Developer: Sean DuBois,
9. Developer: Paul Frazee,
10. Developer: Ariel Mashraki,
11. Developer: jdx,
12. Developer: refcell.eth,
13. Developer: MichaIng,
14. Developer: Yifei Zhang,
15. Developer: Eugene Yurtsev,
16. Developer: Carlos Scheidegger,
17. Developer: Robin Appelman,
18. Developer: Romain Beauxis,
19. Developer: Zoltan Kochan,
20. Developer: lwouis,
21. Developer: Claire,
22. Developer: Nathan Rajlich,
23. Developer: James M Snell,
24. Developer: Adrian Garcia Badaracco,
25. Developer: Juan Julián Merelo Guervós,


#### 1.1. Display the trending Python repositories in GitHub.

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [67]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'

In [68]:
# Send an HTTP GET request to the URL
response = requests.get(url)
response

<Response [200]>

In [69]:
# Check if the request was successful (status code 200)
if response.status_code == 200:
    
# Parse the HTML content of the page using BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')

In [70]:
#Get a soup
soup = BeautifulSoup(response.content)

In [71]:
print(soup)

<!DOCTYPE html>
<html data-a11y-animated-images="system" data-a11y-link-underlines="false" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-a09cef873428.css" media="all" rel="stylesheet"/><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-5d486a4ede8e.css" media="all" rel="stylesheet"/><link crossorigin="anonymous" data-color-theme="dark_dimmed" data-href="https://github.githubassets.com

In [72]:
# Initialize lists to store developer information
developer_repos = []

In [76]:
# Find and extract developer names and their associated repositories
developer_cards = soup.find_all('article', class_='Box-row')
for card in developer_cards:
    
 # Extract repository name
    developer_repo_elem = card.find('span', class_='repo')
    developer_repo = developer_repo_elem.get_text(strip=True) if developer_repo_elem else "N/A"
    developer_repos.append(developer_repo)

In [77]:
#Print the extracted developer names and repositories
for i in range(len(developer_names)):
    print(f"{i + 1}. Repository: {developer_repos[i]}")

1. Repository: N/A
2. Repository: N/A
3. Repository: N/A
4. Repository: N/A
5. Repository: N/A
6. Repository: N/A
7. Repository: N/A
8. Repository: N/A
9. Repository: N/A
10. Repository: N/A
11. Repository: N/A
12. Repository: N/A
13. Repository: N/A
14. Repository: N/A
15. Repository: N/A
16. Repository: N/A
17. Repository: N/A
18. Repository: N/A
19. Repository: N/A
20. Repository: N/A
21. Repository: N/A
22. Repository: N/A
23. Repository: N/A
24. Repository: N/A
25. Repository: N/A




With this error handling in place, even if an element is not found, it will use "N/A" as a placeholder value in the developer_names and developer_repos lists, preventing the "NoneType" error

#### 2. Display all the image links from Walt Disney wikipedia page.
Hint: use `.get()` to access information inside tags. Check out the documentation.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [83]:
# Send an HTTP GET request to the URL
response = requests.get(url)
response

<Response [200]>

In [84]:
# Parse the HTML content of the page using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')


In [89]:
# Find all image tags ('img') and extract the 'src' attribute (image links)
image_tags = soup.find_all('img')
    

In [95]:

# Extract and display image links
for img_tag in image_tags:
    img_link = img_tag.get('src')
    if img_link:
        print(img_link)
else:
    print('Failed to retrieve the web page. Status code:', response.status_code)


/static/images/icons/wikipedia.png
/static/images/mobile/copyright/wikipedia-wordmark-en.svg
/static/images/mobile/copyright/wikipedia-tagline-en.svg
//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png
//upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/20px-Extended-protection-shackle.svg.png
//upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG
//upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png
//upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Walt_Disney_Birthplace_Exterior_Hermosa_Chicago_Illinois.jpg/220px-Walt_Disney_Birthplace_Exterior_Hermosa_Chicago_Illinois.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg
//upload.wikimedia.org/wikipedia/en/thumb/4/4e/Steamboat-willie.jpg/220px-Steamboat-willie.jpg
//upload.wiki

# For my reference:

1.Send an HTTP GET request to the Wikipedia page for Walt Disney.

2.If the request is successful (status code 200), then parse the HTML content using BeautifulSoup.

3.using soup.find_all('img') to find all image tags on the page.

4.For each image tag found, extract the 'src' attribute using img_tag.get('src'), which contains the image link.

5.Print each image link.

It should display all the image links from the Walt Disney Wikipedia page.

#### 2.1. List all language names and number of related articles in the order they appear in wikipedia.org.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'

In [97]:
# Send an HTTP GET request to the URL
response = requests.get(url)
response

<Response [200]>

In [98]:
# Parse the HTML content of the page using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

In [110]:
# Find the section that contains language links and article counts
language_section = soup.find('div', class_='central-featured')
language_section

<div class="central-featured" data-el-section="primary links">
<!-- #1. en.wikipedia.org - 1 719 062 000 views/day -->
<div class="central-featured-lang lang1" dir="ltr" lang="en">
<a class="link-box" data-slogan="The Free Encyclopedia" href="//en.wikipedia.org/" id="js-link-box-en" title="English â Wikipedia â The Free Encyclopedia">
<strong>English</strong>
<small><bdi dir="ltr">6 715 000+</bdi> <span>articles</span></small>
</a>
</div>
<!-- #2. ja.wikipedia.org - 211 317 000 views/day -->
<div class="central-featured-lang lang2" dir="ltr" lang="ja">
<a class="link-box" data-slogan="ããªã¼ç¾ç§äºå
¸" href="//ja.wikipedia.org/" id="js-link-box-ja" title="Nihongo â ã¦ã£ã­ããã£ã¢ â ããªã¼ç¾ç§äºå
¸">
<strong>æ¥æ¬èª</strong>
<small><bdi dir="ltr">1 387 000+</bdi> <span>è¨äº</span></small>
</a>
</div>
<!-- #3. es.wikipedia.org - 198 289 000 views/day -->
<div class="central-featured-lang lang3" dir="ltr" lang="es">
<a class="link-box" data-slogan="La enciclo

In [111]:
# Find all language elements within the section
language_elements = language_section.find_all('a', class_='link-box')
language_elements

[<a class="link-box" data-slogan="The Free Encyclopedia" href="//en.wikipedia.org/" id="js-link-box-en" title="English â Wikipedia â The Free Encyclopedia">
 <strong>English</strong>
 <small><bdi dir="ltr">6 715 000+</bdi> <span>articles</span></small>
 </a>,
 <a class="link-box" data-slogan="ããªã¼ç¾ç§äºå
 ¸" href="//ja.wikipedia.org/" id="js-link-box-ja" title="Nihongo â ã¦ã£ã­ããã£ã¢ â ããªã¼ç¾ç§äºå
 ¸">
 <strong>æ¥æ¬èª</strong>
 <small><bdi dir="ltr">1 387 000+</bdi> <span>è¨äº</span></small>
 </a>,
 <a class="link-box" data-slogan="La enciclopedia libre" href="//es.wikipedia.org/" id="js-link-box-es" title="EspaÃ±ol â Wikipedia â La enciclopedia libre">
 <strong>EspaÃ±ol</strong>
 <small><bdi dir="ltr">1 892 000+</bdi> <span>artÃ­culos</span></small>
 </a>,
 <a class="link-box" data-slogan="Ð¡Ð²Ð¾Ð±Ð¾Ð´Ð½Ð°Ñ ÑÐ½ÑÐ¸ÐºÐ»Ð¾Ð¿ÐµÐ´Ð¸Ñ" href="//ru.wikipedia.org/" id="js-link-box-ru" title="Russkiy â ÐÐ¸ÐºÐ¸Ð¿ÐµÐ´Ð¸Ñ â Ð¡Ð²Ð¾Ð±Ð¾Ð´Ð½Ð°Ñ ÑÐ

In [112]:
 # Iterate through the language elements and extract language name and article count
for element in language_elements:
        language_name = element.find('strong').text
        article_count = element.find('bdi').text
        print(f"Language: {language_name}, Number of Articles: {article_count}")
else:
    print('Failed to retrieve the web page. Status code:', response.status_code)


Language: English, Number of Articles: 6 715 000+
Language: æ¥æ¬èª, Number of Articles: 1 387 000+
Language: EspaÃ±ol, Number of Articles: 1 892 000+
Language: Ð ÑÑÑÐºÐ¸Ð¹, Number of Articles: 1 938 000+
Language: Deutsch, Number of Articles: 2 836 000+
Language: FranÃ§ais, Number of Articles: 2 553 000+
Language: Italiano, Number of Articles: 1 826 000+
Language: ä¸­æ, Number of Articles: 1 377 000+
Language: PortuguÃªs, Number of Articles: 1 109 000+
Language: Ø§ÙØ¹Ø±Ø¨ÙØ©, Number of Articles: Ø§ÙØ¹Ø±Ø¨ÙØ©
Failed to retrieve the web page. Status code: 200


# 2.2. Display the top 10 languages by number of native speakers stored in a pandas dataframe.
Hint: After finding the correct table you want to analyse, you can use a nested **for** loop to find the elements row by row (check out the 'td' and 'tr' tags). <br>An easier way to do it is using pd.read_html(), check out documentation [here](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.read_html.html).

In [184]:
import pandas as pd

In [185]:
# Define the URL of the Wikipedia page
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [186]:
# Use pd.read_html() to automatically extract tables from the page
tables = pd.read_html(url, attrs={'class': 'wikitable'})
tables

[                                             Language  \
 0   Mandarin Chinese(incl. Standard Chinese, but e...   
 1                                             Spanish   
 2                                             English   
 3              Hindi(excl. Urdu, and other languages)   
 4                                          Portuguese   
 5                                             Bengali   
 6                                             Russian   
 7                                            Japanese   
 8                        Yue Chinese(incl. Cantonese)   
 9                                          Vietnamese   
 10                                            Turkish   
 11                     Wu Chinese(incl. Shanghainese)   
 12                                            Marathi   
 13                                             Telugu   
 14                                             Korean   
 15                                             French   
 16           

In [187]:
# Assuming the first table extracted contains the language data, you can select it
language_table = tables[0]
language_table

Unnamed: 0,Language,Native speakers(millions),Language family,Branch
0,"Mandarin Chinese(incl. Standard Chinese, but e...",939.0,Sino-Tibetan,Sinitic
1,Spanish,485.0,Indo-European,Romance
2,English,380.0,Indo-European,Germanic
3,"Hindi(excl. Urdu, and other languages)",345.0,Indo-European,Indo-Aryan
4,Portuguese,236.0,Indo-European,Romance
5,Bengali,234.0,Indo-European,Indo-Aryan
6,Russian,147.0,Indo-European,Balto-Slavic
7,Japanese,123.0,Japonic,Japanese
8,Yue Chinese(incl. Cantonese),86.1,Sino-Tibetan,Sinitic
9,Vietnamese,85.0,Austroasiatic,Vietic


In [188]:
# Select the top 10 languages and their native speaker counts
top_10_languages = language_table.head(10)
top_10_languages

Unnamed: 0,Language,Native speakers(millions),Language family,Branch
0,"Mandarin Chinese(incl. Standard Chinese, but e...",939.0,Sino-Tibetan,Sinitic
1,Spanish,485.0,Indo-European,Romance
2,English,380.0,Indo-European,Germanic
3,"Hindi(excl. Urdu, and other languages)",345.0,Indo-European,Indo-Aryan
4,Portuguese,236.0,Indo-European,Romance
5,Bengali,234.0,Indo-European,Indo-Aryan
6,Russian,147.0,Indo-European,Balto-Slavic
7,Japanese,123.0,Japonic,Japanese
8,Yue Chinese(incl. Cantonese),86.1,Sino-Tibetan,Sinitic
9,Vietnamese,85.0,Austroasiatic,Vietic


In [202]:
# Select the first two columns (Language and Native Speaker Count)
top_10_languages = language_table.iloc[:, :2]

In [204]:
# Select the top 10 languages
top_10_languages = top_10_languages.head(10)

# Display the top 10 languages in a Pandas DataFrame
display(top_10_languages)

Unnamed: 0,Language,Native speakers(millions)
0,"Mandarin Chinese(incl. Standard Chinese, but e...",939.0
1,Spanish,485.0
2,English,380.0
3,"Hindi(excl. Urdu, and other languages)",345.0
4,Portuguese,236.0
5,Bengali,234.0
6,Russian,147.0
7,Japanese,123.0
8,Yue Chinese(incl. Cantonese),86.1
9,Vietnamese,85.0


# For my reference:
Using pd.read_html(url) to automatically extract tables from the Wikipedia page.

Now assuming that the language data is in the first table on the page and select it.

Then we can select the top 10 rows from the table.

To make the DataFrame more readable, we rename the columns to 'Language' and 'Native Speaker Count'.

Finally, display the top 10 languages and their native speaker counts in the Pandas DataFrame

### 3. Display IMDB's top 250 data (movie name, initial release, director name and stars) as a pandas dataframe.
Hint: If you hover over the title of the movie, you should see the director's name. Can you find where it's stored in the html?

In [153]:
# Define the URL of the IMDb top 250 page
url = 'https://www.imdb.com/chart/top'

In [154]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [155]:
# Define a user-agent header to mimic a web browser
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

In [156]:
# Send an HTTP GET request to the URL with the user-agent header
response = requests.get(url, headers=headers)
response

<Response [200]>

In [157]:
# Parse the HTML content of the page using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

In [158]:
# Find the list of top 250 movies (ordered list)
movie_list = soup.find('ol', class_='chart-list')
movie_list

In [159]:
# Initialize lists to store movie information
movie_names = []
release_years = []
director_names = []
star_ratings = []

# Find all components
components = soup.find_all('ul', class_='class="ipc-metadata-list ipc-metadata-list--dividers-between sc-3f13560f-0 sTTRj compact-list-view ipc-metadata-list--base"')
for component in components:
    row = component.find_all('td')
        
    # Extract movie name
    movie_name = component.find('h3',class_='ipc-title_text')
    movie_names.append(movie_name)

    # Extract release year
    release_year = component.find('span', class_='sc-4dcdad14-8 cvucyi cli-title-metadata-item').text.strip('()') 
    release_years.append(release_year)

    # Extract director name (from the title attribute)
    title_attr = movie_item.find('a')['title']
    director_name = title_attr.split(',')[0]
    director_names.append(director_name)

    # Extract star rating
    star_rating = component.find('span', class_= "ipc-rating-star ipc-rating-star--base ipc-rating-star--rate")
    star_ratings.append(star_rating)

 # Create a Pandas DataFrame from the extracted data
df = pd.DataFrame({
    'Movie Name': movie_names,
    'Initial Release': release_years,
    'Director Name': director_names,
    'Star Rating': star_ratings
})
df


Unnamed: 0,Movie Name,Initial Release,Director Name,Star Rating


# Note for TA'S : 
I tried many many times.But i really dont understand that why my dataframe is empty.

## 3.1. Display the movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [127]:
#This is the url you will scrape in this exercise
url = 'https://www.imdb.com/list/ls009796553/'

In [128]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import random

In [129]:
# Define a user-agent header to mimic a web browser
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

In [130]:
# Define parameters for a random search on IMDb
params = {
    'genres': 'random',
    'sort': 'year,desc',
    'count': 10
}

In [131]:
# Send an HTTP GET request to the URL with the user-agent header and parameters
response = requests.get(url, headers=headers, params=params)
response

<Response [200]>

In [132]:
# Parse the HTML content of the page using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

In [133]:
# Find the list of movie items
movie_items = soup.find_all('div', class_='lister-item-content')
movie_items

[<div class="lister-item-content">
 <h3 class="lister-item-header">
 <span class="lister-item-index unbold text-primary">1.</span>
 <a href="/title/tt0816462/?ref_=ttls_li_tt">Conan the Barbarian</a>
 <span class="lister-item-year text-muted unbold">(2011)</span>
 </h3>
 <p class="text-muted text-small">
 <span class="certificate">16</span>
 <span class="ghost">|</span>
 <span class="runtime">113 min</span>
 <span class="ghost">|</span>
 <span class="genre">
 Action, Adventure, Fantasy            </span>
 </p>
 <div class="ipl-rating-widget">
 <div class="ipl-rating-star small">
 <span class="ipl-rating-star__star">
 <svg class="ipl-icon ipl-star-icon" fill="#000000" height="24" viewbox="0 0 24 24" width="24" xmlns="http://www.w3.org/2000/svg">
 <path d="M0 0h24v24H0z" fill="none"></path>
 <path d="M12 17.27L18.18 21l-1.64-7.03L22 9.24l-7.19-.61L12 2 9.19 8.63 2 9.24l5.46 4.73L5.82 21z"></path>
 <path d="M0 0h24v24H0z" fill="none"></path>
 </svg>
 </span>
 <span class="ipl-rating-star_

In [134]:
# Initialize lists to store movie information
movie_titles = []
movie_years = []
movie_summaries = []

# Randomize the order of movie items
random.shuffle(movie_items)

for movie_item in movie_items[:10]:  # Select the top 10 random movies
        # Extract movie title
        movie_title = movie_item.h3.a.text
        movie_titles.append(movie_title)

        # Extract movie year
        movie_year = movie_item.h3.find('span', class_='lister-item-year').text.strip('()')
        movie_years.append(movie_year)

        # Extract movie summary (if available)
        movie_summary_tag = movie_item.find('p', class_='text-muted')
        movie_summary = movie_summary_tag.text.strip() if movie_summary_tag else 'Summary not available'
        movie_summaries.append(movie_summary)

# Create a Pandas DataFrame from the extracted data
df = pd.DataFrame({
    'Movie Name': movie_titles,
    'Year': movie_years,
    'Summary': movie_summaries
})
df


Unnamed: 0,Movie Name,Year,Summary
0,The A-Team,2010,"12\n|\n117 min\n|\n\nAction, Adventure, Crime"
1,Zookeeper,2011,"6\n|\n102 min\n|\n\nComedy, Family, Fantasy"
2,Date Night,2010,"9\n|\n88 min\n|\n\nComedy, Crime, Romance"
3,The Dilemma,2011,"12\n|\n111 min\n|\n\nComedy, Drama"
4,The Help,2011,12\n|\n146 min\n|\n\nDrama
5,The Inbetweeners,2011,12\n|\n97 min\n|\n\nComedy
6,Cyrus,I) (2010,"12\n|\n91 min\n|\n\nComedy, Drama, Romance"
7,Inception,2010,"12\n|\n148 min\n|\n\nAction, Adventure, Sci-Fi"
8,Moneyball,2011,"AL\n|\n133 min\n|\n\nBiography, Drama, Sport"
9,Captain America: The First Avenger,2011,"12\n|\n124 min\n|\n\nAction, Adventure, Sci-Fi"


## Bonus

#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [None]:
#https://openweathermap.org/current
city = input('Enter the city: ')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

In [None]:
# your code here

#### Find the book name, price and stock availability as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

In [None]:
# your code here

####  Display the 100 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe.
***Hint:*** Here the displayed number of earthquakes per page is 20, but you can easily move to the next page by looping through the desired number of pages and adding it to the end of the url.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://www.emsc-csem.org/Earthquake/?view='

# This is how you will loop through each page:
number_of_pages = int(100/20)
each_page_urls = []

for n in range(1, number_of_pages+1):
    link = url+str(n)
    each_page_urls.append(link)
    
each_page_urls

In [None]:
# your code here