# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended content.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit the urls below and take a look at their source code through Chrome DevTools. You'll need to identify the html tags, special class names, etc used in the html content you are expected to extract.

**Resources**:
- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide)
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are already imported for you. If you prefer to use additional libraries feel free to do it.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [2]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [69]:
response = requests.get(url)
response.status_code

200

In [70]:
soup = BeautifulSoup(response.content, "html.parser")

In [4]:
soup


<!DOCTYPE html>

<html data-a11y-animated-images="system" data-a11y-link-underlines="true" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-6448649c7147.css" media="all" rel="stylesheet"><link crossorigin="anonymous" href="https://github.githubassets.com/assets/light_high_contrast-42fc7e3b06b7.css" media="all" rel="stylesheet"><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-d17b946fc

#### 1. Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools or clicking in 'Inspect' on any browser. Here is an example:

![title](example_1.png)

2. Use BeautifulSoup `find_all()` to extract all the html elements that contain the developer names. Hint: pass in the `attrs` parameter to specify the class.

3. Loop through the elements found and get the text for each of them.

4. While you are at it, use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names. Hint: you may also use `.get_text()` instead of `.text` and pass in the desired parameters to do some string manipulation (check the documentation).

5. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [5]:
name_element = soup.find_all("a", class_ = "Link")
name_element

[<a class="Link color-fg-accent text-normal ml-2" data-view-component="true" href="https://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax" target="_blank">Search syntax tips</a>,
 <a aria-checked="false" class="select-menu-item Link" data-pjax="" data-view-component="true" href="/trending/developers/unknown?since=daily" role="menuitemradio"><span class="select-menu-item-text" data-menu-button-text="" data-view-component="true">
               Unknown languages
 </span></a>,
 <a aria-checked="false" class="select-menu-item Link" data-pjax="" data-view-component="true" href="/trending/developers/1c-enterprise?since=daily" role="menuitemradio"><span class="select-menu-item-text" data-menu-button-text="" data-view-component="true">
               1C Enterprise
 </span></a>,
 <a aria-checked="false" class="select-menu-item Link" data-pjax="" data-view-component="true" href="/trending/developers/2-dimensional-array?since=daily" role="menuitemradio"><

In [6]:
name_element = soup.find_all("h1", class_ ="h3 lh-condensed")
name_element

[<h1 class="h3 lh-condensed">
 <a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":59988195,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="dabcb3b6a31070bddc5f5ef6d27ee164c7edf3491c0e3f9bc4d8791903681671" data-view-component="true" href="/ibhagwan">ibhagwan</a> </h1>,
 <h1 class="h3 lh-condensed">
 <a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":13160176,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="7c3b373207da6de1ac17dc02c2890b1d70e8c3dcacd690cd5bb9aef48ee4f1a7" data-view-component="true" href="/monkeyWie">Levi</a> </h1>,
 <h1 class="h3 lh-conde

In [7]:
developers_names = []
for i in name_element:
    name = i.get_text().strip()
    developers_names.append(name)
len(developers_names)

25

In [8]:
user_element = soup.find_all("p",class_ = "f4 text-normal mb-1")

In [9]:
user_names = []
for i in user_element:
    user_name = i.get_text().strip()
    user_names.append(user_name)
len(user_names)

23

In [10]:
soup.find_all("div", class_ = "col-md-6")[0].get_text().strip().replace(" \n\n","&").split("&")

['ibhagwan']

In [11]:
list_of_elements = []
all_element = soup.find_all("div", class_ = "col-md-6")
for x in all_element[::2]:
    element = x.get_text().strip().replace(" \n\n","&").split("&")
    if len(element) == 2:
        list_of_elements.append(element)
    else:
        element.append("Nan")
        list_of_elements.append(element)

In [12]:
list_of_elements

[['ibhagwan', 'Nan'],
 ['Levi', 'monkeyWie'],
 ['Mika Vilpas', 'mikavilpas'],
 ['Akshay Deo', 'akshaydeo'],
 ['Matt Arsenault', 'arsenm'],
 ['Lee Stott', 'leestott'],
 ['Stephen Akinyemi', 'appcypher'],
 ['Diptesh Choudhuri', 'IgnisDa'],
 ['Danny Mösch', 'SimplyDanny'],
 ['Mitchell Hashimoto', 'mitchellh'],
 ['Steve Cosman', 'scosman'],
 ['Arvin Xu', 'arvinxx'],
 ['Owen Schwartz', 'oschwartz10612'],
 ['Danilo Leal', 'danilo-leal'],
 ['Lukas Masuch', 'lukasmasuch'],
 ['comfyanonymous', 'Nan'],
 ['Eli Schleifer', 'EliSchleifer'],
 ['Really Him', 'hesreallyhim'],
 ['Derrick Hammer', 'pcfreak30'],
 ['Azure SDK Bot', 'azure-sdk'],
 ['Dominic Gannaway', 'trueadm'],
 ['Sebastian Raschka', 'rasbt'],
 ['Henrik Rydgård', 'hrydgard'],
 ['Logan', 'logan-markewich'],
 ['Chris Ferdinandi', 'cferdinandi']]

In [13]:
for i in list_of_elements:
    print(f"{i[0]} -----> ({i[1]})")

ibhagwan -----> (Nan)
Levi -----> (monkeyWie)
Mika Vilpas -----> (mikavilpas)
Akshay Deo -----> (akshaydeo)
Matt Arsenault -----> (arsenm)
Lee Stott -----> (leestott)
Stephen Akinyemi -----> (appcypher)
Diptesh Choudhuri -----> (IgnisDa)
Danny Mösch -----> (SimplyDanny)
Mitchell Hashimoto -----> (mitchellh)
Steve Cosman -----> (scosman)
Arvin Xu -----> (arvinxx)
Owen Schwartz -----> (oschwartz10612)
Danilo Leal -----> (danilo-leal)
Lukas Masuch -----> (lukasmasuch)
comfyanonymous -----> (Nan)
Eli Schleifer -----> (EliSchleifer)
Really Him -----> (hesreallyhim)
Derrick Hammer -----> (pcfreak30)
Azure SDK Bot -----> (azure-sdk)
Dominic Gannaway -----> (trueadm)
Sebastian Raschka -----> (rasbt)
Henrik Rydgård -----> (hrydgard)
Logan -----> (logan-markewich)
Chris Ferdinandi -----> (cferdinandi)


#### 1.1. Display the trending Python repositories in GitHub.

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [14]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

In [15]:
repo_element = soup.find_all("h2",class_ = "h3 lh-condensed")#[0].get_text()
repo_element

[<h2 class="h3 lh-condensed">
 <a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_REPOSITORIES_PAGE","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":905697354,"originating_url":"https://github.com/trending/python?since=daily","user_id":null}}' data-hydro-click-hmac="a710cccbb828658a36751000dc1711f8c75ef1d6199857c1e10ae1ff44a0f5cd" data-view-component="true" href="/QuentinFuxa/WhisperLiveKit"><svg aria-hidden="true" class="octicon octicon-repo mr-1 color-fg-muted" data-view-component="true" height="16" version="1.1" viewbox="0 0 16 16" width="16">
 <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.08

In [16]:
repo_list = []
for x in repo_element:
    repo_name = x.get_text().strip().replace("/","").split()
    repo_list.append(repo_name)
repo_list

[['QuentinFuxa', 'WhisperLiveKit'],
 ['laramies', 'theHarvester'],
 ['TheAlgorithms', 'Python'],
 ['paperless-ngx', 'paperless-ngx'],
 ['All-Hands-AI', 'OpenHands'],
 ['chubin', 'cheat.sh'],
 ['microsoft', 'qlib'],
 ['RVC-Project', 'Retrieval-based-Voice-Conversion-WebUI'],
 ['llamastack', 'llama-stack'],
 ['OpenBMB', 'MiniCPM-V'],
 ['pydantic', 'pydantic-ai'],
 ['freqtrade', 'freqtrade'],
 ['iperov', 'DeepFaceLab'],
 ['microsoft', 'RD-Agent'],
 ['pwndbg', 'pwndbg'],
 ['inventree', 'InvenTree'],
 ['Johnserf-Seed', 'f2'],
 ['SylphAI-Inc', 'AdalFlow']]

In [17]:
for i in repo_list:
    print(f"{i[0]} -----> ({i[1]})")

QuentinFuxa -----> (WhisperLiveKit)
laramies -----> (theHarvester)
TheAlgorithms -----> (Python)
paperless-ngx -----> (paperless-ngx)
All-Hands-AI -----> (OpenHands)
chubin -----> (cheat.sh)
microsoft -----> (qlib)
RVC-Project -----> (Retrieval-based-Voice-Conversion-WebUI)
llamastack -----> (llama-stack)
OpenBMB -----> (MiniCPM-V)
pydantic -----> (pydantic-ai)
freqtrade -----> (freqtrade)
iperov -----> (DeepFaceLab)
microsoft -----> (RD-Agent)
pwndbg -----> (pwndbg)
inventree -----> (InvenTree)
Johnserf-Seed -----> (f2)
SylphAI-Inc -----> (AdalFlow)


#### 2. Display all the image links from Walt Disney wikipedia page.
Hint: use `.get()` to access information inside tags. Check out the documentation.

In [18]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
soup

<!DOCTYPE html>

<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Walt Disney - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-featu

In [19]:
img_element = soup.find_all("a", class_ = "mw-file-description")#[0].get("href")
img_element

[<a class="mw-file-description" href="/wiki/File:Walt_Disney_1946_(cropped2).JPG"><img class="mw-file-element" data-file-height="600" data-file-width="450" decoding="async" height="333" src="//upload.wikimedia.org/wikipedia/commons/thumb/5/50/Walt_Disney_1946_%28cropped2%29.JPG/250px-Walt_Disney_1946_%28cropped2%29.JPG" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/5/50/Walt_Disney_1946_%28cropped2%29.JPG/375px-Walt_Disney_1946_%28cropped2%29.JPG 1.5x, //upload.wikimedia.org/wikipedia/commons/5/50/Walt_Disney_1946_%28cropped2%29.JPG 2x" width="250"/></a>,
 <a class="mw-file-description" href="/wiki/File:Walt_Disney_1942_signature.svg"><img class="mw-file-element" data-file-height="218" data-file-width="585" decoding="async" height="56" src="//upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/250px-Walt_Disney_1942_signature.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/330px-Walt_Disney_1942_s

In [20]:
list_of_img_links = []
for i in img_element:
    img_link = i.get("href")
    if img_link.lower().endswith(".jpg"):
        list_of_img_links.append(img_link)

In [21]:
len(list_of_img_links)

13

#### 2.1. List all language names and number of related articles in the order they appear in wikipedia.org.

In [22]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
soup

<!DOCTYPE html>

<html class="no-js" lang="en">
<head>
<meta charset="utf-8"/>
<title>Wikipedia</title>
<meta content="Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation." name="description"/>
<script>
document.documentElement.className = document.documentElement.className.replace( /(^|\s)no-js(\s|$)/, "$1js-enabled$2" );
</script>
<meta content="initial-scale=1,user-scalable=yes" name="viewport"/>
<link href="/static/apple-touch/wikipedia.png" rel="apple-touch-icon"/>
<link href="/static/favicon/wikipedia.ico" rel="shortcut icon"/>
<link href="//creativecommons.org/licenses/by-sa/4.0/" rel="license"/>
<style>
.sprite{background-image:linear-gradient(transparent,transparent),url(portal/wikipedia.org/assets/img/sprite-e49fbf32.svg);background-repeat:no-repeat;display:inline-block;vertical-align:middle}.svg-Commons-logo_sister{background-position:0 0;width:47px;height:47px}.svg-MediaWiki-logo_sister{background

In [23]:
number_element = soup.find_all("small")
number_element

[<small>7,041,000+ <span>articles</span></small>,
 <small>1,469,000+ <span>記事</span></small>,
 <small>2 059 000+ <span>статей</span></small>,
 <small>3.042.000+ <span>Artikel</span></small>,
 <small>2 702 000+ <span>articles</span></small>,
 <small>2.055.000+ <span>artículos</span></small>,
 <small>1,494,000+ <span>条目 / 條目</span></small>,
 <small>1.931.000+ <span>voci</span></small>,
 <small>1 666 000+ <span>haseł</span></small>,
 <small>1.152.000+ <span>artigos</span></small>,
 <small class="jsl10n" data-jsl10n="license">This page is available under the <a href="https://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike License</a></small>,
 <small class="jsl10n" data-jsl10n="terms"><a href="https://foundation.wikimedia.org/wiki/Special:MyLanguage/Policy:Terms_of_Use">Terms of Use</a></small>,
 <small class="jsl10n" data-jsl10n="privacy-policy"><a href="https://foundation.wikimedia.org/wiki/Special:MyLanguage/Policy:Privacy_policy">Privacy Policy</a></sma

In [24]:
number_list = []
for n in number_element:
    number = n.get_text()
    if "+" in number:
        number_list.append(number)
number_list

['7,041,000+ articles',
 '1,469,000+ 記事',
 '2\xa0059\xa0000+ статей',
 '3.042.000+ Artikel',
 '2\u202f702\u202f000+ articles',
 '2.055.000+ artículos',
 '1,494,000+ 条目 / 條目',
 '1.931.000+ voci',
 '1\xa0666\xa0000+ haseł',
 '1.152.000+ artigos']

In [25]:
language_element = soup.find_all("strong")
language_element

[<strong class="jsl10n localized-slogan" data-jsl10n="portal.slogan">The Free Encyclopedia</strong>,
 <strong>English</strong>,
 <strong>日本語</strong>,
 <strong>Русский</strong>,
 <strong>Deutsch</strong>,
 <strong>Français</strong>,
 <strong>Español</strong>,
 <strong>中文</strong>,
 <strong>Italiano</strong>,
 <strong>Polski</strong>,
 <strong>Português</strong>,
 <strong class="jsl10n" data-jsl10n="portal.app-links.title">
 <a class="jsl10n" data-jsl10n="portal.app-links.url" href="https://en.wikipedia.org/wiki/List_of_Wikipedia_mobile_applications">
 Download Wikipedia for Android or iOS
 </a>
 </strong>]

In [26]:
language_list = []
for l in language_element[1:-1]:
    language = l.get_text()
    language_list.append(language)
language_list

['English',
 '日本語',
 'Русский',
 'Deutsch',
 'Français',
 'Español',
 '中文',
 'Italiano',
 'Polski',
 'Português']

In [27]:
for l,n in zip(language_list, number_list):
    print(f"{l} ----> {n}")

English ----> 7,041,000+ articles
日本語 ----> 1,469,000+ 記事
Русский ----> 2 059 000+ статей
Deutsch ----> 3.042.000+ Artikel
Français ----> 2 702 000+ articles
Español ----> 2.055.000+ artículos
中文 ----> 1,494,000+ 条目 / 條目
Italiano ----> 1.931.000+ voci
Polski ----> 1 666 000+ haseł
Português ----> 1.152.000+ artigos


#### 2.2. Display the top 10 languages by number of native speakers stored in a pandas dataframe.
Hint: After finding the correct table you want to analyse, you can use a nested **for** loop to find the elements row by row (check out the 'td' and 'tr' tags). <br>An easier way to do it is using pd.read_html(), check out documentation [here](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.read_html.html).

In [28]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
soup

<!DOCTYPE html>

<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of languages by number of native speakers - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited

In [29]:
language_name_element = soup.find_all("a", class_ = "mw-redirect")
language_name_element

[<a class="mw-redirect" href="/wiki/Native_speaker" title="Native speaker">native speakers</a>,
 <a class="mw-redirect" href="/wiki/Mutually_intelligible" title="Mutually intelligible">mutually intelligible</a>,
 <a class="mw-redirect" href="/wiki/Arabic_language" title="Arabic language">Arabic</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:cmn" title="ISO 639:cmn">Mandarin Chinese</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:spa" title="ISO 639:spa">Spanish</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:eng" title="ISO 639:eng">English</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:hin" title="ISO 639:hin">Hindi</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:por" title="ISO 639:por">Portuguese</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:ben" title="ISO 639:ben">Bengali</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:rus" title="ISO 639:rus">Russian</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:jpn" title="ISO 639:jpn">Japanese</a>,
 <a class="mw-redirect" href="/w

In [30]:
empty_list = []
wiki_element = soup.find_all("td")
for i in wiki_element:
    values = i.get_text().strip()
    empty_list.append(values)
languages = empty_list[0:116:4]

In [31]:
native_speakers_mil = empty_list[1:116:4]

In [32]:
print("These are the top 10 languages spoken:")
for r,l,n in zip(range(1,11),languages, native_speakers_mil):
    print(f"{r}- {l} spoken by {n} millions")

These are the top 10 languages spoken:
1- Mandarin Chinese spoken by 990 millions
2- Spanish spoken by 484 millions
3- English spoken by 390 millions
4- Hindi spoken by 345 millions
5- Portuguese spoken by 250 millions
6- Bengali spoken by 242 millions
7- Russian spoken by 145 millions
8- Japanese spoken by 124 millions
9- Western Punjabi spoken by 90 millions
10- Vietnamese spoken by 86 millions


#### 3. Display Metacritic top 24 Best TV Shows of all time (TV Show name, initial release date, metascore rating, film rating system and description) as a pandas dataframe.
Hint: If you hover over the title of the movie, you should see the director's name. Can you find where it's stored in the html?

In [33]:
# This is the url you will scrape in this exercise 
url = 'https://www.metacritic.com/browse/tv/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
soup

<!DOCTYPE html>

<html data-edition="us" data-n-head="%7B%22lang%22:%7B%22ssr%22:%22en%22%7D,%22data-edition%22:%7B%22ssr%22:%22us%22%7D%7D" data-n-head-ssr="" lang="en">
<head>
<!-- running tag = 'metacritic.prod.028db8' -->
<meta charset="utf-8" data-hid="charset" data-n-head="ssr"/><meta content="width=device-width, initial-scale=1" data-hid="viewport" data-n-head="ssr" name="viewport"/><meta content="100001036810388" data-hid="fb:admins" data-n-head="ssr" property="fb:admins"/><meta content="123113677890173" data-hid="fb:app_id" data-n-head="ssr" property="fb:app_id"/><meta content="Metacritic aggregates music, game, tv, and movie reviews from the leading critics. Only Metacritic.com uses METASCORES, which let you know at a glance how each item was reviewed." data-hid="description" data-n-head="ssr" name="description"/><meta content="I1kHyfzmmG1fEVjq8GBUgkfCHc6PNtxce1_VyUuJhws" data-hid="google-site-verification" data-n-head="ssr" name="google-site-verification"/><meta content="#42

In [34]:
show_names = []
show_rankings = []
shows_element= soup.find_all('h3', class_ = "c-finderProductCard_titleHeading")
for i in shows_element:
    show_name = i.get_text().strip().split(". ")
    show_names.append(show_name[1].strip())
    show_rankings.append(show_name[0])
show_names
show_rankings

['1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 '10',
 '11',
 '12',
 '13',
 '14',
 '15',
 '16',
 '17',
 '18',
 '19',
 '20',
 '21',
 '22',
 '23',
 '24']

In [35]:
relesa_rating_element = soup.find_all("div", class_ = "c-finderProductCard_meta")#[23].get_text().strip().split("\n")

In [36]:
show_release_dates = []
show_rating_system = []
release_rating_element = soup.find_all("div", class_ = "c-finderProductCard_meta")
for i in release_rating_element:
    show_name = i.get_text().strip().split("\n")
    show_release_dates.append(show_name[0])
    show_rating_system.append(show_name[-1].strip().replace("Apr 23, 2021", "Nan"))
print(show_rating_system)

['Rated TV-G', 'Rated TV-MA', 'Rated TV-14', 'Rated TV-14', 'Rated TV-MA', 'Rated TV-MA', 'Rated TV-G', 'Rated TV-MA', 'Rated TV-MA', 'Rated TV-14', 'Rated TV-MA', 'Rated TV-14', 'Rated TV-PG', 'Rated TV-MA', 'Rated TV-MA', 'Rated TV-MA', 'Rated TV-MA', 'Rated TV-MA', 'Nan', 'Rated TV-14', 'Rated TV-MA', 'Rated TV-MA', 'Rated TV-14', 'Rated TV-MA']


In [37]:
metascore_element = soup.find_all("div", class_ = "c-siteReviewScore u-flexbox-column u-flexbox-alignCenter u-flexbox-justifyCenter g-text-bold c-siteReviewScore_green g-color-gray90 c-siteReviewScore_xsmall")#[0].get_text()

In [38]:
metascore_list = []
for i in metascore_element:
    metascore = i.get_text().strip()
    metascore_list.append(metascore)
metascore_list

['97',
 '97',
 '96',
 '96',
 '96',
 '96',
 '96',
 '95',
 '95',
 '95',
 '94',
 '93',
 '93',
 '93',
 '93',
 '92',
 '92',
 '92',
 '92',
 '92',
 '91',
 '91',
 '91',
 '91']

In [39]:
description_element = soup.find_all("div", class_ = "c-finderProductCard_description")#[0].get_text()

In [40]:
description_list = []
for i in description_element:
    description = i.get_text().strip()
    description_list.append(description)
description_list

["Airing simultaneously on AMC, BBC America, IFC, SundanceTV, and WE tv, the seven-part mini-series takes a look at life in the world's oceans with narration by David Attenborough.",
 "The 10-part documentary series from Steve James takes a year-long look at Chicago's Oak Park and River Forest High School as students and staff deal with a variety of issues, including race, at one of the most diverse suburban public high schools in the U.S.",
 "Narrated by Peter Coyote, the three-part documentry series directed by Ken Burns, Lynn Novick and Sarah Botstein examines the United States' response to the rise of Nazism and the Holocaust through interviews, first-person accounts and archival footage.",
 "The seven-and-a-half-hour documentary chronicles O.J. Simpson's rise from the San Francisco housing projects to football fame in college and the NFL to his post-football career and infamy with the 1995 trial for the killings of his ex-wife, Nicole, and Ron Goldman. The first episode will air o

In [41]:
tv_shows_df = pd.DataFrame({"Ranking": show_rankings, "Tv Show": show_names, "Relase Date": show_release_dates, 
                          "Metascore": metascore_list, "Show Rating System": show_rating_system,
                          "Description": description_list})
tv_shows_df.set_index("Ranking")

Unnamed: 0_level_0,Tv Show,Relase Date,Metascore,Show Rating System,Description
Ranking,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Planet Earth: Blue Planet II,"Jan 20, 2018",97,Rated TV-G,"Airing simultaneously on AMC, BBC America, IFC..."
2,The Office (UK),"Jan 23, 2003",97,Rated TV-MA,"""Trust, encouragement, reward, loyalty... sati..."
3,America to Me,"Aug 26, 2018",96,Rated TV-14,The 10-part documentary series from Steve Jame...
4,The U.S,"Sep 18, 2022",96,Rated TV-14,"Narrated by Peter Coyote, the three-part docum..."
5,O.J.: Made in America,"May 20, 2016",96,Rated TV-MA,The seven-and-a-half-hour documentary chronicl...
6,Bo Burnham: Inside,"May 30, 2021",96,Rated TV-MA,The musical comedy special was filmed by the c...
7,Planet Earth II,"Feb 18, 2017",96,Rated TV-G,"Narrated by David Attenborough, the sequel to ..."
8,The Staircase,2004,95,Rated TV-MA,An 8-part documentary series about the celebra...
9,The Larry Sanders Show,"Aug 15, 1992",95,Rated TV-MA,Comic Garry Shandling draws upon his own talk ...
10,Homicide: Life on the Street,"Jan 31, 1993",95,Rated TV-14,This series was the most reality-based police ...


#### 3.1. Find the image source link and the TV show link. After you're able to retrieve, add them to your initial dataframe

In [44]:
image_element = soup.find_all("a", class_ = "c-finderProductCard_container g-color-gray80 u-grid")#[0]

In [49]:
soup.find_all("a", class_ = "c-finderProductCard_container g-color-gray80 u-grid")[0]

<a class="c-finderProductCard_container g-color-gray80 u-grid" href="/tv/planet-earth-blue-planet-ii/"><div class="c-finderProductCard_images g-outer-spacing-right-medium"><div class="c-finderProductCard_leftContainer"><div class="c-finderProductCard_imgContainer g-height-100 g-container-rounded-medium"><!-- --> <div class="c-finderProductCard_img g-height-100 g-width-100"><picture class="c-cmsImage"><!-- --> <img height="132" src="" style="display:none;" width="88"/> <div aria-hidden="true" class="c-globalImagePlaceholder c-cmsImage c-cmsImage-vertical"><div class="c-globalImagePlaceholder--vertical o-ratio o-ratio-tall g-bg-gray80 g-container-rounded-medium"><div class="g-inner-spacing-small o-ratio_content u-flexbox u-flexbox-alignCenter u-flexbox-justifyCenter"><svg class="g-width-100" viewbox="0 0 176 40"><use class="g-fill-gray60" xlink:href="#logoWordmarkPlaceholder"></use></svg></div></div></div></picture></div></div> <!-- --></div></div> <div class="c-finderProductCard_info u-

In [45]:
image_links = []
for i in image_element:
    image_element_2 = i.find("img")
    print(image_element_2)
    image_link = image_element_2['src']
    image_links.append(image_link)
image_links

<img height="132" src="" style="display:none;" width="88"/>
<img height="132" src="" style="display:none;" width="88"/>
<img height="132" src="" style="display:none;" width="88"/>
<img height="132" src="" style="display:none;" width="88"/>
<img height="132" src="" style="display:none;" width="88"/>
<img height="132" src="" style="display:none;" width="88"/>
<img height="132" src="" style="display:none;" width="88"/>
<img height="132" src="" style="display:none;" width="88"/>
<img height="132" src="" style="display:none;" width="88"/>
<img height="132" src="" style="display:none;" width="88"/>
<img height="132" src="" style="display:none;" width="88"/>
<img height="132" src="" style="display:none;" width="88"/>
<img height="132" src="" style="display:none;" width="88"/>
<img height="132" src="" style="display:none;" width="88"/>
<img height="132" src="" style="display:none;" width="88"/>
<img height="132" src="" style="display:none;" width="88"/>
<img height="132" src="" style="display:

TypeError: 'NoneType' object is not subscriptable

In [46]:
image_links

['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

In [47]:
links_element= soup.find_all("a", class_ = "c-finderProductCard_container g-color-gray80 u-grid")#[0].get("href")

In [48]:
links = []
for l in links_element:
    link = l.get("href")
    full_link = "https://www.metacritic.com" + link
    links.append(full_link)
links

['https://www.metacritic.com/tv/planet-earth-blue-planet-ii/',
 'https://www.metacritic.com/tv/the-office-uk/',
 'https://www.metacritic.com/tv/america-to-me/',
 'https://www.metacritic.com/tv/the-us-and-the-holocaust/',
 'https://www.metacritic.com/tv/oj-made-in-america/',
 'https://www.metacritic.com/tv/bo-burnham-inside/',
 'https://www.metacritic.com/tv/planet-earth-ii/',
 'https://www.metacritic.com/tv/the-staircase/',
 'https://www.metacritic.com/tv/the-larry-sanders-show/',
 'https://www.metacritic.com/tv/homicide-life-on-the-street/',
 'https://www.metacritic.com/tv/the-sopranos/',
 'https://www.metacritic.com/tv/city-so-real/',
 'https://www.metacritic.com/tv/bleak-house/',
 'https://www.metacritic.com/tv/homecoming-a-film-by-beyonce/',
 'https://www.metacritic.com/tv/samurai-jack/',
 'https://www.metacritic.com/tv/fleabag/',
 'https://www.metacritic.com/tv/the-underground-railroad/',
 'https://www.metacritic.com/tv/atlanta/',
 'https://www.metacritic.com/tv/romeo-juliet/',
 '

In [50]:
tv_shows_df['links'] = links

In [51]:
tv_shows_df.set_index("Ranking", inplace =True)

In [52]:
tv_shows_df

Unnamed: 0_level_0,Tv Show,Relase Date,Metascore,Show Rating System,Description,links
Ranking,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Planet Earth: Blue Planet II,"Jan 20, 2018",97,Rated TV-G,"Airing simultaneously on AMC, BBC America, IFC...",https://www.metacritic.com/tv/planet-earth-blu...
2,The Office (UK),"Jan 23, 2003",97,Rated TV-MA,"""Trust, encouragement, reward, loyalty... sati...",https://www.metacritic.com/tv/the-office-uk/
3,America to Me,"Aug 26, 2018",96,Rated TV-14,The 10-part documentary series from Steve Jame...,https://www.metacritic.com/tv/america-to-me/
4,The U.S,"Sep 18, 2022",96,Rated TV-14,"Narrated by Peter Coyote, the three-part docum...",https://www.metacritic.com/tv/the-us-and-the-h...
5,O.J.: Made in America,"May 20, 2016",96,Rated TV-MA,The seven-and-a-half-hour documentary chronicl...,https://www.metacritic.com/tv/oj-made-in-america/
6,Bo Burnham: Inside,"May 30, 2021",96,Rated TV-MA,The musical comedy special was filmed by the c...,https://www.metacritic.com/tv/bo-burnham-inside/
7,Planet Earth II,"Feb 18, 2017",96,Rated TV-G,"Narrated by David Attenborough, the sequel to ...",https://www.metacritic.com/tv/planet-earth-ii/
8,The Staircase,2004,95,Rated TV-MA,An 8-part documentary series about the celebra...,https://www.metacritic.com/tv/the-staircase/
9,The Larry Sanders Show,"Aug 15, 1992",95,Rated TV-MA,Comic Garry Shandling draws upon his own talk ...,https://www.metacritic.com/tv/the-larry-sander...
10,Homicide: Life on the Street,"Jan 31, 1993",95,Rated TV-14,This series was the most reality-based police ...,https://www.metacritic.com/tv/homicide-life-on...


## Bonus

#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [None]:
#https://openweathermap.org/current
city = input('Enter the city: ')
url = f'https://api.weatherapi.com/v1/current.json?key=5a68dbd3fe6242678ac130253242505&q={city}&aqi=no'


In [None]:
weather_api = {
  "location": {
    "name": "Cairo",
    "region": "Al Qahirah",
    "country": "Egypt",
    "lat": 30.05,
    "lon": 31.25,
    "tz_id": "Africa/Cairo",
    "localtime_epoch": 1756720468,
    "localtime": "2025-09-01 12:54"
  },
  "current": {
    "last_updated_epoch": 1756719900,
    "last_updated": "2025-09-01 12:45",
    "temp_c": 33.2,
    "temp_f": 91.8,
    "is_day": 1,
    "condition": {
      "text": "Sunny",
      "icon": "//cdn.weatherapi.com/weather/64x64/day/113.png",
      "code": 1000
    },
    "wind_mph": 9.6,
    "wind_kph": 15.5,
    "wind_degree": 348,
    "wind_dir": "NNW",
    "pressure_mb": 1010,
    "pressure_in": 29.83,
    "precip_mm": 0,
    "precip_in": 0,
    "humidity": 24,
    "cloud": 0,
    "feelslike_c": 31.1,
    "feelslike_f": 88,
    "windchill_c": 35.7,
    "windchill_f": 96.3,
    "heatindex_c": 34.1,
    "heatindex_f": 93.4,
    "dewpoint_c": 9.4,
    "dewpoint_f": 49,
    "vis_km": 10,
    "vis_miles": 6,
    "uv": 8.8,
    "gust_mph": 11.1,
    "gust_kph": 17.8
  }
}

#### Find the book name, price and stock availability from books to scrape website as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

In [53]:
url = 'http://books.toscrape.com/'
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

In [54]:
title_element = soup.find_all("a", title= True)
title_element

[<a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>,
 <a href="catalogue/tipping-the-velvet_999/index.html" title="Tipping the Velvet">Tipping the Velvet</a>,
 <a href="catalogue/soumission_998/index.html" title="Soumission">Soumission</a>,
 <a href="catalogue/sharp-objects_997/index.html" title="Sharp Objects">Sharp Objects</a>,
 <a href="catalogue/sapiens-a-brief-history-of-humankind_996/index.html" title="Sapiens: A Brief History of Humankind">Sapiens: A Brief History ...</a>,
 <a href="catalogue/the-requiem-red_995/index.html" title="The Requiem Red">The Requiem Red</a>,
 <a href="catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html" title="The Dirty Little Secrets of Getting Your Dream Job">The Dirty Little Secrets ...</a>,
 <a href="catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html" title="The Coming Woman: A Novel Based on the Life of the 

In [55]:
titles = []
for x in title_element:
    title = x.get("title")
    titles.append(title)
    
titles

['A Light in the Attic',
 'Tipping the Velvet',
 'Soumission',
 'Sharp Objects',
 'Sapiens: A Brief History of Humankind',
 'The Requiem Red',
 'The Dirty Little Secrets of Getting Your Dream Job',
 'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull',
 'The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics',
 'The Black Maria',
 'Starving Hearts (Triangular Trade Trilogy, #1)',
 "Shakespeare's Sonnets",
 'Set Me Free',
 "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)",
 'Rip it Up and Start Again',
 'Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991',
 'Olio',
 'Mesaerion: The Best Science Fiction Stories 1800-1849',
 'Libertarianism for Beginners',
 "It's Only the Himalayas"]

In [56]:
price_element = soup.find_all("p", class_="price_color")
price_element

[<p class="price_color">£51.77</p>,
 <p class="price_color">£53.74</p>,
 <p class="price_color">£50.10</p>,
 <p class="price_color">£47.82</p>,
 <p class="price_color">£54.23</p>,
 <p class="price_color">£22.65</p>,
 <p class="price_color">£33.34</p>,
 <p class="price_color">£17.93</p>,
 <p class="price_color">£22.60</p>,
 <p class="price_color">£52.15</p>,
 <p class="price_color">£13.99</p>,
 <p class="price_color">£20.66</p>,
 <p class="price_color">£17.46</p>,
 <p class="price_color">£52.29</p>,
 <p class="price_color">£35.02</p>,
 <p class="price_color">£57.25</p>,
 <p class="price_color">£23.88</p>,
 <p class="price_color">£37.59</p>,
 <p class="price_color">£51.33</p>,
 <p class="price_color">£45.17</p>]

In [57]:
prices = []
for i in price_element:
    price = i.get_text()
    prices.append(price)
    
prices

['£51.77',
 '£53.74',
 '£50.10',
 '£47.82',
 '£54.23',
 '£22.65',
 '£33.34',
 '£17.93',
 '£22.60',
 '£52.15',
 '£13.99',
 '£20.66',
 '£17.46',
 '£52.29',
 '£35.02',
 '£57.25',
 '£23.88',
 '£37.59',
 '£51.33',
 '£45.17']

In [58]:
stars = soup.find_all("p", class_="star-rating")
stars

[<p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>,
 <p class="star-rating One">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>,
 <p class="star-rating One">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>,
 <p class="star-rating Four">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>,
 <p class="star-rating Five">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>,
 <p class="star-rating One">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i 

In [59]:
rating_list = []
for i in stars:
    rating = i.get('class')[1]
    rating_list.append(rating)

rating_list

['Three',
 'One',
 'One',
 'Four',
 'Five',
 'One',
 'Four',
 'Three',
 'Four',
 'One',
 'Two',
 'Four',
 'Five',
 'Five',
 'Five',
 'Three',
 'One',
 'One',
 'Two',
 'Two']

In [60]:
stock_element = soup.find_all("p", class_ = "instock availability")
stock_element

[<p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>,
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>,
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>,
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>,
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>,
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>,
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>,
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>,
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>,
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>,
 <p class="instock availability">
 <i cl

In [63]:
stock_list = []
for s in stock_element:
    stock = s.get_text().strip()
    stock_list.append(stock)
stock_list

['In stock',
 'In stock',
 'In stock',
 'In stock',
 'In stock',
 'In stock',
 'In stock',
 'In stock',
 'In stock',
 'In stock',
 'In stock',
 'In stock',
 'In stock',
 'In stock',
 'In stock',
 'In stock',
 'In stock',
 'In stock',
 'In stock',
 'In stock']

In [65]:
book_dict = {"Title":titles, "Rating":rating_list, "Price":prices, "stock_availability" : stock_list}

df = pd.DataFrame(book_dict)
df

Unnamed: 0,Title,Rating,Price,stock_availability
0,A Light in the Attic,Three,£51.77,In stock
1,Tipping the Velvet,One,£53.74,In stock
2,Soumission,One,£50.10,In stock
3,Sharp Objects,Four,£47.82,In stock
4,Sapiens: A Brief History of Humankind,Five,£54.23,In stock
5,The Requiem Red,One,£22.65,In stock
6,The Dirty Little Secrets of Getting Your Dream...,Four,£33.34,In stock
7,The Coming Woman: A Novel Based on the Life of...,Three,£17.93,In stock
8,The Boys in the Boat: Nine Americans and Their...,Four,£22.60,In stock
9,The Black Maria,One,£52.15,In stock


####  Display the initial 100 books available in the homepage. Once again, collect the book name, price and its stock availability.

***Hint:*** The total number of displayed books per page is 20, but you can easily move to the next page by looping through the desired number of pages and adding it to the end of the url.

In [66]:
# This is the url you will scrape in this exercise
url = 'https://books.toscrape.com/catalogue/page-'
# This is how you will loop through each page:
number_of_pages = int(100/20)
each_page_urls = []
for n in range(1, number_of_pages+1):
    link = url+str(n)+".html"
    each_page_urls.append(link)
    
each_page_urls

['https://books.toscrape.com/catalogue/page-1.html',
 'https://books.toscrape.com/catalogue/page-2.html',
 'https://books.toscrape.com/catalogue/page-3.html',
 'https://books.toscrape.com/catalogue/page-4.html',
 'https://books.toscrape.com/catalogue/page-5.html']

In [67]:
titles = []
prices = []
rating_list = []
stock_list = []

for url in each_page_urls:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    
    title_element = soup.find_all("a", title= True)
    
    for x in title_element:
        title = x.get("title")
        titles.append(title)
    
    price_element = soup.find_all("p", class_="price_color")
    
    for i in price_element:
        price = i.get_text()
        prices.append(price)
        
    stars = soup.find_all("p", class_="star-rating")
    
    for i in stars:
        rating = i.get('class')[1]
        rating_list.append(rating)
        
    
    for s in stock_element:
        stock = s.get_text().strip()
        stock_list.append(stock)
        
book_dict = {"Title":titles, "Rating":rating_list, "Price":prices, "stock_availability" : stock_list}

df = pd.DataFrame(book_dict)
    
    

In [68]:
df

Unnamed: 0,Title,Rating,Price,stock_availability
0,A Light in the Attic,Three,£51.77,In stock
1,Tipping the Velvet,One,£53.74,In stock
2,Soumission,One,£50.10,In stock
3,Sharp Objects,Four,£47.82,In stock
4,Sapiens: A Brief History of Humankind,Five,£54.23,In stock
...,...,...,...,...
95,Lumberjanes Vol. 3: A Terrible Plan (Lumberjan...,Two,£19.92,In stock
96,"Layered: Baking, Building, and Styling Spectac...",One,£40.11,In stock
97,Judo: Seven Steps to Black Belt (an Introducto...,Two,£53.90,In stock
98,Join,Five,£35.67,In stock
