# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended content.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit the urls below and take a look at their source code through Chrome DevTools. You'll need to identify the html tags, special class names, etc used in the html content you are expected to extract.

**Resources**:
- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide)
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are already imported for you. If you prefer to use additional libraries feel free to do it.

In [34]:
import requests
from bs4 import BeautifulSoup
import pandas as pd


#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [35]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [36]:
#connect and check connection
response = requests.get(url)
response

<Response [200]>

In [37]:
#print content of the url
response.content

b'\n\n<!DOCTYPE html>\n<html\n  lang="en"\n  \n  data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"\n  data-a11y-animated-images="system" data-a11y-link-underlines="true"\n  >\n\n\n\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-efd2f2257c96.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-6b1e37da2254.css" /><link data-color-theme="dark_dimmed" crossori

In [38]:
#parse (using BeautifulSoup)
soup = BeautifulSoup(response.content)

In [39]:
type(soup)
#output: A data structure representing a parsed HTML or XML document.


In [40]:
print(soup.prettify())

<!DOCTYPE html>
<html data-a11y-animated-images="system" data-a11y-link-underlines="true" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
 <head>
  <meta charset="utf-8"/>
  <link href="https://github.githubassets.com" rel="dns-prefetch"/>
  <link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
  <link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
  <link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
  <link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
  <link href="https://avatars.githubusercontent.com" rel="preconnect"/>
  <link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-efd2f2257c96.css" media="all" rel="stylesheet"/>
  <link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-6b1e37da2254.css" media="all" rel="stylesheet"/>
  <link crossorigin="anonymous" data-color-theme="dark_dimmed" data-href="https://g

#### 1. Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools or clicking in 'Inspect' on any browser. Here is an example:

![title](example_1.png)

2. Use BeautifulSoup `find_all()` to extract all the html elements that contain the developer names. Hint: pass in the `attrs` parameter to specify the class.

3. Loop through the elements found and get the text for each of them.

4. While you are at it, use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names. Hint: you may also use `.get_text()` instead of `.text` and pass in the desired parameters to do some string manipulation (check the documentation).

5. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [41]:
# your code here
developers = soup.find_all('h1', class_='h3 lh-condensed')
developers

[<h1 class="h3 lh-condensed">
 <a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":4660275,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="fa447ef919d8f8120caa7fb69e39e737fde9445107c0bdb44864c65c0eb414bc" data-view-component="true" href="/sobolevn">
             sobolevn
 </a> </h1>,
 <h1 class="h3 lh-condensed">
 <a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":22398603,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="8edbb9f311ce39d6608a8c72ba869bea9d08551550d5c4f94a4c04d141d2d758" data-view-component="true" href="/damianricobelli">
             Dami

In [42]:
# your code here
developers_names = [dev.get_text(strip=True) for dev in developers]
for name in developers_names:
    print(name)

sobolevn
Damián Ricobelli
Tim
David Sherret
yetone
Logan
Stephen Celis
Bagheera
overlookmotel
sigoden
Quinn Slack
Matthias Fey
jdx
Jelte Fennema-Nio
Sebastian Raschka
lyuwenyu
Nathan Rajlich
LangChain4j
James Lamb
Huang Huang
Mo'men Sherif
Kilian Lieret
TheTurtle
Mika Vilpas
Morgante Pell


#### 1.1. Display the trending Python repositories in GitHub.

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [43]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'

In [44]:
# your code here
response = requests.get(url)
print(response)

<Response [200]>


In [45]:
soup = BeautifulSoup(response.content)
print(soup.prettify())


<!DOCTYPE html>
<html data-a11y-animated-images="system" data-a11y-link-underlines="true" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
 <head>
  <meta charset="utf-8"/>
  <link href="https://github.githubassets.com" rel="dns-prefetch"/>
  <link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
  <link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
  <link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
  <link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
  <link href="https://avatars.githubusercontent.com" rel="preconnect"/>
  <link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-efd2f2257c96.css" media="all" rel="stylesheet"/>
  <link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-6b1e37da2254.css" media="all" rel="stylesheet"/>
  <link crossorigin="anonymous" data-color-theme="dark_dimmed" data-href="https://g

In [46]:
repositories= soup.find_all('h2', class_='h3 lh-condensed')
repositories


[<h2 class="h3 lh-condensed">
 <a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_REPOSITORIES_PAGE","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":230885748,"originating_url":"https://github.com/trending/python?since=daily","user_id":null}}' data-hydro-click-hmac="8e2ce84785121782fcf14f6343b7e785de8c1906da333bbb41c6cf933ca64303" data-view-component="true" href="/goauthentik/authentik">
 <svg aria-hidden="true" class="octicon octicon-repo mr-1 color-fg-muted" data-view-component="true" height="16" version="1.1" viewbox="0 0 16 16" width="16">
 <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.

In [47]:
repositories_names = [dev.get_text(strip=True) for dev in repositories]
for name in repositories_names:
    print(name)

goauthentik /authentik
microsoft /graphrag
OpenBB-finance /OpenBB
tinygrad /tinygrad
opendatalab /MinerU
LibreTranslate /LibreTranslate
thuml /Time-Series-Library
huggingface /lerobot
mealie-recipes /mealie
infiniflow /ragflow
karpathy /nanoGPT
YvanYin /Metric3D
pre-commit /pre-commit-hooks
SigmaHQ /sigma
princeton-nlp /SWE-agent


#### 2. Display all the image links from Walt Disney wikipedia page.

1.   List item
2.   List item


Hint: use `.get()` to access information inside tags. Check out the documentation.

In [48]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [49]:
# your code here
response = requests.get(url)
print(response)

<Response [200]>


In [50]:
soup = BeautifulSoup(response.content)
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-enabled vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-toc-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Walt Disney - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled v

In [51]:
image_tags = soup.find_all('img')
image_tags


[<img alt="" aria-hidden="true" class="mw-logo-icon" height="50" src="/static/images/icons/wikipedia.png" width="50"/>,
 <img alt="Wikipedia" class="mw-logo-wordmark" src="/static/images/mobile/copyright/wikipedia-wordmark-en.svg" style="width: 7.5em; height: 1.125em;"/>,
 <img alt="The Free Encyclopedia" class="mw-logo-tagline" height="13" src="/static/images/mobile/copyright/wikipedia-tagline-en.svg" style="width: 7.3125em; height: 0.8125em;" width="117"/>,
 <img alt="Featured article" class="mw-file-element" data-file-height="443" data-file-width="466" decoding="async" height="19" src="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/30px-Cscr-featured.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/40px-Cscr-featured.svg.png 2x" width="20"/>,
 <img alt="Extended-protected article" class="mw-file-element" data-file-height="512" data-file

In [52]:
image_links = [image.get('src') for image in image_tags]
for link in image_links:
  print(link)

/static/images/icons/wikipedia.png
/static/images/mobile/copyright/wikipedia-wordmark-en.svg
/static/images/mobile/copyright/wikipedia-tagline-en.svg
//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png
//upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/20px-Extended-protection-shackle.svg.png
//upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG
//upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png
//upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Walt_Disney_Birthplace_Exterior_Hermosa_Chicago_Illinois.jpg/220px-Walt_Disney_Birthplace_Exterior_Hermosa_Chicago_Illinois.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Walt_Disney_Snow_white_1937_trailer_screenshot_%2813%29.

#### 2.1. List all language names and number of related articles in the order they appear in wikipedia.org.

In [53]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'

In [54]:
# your code here
response = requests.get(url)
response

<Response [200]>

In [55]:
soup = BeautifulSoup(response.content)
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Wikipedia
  </title>
  <meta content="Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation." name="description"/>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)no-js(\s|$)/, "$1js-enabled$2" );
  </script>
  <meta content="initial-scale=1,user-scalable=yes" name="viewport"/>
  <link href="/static/apple-touch/wikipedia.png" rel="apple-touch-icon"/>
  <link href="/static/favicon/wikipedia.ico" rel="shortcut icon"/>
  <link href="//creativecommons.org/licenses/by-sa/4.0/" rel="license"/>
  <style>
   .sprite{background-image:linear-gradient(transparent,transparent),url(portal/wikipedia.org/assets/img/sprite-de847d1a.svg);background-repeat:no-repeat;display:inline-block;vertical-align:middle}.svg-Commons-logo_sister{background-position:0 0;width:47px;height:47px}.svg

In [56]:
langs = soup.find_all('div', attrs={'class' : 'central-featured-lang'})
langs

[<div class="central-featured-lang lang1" dir="ltr" lang="en">
 <a class="link-box" data-slogan="The Free Encyclopedia" href="//en.wikipedia.org/" id="js-link-box-en" title="English — Wikipedia — The Free Encyclopedia">
 <strong>English</strong>
 <small>6,847,000+ <span>articles</span></small>
 </a>
 </div>,
 <div class="central-featured-lang lang2" dir="ltr" lang="ja">
 <a class="link-box" data-slogan="フリー百科事典" href="//ja.wikipedia.org/" id="js-link-box-ja" title="Nihongo — ウィキペディア — フリー百科事典">
 <strong>日本語</strong>
 <small>1,421,000+ <span>記事</span></small>
 </a>
 </div>,
 <div class="central-featured-lang lang3" dir="ltr" lang="de">
 <a class="link-box" data-slogan="Die freie Enzyklopädie" href="//de.wikipedia.org/" id="js-link-box-de" title="Deutsch — Wikipedia — Die freie Enzyklopädie">
 <strong>Deutsch</strong>
 <small>2.924.000+ <span>Artikel</span></small>
 </a>
 </div>,
 <div class="central-featured-lang lang4" dir="ltr" lang="ru">
 <a class="link-box" data-slogan="Свободная эн

In [57]:
langs_list = [lang.get_text(strip=True) for lang in langs]
def format_language_string(language):
    for i, char in enumerate(language):
        if char.isdigit():
            name = language[:i].strip()
            # Only keep the numeric part, discarding everything after it
            articles = ''.join([c for c in language[i:] if c.isdigit() or c == ',' or c == ' '])
            return f'{name:<15} {articles}'
    return language

# Format and display the text to make it more readable
for language in langs_list:
    formatted_language = format_language_string(language)
    print(formatted_language)

English         6,847,000
日本語             1,421,000
Deutsch         2924000
Русский         1987000
Español         1965000
Français        2621000
中文              1,429,000  
Italiano        1871000
فارسی           ۱۰۰۶۰۰۰
Português       1128000


#### 2.2. Display the top 10 languages by number of native speakers stored in a pandas dataframe.
Hint: After finding the correct table you want to analyse, you can use a nested **for** loop to find the elements row by row (check out the 'td' and 'tr' tags). <br>An easier way to do it is using pd.read_html(), check out documentation [here](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.read_html.html).

In [58]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [59]:
# your code here
response = requests.get(url)
soup = BeautifulSoup(response.content)

# Finding the table containing the data
table = soup.find('table', {'class': 'wikitable sortable'})
table


<table class="wikitable sortable">
<caption>Top first languages by population per <i>CIA</i><sup class="reference" id="cite_ref-CIA_8-1"><a href="#cite_note-CIA-8"><span class="cite-bracket">[</span>8<span class="cite-bracket">]</span></a></sup>
</caption>
<tbody><tr>
<th>Rank
</th>
<th>Language
</th>
<th>Percentage<br/>of world<br/>population<br/>(2018)
</th></tr>
<tr>
<td>1</td>
<td><a href="/wiki/Mandarin_Chinese" title="Mandarin Chinese">Mandarin Chinese</a></td>
<td>12.3%
</td></tr>
<tr>
<td>2</td>
<td><a href="/wiki/Spanish_language" title="Spanish language">Spanish</a></td>
<td>6.0%
</td></tr>
<tr>
<td>3</td>
<td><a href="/wiki/English_language" title="English language">English</a></td>
<td>5.1%
</td></tr>
<tr>
<td>3</td>
<td><a class="mw-redirect" href="/wiki/Arabic_language" title="Arabic language">Arabic</a></td>
<td>5.1%
</td></tr>
<tr>
<td>5</td>
<td><a class="mw-redirect" href="/wiki/Hindi_language" title="Hindi language">Hindi</a></td>
<td>3.5%
</td></tr>
<tr>
<td>6</td>


In [60]:
import pandas as pd

# URL of the Wikipedia page
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

# Reading HTML tables from the webpage
tables = pd.read_html(url)

# Extracting the desired table (assuming it's the first table on the page)
df = pd.DataFrame = tables[0]

# Displaying the table
top_10_languages = df[['Language', 'Native speakers (in millions)', 'Language family', 'Branch']].head(10)
display(top_10_languages)
#I coudnt do it with the usual method: it always chooses the 2nd table and provides % (from the 2nd table) instead of numbers (from the 1st table)


Unnamed: 0,Language,Native speakers (in millions),Language family,Branch
0,Mandarin Chinese,941,Sino-Tibetan,Sinitic
1,Spanish,486,Indo-European,Romance
2,English,380,Indo-European,Germanic
3,Hindi,345,Indo-European,Indo-Aryan
4,Bengali,237,Indo-European,Indo-Aryan
5,Portuguese,236,Indo-European,Romance
6,Russian,148,Indo-European,Balto-Slavic
7,Japanese,123,Japonic,Japanese
8,Yue Chinese,86,Sino-Tibetan,Sinitic
9,Vietnamese,85,Austroasiatic,Vietic


#### 3. Display Metacritic top 24 Best TV Shows of all time (TV Show name, initial release date, metascore rating, film rating system and description) as a pandas dataframe.
Hint: If you hover over the title of the movie, you should see the director's name. Can you find where it's stored in the html?

In [61]:
# This is the url you will scrape in this exercise
url = 'https://www.metacritic.com/browse/tv/'

In [62]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL of the Metacritic page with top TV shows
url = 'https://www.metacritic.com/browse/tv/'

# Make a request to fetch the page content
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)
content = response.content

# Parse the page content using BeautifulSoup
soup = BeautifulSoup(content, 'html.parser')

data = []
shows = soup.find_all('div', class_='c-finderProductCard_info u-flexbox-column')  # Adjusting for the page structure

for show in shows[:24]:  # Only taking top 24 shows
    # Extract TV Show name
    title_tag = show.find('h3', class_='c-finderProductCard_titleHeading')
    title = title_tag.text.strip() if title_tag else 'N/A'

    # Extract initial release date
    date_tag = show.find('span', class_="u-text-uppercase")
    release_date = date_tag.text.strip() if date_tag else 'N/A'

    # Extract metascore rating
    score_tag = show.find('div', class_='c-siteReviewScore')
    metascore = score_tag.text.strip() if score_tag else 'N/A'

    # Extract description
    description_tag = show.find('div', class_='c-finderProductCard_description')
    description = description_tag.text.strip() if description_tag else 'N/A'

    # Extract film rating
    meta_info = show.find('div', class_='c-finderProductCard_meta')
    spans = meta_info.find_all('span') if meta_info else []

    if len(spans) >= 3:
        film_rating = spans[2].text.strip().replace("Rated ", "")
    else:
        film_rating = 'No rating'

    data.append({
        'TV Show Name': title,
        'Initial Release Date': release_date,
        'Metascore': metascore,
        'Film Rating': film_rating,
        'Description': description
    })

# Create a DataFrame
df1 = pd.DataFrame(data)
df1 = df[['TV Show Name', 'Initial Release Date', 'Metascore', 'Film Rating', 'Description']]

# Display the DataFrame
display(df1)
#I couldnt fix a TypeError: 'DataFrame' object is not callable BUT this cell`s code works in a new clean Notebook. Renaming variables and functions with kernel restart didnt help.
#I spent so much time to make it look like a Data Frame and all I had to do was to DISPLAY instead of PRINT command ahah.


TypeError: 'DataFrame' object is not callable

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL of the Metacritic page with top TV shows
url = 'https://www.metacritic.com/browse/tv/'

# Make a request to fetch the page content
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)
content = response.content

# Parse the page content using BeautifulSoup
soup = BeautifulSoup(content, 'html.parser')

data = []
shows = soup.find_all('div', class_='c-finderProductCard_info u-flexbox-column')  # Adjusting for the page structure

for show in shows[:24]:  # Only taking top 24 shows
    # Extract TV Show name
    title_tag = show.find('h3', class_='c-finderProductCard_titleHeading')
    title = title_tag.text.strip() if title_tag else 'N/A'

    # Extract initial release date
    date_tag = show.find('span', class_="u-text-uppercase")
    release_date = date_tag.text.strip() if date_tag else 'N/A'

    # Extract metascore rating
    score_tag = show.find('div', class_='c-siteReviewScore')
    metascore = score_tag.text.strip() if score_tag else 'N/A'

    # Extract description
    description_tag = show.find('div', class_='c-finderProductCard_description')
    description = description_tag.text.strip() if description_tag else 'N/A'

    # Extract film rating
    meta_info = show.find('div', class_='c-finderProductCard_meta')
    spans = meta_info.find_all('span') if meta_info else []

    if len(spans) >= 3:
        film_rating = spans[2].text.strip().replace("Rated ", "")
    else:
        film_rating = 'No rating'

    data.append({
        'TV Show Name': title,
        'Initial Release Date': release_date,
        'Metascore': metascore,
        'Film Rating': film_rating,
        'Description': description
    })

# Create a DataFrame -  Fixed: removed the extra assignment to pd.DataFrame
df_shows = pd.DataFrame(data) # Changed variable name to avoid conflicts
df1 = df_shows[['TV Show Name', 'Initial Release Date', 'Metascore', 'Film Rating', 'Description']]

# Display the DataFrame
display(df1)

#### 3.1. Find the image source link and the TV show link. After you're able to retrieve, add them to your initial dataframe

In [63]:
# your code here
#pip install selenium - if not installed yet. SOmehow also doesnt work in this notebook but works perfectly in a new one. Idk how to fix this for the last two tasks.
from selenium import webdriver
from bs4 import BeautifulSoup
import time

# URL of the MetaCritic TV page
url = 'https://www.metacritic.com/browse/tv/'

# Set up options for Firefox WebDriver to run headless
options = webdriver.FirefoxOptions()
options.add_argument('--headless')  # Run Firefox in headless mode (no GUI)

# Set up Firefox WebDriver with the specified options
driver = webdriver.Firefox(options=options)  # Change to the appropriate WebDriver for your browser

# Open the URL in the browser
driver.get(url)

# Give the browser some time to load the page
time.sleep(5)

# Get the page source after it's been modified by JavaScript
page_source = driver.page_source

# Close the browser
driver.quit()

# Parse the HTML content
soup = BeautifulSoup(page_source, 'html.parser')

# Find all the div tags with class 'c-finderProductCard', which contain information about each TV show
tv_show_cards = soup.find_all('div', class_='c-finderProductCard')

# Iterate over each TV show card to extract image source link and TV show link
for card in tv_show_cards:
    # Extract the image source link
    image_tag = card.find('img')
    if image_tag:
        image_source = image_tag.get('src')
        if not image_source:
            # If 'src' attribute is not available, try 'data-src' attribute
            image_source = image_tag.get('data-src')
    else:
        image_source = None

     # Print the image source link and the TV show link
    print("Image Source:", image_source)
    print()
    #adding the links to the df from previous task:
    #df['Image Source'] = image_source
    #display(df)



ModuleNotFoundError: No module named 'selenium'

## Bonus

#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [None]:
#https://openweathermap.org/current
city = input('Enter the city: ')
url = f'https://api.weatherapi.com/v1/current.json?key=5a68dbd3fe6242678ac130253242505&q={city}&aqi=no'


In [None]:
# your code here

#### Find the book name, price and stock availability from books to scrape website as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise.
# It is a fictional bookstore created to be scraped.
url = 'http://books.toscrape.com/'

In [None]:
# your code here

####  Display the initial 100 books available in the homepage. Once again, collect the book name, price and its stock availability.

***Hint:*** The total number of displayed books per page is 20, but you can easily move to the next page by looping through the desired number of pages and adding it to the end of the url.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://books.toscrape.com/catalogue/page-'
# This is how you will loop through each page:
number_of_pages = int(100/20)
each_page_urls = []
for n in range(1, number_of_pages+1):
    link = url+str(n)+".html"
    each_page_urls.append(link)

each_page_urls

In [None]:
# your code here