# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended content.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit the urls below and take a look at their source code through Chrome DevTools. You'll need to identify the html tags, special class names, etc used in the html content you are expected to extract.

**Resources**:
- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide)
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are already imported for you. If you prefer to use additional libraries feel free to do it.

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [3]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [5]:
# your code here
import requests
from bs4 import BeautifulSoup

# URL of GitHub Trending Developers
url = "https://github.com/trending/developers"

# Send HTTP GET request
response = requests.get(url)
if response.status_code != 200:
    print("Failed to fetch page:", response.status_code)
    exit()

# Parse HTML with BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")

# Each developer entry is inside <article class="Box-row">
developers = soup.find_all("article", class_="Box-row")

for dev in developers:
    # Developer name (username is usually inside 'h1' > 'a')
    name_tag = dev.find("h1", class_="h3 lh-condensed")
    if name_tag:
        username = name_tag.get_text(strip=True)
        profile_url = "https://github.com" + name_tag.find("a")["href"]
        print(f"{username} → {profile_url}")


ibhagwan → https://github.com/ibhagwan
Xunzhuo → https://github.com/Xunzhuo
Elie Steinbock → https://github.com/elie222
Daniel Öster → https://github.com/dalathegreat
Weblate (bot) → https://github.com/weblate
thinkasany → https://github.com/thinkasany
Karl Seguin → https://github.com/karlseguin
Jeremiah Lowin → https://github.com/jlowin
Henrik Rydgård → https://github.com/hrydgard
Tom Moor → https://github.com/tommoor
Danny Mösch → https://github.com/SimplyDanny
Luis M. Gallardo D. → https://github.com/lgallard
Liran Tal → https://github.com/lirantal
Jerry Zhao → https://github.com/jerryz123
Aleksandr Statciuk → https://github.com/freearhey
Sebastian Raschka → https://github.com/rasbt
Eric Buehler → https://github.com/EricLBuehler
Yorukot → https://github.com/yorukot
Derrick Hammer → https://github.com/pcfreak30
Yeuoly → https://github.com/Yeuoly
Folke Lemaitre → https://github.com/folke
AMIT SHEKHAR → https://github.com/amitshekhariitbhu
Thomas Schmelzer → https://github.com/tschm
We

#### 1. Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools or clicking in 'Inspect' on any browser. Here is an example:

![title](example_1.png)

2. Use BeautifulSoup `find_all()` to extract all the html elements that contain the developer names. Hint: pass in the `attrs` parameter to specify the class.

3. Loop through the elements found and get the text for each of them.

4. While you are at it, use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names. Hint: you may also use `.get_text()` instead of `.text` and pass in the desired parameters to do some string manipulation (check the documentation).

5. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [6]:
# your code here
import requests
from bs4 import BeautifulSoup

url = "https://github.com/trending/developers"

# Fetch page
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

# Find all developer entries
developers = soup.find_all("article", class_="Box-row")

# Extract clean text names (GitHub usernames)
dev_names = []
for dev in developers:
    name_tag = dev.find("h1", class_="h3 lh-condensed")
    if name_tag:
        username = name_tag.get_text(strip=True)  # removes all HTML tags and extra spaces
        dev_names.append(username)

print(dev_names)


['Seth Vargo', 'Danny Mösch', 'ibhagwan', 'Liran Tal', 'yetone', 'Elie Steinbock', 'lauren', 'James Henry', 'Jeremiah Lowin', 'Luis M. Gallardo D.', 'hydai', 'Daniel Öster', 'Karl Seguin', 'Henrik Rydgård', 'Sebastian Raschka', 'comfyanonymous', 'Fatih Arslan', 'AMIT SHEKHAR', 'Travis Cline', "Paul D'Ambra", 'Niels Laute', 'Xunzhuo', 'Lucas Gomide', 'Xingchen Song(宋星辰)', 'Lukasz']


In [7]:
import requests
from bs4 import BeautifulSoup

url = "https://github.com/trending/developers"

# Step 1: Fetch page
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

# Step 2: Use find_all() with attrs to get the elements containing developer names
name_elements = soup.find_all("h1", attrs={"class": "h3 lh-condensed"})

# Step 3: Loop through elements, clean text with .get_text(), strip spaces/linebreaks
dev_names = []
for elem in name_elements:
    clean_name = elem.get_text(separator=" ", strip=True)  # removes \n, trims whitespace
    dev_names.append(clean_name)

# Step 4: Print the list
print(dev_names)


['Seth Vargo', 'Danny Mösch', 'ibhagwan', 'Liran Tal', 'yetone', 'Elie Steinbock', 'lauren', 'James Henry', 'Jeremiah Lowin', 'Luis M. Gallardo D.', 'hydai', 'Daniel Öster', 'Karl Seguin', 'Henrik Rydgård', 'Sebastian Raschka', 'comfyanonymous', 'Fatih Arslan', 'AMIT SHEKHAR', 'Travis Cline', "Paul D'Ambra", 'Niels Laute', 'Xunzhuo', 'Lucas Gomide', 'Xingchen Song(宋星辰)', 'Lukasz']


#### 1.1. Display the trending Python repositories in GitHub.

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'

In [8]:
# your code here
import requests
from bs4 import BeautifulSoup

url = "https://github.com/trending/python?since=daily"

# Fetch page
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

# Step 1: Find all repo name elements
repo_elements = soup.find_all("h2", attrs={"class": "h3 lh-condensed"})

# Step 2: Loop through elements, clean text
repo_names = []
for elem in repo_elements:
    # Get text, strip whitespace/newlines, and collapse spaces
    clean_name = elem.get_text(separator=" ", strip=True).replace("\n", " ")
    repo_names.append(clean_name)

# Step 3: Print the list of repositories
print(repo_names)


['oraios / serena', 'HKUDS / DeepCode', 'crewAIInc / crewAI', 'murtaza-nasir / maestro', 'QwenLM / Qwen3', 'NVIDIA / Megatron-LM', 'livekit / agents', 'lllyasviel / Fooocus', 'microsoft / rStar', 'resemble-ai / chatterbox', 'denizsafak / abogen', 'huggingface / diffusers', 'JefferyHcool / BiliNote', 'PaddlePaddle / PaddleOCR', 'lfnovo / open-notebook']


#### 2. Display all the image links from Walt Disney wikipedia page.
Hint: use `.get()` to access information inside tags. Check out the documentation.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [52]:
# your code here
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Walt_Disney"

# Fetch page
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")

# Step 1: Find all <img> tags
img_tags = soup.find_all("img")

# Step 2: Extract 'src' attribute using .get()
img_links = []
for img in img_tags:
    src = img.get("src")
    if src:  # ensure src exists
        # Wikipedia uses relative links, so prepend https: if missing
        if src.startswith("//"):
            src = "https:" + src
        elif src.startswith("/"):
            src = "https://en.wikipedia.org" + src
        img_links.append(src)

# Step 3: Print the list of image links
print(img_links)





['https://en.wikipedia.org/static/images/icons/wikipedia.png', 'https://en.wikipedia.org/static/images/mobile/copyright/wikipedia-wordmark-en.svg', 'https://en.wikipedia.org/static/images/mobile/copyright/wikipedia-tagline-en.svg', 'https://upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png', 'https://upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/20px-Extended-protection-shackle.svg.png', 'https://upload.wikimedia.org/wikipedia/commons/thumb/5/50/Walt_Disney_1946_%28cropped2%29.JPG/250px-Walt_Disney_1946_%28cropped2%29.JPG', 'https://upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/250px-Walt_Disney_1942_signature.svg.png', 'https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Walt_Disney_Birthplace_Exterior_Hermosa_Chicago_Illinois.jpg/250px-Walt_Disney_Birthplace_Exterior_Hermosa_Chicago_Illinois.jpg', 'https://upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelo

#### 2.1. List all language names and number of related articles in the order they appear in wikipedia.org.

In [48]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'

In [58]:
# your code here
import requests
from bs4 import BeautifulSoup

url = "https://www.wikipedia.org/"

# Fetch page
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")

# Step 1: Find all language boxes
lang_boxes = soup.find_all("a", attrs={"class": "link-box"})

# Step 2: Extract language name + number of articles
languages = []
for box in lang_boxes:
    lang_name = box.find("strong").get_text(strip=True)
    article_count = box.find("small").get_text(strip=True)
    languages.append((lang_name, article_count))

# Step 3: Print results
for lang, count in languages:
    print(f"{lang}: {count}")


English: 7,050,000+articles
æ¥æ¬èª: 1,471,000+è¨äº
Ð ÑÑÑÐºÐ¸Ð¹: 2Â 061Â 000+ÑÑÐ°ÑÐµÐ¹
Deutsch: 3.046.000+Artikel
FranÃ§ais: 2â¯706â¯000+articles
EspaÃ±ol: 2.058.000+artÃ­culos
ä¸­æ: 1,497,000+æ¡ç® / æ¢ç®
Italiano: 1.933.000+voci
Polski: 1Â 667Â 000+haseÅ
PortuguÃªs: 1.154.000+artigos


#### 2.2. Display the top 10 languages by number of native speakers stored in a pandas dataframe.
Hint: After finding the correct table you want to analyse, you can use a nested **for** loop to find the elements row by row (check out the 'td' and 'tr' tags). <br>An easier way to do it is using pd.read_html(), check out documentation [here](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.read_html.html).

In [59]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [65]:
import pandas as pd

url = "https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
response = requests.get(url, headers=headers)

# Step 1: Read all tables from the page
tables = pd.read_html(response.text)

# Step 2: The first table on this page is the one we want
df = tables[0]

# Step 3: Select only the top 10 rows
top10 = df.head(10)

# Step 4: Display the DataFrame
print(top10)


           Language  Native speakers (millions) Language family        Branch
0  Mandarin Chinese                         990    Sino-Tibetan       Sinitic
1           Spanish                         484   Indo-European       Romance
2           English                         390   Indo-European      Germanic
3             Hindi                         345   Indo-European    Indo-Aryan
4        Portuguese                         250   Indo-European       Romance
5           Bengali                         242   Indo-European    Indo-Aryan
6           Russian                         145   Indo-European  Balto-Slavic
7          Japanese                         124         Japonic             —
8   Western Punjabi                          90   Indo-European    Indo-Aryan
9        Vietnamese                          86   Austroasiatic        Vietic


  tables = pd.read_html(response.text)


#### 3. Display Metacritic top 24 Best TV Shows of all time (TV Show name, initial release date, metascore rating, film rating system and description) as a pandas dataframe.
Hint: If you hover over the title of the movie, you should see the director's name. Can you find where it's stored in the html?

In [66]:
# This is the url you will scrape in this exercise 
url = 'https://www.metacritic.com/browse/tv/'

In [67]:
# your code here
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://www.metacritic.com/browse/tv/"

# Metacritic blocks requests without a User-Agent
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")

# Step 1: Find all shows (each is usually in a div with class 'clamp-summary-wrap')
show_divs = soup.find_all("div", class_="clamp-summary-wrap")

# Step 2: Extract details
data = []
for div in show_divs[:24]:  # top 24 shows
    show_name_tag = div.find("a", class_="title")
    show_name = show_name_tag.get_text(strip=True) if show_name_tag else None

    release_tag = div.find("div", class_="clamp-details")
    release_date = release_tag.find_all("span")[1].get_text(strip=True) if release_tag else None

    metascore_tag = div.find("a", class_="metascore_anchor")
    metascore = metascore_tag.find("div").get_text(strip=True) if metascore_tag else None

    # Film rating system is sometimes in 'clamp-score-wrap'
    rating_tag = div.find("div", class_="clamp-score-wrap")
    rating_system = rating_tag.find("div", class_="clamp-rating").get_text(strip=True) if rating_tag and rating_tag.find("div", class_="clamp-rating") else None

    # Description / summary
    desc_tag = div.find("div", class_="summary")
    description = desc_tag.get_text(strip=True) if desc_tag else None

    data.append([show_name, release_date, metascore, rating_system, description])

# Step 3: Create DataFrame
df = pd.DataFrame(data, columns=["Show Name", "Initial Release", "Metascore", "Rating System", "Description"])

# Step 4: Display
print(df)


Empty DataFrame
Columns: [Show Name, Initial Release, Metascore, Rating System, Description]
Index: []


#### 3.1. Find the image source link and the TV show link. After you're able to retrieve, add them to your initial dataframe

In [68]:
# your code here
# We were told to skip this...

## Bonus

#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [69]:
#https://openweathermap.org/current
city = input('Enter the city: ')
url = f'https://api.weatherapi.com/v1/current.json?key=5a68dbd3fe6242678ac130253242505&q={city}&aqi=no'


In [75]:
# your code here
import requests

# Step 1: Get city input
city = input("Enter the city: ")

# Step 2: Construct API URL
url = f'https://api.weatherapi.com/v1/current.json?key=5a68dbd3fe6242678ac130253242505&q={city}&aqi=no'

# Step 3: Fetch data from API
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
response = requests.get(url, headers=headers)

# Check if the request was successful
if response.status_code == 200:
    data = response.json()

    # Step 4: Extract desired information
    temp_c = data['current']['temp_c']
    wind_kph = data['current']['wind_kph']
    description = data['current']['condition']['text']
    weather_icon = data['current']['condition']['icon']

    # Step 5: Display results
    print(f"Weather report for {city}:")
    print(f"Temperature: {temp_c}°C")
    print(f"Wind Speed: {wind_kph} kph")
    print(f"Description: {description}")
    print(f"Weather icon URL: https:{weather_icon}")

else:
    print("Failed to fetch weather data. Please check the city name or API key.")


Weather report for berlin:
Temperature: 26.1°C
Wind Speed: 19.1 kph
Description: Sunny
Weather icon URL: https://cdn.weatherapi.com/weather/64x64/day/113.png


#### Find the book name, price and stock availability from books to scrape website as a pandas dataframe.

In [76]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

In [80]:
# your code here
import requests
from bs4 import BeautifulSoup
import pandas as pd

def scrape_books_to_dataframe():
    # URL of the website to scrape
    url = 'http://books.toscrape.com/'
    
    try:
        # Send a GET request to the website
        response = requests.get(url)
        response.raise_for_status()  # Raise an exception for bad status codes
        
        # Parse the HTML content
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Find all book containers
        books = soup.find_all('article', class_='product_pod')
        
        # Lists to store the data
        book_names = []
        prices = []
        stock_availabilities = []
        
        # Extract information for each book
        for book in books:
            # Extract book name
            name = book.h3.a['title']
            book_names.append(name)
            
            # Extract price
            price = book.find('p', class_='price_color').text
            prices.append(price)
            
            # Extract stock availability
            stock = book.find('p', class_='instock availability').text.strip()
            stock_availabilities.append(stock)
        
        # Create a pandas DataFrame
        df = pd.DataFrame({
            'Book Name': book_names,
            'Price': prices,
            'Stock Availability': stock_availabilities
        })
        
        return df
        
    except requests.RequestException as e:
        print(f"Error fetching the webpage: {e}")
        return pd.DataFrame()
    except Exception as e:
        print(f"An error occurred: {e}")
        return pd.DataFrame()

# Main execution
if __name__ == "__main__":
    # Scrape the books and create DataFrame
    books_df = scrape_books_to_dataframe()
    
    # Check if DataFrame was created successfully
    if not books_df.empty:
        print("Books Data:")
        print("=" * 80)
        print(books_df.to_string(index=False))
        print("\n" + "=" * 80)
        print(f"Total books scraped: {len(books_df)}")
    else:
        print("No data was scraped. Please check the URL or your internet connection.")

Books Data:
                                                                                     Book Name  Price Stock Availability
                                                                          A Light in the Attic £51.77           In stock
                                                                            Tipping the Velvet £53.74           In stock
                                                                                    Soumission £50.10           In stock
                                                                                 Sharp Objects £47.82           In stock
                                                         Sapiens: A Brief History of Humankind £54.23           In stock
                                                                               The Requiem Red £22.65           In stock
                                            The Dirty Little Secrets of Getting Your Dream Job £33.34           In stock
       The Coming Wo

####  Display the initial 100 books available in the homepage. Once again, collect the book name, price and its stock availability.

***Hint:*** The total number of displayed books per page is 20, but you can easily move to the next page by looping through the desired number of pages and adding it to the end of the url.

In [81]:
# This is the url you will scrape in this exercise
url = 'https://books.toscrape.com/catalogue/page-'
# This is how you will loop through each page:
number_of_pages = int(100/20)
each_page_urls = []
for n in range(1, number_of_pages+1):
    link = url+str(n)+".html"
    each_page_urls.append(link)
    
each_page_urls

['https://books.toscrape.com/catalogue/page-1.html',
 'https://books.toscrape.com/catalogue/page-2.html',
 'https://books.toscrape.com/catalogue/page-3.html',
 'https://books.toscrape.com/catalogue/page-4.html',
 'https://books.toscrape.com/catalogue/page-5.html']

In [84]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

def scrape_100_books():
    base_url = 'http://books.toscrape.com/'
    books_data = []
    page_number = 1
    
    try:
        while len(books_data) < 100:
            # Construct URL for current page
            if page_number == 1:
                url = base_url
            else:
                url = f"{base_url}catalogue/page-{page_number}.html"
            
            print(f"Scraping page {page_number}: {url}")
            
            # Send GET request with headers to avoid blocking
            headers = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
            }
            
            response = requests.get(url, headers=headers, timeout=10)
            response.raise_for_status()
            
            # Parse HTML content
            soup = BeautifulSoup(response.content, 'html.parser')
            
            # Find all book containers
            books = soup.find_all('article', class_='product_pod')
            
            print(f"Found {len(books)} books on page {page_number}")
            
            # If no books found, break the loop
            if not books:
                print("No more books found. Stopping.")
                break
            
            # Extract information for each book
            for book in books:
                if len(books_data) >= 100:
                    break
                
                try:
                    # Extract book name
                    name = book.h3.a['title']
                    
                    # Extract price
                    price_element = book.find('p', class_='price_color')
                    price = price_element.text if price_element else 'Price not found'
                    
                    # Extract stock availability
                    stock_element = book.find('p', class_='instock')
                    if not stock_element:
                        stock_element = book.find('p', class_='availability')
                    stock = stock_element.text.strip() if stock_element else 'Stock not found'
                    
                    books_data.append({
                        'Book Name': name,
                        'Price': price,
                        'Stock Availability': stock
                    })
                    
                    print(f"Added book {len(books_data)}: {name}")
                    
                except Exception as e:
                    print(f"Error processing a book: {e}")
                    continue
            
            # Check if there's a next page
            next_button = soup.find('li', class_='next')
            if not next_button and len(books_data) < 100:
                print("No more pages available. Stopping.")
                break
                
            page_number += 1
            
        # Create pandas DataFrame
        df = pd.DataFrame(books_data)
        return df
        
    except requests.RequestException as e:
        print(f"Error fetching the webpage: {e}")
        return pd.DataFrame()
    except Exception as e:
        print(f"An error occurred: {e}")
        return pd.DataFrame()

# Alternative simpler approach for just the first page (20 books)
def scrape_first_page():
    url = 'http://books.toscrape.com/'
    
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
        
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        
        soup = BeautifulSoup(response.content, 'html.parser')
        books = soup.find_all('article', class_='product_pod')
        
        books_data = []
        for book in books:
            name = book.h3.a['title']
            price = book.find('p', class_='price_color').text
            stock = book.find('p', class_='instock availability').text.strip()
            
            books_data.append({
                'Book Name': name,
                'Price': price,
                'Stock Availability': stock
            })
        
        return pd.DataFrame(books_data)
        
    except Exception as e:
        print(f"Error: {e}")
        return pd.DataFrame()

# Main execution
if __name__ == "__main__":
    print("Starting web scraping...")
    
    # Try the full 100 books approach first
    books_df = scrape_100_books()
    
    # If that fails, try just the first page
    if books_df.empty:
        print("\nTrying to scrape just the first page...")
        books_df = scrape_first_page()
    
    # Display results
    if not books_df.empty:
        print("\n" + "=" * 100)
        print("BOOKS DATA:")
        print("=" * 100)
        
        # Set display options for better formatting
        pd.set_option('display.max_rows', None)
        pd.set_option('display.max_columns', None)
        pd.set_option('display.width', None)
        pd.set_option('display.max_colwidth', 40)
        
        print(books_df.to_string(index=False))
        print("\n" + "=" * 100)
        print(f"Total books scraped: {len(books_df)}")
        
    else:
        print("Failed to scrape any data. Possible reasons:")
        print("1. Website might be down or blocking requests")
        print("2. Internet connection issue")
        print("3. Website structure might have changed")

Starting web scraping...
Scraping page 1: http://books.toscrape.com/
Found 20 books on page 1
Added book 1: A Light in the Attic
Added book 2: Tipping the Velvet
Added book 3: Soumission
Added book 4: Sharp Objects
Added book 5: Sapiens: A Brief History of Humankind
Added book 6: The Requiem Red
Added book 7: The Dirty Little Secrets of Getting Your Dream Job
Added book 8: The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull
Added book 9: The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics
Added book 10: The Black Maria
Added book 11: Starving Hearts (Triangular Trade Trilogy, #1)
Added book 12: Shakespeare's Sonnets
Added book 13: Set Me Free
Added book 14: Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)
Added book 15: Rip it Up and Start Again
Added book 16: Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991
Added book 17: Olio
Added book 18: Mesaerion: The Best Scien