# TUTORIAL Basic Web Scraping Tutorial with Beautiful Soup
## Scraping Books Data from books.toscrape.com

### Step 1: Install Required Packages

In [22]:
!pip install requests beautifulsoup4 pandas



### Step 2: Import Libraries

In [23]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

print("Libraries imported successfully!")

Libraries imported successfully!


### Step 3: Get the Webpage

In [24]:
# URL of the books website
url = "http://books.toscrape.com/"

# Get the webpage
response = requests.get(url)

# Check if request was successful
if response.status_code == 200:
    print("Successfully retrieved the webpage!")
    # Create BeautifulSoup object
    soup = BeautifulSoup(response.content, 'html.parser')
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

Successfully retrieved the webpage!


### Step 4: Extract Book Information

In [25]:
# Find all book articles
books = soup.find_all('article', class_='product_pod')
print(f"Found {len(books)} books on the page")

# Create a list to store book data
books_data = []

# Extract information from each book
for book in books:
    # Get title (in the image's alt text)
    title = book.h3.a['title']

    # Get price (in a <p> tag with class 'price_color')
    price = book.find('p', class_='price_color').text.strip()

    # Get availability (in a <p> tag with class 'availability')
    availability = book.find('p', class_='availability').text.strip()

    # Get rating (in the class attribute of <p> tag with class 'star-rating')
    rating = book.find('p', class_='star-rating')['class'][1]

    # Store the data
    book_info = {
        'Title': title,
        'Price': price,
        'Availability': availability,
        'Rating': rating
    }

    books_data.append(book_info)

# Create DataFrame
df = pd.DataFrame(books_data)

# Display first few rows
print("\nFirst 5 books:")
display(df.head())

Found 20 books on the page

First 5 books:


Unnamed: 0,Title,Price,Availability,Rating
0,A Light in the Attic,£51.77,In stock,Three
1,Tipping the Velvet,£53.74,In stock,One
2,Soumission,£50.10,In stock,One
3,Sharp Objects,£47.82,In stock,Four
4,Sapiens: A Brief History of Humankind,£54.23,In stock,Five


### Step 5: Clean the Data

In [26]:
# Clean price (remove '£' symbol and convert to float)
df['Price'] = df['Price'].str.replace('£', '').astype(float)

# Clean availability (extract number of books)
df['Availability'] = df['Availability'].str.extract('(\d+)')

# Display cleaned data
print("Cleaned data:")
display(df.head())

Cleaned data:


Unnamed: 0,Title,Price,Availability,Rating
0,A Light in the Attic,51.77,,Three
1,Tipping the Velvet,53.74,,One
2,Soumission,50.1,,One
3,Sharp Objects,47.82,,Four
4,Sapiens: A Brief History of Humankind,54.23,,Five


### Step 6: Save to CSV File

In [27]:
# Save to CSV
df.to_csv('books_data.csv', index=False)
print("\nData saved to 'books_data.csv'")

# Verify the saved data
print("\nVerifying saved data:")
saved_df = pd.read_csv('books_data.csv')
display(saved_df.head())


Data saved to 'books_data.csv'

Verifying saved data:


Unnamed: 0,Title,Price,Availability,Rating
0,A Light in the Attic,51.77,,Three
1,Tipping the Velvet,53.74,,One
2,Soumission,50.1,,One
3,Sharp Objects,47.82,,Four
4,Sapiens: A Brief History of Humankind,54.23,,Five


# Beginner-Friendly Web Scraping Problems [10 points each]

## Problem 1: Scrape Book Titles and Prices
**Objective**: Extract a list of book titles and their corresponding prices from [Books to Scrape](http://books.toscrape.com).

### Steps:
1. Navigate to the homepage of the website.
2. Identify all book titles and prices listed on the page.
3. Save the data into a CSV file with two columns: `Title` and `Price`.

---

## Problem 2: Scrape Top 10 Quotes from [Quotes to Scrape](http://quotes.toscrape.com)
**Objective**: Extract the top 10 quotes, their authors, and the associated tags from [Quotes to Scrape](http://quotes.toscrape.com).

### Steps:
1. Go to the homepage of the website.
2. Extract the text of the first 10 quotes, their authors, and the tags associated with each quote.
3. Save the data in a CSV file with three columns: `Quote`, `Author`, and `Tags`.

---

## Problem 3: Scrape Weather Data from [World Weather Online](https://www.timeanddate.com/weather/)
**Objective**: Extract the current weather conditions (temperature, weather condition, and humidity) for a given city.

### Steps:
1. Visit [https://www.timeanddate.com/weather/](https://www.timeanddate.com/weather/).
2. Search for the weather data for a city (e.g., New York).
3. Extract the current temperature, weather description, and humidity levels.
4. Save the data in a structured format (e.g., a JSON or CSV file).


In [28]:
# Problem 1:
# Scrape Book Titles and Prices
# Objective: Extract a list of book titles and their corresponding prices from Books to Scrape.

# Steps:
# Navigate to the homepage of the website.
# Identify all book titles and prices listed on the page.
# Save the data into a CSV file with two columns: Title and Price.

import requests
import pandas as pd
from bs4 import BeautifulSoup

url="https://books.toscrape.com/"
response=requests.get(url)


In [29]:
if response.status_code == 200:
    print("Successfully retrieved the webpage!")
    # Create BeautifulSoup object
    soup = BeautifulSoup(response.content, 'html.parser')
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

Successfully retrieved the webpage!


In [30]:
books = soup.find_all('article', class_='product_pod')
book_data=[]
for book in books:
    title = book.h3.a['title']
    price = book.find('p', class_='price_color').text.strip()
    book_info = {
        'Title': title,
        'Price': price
    }
    book_data.append(book_info)

# print(book_data)
df_book = pd.DataFrame(book_data)
display(df_book.head())



Unnamed: 0,Title,Price
0,A Light in the Attic,£51.77
1,Tipping the Velvet,£53.74
2,Soumission,£50.10
3,Sharp Objects,£47.82
4,Sapiens: A Brief History of Humankind,£54.23


In [31]:
# Problem 2: Scrape Top 10 Quotes from Quotes to Scrape
# Objective: Extract the top 10 quotes, their authors, and the associated tags from Quotes to Scrape.

# Steps:
# Go to the homepage of the website.
# Extract the text of the first 10 quotes, their authors, and the tags associated with each quote.
# Save the data in a CSV file with three columns: Quote, Author, and Tags.
import requests
import pandas as pd
from bs4 import BeautifulSoup

url="https://quotes.toscrape.com/"
response=requests.get(url)

if response.status_code == 200:
    print("Successfully retrieved the webpage!")
    # Create BeautifulSoup object
    soup = BeautifulSoup(response.content, 'html.parser')
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

Successfully retrieved the webpage!


In [32]:
quotes = soup.find_all('div', class_='quote')
quote_data=[]
# print(quotes)
for quote in quotes:
    text = quote.find('span', class_='text').text
    author = quote.find('small', class_='author').text
    tags = [tag.text for tag in quote.find_all('a', class_='tag')]
    quote_info = {
        'Quote': text,
        'Author': author,
        'Tags': tags
    }
    quote_data.append(quote_info)

# print(quote_data)
df_quotes = pd.DataFrame(quote_data)
print(df_quotes.head(10))

                                               Quote             Author  \
0  “The world as we have created it is a process ...    Albert Einstein   
1  “It is our choices, Harry, that show what we t...       J.K. Rowling   
2  “There are only two ways to live your life. On...    Albert Einstein   
3  “The person, be it gentleman or lady, who has ...        Jane Austen   
4  “Imperfection is beauty, madness is genius and...     Marilyn Monroe   
5  “Try not to become a man of success. Rather be...    Albert Einstein   
6  “It is better to be hated for what you are tha...         André Gide   
7  “I have not failed. I've just found 10,000 way...   Thomas A. Edison   
8  “A woman is like a tea bag; you never know how...  Eleanor Roosevelt   
9  “A day without sunshine is like, you know, nig...       Steve Martin   

                                             Tags  
0        [change, deep-thoughts, thinking, world]  
1                            [abilities, choices]  
2  [inspirational,

In [33]:
# Problem 3: Scrape Weather Data from World Weather Online
# Objective: Extract the current weather conditions (temperature, weather condition, and humidity) for a given city.

# Steps:
# Visit https://www.timeanddate.com/weather/.
# Search for the weather data for a city (e.g., New York).
# Extract the current temperature, weather description, and humidity levels.
# Save the data in a structured format (e.g., a JSON or CSV file).
import requests
import pandas as pd
from bs4 import BeautifulSoup

url="https://www.timeanddate.com/weather/"
response=requests.get(url)

if response.status_code == 200:
    print("Successfully retrieved the webpage!")
    # Create BeautifulSoup object
    soup = BeautifulSoup(response.content, 'html.parser')
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

Successfully retrieved the webpage!


In [34]:
weather=soup.find_all('div',class_='my-city__item')
weather_data=[]
for w in weather:
  city=w.find('span',class_='my-city__city').text
  temp=w.find('span',class_='my-city__temp').text
  desc=w.find('span',class_='my-city__wtdesc').text
  weather_info = {
        'City': city,
        'Temperature': temp,
        'Description': desc,
    }

  weather_data.append(weather_info)

# print(weather_data)
df = pd.DataFrame(weather_data)
df.to_csv('weather_data.csv', index=False)
display(df.head())



Unnamed: 0,City,Temperature,Description
0,Taipei,19 °C,Partly cloudy.
1,New York,7 °C,Clear.
2,London,9 °C,Partly sunny.
3,Tokyo,10 °C,Cool.


**Pandas Assignment [10 points each]**

1. Create a DataFrame df from this dictionary data which has the index labels and Display a summary of the basic information about this DataFrame and its data.

In [35]:
# some random data related to sports
data = {
    "Player": ["Virat Kohli", "Steve Smith", "Kane Williamson", "Joe Root", "Babar Azam"],
    "Team": ["India", "Australia", "New Zealand", "England", "Pakistan"],
    "Runs": [12169, 9113, 6554, 8932, 5089],
    "Matches": [262, 146, 161, 163, 99],
    "Average": [58.23, 53.97, 47.83, 50.29, 59.17]
}
df = pd.DataFrame(data, index=["1", "2", "3", "4", "5"])
print(df.info())
# summary
print(df.describe())

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 1 to 5
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Player   5 non-null      object 
 1   Team     5 non-null      object 
 2   Runs     5 non-null      int64  
 3   Matches  5 non-null      int64  
 4   Average  5 non-null      float64
dtypes: float64(1), int64(2), object(2)
memory usage: 240.0+ bytes
None
               Runs     Matches    Average
count      5.000000    5.000000   5.000000
mean    8371.400000  166.200000  53.898000
std     2709.386517   59.453343   4.909269
min     5089.000000   99.000000  47.830000
25%     6554.000000  146.000000  50.290000
50%     8932.000000  161.000000  53.970000
75%     9113.000000  163.000000  58.230000
max    12169.000000  262.000000  59.170000


2. Return the first 5 rows of the DataFrame df.

In [36]:
print(df.head())

            Player         Team   Runs  Matches  Average
1      Virat Kohli        India  12169      262    58.23
2      Steve Smith    Australia   9113      146    53.97
3  Kane Williamson  New Zealand   6554      161    47.83
4         Joe Root      England   8932      163    50.29
5       Babar Azam     Pakistan   5089       99    59.17


3. Explain Pandas DataFrame Using Python List

In [37]:
data = [
    ["Virat Kohli", "India", 12169, 262, 58.23],
    ["Steve Smith", "Australia", 9113, 146, 53.97],
    ["Kane Williamson", "New Zealand", 6554, 161, 47.83],
    ["Joe Root", "England", 8932, 163, 50.29],
    ["Babar Azam", "Pakistan", 5089, 99, 59.17]
]

# Define column names
columns = ["Player", "Team", "Runs", "Matches", "Average"]

# Create the DataFrame
df_ = pd.DataFrame(data, columns=columns)

print(df_)

            Player         Team   Runs  Matches  Average
0      Virat Kohli        India  12169      262    58.23
1      Steve Smith    Australia   9113      146    53.97
2  Kane Williamson  New Zealand   6554      161    47.83
3         Joe Root      England   8932      163    50.29
4       Babar Azam     Pakistan   5089       99    59.17


4. How we can rename an index using the rename() method.

In [38]:
# Rename index labels
df.rename(index={"1": "Player 1", "2": "Player 2", "3": "Player 3","4": "Player 4","5": "Player 5"}, inplace=True)
print(df)

                   Player         Team   Runs  Matches  Average
Player 1      Virat Kohli        India  12169      262    58.23
Player 2      Steve Smith    Australia   9113      146    53.97
Player 3  Kane Williamson  New Zealand   6554      161    47.83
Player 4         Joe Root      England   8932      163    50.29
Player 5       Babar Azam     Pakistan   5089       99    59.17


5. You have a 2D NumPy array that you have converted into a pandas DataFrame. You want to assign specific index values to the rows of this DataFrame. If you pass a list of index values to the DataFrame, how does it affect the DataFrame, and how would you apply these index values?

In [39]:
import numpy as np

# Create a 2D NumPy array
array = np.array([[12169, 262, 58.23],
                  [9113, 146, 53.97],
                  [6554, 161, 47.83],
                  [8932, 163, 50.29],
                  [5089, 99, 59.17]])

# Define custom index labels
index_labels = ["Player 1", "Player 2", "Player 3", "Player 4", "Player 5"]

# Define column names
columns = ["Runs", "Matches", "Average"]

# Create the DataFrame and assign the custom index
df = pd.DataFrame(array, index=index_labels, columns=columns)

print(df)

             Runs  Matches  Average
Player 1  12169.0    262.0    58.23
Player 2   9113.0    146.0    53.97
Player 3   6554.0    161.0    47.83
Player 4   8932.0    163.0    50.29
Player 5   5089.0     99.0    59.17


6. You have a dictionary of data that you want to store as a pandas Series. After creating the Series and storing it in the df variable, you print it and observe that the data is represented in a one-dimensional linear format. Explain how to create this Series from the dictionary and describe the output you would expect when printing the Series.

In [40]:
data = {
    "Virat Kohli": 12169,
    "Steve Smith": 9113,
    "Kane Williamson": 6554,
    "Joe Root": 8932,
    "Babar Azam": 5089
}

# Create the Series
df = pd.Series(data)

# Print the Series
print(df)

Virat Kohli        12169
Steve Smith         9113
Kane Williamson     6554
Joe Root            8932
Babar Azam          5089
dtype: int64


7. You create a dictionary and store it as a DataFrame in the df variable. After printing, the data appears as 2-dimensional rows and columns. How would you create this DataFrame from the dictionary, and what does the output look like?

In [41]:
data = {
    "Player": ["Virat Kohli", "Steve Smith", "Kane Williamson", "Joe Root", "Babar Azam"],
    "Team": ["India", "Australia", "New Zealand", "England", "Pakistan"],
    "Runs": [12169, 9113, 6554, 8932, 5089],
    "Matches": [262, 146, 161, 163, 99],
    "Average": [58.23, 53.97, 47.83, 50.29, 59.17]
}

# Create the DataFrame
df = pd.DataFrame(data)

# Print the DataFrame
print(df)

            Player         Team   Runs  Matches  Average
0      Virat Kohli        India  12169      262    58.23
1      Steve Smith    Australia   9113      146    53.97
2  Kane Williamson  New Zealand   6554      161    47.83
3         Joe Root      England   8932      163    50.29
4       Babar Azam     Pakistan   5089       99    59.17
