**Reasoning**:
I will start by fetching the HTML content from 'http://books.toscrape.com/' to verify the connection and inspect the structure of the page as per the instructions.



In [1]:
import requests

url = 'http://books.toscrape.com/'


response = requests.get(url)


print(f'Status Code: {response.status_code}')
if response.status_code == 200:
    print('Successfully connected to the website.')
    print('\nFirst 500 characters of the HTML content:')
    print(response.text[:500])
else:
    print(f'Failed to retrieve the page. Status code: {response.status_code}')

Status Code: 200
Successfully connected to the website.

First 500 characters of the HTML content:
<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html lang="en-us" class="no-js"> <!--<![endif]-->
    <head>
        <title>
    All products | Books to Scrape - Sandbox
</title>

        <meta http-equiv="content-type" content="text/html; charset=UTF-8" /


**Reasoning**:
Now that the connection is verified, I will use BeautifulSoup to parse the HTML and identify the specific tags and classes for book titles and prices to fulfill the final part of the subtask instructions.



## Extract Data with BeautifulSoup




In [2]:
from bs4 import BeautifulSoup


soup = BeautifulSoup(response.text, 'html.parser')

first_book = soup.find('article', class_='product_pod')

if first_book:
    title = first_book.h3.a['title']
    price = first_book.find('p', class_='price_color').text
    print(f'Identified Tag Structure:')
    print(f'- Container: <article class="product_pod">')
    print(f'- Title Tag: <h3><a> (attribute: title)')
    print(f'- Price Tag: <p class="price_color">')
    print(f'\nSample extraction:')
    print(f'Title: {title}')
    print(f'Price: {price}')
else:
    print('Could not find the expected HTML elements.')

Identified Tag Structure:
- Container: <article class="product_pod">
- Title Tag: <h3><a> (attribute: title)
- Price Tag: <p class="price_color">

Sample extraction:
Title: A Light in the Attic
Price: Â£51.77


**Reasoning**:
I will extract the titles and prices of all books on the page by iterating through the book containers identified in the previous inspection step.



## Clean and Structure Data




In [3]:

book_titles = []
book_prices = []

books = soup.find_all('article', class_='product_pod')


for book in books:

    title = book.h3.a['title']

    price = book.find('p', class_='price_color').text

    book_titles.append(title)
    book_prices.append(price)


print(f'Successfully extracted data for {len(book_titles)} books.')
print('First 5 entries:')
for t, p in zip(book_titles[:5], book_prices[:5]):
    print(f'- {t}: {p}')

Successfully extracted data for 20 books.
First 5 entries:
- A Light in the Attic: Â£51.77
- Tipping the Velvet: Â£53.74
- Soumission: Â£50.10
- Sharp Objects: Â£47.82
- Sapiens: A Brief History of Humankind: Â£54.23


**Reasoning**:
I will import pandas, create a DataFrame from the extracted lists, clean the 'Price' column by removing currency symbols, and verify the data types as instructed.



In [4]:
import pandas as pd

# Create a DataFrame from the extracted lists
df = pd.DataFrame({
    'Title': book_titles,
    'Price': book_prices
})

# Clean the Price column: remove 'Â£' and convert to float
df['Price'] = df['Price'].str.replace('Â£', '', regex=False).astype(float)

print("First 5 rows of the DataFrame:")
print(df.head())
print("\nDataFrame Info:")
print(df.info())

First 5 rows of the DataFrame:
                                   Title  Price
0                   A Light in the Attic  51.77
1                     Tipping the Velvet  53.74
2                             Soumission  50.10
3                          Sharp Objects  47.82
4  Sapiens: A Brief History of Humankind  54.23

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Title   20 non-null     object 
 1   Price   20 non-null     float64
dtypes: float64(1), object(1)
memory usage: 452.0+ bytes
None


**Reasoning**:
I will save the cleaned DataFrame 'df' to a CSV file named 'books_data.csv' without the index and verify the file creation as per the subtask instructions.



In [5]:

file_path = 'books_data.csv'
df.to_csv(file_path, index=False)


import os
if os.path.exists(file_path):
    print(f'Successfully exported the dataset to {file_path}.')
    print('\nFiles in current directory:')
    !ls {file_path}
else:
    print('Failed to export the CSV file.')

Successfully exported the dataset to books_data.csv.

Files in current directory:
books_data.csv
