# Mini-Project 2: Scraping “Scrape This Site” - Frames Page

## Tools and Libraries Required

- selenium for navigating and interacting with the web page.
- beautifulsoup4 for parsing and extracting data from HTML.
- chromedriver_autoinstaller for automated ChromeDriver setup.


## Task

- Initialize Selenium WebDriver
- Navigate to the Web Page : scrapethissite
- Since the page contains frames, identify each frame and switch to it to access its content.
- Use Selenium to navigate through frames and extract necessary data.
- After switching to a frame, use BeautifulSoup to parse and extract data.
- Focus on extracting specific information like text, links, or any other relevant content from each frame.
- Structure the extracted data into a structured format like a list of dictionaries or a pandas DataFrame.
- Save the data to a CSV file for further analysis or use.
- Properly close the Selenium WebDriver session.


In [1]:
# from selenium import webdriver
# from selenium.webdriver.common.by import By
# from selenium.webdriver.support.ui import WebDriverWait
# from selenium.webdriver.support import expected_conditions as EC
# from selenium.common.exceptions import TimeoutException
# from bs4 import BeautifulSoup
# import os
# import requests
# import pandas as pd

In [None]:
# options = webdriver.ChromeOptions()
# options.add_argument("--start-minimized")  # Open browser in maximized mode
# driver = webdriver.Chrome(options=options)

# # navigate to the Scrapethissite page
# url = "https://www.scrapethissite.com/pages/frames/"
# driver.get(url)

# # wait for the "main" frame to load
# wait = WebDriverWait(driver, 30)

# try:
#     # Wait for the iframe to be present and switch to it
#     wait.until(EC.presence_of_element_located((By.ID, "iframe")))
#     driver.switch_to.frame("iframe")
#     print("Successfully accessed the iframe.")

#     # Find all buttons within the iframe
#     buttons = driver.find_elements(By.CSS_SELECTOR, 'a.btn.btn-default.btn-xs')
#     print(f'Found {len(buttons)} elements.')

#     # Create a list of links from the buttons
#     button_links = [button.get_attribute("href") for button in buttons]

# except Exception as e:
#     print(f"An error occurred: {e}")


# # Initialize a list to hold the dictionaries
# results = []

# # Iterate over each link in the button_links list
# for link in button_links:
#     driver.get(link)  # Navigate to the link
    
#     # Wait for the h3, img, and p elements to be present
#     try:
#         wait.until(EC.presence_of_element_located((By.TAG_NAME, 'h3')))
#         h3_element = driver.find_element(By.TAG_NAME, 'h3')
        
#         # Extract family name from h3
#         family_name = h3_element.text
        
#         # Find the img element and extract its src attribute
#         img_element = driver.find_element(By.CLASS_NAME, 'turtle-image')
#         file_name = img_element.get_attribute('src')

        
#         # Find the first paragraph element and extract its text
#         p_element = driver.find_element(By.TAG_NAME, 'p')
#         description = p_element.text
        
#         # Store in a dictionary
#         results.append({
#             'family_name': family_name,
#             'file_path': file_name,
#             'description': description
#         })
        
#         # Print family name for confirmation
#         print(f"Family Name: {family_name}, File Path: {file_name}, Description: {description}")

#         # Download the image directly into the turtle_images folder
#         image_response = requests.get(file_name)
        
#         if image_response.status_code == 200:
#             # Save the image in the existing 'turtle_images' folder without renaming
#             image_path = os.path.join('turtle_images', os.path.basename(file_name))  # Use original filename
            
#             with open(image_path, 'wb') as f:
#                 f.write(image_response.content)
#             print(f"Downloaded: {image_path}")
#         else:
#             print(f"Failed to download image from {file_name}")

#     except Exception as e:
#         print(f"An error occurred while accessing {link}: {e}")


# # Shut down the driver
# driver.quit()

# # Create a DataFrame from the results list
# df = pd.DataFrame(results)
# display(df)

In [3]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import requests
from bs4 import BeautifulSoup
import os
import pandas as pd

options = webdriver.ChromeOptions()
options.add_argument("--start-minimized")  # Open browser in minimized mode
driver = webdriver.Chrome(options=options)

# Navigate to the Scrapethissite page
url = "https://www.scrapethissite.com/pages/frames/"
driver.get(url)

# Wait for the "main" frame to load
wait = WebDriverWait(driver, 30)

try:
    # Wait for the iframe to be present and switch to it
    wait.until(EC.presence_of_element_located((By.ID, "iframe")))
    driver.switch_to.frame("iframe")
    print("Successfully accessed the iframe.")

    # Find all buttons within the iframe
    buttons = driver.find_elements(By.CSS_SELECTOR, 'a.btn.btn-default.btn-xs')
    print(f'Found {len(buttons)} elements.')

    # Create a list of links from the buttons
    button_links = [button.get_attribute("href") for button in buttons]

except Exception as e:
    print(f"An error occurred: {e}")

# Initialize a list to hold the dictionaries
results = []

# Iterate over each link in the button_links list
for link in button_links:
    driver.get(link)  # Navigate to the link
    
    # Wait for the page content to load and extract HTML
    try:
        wait.until(EC.presence_of_element_located((By.TAG_NAME, 'h3')))
        html_content = driver.page_source  # Get the HTML content of the page
        
        # Use BeautifulSoup to parse the HTML content
        soup = BeautifulSoup(html_content, 'html.parser')
        
        # Extract family name from h3 using BeautifulSoup
        family_name = soup.find('h3').text
        
        # Find the img element and extract its src attribute using BeautifulSoup
        img_element = soup.find(class_='turtle-image')
        file_name = img_element['src'] if img_element else None
        
        # Find the first paragraph element and extract its text using BeautifulSoup
        p_element = soup.find('p')
        description = p_element.text if p_element else None
        
        # Store in a dictionary
        results.append({
            'family_name': family_name,
            'file_path': file_name,
            'description': description
        })
        
        # Print family name for confirmation
        print(f"Family Name: {family_name}, File Path: {file_name}, Description: {description}")

        # Download the image directly into the turtle_images folder
        if file_name:
            image_response = requests.get(file_name)
            
            if image_response.status_code == 200:
                # Save the image in the existing 'turtle_images' folder without renaming
                image_path = os.path.join('turtle_images', os.path.basename(file_name))  # Use original filename
                
                with open(image_path, 'wb') as f:
                    f.write(image_response.content)
                print(f"Downloaded: {image_path}")
            else:
                print(f"Failed to download image from {file_name}")

    except Exception as e:
        print(f"An error occurred while accessing {link}: {e}")

# Shut down the driver
driver.quit()

# Create a DataFrame from the results list
df = pd.DataFrame(results)
display(df)


# export sd csv file
df.to_csv('turtles.csv', index=False)

Successfully accessed the iframe.
Found 14 elements.
Family Name: Carettochelyidae, File Path: https://upload.wikimedia.org/wikipedia/commons/thumb/b/b3/Carettochelys_insculpta.jpg/200px-Carettochelys_insculpta.jpg, Description: 
                        The Carettochelyidae family of turtles — more commonly known as "Pig-nosed turtle" — were first discovered in 1887 by Boulenger.
                    
Downloaded: turtle_images\200px-Carettochelys_insculpta.jpg
Family Name: Cheloniidae, File Path: https://upload.wikimedia.org/wikipedia/commons/thumb/7/71/GreenSeaTurtle-2.jpg/200px-GreenSeaTurtle-2.jpg, Description: 
                        The Cheloniidae family of turtles — more commonly known as "Sea turtles" — were first discovered in 1811 by Oppel.
                    
Downloaded: turtle_images\200px-GreenSeaTurtle-2.jpg
Family Name: Chelydridae, File Path: https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Alligator_snapping_turtle.jpg/200px-Alligator_snapping_turtle.jpg, Des

Unnamed: 0,family_name,file_path,description
0,Carettochelyidae,https://upload.wikimedia.org/wikipedia/commons...,\n The Carettochelyidae...
1,Cheloniidae,https://upload.wikimedia.org/wikipedia/commons...,\n The Cheloniidae fami...
2,Chelydridae,https://upload.wikimedia.org/wikipedia/commons...,\n The Chelydridae fami...
3,Dermatemydidae,https://upload.wikimedia.org/wikipedia/commons...,\n The Dermatemydidae f...
4,Dermochelyidae,https://upload.wikimedia.org/wikipedia/commons...,\n The Dermochelyidae f...
5,Emydidae,https://upload.wikimedia.org/wikipedia/commons...,\n The Emydidae family ...
6,Geoemydidae,https://upload.wikimedia.org/wikipedia/commons...,\n The Geoemydidae fami...
7,Kinosternidae,https://upload.wikimedia.org/wikipedia/commons...,\n The Kinosternidae fa...
8,Platysternidae,https://upload.wikimedia.org/wikipedia/commons...,\n The Platysternidae f...
9,Testudinidae,https://upload.wikimedia.org/wikipedia/commons...,\n The Testudinidae fam...
