# **Web Scraping AI Images**
---
Importing Required Libraries
We start by importing the necessary libraries:
- `selenium`: For automating the Chrome browser.
- `requests`: To download images.
- `beautifulsoup4`: For parsing HTML and extracting data from web pages.
- `time`: To introduce delays during the scraping process.

In the following sections, Will delve into the web scraping process to collect AI-generated images from specified sources using Selenium, BeautifulSoup, and requests.

---
**Running the Web Scraping Code**

To run the web scraping code for AI image collection follow these instructions:

1. **Prerequisites**:
   - Ensure you have Python installed on your local machine.
   - Install the necessary Python libraries if you haven't already. You can use `pip` to install them:
     ```python
     pip install selenium
     ```
      ```python
     pip install requests
     ```
      ```python
     pip install beautifulsoup4
     ```

2. **WebDriver Setup**:
   - Download the appropriate WebDriver for your browser. In this code, we are using the Chrome WebDriver. Make sure it matches your Chrome browser version.
   - Download the Chrome WebDriver from [ChromeDriver Downloads](https://sites.google.com/chromium.org/driver/).
   - Place the WebDriver executable in a directory that's included in your system's PATH.


Ensure that you have the required WebDriver installed and set up correctly on your local machine to execute the web scraping code successfully.


**Code For Web Scraping AI Images from Gencraft**

In [None]:
# Import necessary libraries
import time  # Import the 'time' library for managing time-related operations
from selenium import webdriver  # Import the Selenium library for web automation
from selenium.webdriver.common.by import By  # Import 'By' class for selecting elements by different methods
from selenium.webdriver.support.ui import WebDriverWait  # Import 'WebDriverWait' for waiting for elements
from selenium.webdriver.support import expected_conditions as EC  # Import 'expected_conditions' for specifying conditions
import requests  # Import the 'requests' library for HTTP requests
from PIL import Image  # Import the 'Image' class from the 'PIL' library for image processing
import imagehash  # Import the 'imagehash' library for image hashing
import io  # Import the 'io' module for input/output operations

# Create a new instance of Google Chrome
driver = webdriver.Chrome()  # Initialize a new Chrome browser instance

# Open the website in the Chrome browser
driver.get("https://gencraft.com/explore")  # Navigate to the specified website

# Define the target class name
target_class = "w-full h-full bg-blue-200"  # Define the class name to locate specific images

# Scroll down to load images and download up to 1000 unique images
downloaded_images = 0  # Initialize a counter for downloaded images
unique_image_hashes = set()  # Create a set to store unique image hashes

while downloaded_images < 1000:  # Execute the following block until 1000 images are downloaded
    # Scroll down
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")  # Scroll down the webpage
    time.sleep(2)  # Wait for 2 seconds to let the images load

    # Find all img elements with the target class
    img_elements = driver.find_elements(By.XPATH, f'//img[contains(@class, "{target_class}")]')  # Find specific images

    # Iterate through the img elements and download the images
    for img_element in img_elements:  # Loop through the found image elements
        src = img_element.get_attribute("src")  # Get the image source URL
        if src:
            response = requests.get(src)  # Send an HTTP GET request to the image URL
            if response.status_code == 200:  # If the request is successful (HTTP 200 OK)
                image = Image.open(io.BytesIO(response.content))  # Open the image using PIL
                hash = str(imagehash.average_hash(image))  # Compute the average hash of the image

                # Check if the image is unique
                if hash not in unique_image_hashes:  # If the hash is not in the set of unique hashes
                    with open(f"AI_Generated/image_{downloaded_images}.jpg", "wb") as file:
                        file.write(response.content)  # Save the image to a file
                    unique_image_hashes.add(hash)  # Add the image hash to the set of unique hashes
                    downloaded_images += 1  # Increment the downloaded image counter
                    if downloaded_images >= 1000:  # If 1000 images are downloaded, exit the loop
                        break

# Close the browser
driver.quit()  # Terminate the Chrome browser instance


**Code For Web Scraping AI Images from Pixabay**

In [None]:
# Import necessary libraries
import time  # Import the 'time' library for managing time-related operations
from selenium import webdriver  # Import the Selenium library for web automation
from selenium.webdriver.common.by import By  # Import 'By' class for selecting elements by different methods
import requests  # Import the 'requests' library for HTTP requests

# Create a new instance of Google Chrome
driver = webdriver.Chrome()  # Initialize a new Chrome browser instance

# Set the initial page number and downloaded image count
page_number = 436
downloaded_images = 8523

while downloaded_images < 50000:  # Execute the loop until 50,000 images are downloaded
    # Open the website with the current page number
    url = f"https://pixabay.com/images/search/ai%20generated/?pagi={page_number}"
    driver.get(url)  # Navigate to the specified URL

    # Initialize the previous image count
    prev_image_count = 0

    # Scroll down repeatedly until new images stop loading
    while True:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")  # Scroll down the webpage
        time.sleep(2)  # Wait for 2 seconds to let new images load

        # Find all anchor elements with the specified class
        anchor_elements = driver.find_elements(By.XPATH, '//a[@class="link--WHWzm"]')  # Find specific anchor elements

        # Check if new images have loaded
        curr_image_count = len(anchor_elements)
        if curr_image_count == prev_image_count:
            break

        prev_image_count = curr_image_count

    # Iterate through the anchor elements and download the images
    for anchor_element in anchor_elements:  # Loop through the found anchor elements
        img_element = anchor_element.find_element(By.TAG_NAME, 'img')  # Find the image element within the anchor
        src = img_element.get_attribute("src")  # Get the image source URL
        if src:
            response = requests.get(src)  # Send an HTTP GET request to the image URL
            if response.status_code == 200:  # If the request is successful (HTTP 200 OK)
                with open(f"AI_50000/image_{downloaded_images}.jpg", "wb") as file:
                    file.write(response.content)  # Save the image to a file
                downloaded_images += 1

    # Increment the page number for the next iteration
    page_number += 1

# Close the browser
driver.quit()  # Terminate the Chrome browser instance


**Code for Web Scraping AI Images from DeviantArt for Validation Set**

In [None]:
# Import necessary libraries
import time  # Import the 'time' library for managing time-related operations
from selenium import webdriver  # Import the Selenium library for web automation
from selenium.webdriver.common.by import By  # Import 'By' class for selecting elements by different methods
import requests  # Import the 'requests' library for HTTP requests

# Create a new instance of Google Chrome
driver = webdriver.Chrome()  # Initialize a new Chrome browser instance

# Set the initial page number and downloaded image count
page_number = 1
downloaded_images = 0

while page_number < 156:  # Execute the loop until page 156
    # Open the website with the current page number
    url = f"https://www.deviantart.com/sono2000/gallery?page={page_number}"
    driver.get(url)  # Navigate to the specified URL

    # Scroll down to load images and wait for images to load
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")  # Scroll down the webpage
    time.sleep(2)  # Wait for 2 seconds to let new images load

    # Find all image elements within the specified div structure
    img_elements = driver.find_elements(By.XPATH, '//div[@data-testid="thumb"]/img')  # Find specific image elements

    # Iterate through the image elements and download the images
    for img_element in img_elements:  # Loop through the found image elements
        src = img_element.get_attribute("src")  # Get the image source URL
        if src:
            response = requests.get(src)  # Send an HTTP GET request to the image URL
            if response.status_code == 200:  # If the request is successful (HTTP 200 OK)
                with open(f"AI_Train/image_{downloaded_images}.jpg", "wb") as file:
                    file.write(response.content)  # Save the image to a file
                downloaded_images += 1

    # Increment the page number for the next iteration
    page_number += 1

# Close the browser
driver.quit()  # Terminate the Chrome browser instance


# Conclusion

In this notebook, we've explored a series of web scraping code snippets that serve the purpose of collecting AI-generated images from various online sources. Web scraping is a powerful technique for data collection, and in this context, it has been used to create datasets for AI image classification projects.

We began by introducing the code for web scraping AI images from [GenCraft](https://gencraft.com/explore), followed by a code snippet for gathering AI images from [Pixabay](https://pixabay.com/images/search/ai%20generated/). Both of these code sections make use of the Selenium library to automate the web browsing process, locate, and download images.

We further extended the capabilities of web scraping by introducing a code snippet for collecting AI images from [DeviantArt](https://www.deviantart.com/sono2000/gallery). This section showcases how to navigate through multiple pages, extract image elements, and store the downloaded images locally.

The collected data from these web scraping activities has been utilized for an AI-generated and human-captured image classification project. We employed Convolutional Neural Networks (CNN) for this purpose, using the scraped data to train and evaluate our models.

To access the full details of the AI-generated and human-captured image classification project using CNN, you can follow this [link](https://colab.research.google.com/drive/1gSk-zoGXZ15lRTYeTAw8VL12JR4uRkkZ?usp=share_link), which provides a comprehensive overview of the project, including the model architecture, dataset preparation, and evaluation metrics.

These code snippets can be further customized and integrated into larger machine learning projects to train and evaluate image classification models.

**Remember that web scraping should always be conducted in accordance with the terms of use and policies of the websites you are scraping. It's important to be respectful of website owners and their content.**

*Happy web scraping, Deep learning, and image classification!*
