# Task: Web Scraping data 

Web scraping is a technique used to extract data from websites. It involves writing code to automatically collect information from web pages, which can then be processed and used for various purposes, such as data analysis, market research, or content aggregation. Common tools and libraries for web scraping include:

1. **Python Libraries**:
   - **BeautifulSoup**: A Python library for parsing HTML and XML documents, useful for extracting specific elements from a webpage.
   - **Scrapy**: A web crawling framework that allows you to build and run web spiders to scrape websites.
   - **Selenium**: A tool for automating web browsers, often used to scrape data from dynamically loaded content (e.g., JavaScript-heavy websites).

2. **Web Scraping Process**:
   - **Step 1**: Identify the target website and determine if the data you want is publicly available.
   - **Step 2**: Inspect the structure of the website (HTML, CSS) using browser developer tools to identify the elements containing the data.
   - **Step 3**: Write a scraper using libraries like BeautifulSoup or Scrapy to access and extract the data.
   - **Step 4**: Process and store the extracted data in a usable format, such as CSV or a database.

### Installing some necessary packages

In [3]:
pip install selenium

Collecting selenium
  Downloading selenium-4.24.0-py3-none-any.whl (9.6 MB)
[K     |████████████████████████████████| 9.6 MB 27 kB/s eta 0:00:013
[?25hCollecting trio~=0.17
  Downloading trio-0.26.2-py3-none-any.whl (475 kB)
[K     |████████████████████████████████| 475 kB 201 kB/s eta 0:00:01
[?25hCollecting urllib3[socks]<3,>=1.26
  Downloading urllib3-2.2.2-py3-none-any.whl (121 kB)
[K     |████████████████████████████████| 121 kB 42 kB/s eta 0:00:01
[?25hCollecting certifi>=2021.10.8
  Downloading certifi-2024.8.30-py3-none-any.whl (167 kB)
[K     |████████████████████████████████| 167 kB 51 kB/s eta 0:00:01
[?25hCollecting websocket-client~=1.8
  Downloading websocket_client-1.8.0-py3-none-any.whl (58 kB)
[K     |████████████████████████████████| 58 kB 101 kB/s ta 0:00:01
[?25hCollecting trio-websocket~=0.9
  Downloading trio_websocket-0.11.1-py3-none-any.whl (17 kB)
Collecting outcome
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl (10 kB)
Collecting sortedcontai

In [6]:
pip install requests

Note: you may need to restart the kernel to use updated packages.


In [4]:
pip install webdriver-manager

Collecting webdriver-manager
  Downloading webdriver_manager-4.0.2-py2.py3-none-any.whl (27 kB)
Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv, webdriver-manager
Successfully installed python-dotenv-1.0.1 webdriver-manager-4.0.2
Note: you may need to restart the kernel to use updated packages.


In [1]:
pip install zipfile36

Collecting zipfile36
  Downloading zipfile36-0.1.3-py3-none-any.whl (20 kB)
Installing collected packages: zipfile36
Successfully installed zipfile36-0.1.3
Note: you may need to restart the kernel to use updated packages.


### Necessary librairies

In [None]:
import os
import time
import requests
import zipfile
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

In [None]:
# Create a directory to store the downloaded images
def create_directory(directory_name):
    if not os.path.exists(directory_name):
        os.makedirs(directory_name)
    return directory_name

In [None]:
# Function to download images
def download_image(image_url, save_folder, image_num):
    try:
        img_data = requests.get(image_url).content
        file_name = os.path.join(save_folder, f'image_{image_num}.jpg')
        with open(file_name, 'wb') as handler:
            handler.write(img_data)
        print(f"Downloaded: {file_name}")
    except Exception as e:
        print(f"Failed to download {image_url}: {e}")

In [None]:
# Function to create a ZIP file
def create_zip_file(folder_name, zip_file_name):
    with zipfile.ZipFile(zip_file_name, 'w') as zipf:
        for root, dirs, files in os.walk(folder_name):
            for file in files:
                zipf.write(os.path.join(root, file), file)
    print(f"All images compressed into {zip_file_name}")

In [None]:
# Set up the Selenium WebDriver
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

In [None]:
# Function to scrape and download images from Pinterest
def scrape_pinterest_images(query, limit=10, download_folder='pinterest_images', zip_file_name='images.zip'):
    # Create a directory to store the images
    save_folder = create_directory(download_folder)

    search_url = f"https://www.pinterest.com/search/pins/?q={query}"
    driver.get(search_url)
    time.sleep(5)  # Let the page load

    images = set()
    scroll_pause = 2
    scrolls = 0
    image_count = 0

    while image_count < limit:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(scroll_pause)

        # Find image elements
        img_elements = driver.find_elements(By.CSS_SELECTOR, "img")
        for img in img_elements:
            img_url = img.get_attribute('src')
            if img_url and img_url not in images:
                images.add(img_url)
                download_image(img_url, save_folder, image_count + 1)
                image_count += 1
                if image_count >= limit:
                    break

        scrolls += 1
        if scrolls > 10:  # Safety stop to avoid scrolling too far
            break

    driver.quit()

    # Create a ZIP file with all downloaded images
    create_zip_file(save_folder, zip_file_name)

In [2]:
# Example usage
scrape_pinterest_images('beauty products', limit=20, download_folder='beauty_images', zip_file_name='beauty_images.zip')


Downloaded: beauty_images/image_1.jpg
Downloaded: beauty_images/image_2.jpg
Downloaded: beauty_images/image_3.jpg
Downloaded: beauty_images/image_4.jpg
Downloaded: beauty_images/image_5.jpg
Downloaded: beauty_images/image_6.jpg
Downloaded: beauty_images/image_7.jpg
Downloaded: beauty_images/image_8.jpg
Downloaded: beauty_images/image_9.jpg
Downloaded: beauty_images/image_10.jpg
Downloaded: beauty_images/image_11.jpg
Downloaded: beauty_images/image_12.jpg
Downloaded: beauty_images/image_13.jpg
Downloaded: beauty_images/image_14.jpg
Downloaded: beauty_images/image_15.jpg
Downloaded: beauty_images/image_16.jpg
Downloaded: beauty_images/image_17.jpg
Downloaded: beauty_images/image_18.jpg
Downloaded: beauty_images/image_19.jpg
Downloaded: beauty_images/image_20.jpg
All images compressed into beauty_images.zip


## The code has work succesfuly but the result have some issues such as the dimensions of the images, the numbers of the images. Other difficulty is about the *Chromedriver*