# Image Scraper

---

**Author:** Kevin Milli

---

## Objective

The aim of this notebook is to develop a script for extracting high-resolution images from the website [Unsplash](https://unsplash.com/).

## Process Description

This section, dedicated to data extraction via web scraping, will primarily be used to retrieve images resulting from a specific search.<br>
Subsequently, premium images and user profile images will be removed. Finally, the resulting images will be downloaded in high-resolution format.

## Approach to the Problem

The approach adopted involves the utilization of the following Python libraries:
- `requests`
- `BeautifulSoup`
- `os`

Four main functions have been implemented:
1. `get_img_tags_for`: Used to select all images from the website using BeautifulSoup.
2. `img_filter_out`: Employed to filter URLs based on specific keywords.
3. `get_high_res_img_url`: Applies data filtering using the `img_filter_out` function.
4. `save_images`: Downloads images in JPG format, taking a list of images as input.

## Acknowledgments

Gratitude is expressed to the entire staff of [Unsplash](https://unsplash.com/) for their availability and the quality of the content provided.

In [1]:
import requests as r
from bs4 import BeautifulSoup
import os
import logging

In [2]:
logging.basicConfig(level=logging.DEBUG,
                    format="%(asctime)s - %(levelname)s - %(message)s"
                   )

In [2]:
def get_img_tags_for(term=None):
    """
    Fetches data from the server, performing image scraping.

    Parameters:
    term : str, optional
        The search term to fetch images for.

    Returns:
    BeautifulSoup object
        A list of image tags obtained from BeautifulSoup.
        
    Raises:
    Exception
        If no search term is provided or if there is an error getting the response from the server.
    """
    if term == None:
        raise Exception("You Must insert something to search")
    
    url = f"https://unsplash.com/s/foto/{term}"
    resp = r.get(url)

    if resp.status_code != 200:
        raise Exception("Error Getting response")

    soup = BeautifulSoup(resp.content, 'html.parser')
    img_tags = soup.select("figure a img")

    return img_tags
    

In [4]:
def img_filter_out(url, keywords):
    """
    Remove URLs containing any of the specified keywords.
    
    Parameters:
    url : str
        URL of the image.
    keywords : list of str
        List of keywords to be removed from URLs.
        
    Returns:
    bool
        True if the URL does not contain any of the keywords, False otherwise.
    """
    return not any(x in url for x in keywords)
    

In [84]:
def get_high_res_img_url(img_tags, keywords=["premium","plus","profile"]):
    """
    This function applies data filtering using the img_filter_out function.
    
    Parameters:
    img_tags : BeautifulSoup "list of images"
        A list of image tags obtained from BeautifulSoup.
    keywords : list of strings, optional
        A list of keywords to filter out unwanted images. Default is ["premium","plus","profile"].
        
    Returns:
    list
        A list of URLs of filtered images based on specified conditions.
    """
    img_urls = [img['srcset'].split() for img in img_tags if img['src'].startswith("h")]
    hd_content = [img[-2:-1][0] for img in img_urls]
    final_hd_urls = [i for i in hd_content if img_filter_out(i, keywords)]

    return final_hd_urls

In [102]:
def save_images(img_urls, dest_dir="image", tag=""):
    """
    Save images to a destination folder.
    
    Parameters:
    img_urls : list
        List of URLs of images to be saved.
    dest_dir : str, optional
        Destination directory where images will be saved. Default is "image".
    tag : str, optional
        Tag to be added to the beginning of each saved image file name. Default is an empty string.
    """
    for url in img_urls:
        resp = r.get(url)
        logging,info(f"Downloading {url} ...")
        file_name = url.split("/")[-1].split("?")[0]
        if not os.path.exists(dest_dir):
            os.mkdir(dest_dir)

        with open(f"{dest_dir}/{tag}{file_name}.jpeg", "wb") as f:
            f.write(resp.content)
            logging,info(f"Saved {file_name}, with size {round((resp.content)/1024/1024, 2)} MB.")
        

In [86]:
# Sending the request and getting all images
imgs = get_img_tags_for("Artificial Intelligence")

In [100]:
# Filtering data to obtain images in HD resolution
ia_imgs = get_high_res_img_url(imgs)

In [104]:
# Saving images to a given folder -> images
save_images(ia_imgs, "images", "Artifiicial_Intelligence")