<a href="https://colab.research.google.com/github/MelikaRad/Image_Colorization_Using_GAN/blob/main/Making_Landscape_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Making "Landscape Dataset" for image colorization

Collected by: Melika Heydari Rad

Here is how I made this dataset by web scraping and then refining it:  
  
1. Scraped images using bing-image-downloader. searched for various categories and saved them in separate folders.  
(there are 10 categories, each containing 100 images)
2. Made a backup  
3. Deleted bad extensions
4. Modified bad modes (converted RGBA transparent images, or gray images, into RGB images)
5. Renamed all images with unique random names.
6. Resized all images to 200*200
7. Made test_y folder: (Transferred 10 random images from each category folder, to a separate folder called test_y)
8. Made train_y folder: (Transferred all the remaining images to a separate folder called train_y)
9. Made test_x folder: (Made grayscale copies of all images of test_y and saved them in the folder test_x)
10. Made train_x folder: (Made grayscale copies of all images of train_y and  saved them in the folder train_x)
  
The final dataset is accessible through the link below:  
https://drive.google.com/drive/folders/1mS3zGZDpGRYTbPKc19vitHJ4fK08Watx?usp=sharing

___

In [None]:
!pip install bing-image-downloader



In [None]:
# searching for various categories and saving them in seperate folders using
# bing-image-downloader

from bing_image_downloader import downloader

output_dir = "drive/MyDrive/Landscape_Dataset_raw"

for category in (## landscape
                 "landcape",
                 "beach",
                 "waterfall",
                 "mountains",
                 "Nature",
                 "forest",
                 "city",
                 "street",
                 "buildings",
                 "town"
                 ):
  downloader.download(category, limit=100,  output_dir=output_dir,
  adult_filter_off=True, force_replace=False, timeout=70)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m



[!!]Indexing page: 384

[%] Indexed 7 Images on Page 384.




[!!]Indexing page: 385

[%] Indexed 31 Images on Page 385.




[!!]Indexing page: 386

[%] Indexed 31 Images on Page 386.




[!!]Indexing page: 387

[%] Indexed 7 Images on Page 387.




[!!]Indexing page: 388

[%] Indexed 31 Images on Page 388.




[!!]Indexing page: 389

[%] Indexed 28 Images on Page 389.




[!!]Indexing page: 390

[%] Indexed 1 Images on Page 390.




[!!]Indexing page: 391

[%] Indexed 31 Images on Page 391.




[!!]Indexing page: 392

[%] Indexed 7 Images on Page 392.




[!!]Indexing page: 393

[%] Indexed 28 Images on Page 393.




[!!]Indexing page: 394

[%] Indexed 31 Images on Page 394.




[!!]Indexing page: 395

[%] Indexed 31 Images on Page 395.




[!!]Indexing page: 396

[%] Indexed 6 Images on Page 396.




[!!]Indexing page: 397

[%] Indexed 31 Images on Page 397.




[!!]Indexing page: 398

[%] Indexed 28 Images on Page 3

In [None]:
# Making a backup from the raw dataset

import shutil
import os

def duplicate_folder(source, destination):
    try:
        # Check if the source folder exists
        if not os.path.exists(source):
            print(f"The source folder '{source}' does not exist.")
            return

        # Copy the folder
        shutil.copytree(source, destination)
        print(f"Folder duplicated from '{source}' to '{destination}' successfully.")

    except FileExistsError:
        print(f"The destination folder '{destination}' already exists.")
    except Exception as e:
        print(f"An error occurred: {e}")

# Define the source and destination paths
source_folder = "drive/MyDrive/Landscape_Dataset_raw"
destination_folder = "drive/MyDrive/Landscape_Dataset_raw_Backup"

# Call the function to duplicate the folder
duplicate_folder(source_folder, destination_folder)


Folder duplicated from 'drive/MyDrive/Landscape_Dataset_raw' to 'drive/MyDrive/Landscape_Dataset_raw_Backup' successfully.


In [None]:
# checking extensions
# as there might be bad extensions

base_dir = "drive/MyDrive/Landscape_Dataset_raw"

def list_extensions_in_directory(directory):
  extensions = set()  # Use a set to avoid duplicates

  for folder_name in os.listdir(directory):
      folder_path = os.path.join(directory, folder_name)

      for file in os.listdir(folder_path):
          # if os.path.isfile(os.path.join(folder_path, file)):  # Check if it's a file
          _, ext = os.path.splitext(file)  # Split the file name and extension
          extensions.add(ext)  # Add the extension to the set
  return extensions

print(list_extensions_in_directory(base_dir))

{'.png', '.jpg', '.JPG', '.webp', '.WEBP', '.jpeg'}


In [None]:
# deleting bad data (deleting .webp and .gif)

base_dir = "drive/MyDrive/Landscape_Dataset_raw"

# Define the allowed extensions
allowed_extensions = [".png", ".jpg", ".jpeg", ".JPG", ".JPEG"]
count = 0

# Iterate through the folders in the base directory
for folder_name in os.listdir(base_dir):
    folder_path = os.path.join(base_dir, folder_name)

    for root, dirs, files in os.walk(folder_path):
        for file in files:
            # Get the full path of the file
            file_path = os.path.join(root, file)

            # Check if the file extension is not in the allowed list
            if not any(file.endswith(ext) for ext in allowed_extensions):
                # Delete the file
                os.remove(file_path)
                print(f"Deleted file: {file_path}")
                count+=1

print(f"Finished deleting {count} non-image files.")

Deleted file: drive/MyDrive/Landscape_Dataset_raw/waterfall/Image_84.WEBP
Deleted file: drive/MyDrive/Landscape_Dataset_raw/waterfall/Image_85.WEBP
Deleted file: drive/MyDrive/Landscape_Dataset_raw/waterfall/Image_90.WEBP
Deleted file: drive/MyDrive/Landscape_Dataset_raw/waterfall/Image_93.webp
Deleted file: drive/MyDrive/Landscape_Dataset_raw/forest/Image_61.webp
Deleted file: drive/MyDrive/Landscape_Dataset_raw/street/Image_30.webp
Deleted file: drive/MyDrive/Landscape_Dataset_raw/town/Image_32.webp
Deleted file: drive/MyDrive/Landscape_Dataset_raw/town/Image_97.webp
Finished deleting 8 non-image files.


___

In [None]:
# modifying bad modes (if image has not 3 channel)

import os
from PIL import Image

def check_and_convert_images(directory):
    count = 0

    for folder_name in os.listdir(directory):
        folder_path = os.path.join(directory, folder_name)



        for filename in os.listdir(folder_path):
            if filename.endswith(('.jpg', '.jpeg', '.png', '.JPG', 'JPEG')):
                image_path = os.path.join(folder_path, filename)
                img = Image.open(image_path)
                mode = img.mode

                if mode == 'RGBA':
                    img = img.convert('RGB')
                    img.save(image_path, 'JPEG')
                    print(f"Converted {filename} from RGBA to RGB JPEG")
                    count+=1

                elif mode == 'L':
                    img = img.convert('RGB')
                    img.save(image_path, 'JPEG')
                    print(f"Converted {filename} from grayscale to RGB JPEG")
                    count+=1

                elif mode == 'RGB':
                  pass

                else:
                    print(f"Unsupported mode '{mode}' for {filename}")
                    os.remove(image_path)
                    count+=1

                    continue
    print(str(count) + " images modified or removed")

# Example usage:
base_dir = "drive/MyDrive/Landscape_Dataset_raw"
check_and_convert_images(base_dir)


Converted Image_38.png from RGBA to RGB JPEG
Converted Image_59.png from RGBA to RGB JPEG
Unsupported mode 'P' for Image_67.png
Converted Image_74.png from RGBA to RGB JPEG
Converted Image_91.png from RGBA to RGB JPEG
Unsupported mode 'P' for Image_42.png
Unsupported mode 'P' for Image_23.png
Unsupported mode 'P' for Image_28.png
Unsupported mode 'P' for Image_94.png
Unsupported mode 'P' for Image_1.png
Converted Image_25.jpg from grayscale to RGB JPEG
Converted Image_37.png from RGBA to RGB JPEG
Unsupported mode 'P' for Image_42.png
Unsupported mode 'P' for Image_81.png
Converted Image_30.png from RGBA to RGB JPEG
Converted Image_88.png from RGBA to RGB JPEG
Unsupported mode 'CMYK' for Image_94.jpg
Converted Image_21.png from RGBA to RGB JPEG
Unsupported mode 'P' for Image_27.png
Unsupported mode 'P' for Image_29.png
Converted Image_30.png from RGBA to RGB JPEG
Converted Image_36.png from RGBA to RGB JPEG
Unsupported mode 'P' for Image_38.png
Converted Image_91.png from RGBA to RGB JP

In [None]:
# renaming with random numbers

import os
import random

# Set the base directory
base_dir = "drive/MyDrive/Landscape_Dataset_raw"

random_names = list(range(1001))
random.shuffle(random_names)
random_names = [str(name) for name in random_names]

count = 0

# Iterate through the folders in the base directory
for folder_name in os.listdir(base_dir):
    folder_path = os.path.join(base_dir, folder_name)

    # Check if the item is a directory
    if os.path.isdir(folder_path):
        # Iterate through the files in the folder

        for filename in os.listdir(folder_path):

            image_extension = filename.split(".")[-1]
            new_filename = f"{str(random_names[count])}" + "." + image_extension

            # Construct the full paths
            old_file_path = os.path.join(folder_path, filename)
            new_file_path = os.path.join(folder_path, new_filename)

            # Rename the file
            os.rename(old_file_path, new_file_path)

            # Increment the count
            count += 1


In [None]:
# resizing images

import os
from PIL import Image

base_dir = "drive/MyDrive/Landscape_Dataset_raw"

for folder_name in os.listdir(base_dir):
    folder_path = os.path.join(base_dir, folder_name)


    # Get all the files in the input directory
    files = os.listdir(folder_path)


    # Iterate over each file
    for file in files:
        # Check if the file is an image
        if file.endswith(('.jpg', '.jpeg', '.png', '.JPEG', '.JPG')):
            # Open the image
            img = Image.open(os.path.join(folder_path, file))

            # Resize the image
            img = img.resize((200, 200))

            # Save the resized image
            img.save(os.path.join(folder_path, file))

            print(f"Resized and saved {file}")
        else:
            print(f"Skipping {file}, not an image")


Resized and saved 440.jpg
Resized and saved 286.jpg
Resized and saved 1.jpg
Resized and saved 491.jpg
Resized and saved 164.jpg
Resized and saved 342.jpg
Resized and saved 222.jpg
Resized and saved 966.jpg
Resized and saved 192.jpg
Resized and saved 119.jpg
Resized and saved 58.jpg
Resized and saved 185.jpg
Resized and saved 125.jpg
Resized and saved 598.jpg
Resized and saved 590.jpg
Resized and saved 193.jpg
Resized and saved 783.png
Resized and saved 959.jpg
Resized and saved 270.jpg
Resized and saved 238.jpeg
Resized and saved 944.jpg
Resized and saved 899.jpg
Resized and saved 463.jpg
Resized and saved 268.jpg
Resized and saved 175.jpg
Resized and saved 988.png
Resized and saved 408.jpg
Resized and saved 476.jpg
Resized and saved 72.jpg
Resized and saved 824.jpg
Resized and saved 397.jpg
Resized and saved 656.jpg
Resized and saved 303.jpg
Resized and saved 895.jpg
Resized and saved 146.jpg
Resized and saved 172.jpg
Resized and saved 691.jpg
Resized and saved 958.png
Resized and sav

In [None]:
# making the test_y directory (10 random images from each category)

import os
import random
import shutil

source_dir = "drive/MyDrive/Landscape_Dataset_raw"
dest_dir = "drive/MyDrive/Landscape_Dataset/test_y"

# Create the destination directory if it doesn't exist
if not os.path.exists(dest_dir):
    os.makedirs(dest_dir)

for folder in os.listdir(source_dir):
    folder_path = os.path.join(source_dir, folder)

    if os.path.isdir(folder_path):
        images = os.listdir(folder_path)
        if len(images) >= 10:
            selected_images = random.sample(images, 10)
            for image in selected_images:
                source_file = os.path.join(folder_path, image)
                dest_file = os.path.join(dest_dir, image)
                shutil.move(source_file, dest_file)
            print(f"Moved 10 random images from {folder} folder.")
        else:
            print(f"Skipping {folder} folder as it has less than 10 images.")


Moved 10 random images from landcape folder.
Moved 10 random images from beach folder.
Moved 10 random images from waterfall folder.
Moved 10 random images from mountains folder.
Moved 10 random images from Nature folder.
Moved 10 random images from forest folder.
Moved 10 random images from city folder.
Moved 10 random images from street folder.
Moved 10 random images from buildings folder.
Moved 10 random images from town folder.


In [None]:
# make train_y folder, (by combining all images of folders into train_y folder)


import os
import shutil

# Set the source and destination directories
source_dir = "drive/MyDrive/Landscape_Dataset_raw"
dest_dir = "drive/MyDrive/Landscape_Dataset/train_y"

# Create the destination directory if it doesn't exist
if not os.path.exists(dest_dir):
    os.makedirs(dest_dir)

# Iterate through the source directory
for category_folder in os.listdir(source_dir):
    category_path = os.path.join(source_dir, category_folder)

    # Check if the item is a directory
    if os.path.isdir(category_path):
        # Iterate through the files in the category folder
        for filename in os.listdir(category_path):

            # Construct the full source and destination paths
            source_file = os.path.join(category_path, filename)
            dest_file = os.path.join(dest_dir, filename)

            # Move the file to the destination directory and rename it
            shutil.move(source_file, dest_file)

print(f"All images have been moved to the '{dest_dir}' folder.")


All images have been moved to the 'drive/MyDrive/Landscape_Dataset/train_y' folder.


___

In [None]:
# checking number of train
import os
from pathlib import Path

def count_images(folder_path):
    """
    Counts the number of image files in the given folder.

    Args:
        folder_path (str): The path to the folder containing the images.

    Returns:
        int: The number of image files in the folder.
    """
    image_extensions = [".jpg", ".png", ".jpeg", ".JPG", "JPEG"]
    count = 0
    for entry in os.listdir(folder_path):
        entry_path = os.path.join(folder_path, entry)
        if os.path.isfile(entry_path) and Path(entry_path).suffix.lower() in image_extensions:
            count += 1
    return count

# Example usage
folder_path = "drive/MyDrive/Landscape_Dataset/train_y"
num_images = count_images(folder_path)
print(f"Number of images in {folder_path}: {num_images}")


Number of images in drive/MyDrive/Landscape_Dataset/train_y: 880


### Making train_x by grayscaling train_y

In [None]:
# making train_x (grayscale images for training)

input_dir = "drive/MyDrive/Landscape_Dataset/train_y"
output_dir = "drive/MyDrive/Landscape_Dataset/train_x"

import os
from PIL import Image

if not os.path.exists(output_dir):
    os.makedirs(output_dir)


for filename in os.listdir(input_dir):
    if filename.endswith(('.jpg', '.jpeg', '.png', '.JPEG', '.JPG')):
        # Load the image
        img_path = os.path.join(input_dir, filename)
        img = Image.open(img_path)

        # Convert the image to grayscale
        gray_img = img.convert('L')

        name = filename.split('.')[0]
        extension = filename.split('.')[-1]
        name = name + "_gray."
        filename = name + extension

        # Save the grayscale image to the output directory
        output_path = os.path.join(output_dir, filename)
        gray_img.save(output_path)
        print(f'Saved grayscale image: {output_path}')


Saved grayscale image: drive/MyDrive/Landscape_Dataset/train_x/440_gray.jpg
Saved grayscale image: drive/MyDrive/Landscape_Dataset/train_x/286_gray.jpg
Saved grayscale image: drive/MyDrive/Landscape_Dataset/train_x/1_gray.jpg
Saved grayscale image: drive/MyDrive/Landscape_Dataset/train_x/491_gray.jpg
Saved grayscale image: drive/MyDrive/Landscape_Dataset/train_x/164_gray.jpg
Saved grayscale image: drive/MyDrive/Landscape_Dataset/train_x/342_gray.jpg
Saved grayscale image: drive/MyDrive/Landscape_Dataset/train_x/192_gray.jpg
Saved grayscale image: drive/MyDrive/Landscape_Dataset/train_x/58_gray.jpg
Saved grayscale image: drive/MyDrive/Landscape_Dataset/train_x/185_gray.jpg
Saved grayscale image: drive/MyDrive/Landscape_Dataset/train_x/125_gray.jpg
Saved grayscale image: drive/MyDrive/Landscape_Dataset/train_x/598_gray.jpg
Saved grayscale image: drive/MyDrive/Landscape_Dataset/train_x/590_gray.jpg
Saved grayscale image: drive/MyDrive/Landscape_Dataset/train_x/193_gray.jpg
Saved grayscale

### Making test_x by grayscaling test_y

In [None]:
# making train_x (grayscale images for training)

input_dir = "drive/MyDrive/Landscape_Dataset/test_y"
output_dir = "drive/MyDrive/Landscape_Dataset/test_x"

import os
from PIL import Image

if not os.path.exists(output_dir):
    os.makedirs(output_dir)


for filename in os.listdir(input_dir):
    if filename.endswith(('.jpg', '.jpeg', '.png', '.JPEG', '.JPG')):
        # Load the image
        img_path = os.path.join(input_dir, filename)
        img = Image.open(img_path)

        # Convert the image to grayscale
        gray_img = img.convert('L')

        name = filename.split('.')[0]
        extension = filename.split('.')[-1]
        name = name + "_gray."
        filename = name + extension

        # Save the grayscale image to the output directory
        output_path = os.path.join(output_dir, filename)
        gray_img.save(output_path)
        print(f'Saved grayscale image: {output_path}')



Saved grayscale image: drive/MyDrive/Landscape_Dataset/test_x/222_gray.jpg
Saved grayscale image: drive/MyDrive/Landscape_Dataset/test_x/966_gray.jpg
Saved grayscale image: drive/MyDrive/Landscape_Dataset/test_x/119_gray.jpg
Saved grayscale image: drive/MyDrive/Landscape_Dataset/test_x/268_gray.jpg
Saved grayscale image: drive/MyDrive/Landscape_Dataset/test_x/988_gray.png
Saved grayscale image: drive/MyDrive/Landscape_Dataset/test_x/408_gray.jpg
Saved grayscale image: drive/MyDrive/Landscape_Dataset/test_x/958_gray.png
Saved grayscale image: drive/MyDrive/Landscape_Dataset/test_x/347_gray.jpeg
Saved grayscale image: drive/MyDrive/Landscape_Dataset/test_x/618_gray.jpg
Saved grayscale image: drive/MyDrive/Landscape_Dataset/test_x/577_gray.jpg
Saved grayscale image: drive/MyDrive/Landscape_Dataset/test_x/198_gray.jpg
Saved grayscale image: drive/MyDrive/Landscape_Dataset/test_x/106_gray.jpg
Saved grayscale image: drive/MyDrive/Landscape_Dataset/test_x/706_gray.jpg
Saved grayscale image: d

___

Done !!!