# Data Cleaning

In this notebook our goal will is going to be to prepare all data needed for training/testing our models.

In [1]:
import cv2
import os
from Utils import get_min_dimensions
    

In [2]:
folder_path = "data"  # Change this to the path of your image folder
min_width, min_height = get_min_dimensions(folder_path)

if min_width == float('inf') or min_height == float('inf'):
    print("No valid images found in the folder.")
else:
    print(f"Minimum Width: {min_width}px")
    print(f"Minimum Height: {min_height}px")


Error processing Abyssinian_34.jpg: 'NoneType' object has no attribute 'shape'
Error processing Egyptian_Mau_139.jpg: 'NoneType' object has no attribute 'shape'
Error processing Egyptian_Mau_145.jpg: 'NoneType' object has no attribute 'shape'
Error processing Egyptian_Mau_167.jpg: 'NoneType' object has no attribute 'shape'
Error processing Egyptian_Mau_177.jpg: 'NoneType' object has no attribute 'shape'
Error processing Egyptian_Mau_191.jpg: 'NoneType' object has no attribute 'shape'
Minimum Width: 114px
Minimum Height: 103px


### 1. Reorganize images in folders

In this step, we'll group images in folder by breeds to ensure that both train/test have all breeds. If we had skipped this step, we could have stored breeds in test that had not been trained. 

In [18]:
import os
import re
from shutil import copy

# Path to your data folder
data_folder_path = "data"

# Create a dictionary to store breeds and corresponding folder paths
breed_folders = {}

# Regular expression pattern to extract breed name
pattern = re.compile(r'(.+)_\d+\.jpg')

# Iterate through each file in the data folder
for filename in os.listdir(data_folder_path):
    # Match the pattern to extract the breed name
    match = pattern.match(filename)
    
    if match:
        breed_name = match.group(1)
        
        # Create a folder for the breed if it doesn't exist
        if breed_name not in breed_folders:
            breed_folder = os.path.join(data_folder_path, breed_name)
            os.makedirs(breed_folder, exist_ok=True)
            breed_folders[breed_name] = breed_folder
        
        # Copy the file to the respective breed folder
        src_path = os.path.join(data_folder_path, filename)
        dest_path = os.path.join(breed_folders[breed_name], filename)
        copy(src_path, dest_path)
        
        # Remove the file from the data folder
        os.remove(src_path)

# Print the list of breed folders
print("Breed folders created:\n")
print(100*'*')
for breed_name, folder_path in breed_folders.items():
    print(f"{breed_name}: {folder_path}")





Breed folders created:

****************************************************************************************************
Abyssinian: data\Abyssinian
american_bulldog: data\american_bulldog
american_pit_bull_terrier: data\american_pit_bull_terrier
basset_hound: data\basset_hound
beagle: data\beagle
Bengal: data\Bengal
Birman: data\Birman
Bombay: data\Bombay
boxer: data\boxer
British_Shorthair: data\British_Shorthair
chihuahua: data\chihuahua
Egyptian_Mau: data\Egyptian_Mau
english_cocker_spaniel: data\english_cocker_spaniel
english_setter: data\english_setter
german_shorthaired: data\german_shorthaired
great_pyrenees: data\great_pyrenees
havanese: data\havanese
japanese_chin: data\japanese_chin
keeshond: data\keeshond
leonberger: data\leonberger
Maine_Coon: data\Maine_Coon
miniature_pinscher: data\miniature_pinscher
newfoundland: data\newfoundland
Persian: data\Persian
pomeranian: data\pomeranian
pug: data\pug
Ragdoll: data\Ragdoll
Russian_Blue: data\Russian_Blue
saint_bernard: data

In [19]:
files_folder=200
train_ratio=0.8
num_files=train_ratio*files_folder

data_folder="data/"
train_dest="data/train/"
test_dest="data/test/"

# Specify the path where you want to create the new folders
base_path = data_folder

# Create "train" folder
train_path = os.path.join(base_path, "train")
os.makedirs(train_path, exist_ok=True)
print(f"Folder 'train' created successfully at '{train_path}'")

# Create "test" folder
test_path = os.path.join(base_path, "test")
os.makedirs(test_path, exist_ok=True)
print(f"Folder 'test' created successfully at '{test_path}'\n")

for folder in os.listdir(data_folder):
    folder_path = os.path.join(data_folder, folder)
    if os.path.isdir(folder_path) and folder not in ("train","test"):
        i=0
        print("Processing folder: "+str(folder))
        for filename in os.listdir(folder_path):
            img = os.path.join(folder_path, filename)
            if i<num_files:
                dest_path=train_dest
            else:
                dest_path=test_dest
            copy(img,dest_path)
            os.remove(img)
            i=i+1
        os.rmdir(folder_path)


Folder 'train' created successfully at 'data/train'
Folder 'test' created successfully at 'data/test'
Processing folder: Abyssinian
Processing folder: american_bulldog
Processing folder: american_pit_bull_terrier
Processing folder: basset_hound
Processing folder: beagle
Processing folder: Bengal
Processing folder: Birman
Processing folder: Bombay
Processing folder: boxer
Processing folder: British_Shorthair
Processing folder: chihuahua
Processing folder: Egyptian_Mau
Processing folder: english_cocker_spaniel
Processing folder: english_setter
Processing folder: german_shorthaired
Processing folder: great_pyrenees
Processing folder: havanese
Processing folder: japanese_chin
Processing folder: keeshond
Processing folder: leonberger
Processing folder: Maine_Coon
Processing folder: miniature_pinscher
Processing folder: newfoundland
Processing folder: Persian
Processing folder: pomeranian
Processing folder: pug
Processing folder: Ragdoll
Processing folder: Russian_Blue
Processing folder: sai

Now, we have all images in a 80/20 ratio

## 2. Store labels in a csv

Another problem we need to solve is that we have all labels in different 