# Data Cleaning

In this notebook our goal will is going to be to prepare all data needed for training/testing our models.

In [16]:
import cv2
import os
import pandas as pd
from utils.Utils import get_min_dimensions,check_duplicates,remove_duplicates, breeds, extract_breed, process_file
    

### 1. Reorganize images in folders

In this step, we'll group images in folder by breeds to ensure that both train/test have all breeds. If we had skipped this step, we could have stored breeds in test that had not been trained. 

In [17]:
import os
import re
from shutil import copy

# Path to your data folder
data_folder_path = "data"

# Create a dictionary to store breeds and corresponding folder paths
breed_folders = {}

# Regular expression pattern to extract breed name
pattern = re.compile(r'(.+)_\d+\.jpg')

# Iterate through each file in the data folder
for filename in os.listdir(data_folder_path):
    # Match the pattern to extract the breed name
    match = pattern.match(filename)
    
    if match:
        breed_name = match.group(1)
        
        # Create a folder for the breed if it doesn't exist
        if breed_name not in breed_folders:
            breed_folder = os.path.join(data_folder_path, breed_name)
            os.makedirs(breed_folder, exist_ok=True)
            breed_folders[breed_name] = breed_folder
        
        # Copy the file to the respective breed folder
        src_path = os.path.join(data_folder_path, filename)
        dest_path = os.path.join(breed_folders[breed_name], filename)
        copy(src_path, dest_path)
        
        # Remove the file from the data folder
        os.remove(src_path)

# Print the list of breed folders
print("Breed folders created:\n")
print(100*'*')
for breed_name, folder_path in breed_folders.items():
    print(f"{breed_name}: {folder_path}")





Breed folders created:

****************************************************************************************************


In [18]:
files_folder=200
train_ratio=0.8
num_files=train_ratio*files_folder

data_folder="data/"
train_dest="data/train/"
test_dest="data/test/"

# Specify the path where you want to create the new folders
base_path = data_folder

# Create "train" folder
train_path = os.path.join(base_path, "train")
os.makedirs(train_path, exist_ok=True)
print(f"Folder 'train' created successfully at '{train_path}'")

# Create "test" folder
test_path = os.path.join(base_path, "test")
os.makedirs(test_path, exist_ok=True)
print(f"Folder 'test' created successfully at '{test_path}'\n")

for folder in os.listdir(data_folder):
    folder_path = os.path.join(data_folder, folder)
    if os.path.isdir(folder_path) and folder not in ("train","test"):
        i=0
        print("Processing folder: "+str(folder))
        for filename in os.listdir(folder_path):
            img = os.path.join(folder_path, filename)
            if i<num_files:
                dest_path=train_dest
            else:
                dest_path=test_dest
            copy(img,dest_path)
            os.remove(img)
            i=i+1
        os.rmdir(folder_path)


Folder 'train' created successfully at 'data/train'
Folder 'test' created successfully at 'data/test'

Processing folder: .ipynb_checkpoints


Now, we have all images in a 80/20 ratio

## 2. Store labels in a csv

Another problem we need to solve is that we have all labels in different sources. We have class and breeds id's in txt files, while bounding boxes are in xml files and trimaps in another location. Our goal is to try to unify most of our data ina  single csv, while trimaps are going to stay in a different folder

In [19]:
import os
import csv


# Specify the paths for the train and test folders
train_folder_path = "data/train"
test_folder_path = "data/test"

# Specify the path for the output CSV file
train_output = "data/y_train.csv"
test_output = "data/y_test.csv"

# Process files in the train folder
for filename in os.listdir(train_folder_path):
    file_path = os.path.join(train_folder_path, filename)
    process_file(file_path, train_output)

# Process files in the test folder
for filename in os.listdir(test_folder_path):
    file_path = os.path.join(test_folder_path, filename)
    process_file(file_path, test_output)



In [20]:
import csv
# Specify the path for the existing CSV file
existing_csv_path = "data/y_train.csv"
existing_csv_path2 = "data/y_test.csv"
# Check for duplicates
train_duplicates = check_duplicates(existing_csv_path)
test_duplicates = check_duplicates(existing_csv_path2)

if train_duplicates or test_duplicates:
    print("Duplicate IDs found:")
    print("Train.csv has "+str(len(train_duplicates))+" duplicated rows")
    print("Test.csv has "+str(len(test_duplicates))+" duplicated rows")
else:
    print("No duplicates found.")


Duplicate IDs found:
Train.csv has 0 duplicated rows
Test.csv has 3 duplicated rows


In [21]:
import pandas as pd

# Specify the paths for the existing and output CSV files
existing_csv_path = "data/y_train.csv"
output_csv_path = "data/y_train.csv"

# Remove training duplicated rows
remove_duplicates(existing_csv_path, output_csv_path)

existing_csv_path = "data/y_test.csv"
output_csv_path = "data/y_test.csv"

# Remove esting duplicated rows
remove_duplicates(existing_csv_path, output_csv_path)

print("Duplicated rows removed.")


Duplicated rows removed.


As we can see, there are images that have no annotations. We should complete them.

## 3. 