# Data Wrangling for Model 2

This notebook was made following the completion of my redesign of model 1 (done using mnist). It contains the data preperation for use in training model 2. 

The goal is to get the data into the same format as mnist when you load it in using torch. That is, integer encoded labels.

The data that I am using comes from: https://www.kaggle.com/datasets/robinreni/signature-verification-dataset It is the same data as what I used for model 1, but this version has been processed into a far better storage structure. The dataset contains real and forged signatures for 70 people. The signature for each person (for both real and forged) are grouped into their own subdirectories.

I will prep one of two versions of the dataset from this dataset now. The first version will only include the real signatures. It will be labeled with integer encoding based on who the signatre belongs to. 

The second version, which I will save making until I can train a model on the first, will contain both real and forged versions of the datset. Signatures of the same person but different forgery status will be considered different classes. Also labeled with integer encoding.

In [None]:
import boto3
import os

# Initialize the S3 client
s3 = boto3.client('s3')

def download_folder_from_s3(bucket_name, s3_folder, local_path):
    paginator = s3.get_paginator('list_objects_v2')
    for result in paginator.paginate(Bucket=bucket_name, Prefix=s3_folder):
        # Download each file individually
        for key in result.get('Contents', []):
            file_key = key.get('Key')
            if not os.path.exists(os.path.dirname(local_path + file_key)):
                os.makedirs(os.path.dirname(local_path + file_key))
            s3.download_file(bucket_name, file_key, local_path + file_key)

# Example usage
download_folder_from_s3('signature-data', 'test', 'test')
download_folder_from_s3('signature-data', 'train', 'train')


### Dataset Version 1
Below I am gather the paths to all the real signatures in the train and test directories.

In [10]:
import os

test_dir = r"C:\Users\hunte\OneDrive\Documents\Coding Projects\Signature-Similarity-Checker\data\signature-verification-dataset\sign_data\test"
train_dir = r"C:\Users\hunte\OneDrive\Documents\Coding Projects\Signature-Similarity-Checker\data\signature-verification-dataset\sign_data\train"

test_dir_contents = os.listdir(test_dir)
train_dir_contents = os.listdir(train_dir)

test_real_sig_path, train_real_sig_paths = [], []

# Get all the paths to the real signatures 
for subdir in test_dir_contents:
    if "forg" not in subdir:
        test_real_sig_path.append(os.path.join(test_dir, subdir))

for subdir in train_dir_contents:
    if "forg" not in subdir:
        train_real_sig_paths.append(os.path.join(train_dir, subdir))


#print first 5 paths of each list
print(test_real_sig_path[:5])
print(train_real_sig_paths[:5])

['C:\\Users\\hunte\\OneDrive\\Documents\\Coding Projects\\Signature-Similarity-Checker\\data\\signature-verification-dataset\\sign_data\\test\\049', 'C:\\Users\\hunte\\OneDrive\\Documents\\Coding Projects\\Signature-Similarity-Checker\\data\\signature-verification-dataset\\sign_data\\test\\050', 'C:\\Users\\hunte\\OneDrive\\Documents\\Coding Projects\\Signature-Similarity-Checker\\data\\signature-verification-dataset\\sign_data\\test\\051', 'C:\\Users\\hunte\\OneDrive\\Documents\\Coding Projects\\Signature-Similarity-Checker\\data\\signature-verification-dataset\\sign_data\\test\\052', 'C:\\Users\\hunte\\OneDrive\\Documents\\Coding Projects\\Signature-Similarity-Checker\\data\\signature-verification-dataset\\sign_data\\test\\053']
['C:\\Users\\hunte\\OneDrive\\Documents\\Coding Projects\\Signature-Similarity-Checker\\data\\signature-verification-dataset\\sign_data\\train\\001', 'C:\\Users\\hunte\\OneDrive\\Documents\\Coding Projects\\Signature-Similarity-Checker\\data\\signature-verifi

I call the get_tensor_images() function to load in the list of image paths from above, apply appropriate tensor transformations to them, and return as a labels and features tensor.

In [12]:
from PIL import Image
import torchvision.transforms as transforms
import torch

# Initializes tranforms to apply to images
transform = transforms.ToTensor()
resize = transforms.Resize((224, 224))



def get_tensor_labels_features(tensor_image_list, path_list):
    '''
        This function recieves a list of paths to directories containing multiple images, where the 
        name of the directory is the label for the images in the directory. The function iterates 
        through each path, converts the images to tensors and resizes to 3x224x224, and appends them 
        to a list. The function also creates a list of labels for each image in the path. These lists 
        are converted to tensors and stacked. The function returns a tensor of labels and a tensor of
        images.
    '''

    labels = []

    # iterate through each path in the list
    for path in path_list:

        # get label from path name (last element in path)
        label = int(path.split("\\")[-1])

        # initialize list to hold labels for each image in the path
        dir_labels = []

        # iterate through each file in the path
        for file in os.listdir(path):

            # convert image to tensor
            tensor_image = transform(Image.open(os.path.join(path, file)))

            # resize images to 224x224
            tensor_image = resize(tensor_image)

            # append tensor to list
            tensor_image_list.append(tensor_image)
            
            # append single item tensors to list of labels for the path
            dir_labels.append(torch.tensor(label))

        # append list of labels to labels list after stacking
        labels.append(torch.stack(dir_labels))

    # concat list of tensor stacks into one tensor --- and stacked images into one tensor  --- and return
    return torch.cat(labels), torch.stack(tensor_image_list)



            


# intitialize lists to hold images as tensors
test_real_sig_images, train_real_sig_images = [], []

# get tensor images
test_labels, test_images = get_tensor_labels_features(test_real_sig_images, test_real_sig_path)
train_labels, train_images = get_tensor_labels_features(train_real_sig_images, train_real_sig_paths)



In [13]:
#print shapes
print(f'Test Images Shape: {test_images.shape}')
print(f'Test Labels Shape: {test_labels.shape}\n')

print(f'Train Images Shape: {train_images.shape}')
print(f'Train Labels Shape: {train_labels.shape}')


Test Images Shape: torch.Size([252, 3, 224, 224])
Test Labels Shape: torch.Size([252])

Train Images Shape: torch.Size([887, 3, 224, 224])
Train Labels Shape: torch.Size([887])


All that's left to do before conducting training is to convert these tensors to tensor datasets and wrap in dataloaders, which should be done in the training environment.

In [14]:
# save tensors to file
torch.save(test_images, "test_images.pt")
torch.save(test_labels, "test_labels.pt")
torch.save(train_images, "train_images.pt")
torch.save(train_labels, "train_labels.pt")