# Construction of Signature Pairs Dataset from Forged Signatures Dataset

I found this dataset that consists of pngs of handwritten signatures. For each signature, there are 5 real signature instances and 5 fake signature instances. The dataset is intended for binary classification. 

https://www.kaggle.com/datasets/divyanshrai/handwritten-signatures 

What I am going to do is construct a dataset of image pairs that are labeled based on if each of those images are of the same signature or of a different signature. The resulting data will be the first data the model (which I wrote the code for yesterday) will be trained on. For now, I am not going to differentiate between real and forged signatures. Just whether the signature itself is the same. This is to determine the capacity of the model from a lower starting point.

## Data Format

The data directory at the beggining below contains four subdirectories inside it. Each of those subdirectories contains a directories 'forged' and 'real'.

The dataset contains 30 signatures belonging to 30 people. There are 5 real signature instances by the person whose signature it is and 5 by others. 

The file names are 8 integers where the first 3 signifies which person did the signature, the next 2 signify which of the 5 signatures it is, and the last 3 signify whose signature it is in reality.

For example: 00602023.png  That this is person 023's signature forged by person 006. And that it is the second forged signatures

### 1.)

Below I am loading in all the images from the dataset. 

In [1]:
import os

data_dir = r"C:\Users\hunte\OneDrive\Documents\Coding Projects\Signature-Similarity-Checker\data\handwritten-signatures\Dataset_Signature_Final\Dataset"

# lists to store paths of real and forged signatures
real = []
forged = []



# gather paths of all images in the dataset
def get_image_paths(data_dir):
    ''' gathers image paths into two lists: real and forged based on the subfolders in the dataset'''
    for folder in os.listdir(data_dir): # <----- this leads to 4 subdirectories, each containing a 'real' and 'forged' folder
        for subfolder in os.listdir(os.path.join(data_dir, folder)): # <----- this leads to the 'real' and 'forged' folders
            for file in os.listdir(os.path.join(data_dir, folder, subfolder)): # <----- this leads to the images in each of the 'real and 'forged' folders
                if subfolder == 'real':
                    real.append(os.path.join(data_dir, folder, subfolder, file)) # <----- append paths of real signatures to 'real' list
                else:
                    forged.append(os.path.join(data_dir, folder, subfolder, file)) # <----- append paths of forged signatures to 'forged' list


# call function
get_image_paths(data_dir)

print(f'length of real signatures paths: {len(real)}')
print(f'length of forged signatures paths: {len(forged)}')

length of real signatures paths: 270
length of forged signatures paths: 450


### 2.)
 Next I am checking if there are any duplicates based on the naming schema and seperating out only the unique names.

In [2]:
def check_duplicates(image_path_list, name):
    ''' checks for duplicates in the list of image paths
        and returns a dict of the unique file names and 
        their paths'''

    # create lists to store file names and file paths
    file_names, file_paths = [], []

    # iterate through the list of image paths
    for path in image_path_list:
        file_names.append(path.split('\\')[-1]) # <----- append the file name to the list of file names
        file_paths.append(path)                 # <----- append the file path to the list of file paths

    #check duplicates
    if len(file_names) != len(set(file_names)):  # <----- calling set() removes duplicates
        print(f'There are duplicates in the list of {name} file names.')
        print(f'Out of the {len(file_names)} {name} signatures, there are {len(set(file_names))} unique file names.\n')

    # create list of only the unique file names
    non_duplicate_names = list(set(file_names)) 

    # iterate through list of unique file names and once a path 
    # with its name is found, append it to the dictionary
    unique_file_paths = {}

    # iterate through the list of unique file names
    for name in non_duplicate_names:
        for path in file_paths: # <----- iterate through the list of file paths
            if name in path:
                unique_file_paths[name] = path # <----- append the file path to the dictionary with the file 
                                               #        name as the key if the file name is in the path
     
    return unique_file_paths


# create dictionaries of the image paths with the name of the person as the key
real_fnames_unique = check_duplicates(real, name='real')
forged_fnames_unique = check_duplicates(forged, name='forged')


There are duplicates in the list of real file names.
Out of the 270 real signatures, there are 162 unique file names.

There are duplicates in the list of forged file names.
Out of the 450 forged signatures, there are 400 unique file names.



I am going to check if this worked correctly by seeing if all the fnames in the paths match with the fname key

In [3]:
# I am going to check if this worked correctly by seeing if all the fnames in the paths match with the fname key
# in the dictionary

non_matching = 0

for key, value in real_fnames_unique.items():
    if key not in value:    
        non_matching += 1

print(f'\n\nThere are {non_matching} real signatures that do not match with their file name.')    



There are 0 real signatures that do not match with their file name.


### 3.)

Next I am going to group together the image paths into nested lists containing only the paths to signatures beloning to the same owner. 

In [4]:
#print a 1 in fstring right aligned with 2 zeroes in front of it\

# create a list of the signature owner strings
signature_owners = []
for i in range(1, 31):
    signature_owners.append(f'{i:0>3}') # <----- this will print i aligned with zeros padding
                                        #        the left side of the number up to 3 digits


def group_like_signatures(fname_path_dict_real, fname_path_dict_forged):
    ''' Inputs are dictionaries of the real and forged signatures with keys as the file names
    and values as the paths. The function groups the paths of signatures that belong to the 
    same person into a list of lists'''

    # create a list to store the lists of paths of signatures that belong to the same person
    nested_like_signatures_list = []

    # iterate through the list of signature owners
    for signature_owner in signature_owners:

        # create a list to store the paths of signatures that belong to the same person
        like_signatures_list = []

        # iterate through the dictionary of real signatures
        for fname, path in fname_path_dict_real.items():

            if signature_owner in fname[-7:-4]:  # <----- check the last 3 digits, excluding .png
                like_signatures_list.append(path)

        # iterate through the dictionary of forged signatures
        for fname, path in fname_path_dict_forged.items():
                
                if signature_owner in fname[-7:-4]: # <----- check the last 3 digits, excluding .png
                    like_signatures_list.append(path)
            
        # append the list of paths of signatures that belong to the same person to the list of lists
        nested_like_signatures_list.append(like_signatures_list) 

    return nested_like_signatures_list



# call function to group the real and forged signatures
nested_like_signatures_list = group_like_signatures(real_fnames_unique, forged_fnames_unique)

### 4.)
next I am going to load the like images into nested lists of PIL Images

In [5]:
from PIL import Image

# create a list to store the PIL images of signatures that belong to the same person
nested_like_PIL_signatures = []

# iterate through the list of lists of paths of signatures that belong to the same person
for path_list in nested_like_signatures_list:

    # create a list to store the PIL images of signatures that belong to the same person
    like_PIL_signatures = []

    # iterate through the list of paths of signatures that belong to the same person
    for path in path_list:

        # append the PIL image of the signature to the list of PIL images
        like_PIL_signatures.append(Image.open(path))

    # append the list of PIL images of signatures that belong to the same person to the list of lists
    nested_like_PIL_signatures.append(like_PIL_signatures)


5.)   
Next I will begin the labeling and collation process for the images loaded in. The labels will be binary. Images will be grouped into sets of two per example. An example will recieve a 1 if the two images belong to the same signature owner. Otherwise the pair will recieve a zero. 

The first thing I will do to accomplish this is to collate the nested list of PIL images into a list of tensor stacks. To do this I will use the ImageCollator class from image_collator.py. This applies max pooling as a dimmensionality reduction technique as well as converting all images to grayscale tensors.

In [6]:
from image_collator import ImageCollator

# create an instance of the ImageCollator class
collator = ImageCollator()

# collate all like PIL images into seperate tensors
like_tensor_image_stacks = [collator.collate(PIL_image_list, num_poolings=2) for PIL_image_list in nested_like_PIL_signatures]



Below are the shapes of all the like image tensor groupings. Interestingly, contrary to the documentation of the data on kaggle, there are more than 5 forged + 5 legitimate signature images per person.

In [7]:
for tensor in like_tensor_image_stacks:
    print(tensor.shape) 

torch.Size([21, 1, 50, 150])
torch.Size([26, 1, 50, 150])
torch.Size([17, 1, 50, 150])
torch.Size([17, 1, 50, 150])
torch.Size([17, 1, 50, 150])
torch.Size([17, 1, 50, 150])
torch.Size([17, 1, 50, 150])
torch.Size([23, 1, 50, 150])
torch.Size([18, 1, 50, 150])
torch.Size([20, 1, 50, 150])
torch.Size([17, 1, 50, 150])
torch.Size([17, 1, 50, 150])
torch.Size([10, 1, 50, 150])
torch.Size([10, 1, 50, 150])
torch.Size([10, 1, 50, 150])
torch.Size([20, 1, 50, 150])
torch.Size([10, 1, 50, 150])
torch.Size([10, 1, 50, 150])
torch.Size([10, 1, 50, 150])
torch.Size([10, 1, 50, 150])
torch.Size([10, 1, 50, 150])
torch.Size([10, 1, 50, 150])
torch.Size([10, 1, 50, 150])
torch.Size([15, 1, 50, 150])
torch.Size([10, 1, 50, 150])
torch.Size([10, 1, 50, 150])
torch.Size([10, 1, 50, 150])
torch.Size([10, 1, 50, 150])
torch.Size([10, 1, 50, 150])
torch.Size([10, 1, 50, 150])


6.)   
Next I will take that list of tensor stacks and use the Build_Batch class in get_batch.py. Labels will not be applied with this step.

This will be done in two ways. The first will use the .build_like_pairs() method to group all possible like signatures together into pairs of images labeled as 1. The method takes a single tensors stack of like images and outputs a single stack of these combinations. This is done to the entire list of tensors.

Below I map the .build_like_pairs() method on each of the tensor stacks. 

In [8]:
from get_batch import Build_Batch

# create an instance of the Build_Batch class
builder = Build_Batch()

# map the build_like_pairs method to the list of like tensor image stacks
like_tensor_combos = map(builder.build_like_pairs, like_tensor_image_stacks)

# convert the map object to a list
like_tensor_combos = list(like_tensor_combos)

the output tensors are of the shape (num_examples, 2 images, 1 singleton, height, width)  

In [9]:
print(f'There are {len(like_tensor_combos)} stacks of like signature combinations.')

# print all shapes of the like tensors
for tensor in like_tensor_combos:
    print(tensor.shape)

There are 30 stacks of like signature combinations.
torch.Size([420, 2, 1, 50, 150])
torch.Size([650, 2, 1, 50, 150])
torch.Size([272, 2, 1, 50, 150])
torch.Size([272, 2, 1, 50, 150])
torch.Size([272, 2, 1, 50, 150])
torch.Size([272, 2, 1, 50, 150])
torch.Size([272, 2, 1, 50, 150])
torch.Size([506, 2, 1, 50, 150])
torch.Size([306, 2, 1, 50, 150])
torch.Size([380, 2, 1, 50, 150])
torch.Size([272, 2, 1, 50, 150])
torch.Size([272, 2, 1, 50, 150])
torch.Size([90, 2, 1, 50, 150])
torch.Size([90, 2, 1, 50, 150])
torch.Size([90, 2, 1, 50, 150])
torch.Size([380, 2, 1, 50, 150])
torch.Size([90, 2, 1, 50, 150])
torch.Size([90, 2, 1, 50, 150])
torch.Size([90, 2, 1, 50, 150])
torch.Size([90, 2, 1, 50, 150])
torch.Size([90, 2, 1, 50, 150])
torch.Size([90, 2, 1, 50, 150])
torch.Size([90, 2, 1, 50, 150])
torch.Size([210, 2, 1, 50, 150])
torch.Size([90, 2, 1, 50, 150])
torch.Size([90, 2, 1, 50, 150])
torch.Size([90, 2, 1, 50, 150])
torch.Size([90, 2, 1, 50, 150])
torch.Size([90, 2, 1, 50, 150])
torch.

Here I am taking the list of person segregated like image pair tensor stacks and stacking them all into a single tensor. These are the positive examples of the dataset.

In [10]:
import torch

#concatenate all the like tensors into one tensor
like_tensor_combos = torch.cat(like_tensor_combos, dim=0)

print(f'The shape of the like tensor combos after concatenating is {like_tensor_combos.shape}.')

The shape of the like tensor combos after concatenating is torch.Size([6196, 2, 1, 50, 150]).


7.)   
I am now going build a stack of unlike image pairs with the .build|_unlike_pairs() class.   

As the name suggests, it returns a tensor stack of all combinations of pairs between images not belonging to like signature owners. 

Currently, a weakness in the algorithm I am using to do this is that it does not check for duplicate examples.   

Duplicates present two issues: The first is that duplicates mean that in a single epoch, the model will predict on the same examples multiple times and this will affect the learning update. The way that the learning update will be affected is essentially the same as letting the gradients accumulate for longer before applying the weight update. Not particularly harmful but leads to less control over training.

The other way that training is affected is a bit more insidious. That being that if there are dupliicates of the same example in the tiraning set and the test set, then that will affect the model evaluation and lead to less of an understanding of how to improve the model. 

I will adress this soon. For now thought I want to develop out a bit more.

In [11]:
unlike_tensor_combos = builder.build_unlike_pairs(like_tensor_image_stacks)

print(f'The shape of the unlike tensor combos is {unlike_tensor_combos.shape}.')

The shape of the unlike tensor combos is torch.Size([171466, 2, 1, 50, 150]).


8.)   
Next, i am going to create a test and train split for the each of the like and unlike tensor combos. I'm doing this seperately for the two labels so that I can control the distribution of labels in each of these sets. I will also creates the labels for these tensors here as well. They will be seperate a seperate tensor from the feature tensors. This is so they can easily be wrapped in a dataloader.

In [12]:

def split(tensor_combos, like_or_unlike):
    ''' splits the tensor combos into training and validation sets
        and creates labels for the tensors'''

    #compute 80 percent of the length of the unlike tensor combos
    eighty_percent = int(0.8 * len(tensor_combos))

    #split the unlike tensor combos into training and validation sets
    train = tensor_combos[:eighty_percent]
    val = tensor_combos[eighty_percent:]

    # create labels for the unlike or like tensors depending on the input
    if like_or_unlike == 'unlike':
        # create labels for the unlike tensors
        train_labels = torch.zeros(len(train))
        val_labels = torch.zeros(len(val))
    elif like_or_unlike == 'like':
        # create labels for the like tensors
        train_labels = torch.ones(len(train))
        val_labels = torch.ones(len(val))

    return train, val, train_labels, val_labels


# split like tesnors
like_train, like_val, like_train_labels, like_val_labels = split(like_tensor_combos, like_or_unlike = 'like')

# split unlike tensors
unlike_train, unlike_val, unlike_train_labels, unlike_val_labels = split(unlike_tensor_combos, like_or_unlike='unlike')

In [15]:
def print_shapes(train_, val_, train_labels_, val_labels_, condition):
    ''' prints the shapes of the training and validation sets and their labels'''
    # print shapes
    print(f'\n\nThe shape of the {condition} training set is {train_.shape}.')
    print(f'The shape of the {condition} validation set is {val_.shape}.')
    print(f'Then shape og the {condition} train labels is {train_labels_.shape}.')
    print(f'Then shape og the {condition} val labels set is {val_labels_.shape}.')

# print shapes of like tensors
print_shapes(like_train, like_val, like_train_labels, like_val_labels, condition = 'like')

# print shapes of unlike tensors
print_shapes(unlike_train, unlike_val, unlike_train_labels, unlike_val_labels, condition = 'unlike')





The shape of the like training set is torch.Size([4956, 2, 1, 50, 150]).
The shape of the like validation set is torch.Size([1240, 2, 1, 50, 150]).
Then shape og the like train labels is torch.Size([4956]).
Then shape og the like val labels set is torch.Size([1240]).


The shape of the unlike training set is torch.Size([137172, 2, 1, 50, 150]).
The shape of the unlike validation set is torch.Size([34294, 2, 1, 50, 150]).
Then shape og the unlike train labels is torch.Size([137172]).
Then shape og the unlike val labels set is torch.Size([34294]).


And here I am concatenating the like and unlike for train and for val each into unified data splits. For both targets and features.

In [16]:
#concatenate the like and unlike training features and labels
train = torch.cat((like_train, unlike_train), dim=0)
train_labels = torch.cat((like_train_labels, unlike_train_labels), dim=0)

#concatenate the like and unlike validation sets and labels
val = torch.cat((like_val, unlike_val), dim=0)
val_labels = torch.cat((like_val_labels, unlike_val_labels), dim=0)

The wrangling is done. They will be wrapped in data loaders before running training which will take care of shuffling examples.

In [17]:
# print shapes
print(f'The shape of the training set is {train.shape}.')
print(f'The shape of the training labels is {train_labels.shape}.\n')

print(f'The shape of the validation set is {val.shape}.')
print(f'The shape of the validation labels is {val_labels.shape}.')

The shape of the training set is torch.Size([142128, 2, 1, 50, 150]).
The shape of the training labels is torch.Size([142128]).

The shape of the validation set is torch.Size([35534, 2, 1, 50, 150]).
The shape of the validation labels is torch.Size([35534]).


In [19]:
import os 

cwd = os.getcwd()

# make a data directory
os.mkdir(os.path.join(cwd, 'data'))

# save the training and validation sets to the data directory
torch.save(train, os.path.join(cwd, 'data', 'train_examples.pt'))
torch.save(train_labels, os.path.join(cwd, 'data', 'train_labels.pt'))
torch.save(val, os.path.join(cwd, 'data', 'val_examples.pt'))
torch.save(val_labels, os.path.join(cwd, 'data', 'val_labels.pt'))