## Part 1:  Preparing the CelebA Dataset for a Known vs. Unknown Face Recognition Task

In this part of the project, we will prepare the CelebA face dataset for a classification task in which the model must decide if the face belongs to a known or unknown individual. Given that the dataset contains over 200 thousand faces with each face showing up anywhere from only a few times to 30+, we will be selecting identities of "known" individuals based on a list that only contains the IDs of individuals with 30 or more appearances.

1. We read from the `identity_CelebA.txt` file to map each image filename to an identity ID. This gives us the necessary labels for determining which images correspond to which person.

2. We create a subset of identities where each chosen ID appears at least 30 times within the dataset. The rest of the identities are excluded from the known-class pool.



In [19]:
import pandas as pd
import random

random.seed(2)

df = pd.read_csv("data/identity_CelebA.txt", sep = " ", header = None, names=["filename", "id"])
counts = df["id"].value_counts()
possible_celebs = counts[counts >= 30].index.tolist()

#Uncomment the print statement below to see a list of celebrity IDs that appear 30 or more times within the dataset.
#print("Possible celebrity IDs: ", possible_celebs) 

known_celebs = random.sample(possible_celebs, 10)
print(known_celebs)

[4421, 1231, 6727, 8916, 5027, 1149, 7725, 5982, 4946, 395]


Now lets partition the data

In [26]:
def prep_celeba_splits(df, known_celebs, min_images = 30, train_ratio = 0.7, val_ratio = 0.15, seed = 2):
    df_known = df[df["id"].isin(known_celebs)]
    df_unknown = df[~df["id"].isin(known_celebs)]

    train_rows, val_rows, test_known_rows = [], [], []
    for celeb in known_celebs:
        df_celeb = df_known[df_known["id"] == celeb].sample(frac = 1, random_state = seed)
        n = len(df_celeb)

        n_train = int(n * train_ratio)
        n_val = int(n * val_ratio)

        train_rows.append(df_celeb.iloc[:n_train])
        val_rows.append(df_celeb.iloc[n_train:n_train + n_val])
        test_known_rows.append(df_celeb.iloc[n_train+n_val:])

    train_df = pd.concat(train_rows)
    val_df = pd.concat(val_rows)
    test_known_df = pd.concat(test_known_rows)

    test_unknown_df = df_unknown.sample(2000, random_state = seed)

    return train_df, val_df, test_known_df, test_unknown_df

def test_the_data(train_df, val_df, test_known_df, test_unknown_df, known_celebs):
    print("Training Samples: ", len(train_df))
    print("Validation Samples: ", len(val_df))
    print("Testing Known Samples: ", len(test_known_df))
    print("Testing Unknown Samples: ", len(test_unknown_df))
    print("Known Celebrities: ", len(known_celebs))

train_df, val_df, test_known_df, test_unknown_df = prep_celeba_splits(df, known_celebs)
test_the_data(train_df, val_df, test_known_df, test_unknown_df, known_celebs)

Training Samples:  210
Validation Samples:  40
Testing Known Samples:  50
Testing Unknown Samples:  2000
Known Celebrities:  10
