## Part 1:  Preparing the CelebA Dataset for a Known vs. Unknown Face Recognition Task

In this part of the project, we will prepare the CelebA face dataset for a classification task in which the model must decide if the face belongs to a known or unknown individual. Given that the dataset contains over 200 thousand faces with each face showing up anywhere from only a few times to 30+, we will be selecting identities of "known" individuals based on a list that only contains the IDs of individuals with 30 or more appearances.

1. We read from the `identity_CelebA.txt` file to map each image filename to an identity ID. This gives us the necessary labels for determining which images correspond to which person.

2. We create a subset of identities where each chosen ID appears at least 30 times within the dataset. The rest of the identities are excluded from the known-class pool.



In [19]:
import pandas as pd
import random

random.seed(2)

df = pd.read_csv("data/identity_CelebA.txt", sep = " ", header = None, names=["filename", "id"])
counts = df["id"].value_counts()
possible_celebs = counts[counts >= 30].index.tolist()

#Uncomment the print statement below to see a list of celebrity IDs that appear 30 or more times within the dataset.
#print("Possible celebrity IDs: ", possible_celebs) 

known_celebs = random.sample(possible_celebs, 10)
print(known_celebs)

[4421, 1231, 6727, 8916, 5027, 1149, 7725, 5982, 4946, 395]
