## Part 1:  Preparing the CelebA Dataset for a Known vs. Unknown Face Recognition Task

In this part of the project, we will prepare the CelebA face dataset for a classification task in which the model must decide if the face belongs to a known or unknown individual. Given that the dataset contains over 200 thousand faces with each face showing up anywhere from only a few times to 30+, we will be selecting identities of "known" individuals based on a list that only contains the IDs of individuals with 30 or more appearances.

1. We read from the `identity_CelebA.txt` file to map each image filename to an identity ID. This gives us the necessary labels for determining which images correspond to which person.

2. We create a subset of identities where each chosen ID appears at least 30 times within the dataset. The rest of the identities are excluded from the known-class pool.



In [1]:
import pandas as pd
import random
import numpy as np
import torch
import torch.nn as nn
from sklearn.model_selection import train_test_split
import torchvision.transforms as transforms
from torchvision.models import resnet50
from PIL import Image
import os
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

random.seed(2)

df = pd.read_csv("data/identity_CelebA.txt", sep = " ", header = None, names=["filename", "id"])
counts = df["id"].value_counts()
possible_celebs = counts[counts >= 30].index.tolist()
# print(possible_celebs)

#Uncomment the print statement below to see a list of celebrity IDs that appear 30 or more times within the dataset.
# print("Possible celebrity IDs: ", possible_celebs) 

known_celebs = random.sample(possible_celebs, 10)
# print(known_celebs)

# creating a list of possible celebs with the filename still associated
celebs_with_rows = df[df["id"].isin(possible_celebs)]
# print(celebs_with_rows)

# known celebs with row data
known_with_rows = df[df["id"].isin(known_celebs)]
# print(known_with_rows)

# One-hot encode the id column
one_hot_labels = pd.get_dummies(known_with_rows["id"], prefix="id")
# print(one_hot_labels)

known_with_onehot = pd.concat([known_with_rows, one_hot_labels], axis=1)
# print(known_with_onehot)

Resnet 50 is a pretrained model that breaks each image down into multileveled feature vectors. I convert the data to tensor so it is compatible with the model, and run the pretrained model on our dataset.

In [2]:
# pretrained model, load resnet remove the classification layer
model = resnet50(weights="DEFAULT")
model = torch.nn.Sequential(*list(model.children())[:-1])  
model.eval()

# set all images to the same size, convert to tensor and normalize to clean data for resnet
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
])

# load image, transform. Run through resnet using no_grad to speed up processing
def extract_feature(img_path):
    img = Image.open(img_path).convert("RGB")
    x = transform(img).unsqueeze(0)
    with torch.no_grad():
        feat = model(x).squeeze().numpy()  
    return feat

features = []

# get image data from row send through feature extraction, create list of processed data with filename, id, feature vector, and the labels
for _, row in known_with_onehot.iterrows():
    filename = row["filename"]
    img_path = os.path.join("data/img_align_celeba", filename)

    feature_vec = extract_feature(img_path)

    features.append({
        "filename": filename,
        "id": row["id"],
        "feature": feature_vec,
        **{col: row[col] for col in one_hot_labels.columns}
    })

feature_df = pd.DataFrame(features)
# print(feature_df.head())


In [11]:
# split processed data by id and features
X = np.stack(feature_df["feature"].values)

y = feature_df["id"].values

# make y's 0-9 while maintaining relationship with filenames
y_raw = feature_df["id"].values
unique_ids = np.unique(y_raw)
id_to_idx = {old:i for i, old in enumerate(unique_ids)}
y = np.array([id_to_idx[x] for x in y_raw])


#train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

# run data through linear SVC
clf = SVC(kernel="linear", probability=True)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print("Test Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Test Accuracy: 0.6444444444444445

Classification Report:
               precision    recall  f1-score   support

           0       0.62      0.89      0.73         9
           1       0.82      1.00      0.90         9
           2       0.50      0.56      0.53         9
           3       0.50      0.67      0.57         9
           4       0.67      0.44      0.53         9
           5       0.44      0.44      0.44         9
           6       0.71      0.56      0.62         9
           7       1.00      0.44      0.62         9
           8       0.67      0.89      0.76         9
           9       0.83      0.56      0.67         9

    accuracy                           0.64        90
   macro avg       0.68      0.64      0.64        90
weighted avg       0.68      0.64      0.64        90



The SVC classified 83% at 10% test size, 77% of the test samples at 20% test size and .64% at 30%. The aggressive fluctuations suggest the data set is too small to reliably learn the data.

In [12]:
import numpy as np

probs = clf.predict_proba(X_test)
# find percent of correct guesses within top 5 attempts
def top_k_accuracy(probs, y_true, k=3):
    k = min(k, probs.shape[1])
    top_k = np.argsort(probs, axis=1)[:, -k:]
    return np.mean([
        y_true[i] in top_k[i]
        for i in range(len(y_true))
    ])

print("Top-5 Accuracy:", top_k_accuracy(probs, y_test, k=5))


Top-5 Accuracy: 0.8444444444444444


We can find the correct id within 96% of attempts within the first five tries. Which isn't nearly as impressive as it sounds when there are only ten options. The top 2 attempts is correct over 84% however which is a strong foundation for a model. With more time and resources this model could effectively learn known faces with a high level of accuracy.

In [5]:
# same data prep as before except reformatted for torch
y_raw = feature_df["id"].values

unique_ids = np.unique(y_raw)
id_to_idx = {old:i for i, old in enumerate(unique_ids)}
y = np.array([id_to_idx[x] for x in y_raw])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

X_train = torch.tensor(X_train, dtype=torch.float32)
X_test  = torch.tensor(X_test,  dtype=torch.float32)

y_train_t = torch.tensor(y_train, dtype=torch.long)
y_test_t  = torch.tensor(y_test,  dtype=torch.long)

num_classes = len(np.unique(y))
input_dim = X_train.shape[1] 

class PerceptronClassifier(nn.Module):
    def __init__(self, input_dim, num_classes):
        super().__init__()
        self.fc = nn.Linear(input_dim, num_classes)

    def forward(self, x):
        return self.fc(x)

# use cross entropy because it heavily punishes incorrect guesses
model = PerceptronClassifier(input_dim, num_classes)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

epochs = 100
for epoch in range(epochs):
    optimizer.zero_grad()
    outputs = model(X_train)
    loss = criterion(outputs, y_train_t)
    loss.backward()
    optimizer.step()

    if (epoch+1) % 10 == 0:
        print(f"Epoch {epoch+1}/{epochs}, Loss: {loss.item():.4f}")

with torch.no_grad():
    preds = model(X_test).argmax(dim=1)
    accuracy = (preds == y_test_t).float().mean().item()

print("Perceptron Accuracy:", accuracy)

Epoch 10/100, Loss: 1.6985
Epoch 20/100, Loss: 1.1868
Epoch 30/100, Loss: 0.8467
Epoch 40/100, Loss: 0.6252
Epoch 50/100, Loss: 0.4784
Epoch 60/100, Loss: 0.3779
Epoch 70/100, Loss: 0.3068
Epoch 80/100, Loss: 0.2547
Epoch 90/100, Loss: 0.2154
Epoch 100/100, Loss: 0.1850
Perceptron Accuracy: 0.6333333253860474


In [6]:
# Create a one-hot encoded tensor for training labels
y_train_onehot = torch.zeros(len(y_train), num_classes)
y_train_onehot[torch.arange(len(y_train)), y_train] = 1.0


# Create a one-hot encoded tensor for test labels
y_test_onehot = torch.zeros(len(y_test), num_classes)
y_test_onehot[torch.arange(len(y_test)), y_test] = 1.0


# linear regression model
class LinearRegressionClassifier(nn.Module):
    def __init__(self, input_dim, num_classes):
        super().__init__()
        self.fc = nn.Linear(input_dim, num_classes)

    def forward(self, x):
        return self.fc(x)


# Instantiate the model
model_lr = LinearRegressionClassifier(input_dim, num_classes)

# MSE used because it work well with one hot encoding
criterion = nn.MSELoss()

# use torch's optimization model
optimizer = torch.optim.Adam(model_lr.parameters(), lr=1e-3)


epochs = 50
for epoch in range(epochs):
    # Clear previous gradients
    optimizer.zero_grad()

    # input data to Linear regression model
    outputs = model_lr(X_train)
    loss = criterion(outputs, y_train_onehot)

    # Backpropagate gradients and update
    loss.backward()
    optimizer.step()

    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch+1}/{epochs}, Loss: {loss.item():.4f}")

# no_grad to save processing time
with torch.no_grad():
    # Get predicted class indices by taking argmax over class dimension
    preds = model_lr(X_test).argmax(dim=1)
    accuracy = (preds == y_test_t).float().mean().item()

print("Linear Regression Classifier Accuracy:", accuracy)


Epoch 10/50, Loss: 0.0537
Epoch 20/50, Loss: 0.0293
Epoch 30/50, Loss: 0.0172
Epoch 40/50, Loss: 0.0111
Epoch 50/50, Loss: 0.0075
Linear Regression Classifier Accuracy: 0.6499999761581421


The perceptron and Linear regression model both seem to perform pretty poorly when used for this dataset. They seem to overfit to the data really quickly suggesting there is not enough data for them to evaluate. A 