## Part 1:  Preparing the CelebA Dataset for a Known vs. Unknown Face Recognition Task

In this part of the project, we will prepare the CelebA face dataset for a classification task in which the model must decide if the face belongs to a known or unknown individual. Given that the dataset contains over 200 thousand faces with each face showing up anywhere from only a few times to 30+, we will be selecting identities of "known" individuals based on a list that only contains the IDs of individuals with 30 or more appearances.

1. We read from the `identity_CelebA.txt` file to map each image filename to an identity ID. This gives us the necessary labels for determining which images correspond to which person.

2. We create a subset of identities where each chosen ID appears at least 30 times within the dataset. The rest of the identities are excluded from the known-class pool.

3. After filtering, we selected 10 identities. Each one has at least 30 images and all of them were resized in order to fit for each model. In the instance of the MLP and CNN, the images were resized to 64x64 pixels resulting in an input dimentionality of 3x64x64. The final dataset was split into training, validation, and test sets for each identity to ensure each class has balanced representation. 



In [71]:
import pandas as pd
import numpy as np
import random
from torch.utils.data import Dataset
from PIL import Image
import os
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
from torchvision.models import resnet50
from sklearn.model_selection import train_test_split

random.seed(2)

df = pd.read_csv("data/identity_CelebA.txt", sep = " ", header = None, names=["filename", "id"])
counts = df["id"].value_counts()
possible_celebs = counts[counts >= 30].index.tolist()

#Uncomment the print statement below to see a list of celebrity IDs that appear 30 or more times within the dataset.
#print("Possible celebrity IDs: ", possible_celebs) 

known_celebs = random.sample(possible_celebs, 10)
id_to_label = {}
for i in range(len(known_celebs)):
    celeb_id = known_celebs[i]
    id_to_label[celeb_id] = i
print(known_celebs)
print(id_to_label)

[6958, 6695, 7073, 5703, 10001, 5672, 6878, 1886, 1842, 6399]
{6958: 0, 6695: 1, 7073: 2, 5703: 3, 10001: 4, 5672: 5, 6878: 6, 1886: 7, 1842: 8, 6399: 9}


**Now lets partition the data into three major splits**



In [37]:
def prep_celeba_splits(df, known_celebs, min_images = 30, train_ratio = 0.7, val_ratio = 0.15, seed = 2):
    df_known = df[df["id"].isin(known_celebs)]
    df_unknown = df[~df["id"].isin(known_celebs)]

    train_rows, val_rows, test_known_rows = [], [], []
    for celeb in known_celebs:
        df_celeb = df_known[df_known["id"] == celeb].sample(frac = 1, random_state = seed)
        n = len(df_celeb)

        n_train = int(n * train_ratio)
        n_val = int(n * val_ratio)

        train_rows.append(df_celeb.iloc[:n_train])
        val_rows.append(df_celeb.iloc[n_train:n_train + n_val])
        test_known_rows.append(df_celeb.iloc[n_train+n_val:])

    train_df = pd.concat(train_rows)
    val_df = pd.concat(val_rows)
    test_known_df = pd.concat(test_known_rows)

    test_unknown_df = df_unknown.sample(2000, random_state = seed)

    return train_df, val_df, test_known_df, test_unknown_df

def test_the_data(train_df, val_df, test_known_df, test_unknown_df, known_celebs):
    print("Training Samples: ", len(train_df))
    print("Validation Samples: ", len(val_df))
    print("Testing Known Samples: ", len(test_known_df))
    print("Testing Unknown Samples: ", len(test_unknown_df))
    print("Known Celebrities: ", len(known_celebs))

train_df, val_df, test_known_df, test_unknown_df = prep_celeba_splits(df, known_celebs)
test_the_data(train_df, val_df, test_known_df, test_unknown_df, known_celebs)

Training Samples:  210
Validation Samples:  40
Testing Known Samples:  50
Testing Unknown Samples:  2000
Known Celebrities:  10


**Now that we have split the test and done a few small sanity checks, we can move on to actually prepping the data for the models.**

Small disclaimer, but I had a lot of difficulty understanding how to prep image data for the CNN and MLP so I had used the data_loader.py file from this GitHub Repo that seemed to have a nice setup:


https://github.com/zamaex96/ML-LSTM-CNN-RNN-MLP/blob/main/data_loader.py

In [38]:
class CelebADataset(Dataset):
    def __init__(self, df, img_dir, id_to_label, transform = None):
        self.df = df.reset_index(drop=True)
        self.img_dir = img_dir
        self.id_to_label = id_to_label
        self.transform = transform

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        img_path = os.path.join(self.img_dir, row["filename"])
        image = Image.open(img_path).convert("RGB")
        if self.transform:
            image = self.transform(image)
        label = self.id_to_label[row["id"]]
        return image, label

basic_transform = transforms.Compose([
    transforms.Resize((64,64)),
    transforms.ToTensor()
])

train_dataset = CelebADataset(train_df, img_dir = "data/img_align_celeba", id_to_label = id_to_label, transform = basic_transform)
val_dataset = CelebADataset(val_df, img_dir = "data/img_align_celeba", id_to_label = id_to_label, transform = basic_transform)

#Sanity check
x, y = train_dataset[0]
print("Shape: ", x.shape)
print("Label: ", y)

train_loader = DataLoader(train_dataset, batch_size = 32, shuffle = True)
val_loader = DataLoader(val_dataset, batch_size = 32)
images, labels = next(iter(train_loader))
print("Batch image shape:", images.shape)
print("Batch labels shape:", labels.shape)

Shape:  torch.Size([3, 64, 64])
Label:  0
Batch image shape: torch.Size([32, 3, 64, 64])
Batch labels shape: torch.Size([32])


## Part 2: Training a CNN Baseline and MLP baseline

For this part of the project we want to evalute the accuracy of two baseline models on the identity dataset. We will first test with a Convolutional Nueral Network and then move on to the Multilayer Perceptron. They will be trained on the same set of known identities using identical train and validation splits. The idea is that we can highlight the difference in accuracy given that a CNN perserves spatial structure whereas an MLP cannot.

Citation: https://github.com/asabenhur/CS345/blob/ef562f15f2bb5a3ee23615b29291f918e3878132/fall24/notebooks/module07_04_cnns.ipynb#L516

In [44]:
class CelebA_CNN(nn.Module):
    def __init__(self, num_classes):
        super().__init__()

        self.conv1 = nn.Conv2d(3, 32, kernel_size = 5)
        self.conv2 = nn.Conv2d(32, 32, kernel_size = 5)
        self.conv3 = nn.Conv2d(32, 64, kernel_size = 5)
        self.fc1 = nn.Linear(64 * 12 * 12, 256)
        self.fc2 = nn.Linear(256, num_classes)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(F.max_pool2d(self.conv2(x), 2))
        x = F.relu(F.max_pool2d(self.conv3(x), 2))
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training = self.training)
        x = self.fc2(x)
        return x

num_classes = len(id_to_label)
cnn_model = CelebA_CNN(num_classes)

def train_epoch(dataloader, model, loss_fn, optimizer, device):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    total_loss = 0.0

    model.train()
    for batch_idx, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)
        optimizer.zero_grad()
        preds = model(X)
        loss = loss_fn(preds, y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

        if batch_idx % 50 == 0:
            current = (batch_idx + 1) * len(X)
            print(f"loss: {loss.item():>7f}  [{current:>5d}/{size:>5d}]")

    return total_loss / num_batches

def validate(dataloader, model, loss_fn, device):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    total_loss = 0.0
    correct = 0

    model.eval()
    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            preds = model(X)
            total_loss += loss_fn(preds, y).item()
            correct += (preds.argmax(1) == y).sum().item()

    avg_loss = total_loss / num_batches
    accuracy = correct / size
    print (f"Validation Accuracy: {accuracy*100:.1f}%, Avg loss: {avg_loss:.4f}")
    return avg_loss, accuracy
    

**Now that we have everything ready, lets spin it up and see what sort of accuracy we get.**

In [60]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
cnn_model = CelebA_CNN(num_classes).to(device)
learning_rate = 0.001
epochs = 15
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(cnn_model.parameters(), lr = learning_rate)

train_losses = []
val_losses = []
val_accuracies = []

for epoch in range(epochs):
    print(f"Epoch {epoch+1}\n-------------------------------")
    train_loss = train_epoch(train_loader, cnn_model, loss_fn, optimizer, device)
    train_losses.append(train_loss)
    val_loss, val_acc = validate(val_loader, cnn_model, loss_fn, device)
    val_losses.append(val_loss)
    val_accuracies.append(val_acc)

print("Training Complete :)")

Epoch 1
-------------------------------
loss: 2.302838  [   32/  210]
Validation Accuracy: 15.0%, Avg loss: 2.2981
Epoch 2
-------------------------------
loss: 2.285281  [   32/  210]
Validation Accuracy: 20.0%, Avg loss: 2.2501
Epoch 3
-------------------------------
loss: 2.255957  [   32/  210]
Validation Accuracy: 22.5%, Avg loss: 1.9147
Epoch 4
-------------------------------
loss: 2.037496  [   32/  210]
Validation Accuracy: 32.5%, Avg loss: 1.8390
Epoch 5
-------------------------------
loss: 1.791372  [   32/  210]
Validation Accuracy: 42.5%, Avg loss: 1.4766
Epoch 6
-------------------------------
loss: 1.752915  [   32/  210]
Validation Accuracy: 37.5%, Avg loss: 1.5907
Epoch 7
-------------------------------
loss: 1.392975  [   32/  210]
Validation Accuracy: 52.5%, Avg loss: 1.4350
Epoch 8
-------------------------------
loss: 1.074172  [   32/  210]
Validation Accuracy: 57.5%, Avg loss: 1.2787
Epoch 9
-------------------------------
loss: 1.025722  [   32/  210]
Validation

**We can see that at epochs 11-13 the model approaches its best accuracy and then begins to overfit and memorize the training data.**

Now lets look at something we expect to do worse.

In [52]:
mlp_transform = transforms.Compose([
    transforms.Resize((64,64)),
    transforms.ToTensor(),
    transforms.Lambda(lambda x: x.view(-1))
])

mlp_train_dataset = CelebADataset(train_df, img_dir = "data/img_align_celeba", id_to_label = id_to_label, transform = mlp_transform)
mlp_val_dataset = CelebADataset(val_df, img_dir = "data/img_align_celeba", id_to_label = id_to_label, transform = mlp_transform)
#mlp_test_dataset = CelebADataset(test_df, img_dir = "data/img_align_celeba", id_to_label = id_to_label, transform = mlp_transform)
mlp_train_loader = DataLoader(mlp_train_dataset, batch_size = 32, shuffle = True)
mlp_val_loader = DataLoader(mlp_val_dataset, batch_size = 32)
#mlp_test_loader = DataLoader(mlp_test_dataset, batch_size = 32)


torch.Size([12288])
0


In [53]:
class MLP(nn.Module):
    def __init__(self, input_dim, num_classes):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, 512)
        self.fc2 = nn.Linear(512, num_classes)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training = self.training)
        x = self.fc2(x)
        return x

input_dim = 3 * 64 * 64
mlp_model = MLP(input_dim, num_classes).to(device)

loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(mlp_model.parameters(), lr = learning_rate)

mlp_train_losses = []
mlp_val_losses = []
mlp_val_accuracies = []

for epoch in range(epochs):
    print(f"Epoch {epoch+1}\n-------------------------------")
    train_loss = train_epoch(mlp_train_loader, mlp_model, loss_fn, optimizer, device)
    mlp_train_losses.append(train_loss)
    mlp_val_loss, mlp_val_acc = validate(mlp_val_loader, mlp_model, loss_fn, device)
    mlp_val_losses.append(mlp_val_loss)
    mlp_val_accuracies.append(mlp_val_acc)

print("Training Complete :)")


Epoch 1
-------------------------------
loss: 2.263547  [   32/  210]
Validation Accuracy: 20.0%, Avg loss: 3.2991
Epoch 2
-------------------------------
loss: 5.728381  [   32/  210]
Validation Accuracy: 17.5%, Avg loss: 2.6118
Epoch 3
-------------------------------
loss: 2.380456  [   32/  210]
Validation Accuracy: 27.5%, Avg loss: 1.9695
Epoch 4
-------------------------------
loss: 2.423108  [   32/  210]
Validation Accuracy: 30.0%, Avg loss: 1.7995
Epoch 5
-------------------------------
loss: 1.839095  [   32/  210]
Validation Accuracy: 37.5%, Avg loss: 1.8151
Epoch 6
-------------------------------
loss: 1.782277  [   32/  210]
Validation Accuracy: 30.0%, Avg loss: 1.7303
Epoch 7
-------------------------------
loss: 1.604099  [   32/  210]
Validation Accuracy: 45.0%, Avg loss: 1.6536
Epoch 8
-------------------------------
loss: 1.642818  [   32/  210]
Validation Accuracy: 42.5%, Avg loss: 1.7160
Epoch 9
-------------------------------
loss: 1.561949  [   32/  210]
Validation

**We can tell that there is a noticable change in accuracy between the MLP and the CNN**

In [78]:
best_cnn_accuracy = max(val_accuracies)
best_mlp_accuracy = max(mlp_val_accuracies)
results_df = pd.DataFrame({
    "Model": ["CNN", "MLP"],
    "Best Validation Accuracy %": [best_cnn_accuracy *100, best_mlp_accuracy * 100]
})
print(results_df)

  Model  Best Validation Accuracy %
0   CNN                        75.0
1   MLP                        55.0


## Part 3: Results and Discussion

Across multiple runs, the convolutional neural network consistently outpreformed the multilayer perceptron by around 20% in validation accuracy. This preformance gap highlights the choice in using a model capable of exploiting spacial structure. Whereas the MLP uses flattened pixel vectors and discards spacial relationships between pixels. 

## Part 4: Extension - Pretrained Feature Extraction with ResNet-50

In this extension we aim to explore an alternative approach to face recognition by using a pretrained feature extraction. Instead of training our models on raw images, like in the above experiment, we use a ResNet-50 model to extract feature embeddings for each image. These embeddings are then used with our linear classifiers. This will show how representation quality affects classification preformance. 

In [67]:
# creating a list of possible celebs with the filename still associated
celebs_with_rows = df[df["id"].isin(possible_celebs)]
# print(celebs_with_rows)

# known celebs with row data
known_with_rows = df[df["id"].isin(known_celebs)]
# print(known_with_rows)

# One-hot encode the id column
one_hot_labels = pd.get_dummies(known_with_rows["id"], prefix="id")
# print(one_hot_labels)

known_with_onehot = pd.concat([known_with_rows, one_hot_labels], axis=1)
# print(known_with_onehot)



Resnet 50 is a pretrained model that breaks each image down into multileveled feature vectors. I convert the data to tensor so it is compatible with the model, and run the pretrained model on our dataset.

In [72]:
# pretrained model, load resnet remove the classification layer
model = resnet50(weights="DEFAULT")
model = torch.nn.Sequential(*list(model.children())[:-1])  
model.eval()

# set all images to the same size, convert to tensor and normalize to clean data for resnet
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
])

# load image, transform. Run through resnet using no_grad to speed up processing
def extract_feature(img_path):
    img = Image.open(img_path).convert("RGB")
    x = transform(img).unsqueeze(0)
    with torch.no_grad():
        feat = model(x).squeeze().numpy()  
    return feat

features = []

# get image data from row send through feature extraction, create list of processed data with filename, id, feature vector, and the labels
for _, row in known_with_onehot.iterrows():
    filename = row["filename"]
    img_path = os.path.join("data/img_align_celeba", filename)

    feature_vec = extract_feature(img_path)

    features.append({
        "filename": filename,
        "id": row["id"],
        "feature": feature_vec,
        **{col: row[col] for col in one_hot_labels.columns}
    })

feature_df = pd.DataFrame(features)
# print(feature_df.head())

In [73]:
# split processed data by id and features
X = np.stack(feature_df["feature"].values)

y = feature_df["id"].values

# make y's 0-9 while maintaining relationship with filenames
y_raw = feature_df["id"].values
unique_ids = np.unique(y_raw)
id_to_idx = {old:i for i, old in enumerate(unique_ids)}
y = np.array([id_to_idx[x] for x in y_raw])


#train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

# run data through linear SVC
clf = SVC(kernel="linear", probability=True)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print("Test Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Test Accuracy: 0.7444444444444445

Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.78      0.82         9
           1       0.62      0.89      0.73         9
           2       0.83      0.56      0.67         9
           3       0.62      0.89      0.73         9
           4       0.71      0.56      0.62         9
           5       0.89      0.89      0.89         9
           6       0.73      0.89      0.80         9
           7       0.80      0.89      0.84         9
           8       0.78      0.78      0.78         9
           9       0.75      0.33      0.46         9

    accuracy                           0.74        90
   macro avg       0.76      0.74      0.73        90
weighted avg       0.76      0.74      0.73        90



The SVC classified 83% at 10% test size, 77% of the test samples at 20% test size and .64% at 30%. The aggressive fluctuations suggest the data set is too small to reliably learn the data.

In [74]:
probs = clf.predict_proba(X_test)
# find percent of correct guesses within top 5 attempts
def top_k_accuracy(probs, y_true, k=3):
    k = min(k, probs.shape[1])
    top_k = np.argsort(probs, axis=1)[:, -k:]
    return np.mean([
        y_true[i] in top_k[i]
        for i in range(len(y_true))
    ])

print("Top-5 Accuracy:", top_k_accuracy(probs, y_test, k=5))

Top-5 Accuracy: 0.9777777777777777


We can find the correct id within 96% of attempts within the first five tries. Which isn't nearly as impressive as it sounds when there are only ten options. The top 2 attempts is correct over 84% however which is a strong foundation for a model. With more time and resources this model could effectively learn known faces with a high level of accuracy.

In [75]:
# same data prep as before except reformatted for torch
y_raw = feature_df["id"].values

unique_ids = np.unique(y_raw)
id_to_idx = {old:i for i, old in enumerate(unique_ids)}
y = np.array([id_to_idx[x] for x in y_raw])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

X_train = torch.tensor(X_train, dtype=torch.float32)
X_test  = torch.tensor(X_test,  dtype=torch.float32)

y_train_t = torch.tensor(y_train, dtype=torch.long)
y_test_t  = torch.tensor(y_test,  dtype=torch.long)

num_classes = len(np.unique(y))
input_dim = X_train.shape[1] 

class PerceptronClassifier(nn.Module):
    def __init__(self, input_dim, num_classes):
        super().__init__()
        self.fc = nn.Linear(input_dim, num_classes)

    def forward(self, x):
        return self.fc(x)

# use cross entropy because it heavily punishes incorrect guesses
model = PerceptronClassifier(input_dim, num_classes)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

epochs = 100
for epoch in range(epochs):
    optimizer.zero_grad()
    outputs = model(X_train)
    loss = criterion(outputs, y_train_t)
    loss.backward()
    optimizer.step()

    if (epoch+1) % 10 == 0:
        print(f"Epoch {epoch+1}/{epochs}, Loss: {loss.item():.4f}")

with torch.no_grad():
    preds = model(X_test).argmax(dim=1)
    accuracy = (preds == y_test_t).float().mean().item()

print("Perceptron Accuracy:", accuracy)

Epoch 10/100, Loss: 1.7549
Epoch 20/100, Loss: 1.3172
Epoch 30/100, Loss: 1.0093
Epoch 40/100, Loss: 0.7938
Epoch 50/100, Loss: 0.6397
Epoch 60/100, Loss: 0.5262
Epoch 70/100, Loss: 0.4403
Epoch 80/100, Loss: 0.3737
Epoch 90/100, Loss: 0.3212
Epoch 100/100, Loss: 0.2790
Perceptron Accuracy: 0.7666666507720947


In [76]:
# Create a one-hot encoded tensor for training labels
y_train_onehot = torch.zeros(len(y_train), num_classes)
y_train_onehot[torch.arange(len(y_train)), y_train] = 1.0


# Create a one-hot encoded tensor for test labels
y_test_onehot = torch.zeros(len(y_test), num_classes)
y_test_onehot[torch.arange(len(y_test)), y_test] = 1.0


# linear regression model
class LinearRegressionClassifier(nn.Module):
    def __init__(self, input_dim, num_classes):
        super().__init__()
        self.fc = nn.Linear(input_dim, num_classes)

    def forward(self, x):
        return self.fc(x)


# Instantiate the model
model_lr = LinearRegressionClassifier(input_dim, num_classes)

# MSE used because it work well with one hot encoding
criterion = nn.MSELoss()

# use torch's optimization model
optimizer = torch.optim.Adam(model_lr.parameters(), lr=1e-3)


epochs = 50
for epoch in range(epochs):
    # Clear previous gradients
    optimizer.zero_grad()

    # input data to Linear regression model
    outputs = model_lr(X_train)
    loss = criterion(outputs, y_train_onehot)

    # Backpropagate gradients and update
    loss.backward()
    optimizer.step()

    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch+1}/{epochs}, Loss: {loss.item():.4f}")

# no_grad to save processing time
with torch.no_grad():
    # Get predicted class indices by taking argmax over class dimension
    preds = model_lr(X_test).argmax(dim=1)
    accuracy = (preds == y_test_t).float().mean().item()

print("Linear Regression Classifier Accuracy:", accuracy)

Epoch 10/50, Loss: 0.0609
Epoch 20/50, Loss: 0.0391
Epoch 30/50, Loss: 0.0274
Epoch 40/50, Loss: 0.0200
Epoch 50/50, Loss: 0.0152
Linear Regression Classifier Accuracy: 0.7333333492279053


The perceptron and Linear regression model both seem to perform pretty poorly when used for this dataset. They seem to overfit to the data really quickly suggesting there is not enough data for them to evaluate. 

## Part 5: Conculsion

Our goal of this project was to explore how different models and their achitecture affect preformance on a face recognition task using the CelebA dataset. We first trained our models on raw image data and then used a pretrained ResNet-50 combined with linear classifiers to evaluate the accuracy of those models. Despite their simplicity the next models in combination with the ResNet-50 achieved very good preformance which demonstrated the use of large-scale pretraining. 

Overall, this project reinforced the importance of selecting model architectures that align with the structure of the data. Given our problem/dataset was image based, models capable of preserving spacial information like the convolutional neural network or pretrained feature extractors were far better than baseline linear or full connected approches. 