# **Facial Recognition System Using Celebrity Images**

## **Introduction**
Welcome to this Jupyter Notebook, where I embark on the exciting journey of building a **Face Recognition System**. This project is designed to demonstrate my proficiency in data gathering, preprocessing, feature extraction, and model training within the field of computer vision. By using advanced tools and techniques, I aim to create a robust system capable of recognizing faces accurately.

---

## **Project Objective**
The primary objective of this project is to develop a face recognition system capable of identifying individuals from a carefully curated dataset of celebrity images, downloaded using a custom google API and then supplemented by the celebA dataset from kaggle https://www.kaggle.com/datasets/jessicali9530/celeba-dataset. This project addresses key challenges in computer vision, such as:
- Handling datasets with limited images per individual.
- Extracting meaningful features for face recognition using modern deep learning techniques.
- Deploying a user-friendly application (streamlit) to demonstrate practical utility.



In particular, this project emphasizes inclusivity by focusing on a diverse dataset featuring celebrities from various racial backgrounds, with a special emphasis on African representation.

---

## **Approach**

This project follows a structured, step-by-step process to build the face recognition system:

1. **Data Gathering and Preprocessing**  
I curated a dataset(using Google custom search API) featuring prominent global celebrities. Using publicly available resources,I collected ~55 images per person, focusing on diversity and balanced representation. I then organized the dataset into train and test sets following a 70-30 split.

2. **Face Detection and Preprocessing**  
   Faces are detected and cropped from the images using the **Haar Cascade model**. Each face is resized to 160x160 pixels to ensure compatibility with the **FaceNet** model during the feature extraction phase.

3. **Feature Extraction Using FaceNet**  
   Pre-trained models like FaceNet are used to extract meaningful embeddings from each face image. These embeddings represent the faces in a way that is optimized for machine learning tasks.

4. **Model Training and Deployment**  
   Using the extracted features, I train a simple classification model(SVM) to identify individuals. Finally, a web-based interactive demo is deployed using **Streamlit**, enabling users to test the face recognition system.

---

## **Structure of the Notebook**
This notebook is organized into the following sections:
1. **Data Gathering and Preprocessing**: Steps to collect, organize, and prepare the dataset.
2. **Face Detection and Cropping**: Detecting faces using Haar Cascade and resizing them for further processing.
3. **Feature Extraction**: Leveraging FaceNet to generate embeddings for each face.
4. **Model Training & Deployment**: Building and evaluating a classification model for face recognition and an interactive web application using Streamlit.
---

# **Phase 1: Data Collection & Preprocessing**

**Used Google Custom Search API for image collection**: by creating a python script for searching, downloading and saving the images of specified celebs called download-images.py saved in the main directory.
I then supplemented the downloaded images using the (CelebA) dataset from kaggle, so that I would have enough for the train-test sets.

In [1]:
# Key libraries
import os
import cv2
import numpy as np
from sklearn.model_selection import train_test_split

Defining key functions for data preprocessing; the extract_face, load_faces&labels, and process_dataset functions and finalizing by executing the created pipeline.

In [2]:
# Step 1: extract_face function

def extract_face(image_path, target_size=(160, 160)):
    
    # Load the image
    image = cv2.imread(image_path)
    if image is None:
        print(f"Could not read image: {image_path}")
        return None
    
    # Converting image to grayscale for improved face detection
    gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    
    # Loading the HAAR cascade model
    face_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + 'haarcascade_frontalface_default.xml')
    
    # Detecting faces in the image
    faces = face_cascade.detectMultiScale(gray_image, scaleFactor=1.1, minNeighbors=5, minSize=(30, 30))
    
    if len(faces) == 0:
        print(f"No face detected in: {image_path}")
        return None
    
    # extract the first detected face
    x, y, w, h = faces[0]
    face = image[y:y+h, x:x+w]
    
    # Resizing the face to target size.
    face_resized = cv2.resize(face, target_size)
    return face_resized

In [3]:
# Step 2: load_faces & labels function.

def load_faces(directory):

    faces, labels = [], []
    for label in os.listdir(directory):  # folder name is the label
        label_path = os.path.join(directory, label)
        if not os.path.isdir(label_path):
            continue
        
        for image_name in os.listdir(label_path):
            image_path = os.path.join(label_path, image_name)
            face = extract_face(image_path)
            if face is not None:
                faces.append(face)
                labels.append(label)
    return faces, labels

In [4]:
# Step 3: Main routine function for processing the dataset

def process_dataset(parent_directory): #loading and splitting dataset into train-test sets & saving it.
    
    faces, labels = load_faces(parent_directory)
    f"Loaded {len(faces)} faces."
    
    # Split into train-test sets
    train_faces, test_faces, train_labels, test_labels = train_test_split(
        faces, labels, test_size=0.3, random_state=42
    )
    
    # Saving the data into a .npz file
    np.savez_compressed('celeb_faces_dataset.npz',
                        train_faces=np.array(train_faces),
                        train_labels=np.array(train_labels),
                        test_faces=np.array(test_faces),
                        test_labels=np.array(test_labels))

In [5]:
# Step 4: Executing the pipeline

parent_directory = 'celeb_images'
process_dataset(parent_directory)

No face detected in: celeb_images/Roger Federer/Roger Federer_28.jpg
No face detected in: celeb_images/Roger Federer/Roger Federer_59.jpg
No face detected in: celeb_images/Roger Federer/Roger Federer_56.jpg
No face detected in: celeb_images/Roger Federer/Roger Federer_55.jpg
No face detected in: celeb_images/Roger Federer/Roger Federer_34.jpg
No face detected in: celeb_images/Roger Federer/Roger Federer_9.jpg
No face detected in: celeb_images/Roger Federer/Roger Federer_23.jpg
No face detected in: celeb_images/Roger Federer/Roger Federer_32.jpg
No face detected in: celeb_images/Roger Federer/Roger Federer_19.jpg
No face detected in: celeb_images/Roger Federer/Roger Federer_25.jpg
No face detected in: celeb_images/Robert Downey Jr/Robert Downey Jr_11.jpg
No face detected in: celeb_images/Brad Pitt/Brad Pitt_17.jpg
Could not read image: celeb_images/Idris_Elba/.DS_Store
Could not read image: celeb_images/Idris_Elba/Idris_Elba_0013
No face detected in: celeb_images/Idris_Elba/Idris_Elba_0

In [7]:
# confirming dataset status

# loading saved dataset
data = np.load('celeb_faces_dataset.npz')

# size of the training and testing sets
print(f"Train Faces: {len(data['train_faces'])}")
print(f"Test Faces: {len(data['test_faces'])}")

Train Faces: 1294
Test Faces: 555
