# **D3APL: Aplicações em Ciência de Dados - IFSP Campinas**

**Professor:** Dr. Samuel Martins (Samuka)

**Alunos:**

* Gabrielly Baratella de Carvalho 
* Halisson Gomides de Souza
* Hugo Martinelli Watanuki

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
#for dirname, _, filenames in os.walk('/kaggle/input'):
 #   for filename in filenames:
  #      print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# **Celebrity Face Recognition Competition**

The objetive of this notebook is to document and provide the python code required to address a specific problem of computer vision: the recognition of the faces of celebrities contained in a public dataset ([http://www.briancbecker.com/blog/research/pubfig83-lfw-dataset/](http://))

This notebook is structured in five main sections that also represent the main steps adopted to address the problem:
1. Profile the dataset
1. Pre-process the images
1. Define the proper neural network
1. Train the model
1. Fine-tune the model
1. Validate the results

The main approach adopted to address the problem involved addressing the class imbalance of the training dataset using undersampling and the leveraging of a pre-trained face recognition model (VGGFace2 - Resnet) for transfer learning.

# 1. Profilling the Dataset

The objetive of this step is to understand the dataset structure, handle the class imbalance and generating a dataframe.

In [10]:
# Listing important libraries required for profilling
import glob  
import os
import pandas as pd
import random

## 1.1 The Dataset

The dataset utilized in this competition is a combination of the [PubFigg83](http://vision.seas.harvard.edu/pubfig83/) and [LFW](http://vis-www.cs.umass.edu/lfw/).

The dataset has 13,840 color images of 83 celebrities and has been previously reshaped to a 100x100 pixels dimension according to the position of the eyes of the individual in the image. 12,1280 images compose the labeled training dataset and 1,660 unlabeled images compose the test dataset. 

The dataset is located on the '/kaggle/input/' directory of the current kernel.

In [2]:
# Defining the dataset_folder and evaluating the number of classes
dataset_folder = '../input/ifsp-d3apl-2023-face-recognition/train/train/'

class_folders = sorted(os.listdir(dataset_folder))

print(class_folders)
print(f'Number of class: {len(class_folders)}')

['Adam Sandler', 'Alec Baldwin', 'Angelina Jolie', 'Anna Kournikova', 'Ashton Kutcher', 'Avril Lavigne', 'Barack Obama', 'Ben Affleck', 'Beyonce Knowles', 'Brad Pitt', 'Cameron Diaz', 'Cate Blanchett', 'Charlize Theron', 'Christina Ricci', 'Claudia Schiffer', 'Clive Owen', 'Colin Farrell', 'Colin Powell', 'Cristiano Ronaldo', 'Daniel Craig', 'Daniel Radcliffe', 'David Beckham', 'David Duchovny', 'Denise Richards', 'Drew Barrymore', 'Dustin Hoffman', 'Ehud Olmert', 'Eva Mendes', 'Faith Hill', 'George Clooney', 'Gordon Brown', 'Gwyneth Paltrow', 'Halle Berry', 'Harrison Ford', 'Hugh Jackman', 'Hugh Laurie', 'Jack Nicholson', 'Jennifer Aniston', 'Jennifer Lopez', 'Jennifer Love Hewitt', 'Jessica Alba', 'Jessica Simpson', 'Joaquin Phoenix', 'John Travolta', 'Julia Roberts', 'Julia Stiles', 'Kate Moss', 'Kate Winslet', 'Katherine Heigl', 'Keira Knightley', 'Kiefer Sutherland', 'Leonardo DiCaprio', 'Lindsay Lohan', 'Mariah Carey', 'Martha Stewart', 'Matt Damon', 'Meg Ryan', 'Meryl Streep', '

In [3]:
# Evaluating the class proportions: number of samples per class
for class_folder in class_folders:
    full_class_folder = os.path.join(dataset_folder, class_folder)
    
    class_img_filenames = os.listdir(full_class_folder)
    print(f'Number of Images for Class "{class_folder}": {len(class_img_filenames)}')

Number of Images for Class "Adam Sandler": 88
Number of Images for Class "Alec Baldwin": 83
Number of Images for Class "Angelina Jolie": 194
Number of Images for Class "Anna Kournikova": 151
Number of Images for Class "Ashton Kutcher": 81
Number of Images for Class "Avril Lavigne": 279
Number of Images for Class "Barack Obama": 249
Number of Images for Class "Ben Affleck": 97
Number of Images for Class "Beyonce Knowles": 106
Number of Images for Class "Brad Pitt": 280
Number of Images for Class "Cameron Diaz": 226
Number of Images for Class "Cate Blanchett": 140
Number of Images for Class "Charlize Theron": 175
Number of Images for Class "Christina Ricci": 123
Number of Images for Class "Claudia Schiffer": 102
Number of Images for Class "Clive Owen": 114
Number of Images for Class "Colin Farrell": 125
Number of Images for Class "Colin Powell": 92
Number of Images for Class "Cristiano Ronaldo": 148
Number of Images for Class "Daniel Craig": 148
Number of Images for Class "Daniel Radclif

In [4]:
# Sorting and identifying the class with the fewest images

# Dictionary to store directory and file count
file_counts = {}

# Count files in each directory
for class_folder in class_folders:
    full_class_folder = os.path.join(dataset_folder, class_folder)
    file_counts[class_folder] = len(glob.glob(os.path.join(full_class_folder, '*')))

# Sort file counts by value in descending order
sorted_counts = sorted(file_counts.items(), key=lambda x: x[1], reverse=True)

print(sorted_counts)

[('Miley Cyrus', 348), ('Lindsay Lohan', 334), ('Brad Pitt', 280), ('Jessica Simpson', 280), ('Avril Lavigne', 279), ('Scarlett Johansson', 253), ('Barack Obama', 249), ('Orlando Bloom', 240), ('Katherine Heigl', 237), ('Gwyneth Paltrow', 233), ('Cameron Diaz', 226), ('Daniel Radcliffe', 226), ('Jennifer Aniston', 210), ('George Clooney', 207), ('Angelina Jolie', 194), ('Meg Ryan', 190), ('Sharon Stone', 186), ('Shakira', 181), ('Denise Richards', 180), ('Leonardo DiCaprio', 179), ('Tom Cruise', 177), ('Charlize Theron', 175), ('Keira Knightley', 175), ('Zac Efron', 173), ('Nicole Richie', 168), ('David Beckham', 167), ('Nicole Kidman', 165), ('Jessica Alba', 155), ('Anna Kournikova', 151), ('Cristiano Ronaldo', 148), ('Daniel Craig', 148), ('Hugh Laurie', 148), ('Uma Thurman', 147), ('Steve Carell', 146), ('Cate Blanchett', 140), ('Hugh Jackman', 137), ('Reese Witherspoon', 137), ('Matt Damon', 134), ('Kate Moss', 133), ('Drew Barrymore', 132), ('Shahrukh Khan', 132), ('Harrison Ford'

## 1.2 Handling the Class Imbalance by Undersampling

The class with the fewest images is "Robert Gates" (80 images) and the one with the most images is "Miley Cyrus" (348 images).

We have decided to adopt a undersampling technique to handle the class imbalance issue.

Due to memory hard limits in the kernel and the fact that the pre-trained model does not accept multiprocessing we have decided to consider 50 images per class for the training.

In [5]:
# Defining the maximum number of images per class
max_n_samples_per_class = 50

In [6]:
# Ramdonly selecting 50 images from each class 
dataset_folder = '../input/ifsp-d3apl-2023-face-recognition/train/train/'
class_folders = sorted(os.listdir(dataset_folder))

# OPTIONAL: just to get the same selected images
random.seed(22)

img_full_paths = []
img_classes = []

for class_folder in class_folders:
    img_class = class_folder  
    print(f'Class: {img_class}')
     
    # get the full class folder pathname
    full_class_folder = os.path.join(dataset_folder, class_folder)
    print(full_class_folder)
    
    # get all image filenames (without their parent dir) for the current class/celebrity
    class_img_filenames = sorted(os.listdir(full_class_folder))
    print(len(class_img_filenames))
    
    class_img_filenames = random.sample(class_img_filenames, max_n_samples_per_class)
    print(f'Number of images: {len(class_img_filenames)}')
    
    for img_filename in class_img_filenames:
        full_img_path = os.path.join(full_class_folder, img_filename)
        
        img_full_paths.append(full_img_path)
        img_classes.append(img_class)

    print()

Class: Adam Sandler
../input/ifsp-d3apl-2023-face-recognition/train/train/Adam Sandler
88
Number of images: 50

Class: Alec Baldwin
../input/ifsp-d3apl-2023-face-recognition/train/train/Alec Baldwin
83
Number of images: 50

Class: Angelina Jolie
../input/ifsp-d3apl-2023-face-recognition/train/train/Angelina Jolie
194
Number of images: 50

Class: Anna Kournikova
../input/ifsp-d3apl-2023-face-recognition/train/train/Anna Kournikova
151
Number of images: 50

Class: Ashton Kutcher
../input/ifsp-d3apl-2023-face-recognition/train/train/Ashton Kutcher
81
Number of images: 50

Class: Avril Lavigne
../input/ifsp-d3apl-2023-face-recognition/train/train/Avril Lavigne
279
Number of images: 50

Class: Barack Obama
../input/ifsp-d3apl-2023-face-recognition/train/train/Barack Obama
249
Number of images: 50

Class: Ben Affleck
../input/ifsp-d3apl-2023-face-recognition/train/train/Ben Affleck
97
Number of images: 50

Class: Beyonce Knowles
../input/ifsp-d3apl-2023-face-recognition/train/train/Beyonce K

In [9]:
# Assessing the total number of images (50 images x 83 classes = 4,150)
print(len(img_full_paths))
print(len(img_classes))

4150
4150


In [11]:
# Creating a dataframe to store the image full pathnames and their corresponding classes
dataset_df = pd.DataFrame({
    'image_pathname': img_full_paths,
    'class': img_classes
})

dataset_df

Unnamed: 0,image_pathname,class
0,../input/ifsp-d3apl-2023-face-recognition/trai...,Adam Sandler
1,../input/ifsp-d3apl-2023-face-recognition/trai...,Adam Sandler
2,../input/ifsp-d3apl-2023-face-recognition/trai...,Adam Sandler
3,../input/ifsp-d3apl-2023-face-recognition/trai...,Adam Sandler
4,../input/ifsp-d3apl-2023-face-recognition/trai...,Adam Sandler
...,...,...
4145,../input/ifsp-d3apl-2023-face-recognition/trai...,Zac Efron
4146,../input/ifsp-d3apl-2023-face-recognition/trai...,Zac Efron
4147,../input/ifsp-d3apl-2023-face-recognition/trai...,Zac Efron
4148,../input/ifsp-d3apl-2023-face-recognition/trai...,Zac Efron


In [12]:
# Listing the counts for the images of each class
dataset_df['class'].value_counts()

Adam Sandler         50
Nicole Kidman        50
Miley Cyrus          50
Mickey Rourke        50
Michael Bloomberg    50
                     ..
Ehud Olmert          50
Dustin Hoffman       50
Drew Barrymore       50
Denise Richards      50
Zac Efron            50
Name: class, Length: 83, dtype: int64

In [13]:
# Saving the undersampled dataset
dataset_df.to_csv('../working/faces_dataset_balanced.csv', index=False)

CAUTION:
The image pathnames shown in the the CSV contain a relative path according to the directory of this notebook.
If you try to open some image from a notebook started on other location, an error will appear.
One solution is to save the absolute path of each image or simply adjust the relative path according to your need.

# 1.5 Inspecting an image

In [None]:
import cv2

In [None]:
dataset_df.loc[0, 'image_pathname']

In [None]:
# read an image
img = cv2.imread(dataset_df.loc[0, 'image_pathname'])
print(type(img))
img.shape

In [None]:
# channel BLUE
img[:, :, 0]

In [None]:
# channel GREEN
img[:, :, 1]

In [None]:
# channel RED
img[:, :, 2]

In [None]:
img.min(), img.max()

In [None]:
import matplotlib.pyplot as plt

plt.imshow(img)

In [None]:
img_RGB = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
#img_RGB = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)
plt.imshow(img_RGB)

In [None]:
img = cv2.imread(dataset_df.loc[500, 'image_pathname'])  # BGR
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)  # convert BGR to RGB
plt.imshow(img)

In [None]:
img.shape

# 1.6 Create the training dataset

In [None]:
import tensorflow as tf
tf.__version__

In [None]:
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

In [None]:
gpus = tf.config.list_physical_devices('GPU')

if gpus:
    try:
        # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
        logical_gpus = tf.config.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        # Memory growth must be set before GPUs have been initialized
        print(e)

In [None]:
dataset_df

In [None]:
dataset_df["class"].unique()

In [None]:
class_names = sorted(dataset_df["class"].unique())
n_classes = len(class_names)

print(f'Number of classes: {n_classes}')
print(f'Classes: {class_names}')

In [None]:
# number of samples per class
dataset_df['class'].value_counts()

In [None]:
from sklearn.model_selection import train_test_split

# for a stratified sampling, we need to pass the labels
labels = dataset_df['class']

dataset_df_full_train, dataset_df_test = train_test_split(dataset_df, test_size=0.2, random_state=22, stratify=labels)

In [None]:
dataset_df_full_train.shape

In [None]:
dataset_df_full_train.head()

In [None]:
dataset_df_test.shape

In [None]:
# for a stratified sampling, we need to pass the labels
labels_full_train = dataset_df_full_train['class']
#labels = dataset_df['class']

dataset_df_train, dataset_df_val = train_test_split(dataset_df_full_train, train_size=0.8, random_state=22, stratify=labels_full_train)
#dataset_df_train, dataset_df_val = train_test_split(dataset_df, train_size=0.8, random_state=42, stratify=labels)

dataset_df_train['class'].value_counts()

In [None]:
# checking class balancing in the validation set
dataset_df_val['class'].value_counts()

In [None]:
dataset_df_test['class'].value_counts()

# 1.7 Preprocessing the images

In [None]:
dataset_df.loc[0, 'image_pathname']

In [None]:
import cv2
import matplotlib.pyplot as plt

# BGR
img = cv2.imread('../working/oversampled/Adam Sandler/73.jpg')
# BGR ==> RGB
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

plt.imshow(img)

In [None]:
# aspect ratio = width / height
aspect_ratio = img.shape[0] / img.shape[1]
aspect_ratio

In [None]:
import cv2
from sklearn.preprocessing import LabelEncoder
import numpy as np


#def preprocess_faces_dataset(dataset_df, label_encoder: LabelEncoder, new_img_dims=(100, 100), verbose=0):
def preprocess_faces_dataset(dataset_df, label_encoder: LabelEncoder, new_img_dims=(224, 224), verbose=0):
    # load the images as a feature matrix
    image_list = []  # list of numpy arrays
    
    for index, img_path in enumerate(dataset_df['image_pathname']):
        img = cv2.imread(img_path)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        
        # image resizing
        # for gray or color images, the linear interpolation sounds good
        img = cv2.resize(img, new_img_dims, interpolation=cv2.INTER_LINEAR)        
        image_list.append(img)
        
        if verbose and (index % verbose) == 0:
            print(f'{index + 1}/{dataset_df.shape[0]} - {img_path}')
    
    # numpy array 4D: n_imgs, height, width, n_channels
    X = np.array(image_list)
    
    # feature scaling
    # numpy arary 4D with values within [0, 1]
    X = X / 255.0
    
    # encoding the classes
    # numpy array 1D with integer labels
    y = label_encoder.transform(dataset_df['class'])
    
    return X, y

In [None]:
from tensorflow.keras.utils import Sequence
import numpy as np
import math
import cv2


class MyGenerator(Sequence):
    def __init__(self, dataset_df, label_encoder, batch_size, new_dims=(100, 100)):
        self.dataset_df = dataset_df
        self.label_encoder = label_encoder
        self.batch_size = batch_size
        self.new_dims = new_dims
    
    
    def __len__(self):
        n_samples = self.dataset_df.shape[0]
        
        return math.ceil(n_samples / float(self.batch_size))
    
    
    def __getitem__(self, idx):
        batch_begin = idx * self.batch_size
        batch_end = (idx + 1) * self.batch_size
        
        batch_df = self.dataset_df[batch_begin:batch_end]
        
        X_batch, y_batch = preprocess_faces_dataset(batch_df, self.label_encoder, self.new_dims, verbose=0)
        #X_batch, y_batch = preprocess_faces_dataset(batch_df, self.label_encoder, self.new_dims)
                
        return X_batch, y_batch

In [None]:
# training a Label Encoder from the train set
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
label_encoder.fit(dataset_df_train['class'])

label_encoder.classes_

In [None]:
#batch_size = 83

#training_batch_generator = MyGenerator(dataset_df_train, label_encoder, batch_size, new_dims=(100, 100))
#validation_batch_generator = MyGenerator(dataset_df_val, label_encoder, batch_size, new_dims=(100, 100))
#test_batch_generator = MyGenerator(dataset_df_test, label_encoder, batch_size, new_dims=(100, 100))

In [None]:
#test_batch_generator = MyGenerator(dataset_df_test, label_encoder, batch_size, new_dims=(100, 100))

In [None]:
# transform/map the string class to the trained numeric class
#label_encoder.transform(['Alec Baldwin', 'Claudia Schiffer', 'Zac Efron'])

In [None]:
# preprocessing the train set
#X_train, y_train = preprocess_faces_dataset(dataset_df_train, label_encoder)
#X_train, y_train = preprocess_faces_dataset(dataset_df_train, label_encoder, new_img_dims=(100, 100))
X_train, y_train = preprocess_faces_dataset(dataset_df_train, label_encoder, new_img_dims=(224, 224))

In [None]:
print(f'X_train.shape: {X_train.shape}')
print(f'y_train (classes): {np.unique(y_train)}')
print(f'y_train.shape: {y_train.shape}')

# rescaled 24-bit color image
print(f'Min. value of X_train: {X_train.min()}')
print(f'Max. value of X_train: {X_train.max()}\n')

In [None]:
import matplotlib.pyplot as plt
plt.imshow(X_train[0])

In [None]:

# preprocessing the validation set
#X_val, y_val = preprocess_faces_dataset(dataset_df_val, label_encoder)
#X_val, y_val = preprocess_faces_dataset(dataset_df_val, label_encoder, new_img_dims=(100, 100))
X_val, y_val = preprocess_faces_dataset(dataset_df_val, label_encoder, new_img_dims=(224, 224))

In [None]:

print(f'X_val.shape: {X_val.shape}')
print(f'y_val (classes): {np.unique(y_val)}')
print(f'y_val.shape: {y_val.shape}')

# rescaled 24-bit color image
print(f'Min. value of X_val: {X_val.min()}')
print(f'Max. value of X_val: {X_val.max()}\n')


In [None]:
import matplotlib.pyplot as plt
plt.imshow(X_val[0])

In [None]:
# preprocessing the test set
#X_test, y_test = preprocess_faces_dataset(dataset_df_test, label_encoder, new_img_dims=(100, 100))
X_test, y_test = preprocess_faces_dataset(dataset_df_test, label_encoder, new_img_dims=(224, 224))

In [None]:
import matplotlib.pyplot as plt
plt.imshow(X_test[0])

# 1.8 Saving the preprocessed data

In [None]:
import os

out_dir = '../working/preprocessed'

if not os.path.exists(out_dir):
    os.makedirs(out_dir)
    
dataset_df_full_train.to_csv(os.path.join(out_dir, 'full_train.csv'), index=False)

dataset_df_train.to_csv(os.path.join(out_dir, 'train.csv'), index=False)
np.save(os.path.join(out_dir, 'train_data_64x64x3.npy'), X_train)
np.save(os.path.join(out_dir, 'train_labels.npy'), y_train)

dataset_df_val.to_csv(os.path.join(out_dir, 'validation.csv'), index=False)
np.save(os.path.join(out_dir, 'validation_data_64x64x3.npy'), X_val)
np.save(os.path.join(out_dir, 'validation_labels.npy'), y_val)

dataset_df_test.to_csv(os.path.join(out_dir, 'test.csv'), index=False)
np.save(os.path.join(out_dir, 'test_data_64x64x3.npy'), X_test)
np.save(os.path.join(out_dir, 'test_labels.npy'), y_test)


# 2 Training the model

# 2.1 Stablish base model for transfer learning VGG16

In [None]:
'''
# https://keras.io/api/applications/vgg/
# https://towardsdatascience.com/transfer-learning-with-vgg16-and-keras-50ea161580b4

from tensorflow.keras.applications import VGG16

base_model = VGG16(include_top=None,   # we will ignore the top layers that consists of the MLP classifier of VGG16
                   weights="imagenet", # we will use the weights learned for the ImageNet dataset
                   input_shape=(100, 100, 3))  # let's consider a smaller resolution than the original paper due to lack of memory


# freeze the base model weights ==> these weights won't be updated during training
# i.e., the weights of all layers from the base model are not updated
base_model.trainable = False
'''

In [None]:
#!pip install keras-vggface
!pip install git+https://github.com/rcmalli/keras-vggface.git
!pip install Keras-Applications

    
filename = "/opt/conda/lib/python3.10/site-packages/keras_vggface/models.py"
text = open(filename).read()
open(filename, "w+").write(text.replace('keras.engine.topology', 'tensorflow.keras.utils'))
import tensorflow as tf

from keras_vggface.vggface import VGGFace
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Flatten, Dense, Dropout



In [None]:

import tensorflow as tf

from keras_vggface.vggface import VGGFace
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Flatten, Dense, Dropout



#from keras_vggface.vggface import VGGFace

# Based on VGG16 architecture -> old paper(2015)
#vggface = VGGFace(model='vgg16') # or VGGFace() as default

# Based on RESNET50 architecture -> new paper(2017)
vggface = VGGFace(model='resnet50')

# Based on SENET50 architecture -> new paper(2017)
#vggface = VGGFace(model='senet50')


#from keras.engine import  Model
#from keras.layers import Input
#from keras_vggface.vggface import VGGFace

# Convolution Features
#vgg_features = VGGFace(include_top=False, input_shape=(100, 100, 3), pooling='avg') # pooling: None, avg or max
vgg_features = VGGFace(include_top=False, input_shape=(224, 224, 3), pooling='avg') # pooling: None, avg or max

# After this point you can use your model to predict.
# ...

#from keras.engine import  Model
#from keras.layers import Input
#from keras_vggface.vggface import VGGFace

# Layer Features
#layer_name = 'layer_name' # edit this line
#vgg_model = VGGFace() # pooling: None, avg or max
#out = vgg_model.get_layer(layer_name).output
#vgg_model_new = Model(vgg_model.input, out)

# After this point you can use your model to predict.
# ...

#from keras.engine import  Model
#from keras.layers import Flatten, Dense, Input
#from keras_vggface.vggface import VGGFace

#custom parameters
nb_class = 83
hidden_dim = 128

#vgg_model = VGGFace(include_top=False, input_shape=(100, 100, 3))
vgg_model = VGGFace(include_top=False, input_shape=(224, 224, 3))
vgg_model.trainable = False
last_layer = vgg_model.get_layer('pool5').output
x = Flatten(name='flatten')(last_layer)
x = Dense(hidden_dim, activation='relu', name='fc6')(x)
x = Dense(hidden_dim, activation='relu', name='fc7')(x)
x = Dropout(0.3)(x)
out = Dense(nb_class, activation='softmax', name='fc8')(x)
vggface_model = Model(vgg_model.input, out)

# Train your model as usual.
# ...

'''
def define_vggface_model(input_shape, num_classes):
    # Load the VGGFace model
    vggface_model = VGGFace(model='vgg16', weights='vggface', include_top=False, input_shape=input_shape)

    # Freeze the layers of the VGGFace model
    for layer in vggface_model.layers:
       layer.trainable = False

    # Flatten the output of the VGGFace model
    x = Flatten()(vggface_model.output)
    #x = Flatten()(vggface_model.get_layer('avg_pool').output)

    # Add a fully connected layer with dropout
    #x = Dense(1024, activation='relu')(x)
    #x = Dense(512, activation='relu')(x)
    x = Dense(256, activation='relu')(x)
    x = Dropout(0.1)(x)

    # Add the output layer for the desired number of classes
    output = Dense(num_classes, activation='softmax')(x)

    # Create the model
    model = Model(inputs=vggface_model.input, outputs=output)

    return model

# Define the input shape of your images and the number of classes
input_shape = (100, 100, 3)
num_classes = 83

# Create the VGGFace transfer learning model
vggface_model = define_vggface_model(input_shape, num_classes)

'''

# Print a summary of the model architecture
vggface_model.summary()

In [None]:
#base_model = VGGFace(model='vgg16', include_top=None,   # we will ignore the top layers that consists of the MLP classifier of VGG16
                   #weights='vggface', # we will use the weights learned for the ImageNet dataset
                   #input_shape=(100, 100, 3))  # let's consider a smaller resolution than the original paper due to lack of memory

#base_model.trainable = False
#vgg_model.trainable = False

In [None]:
#base_model.summary()

In [None]:
#base_model.summary()

# 2.2 Define the connected model

In [None]:
'''
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Flatten, Dense

from tensorflow.keras.layers import Conv2D, MaxPool2D
from tensorflow.keras.layers import RandomFlip, RandomRotation, RandomTranslation

model = Sequential([
    # our base model - feature extraction
    base_model,

    # data augmentation layers
#        RandomFlip("horizontal"),
#        RandomRotation(factor=0.1),
#        RandomTranslation(height_factor=0.1, width_factor=0.1),
        
        # CNN
#        Conv2D(filters=32, kernel_size=(1,1), activation='relu'),
#        MaxPool2D(pool_size=(1,1)),
#        Conv2D(filters=32, kernel_size=(1,1), activation='relu'),
 #       MaxPool2D(pool_size=(1,1)),
  
    
    Flatten(),
    
    # FC classifier
  Dense(512, activation='relu'),
    # Dense(256, activation='relu'),
 #  Dense(128, activation='relu'),
  #  Dense(64, activation='relu'),
    Dropout(0.5),
    Dense(83, activation='softmax')
])
'''

In [None]:
vggface_model.summary()

In [None]:
#model.summary()

# 2.3 Compile and run the model

In [None]:
from tensorflow.keras.optimizers import Adam
#opt = Adam(learning_rate=0.001)
opt = Adam(learning_rate=0.0005)
#model.compile(loss='sparse_categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
vggface_model.compile(loss='sparse_categorical_crossentropy', optimizer=opt, metrics=['accuracy'])

In [None]:
import tensorflow
early_stopping_cb = tensorflow.keras.callbacks.EarlyStopping(patience=3, restore_best_weights=True)

In [None]:
history = vggface_model.fit(X_train, y_train, epochs=20, batch_size=83, validation_data=(X_val, y_val), callbacks=[early_stopping_cb])
# https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit
#history = model.fit(training_batch_generator, epochs=20, validation_data=validation_batch_generator, callbacks=[early_stopping_cb],  use_multiprocessing=True, workers=16, max_queue_size=32)


#history = vggface_model.fit_generator(training_batch_generator, epochs=20, validation_data=validation_batch_generator, callbacks=[early_stopping_cb],  use_multiprocessing=True, workers=16, max_queue_size=32)
                    # Used for generator or keras.utils.Sequence input only
                   

In [None]:
from tensorflow.keras.utils import plot_model
# vertical
plot_model(vggface_model, show_shapes=True, show_layer_activations=True)

In [None]:
# creates a HDF5 file
vggface_model.save('../working/'+
    'transfer_learning_trained' +
    '_face_cnn_model.h5')

# 2.4 Visualizing the training history

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

history_df = pd.DataFrame(history.history)

In [None]:
history_df[['loss', 'val_loss']].plot(figsize=(8, 5))
plt.grid(True)
plt.xlabel('Epochs')
plt.ylabel('Score')

history_df[['accuracy', 'val_accuracy']].plot(figsize=(8, 5))
plt.grid(True)
plt.xlabel('Epochs')
plt.ylabel('Score')

In [None]:
# checking class balancing in the training set
#test_folder = '../input/ifsp-d3apl-2023-face-recognition/test/test/'

#dataset_df_test['class'].value_counts()

# Model evaluation

In [None]:
vggface_model.evaluate(X_test, y_test)
#model.evaluate(test_batch_generator)

In [None]:
y_test_proba = vggface_model.predict(X_test)
#y_test_proba = model.predict(test_batch_generator)
y_test_proba

In [None]:
y_test_pred = np.argmax(y_test_proba, axis=1)
y_test_pred

In [None]:
from sklearn.metrics import classification_report

y_test = label_encoder.transform(dataset_df_test['class'])
class_names = label_encoder.classes_

print(classification_report(y_test, y_test_pred, target_names=[name for name in class_names]))

# Generating prediciton file for submission

In [None]:
#import os

# checking class balancing in the training set
test_folder = '../input/ifsp-d3apl-2023-face-recognition/test/test/'

img_test_list = sorted(os.listdir(test_folder))

img_test_full_paths = []
#img_pred_classes = []

for img_test in img_test_list:
    full_img_test_path = os.path.join(test_folder, img_test)
    img_test_full_paths.append(full_img_test_path)

# creating a dataframe to store the image full pathnames and their corresponding classes
#import pandas as pd

dataset_sub_test = pd.DataFrame({
    'image_pathname': img_test_full_paths
   })

dataset_sub_test
    
    # print()
#print(img_test_list)

In [None]:
dataset_sub_test['image_pathname'].value_counts()

In [None]:
import cv2
from sklearn.preprocessing import LabelEncoder
import numpy as np


def preprocess_faces_dataset(dataset_df, label_encoder: LabelEncoder, new_img_dims=(224, 224), verbose=0):
    # load the images as a feature matrix
    image_list = []  # list of numpy arrays
    
    for index, img_path in enumerate(dataset_df['image_pathname']):
        img = cv2.imread(img_path)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        
        # image resizing
        # for gray or color images, the linear interpolation sounds good
        img = cv2.resize(img, new_img_dims, interpolation=cv2.INTER_LINEAR)        
        image_list.append(img)
        
        if verbose and (index % verbose) == 0:
            print(f'{index + 1}/{dataset_df.shape[0]} - {img_path}')
    
    # numpy array 4D: n_imgs, height, width, n_channels
    X = np.array(image_list)
    
    # feature scaling
    # numpy arary 4D with values within [0, 1]
    X = X / 255.0
    
    # encoding the classes
    # numpy array 1D with integer labels
    #y = label_encoder.transform(dataset_df['class'])
    
    return X

x_test=preprocess_faces_dataset(dataset_sub_test, label_encoder, new_img_dims=(224, 224))
x_test


In [None]:
import os

out_dir = '../working/preprocessed'

if not os.path.exists(out_dir):
    os.makedirs(out_dir)
    
dataset_sub_test.to_csv(os.path.join(out_dir, 'sub_test.csv'), index=False)
np.save(os.path.join(out_dir, 'test_data_224x224x3.npy'), x_test)


In [None]:
y_test_proba = vggface_model.predict(x_test)
y_test_proba

In [None]:
y_test_pred = np.argmax(y_test_proba, axis=1)
y_test_pred


In [None]:
len(y_test_pred)

In [None]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
class_names = label_encoder.classes_

#y_test_pred = label_encoder.transform(dataset_df_test['class'])

le_fitted = le.fit_transform(class_names)

inverted = le.inverse_transform(y_test_pred)

print(inverted)
len(inverted)

In [None]:
image_id=list(range(1661))
image_id.pop(0)
len(image_id)



In [None]:
dataset_submission = pd.DataFrame({
    'image-id': image_id,
    'prediction': inverted
   })

dataset_submission

In [None]:
dataset_submission.to_csv('../working/prediction_400.csv', index=False)