In [None]:
Project Report for Artificail intelligence 2:
    Gaurav Dongare
    Vinay Pawar
    
Under the Guidance of:
    Prof. Shreeganesh Thottempudi

Real Time Face Emotion Recognition using Deep Learning Model: CNN - Convolution Neural Network.

Topics:
1) Introduction
2) CNN History
3) CNN Architecture
4) CNN Use Case


Introduction:

Convolutional Neural Networks (CNNs) have revolutionized the field of computer vision and pattern recognition. This report provides a comprehensive overview of CNNs, it's architecture and applications. The report explains the underlying principles of CNNs, their layers, and how they are utilized for tasks such as image classification, object detection, and image segmentation.

What is CNN?

    A Convolutional Neural Network (CNN) is like a smart tool that looks at pictures or words and figures out the most important parts. It changes the information in a way that makes it faster to work with, and it's good at handling mistakes and messy stuff. CNNs are really good at understanding images and texts, and they're often used to spot objects in pictures or find specific parts of texts. CNN stands for Convolution Neural Network. What's cool about CNNs is that they can do many things at once, they're good at sharing knowledge, and they learn step by step to understand things better. These features make CNNs even more interesting and useful.


CNN History:

    The history of Convolutional Neural Networks (CNNs) is a journey of innovation in the world of artificial intelligence. Starting in the late 20th century, CNNs were initially inspired by the human visual system. In the 1980s, Yann LeCun introduced one of the earliest forms of CNNs called LeNet-5, which paved the way for image recognition. However, due to limited computational resources, CNNs didn't gain much popularity until the 2010s when the explosion of data and computing power allowed for their practical implementation. This era saw the rise of deep CNNs with remarkable achievements, like AlexNet winning the ImageNet competition in 2012. Further developments, such as VGGNet, GoogLeNet, and ResNet, continued to enhance CNN performance, enabling breakthroughs in image classification, object detection, and more. As of today, CNNs remain a cornerstone of modern AI, with their applications spanning across industries and revolutionizing the field of computer vision.

CNN Architecture:
At the heart of a CNN, you have layers that do different tasks:

Input Layer:
This is where you feed in the image you want the CNN to understand.

Convolutional Layers:
These layers use special filters to find specific patterns in the image, like edges, corners, or textures. Imagine you're looking at a puzzle piece by piece.

Activation Layers:
After finding patterns, the network decides if they matter. It's like telling the network, "Hey, this part is important, and this part is not."

Pooling Layers:
These layers shrink down the picture by keeping only the main information. Imagine you're looking at the bigger picture instead of every tiny detail.

Fully Connected Layers:
These layers understand what patterns the network found and make a guess about what the image is showing. It's like the network saying, "Based on all these things I've seen, I think this picture is a cat!"

Output Layer:
The final answer is given here. The network tells you its best guess about what's in the picture.

CNNs work by repeating these layers in a deep structure, learning more and more about the image as they go along. This architecture is designed to capture both simple and complex features, helping the network recognize intricate patterns in images.


CNN Use Case:

Following is the use cases taken from Kaggle. The use case is used as a reference to understand the architecture of CNN and its application in real time. 

The code begins by importing the TensorFlow library, which is a widely used open-source machine learning framework.

In [3]:
import tensorflow as tf

The paths to the directories containing training and testing data are specified here.

In [4]:
train_dir = "train"
test_dir = "test"

1) Data Preprocessing:
ImageDataGenerator is object from TensorFlow's Keras module. This object is used to preprocess and augment image data. The options provided include:

    rescale=1./255: This rescales the pixel values of the images to be between 0 and 1, which is a common preprocessing step.

    horizontal_flip=True: It horizontally flips the images, which helps in augmenting the dataset and making the model more robust.

    validation_split=0.2: It sets aside 20% of the data for validation during training.



In [6]:
dataGenerator = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255, horizontal_flip=True, validation_split=0.2)


2) Generating Trainign Data and Validation data:

    train_dir is the directory path where the training images are stored.
    
    batch_size=64: It means that 64 images will be processed in each iteration.
    
    target_size=(48, 48): The images are resized to 48x48 pixels.
    
    shuffle=True: The data is shuffled randomly.
    
    color_mode='grayscale': Images are converted to grayscale.
    
    class_mode='categorical': The labels are categorical (one-hot encoded) indicating different classes.
    
    subset='training': It specifies that this data generator will be used for training.

    subset='validation': It indicates that this data generator will be used for validation.

In [8]:
training_data = dataGenerator.flow_from_directory(train_dir, batch_size=64, target_size=(48, 48), shuffle=True, color_mode='grayscale', class_mode='categorical', subset='training')
validation_set = dataGenerator.flow_from_directory(train_dir, batch_size=64, target_size=(48, 48), shuffle=True, color_mode='grayscale', class_mode='categorical', subset='validation')



Found 22968 images belonging to 7 classes.
Found 5741 images belonging to 7 classes.


3) Genrating Test Data:

    test_dir is the directory path where the test images are stored.
    
    The testDataGenerator object is similar to the one used for training and validation, with rescaling and horizontal flipping.
    
    The test data generator doesn't have the subset parameter because all test data is used.

In [7]:
testDataGenerator = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255, horizontal_flip=True)

test_data = testDataGenerator.flow_from_directory(test_dir, batch_size=64, target_size=(48, 48), shuffle=True, color_mode='grayscale', class_mode='categorical')

Found 7178 images belonging to 7 classes.


4) Defining a function to create model:

    The create_model function defines a CNN architecture with multiple convolutional and pooling layers, dropout for regularization, and fully connected layers for classification. These layers are structured to learn and extract features from input images, gradually capturing more complex patterns as the network gets deeper.
    
    weight_decay is a parameter used to control regularization, which helps prevent overfitting. A Sequential model is created, which is a linear stack of layers.
    
    Conv2D layers perform convolutions on the input images. Here, a 64 filters of size (4, 4) are applied. padding='same' ensures the output size is the same as the input size. 
    
    Activation('relu') applies the ReLU activation function to the output of the convolution.
    
    BatchNormalization() helps normalize the output, making training more stable and faster.
    
    MaxPool2D layers perform max pooling, reducing the spatial dimensions of the data while keeping important features.
    
    Dropout(0.2) randomly "drops out" a fraction of neurons during training to prevent overfitting.
    
It is repeated three times with different settings for varying complexity and abstraction in the features learned.

    Flatten layer reshapes the output into a 1D vector.
    
    Dense layers are fully connected layers, where each neuron is connected to every neuron in the previous layer.
    
    The final layer has 7 neurons and uses softmax activation to predict the probability of each of the 7 classes (emotions).
    
    
    

In [6]:
def create_model():
    weight_decay = 1e-4
    model = tf.keras.models.Sequential()

    model.add(tf.keras.layers.Conv2D(64, (4, 4), padding='same', kernel_regularizer=tf.keras.regularizers.l2(weight_decay), input_shape=(48, 48, 1)))
    model.add(tf.keras.layers.Activation('relu'))
    model.add(tf.keras.layers.BatchNormalization())
    model.add(tf.keras.layers.Conv2D(64, (4, 4), padding='same', kernel_regularizer=tf.keras.regularizers.l2(weight_decay)))
    model.add(tf.keras.layers.Activation('relu'))
    model.add(tf.keras.layers.BatchNormalization())
    model.add(tf.keras.layers.MaxPool2D(pool_size=(2, 2)))
    model.add(tf.keras.layers.Dropout(0.2))

    model.add(tf.keras.layers.Conv2D(128, (4, 4), padding='same', kernel_regularizer=tf.keras.regularizers.l2(weight_decay)))
    model.add(tf.keras.layers.Activation('relu'))
    model.add(tf.keras.layers.BatchNormalization())
    model.add(tf.keras.layers.MaxPool2D(pool_size=(2, 2)))
    model.add(tf.keras.layers.Dropout(0.3))
    
    model.add(tf.keras.layers.Conv2D(128, (4, 4), padding='same', kernel_regularizer=tf.keras.regularizers.l2(weight_decay)))
    model.add(tf.keras.layers.Activation('relu'))
    model.add(tf.keras.layers.BatchNormalization())
    model.add(tf.keras.layers.Conv2D(128, (4, 4), padding='same', kernel_regularizer=tf.keras.regularizers.l2(weight_decay)))
    model.add(tf.keras.layers.Activation('relu'))
    model.add(tf.keras.layers.BatchNormalization())
    model.add(tf.keras.layers.MaxPool2D(pool_size=(2, 2)))
    model.add(tf.keras.layers.Dropout(0.4))
    
    model.add(tf.keras.layers.Flatten())
    model.add(tf.keras.layers.Dense(128, activation="linear"))
    model.add(tf.keras.layers.Activation('relu'))
    model.add(tf.keras.layers.Dense(7, activation='softmax'))
    
    return model

5) Creating Model
    This below line of code calls the previously defined create_model function to instantiate the CNN model. It means that the architecture described in the create_model function will be used to create the actual neural network.

    compile is a method used to configure the learning process of the model.

    loss='categorical_crossentropy' specifies the loss function to measure how well the model's predictions match the actual labels.

    optimizer=tf.keras.optimizers.Adam(0.0003) sets the optimizer used to adjust the model's parameters during training. Here, Adam optimizer with a learning rate of 0.0003 is chosen.

    metrics=['accuracy'] defines the metric to evaluate the model's performance during training. In this case, accuracy (the proportion of correctly predicted cases) is chosen as the metric.

In [7]:
model = create_model()

model.compile(loss='categorical_crossentropy', optimizer=tf.keras.optimizers.Adam(0.0003), metrics=['accuracy'])

The checkpointer is a list of two callbacks: 
EarlyStopping stops training when the validation accuracy plateaus, and ModelCheckpoint saves the best model weights during training. These callbacks enhance the training process, preventing overfitting and allowing you to restore the best-performing model.

In Below code following are the attributes:

EarlyStopping and is a callback that monitors a specific metric during training and stops the training process if that metric stops improving.

monitor='val_accuracy' indicates that the validation accuracy will be monitored.

verbose=1 means that messages will be printed to the console to provide information about the training process.

restore_best_weights=True restores the model's weights to the best state when training is stopped.

mode="max" specifies that the metric should be maximized (in this case, validation accuracy).

patience=10 means that if the validation accuracy doesn't improve for 10 consecutive epochs, the training will be stopped.


ModelCheckpoint is a callback that saves the model's weights during training.

filepath='final_model_weights.hdf5' specifies the file path where the best model weights will be saved.

monitor="val_accuracy" means that the validation accuracy will be monitored to decide when to save the weights.

verbose=1 provides information about the saving process.

save_best_only=True indicates that only the weights associated with the best validation accuracy will be saved.

mode="max" specifies that the metric (validation accuracy) should be maximized.

In [8]:
checkpointer = [tf.keras.callbacks.EarlyStopping(monitor = 'val_accuracy', verbose = 1, restore_best_weights=True, mode="max",patience = 10),
                tf.keras.callbacks.ModelCheckpoint(
                    filepath='final_model_weights.hdf5',
                    monitor="val_accuracy",
                    verbose=1,
                    save_best_only=True,
                    mode="max")]

6) Training Model

    The Below code trains the model using the training data and validates its performance using the validation data. The number of steps and epochs is controlled, and the defined callbacks are utilized to enhance the training process and monitor the model's progress.

    model.fit() trains the model using the provided data.

    x=training_data specifies the training data to be used.

    validation_data=validation_set indicates the validation data to be used to check the model's performance.

    epochs=10 sets the number of times the model will go through the entire training dataset.

    callbacks=[checkpointer] applies the callbacks previously defined, such as early stopping and model checkpointing.

    steps_per_epoch=steps_per_epoch determines how many batches of data are processed in each epoch..

    validation_steps=validation_steps sets the number of batches used for validation in each epoch.

In [9]:
steps_per_epoch = training_data.n // training_data.batch_size
validation_steps = validation_set.n // validation_set.batch_size
#These lines calculate the number of steps needed for each epoch of training and validation. 
#It's based on the number of samples in the dataset divided by the batch size.

history = model.fit(x=training_data,
                 validation_data=validation_set,
                 epochs=10,
                 callbacks=[checkpointer],
                 steps_per_epoch=steps_per_epoch,
                 validation_steps=validation_steps)

Epoch 1/10
Epoch 1: val_accuracy improved from -inf to 0.26650, saving model to final_model_weights.hdf5
Epoch 2/10


  saving_api.save_model(


Epoch 2: val_accuracy improved from 0.26650 to 0.42591, saving model to final_model_weights.hdf5
Epoch 3/10
Epoch 3: val_accuracy improved from 0.42591 to 0.47946, saving model to final_model_weights.hdf5
Epoch 4/10
Epoch 4: val_accuracy improved from 0.47946 to 0.50878, saving model to final_model_weights.hdf5
Epoch 5/10
Epoch 5: val_accuracy improved from 0.50878 to 0.53178, saving model to final_model_weights.hdf5
Epoch 6/10
Epoch 6: val_accuracy improved from 0.53178 to 0.53652, saving model to final_model_weights.hdf5
Epoch 7/10
Epoch 7: val_accuracy improved from 0.53652 to 0.54951, saving model to final_model_weights.hdf5
Epoch 8/10
Epoch 8: val_accuracy improved from 0.54951 to 0.57128, saving model to final_model_weights.hdf5
Epoch 9/10
Epoch 9: val_accuracy improved from 0.57128 to 0.58445, saving model to final_model_weights.hdf5
Epoch 10/10
Epoch 10: val_accuracy did not improve from 0.58445


Below code  calculates and displays the accuracy of the trained model on the test dataset, giving an indication of how well the model generalizes to new, unseen data.

In [None]:
print(f"Test accuracy = {model.evaluate(test_data ,batch_size=test_data.batch_size,steps=test_data.n // test_data.batch_size)[1]*100}%")

Test accuracy = 58.03571343421936%


 This code captures real-time webcam frames, detects faces, resizes and preprocesses face images, uses the trained CNN model to predict emotions, and overlays rectangles and labels on the frames to show the recognized emotions. The code provides an interactive way to visualize the model's performance in real time.

In [None]:
#Code for real time prediction 

'''
The code begins by importing the necessary libraries: cv2 for image processing, tensorflow for using the trained model, 
and numpy for numerical operations.
class_names is a list of emotion labels that the model can predict.
The trained model is loaded using tf.keras.models.load_model().
'''

import cv2
import tensorflow as tf
import numpy as np

class_names = ["Angry", "Disgust", "Fear", "Happy", "Neutral", "Sad", "Surprise"]

model = tf.keras.models.load_model('final_model_weights.hdf5')


'''
The Below code initializes a video capture object to access the webcam feed (0 indicates the default camera).
faceDetect is an instance of CascadeClassifier from OpenCV, which is used for face detection.
'''

video = cv2.VideoCapture(0)

faceDetect = cv2.CascadeClassifier('haarcascade_frontalface_default.xml')

'''
This loop captures frames from the video feed and converts them to grayscale.
The detectMultiScale function detects faces in the grayscale frame and returns their coordinates.
'''

while True:
    ret, frame = video.read()
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    faces = faceDetect.detectMultiScale(gray, 1.3, 3)
    
'''
This loop iterates through each detected face:
A sub-image of the face region is extracted and resized to 48x48 pixels.
The pixel values are normalized between 0 and 1.
The normalized image is reshaped to match the model's input shape.
The model predicts the emotion label for the resized face image.
The label variable stores the index of the predicted emotion.
'''

    for x, y, w, h in faces:
        sub_face_img = gray[y : y + h, x : x + w]
        resized = cv2.resize(sub_face_img, (48, 48))
        normalize = resized / 255.0
        reshaped = np.reshape(normalize, (1, 48, 48, 1))
        result = model.predict(reshaped)
        label = np.argmax(result, axis=1)[0]
        print(label)
        
'''
This part draws rectangles around the detected faces and labels them with predicted emotions:
The rectangles are drawn using the cv2.rectangle function.
The cv2.putText function adds the predicted emotion label on top of the rectangle.
'''
        cv2.rectangle(frame, (x, y), (x + w, y + h), (0, 0, 255), 1)
        cv2.rectangle(frame, (x, y), (x + w, y + h), (50, 50, 255), 2)
        cv2.rectangle(frame, (x, y - 40), (x + w, y), (50, 50, 255), -1)
        cv2.putText(frame, class_names[label], (x, y - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.8, (255, 255, 255), 2)
'''
The processed frame with rectangles and labels is displayed in a window using cv2.imshow.
The loop continues until the user presses the 'q' key, at which point the video capture is released and the windows are closed.
'''

    cv2.imshow("Frame", frame)
    k = cv2.waitKey(1)
    if k == ord('q'):
        break

video.release()
cv2.destroyAllWindows()

In conclusion, the project involves the implementation of a real-time emotion recognition system using a Convolutional Neural Network (CNN). The primary goal of the project is to accurately detect and label emotions in faces captured through a webcam feed. The project integrates various components, including data preprocessing, model architecture design, training, and real-time inference.

The key components of the project are as follows:

Data Preprocessing:
The project begins by preprocessing image data using TensorFlow's ImageDataGenerator. The images are rescaled to a common range, augmented with horizontal flips, and split into training and validation sets. This preprocessing enhances the model's ability to generalize and perform well on new, unseen data.

CNN Model Architecture:
The CNN model architecture is designed to capture meaningful features from face images. It comprises multiple convolutional layers for feature extraction, activation functions to introduce non-linearity, batch normalization to stabilize training, max pooling for spatial reduction, and dropout layers for regularization. The model culminates with fully connected layers for classification into emotion categories.

Training and Evaluation:
The model is compiled with appropriate loss, optimizer, and metrics settings. It is then trained using the prepared training data, with callbacks such as EarlyStopping and ModelCheckpoint to prevent overfitting and save the best model weights. After training, the model's accuracy and performance are evaluated on a separate test dataset.

Real-Time Emotion Recognition:
To demonstrate the model's capability, a real-time emotion recognition system is developed using OpenCV for webcam access. The system detects faces, preprocesses them, feeds them into the trained model, and overlays emotion labels on the frames. This interactive application provides a practical way to showcase the model's performance in real-world scenarios.

In essence, the project successfully achieves the goal of creating an end-to-end emotion recognition system. By combining deep learning techniques with real-time video processing, the system is able to predict and display emotions in real time. This project serves as an example of how CNNs can be applied to real-world applications, showcasing their power in understanding and interpreting visual data.

References:

Bhatt, D.; Patel, C.; Talsania,
H.; Patel, J.; Vaghela, R.; Pandya, S.;
Modi, K.; Ghayvat, H. CNN Variants
for Computer Vision: History,
Architecture, Application, Challenges
and Future Scope. Electronics 2021, 10,
2470. https://doi.org/10.3390/
electronics10202470

https://www.kaggle.com/code/mihaililie/real-time-face-emotion-recognition-with-tensorflow

https://www.ibm.com/topics/convolutional-neural-networks

https://saturncloud.io/blog/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way/

https://medium.com/analytics-vidhya/building-a-real-time-emotion-detector-towards-machine-with-e-q-c20b17f89220