<a href="https://colab.research.google.com/github/NadiaHolmlund/Semester_Project/blob/main/ver_3_Facial_Emotion_Recognition_(FER).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Facial Emotion Recognition (FER) with the Vision Transformer (ViT) by Gogle Brain

The following notebook contains the fine-tuning process of a pre-trained vision transformer (ViT) on the FER2013 dataset. The [dataset]((https://www.kaggle.com/datasets/deadskull7/fer2013).) is a collection of 35.887 48x48 grayscale images of faces divided in 7 classes (Angry, Disgust, Fear, Happy, Sad, Surprise, Neutral). The training set consists of 28.709 images while the validation and test sets consist of 3.589 images, respectively.

The Vision Transformer (ViT) is similar to BERT, but rather than text the ViT has been trained on images. According to the [paper](https://arxiv.org/abs/2010.11929) on ViT, it attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train

In the notebook, the data is prepared using 🤗 [datasets](https://github.com/huggingface/datasets) and the model is trained using the 🤗 [Trainer](https://huggingface.co/transformers/main_classes/trainer.html).

The process is inspired by a tutorial by Niels Rogge, ML engineer at 🤗 [HuggingFace](https://huggingface.co'), who fine-tuned ViT on the CIFAR-10 dataset using HugginFace's [Trainer](https://huggingface.co/transformers/main_classes/trainer.html). The tutorial can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer).

# Imports

In [1]:
# Pip installs
!pip install -q transformers==4.28.0 # Installing version 4.28.0 to circumvent an issue with Accelerator and the introduction of PartialState in later versions
!pip install -q transformers datasets
!pip install -q mlflow
!pip install -q pyngrok

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m34.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m79.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m34.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

In [25]:
# Libraries
from datasets import *
from transformers import ViTImageProcessor
from transformers import ViTModel, ViTConfig
from transformers import PreTrainedModel
from transformers import TrainingArguments, Trainer
from transformers.modeling_outputs import SequenceClassifierOutput
import numpy as np
import pandas as pd 
import torch.nn as nn
import pickle
from matplotlib import pyplot as plt
%matplotlib inline 
import seaborn as sns
from sklearn.metrics import confusion_matrix
import mlflow
from pyngrok import ngrok
from getpass import getpass

# Connecting to Google Drive

Due to the size of the dataset (301MB) it exceeds the file-size limit on Github, hence it is loaded from Google Drive (requires personal access).

In [3]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [4]:
%cd /content/gdrive/MyDrive/Semester_Project

/content/gdrive/MyDrive/Semester_Project


# Setting up MLFlow for experiment tracking

## Setting up MLFlow UI

In [None]:
# run tracking UI in the background
get_ipython().system_raw("mlflow ui --port 5000 &")

# Terminate open tunnels if any exist
ngrok.kill()

In [None]:
import os

google_drive_path = "/content/gdrive/MyDrive/Semester_Project/mlruns"
mlflow_tracking_uri = f"file://{google_drive_path}"

os.environ["MLFLOW_TRACKING_URI"] = mlflow_tracking_uri

In [None]:
# Login on ngrok.com and get your authtoken from https://dashboard.ngrok.com/auth
# Enter your auth token when the code is running
NGROK_AUTH_TOKEN = getpass('Enter the ngrok authtoken: ')
ngrok.set_auth_token(NGROK_AUTH_TOKEN)
ngrok_tunnel = ngrok.connect(addr="5000", proto="http", bind_tls=True)
print("MLflow Tracking UI:", ngrok_tunnel.public_url)

Enter the ngrok authtoken: ··········




MLflow Tracking UI: https://6164-34-91-191-89.ngrok-free.app


## Setting up new experiment
Note: Only run this section if setting up a new experiment

In [5]:
experiment_name = "saving_strategy_mlruns"
run_name = "experiment_1"

In [6]:
# Only run this code if creating a whole new experiment (not just a new run)
#mlflow.create_experiment(experiment_name)

In [7]:
# Get the experiment ID for the experiment with the specified name
experiment_id = mlflow.get_experiment_by_name(experiment_name).experiment_id

In [8]:
## Starting MLflow, running UI in background
mlflow.start_run(run_name=run_name, nested=True, experiment_id=experiment_id)

<ActiveRun: >

# Loading the dataset



In [9]:
fer_df = pd.read_csv("/content/gdrive/MyDrive/Semester_Project/FER2013.csv")  # available on kaggle

In [10]:
fer_df.head()

Unnamed: 0,emotion,pixels,Usage
0,0,70 80 82 72 58 58 60 63 54 58 60 48 89 115 121...,Training
1,0,151 150 147 155 148 133 111 140 170 174 182 15...,Training
2,2,231 212 156 164 174 138 161 173 182 200 106 38...,Training
3,4,24 32 36 30 32 23 19 20 30 41 21 22 32 34 21 1...,Training
4,6,4 0 0 0 0 0 0 0 0 0 0 0 3 15 23 28 48 50 58 84...,Training


In [11]:
fer_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35887 entries, 0 to 35886
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   emotion  35887 non-null  int64 
 1   pixels   35887 non-null  object
 2   Usage    35887 non-null  object
dtypes: int64(1), object(2)
memory usage: 841.2+ KB


# Preprocessing

In [12]:
# Defining the labels for emotions in the dataset
string_labels = ['Anger', 'Disgust', 'Fear', 'Happiness', 'Sadness', 'Surprise', 'Neutral']

In [13]:
# Importing the ViT Feature Extractor from HuggingFace
# The Feature Extractor resizes every image to the resolution that the model expects, i.e. 224x224, and normalizes the channels
processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224-in21k')

Downloading (…)rocessor_config.json:   0%|          | 0.00/160 [00:00<?, ?B/s]

First preprocessing step

In [14]:
def prepare_fer_data(data):
    """ Prepare fer data for vision transformer 
        input: fer df loaded from csv
        output: df that can be load into a huggingface dataset """

    # outputs
    image_list = []
    image_labels = list(map(int, data['emotion']))
    
    # go over all images
    for i, row in enumerate(data.index):
        image = np.fromstring(data.loc[row, 'pixels'], dtype=int, sep=' ')
        image = np.reshape(image, (48, 48))
        # adapt grayscale to rgb format (change single values to triplets of the same value)
        image = image[..., np.newaxis]
        image = np.repeat(image, 3, axis=2)
        # convert to list format used by the later functions
        image = image.astype(int).tolist()
        # save to output
        image_list.append(image)

    output_df = pd.DataFrame(list(zip(image_list, image_labels)),
               columns =['img', 'label'])
        
    return output_df

In [15]:
fer_train_df = prepare_fer_data(fer_df[fer_df['Usage']=='Training'].sample(n = 50))
fer_test_df = prepare_fer_data(fer_df[fer_df['Usage']=='PrivateTest'].sample(n = 5))
fer_val_df = prepare_fer_data(fer_df[fer_df['Usage']=='PublicTest'].sample(n = 5))

In [16]:
fer_train_df.head()

Unnamed: 0,img,label
0,"[[[192, 192, 192], [97, 97, 97], [123, 123, 12...",0
1,"[[[98, 98, 98], [100, 100, 100], [105, 105, 10...",4
2,"[[[109, 109, 109], [99, 99, 99], [93, 93, 93],...",0
3,"[[[15, 15, 15], [15, 15, 15], [15, 15, 15], [1...",1
4,"[[[43, 43, 43], [51, 51, 51], [78, 78, 78], [9...",3


In [17]:
print(len(fer_train_df))
print(len(fer_test_df))
print(len(fer_val_df))

50
5
5


In [18]:
train_ds = Dataset.from_pandas(fer_train_df)
val_ds = Dataset.from_pandas(fer_val_df)
test_ds = Dataset.from_pandas(fer_test_df)

print(train_ds)
print(val_ds)
print(test_ds)

Dataset({
    features: ['img', 'label'],
    num_rows: 50
})
Dataset({
    features: ['img', 'label'],
    num_rows: 5
})
Dataset({
    features: ['img', 'label'],
    num_rows: 5
})


In [19]:
print(len(train_ds))
print(len(val_ds))
print(len(test_ds))

50
5
5


In [20]:
# image size 
np.array(train_ds[0]["img"]).shape

(48, 48, 3)

Second preprocessing step using the ViT feature extractor

In [21]:
def preprocess_images(examples):
    """ Prepare datasets for vision transformer 
    input: dataset with images in their orignal size 
    output: dataset with pixelvalues computed by the feature extractor added """
    # get batch of images
    images = examples['img']
    # convert to list of NumPy arrays of shape (C, H, W)
    images = [np.array(image, dtype=np.uint8) for image in images]
    images = [np.moveaxis(image, source=-1, destination=0) for image in images]
    # preprocess and add pixel_values
    inputs = processor(images=images)
    examples['pixel_values'] = inputs['pixel_values']

    return examples

In [22]:
# features of the new dataset with an additional column for the preprocess 224x224x3 images 
features = Features({
    'label': ClassLabel(names=['Angry', 'Disgust', 'Fear', 'Happy', 'Sad', 'Surprise', 'Neutral']),
    'img': Array3D(dtype="int64", shape=(3,48,48)),
    'pixel_values': Array3D(dtype="float32", shape=(3, 224, 224)),
})

preprocessed_train_ds = train_ds.map(preprocess_images, batched=True, batch_size=1, features=features)
#with open('preprocessed_train_ds.pickle', 'wb') as handle:
#    pickle.dump(preprocessed_train_ds, handle, protocol=pickle.HIGHEST_PROTOCOL)
preprocessed_val_ds = val_ds.map(preprocess_images, batched=True, features=features)
#with open('preprocessed_val_ds.pickle', 'wb') as handle:
#    pickle.dump(preprocessed_val_ds, handle, protocol=pickle.HIGHEST_PROTOCOL)
preprocessed_test_ds = test_ds.map(preprocess_images, batched=True, features=features)
#with open('preprocessed_test_ds.pickle', 'wb') as handle:
#    pickle.dump(preprocessed_test_ds, handle, protocol=pickle.HIGHEST_PROTOCOL)

preprocessed_train_ds

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Dataset({
    features: ['label', 'img', 'pixel_values'],
    num_rows: 50
})

In [23]:
# final image size
print(len(preprocessed_train_ds[0]["pixel_values"]))       
print(len(preprocessed_train_ds[0]["pixel_values"][0]))     
print(len(preprocessed_train_ds[0]["pixel_values"][0][0]))  

3
224
224


## Defining the model

The model architecture is defined in PyTorch, with dropout and a linear layer added on top of the ViT model's output of the special CLS token representing the input picture. 


In [34]:
class ViTForImageClassification(PreTrainedModel):
    #define architecture
    def __init__(self, config, num_labels=len(string_labels)):
        super(ViTForImageClassification, self).__init__(config)
        self.vit = ViTModel.from_pretrained('google/vit-base-patch16-224-in21k')
        self.dropout = nn.Dropout(0.1)
        self.classifier = nn.Linear(self.vit.config.hidden_size, num_labels)
        self.num_labels = num_labels

    #define a forward pass through that architecture + loss computation
    def forward(self, pixel_values, labels):
        outputs = self.vit(pixel_values=pixel_values)
        output = self.dropout(outputs.last_hidden_state[:, 0])
        logits = self.classifier(output)

        loss = None
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

        return SequenceClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

Training uses the standard HuggingFace [Trainer](https://huggingface.co/transformers/main_classes/trainer.html) interface. 

In [35]:
metric_name = "accuracy"

args = TrainingArguments(
    f"HF_Training_Log",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=6,
    weight_decay=0.01,
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    logging_dir='HF_Training_Log',
)

In [36]:
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

In [37]:
config = ViTConfig.from_pretrained('google/vit-base-patch16-224-in21k')
model = ViTForImageClassification(config)

Downloading (…)lve/main/config.json:   0%|          | 0.00/502 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/346M [00:00<?, ?B/s]

In [38]:
trainer = Trainer(
    model = model,
    args = args,
    train_dataset = preprocessed_train_ds,
    eval_dataset = preprocessed_val_ds,
    compute_metrics = compute_metrics,
)

## Fine-tuning ViT


Fine-tuning the model by calling the `train()` method

In [39]:
trainer.train()

Could not estimate the number of tokens of the input, floating-point operations will not be computed


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.989853,0.2
2,No log,1.975646,0.2
3,No log,1.968999,0.2
4,No log,1.958616,0.2
5,No log,1.954828,0.2
6,No log,1.953366,0.2


TrainOutput(global_step=24, training_loss=1.7830179532368977, metrics={'train_runtime': 560.1733, 'train_samples_per_second': 0.536, 'train_steps_per_second': 0.043, 'total_flos': 0.0, 'train_loss': 1.7830179532368977, 'epoch': 6.0})

## MLflow

In [40]:
# Generate a unique model filename based on the run name
model_filename = f"Model_{run_name}"
model_filename

'Model_experiment_1'

In [41]:
experiment_id = mlflow.active_run().info.experiment_id
experiment_id

'355554225098101950'

In [42]:
run_id = mlflow.active_run().info.run_id
run_id

'71e8bc2658ca43339a0f42befc39de4f'

In [43]:
model.save_pretrained(f"/content/gdrive/MyDrive/Semester_Project/mlruns/{experiment_id}/{run_id}/artifacts/{model_filename}")

In [None]:
# Generate a unique model filename based on the run ID
preprocessed_train_ds_filename = f"Preprocessed_train_ds_{run_name}"
preprocessed_val_ds_filename = f"Preprocessed_val_ds{run_name}"
preprocessed_test_ds_filename = f"Preprocessed_test_ds_{run_name}"

mlflow.log_artifact(local_path="/content/gdrive/MyDrive/Semester_Project/preprocessed_train_ds.pickle", artifact_path=preprocessed_train_ds_filename)
mlflow.log_artifact(local_path="/content/gdrive/MyDrive/Semester_Project/preprocessed_val_ds.pickle", artifact_path=preprocessed_val_ds_filename)
mlflow.log_artifact(local_path="/content/gdrive/MyDrive/Semester_Project/preprocessed_test_ds.pickle", artifact_path=preprocessed_test_ds_filename)

## Evaluation on Test Set

In [None]:
outputs = trainer.predict(preprocessed_test_ds)
print(outputs.metrics)

## MLFLOW

In [None]:
test_loss = outputs.metrics['test_loss']
test_accuracy = outputs.metrics['test_accuracy']
test_runtime = outputs.metrics['test_runtime']
test_samples_per_second = outputs.metrics['test_samples_per_second']
test_steps_per_second = outputs.metrics['test_steps_per_second']

In [None]:
mlflow.log_metric("test_loss", test_loss)
mlflow.log_metric("test_accuracy", test_accuracy)
mlflow.log_metric("test_runtime", test_runtime)
mlflow.log_metric("test_samples_per_second", test_samples_per_second)
mlflow.log_metric("test_steps_per_second", test_steps_per_second)

In [None]:
mlflow.end_run()

The results on the test set as confusion matrix

In [None]:
y_true = outputs.label_ids
y_pred = outputs.predictions.argmax(1)

cm = confusion_matrix(y_true, y_pred)

# plot with seaborn
fig, ax = plt.subplots(figsize=(8,6))  
ax = sns.heatmap(cm, annot=True, fmt="d", linewidths=.5, xticklabels=string_labels, yticklabels=string_labels)
plt.xlabel("Predicted")
plt.ylabel("True")
plt.show()

## Examining the data

In [None]:
# show a single image from the dataset
data_for_image = np.array(preprocessed_train_ds[0]["pixel_values"])
data_for_image[data_for_image < 0] = 0 

plt.imshow(np.transpose(data_for_image, (1,2,0)), interpolation='nearest')
plt.show()

print(string_labels[preprocessed_train_ds[0]["label"]])

In [None]:
# show 100 images from the dataset
fig, axes = plt.subplots(10,10, figsize=(11,11))
for i,ax in enumerate(axes.flat):
  data_for_image = np.array(preprocessed_train_ds[i]["pixel_values"])
  data_for_image[data_for_image < 0] = 0 
  ax.imshow(np.transpose(data_for_image, (1,2,0)), interpolation='nearest')
  ax.set_axis_off()


In [None]:
# distribution of labels in the training set
keys, counts = np.unique(preprocessed_train_ds["label"], return_counts=True)
plt.bar(string_labels, counts)
plt.show()

# 1-line test

In [None]:
import pandas as pd

fer_df = pd.read_csv("/content/gdrive/MyDrive/Semester_Project/FER2013.csv")  # available on kaggle

In [None]:
test_df = fer_df.head(1)

In [None]:
test_df

# Preprocessing

In [None]:
# Defining the labels for emotions in the dataset
string_labels = ['Anger', 'Disgust', 'Fear', 'Happiness', 'Sadness', 'Surprise', 'Neutral']

In [None]:
# Importing the ViT Feature Extractor from HuggingFace
# The Feature Extractor resizes every image to the resolution that the model expects, i.e. 224x224, and normalizes the channels
feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224-in21k')

First preprocessing step

In [None]:
def prepare_fer_data(data):
    """ Prepare fer data for vision transformer 
        input: fer df loaded from csv
        output: df that can be load into a huggingface dataset """

    # outputs
    image_list = []
    image_labels = list(map(int, data['emotion']))
    
    # go over all images
    for i, row in enumerate(data.index):
        image = np.fromstring(data.loc[row, 'pixels'], dtype=int, sep=' ')
        image = np.reshape(image, (48, 48))
        # adapt grayscale to rgb format (change single values to triplets of the same value)
        image = image[..., np.newaxis]
        image = np.repeat(image, 3, axis=2)
        # convert to list format used by the later functions
        image = image.astype(int).tolist()
        # save to output
        image_list.append(image)

    output_df = pd.DataFrame(list(zip(image_list, image_labels)),
               columns =['img', 'label'])
        
    return output_df

In [None]:
prep_test_df = prepare_fer_data(test_df)

In [None]:
prep_test_df.head()

In [None]:
print(len(prep_test_df))

In [None]:
test_ds = Dataset.from_pandas(prep_test_df)

print(test_ds)

In [None]:
print(len(test_ds))

In [None]:
# image size 
np.array(test_ds[0]["img"]).shape

Second preprocessing step using the ViT feature extractor

In [None]:
def preprocess_images(examples):
    """ Prepare datasets for vision transformer 
    input: dataset with images in their orignal size 
    output: dataset with pixelvalues computed by the feature extractor added """
    # get batch of images
    images = examples['img']
    # convert to list of NumPy arrays of shape (C, H, W)
    images = [np.array(image, dtype=np.uint8) for image in images]
    images = [np.moveaxis(image, source=-1, destination=0) for image in images]
    # preprocess and add pixel_values
    inputs = feature_extractor(images=images)
    examples['pixel_values'] = inputs['pixel_values']

    return examples

In [None]:
# features of the new dataset with an additional column for the preprocess 224x224x3 images 
features = Features({
    'label': ClassLabel(names=['Angry', 'Disgust', 'Fear', 'Happy', 'Sad', 'Surprise', 'Neutral']),
    'img': Array3D(dtype="int64", shape=(3,48,48)),
    'pixel_values': Array3D(dtype="float32", shape=(3, 224, 224)),
})

preprocessed_test_ds = test_ds.map(preprocess_images, batched=True, batch_size=1, features=features)

preprocessed_test_ds

In [None]:
# final image size
print(len(preprocessed_test_ds[0]["pixel_values"]))       
print(len(preprocessed_test_ds[0]["pixel_values"][0]))     
print(len(preprocessed_test_ds[0]["pixel_values"][0][0]))  

In [None]:
outputs = trainer.predict(preprocessed_test_ds)

In [None]:
outputs

In [None]:
logits = outputs.logits

# model predicts one of the 1000 ImageNet classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", labels[predicted_class_idx])  # Use the labels list directly