
# Vision Transformer (ViT) for Facial Expression Recognition Model Card

## Model Overview

- **Model Name:** [trpakov/vit-face-expression](https://huggingface.co/trpakov/vit-face-expression)

- **Task:** Facial Expression/Emotion Recognition

- **Dataset:** [FER2013](https://www.kaggle.com/datasets/msambare/fer2013)

- **Model Architecture:** [Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)

- **Finetuned from model:** [vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k)

## Model Description

The vit-face-expression model is a Vision Transformer fine-tuned for the task of facial emotion recognition. 

It is trained on the FER2013 dataset, which consists of facial images **categorized** into seven different emotions:
- Angry
- Disgust
- Fear
- Happy
- Sad
- Surprise
- Neutral

(Sì il FER2013 ha solamente 7 emozioni e non 8)

## Evaluation Metrics

- **Validation set accuracy:** 0.7113
- **Test set accuracy:** 0.7116


Here’s how you can integrate the **"trpakov/vit-face-expression"** model with OpenCV to perform real-time emotion detection using your webcam feed. The process involves loading the pre-trained model and image processor from Hugging Face, detecting faces in the video stream using OpenCV, and analyzing the emotions using the model.


---

### **Explanation**:
1. **Model and Processor**:
   - The model (`AutoModelForImageClassification`) and processor (`AutoImageProcessor`) are loaded from Hugging Face. 
   - The processor handles image preprocessing like resizing, normalization, and tensor conversion.
   
2. **Face Detection**:
   - OpenCV's Haar cascade is used to detect faces in the video frame.
   - Each detected face is cropped from the frame (`face_roi`) for emotion classification.

3. **Emotion Analysis**:
   - The face is converted into a PIL image and passed through the processor for preprocessing.
   - The preprocessed image is fed into the model to get the logits.
   - Softmax is applied to convert logits to probabilities, and the class with the highest probability is selected as the predicted emotion.

4. **Visualization**:
   - Bounding boxes are drawn around the detected faces.
   - The predicted emotion is displayed as a label near the bounding box.



In [None]:
import cv2
import torch
from transformers import AutoImageProcessor, AutoModelForImageClassification
from torchvision.transforms import functional as F
from PIL import Image

# Load model and processor
processor = AutoImageProcessor.from_pretrained("trpakov/vit-face-expression")
model = AutoModelForImageClassification.from_pretrained("trpakov/vit-face-expression")

# Load OpenCV's Haar Cascade for face detection
face_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + "haarcascade_frontalface_default.xml")

# Start video capture
cap = cv2.VideoCapture(0)

while True:
    ret, frame = cap.read()
    if not ret:
        break

    # Convert frame to grayscale for face detection
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    faces = face_cascade.detectMultiScale(gray, scaleFactor=1.1, minNeighbors=5, minSize=(30, 30))

    for (x, y, w, h) in faces:
        # Extract face ROI
        face_roi = frame[y:y + h, x:x + w]

        try:
            # Convert to PIL Image and preprocess
            face_pil = Image.fromarray(cv2.cvtColor(face_roi, cv2.COLOR_BGR2RGB))
            inputs = processor(face_pil, return_tensors="pt")

            # Run inference
            outputs = model(**inputs)
            probs = torch.nn.functional.softmax(outputs.logits, dim=1)
            predicted_class = probs.argmax().item()
            predicted_label = model.config.id2label[predicted_class]

            # Draw bounding box and label
            cv2.rectangle(frame, (x, y), (x + w, y + h), (0, 255, 0), 2)
            cv2.putText(frame, predicted_label, (x, y - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.9, (0, 255, 0), 2)

        except Exception as e:
            print(f"Error analyzing face: {e}")

    # Display the video feed
    cv2.imshow("Emotion Detection", frame)

    # Exit on 'q' key or if the window is closed
    key = cv2.waitKey(1)
    if key == ord('q') or cv2.getWindowProperty("Emotion Detection", cv2.WND_PROP_VISIBLE) < 1:
        break

cap.release()
cv2.destroyAllWindows()


Yes, it's absolutely possible to display the current **frames per second (FPS)** on the preview window. To calculate FPS in OpenCV, you can use a simple approach by measuring the time taken to process each frame. By keeping track of the time before and after processing each frame, you can calculate the FPS and display it on the video stream.


### **Changes for FPS Calculation**:
1. **Track Time for FPS**:
   - We use `time.time()` to get the current time before and after each frame is processed.
   - The FPS is calculated as the inverse of the difference between the current time and the previous time (`1 / (curr_time - prev_time)`).

2. **Display FPS**:
   - We use `cv2.putText()` to draw the FPS value on the frame. The text is placed in the top-left corner of the window (`(10, 30)`).

3. **Frame Processing**:
   - For each frame, we calculate the FPS and update the `prev_time` for the next frame.


- FPS is calculated based on the time elapsed between frames. The faster the frames are processed, the higher the FPS will be. This gives you a real-time indication of how many frames are processed per second.
  


In [None]:
import cv2
import torch
from transformers import AutoImageProcessor, AutoModelForImageClassification
from PIL import Image
import time

# Load model and processor
processor = AutoImageProcessor.from_pretrained("trpakov/vit-face-expression")
model = AutoModelForImageClassification.from_pretrained("trpakov/vit-face-expression")

# Load OpenCV's Haar Cascade for face detection
face_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + "haarcascade_frontalface_default.xml")

# Start video capture
cap = cv2.VideoCapture(0)

# FPS calculation variables
prev_time = 0

while True:
    ret, frame = cap.read()
    if not ret:
        break

    # Get the current time
    curr_time = time.time()

    # Calculate FPS (frames per second)
    fps = 1 / (curr_time - prev_time)
    prev_time = curr_time

    # Convert frame to grayscale for face detection
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    faces = face_cascade.detectMultiScale(gray, scaleFactor=1.1, minNeighbors=5, minSize=(30, 30))

    for (x, y, w, h) in faces:
        # Extract face ROI
        face_roi = frame[y:y + h, x:x + w]

        try:
            # Convert to PIL Image and preprocess
            face_pil = Image.fromarray(cv2.cvtColor(face_roi, cv2.COLOR_BGR2RGB))
            inputs = processor(face_pil, return_tensors="pt")

            # Run inference
            outputs = model(**inputs)
            probs = torch.nn.functional.softmax(outputs.logits, dim=1)
            predicted_class = probs.argmax().item()
            predicted_label = model.config.id2label[predicted_class]

            # Draw bounding box and label
            cv2.rectangle(frame, (x, y), (x + w, y + h), (0, 255, 0), 2)
            cv2.putText(frame, predicted_label, (x, y - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.9, (0, 255, 0), 2)

        except Exception as e:
            print(f"Error analyzing face: {e}")

    # Display FPS on the frame
    cv2.putText(frame, f"FPS: {fps:.2f}", (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)

    # Display the video feed with detected emotions
    cv2.imshow("Emotion Detection", frame)

    # Exit on 'q' key or if the window is closed
    key = cv2.waitKey(1)
    if key == ord('q') or cv2.getWindowProperty("Emotion Detection", cv2.WND_PROP_VISIBLE) < 1:
        break

cap.release()
cv2.destroyAllWindows()
