<a href="https://colab.research.google.com/github/ShreyAgarwal11/Privacy-Preserving-Representation-for-Audio-Visual-Speech-Understanding/blob/main/privacy_filter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Preliminaries**

In [None]:
pip install numpy==1.23.5



In [None]:
!git clone https://github.com/ShreyAgarwal11/Privacy-Preserving-Representation-for-Audio-Visual-Speech-Understanding.git
%cd /content/
!git clone https://github.com/facebookresearch/av_hubert.git

%cd av_hubert
!git submodule init
!git submodule update
!pip install scipy
!pip install sentencepiece
!pip install python_speech_features
!pip install scikit-video

%cd fairseq
!pip install ./

Cloning into 'Privacy-Preserving-Representation-for-Audio-Visual-Speech-Understanding'...
remote: Enumerating objects: 103717, done.[K
remote: Counting objects: 100% (12/12), done.[K
remote: Compressing objects: 100% (12/12), done.[K
remote: Total 103717 (delta 6), reused 0 (delta 0), pack-reused 103705[K
Receiving objects: 100% (103717/103717), 2.95 GiB | 36.98 MiB/s, done.
Resolving deltas: 100% (19/19), done.
Updating files: 100% (103015/103015), done.
/content
fatal: destination path 'av_hubert' already exists and is not an empty directory.
/content/av_hubert
Collecting python_speech_features
  Downloading python_speech_features-0.6.tar.gz (5.6 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: python_speech_features
  Building wheel for python_speech_features (setup.py) ... [?25l[?25hdone
  Created wheel for python_speech_features: filename=python_speech_features-0.6-py3-none-any.whl size=5870 sha256=b2da6c21bd4f5c1a9e2fc11ef9556

In [None]:
!mkdir -p /content/data/misc/
!wget http://dlib.net/files/shape_predictor_68_face_landmarks.dat.bz2 -O /content/data/misc/shape_predictor_68_face_landmarks.dat.bz2
!bzip2 -d /content/data/misc/shape_predictor_68_face_landmarks.dat.bz2
!wget --content-disposition https://github.com/mpc001/Lipreading_using_Temporal_Convolutional_Networks/raw/master/preprocessing/20words_mean_face.npy -O /content/data/misc/20words_mean_face.npy

--2024-04-05 00:49:06--  http://dlib.net/files/shape_predictor_68_face_landmarks.dat.bz2
Resolving dlib.net (dlib.net)... 107.180.26.78
Connecting to dlib.net (dlib.net)|107.180.26.78|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 64040097 (61M)
Saving to: ‘/content/data/misc/shape_predictor_68_face_landmarks.dat.bz2’


2024-04-05 00:49:09 (17.7 MB/s) - ‘/content/data/misc/shape_predictor_68_face_landmarks.dat.bz2’ saved [64040097/64040097]

--2024-04-05 00:49:21--  https://github.com/mpc001/Lipreading_using_Temporal_Convolutional_Networks/raw/master/preprocessing/20words_mean_face.npy
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/mpc001/Lipreading_using_Temporal_Convolutional_Networks/master/preprocessing/20words_mean_face.npy [following]
--2024-04-05 00:49:22--  https://raw.githubusercontent.c

**Import a pre-trained model**

Fine tuned model -> Noise-Augmented AV-HuBERT Base

In [None]:
!pwd
%mkdir -p /content/data/
!wget https://dl.fbaipublicfiles.com/avhubert/model/lrs3_vox/avsr/base_noise_pt_noise_ft_433h.pt -O /content/data/finetune-model.pt

/content/av_hubert/fairseq
--2024-04-05 00:49:22--  https://dl.fbaipublicfiles.com/avhubert/model/lrs3_vox/avsr/base_noise_pt_noise_ft_433h.pt
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 13.226.210.25, 13.226.210.15, 13.226.210.111, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|13.226.210.25|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1928060481 (1.8G) [binary/octet-stream]
Saving to: ‘/content/data/finetune-model.pt’


2024-04-05 00:49:38 (118 MB/s) - ‘/content/data/finetune-model.pt’ saved [1928060481/1928060481]



**Create Video Out of Frames**

In [None]:
import cv2
import os
import numpy as np

frame_folder = '/content/Privacy-Preserving-Representation-for-Audio-Visual-Speech-Understanding/VidTIMIT/fadg0/video/sa1'

output_video_path = '/content/output_video.mp4'

frame_rate = 25

frame_files = [f for f in os.listdir(frame_folder) if os.path.isfile(os.path.join(frame_folder, f))]

frame_files.sort()

video_resolution = (512, 384)

if video_resolution is None:
    first_frame_path = os.path.join(frame_folder, frame_files[0])
    first_frame = cv2.imread(first_frame_path)
    video_resolution = (first_frame.shape[1], first_frame.shape[0])


fourcc = cv2.VideoWriter_fourcc(*'MP4V')

out = cv2.VideoWriter(output_video_path, fourcc, frame_rate, video_resolution)

for frame_file in frame_files:
    frame_path = os.path.join(frame_folder, frame_file)
    frame = cv2.imread(frame_path)
    if (frame.shape[1], frame.shape[0]) != video_resolution:
        frame = cv2.resize(frame, video_resolution)
    out.write(frame)

out.release()


In [None]:
%cd /content/av_hubert/avhubert
import cv2
import tempfile
import torch
import utils as avhubert_utils
from argparse import Namespace
import fairseq
from fairseq import checkpoint_utils, options, tasks, utils
from IPython.display import HTML
from python_speech_features import logfbank
from scipy.io import wavfile

/content/av_hubert/avhubert


**Feature Extraction using AV-HUBERT**

In [None]:
def stacker(feats, stack_order):
            """
            Concatenating consecutive audio frames
            Args:
            feats - numpy.ndarray of shape [T, F]
            stack_order - int (number of neighboring frames to concatenate
            Returns:
            feats - numpy.ndarray of shape [T', F']
            """
            feat_dim = feats.shape[1]
            if len(feats) % stack_order != 0:
                res = stack_order - len(feats) % stack_order
                res = np.zeros([res, feat_dim]).astype(feats.dtype)
                feats = np.concatenate([feats, res], axis=0)
            feats = feats.reshape((-1, stack_order, feat_dim)).reshape(-1, stack_order*feat_dim)
            return feats

In [None]:
def extract_visual_feature(video_path, audio_path, ckpt_path, user_dir, is_finetune_ckpt=False):
  utils.import_user_module(Namespace(user_dir=user_dir))
  models, saved_cfg, task = checkpoint_utils.load_model_ensemble_and_task([ckpt_path])
  transform = avhubert_utils.Compose([
      avhubert_utils.Normalize(0.0, 255.0),
      avhubert_utils.CenterCrop((task.cfg.image_crop_size, task.cfg.image_crop_size)),
      avhubert_utils.Normalize(task.cfg.image_mean, task.cfg.image_std)])
  frames = avhubert_utils.load_video(video_path)
  print(f"Load video {video_path}: shape {frames.shape}")
  sample_rate, wav_data = wavfile.read(audio_path)
  audio_features = logfbank(wav_data, sample_rate).astype(np.float32)
  audio_features = stacker(audio_features, 4)
  print(f"Load audio {audio_path}: shape {audio_features.shape}")
  audio_features = torch.FloatTensor(audio_features).unsqueeze(dim=0).permute(0, 2, 1).cuda()
  frames = torch.FloatTensor(frames).unsqueeze(dim=0).unsqueeze(dim=0).cuda()
  model = models[0]
  if hasattr(models[0], 'decoder'):
    print(f"Checkpoint: fine-tuned")
    model = models[0].encoder.w2v_model
  else:
    print(f"Checkpoint: pre-trained w/o fine-tuning")
  model.cuda()
  model.eval()
  with torch.no_grad():
    # Specify output_layer if you want to extract feature of an intermediate layer
    layer_features = []
    for i in range(12):
      feature, _ = model.extract_finetune(source={'video': frames, 'audio': audio_features}, padding_mask=None, output_layer=(i+1))
      layer_features.append(feature)
    feature, _ = model.extract_finetune(source={'video': frames, 'audio': audio_features}, padding_mask=None, output_layer=None)
    feature = feature.squeeze(dim=0)
  print(f"AvHuBert Feature shape: {feature.shape}")
  return layer_features, feature

mouth_roi_path, ckpt_path = "/content/output_video.mp4", "/content/data/finetune-model.pt"
audio_path = "/content/Privacy-Preserving-Representation-for-Audio-Visual-Speech-Understanding/VidTIMIT/fadg0/audio/sa1.wav"
user_dir = "/content/av_hubert/avhubert"
layer_features, feature = extract_visual_feature(mouth_roi_path, audio_path, ckpt_path, user_dir)



Load video /content/output_video.mp4: shape (119, 384, 512)
Load audio /content/Privacy-Preserving-Representation-for-Audio-Visual-Speech-Understanding/VidTIMIT/fadg0/audio/sa1.wav: shape (119, 104)
Checkpoint: fine-tuned
AvHuBert Feature shape: torch.Size([119, 768])


#Emotion Recognition