# From Voice to Face: Exploring Speech2Face - Learning the Face Behind a Voice


**Name:** Rishab Sonthalia  
**Roll No.:** 220150035  
**Course:** DA323


## Motivation

In [10]:
from IPython.display import Audio


audio_path1 = 'voices/voice1.wav'
audio_path2 = 'voices/voice2.wav'

In [11]:
Audio(audio_path1)

In [12]:
Audio(audio_path2)



The connection between how someone looks and how they sound has always fascinated me. When we hear a voice on the phone or radio, we often unconsciously form mental images of the speaker. This natural human tendency to correlate vocal characteristics with physical appearance raises an intriguing question: can a machine learn to visualize a person's face just from hearing their voice?

This question led me to explore the groundbreaking paper "Speech2Face: Learning the Face Behind a Voice" by Oh et al. The research tackles a fascinating cross-modal inference problem with applications ranging from accessibility technology to human-computer interaction. What makes this work particularly compelling is that it addresses a capability that humans naturally possess—inferring physical attributes from voice—and attempts to replicate it using deep learning techniques.

## Historical Perspective on Multimodal Learning



Speech2Face sits within a rich tradition of multimodal learning research that aims to bridge different sensory modalities using computational methods. Here's how it connects with past and current work:


### Early Cross-Modal Research

Early work in multimodal learning focused primarily on audio-visual correspondence, such as matching speech with lip movements. Research from the early 2000s explored statistical correlations between vocal features and physical characteristics, but these approaches typically relied on hand-crafted features and explicit modeling of specific attributes.


### The Rise of Self-Supervised Learning

Around 2016-2017, Arandjelović and Zisserman introduced the Audio-Visual Correspondence (AVC) task, which leveraged the natural co-occurrence of audio and visual signals in videos as a form of self-supervision. This approach moved away from explicit modeling toward letting neural networks discover correlations between modalities directly from data..


### Recent Developments

More recent work has explored increasingly complex cross-modal tasks:
- **Sound Source Localization/Separation**: Identifying which objects in a scene are producing sounds (Ephrat et al., 2018)
- **Cross-Modal Retrieval**: Searching for images based on audio queries and vice versa (Owens et al., 2016)
- **Audio-Visual Representation Learning**: Creating joint embeddings that capture information from both modalities (Aytar et al., 2016)

Speech2Face extends this tradition by tackling a particularly challenging cross-modal synthesis task: generating plausible face images directly from speech signals. Unlike earlier approaches that might have explicitly modeled discrete attributes (age, gender, etc.), Speech2Face attempts to learn these correlations directly from data in an end-to-end fashion.


## Technical Deep-Dive: How Speech2Face Works

### Core Concept

Speech2Face doesn't aim to recover the exact face of a speaker (an impossible task given that many different faces could produce the same voice). Instead, it attempts to reconstruct a face with dominant visual traits that correlate statistically with the input speech.


### Data Foundation

The researchers utilized the AVSpeech dataset, which contains:
- Millions of YouTube video clips with speaking faces
- Over 100,000 different speakers
- Diverse demographic representation (though with some imbalance)

This large-scale dataset provided the natural co-occurrence of faces and voices needed for self-supervised learning.



![avc_speech_data_stats](images/avc_speech_data_stats.png)


### Model Architecture

The Speech2Face pipeline consists of two primary components:


![Model Architecture](images/model_arch.png)


#### 1. Voice Encoder

The trainable component that maps from speech to face feature vectors:

```
Input → Complex Spectrogram → CNN → 4096-D Face Feature Vector
```

Specifically, the voice encoder architecture consists of:
- **Input**: Complex spectrogram (598 × 257 dimensions for 6-second audio, 2 channels representing real and imaginary components)
- **Convolutional Blocks**: 8 convolutional layers with batch normalization and ReLU activation
- **Pooling Strategy**: Max-pooling only along the temporal dimension to preserve frequency information (crucial for vocal characteristics)
- **Final Layers**: Average pooling over time followed by two fully-connected layers producing a 4096-D face feature vector

Key architectural insight: The model pools only in the temporal dimension, preserving frequency information that carries vocal characteristics, while linguistic information spans longer time durations.

#### Table 1. Voice Encoder Architecture

| Layer        | CONV | CONV | CONV | CONV   | CONV | CONV   | CONV | CONV   | AVGPOOL |
|--------------|------|------|------|--------|------|--------|------|--------|---------|
| Input        | RELU | RELU | RELU | MAXPOOL| RELU | MAXPOOL| RELU | MAXPOOL| RELU    |
|              | BN   | BN   | BN   | BN     | BN   | BN     | BN   | BN     | BN      |
| Channels     | 2    | 64   | 64   | 128    | –    | 128    | –    | 128    | –       |
| Stride       | –    | 1    | 1    | 1      | 2×1  | 1      | 2×1  | 1      | 2×1     |
| Kernel Size  | –    | 4×4  | 4×4  | 4×4    | 2×1  | 4×4    | 2×1  | 4×4    | 2×1     |

---

| Layer        | CONV | RELU | FC   | FC   |
|--------------|------|------|------|------|
| Input        | RELU | RELU | RELU | RELU |
|              | BN   |      |      |      |
| Channels     | 256  | 512  | 512  | –    |
| Stride       | 2×1  | 1    | 1    | 1    |
| Kernel Size  | 4×4  | 4×4  | ∞×1  | 1×1  |


**Notes:**
- The input spectrogram dimensions are 598 × 257 (time × frequency) for a 6-second audio segment.

- The two input channels correspond to the real and imaginary components of the spectrogram.

- "–" indicates no operation or parameter for that cell.



#### 2. Face Decoder
- Based on the face decoder model by Cole et al.

Reconstructs a normalized face image with:
- Frontal orientation
- Neutral expression
- Normalized lighting condition

A pre-trained, fixed network that converts face features to canonical face images:
- Takes the 4096-D feature vector as input
- First layer transforms it to a 1000-D representation
- Remains fixed during voice encoder training
- Produces a canonical face image as output

### Data Processing


![preprocessing_pipeline](images/preprocessing_pipeline.png)




#### Dataset
- **AVSpeech**: Large-scale, "in-the-wild" audiovisual dataset
- **Training/Test Split**: 1.7M / 0.15M spectra-face feature pairs

---

#### Audio Processing
- **Segment length**: 6 seconds (shorter clips repeated)
- **Sampling rate**: 16 kHz, single channel
- **Spectrogram computation**:
  - STFT with:
    - 25ms Hann window
    - 10ms hop length
    - 512 FFT frequency bands
- **Power-law compression**:
  - Applied to real/imaginary components as:
    - `sgn(S) × |S|^0.3`

---

#### Face Processing
- **Face Detection**: CNN-based using Dlib
- **Preprocessing**: Cropped and resized to 224×224 pixels
- **Feature Extraction**: VGG-Face (4096-dimensional vector)

---


### Training Objective




The **voice encoder** is trained to **predict the face feature vector** `v_f` (from VGG-Face) using **speech input** to generate `v_s`.  
The goal is to minimize the difference between the predicted and actual face features.

---

#### Loss Function Design

A simple L1 loss was insufficient due to **unstable training**, so the final loss combines **three components**:

$$
L_{\text{total}} = \left\| f_{\text{dec}}(v_f) - f_{\text{dec}}(v_s) \right\|_1 + \lambda_1 \left\| \frac{v_f}{\|v_f\|} - \frac{v_s}{\|v_s\|} \right\|_2^2 + \lambda_2 L_{\text{distill}}\left(f_{\text{VGG}}(v_f), f_{\text{VGG}}(v_s)\right)
$$


Where:

- **First term**: L1 distance between decoder activations
- **Second term**: Normalized feature alignment with **λ₁ = 0.025**
- **Third term**: **Knowledge distillation loss** with **λ₂ = 200**

#### Knowledge Distillation Loss

$$
L_{\text{distill}}(a, b) = -\sum_i p(i)(a) \log p(i)(b)
\
$$

Where:

$$
p(i)(a) = \frac{\exp(a_i / T)}{\sum_j \exp(a_j / T)}
$$

- **T = 2** (temperature parameter) smooths activation distributions

---

#### Hyperparameters & Optimization

- **Framework**: TensorFlow
- **Optimizer**: ADAM
  - β₁ = 0.5
  - ε = 1e-4
- **Learning Rate**: 0.001 with exponential decay (rate = 0.95 every 10,000 iterations)
- **Batch Size**: 8
- **Training Duration**: 3 epochs

---

#### Loss Function Rationale

- **Direct Feature Matching**: Basic L1 term ensures raw similarity
- **Normalized Feature Alignment**: Directional alignment independent of magnitude
- **Knowledge Distillation**: Ensures behavioral similarity using pre-trained VGG-Face features

The weights **λ₁** and **λ₂** were carefully tuned to **balance gradient magnitudes** early in training.  
This **multi-term loss design** significantly improved training **stability** and **reconstruction quality**, outperforming simple L1 loss strategies.


## Results and Analysis

In [6]:
Audio(audio_path1)


![result1](images/result1.png)


In [7]:
Audio(audio_path2)


![result2](images/result2.png)



![qualitative_results](images/qualitative_results.png)


### 1. Qualitative Results

The Speech2Face reconstructions successfully capture several key facial attributes:
- Age category (young, middle-aged, elderly)
- Gender
- Ethnicity 
- Face/head shape (elongated vs. round)

Visual inspection shows that while reconstructions appear somewhat like "average faces," they still contain person-specific information that correlates with the voice input.


![confusion_matrix_attributes](images/confusion_matrix_attributes.png)



![avc_speech_data_stats](images/avc_speech_data_stats.png)


### 2. Demographic Attribute Analysis

Using Face++ (a commercial facial attribute classifier), the researchers quantified agreement between original faces and Speech2Face reconstructions:

**Gender Classification**:
- 94% overall agreement between true images and reconstructions
- 84% accuracy for males, 84% for females

**Age Estimation**:
- Strong diagonal pattern in confusion matrices
- Best performance for younger and middle-aged adults

**Ethnicity Classification**:
- Variable performance across groups:
  - White: 81% accuracy
  - Asian: 76% accuracy
  - Indian: 48% accuracy
  - Black: 49% accuracy

The lower performance on some ethnic groups correlates with their underrepresentation in the training data:
- White: 50.2% of dataset
- Asian: 28.9% of dataset
- Indian: 12.1% of dataset
- Black: 8.7% of dataset



![demographic_attributes](images/demographic_attributes.png)


### 3. Craniofacial Measurements

One of the most fascinating findings is the statistically significant correlation between craniofacial measurements in true faces and Speech2Face reconstructions:

| Face Measurement           | Correlation | p-value   |
|---------------------------|-------------|-----------|
| Upper lip height          | 0.16        | p < 0.001 |
| Lateral upper lip heights | 0.26        | p < 0.001 |
| Jaw width                 | 0.11        | p < 0.001 |
| Nose height               | 0.14        | p < 0.001 |
| Nose width                | 0.35        | p < 0.001 |
| Labio oral region         | 0.17        | p < 0.001 |
| Mandibular idx            | 0.20        | p < 0.001 |
| Intercanthal idx          | 0.21        | p < 0.001 |
| Nasal index               | 0.38        | p < 0.001 |
| Vermilion height idx      | 0.29        | p < 0.001 |
| Mouth face width idx      | 0.20        | p < 0.001 |
| Nose area                 | 0.28        | p < 0.001 |
| Random baseline           | 0.02        | –         |


The strongest correlations appear in nose-related features, suggesting that nasal structure (which affects speech resonance) leaves detectable patterns in the voice that the model can pick up.


### 4. Cross-Modal Retrieval Performance



| Length     | cos (deg)         | L2              | L1              |
|------------|-------------------|------------------|------------------|
| 3 seconds  | 48.43 ± 6.01      | 0.19 ± 0.03     | 9.81 ± 1.74     |
| 6 seconds  | 45.75 ± 5.09      | 0.18 ± 0.02     | 9.42 ± 1.54     |





![stf_3s_6s_comparison](images/stf_3s_6s_comparison.png)


**The researchers tested how well Speech2Face features could retrieve the correct face from a database of 5,000 images:**


| Duration | Metric | R@1  | R@2  | R@5  | R@10 |
|----------|--------|------|------|------|------|
| 3 sec    | L2     | 5.86 | 10.02| 18.98| 28.92|
| 3 sec    | L1     | 6.22 | 9.92 | 18.94| 28.70|
| 3 sec    | cos    | 8.54 | 13.64| 24.80| 38.54|
| 6 sec    | L2     | 8.28 | 13.66| 24.66| 35.84|
| 6 sec    | L1     | 8.34 | 13.70| 24.66| 36.22|
| 6 sec    | cos    |10.92 | 17.00| 30.60| 45.82|
| Random   | –      | 1.00 | 2.00 | 5.00 | 10.00|

**S2F→Face retrieval performance.Measure retrieval
performance by recall at K (R@K, in %), which indicates the
chance of retrieving the true image of a speaker within the top-K
results.**



![feature_similarity_2](images/feature_similarity_2.png)


- **Direct Feature Comparison**

When comparing the model's **predicted face features** to the **real person's face features**, **longer audio clips** (6 seconds vs. 3 seconds) consistently improved accuracy.

---

- **Face Retrieval Test**

    In a more practical test, the researchers:

    1. Used the model to **predict face features from a voice**
    2. **Searched a database** of 5,000 faces to find matches
    3. Measured how often the **correct face appeared in the top results**

---

- **Results**

    -  With **6 seconds** of speech:
          - Correct face was the **#1 match** about **11%** of the time
          - Correct face appeared in the **top 5** results about **31%** of the time
          - Correct face appeared in the **top 10** results about **46%** of the time
    - Feature similarity improved (cosine distance: 45.75° vs. 48.43°)
    - Retrieval performance increased (~2-6% improvement across metrics)
    - Visual quality of reconstructions showed noticeable enhancement
    - **Random chance** for a perfect match (1 in 5,000) would be just **0.02%**, so even **1%** is a strong baseline — these results **significantly outperform random**.

---




### 5. **Conclusion**
These results demonstrate that:

- **Voices carry meaningful information about facial appearance**
- Even when the exact match wasn't found, retrieved faces shared **similar traits** (age, gender, ethnicity, facial structure)
- **Deep learning** can effectively capture these **cross-modal relationships**

#### **Our voices reveal more about our **physical appearance** than we might expect!**

### Training Components Analysis

The ablation studies revealed important insights about training methodology:
- **Batch Normalization**: Significantly improved convergence speed and stability
- **Loss Function**: The full composite loss produced much sharper, more accurate reconstructions than pixel-level loss alone
- **Audio Duration**: Models trained with 6-second clips consistently outperformed those trained with 3-second clips


## Reflections

Several aspects of this work were particularly surprising:

1. **Craniofacial Correlations**: The significant correlation between specific facial measurements and voice was unexpected. The fact that nasal features showed the strongest correlation makes physiological sense (nasal cavity shapes affect voice resonance), but it's remarkable that the model discovered this relationship without explicit guidance.

2. **Retrieval Performance**: The ability to retrieve the correct face from a database of 5,000 faces at a rate of 10.92% (R@1) is surprisingly good considering the one-to-many relationship between voices and faces. This suggests the voice contains more identifying information than I initially expected.

3. **Training Stability Challenges**: The fact that a simple L1 loss between features was insufficient for stable training highlights the complexity of cross-modal learning. The need for a composite loss function with knowledge distillation reveals the subtlety of the optimization landscape.

4. **Demographic Bias Effects**: While not entirely surprising, the clear correlation between dataset representation and reconstruction accuracy for different ethnic groups emphasizes the importance of balanced training data in multimodal systems.


### Scope for Improvement

This work opens several promising avenues for further research:

1. **Balanced Data Representation**: Training on a more demographically balanced dataset would likely improve performance across all groups. The paper acknowledges this limitation, and addressing it would be a straightforward yet impactful improvement.

2. **Temporal Dynamics**: The current model uses average pooling over time to create a single face representation. Incorporating temporal dynamics of speech might capture additional information about facial structure and expressions.

3. **Multiple Face Hypotheses**: Since voice-to-face is a one-to-many mapping, generating multiple plausible face hypotheses (perhaps using variational or generative approaches) could better represent the inherent ambiguity.

4. **Additional Modalities**: Incorporating linguistic content analysis alongside acoustic features might improve reconstruction, as certain speech patterns correlate with cultural or regional facial characteristics.

5. **Ethical Safeguards**: While the paper addresses ethical considerations, developing more robust privacy guarantees would be valuable for real-world applications.

6. **Dynamic Face Reconstruction**: Moving beyond static, neutral faces to reconstruct plausible facial movements during speech would be a fascinating extension.

## References

1. Oh, T. H., Dekel, T., Kim, C., Mosseri, I., Freeman, W. T., Rubinstein, M., & Matusik, W. (2019). Speech2Face: Learning the Face Behind a Voice. *IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*.

2. Arandjelović, R., & Zisserman, A. (2017). Look, Listen and Learn. *IEEE International Conference on Computer Vision (ICCV)*.

3. Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, A., Freeman, W. T., & Rubinstein, M. (2018). Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation. *ACM Transactions on Graphics (TOG)*.

4. Cole, F., Belanger, D., Krishnan, D., Sarna, A., Mosseri, I., & Freeman, W. T. (2017). Face Synthesis from Facial Identity Features. *IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*.

5. Aytar, Y., Vondrick, C., & Torralba, A. (2016). SoundNet: Learning Sound Representations from Unlabeled Video. *Advances in Neural Information Processing Systems (NIPS)*.

6. Liu, M.-Y., & Tuzel, O. (2016). Coupled Generative Adversarial Networks. *Advances in Neural Information Processing Systems (NIPS)*.

7. Owens, A., Wu, J., McDermott, J. H., Freeman, W. T., & Torralba, A. (2016). Ambient Sound Provides Supervision for Visual Learning. *European Conference on Computer Vision (ECCV)*.

8. [Speech2Face GitHub Repository](https://speech2face.github.io/) (Official project page)