# Research


## <u>DeepFace</u>

[DeepFace: Closing the Gap to Human-Level Performance in Face Verification](https://www.cs.toronto.edu/~ranzato/publications/taigman_cvpr14.pdf)

- The process in short detect ⇒ align ⇒ represent ⇒ classify. 
- Alignment & Representation processes were revised now utilising a nine-layer deep neural network.
- The neural network involves more than 120 million parameters using several locally connected layers without weight sharing, rather than the standard convolutional layers.
- Trained using the [Labeled Faces in the Wild (LFW) dataset](https://vis-www.cs.umass.edu/lfw/)
    - This dataset contains faces in unconstrained environments
- An accuracy of 97.35% was achieved which saw a reduction of 27% error from the state of the art implementation, closely approaching human-level performance. 
- Uses the deep learning (DL) framework.
    - DL is especially suitable for dealing with large training sets.
- Deep neural networks outperform traditional machine learning in handling challenges like deformations, clutter, occlusion, and illumination.
- Face recognition error rates have significantly decreased in controlled environments, but unconstrained settings remain challenging.
- The Deep Neural Network (DNN) architecture for face recognition involves 3D alignment and convolutional layers.
- Model training & structure is explained in **Representation** section
- The proposed face representation is learned from a large collection of photos from Facebook, referred to as the Social Face Classification (SFC) dataset. The representations are then applied to the Labeled Faces in the Wild database (LFW), which is the de facto benchmark dataset for face verification in unconstrained environments, and the YouTube Faces (YTF) dataset, which is modeled similarly to the LFW but focuses on video clips.
    - The SFC dataset includes 4.4 million labeled faces from 4,030 people each with 800 to 1200 faces, where the most recent 5% of face images of each identity are left out for testing.
- Results of different datasets present in the Results section however an accuracy greater than 90% is always present.

## <u>Identifying Bias in datasets</u>

[Metrics for identifying bias in dataset](https://ceur-ws.org/Vol-3118/p02.pdf)

- Two different metrics based on data completeness:
    - Using combination of values in the dataset (Benchmark) 
    - Using frame theory (Quality measure)
- Utilises an index that measures the degree of balance of the data. At the same time the proposed metric can be considered within the framework of the measures of the **ISO 25000 series of standards** (framework to evaluate software product quality)
- The paper denotes that a person has a 1 to N relationship with its attributes such as gender, race, religion and so on:
 ![image.png](attachment:image.png)

### Combinatorial metric
- Minimum completeness - must have at least one instance that belongs to each distinct combination of categories.
- The value for maximum completeness is calculated from the maximum number of duplicates of the same combinations of characterising columns.

### Frame theory
- Didn't understand method 
- Gini-Simpson index
- Eigen values & vectors

Two measures were proposed: a combinatorial metric and Gini on eigenvalues. We illustrate a case study and discuss their strengths and limitations. 

## <u> A survey on bias in visual datasets </u>

[A survey on bias in visual datasets](https://www-sciencedirect-com.ejournals.um.edu.mt/science/article/pii/S1077314222001308?via%3Dihub)

- The study aims to address the concern of bias in datasets by 
    1. Identifying various forms of bias in visual datasets 
    2. Reviewing methods for detecting and measuring bias
    3. Discussing efforts to create datasets with bias awareness
 
- Algorithms may be responsible for the amplification of pre-existing biases in the training data, issues in the quality of the data itself could contribute significantly to the development of discriminatory AI applications.

- Two ways in which bias is encoded in the data were identified:
    1. Correlations and causal influences among the protected attributes and other features
    2. The lack of representation of protected groups in the data
    - It was also noted that biases manifest in ways that are specific to the data type.
    
- Types of bias that pertain to the capture and collection of visual data:
    1. **Selection bias** - The way we pick which pictures or images to include can create unfair differences or connections. This can lead to mistakes in understanding or results. For example, if we're studying faces and only include certain types of people, we might not get a complete picture of how well our system works for everyone. So, we need to be careful when choosing what images to use in our datasets to avoid this bias. **(Simplified Explenation via ChatGPT)**
    
    2. **Capture/Framing bias** - When we "frame" something, we choose certain parts of what we're talking about and make them more noticeable in our communication. This helps shape how people see the issue and what they think about it. This concept of framing isn't just for words – it also applies to pictures. In visual studies, framing means picking a particular view, scene, or angle when creating or editing an image. These definitions show that framing bias has two parts. First, the way an image is put together can send different messages. Second, how an image is taken or edited can also introduce bias. So, when we talk about framing bias, we mean any differences or connections in an image that make people think differently, and these differences might come from how the image was put together.

    3. **Label bias** - In supervised learning, we need labeled data to teach the computer. The accuracy of these labels is crucial but can be challenging because today's datasets are complex and vast. Label bias happens when the labels given to the computer are different from the actual truth. For example, a face recognition dataset might have labels that are not very accurate compared to human annotations. Sometimes bias also comes from how things are labeled. For instance, different people might call the same thing by different names. This can be a big problem when dealing with things related to people like race or gender. So, label bias means mistakes in labeling data, either because the labels are wrong compared to the truth or because they use unclear or inappropriate categories.

    4. **Negative set bias** -  The labelling does not reflect entirely the population of the negative class (say non-white in a binary feature [white people/non-white people]). Negative class bias is being considered as an instance of selection and label bias.
    
    ![image.png](attachment:image.png)
    
- Mentions the story of a Black American individual wrongfully arrested due to an error made by a facial recognition algorithm

- During the capturing section of the data selection & framing bias can be introduced

- Dissemination of visual content suffers from both selection and framing bias

- Data collection for the sake of creating a dataset can result in selection & label bias

- Algorithms also raise issues as biased generative models can generate images which can result in further bias in the Visual Content Life Cycle.

Section 3 of the paper aims to answer the following "Given a visual dataset, is it possible to discover/quantify what types of bias it exhibits?"

- Stratagies used to descover bias:
    
    **Reduction to tabular data** (Imp for our case)
    1. These rely on the attributes and labels attached to or extracted from the visual data and try to measure bias as if it were a tabular dataset.
    2. The features for the tabular description can be extracted either directly from the images (using, for example, some image recognition tools) or indirectly from some accompanying image description/annotation or both. This is therefore prone to errors and biases.
    3. The biases that exist in the original images (selection, framing, or label bias) might be reflected and even amplified in the tabular representation due to the bias-prone feature extraction process.
    4. Bias may also exist due to the labelling process and the automatic feature extraction. The impact of such additional sources on the results is typically omitted.
    
    

    **Biased image representations** 
    1. these rely on lower-dimensional representations of the data to discover bias.

    **Cross-dataset bias detection** 
    1. These assess bias by comparing different datasets, trying to discover some sort of “signature” due to the data collection process.

    **Other methods** 
    1. Different methods that could not fit any of the above categories.

# Possible important papers

- https://arxiv.org/ftp/arxiv/papers/1505/1505.01257.pdf
- https://ieeexplore-ieee-org.ejournals.um.edu.mt/document/5995347
- https://arxiv.org/ftp/arxiv/papers/1505/1505.01257.pdf