# FG-Net

- The purpose of this notebook is to provide the EDA done before trying to solve this dataset
- Found this dataset in [this webpage](https://yanweifu.github.io/FG_NET_data/)
- We have a [papers with code entry](https://paperswithcode.com/dataset/fg-net) where we can find how some people have used this dataset
- From the *papers with code* entry, I find the **[following paper](https://arxiv.org/abs/1602.06149)**
    - They propose a new dataset, Large Age-Gap dataset (LAG dataset)
    - They talk about **LFW dataset**: is the most famous dataset where there are almost no constraints (lighting, pose...)
    - But they constraint age, which is in what we are interested in!
    - They talk about **FG-NET dataset** as one of the most famous datasets with aging gaps
    - So I think it is **valuable to talk about this paper in my thesis**
- *Papers with code* says that [this github repo](https://github.com/Hzzone/MTLFace) has the best model for the age-invariant recognition problem
    - They have a related [paper](https://arxiv.org/abs/2103.01520)
    - They use attention mechanisms
    - They talk about age-invariant face recognition or _**AIFR**_
    - They have a table with the results of different papers in this dataset, so **it can be interesting to talk about this paper in my thesis**
    - They say that *FG-NET* is the most challenging dataset for *AIFR*
    - They **describe precisely how testing is done**
    - They say that **leave one method** is, iterate for each element of the dataset, query against the rest of the dataset, Rank@1
    - They use **FG-Net only for validating**, they train on other huge dataset
- Previous paper links to [this paper](https://arxiv.org/abs/1904.04972)
    - They tackle AFAIR with a novel approach (I am not interested in that approach)
    - They test against FG-Net with three metrics:
        1. Leave one out
        2. Mega Face Challenge 1: they test AFAIR models introducing a large amount of distractors
        3. Mega Face Challenge 2
- **NOTE**: some papers use the following protocol:
    - They train on a bigger huge dataset
    - Then, they use the whole FG-Net as evaluation dataset

# Imports

In [None]:
import os
import requests, zipfile, io
import itertools
from typing import Union, Tuple, List

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import torch
import torchvision
import torchvision.transforms as T

import lib.visualizations as visualizations
import lib.datasets as datasets

# Global parameters of the notebook

In [None]:
# Lib to define paths
import os

# - For ease of use, we are going to store all global parameters into a dict
# - This way, we can pass this dict directly to wandb init, so we can keep track
# of which parameters produced which output

from typing import Dict, Union
GLOBALS: Dict[str, Union[str, int, float, bool]] = dict()

# Define if we are running the notebook in our computer ("local")
# or in Google Colab ("remote")
GLOBALS['RUNNING_ENV'] = "local"

# Base path for the rest of paths defined in the notebook
GLOBALS['BASE_PATH'] = "./" if GLOBALS['RUNNING_ENV'] == "local" else "/content/drive/MyDrive/Colab Notebooks/"

# Path to our lib dir
GLOBALS['LIB_PATH'] = os.path.join(GLOBALS['BASE_PATH'], "lib")

# Path where we store training / test data
GLOBALS['DATA_PATH'] = os.path.join(GLOBALS['BASE_PATH'], "data/FG_NET")

# URL of the zipfile with the dataset
GLOBALS['DATASET_URL'] = "http://yanweifu.github.io/FG_NET_data/FGNET.zip"

# Dataset has images and metadata. Here we store the path to the img dir 
GLOBALS['IMAGE_DIR_PATH'] = os.path.join(GLOBALS['DATA_PATH'], "FGNET/images")

# Auth for Google Drive

In [None]:
if GLOBALS['RUNNING_ENV'] == "remote":
    from google.colab import drive
    drive.mount('/content/drive')

# Dataset downloading 

In [None]:
datasets.download_fg_dataset(
    GLOBALS['DATA_PATH'],
    GLOBALS['DATASET_URL'],
    can_skip_download = True
)

# Putting the data into a pytorch `Dataset`

In [None]:
transform = T.transforms.Compose([
    T.ToPILImage(),
])
dataset = datasets.FGDataset(path = GLOBALS['IMAGE_DIR_PATH'], transform = transform)
dataset.set_exploration_mode(mode = True)

# Exploratory Data Analysis

## Show some examples of the data

In [None]:
# Get a single element of the dataset

for index in range(3):
    
    sample = dataset[index]
    img = sample["image"]
    age = sample["age"]
    id = sample["id"]
    
    print(f"Id {id} at age {age}")

    plt.imshow(img)
    plt.show()

## Show all the images of a given individual, identified by its ID, sorted by their age

In [None]:
# Set the id of the individual we want to identify
id = 14

# Select all the indixes corresponding to that individual
id_indixes = [idx for idx, element in enumerate(dataset) if element["id"] == id]

# Sort the list of indixes by age
id_indixes = sorted(
    id_indixes, 
    key = lambda id: dataset[id]["age"],
    reverse = False
)

# With the sorted list of indixes, now we can get the images 
# and also use the ages as the title for the subplots

images = [
    dataset[idx]["image"]
    for idx in id_indixes
]

ages = [dataset[idx]["age"] for idx in id_indixes]
titles = [f"Age: {age}" for age in ages]

# Plot the images
visualizations.PIL_show_images_with_titles_same_window(images, titles = ages, figsize = (20, 40))

Checking different ID's shows us that the dataset generation seems to be properly implemented.

## Show the shapes of the images

In [None]:
for idx in range(10):
    
    # Get the image from the dataset
    img = dataset[idx]["image"]

    # Images are stored in PIL format, convert to pytorch tensors
    transform = T.transforms.Compose([T.transforms.ToTensor()])
    tensor = transform(img)

    # And now we can query about its shape
    print(tensor.shape)

We have different shapes for the images, so some normalization has to be done. Also, some images are colored (3 channels) and other are in black & white (1 channel). So we should convert all the images to black and white.

## Exploring the *images-per-person* distribution

- One key aspect of the problem we are solving is the number of images per person
- For example, when doing `P-K` sampling, if there are persons with less than `K` images, there might be a problem (we have some mechanisms to deal with that problem)

First, show the histogram of how many images per person there are:

In [None]:
# Remember that `dataset.individuals` is a dict with keys the indixes of persons and with values
# lists of ages (each age correspond to a stored image, thus there might be repeated ages if there
# are more of one image for one concrete age)
imgs_per_user = [len(individual_imgs) for individual_imgs in dataset.individuals.values()]

# Now, plot the distribution of the ages
visualizations.plot_histogram(
    values = imgs_per_user,
    num_bins = 20,
    title = "Images per user",
    xlabel = "Images per user",
    ylabel = "Number of instances",
    figsize = (10, 8)
)

There seems to be at least 6 images per person, and at most 18. The distribution seems to follow a normal distribution, but we are not interested in checking that assumption. Check the numbers putting the data into a pandas dataframe:

In [None]:
imgs_per_user_df = pd.DataFrame({
    "IDs": dataset.individuals.keys(),
    "Nº images": imgs_per_user
})
imgs_per_user_df.describe()

So, in fact, we have at least 6 images per class, and 18 images per class at most! This is a huge improvement from the *LFW dataset*

## Exploring the age distribution

- We have worked with the *LFW dataset*, but there was no variance in the age distribution (which is a key component in our problem)
- So now study that age distribution

In [None]:
# Get a flat list with all the ages in the dataset
ages = dataset.individuals.values()
ages = list(itertools.chain(*ages))

# Now, plot the histogram of the ages distribution
visualizations.plot_histogram(
    values = ages,
    num_bins = 70,
    title = "Age distribution",
    xlabel = "Age",
    ylabel = "Number of instances",
    figsize = (10, 8)
)

This histogram shows us a Skewed distribution. That is to say, we have more samples of young people than older people. This bias can be a problem if we want to use the trained model in real world enviroments, where the distribution can vary a lot! 

We can observe a great spike around the 20's. Lets explore further this distribution with a few basic metrics:

In [None]:
print(f"Mean age: {np.mean(ages)}")
print(f"Min age: {min(ages)}")
print(f"Max age: {max(ages)}")
print(f"Most frequent age = {max(set(ages), key = ages.count)}")

The most frequent age, that we saw in the histogram, is in fact eighteen years. Ages range from 0 to 69 years. Lets get more detailed information using pandas:

In [None]:
min_ages = [min(age_list) for age_list in dataset.individuals.values()]
max_ages = [max(age_list) for age_list in dataset.individuals.values()]

ages_df = pd.DataFrame({
    "IDs": dataset.individuals.keys(),
    "Min age": min_ages,
    "Max age": max_ages,
})
ages_df["Age range"] = ages_df["Max age"] - ages_df["Min age"]

ages_df.head(5)

In [None]:
ages_df.describe()

We can see that, at least, we have 11 years of difference among images of the same person. The mean age range is 27.80 years, which can make solving this task hard. But in the other hand, shows that this dataset is relevant for the problem that we are trying to solve. The biggest age range is 54 years.

Let's see an histogram for the age range:

In [None]:
visualizations.plot_histogram(
    values = ages_df["Age range"],
    num_bins = 82,
    title = "Distribution of the age range",
    xlabel = "Difference in years for the same person",
    ylabel = "Frequency",
    figsize = (15, 10)
)