# FG-Net

- The purpose of this notebook is to provide the EDA done before trying to solve this dataset
- Found this dataset in [this webpage](https://yanweifu.github.io/FG_NET_data/)
- We have a [papers with code entry](https://paperswithcode.com/dataset/fg-net) where we can find how some people have used this dataset
- From the *papers with code* entry, I find the **[following paper](https://arxiv.org/abs/1602.06149)**
    - They propose a new dataset, Large Age-Gap dataset (LAG dataset)
    - They talk about **LFW dataset**: is the most famous dataset where there are almost no constraints (lighting, pose...)
    - But they constraint age, which is in what we are interested in!
    - They talk about **FG-NET dataset** as one of the most famous datasets with aging gaps
    - So I think it is **valuable to talk about this paper in my thesis**
- *Papers with code* says that [this github repo](https://github.com/Hzzone/MTLFace) has the best model for the age-invariant recognition problem
    - They have a related [paper](https://arxiv.org/abs/2103.01520)
    - They use attention mechanisms
    - They talk about age-invariant face recognition or _**AIFR**_
    - They have a table with the results of different papers in this dataset, so **it can be interesting to talk about this paper in my thesis**
    - They say that *FG-NET* is the most challenging dataset for *AIFR*
    - They **describe precisely how testing is done**

# Imports

In [None]:
import os
import requests, zipfile, io
import itertools
from typing import Union, Tuple, List

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import torch
import torchvision
import torchvision.transforms as T

import lib.visualizations as visualizations

# Global parameters of the notebook

In [None]:
# Lib to define paths
import os

# - For ease of use, we are going to store all global parameters into a dict
# - This way, we can pass this dict directly to wandb init, so we can keep track
# of which parameters produced which output

from typing import Dict, Union
GLOBALS: Dict[str, Union[str, int, float, bool]] = dict()

# Define if we are running the notebook in our computer ("local")
# or in Google Colab ("remote")
GLOBALS['RUNNING_ENV'] = "local"

# Base path for the rest of paths defined in the notebook
GLOBALS['BASE_PATH'] = "./" if GLOBALS['RUNNING_ENV'] == "local" else "/content/drive/MyDrive/Colab Notebooks/"

# Path to our lib dir
GLOBALS['LIB_PATH'] = os.path.join(GLOBALS['BASE_PATH'], "lib")

# Path where we store training / test data
GLOBALS['DATA_PATH'] = os.path.join(GLOBALS['BASE_PATH'], "data/FG_NET")

# URL of the zipfile with the dataset
GLOBALS['DATASET_URL'] = "http://yanweifu.github.io/FG_NET_data/FGNET.zip"

# Dataset has images and metadata. Here we store the path to the img dir 
GLOBALS['IMAGE_DIR_PATH'] = os.path.join(GLOBALS['DATA_PATH'], "FGNET/images")

# Auth for Google Drive

In [None]:
if GLOBALS['RUNNING_ENV'] == "remote":
    from google.colab import drive
    drive.mount('/content/drive')

# Dataset downloading 

In [None]:
def get_size(start_path = '.'):
    """
    Got from:
        https://stackoverflow.com/questions/1392413/calculating-a-directorys-size-using-python
    """
    total_size = 0
    for dirpath, dirnames, filenames in os.walk(start_path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            # skip if it is symbolic link
            if not os.path.islink(fp):
                total_size += os.path.getsize(fp)

    return total_size


def download_dataset(path: str, url: str, can_skip_download: bool = False):
    """"Downloads and extracts the dataset from a given `url` into a given `path`"""
    
    # Create the dir if it does not exist
    if os.path.exists(path) is False:
        print(f"Dir {path} does not exist, creating that dir")
        os.mkdir(path)
        

    # If the dir has a filesize bigger than 42.2MB, then it should be already 
    # downloaded, and we can skip this step. However, the user can tell this 
    # function to do not skip, so they assure there has not been any data 
    # corruption
    file_B = get_size(path)
    file_MB = file_B / (1024 * 1024)
    
    if file_MB > 44.2 and can_skip_download is True:
        print("Skipping the download, files are already downloaded")
        return

    # Download the dataset
    try:
        print("Downloading the dataset")
        req = requests.get(url)
    except Exception as e:
        print(f"ERROR: could not download data from url")
        print(f"ERROR: error is:\n{e}")
        return
        
    # Extract the dataset contents
    print("Extracting the dataset contents")
    zip_file = zipfile.ZipFile(io.BytesIO(req.content))
    zip_file.extractall(path)

    print("Succesful download")

    
download_dataset(
    GLOBALS['DATA_PATH'],
    GLOBALS['DATASET_URL'],
    can_skip_download = True
)

# Putting the data into a pytorch `Dataset`

In [None]:
# TODO -- properly document this class
# TODO -- say that we are not using all the metadata that the dataset has
class FGDataset(torch.utils.data.Dataset):
    def __init__(self, path: str, transform = None):
        
        # Path of the dir where all the images are stored
        # NOTE: This is the path of the image dir, and not the dataset dir, where
        # some metadata is also stored
        self.path = path
        
        # Transformation to apply to the images of the dataset
        # Items of this dataset are made up of: image, id and age
        # But the transformation is done only to the image
        self.transform = transform
        
        # Dict containing the following data:
        # - Keys are the ids of the individuals
        # - Values are lists with all the ages associated with that individual
        self.individuals: Optional[Dict[int, List[int]]] = None
        
        # Number of images stored in this class 
        self.__number_images: Union[int, None] = None
        
        # All the filenames of the images stored in `self.path` dir 
        self.file_names: Union[List, None] = None

        # Get the data from the dir and instanciate all the attributes of this class
        # Before that, all the attrs are None
        self.__generate_dataset()
                
        # Check that the dataset is properly created
        self.__check_integrity()

        super(FGDataset, self).__init__()
    
    def __len__(self) -> int:

        # Check that we have the number of images of the dataset
        if self.__number_images is None:
            raise Exception("Dataset is not initialized, thus, number of images is unknown")
        
        return self.__number_images
    
    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist() 

        # Get the image from the index
        img_name = os.path.join(self.path, self.file_names[idx])
        image = torchvision.io.read_image(img_name)

        # Get the id and age from the file_name
        id, age = self.__id_and_age_from_file_name(self.file_names[idx])

        # Put together all the info
        sample = {
            "image": image,
            "id": id,
            "age": age,
        }

        # Items are made up of: image, id and age; as the prev code shows
        # But the transform is only made to the image, that is the only 
        # part of the dict where it makes sense
        if self.transform:
            sample["image"] = self.transform(sample["image"])

        return sample

    def __generate_dataset(self):

        # Get all the names of the files
        self.file_names = os.listdir(self.path)

        # Use that for computing the size
        self.__number_images = len(self.file_names)
            
        self.individuals = dict()

        # Use the names to get the persons IDs and their ages
        for file_name in self.file_names:

            # Split into id and age
            id, age = self.__id_and_age_from_file_name(file_name)

            
            # If there is not already a instance for this id, create it 
            # and set the initial value for the list 
            if self.individuals.get(id) is None:
                self.individuals[id] = [age]
                continue
                
            # This individual already has a list (we have checked before)
            # so append to that list
            self.individuals[id].append(age)

    def __id_and_age_from_file_name(self, file_name: str) -> Tuple[str, str]:
        # Remove file extension
        file_name_no_extension = file_name.split(".JPG")[0]

        # Split into id and age
        id, age = file_name_no_extension.split("A")

        # Age can contain trailing letters, for example, the entries corresponding to:
        # `068A10a.JPG` and `068A10b.JPG`
        age = ''.join(filter(str.isdigit, age))

        # Now it is safe to cast both id and age to an int
        id, age = int(id), int(age)
        
        return id, age

    
    def __check_integrity(self):
                
        # Check that we have the proper number of images 
        assert self.__len__() == 1002
    
    
transform = T.transforms.Compose([
    T.ToPILImage(),
])
dataset = FGDataset(path = GLOBALS['IMAGE_DIR_PATH'], transform = transform)

# Exploratory Data Analysis

## Show some examples of the data

In [None]:
# Get a single element of the dataset

for index in range(3):
    
    sample = dataset[index]
    img = sample["image"]
    age = sample["age"]
    id = sample["id"]
    
    print(f"Id {id} at age {age}")

    plt.imshow(img)
    plt.show()

## Show all the images of a given individual, identified by its ID, sorted by their age

In [None]:
# Set the id of the individual we want to identify
id = 14

# Select all the indixes corresponding to that individual
id_indixes = [idx for idx, element in enumerate(dataset) if element["id"] == id]

# Sort the list of indixes by age
id_indixes = sorted(
    id_indixes, 
    key = lambda id: dataset[id]["age"],
    reverse = False
)

# With the sorted list of indixes, now we can get the images 
# and also use the ages as the title for the subplots

images = [
    dataset[idx]["image"]
    for idx in id_indixes
]

ages = [dataset[idx]["age"] for idx in id_indixes]
titles = [f"Age: {age}" for age in ages]

# Plot the images
visualizations.PIL_show_images_with_titles_same_window(images, titles = ages, figsize = (20, 40))

Checking different ID's shows us that the dataset generation seems to be properly implemented.

## Exploring the *images-per-person* distribution

- One key aspect of the problem we are solving is the number of images per person
- For example, when doing `P-K` sampling, if there are persons with less than `K` images, there might be a problem (we have some mechanisms to deal with that problem)

First, show the histogram of how many images per person there are:

In [None]:
# Remember that `dataset.individuals` is a dict with keys the indixes of persons and with values
# lists of ages (each age correspond to a stored image, thus there might be repeated ages if there
# are more of one image for one concrete age)
imgs_per_user = [len(individual_imgs) for individual_imgs in dataset.individuals.values()]

# Now, plot the distribution of the ages
visualizations.plot_histogram(
    values = imgs_per_user,
    num_bins = 20,
    title = "Images per user",
    xlabel = "Images per user",
    ylabel = "Number of instances",
    figsize = (10, 8)
)

There seems to be at least 6 images per person, and at most 18. The distribution seems to follow a normal distribution, but we are not interested in checking that assumption. Check the numbers putting the data into a pandas dataframe:

In [None]:
imgs_per_user_df = pd.DataFrame({
    "IDs": dataset.individuals.keys(),
    "Nº images": imgs_per_user
})
imgs_per_user_df.describe()

So, in fact, we have at least 6 images per class, and 18 images per class at most! This is a huge improvement from the *LFW dataset*

## Exploring the age distribution

- We have worked with the *LFW dataset*, but there was no variance in the age distribution (which is a key component in our problem)
- So now study that age distribution

In [None]:
# Get a flat list with all the ages in the dataset
ages = dataset.individuals.values()
ages = list(itertools.chain(*ages))

# Now, plot the histogram of the ages distribution
visualizations.plot_histogram(
    values = ages,
    num_bins = 70,
    title = "Age distribution",
    xlabel = "Age",
    ylabel = "Number of instances",
    figsize = (10, 8)
)

This histogram shows us a Skewed distribution. That is to say, we have more samples of young people than older people. This bias can be a problem if we want to use the trained model in real world enviroments, where the distribution can vary a lot! 

We can observe a great spike around the 20's. Lets explore further this distribution with a few basic metrics:

In [None]:
print(f"Mean age: {np.mean(ages)}")
print(f"Min age: {min(ages)}")
print(f"Max age: {max(ages)}")
print(f"Most frequent age = {max(set(ages), key = ages.count)}")

The most frequent age, that we saw in the histogram, is in fact eighteen years. Ages range from 0 to 69 years. Lets get more detailed information using pandas:

In [None]:
min_ages = [min(age_list) for age_list in dataset.individuals.values()]
max_ages = [max(age_list) for age_list in dataset.individuals.values()]

ages_df = pd.DataFrame({
    "IDs": dataset.individuals.keys(),
    "Min age": min_ages,
    "Max age": max_ages,
})
ages_df["Age range"] = ages_df["Max age"] - ages_df["Min age"]

ages_df.head(5)

In [None]:
ages_df.describe()

We can see that, at least, we have 11 years of difference among images of the same person. The mean age range is 27.80 years, which can make solving this task hard. But in the other hand, shows that this dataset is relevant for the problem that we are trying to solve. The biggest age range is 54 years.

Let's see an histogram for the age range:

In [None]:
visualizations.plot_histogram(
    values = ages_df["Age range"],
    num_bins = 82,
    title = "Distribution of the age range",
    xlabel = "Difference in years for the same person",
    ylabel = "Frequency",
    figsize = (15, 10)
)