# CACD Dataset

- The purpose of this notebook is to provide the EDA done before trying to solve this dataset
- We are going to use this dataset to train and *FG-Net dataset* to validate. So we are really trying to solve the other dataset (which is really small for training) throught this huge dataset
- [Papers with code entry](https://paperswithcode.com/dataset/cacd)
- The dataset was introduced in the paper [Cross-Age Reference Coding for Age-Invariant Face Recognition and Retrieval](https://link.springer.com/chapter/10.1007/978-3-319-10599-4_49)
    - They work with *Matlab*, and the dataset structure reflects that somehow (as they say in their website)
- I think this is their [official website](https://bcsiriuschen.github.io/CARC/). In that website, they note:
    - Celebrities with rank smaller or equal to five might have some noise
    - The dataset might contain some duplicates
    - Dataset thought for cross-age face recognition and retrivial. Year labels are rough and thus this dataset is not suitable for age-estimation problems
    - So dataset seems to be a perfect fit for our needs
    - They have prepared a **testing subset** with image pairs with half positives and half negatives
- Dataset metadata (from their [website](https://bcsiriuschen.github.io/CARC/)):
    - celebrityData - contains information of the 2,000 celebrities
        - name - celebrity name
        - identity - celebrity id
        - birth - celebrity brith year
        - rank - rank of the celebrity with same birth year in IMDB.com when the dataset was constructed
        - lfw - whether the celebrity is in LFW dataset
    - celebrityImageData - contains information of the face images
        - age - estimated age of the celebrity
        - identity - celebrity id
        - year - estimated year of which the photo was taken
        - feature - 75,520 dimension LBP feature extracted from 16 facial landmarks
        - name - file name of the image
- **IMPORTANT**: the dataset has been constructed by searching `<celebrity name> + <year between 2004 - 2013>`, so the max age difference for one person should be 8 years. In FG-Net we have age differences up to 54 years. So this might be a **PROBLEM**
- **NOTE**: the 4.5GB dataset with all the metadata seems to be useless to our purposes. We can see in this [Google Colab Notebook](https://colab.research.google.com/drive/1X0NftH0Y1b2vL6ytztS2di7Iqfdine60?usp=sharing) that we are not going to use all of that metadata

# Imports

In [None]:
import os
import requests, zipfile, io
import itertools
from typing import Union, Tuple, List

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import torch
import torchvision
import torchvision.transforms as T

import lib.visualizations as visualizations
import lib.datasets as datasets

# Global parameters of the notebook

In [None]:
# Lib to define paths
import os

# - For ease of use, we are going to store all global parameters into a dict
# - This way, we can pass this dict directly to wandb init, so we can keep track
# of which parameters produced which output

from typing import Dict, Union
GLOBALS: Dict[str, Union[str, int, float, bool]] = dict()

# Define if we are running the notebook in our computer ("local")
# or in Google Colab ("remote")
GLOBALS['RUNNING_ENV'] = "local"

# Base path for the rest of paths defined in the notebook
GLOBALS['BASE_PATH'] = "./" if GLOBALS['RUNNING_ENV'] == "local" else "/content/drive/MyDrive/Colab Notebooks/"

# Path to our lib dir
GLOBALS['LIB_PATH'] = os.path.join(GLOBALS['BASE_PATH'], "lib")

# Path where we store training / test data
GLOBALS['DATA_PATH'] = os.path.join(GLOBALS['BASE_PATH'], "data/CACD")

# Images are stored in different folder, due to the fact that the extraction 
# method produces a new folder
GLOBALS['IMAGE_DIR_PATH'] = os.path.join(GLOBALS['DATA_PATH'], "CACD2000")

# URL of the zipfile with the dataset
GLOBALS['DATASET_URL'] = "https://drive.google.com/file/d/1hYIZadxcPG27Fo7mQln0Ey7uqw1DoBvM/view"

# Set some options for displaying better quality images

# Font for the labels of the axes
fontopts = {
    'fontname': 'serif',
    'fontsize': 16,
}

# Set higher DPI values
%config InlineBackend.figure_format = 'retina'  # 'retina' or 'png2x'

# Auth for Google Drive

In [None]:
if GLOBALS['RUNNING_ENV'] == "remote":
    from google.colab import drive
    drive.mount('/content/drive')

# Dataset downloading 

In [None]:
datasets.download_cacd_dataset(
    GLOBALS['DATA_PATH'],
    GLOBALS['DATASET_URL'],
    can_skip_download = True,
    can_skip_extraction = False,
)

# Putting the data into a pytorch `Dataset`

In [None]:
transform = T.transforms.Compose([
    T.ToPILImage(),
])
dataset = datasets.CACDDataset(path = GLOBALS['IMAGE_DIR_PATH'], transform = transform)
dataset.set_exploration_mode(mode = True)

# Exploratory Data Analysis

## Show some examples of the data

In [None]:
# Get a single element of the dataset

for index in range(3):
    
    sample = dataset[index]
    img = sample["image"]
    age = sample["age"]
    id = sample["id"]
    
    print(f"Id {id} at age {age}")

    plt.imshow(img)
    plt.show()

## Show all the images of a given individual, identified by its ID, sorted by their age

In [None]:
# Set the id of the individual we want to identify
id = 14

# Select all the indixes corresponding to that individual
id_indixes = [idx for idx, element in enumerate(dataset) if element["id"] == id]

# Sort the list of indixes by age
id_indixes = sorted(
    id_indixes, 
    key = lambda id: dataset[id]["age"],
    reverse = False
)

# With the sorted list of indixes, now we can get the images 
# and also use the ages as the title for the subplots

images = [
    dataset[idx]["image"]
    for idx in id_indixes
]

ages = [dataset[idx]["age"] for idx in id_indixes]
titles = [f"Age: {age}" for age in ages]

# Plot the images
visualizations.PIL_show_images_with_titles_same_window(images, titles = ages, figsize = (20, 40))

Checking different ID's shows us that the dataset generation seems to be properly implemented.

## Show the shapes of the images

In [None]:
for idx in range(10):
    
    # Get the image from the dataset
    img = dataset[idx]["image"]

    # Images are stored in PIL format, convert to pytorch tensors
    transform = T.transforms.Compose([T.transforms.ToTensor()])
    tensor = transform(img)

    # And now we can query about its shape
    print(tensor.shape)

All the images seems to have the shape. Also, they all seem to be in color (3 channels). So normalization here is not as critical as in the FG-Net dataset. However, FG-Net having different shapes and color schemes might be problematic (FG-Net might be much harder than this dataset).

## Exploring the *images-per-person* distribution

- One key aspect of the problem we are solving is the number of images per person
- For example, when doing `P-K` sampling, if there are persons with less than `K` images, there might be a problem (we have some mechanisms to deal with that problem)

First, show the histogram of how many images per person there are:

In [None]:
# Remember that `dataset.individuals` is a dict with keys the indixes of persons and with values
# lists of ages (each age correspond to a stored image, thus there might be repeated ages if there
# are more of one image for one concrete age)
imgs_per_user = [len(individual_imgs) for individual_imgs in dataset.individuals.values()]

# Now, plot the distribution of the ages
visualizations.plot_histogram(
    values = imgs_per_user,
    num_bins = 20,
    title = "Images per user",
    xlabel = "Images per user",
    ylabel = "Number of instances",
    figsize = (10, 8),
    fontopts = fontopts,
)

There seems to be at least more than 20 images per person, and at most 138 or 139. The data seems very likely to follow a normal distribution, but we are not interested in checking that assumption. Check the numbers putting the data into a pandas dataframe:

In [None]:
imgs_per_user_df = pd.DataFrame({
    "IDs": dataset.individuals.keys(),
    "Nº images per user": imgs_per_user
})
imgs_per_user_df.describe()

We have at least 22 images per user, and at most 139. This might let us use bigger values of `K` in `P-K` sampling. But remember that in FG-Net we have at least 6 images per user and at most 18.

Having bigger values than 18 should not produce python errors, as we are going to use FG-Net to validate with Rank@k, but if we want to use other metrics (such as Local Rank@k, it might matter)

## Exploring the age distribution

- We have worked with the *LFW dataset*, but there was no variance in the age distribution (which is a key component in our problem)
- So now study that age distribution

In [None]:
# Get a flat list with all the ages in the dataset
ages = dataset.individuals.values()
ages = list(itertools.chain(*ages))

# Now, plot the histogram of the ages distribution
visualizations.plot_histogram(
    values = ages,
    num_bins = len(set(ages)),
    title = "Age distribution",
    xlabel = "Age",
    ylabel = "Number of instances",
    figsize = (10, 8),
    fontopts = fontopts,
)

FG-Net showed us an skewed distribution, with more samples of young ages. This time we have a more symetrical distribution. We don't have the same concentration of lower ages. 

This differences in distribution might be a problem in the future.

In [None]:
print(f"Mean age: {np.mean(ages)}")
print(f"Min age: {min(ages)}")
print(f"Max age: {max(ages)}")
print(f"Most frequent age = {max(set(ages), key = ages.count)}")

Lets compare this metrics with the CACD dataset ones:

| Dataset | Mean Age | Min Age | Max Age | Most frequent age | 
|:---     | :---     | :---    | :---    | :---              | 
| FG-Net  | 15.84    | 0       | 69      | 18                | 
| CACD    | 38.03    | 14      | 62      | 37                | 

## Exploring the age range distribution

First, compute the age range data:

In [None]:
min_ages = [min(age_list) for age_list in dataset.individuals.values()]
max_ages = [max(age_list) for age_list in dataset.individuals.values()]

ages_df = pd.DataFrame({
    "IDs": dataset.individuals.keys(),
    "Min age": min_ages,
    "Max age": max_ages,
})
ages_df["Age range"] = ages_df["Max age"] - ages_df["Min age"]

ages_df.head(5)

In [None]:
ages_df.describe()

We can see that, at least, we have 11 years of difference among images of the same person. The mean age range is 27.80 years, which can make solving this task hard. But in the other hand, shows that this dataset is relevant for the problem that we are trying to solve. The biggest age range is 54 years.

Let's see an histogram for the age range:

In [None]:
visualizations.plot_histogram(
    values = ages_df["Age range"],
    num_bins = 1,
    title = "Distribution of the age range",
    xlabel = "Difference in years for the same person",
    ylabel = "Frequency",
    figsize = (15, 10),
    fontopts = fontopts,
)

Show the age ranges of the individuals:

In [None]:
def plot_age_range(data: List[Tuple[int, int]]):
    """
    Given a list with the following structure:
        `[(lowest_age, highest_age), (lowest_age, highest_age), ...]`
    plots, pear each individual, a vertical bar with their lowest and highest age.
    
    It's sorted first by lowest, then by highest
        
    """
    
    # Sort the data by lowest age values and then by highest age values
    data.sort(key=lambda x: (x[0], x[1]))

    # Initialize a figure and axis
    fig, ax = plt.subplots()

    # Calculate the width of each individual's age range
    # Used for offsetting and getting non-overlapping lines
    width = 200.0

    # Create an array of x-values for each individual
    x_values = np.arange(len(data))

    # Loop through the sorted data and plot a line with circles for each individual
    for i, (lowest, highest) in enumerate(data):
        
        # Midpoint age
        mid_age = (lowest + highest) / 2 
        
        # Horizontaloffset to avoid overlapping
        # Controled by width variable
        offset = i * width 
        
        # Vertical line
        ax.plot([i + offset, i + offset], [lowest, highest], color='b', linewidth=2)  
        
        # Two circles
        ax.plot(i + offset, lowest, 'bo', markersize=2)  
        ax.plot(i + offset, highest, 'ro', markersize=2) 

    # Remove X-axis labels
    ax.set_xticks([])
    
    # Set labels and title
    ax.set_xlabel('Individual')
    ax.set_ylabel('Age')
    ax.set_title('Age Range per Individual (Sorted by Lowest Age, Secondary by Highest Age)')

    # Show the plot
    plt.grid()
    plt.show()

In [None]:
# Compute the age ranges
age_lower_and_upper = []

for el in dataset.individuals.values():
    age_lower_and_upper.append((min(el), max(el)))

# And use that data to plot
plot_age_range(age_lower_and_upper)

Almost all individuals have an age range of 9 years. See how many individuals have less than that age range:

In [None]:
lower_ranges = [
    age_range for age_range in list(ages_df['Age range'])
    if age_range < 9
]
print("Nº of individuals having less than 9 years of age range: ", len(lower_ranges))