# Data Exploration Notebook
## MLA Project - Fader Networks

**Authors:** Adrien PETARD, Robin LEVEQUE, Théo MAGOUDI, Eliot CHRISTON

**Group:** 11

---
![alt text](data/CelebA.png "CelebA")

## Introduction

The goal of this notebook is to explore the CelebA dataset and to understand how it is structured.

Here is the link to the dataset: [https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html](https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html)

The data is divided into 3 folders:
- <u>**Anno** (`annoation`):</u> contains the annotations of the dataset
- <u>**Eval** (`evaluation`):</u> contains the evaluation files of the dataset
- <u>**Img** (`images`):</u> contains the images of the dataset, here we chose to use the aligned and cropped images


## Imports

In [1]:
import pandas as pd
import numpy as np

## Annotations

#### Identity

There are

In [2]:
identity = pd.read_csv('data/Anno/identity_CelebA.txt', sep=" ", header=None, index_col=0)
identity.columns = ["identity_id"]
identity.index.name = "image_id"

identity.head()

Unnamed: 0_level_0,identity_id
image_id,Unnamed: 1_level_1
000001.jpg,2880
000002.jpg,2937
000003.jpg,8692
000004.jpg,5805
000005.jpg,9295


In [3]:
# here we will the identities according to their frequency
identity_counts = identity.groupby("identity_id").size().reset_index(name="count")
identity_counts = identity_counts.sort_values(by="count", ascending=False).reset_index(drop=True)
display(identity_counts.head()) # first 5 (default) most frequent identities
display(identity_counts.tail()) # last 5 (default) least frequent identities

Unnamed: 0,identity_id,count
0,2820,35
1,3227,35
2,3782,35
3,3699,34
4,3745,34


Unnamed: 0,identity_id,count
10172,9280,1
10173,9966,1
10174,7778,1
10175,1100,1
10176,8591,1


In [10]:
print("Number of images: {}".format(len(identity)))
print("Number of identities: {}".format(len(identity_counts)))

Number of images: 202599
Number of identities: 10177


#### Attributes

In [5]:
attributes = pd.read_csv('data/Anno/list_attr_celeba.txt', sep=" ", header=1, index_col=0)
attributes.index.name = "image_id"
attributes.head()

Unnamed: 0_level_0,Arched_Eyebrows,Attractive,Bags_Under_Eyes,Bald,Bangs,Big_Lips,Big_Nose,Black_Hair,Blond_Hair,Blurry,...,Smiling,Straight_Hair,Wavy_Hair,Wearing_Earrings,Wearing_Hat,Wearing_Lipstick,Wearing_Necklace,Wearing_Necktie,Young,Unnamed: 40
image_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
000001.jpg,-1,1,1,-1,-1,-1,-1,-1,-1,-1,...,-1,1,1,-1,1,-1,1,-1,-1,1
000002.jpg,-1,-1,-1,1,-1,-1,-1,1,-1,-1,...,-1,1,-1,-1,-1,-1,-1,-1,-1,1
000003.jpg,-1,-1,-1,-1,-1,-1,1,-1,-1,-1,...,-1,-1,-1,1,-1,-1,-1,-1,-1,1
000004.jpg,-1,-1,1,-1,-1,-1,-1,-1,-1,-1,...,-1,-1,1,-1,1,-1,1,1,-1,1
000005.jpg,-1,1,1,-1,-1,-1,1,-1,-1,-1,...,-1,-1,-1,-1,-1,-1,1,-1,-1,1


In [7]:
print("Number of images: {}".format(len(attributes)))
print("Number of attributes: {}".format(len(attributes.columns)))

Number of images: 202599
Number of attributes: 40


## Evaluation

## Images