# Exploratory Data Analysis of CIFAR-10
In this notebook, I will perform an exploratory data analysis of the CIFAR-10 dataset.

## Objectives

The objectives of my EDA are to discover the question for the following questions:
- What is the size of the dataset?
- How many classes are in the dataset?
- What is the resolution of images in the dataset?
- How many images are there per class?
- What are the statistics of the images per class? (e.g. mean pixel value)
- What are the ground truth KID, FID, and IS scores of the data?
- What do images of each class look like in the dataset?
- What is the diversity of images in the dataset?

## Google Colab Setup

In [None]:
! mkdir -p /root/.ssh
with open("/root/.ssh/id_rsa", mode="w") as fp: # Repository Deploy Key
    fp.write("""
-----BEGIN RSA PRIVATE KEY-----
MIIJKgIBAAKCAgEAt4IzVm1w9r7xKuS+zYuBb6UNB2NFHQBaRbhih7+HBK+yNSbz
lGu9P/sWbFarsY68zKCISb8+K+hulP0ay9OdnCLat9z96eOZ0gX6Iqsh6+szfNvm
8m1SJeXc7C6UGmyNIpr33TUpf56y28UFa656rIjff1w20SRKjL2rgu8rx+lxiASL
+hXiZi2t1PA6oLD3puD9TOwN85Ct5mutmTjBYQKbmk04Sp8jE9DqloJPpkCJHVh6
cJ0bzyVCx4njzdoQeWwPtVa67wyHIXDqH1xZBAkAqt2WAx4npLGgTotPSUaFkDLw
co6SnpOLx8ZGrpggX1k2Oh7FOH75nZXHKjrfWtX5pbkw8bxYmNLTErq/t19ULBQi
dVyv406ARf1rDOUFoMfsOsc1pd/wf66mcZUn3s3ogI6it5zGrpzCpfrlxHgJQ/Uh
gvyWA88J6BRVgMA5cUS4gb/OEmuHvdM9CRY6HELAS35tS85zcRQXipYqngx/dgaV
GhHHIWZh1bOkbn1dRV3xQau6KxYOyLI/i+eBFJA3jxvDKlRMVfy1DMSn0DffFlFt
ApOAqOccYo124vUthsiWJ9qExdCJ36+tAHpFelfyjAygJMWCZhaWprxvY9VG/lcR
6vjNtjaUySWx3l4GTadmbSwARK9gl5Xgbp0qojx1FMpsFcnCsz3y8yB7TYECAwEA
AQKCAgBVmHSrxqafYVcKg+H/7Cd21QzrukEdkvGIfcXvvcWTyQQdyMprG4oN0ueV
pyO00Xh9FhAcHgk439Tcx+Z81ns4vgU5J+qD8zbngQQ4sYxEB9RfVA84WwerR7mx
rNRGMwXt80zUMJznuzWATzkFDkCIQ9vEA1ZKXVwso7fhff/04o2jPUOxZg3RTVM8
9MTT+Ve6zk04WQ704jJLPUSfKJsCzf2YjpZIMExjTNpvU98lFAsg1gleh9nV2HJ6
snXAqgtvJ5l4IzlUkYpibdG2yRN4T16xVGRJlgI1zuiQWmikLDHWnfwL4za+ouHb
UD/d5nWLJAioOXwSqx9xgtCAgS92211ydmKWmsdThHdRDN63ncFPNg7I3ZVdRKOK
bdAMev6sP3CXnWO7aRO0sGT4wg7tGbnyE53I4RJXck1aZZGSyOvswucCxEQsnbSP
Hr78/kc+5+DJX8pbc0NuLAspkUVoSU75Idrv5B+2UQSb1ZXspfp923s0voRw0sJ8
ydOg1n173QOwKnAE++tXQrdPZyU2cHkuvg426snCjlpbogfmmlj8cGGb8EOZcdJv
I3r+w2V+9bC2Z7O4OJhe2HlwM0N6F+KBnyJHsxbdP08OqZYzVMiDmg5Rfbkwf8W9
arkt9+pAWSix0nkp7qNgD+qkjfrtOxIX//mFbIBWhhq1gSby5QKCAQEA63P0DsuC
APhI0/GeSFJ8FXYtVFrX8/DJjOH441VEaQQMlAub6KnlhsOo0nS10A5GqV9/EeZ7
ss2w22JwIh+Wk6PtxPU7lMEQRy1eV0GUrQdrE8StlLs36zszMswMafnm4yA4ki6g
Lr3BR6Ps38TzLba6mctOFt9T8wV6+/YB2PFi3r5tmX0zYNQi6mbTFnlDv/dzaFOG
fT823OuOzuwVvgu90651PutVfPrNhUTTykuGyDee5kkn21HQeguoLDWEIfeV+ujH
l/AAT7rpNmxtSl0m+iwYsbKDbX28DCGnWXgeMuFgMUblRvKumChbno81JkJIOOwV
+DcWryqTLp17VwKCAQEAx4XN5+PGMg9na7vZW6zU8r92RFKsjjhueygtpxdDUjdV
MSYPu4mgO9ab62nf9LQJE2JCN6e8zHWEAooaIt83TzCa6SaYbTEnzin2M9gSYtW6
MQ429zq49MOdZfwMfRgfnFAnA8KDIfYqqcPcmnQWHWhNGXyS3CccYw+2+gmRHLoM
ohcoVZne6VuMqkEzf8SDaR8k9gwVjqxVqpQN8p81PE00a02k+QDwyNsrcnM19plB
kntb9FLuqQf+lmDhe0/9fDqcjIEDz4eonLlFaTrFegGybTQcKD+3uyC0k9njUFwJ
Y77I3kJiaoDuXXVxWETS3KvaE2rmjXAEcrN5rkfO5wKCAQEAl+41kQputBOCYwjp
Ov/Gw86DB4irCuTYGYmDIaZWw3DycOFg1Gw1CJXerRbUbxGXNRnDFBjmvwUNVzMY
6lv5vQEtn0cjECTYTSWQV7ugpVpBFPt3ip6YQbjsm52hcQzpmKuk9WcSw7Z8Lq8v
XWFoDZp4pF7U39tx/0INDuK6ZHO2ecblUALDEXsxoJGDKmBLgGa7WJl1EgKlcz6o
4wriKMTI0/wh+dy/SCtKTPGRvFqp+S4y4aRZDKOpY+d7uDM8NPLfG43zpS4f9VLF
w/GJQFAFo66qrJdlSVS18BoTM59X1Tsq6AE4V2SnltWL8S+1ex+QHPLyZj2d7KAL
YywJdwKCAQEApWUG3j6T0nWwfr82nGc2E5ChgluiTTb8Zr1Ustl25hWWWmq5yfV5
TYFGuSyICTqg91+Rkr9Ko5aa+tvudI/jMpMRJ0rmOkXwQFfKjwmDnEid0wJ8kA8u
uT/bH2qEE8LGmXZcESLSP3nnvdjt619l4bTPjNwWhccqIfgp7zW1BEI6LLfTqLon
7fwFLDFmdni5ko/NvOUhjabQUNnwgfp2T+mUFYtEwWGFOItuha55wlUi5UG7ZVrG
GnrVEWV4JReXAr83fMWKGiPToy92GZgtkUkM1rfGy5qePNIMvy903u2cnwHNU2lm
WfFNJ04uykQrI+CVo1kPi5mbJlYe/VjrawKCAQEA5Pmjb8/MdAUEkb3zAD7GJIKC
HnUAA4mwk8xVdsGN6xvUL8RYgi+VjSKvzNsUln5sPXdtZbP//gQOF7KgLPFFe+mf
Xok7fGSTQ1DgVWEErFynAYxu+Uu4xtjRbPyCXjyoHianXkn3QDf1ggpF+y2R0Ivu
oyxsDvMArFalbmK4q/+Q6/z/DtnirfjUnxiYEPEBZtP3Gz74KQK/AhForVlCiSz6
MbDp30cxPy/8/pimJ9xUR6re9Xuw/EFWp0ifHXv6IGNOd8UQGejyI82KnJZPNTde
tHO70d3zFdhrpJO63Elrw6c9bxeZrcJTT1e3wFpX2z1aE4dybdNqrI/IbzcdVA==
-----END RSA PRIVATE KEY-----
""")
! ssh-keyscan -t rsa github.com >> ~/.ssh/known_hosts
! chmod go-rwx /root/.ssh/id_rsa
! git clone git@github.com:Tien-Cheng/dele-generative-adversarial-networks.git
%cd /content/dele-generative-adversarial-networks

In [None]:
%%capture
%pip install -U torch-fidelity wandb pytorch-lightning

## Setup

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch_fidelity
from utils.visualize import visualize
from data.dataset import CIFAR10DataModule

## Data
### CIFAR-10

CIFAR-10 is a labelled subset of the 80 million tiny images dataset. It consists of 60000 32x32 color images in 10 classes.

- They are: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck

There are 6000 images per class. CIFAR-10 splits data into 50000 training images, and 10000 test images. It is a common benchmark dataset, used to evaluate computer vision models. It is also commonly used as a benchmark for GAN training, as GAN training typically takes a long time, and so a smaller dataset like CIFAR with a lower resolution is easier to train, allowing GAN models to be more easily evaluated.

In [None]:
dm = CIFAR10DataModule(
    data_dir="./data",
    batch_size=64,
    num_workers=2,
)

In [None]:
dm.prepare_data()
dm.setup()

## EDA

### Basic Dataset Details

In [None]:
train_len = len(dm.cifar_train)
test_len = len(dm.cifar_test)

print(f"The size of the training split is {train_len}, the size of the test split is {test_len}")