# Dataset stats

This notebook calculates the mean and standard deviation of the a given dataset.

### Notebook Contents:
1. Imports.
2. Loading the dataset.
3. Calculating the mean and standard deviation.
4. Saving the mean and standard deviation into a .yaml file

## 1. Imports:

In [2]:
# PyTorch imports:
import torch
from torchvision.datasets import ImageFolder
import torchvision.transforms as T

# Dataset imports:
from dataset import DigitDataset

# Other imports:
import os
import yaml
from tqdm import tqdm
from pathlib import Path


## 2. Loading the dataset:

#### a) Getting the path of the dataset:

In [5]:
# Getting the path of the .yaml file that contains the path to the dataset:
yaml_file_path = Path().resolve().parent / "Dataset" / "dataset_path.yaml"

# Getting the dataset path from the .yaml file:
with open(yaml_file_path, 'r') as file:
    dataset_path = yaml.safe_load(file)['train']

print(f'Dataset path: {dataset_path}')


Dataset path: D:\Datasets\Kaggle\DigitRecognizer\train.csv


#### b) Creating the a dataset obejct

In [9]:
dataset = DigitDataset(transforms=[T.ToTensor()])
dataset


<dataset.DigitDataset at 0x1c05d0e5280>

## 3. Calculating the mean and standard deviation.

The mean is calculated as follow:
$$Mean \space (\mu) = \frac{\sum_{i=1}^{n} x_i}{n}$$

And the variance is calculated as follows:
$$Varience \space (\sigma^2) = \frac{1}{n} \sum_{i=1}^{n} x_i^2 - \left(\frac{\sum_{i=1}^{n} x_i}{n}\right)^2 = \frac{1}{n} \left(\sum_{i=1}^{n} x_i^2\right) - \mu^2$$


Finally, the standard deviation is the square root of the varience:

$$Standard Deviation \space (\sigma) = \sqrt{\sigma^2} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} x_i^2 - \left(\frac{\sum_{i=1}^{n} x_i}{n}\right)^2}$$

In [10]:
pixel_sum = torch.tensor([0.0, 0.0, 0.0])
pixel_sum_sq = torch.tensor([0.0, 0.0, 0.0])

pixel_count = 0

for im, _ in tqdm(dataset):
    # Calculating the number of pixels this way since the images
    # have different sizes.
    pixel_count += im.shape[1] * im.shape[2]

    # Summing the pixels of each channel and accumulating them:
    pixel_sum += im.sum(axis=(1, 2))

    # Accumulating the square of the summation of each channel:
    pixel_sum_sq += (im**2).sum(axis=(1, 2))


100%|██████████| 42000/42000 [00:04<00:00, 10191.70it/s]


In [11]:
mean = pixel_sum / pixel_count
varience = pixel_sum_sq - pixel_sum**2 / pixel_count
varience = (pixel_sum_sq / pixel_count) - mean**2
std = torch.sqrt(varience)

print(f'Mean: {mean.tolist()}\nStandard Deviation: {std.tolist()}')


Mean: [0.1310141682624817, 0.1310141682624817, 0.1310141682624817]
Standard Deviation: [0.30854013562202454, 0.30854013562202454, 0.30854013562202454]


## 4. Saving the mean and standard deviation into a .yaml file:

In [13]:
# Creating a dict that holds the mean and standard deviation:
stats_dict = {'mean' : mean.tolist(), 'std' : std.tolist()}

# Saving the stats_dict into a .yaml file:
file_path = Path().resolve() / "dataset_stats.yaml"

with open(file_path, 'w') as file:
    yaml.dump(stats_dict, file)
