# A Real Dataset
- Fashion MNIST dataset
    - Training samples = 60,000 
    - Testing samples = 10,000
    - Sample size = 28x28
    - #classes = 10 w/ each being a different clothing item
    - Samples per class = 6,000 which balances the dataset
- Common : 
    - Grayscale (go from 3-channel RGB values per pixel to a single Black to white range of 0 - 255 per px)
    - Reisize images to normalize their dimensions
    
    
- Notes
    - If dataset isn't already balanced, the NN $\rightarrow$ biased to predict the class containing the most images. NN fundamentally seek out the steepest and quickest gradient descent to decrease loss, which might lead to a local minimum making the model unable to find the global loss minimum. May be best to trim samples from high-frequency classes in dataset. Other solutions are to 1) use class weights which needs to be validated in practice, 2) augment samples - crop, rotate, flip, etc.
---

---
## Data prep
- Get from nnfs.io site
- All CAPITAL VARIABLES bc in Python, these are CONSTANTS (formally known as IMMUTABLE)
- pypi (bc they throw errors w/ conda installation) see w/ conda list : 
    - matplotlib=3.5.2
    - numpy=1.21.6
    - opencv-python=4.5.5.64
    - pillow=9.1.0

In [1]:
from zipfile import ZipFile
import os
import urllib
import urllib.request
import cv2

import numpy as np
import matplotlib.pyplot as plt

In [None]:
'''
Download the compressed data (if the files is absent under the given path) 
using the urllib, a standard Py library
'''

URL = 'https://nnfs.io/datasets/fashion_mnist_images.zip'
FILE = 'fashion_mnist_images.zip'
FOLDER = 'fashion_mnist_images'


# Unzip files 
if not os.path.isfile(FILE):
    print(f'Downloading {URL} and saving as {FILE}...')
    urllib.request.urlretrieve(URL, FILE)
print('Unzipping images...') 

# with is a keyword that opens and close file
with ZipFile(FILE) as zip_images: 
    zip_images.extractall(FOLDER)
print('Done!')

## Data loading

In [2]:
labels = os.listdir('fashion_mnist_images/train')
print('labels : ', labels)

files = os.listdir('fashion_mnist_images/train/0')
# print('\ntotal #imgs in 0', len(files), '\nall imgs in 0 : \n', files)
# print('\nset of imgs in 0 : ', files[:10])

image_data1 = cv2.imread('fashion_mnist_images/train/7/0002.png')
# print("image_data1 : ", image_data1)

# cv2.IMREAD_UNCHANGED, argument notifies the cv2 package that we intend to read in these images 
# in the same format as they were saved (grayscale in this case)
image_data2 = cv2.imread('fashion_mnist_images/train/7/0002.png', cv2.IMREAD_UNCHANGED)

# see all characters in that one line that should be there; break new row
np.set_printoptions(linewidth=200)
print("image_data2 : ", np.shape(image_data2), type(image_data2), "\n", image_data2) 


labels :  ['9', '0', '7', '6', '1', '8', '4', '3', '2', '5']
image_data2 :  (28, 28) <class 'numpy.ndarray'> 
 [[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0  49 135 182 15