<a href="https://colab.research.google.com/github/DukeFens/QuickDraw-by-Scratch/blob/main/QuickDrawScratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preprocessing Data

We perform CNN to detect 5 of animals in Quickdraw Dataset: cat, dog, dolphin, elephant and zebra. First, we dowload each of them:

In [None]:
import os
import numpy as np
import requests

def download_QuickDraw_dataset():
  # Create a folder to store the dataset
  os.makedirs("quickdraw_animals", exist_ok=True)

  # List of animals to download
  animals = ["cat", "dog", "elephant", "zebra", "dolphin"]

  # Base URL for QuickDraw .npy files
  base_url = "https://storage.googleapis.com/quickdraw_dataset/full/numpy_bitmap/"

  for animal in animals:
      file_name = f"{animal}.npy"
      url = base_url + file_name.replace(" ", "%20")
      print(f"Downloading {animal}...")
      response = requests.get(url)
      with open(os.path.join("quickdraw_animals", file_name), "wb") as f:
          f.write(response.content)

  print("Download complete!")

In [None]:
def load_data(animal):
  """
  Purpose:
        - load data of the animal by its name
  Params:
        - animal (str): name of class (cat, dog, dolphin, elephant, zebra)
  Returns:
        - data (ndarray): data of the animal
  """
  path = os.path.join("quickdraw_animals", f"{animal}.npy")
  data = np.load(path)
  return data

cat_data = load_data("cat")
print(cat_data.shape)

(123202, 784)


We label the data by a vector size of 5 and each element has value of 1 or 0, we assume the dataset follows this rule:
- 0 (cat)      → [1, 0, 0, 0, 0]
- 1 (dog)      → [0, 1, 0, 0, 0]
- 2 (dolphin) → [0, 0, 1, 0, 0]
- 3 (elephant)    → [0, 0, 0, 1, 0]
- 4 (zebra)  → [0, 0, 0, 0, 1]

In [None]:
def label_data(data, label, X, Y):
    """
    Purpose:
        - label the data by its label
    Params:
        - data (ndarray): data of the animal that unlabeled
        - label (ndarray): label of the animal
        - X (list): list of data that labeled
        - Y (list): list of label
    """
    label = np.tile(label, (data.shape[0], 1))  # shape is (N, 5)
    data = data.reshape(-1, 28, 28, 1)

    X.append(data)
    Y.append(label)

def initialize_data():
  """
  Purpose:
        - initialize the data and label
  Returns:
        - X (list): list of data that labeled
        - Y (list): list of label
  """
  X = []
  Y = []
  target = np.array(["cat", "dog", "dolphin", "elephant", "zebra"])

  for animal in target:
    print(f"Processing {animal}...")
    data = load_data(animal)
    data = data[:10000] #limit data
    label = np.array((target == animal).astype(int))

    label_data(data, label, X, Y)
    print(f"Done {animal}!")

  X = np.concatenate(X, axis=0)
  Y = np.concatenate(Y, axis=0)

  return X, Y

In [None]:
X, Y = initialize_data()
print(X.shape, Y.shape)

Processing cat...
Done cat!
Processing dog...
Done dog!
Processing dolphin...
Done dolphin!
Processing elephant...
Done elephant!
Processing zebra...
Done zebra!
(50000, 28, 28, 1) (50000, 5)


Now we split the data into two sets, one for training and one for testing:



In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
print(X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)

(40000, 28, 28, 1) (10000, 28, 28, 1) (40000, 5) (10000, 5)


# Training Data

We will build the convolution layer at first, but before it, we will discuss the math theory behind it. Let's see we have input with a picture of 28x28 and filter 3x3:

$$
X =
\begin{bmatrix}
x_{11} & x_{12} & \cdots & x_{1,28} \\
x_{21} & x_{22} & \cdots & x_{2,28} \\
\vdots & \vdots & \ddots & \vdots \\
x_{28,1} & x_{28,2} & \cdots & x_{28,28}
\end{bmatrix} \; , \; \; K =
\begin{bmatrix}
w_{11} & w_{12} & w_{13} \\
w_{21} & w_{22} & w_{23} \\
w_{31} & w_{32} & w_{33}
\end{bmatrix}
$$

We apply the convolution for X and K, that means we slide the kernel onto X and for each patch (m, n) while sliding we calculate:
\begin{align*}
a_{m,n} &= \sum_{i=0}^{2} \sum_{j=0}^{2} x_{m+i, n+j} w_{i, j} \\
\end{align*}

To avoid the shape reducing, before convolution, we add zeros around the border of X, then Z becomes:
$$
Z = X * K + b = \begin{bmatrix}
a_{1,1} & a_{1,2} & \cdots \\
\vdots & \ddots & \cdots \\
a_{28, 1} & \cdots & a_{28, 28}
\end{bmatrix} \text{ (we also add bias)}
$$

Then apply ReLu function to get the final feature map, we can see in general:

$$
A = ReLu(Z) = ReLu(X * K + b) \text{, with stride 1 and padding 1.}
$$

We increase the strong features by applying pooling 2x2 onto A with stride 2, then we have pooling activated feature map:
$$
P = \begin{bmatrix}
p_{1,1} & p_{1,2} & \cdots \\
\vdots & \ddots & \cdots \\
p_{14, 1} & \cdots & p_{14, 14}
\end{bmatrix}
$$

We have done convolution layer 1 so far, now let's see in general our network design from **X input** to **X full connected dense layer**:

$$
X \xrightarrow{\text{conv}} Z^{[1]} \xrightarrow{\text{ReLu}} A^{[1]} \xrightarrow{\text{pooling}} P^{[1]} \xrightarrow{\text{conv}} Z^{[2]} \xrightarrow{\text{ReLu}} A^{[2]} \xrightarrow{\text{pooling}} P^{[2]}
\xrightarrow{\text{Softmax}} X_{FC}
$$

We have 2 layer for convolution, notice that the shape change by:
$$
m = \frac{n+2p-f+1}{s}
$$
Which:
  - $m$: Output size
  - $n$: Input size
  - $p$: Padding size
  - $f$: Filter size or pooling size
  - $s$: Stride size

Then we have pooling activated feature map in layer 2:
$$
P^{[2]} = \begin{bmatrix}
p_{1,1} & p_{1,2} & \cdots \\
\vdots & \ddots & \cdots \\
p_{7, 1} & \cdots & p_{7, 7}
\end{bmatrix}
$$

In real code, more crazy, we have channel 1 for gray color of image and **32 filters** for layer 1, **64 filters** for layer 2 to detect more features. Then, the general mathematics looks like:

$$
X =
\begin{bmatrix}
x_{11} & x_{12} & \cdots & x_{1,28} \\
x_{21} & x_{22} & \cdots & x_{2,28} \\
\vdots & \vdots & \ddots & \vdots \\
x_{28,1} & x_{28,2} & \cdots & x_{28,28}
\end{bmatrix} \; , \; \; K_1 =
\begin{bmatrix}
k_1 & k_2 & \cdots & k_{32}
\end{bmatrix}, K_2 =
\begin{bmatrix}
k_1 & k_2 & \cdots & k_{64}
\end{bmatrix}
$$

which for each $k_i$:
$$
k_{i}=\begin{pmatrix}
w_{i11} & w_{i12} & w_{i13} \\
w_{i21} & w_{i22} & w_{i23} \\
w_{i31} & w_{i32} & w_{i33}
\end{pmatrix}
$$

In [None]:
import torch

def initialize_filters(filter_size, num_filters):
  """
  Purpose:
        - initialize the filters
  Params:
        - filter_size (int): size of the filter
        - num_filters (int): number of filters
  Returns:
        - filters (dict):
          • "kernel": kernel of the filter (tensor)
          • "bias": bias of the filter (tensor)
  """
  filters = {}
  filters["K"] = torch.randn(num_filters, filter_size, filter_size, requires_grad=True)
  filters["b"] = torch.randn(num_filters, requires_grad=True)
  return filters

In [None]:
layer1_filters = initialize_filters(3, 32)
layer2_filters = initialize_filters(3, 64)
print("X =" , X[0].shape)
print("K_1 =", layer1_filters["K"].shape[0])
print("K_2 =", layer2_filters["K"].shape[0])
print("k_i =", layer1_filters["K"][0].shape)

X = (28, 28, 1)
K_1 = 32
K_2 = 64
k_i = torch.Size([3, 3])


Then, we calculate for i in $K_1$:

$$
\begin{align*}
z_i &= X * k_i + b_1 \\
a_i &= ReLu(z_i) \\
p_i &= pooling(a_i) \\
\end{align*} \\
\therefore P^{[1]} \text{ has shape (32, 14,14) as the channel replaced by number of filters} \\
\therefore P^{[2]} \text{ is (64, 7, 7) as the same calculating.}
$$


Notice that "*" is convolution operation.

In [None]:
import torch.nn.functional as F
def convolution(X, filters):
  return F.conv2d(X, filters["K"], filters["b"])

def pooling(X):
  return F.max_pool2d(X, 2, 2)

def relu(X):
  return F.relu(X)

def forward_propagation_convolution_layer(X, filter, func):
  """
    Purpose:
        - forward propagation for convolution layer
    Params:
        - X (tensor): input of the layer
        - filter (dict):
          • "kernel": kernel of the filter (tensor)
          • "bias": bias of the filter (tensor)
        - func (function): activation function
    Returns:
        - P (tensor): output of the layer
  """
  Z = convolution(X, filter)
  A = func(Z)
  P = pooling(A)
  return P

We flatten it to get input for full connected layer as we can see **64x7x7=3136** and suppose we have n examples and 128 perceptron in dense layer 1, then:
$$
X_{FC} =
\begin{bmatrix}
x_{11} & x_{12} & \cdots & x_{1,3136} \\
x_{21} & x_{22} & \cdots & x_{2,3136} \\
\vdots & \vdots & \ddots & \vdots \\
x_{n,1} & x_{n,2} & \cdots & x_{n,3136}
\end{bmatrix} \; , \; \; W^{[1]} =
\begin{bmatrix}
w_{11} & w_{12} & \cdots & w_{1,128} \\
w_{21} & w_{22} & \cdots & w_{2,128} \\
\vdots & \vdots & \ddots & \vdots \\
w_{3136,1} & w_{3136,2} & \cdots & w_{3136,128}
\end{bmatrix}
$$

In [None]:
def initial_parameters(layer_dims):
  """
  Purpose:
        - initialize the parameters
  Params:
        - layer_dims (list): list of number of perceptron in each layer (including input layer)
        - X (tensor): input of the layer
  Returns:
        - parameters (dict):
          • "W": weight of the layer (tensor)
          • "b": bias of the layer (tensor)
  """
  parameters = {}
  L = len(layer_dims)
  for l in range(1, L):
    parameters["W" + str(l)] = torch.randn(layer_dims[l-1], layer_dims[l], requires_grad=True)
    parameters["b" + str(l)] = torch.randn(layer_dims[l], requires_grad=True)
  return parameters

$$
\begin{align*}
\therefore Z^{[1]} &= XW + b^{[1]} \\
\therefore A^{[1]} &= ReLu(Z^{[1]})
\end{align*}
$$

For layer 2, we have:

$Softmax$: $$ \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}} $$
- $K$: number of classes

Then:
$$
A^{[2]} = Softmax(Z^{[2]}) = Softmax(Z^{[1]}W^{[2]} + b^{[2]})
$$
Which:
- $W^{[2]}$: 5 percentrons equals to 5 class of output.
