##### What is CCT, and why do we use it?

**CCT** stands for **Compact Convolutional Transformer**. It is a powerful architecture used in **Computer Vision** problems because it combines the best features of **transformers** and **convolutional neural networks (CNNs)**.

---

### Key Features of CCT:
- In **Vision Transformer (ViT)**, patches are required. But in **CCT**, the model starts by using **convolutional layers**.
  - These layers extract important features from images, such as **edges** and **textures**.
  - This helps in representing images better and ensures that the model can capture **local patterns** from the image.
  
- After the convolutional layers, **pooling** is required to reshape the images.
  - This reshaped output works as a sequence for the **transformer model**.
  
- **Positional embedding** is usually needed for patches in **ViT**, but in **CCT**, positional embedding is optional.
  - This is because the reshaped image already contains enough information for the transformer encoder.
  
---

### Transformer Encoder in CCT:

- The **Transformer Encoder** processes the sequence of image tokens (small pieces of the image) and learns how they relate to each other.
- It captures both **local** and **global** patterns in the image.
- The **self-attention** mechanism allows each token to look at every other token in the image and identify the important relationships.
- This helps the model understand how different parts of the image, such as edges and textures, interact.
  
---

### Sequence Pooling in CCT:

- After the transformer encoder processes the tokens, **Sequence Pooling** gathers information from all tokens and creates a **single, condensed representation** of the entire image.
- The pooled representation is an average (or sum) of all tokens, which is used for classification.
- This step effectively summarizes the whole image into **one vector**, enabling the model to classify the image without needing to focus on individual patches anymore.

---

### MLP Head in CCT:

- The **MLP Head** (Multilayer Perceptron) is the final part of the model.
  - It takes the pooled representation from sequence pooling and predicts the class of the image (e.g., t-shirt, shoe, etc.).
  - The MLP Head is a simple, fully connected neural network made up of one or more layers of neurons.
  - These neurons process the pooled image representation and output the final prediction, like a 90% chance that the image is a "t-shirt".
  
---

### Summary:
- **CCT** is particularly effective for **small datasets**, as it uses convolution to extract local features and combines it with transformers to understand **global relationships** within the image.
- This gives it a strong ability to **recognize complex visual patterns**.
- **ViT**, on the other hand, works well with **large datasets**.


![alt text](model_sym.png)

![alt text](comparison.png)

##### Now that we got a grip on CCT basics, we are going to implement CCT on the FashionMnist dataset.

- At First we will import the necessary libraries and load the dataset


In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import matplotlib.pyplot as plt