<a href="https://colab.research.google.com/github/BrandNewMyUserName/Game-Knight-s-labirint/blob/main/Machine_learning_epam_Exam.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring the rank of trained Neural Networks

In this Final assignment, you're going to explore trained neural networks, and study the rank of its matrices.

**Reminder**: The rank is the number of independent columns of the matrix. If a matrix $A \in \mathbb{R}^{n\times m}$  has rank $k$, then $A$ can be approximated by

$$A \approx B \cdot C$$

where $B \in \mathbb{R}^{n\times k}$ and $C \in \mathbb{R}^{k\times m}$.

You can find the rank of matrix $A$ by performing Gaussian elimination and counting the number of pivots. This can be done in few lines of `numpy` code.

**References**:
- https://arxiv.org/pdf/1804.08838
- https://arxiv.org/pdf/2209.13569
- https://arxiv.org/pdf/2012.13255

Note: The references above are not needed to complete this notebook, but reading them might give you additional insights.

## Important

1. For all the training done, make sure to plot things like the loss values and accuracy on each epoch.

    - You can either use tensorboard or just make a static matplotlib plot.
    
2. Don't add biases to the layers in the network, not important for this notebook.
3. No need to use Dropout or BatchNorm on the network.
4. Remember to use GPUs during the training.
5. Always test your hypothesis on both training and testing sets, you might get a surprising result sometimes.

## Task 1: Downloading MNIST and Dataloaders

Download the MNIST dataset and split into training and testing, and create dataloaders.

Link: https://pytorch.org/vision/stable/generated/torchvision.datasets.MNIST.html

## Task 2: Train a neural network

Build a simple Multi-layered Perceptron with ReLU activations, and train it on MNIST until achieving 95% accuracy or higher.


## Task 3: Analyze the rank of the matrices in this network

Perform experiments and answer the following questions:
- What's the average rank of the matrices on all layers?
- How does the rank increase as we go to deeper layers?
- Try the same MLP, but change the activation function to others ($\tanh, \sigma, \dots$). Do the answers change?

## Task 4: Overfit by scaling the MLP

1. Create a bigger network and train it on MNIST, to the point of overfitting.
2. Now check the rank of the matrices in the network, and answer the same questions.

## Task 5: Approximate low-rank

From some of the references given at the beginning, you can realize that trained neural networks have intrinsically low dimensionality (meaning low-rank matrices).

In this task, take the overparametrized network already trained from the TASK4 and try to approximate each layer's matrix with a product of two other low-rank matrices?

This means, if a layer has a matrix $A \in\mathbb{R}^{n\times m}$, then try to find two matrices $B \in \mathbb{R}^{n\times r}$ and $C \in \mathbb{R}^{r\times m}$ so that $\lvert {A - B\cdot C}\rvert $ is minimized, where $\lvert x\rvert$ means the Frobenius norm. You can use a different norm, if you think it makes sense. In order to learn $B$ and $C$, you can do gradient descent-like algorithms, where you alternate between updating $B$ and $C$ on each optimization step.

**Ablate**:
Try different values for $r$ and analyze how good your approximation is (for e.g, by taking average Frobenius norm across all layers) as you increase $r$. Make a plot with that.

Conclude what is the effective rank $r$: the smallest rank such that the approximation of that rank is good enough (meaning the Frobenius norm is smaller than some threshold chosen by you).

## Task 6: Learning with low-rank factorization

Once you found the effective rank $r$, take the same architecture from the previous task, and now replace each layer $A \in \mathbb{R}^{n\times m}$ by a layer that applies $B\cdot C$ with $B\in \mathbb{R}^{n\times r}$ and $C \in \mathbb{R}^{r\times m}$.

**Question**: How much memory do you save? (you can just count the number of parameters of the original network and compare to that of the new network).

Initialize these values with standard initialization, and train this network.

**Question**: How does the learning change? Does it converge faster or slower? What about accuracy on both training and testing sets?

**Question**: Now try doing inference, how much improvement do you see?

## Task 7: Final conclusions

Based on all the previous experiments, report your conclusions and try to give an explanation to the behaviours you observed.

Can you think of other ways of using the low-rank factorizations? What about SVD? Provide an explanation.

## BONUS Task: LoRA

Propose ideas by which low-rank could improve fine-tuning and training? Which disadvantages does it have?

Read about LoRA (given in one of the references at the begining of the notebook).

Now, take MNIST, and remove some digit from the dataset (keep the same labels, just remove the datapoints of a specific label).

Train a simple MLP on this modified dataset.
Fine-tune in the datapoints of the chosen digit, by using LoRA.

Report the memory and time overheads.