<a href="https://colab.research.google.com/github/MichalSlowakiewicz/Machine-Learning/blob/master/Homework8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Homework Assignment: Working with Other Loss Functions**

-------------------------------

During the class today, we reconstructed an **ellipse**. The ellipse was defined with two **foci** and $C$ (the sum of distances of the ellipse points from the foci).

To reconstruct the ellipse through optimization, we began with **$N$ points** scattered randomly in the 2D plane. Our goal was to adjust their positions so that they satisfy the elliptical constraint as closely as possible. We achieved this by minimizing the **error-related loss**, $L^{(2)}_{\text{ellipse}}$ which was defined in today's class with $\ell_2$ norm as:

$$
L^{(2)}_{\text{ellipse}} = \frac{1}{N} \sum_{i=1}^{N} \epsilon_i^2
$$

where
$$
 \epsilon_i = d_{i1} + d_{i2} - C
$$
where $N$ is the number of points, and $d_{i1}, d_{i2}$ are their distances to the two foci.

In the homework assignment you will experiment with 3 other loss definitions based on some other possible norms.



1. $\ell_0$ norm resulting in
  $$
  L^{(0)}_{\text{ellipse}} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}(\epsilon_i \neq 0)
  $$
  - If you actually succeed to code this loss function, the question for you to answer in relation to $L^{(0)}_{\text{ellipse}}$ is why the training is not progressing with passing epochs.
  - If you actually fail to code this loss function, the question for you to answer is to explain the failure and reason out theoretically, why the training would not be progressing with passing epochs, anyway.

1. $\ell_1$ norm resulting in
  $$
  L^{(1)}_{\text{ellipse}} = \frac{1}{N} \sum_{i=1}^{N} |\epsilon_i|
  $$
  The question for you to answer in relation to $L^{(1)}_{\text{ellipse}}$ is why the training loss doesn't converge, even after the ellipse has been fully drawn.

1. $\ell_\infty$ norm resulting in
  $$
  L^{(\infty)}_{\text{ellipse}} =  \max_{i} |\epsilon_i|
  $$
  The question for you to answer in relation to $L^{(\infty)}_{\text{ellipse}}$ is why the training takes so long and it doesn't converge in the end, either.

## **Points to Note**

1. Draw both the shape that the points draw as they move, and the loss value after each epoch, just as we did in class today.

2. Note, that the purpose of this excercise is not that you reconstruct a perfect ellipse, but rather that you give it a try, and even if you fail you should document and explain the failure, and answer a question related to a given loss definition.

3. You can also play around with the learning rate to try to improve convergence.

## **Task & Deliverables**
  
   - Document your experiments (python code and charts) and **write down your conclusions** into the Colab notebook.
   - It is not strictly required, but **if you make a movie showing the optimization progress it will be considered a strong point of your solution**
     - You can make a movie programmatically as we did in clustering class (our second class) with EM clustering,
     - or, you can save to disk the image files with epoch charts and use an external tool to bind them into a movie. Provide links to movie files in the README.
   - Place the Colab notebook  with the solution in your **GitHub repository** for this course.
   - In your repository’s **README**, add a **link** to the notebook (and any movies you created) and also include an **“Open in Colab”** badge at the top of the notebook so it can be launched directly from GitHub.

## Sample code

   You can use the sample code provided below:



In [2]:
# defining function for plotting the optimization process
# code below comes form classes
# the only adjustment is the 3 last lines which are responsible for saving pngs of plot
# these pngs will be used to make a video showing the optimization process

import torch
import matplotlib.pyplot as plt
import numpy as np


def plot_results(epoch, trajectories, loss_history=None):
    if loss_history is not None:
      fig, axes = plt.subplots(1, 2, figsize=(12, 6))
    else:
      fig, axes = plt.subplots(1, 1, figsize=(6, 6))
      axes = [axes]
    points = np.array([trajectories[i][-1] for i in range(num_points)])
    # Left plot: Scatter of points with trajectories
    axes[0].scatter(points[:, 0], points[:, 1], label=f'Points - Epoch {epoch}')
    f1 = focus1.detach().cpu().numpy()
    f2 = focus2.detach().cpu().numpy()
    axes[0].scatter([f1[0], f2[0]], [f1[1], f2[1]], color='red', marker='x', s=100, label='Foci')

    # Draw movement traces
    for i in range(num_points):
        trajectory = np.array(trajectories[i])
        axes[0].plot(trajectory[:, 0], trajectory[:, 1], color='gray', linestyle='-', linewidth=0.5)

    axes[0].set_xlabel('X')
    axes[0].set_ylabel('Y')
    axes[0].legend()


    axes[0].grid()

    if loss_history is not None:
      axes[0].set_title(f'Points after Epoch {epoch}')
      # Right plot: Loss history
      axes[1].plot(loss_history, color='blue')
      axes[1].set_xlabel("Epoch")
      axes[1].set_ylabel("Loss")
      axes[1].set_title("Loss Convergence")
      axes[1].grid()
    else:
      axes[0].set_title(f'Points before Epoch {epoch}')

    # saving png file of plot to "frames2" folder
    filename = f"frames2/frame_{epoch:04d}.png"
    plt.savefig(filename)
    plt.close()

In [10]:
# L0 loss function
import torch
import matplotlib.pyplot as plt
import numpy as np
import os
import imageio
from IPython.display import Image, HTML, Video

# fixing the random seed
torch.manual_seed(42)
np.random.seed(42)

# defining ellipse parameters
focus1 = torch.tensor([-2.0, 0.0])
focus2 = torch.tensor([2.0, 0.0])
constant_sum = 6.0

# initializing random points
num_points = 100
points = torch.rand((num_points, 2)) * 10 - 5
points.requires_grad = True

# resetting trajectories
trajectories = [[] for _ in range(num_points)]
loss_history = []

# making a folder in which we'll save images for video showing the optimization process
os.makedirs("frames2", exist_ok=True)



# training
optimizer = torch.optim.Adam([points], lr=0.1)
num_epochs = 1000

for epoch in range(num_epochs):
    optimizer.zero_grad()

    dist1 = torch.norm(points - focus1, dim=1)
    dist2 = torch.norm(points - focus2, dim=1)
    eps = dist1 + dist2 - constant_sum

    # using L0 loss function
    loss = torch.mean((eps != 0).float())

    loss.backward()
    optimizer.step()

    loss_history.append(loss.item())

    for i in range(num_points):
        trajectories[i].append(points[i].detach().cpu().clone().numpy())

    if epoch % 10 == 0:
        plot_results(epoch, trajectories, loss_history)

    if epoch % 10 == 0:
        print(f"Epoch {epoch}: Loss = {loss.item():.6f}")


# reading all frames
filenames = sorted([f for f in os.listdir("frames2") if f.endswith(".png")])
frames2 = [imageio.imread(os.path.join("frames2", f)) for f in filenames]

# saving joined png files as MP4
video_path = "ellipse_training2.mp4"
imageio.mimsave(video_path, frames2, fps=7)

# displaying the video (unfortunately, only works in Colab editor and the video doesn't load in Colab notebook. So, below there's a link to the video)
Video(video_path, embed=True)




RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

**$L_0$ loss function**

  - If you actually succeed to code this loss function, the question for you to answer in relation to $L^{(0)}_{\text{ellipse}}$ is why the training is not progressing with passing epochs.
  - If you actually fail to code this loss function, the question for you to answer is to explain the failure and reason out theoretically, why the training would not be progressing with passing epochs, anyway.

W powyższym kodzie nie powiodła się implementacja funckji straty dla $L_0$. Kod wskazuje na błąd w linijce:

*loss.backward()*

Dzieje się tak dlatego, iż funkcja stataty dla $L_0$ jest postaci:

  $$
  L^{(0)}_{\text{ellipse}} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}(\epsilon_i \neq 0)
  $$

czyli nie jest różniczkowalna, co sprawia, że dla funkcji z PyTorch nie jest możliwe policzenie gradientu.

Nawet jeśli udałoby się zaimplementować tę funkcję straty, to wraz z kolejnymi epokami trening nie postępowałaby. Funkcja  $\mathbf{1}(\epsilon_i \neq 0)$ jest kawałkami stała: dla $\epsilon_i = 0$ jest równa 1, a w przeciwnym przypadku jest równa 0. W związku z tym, mała zmiana położenia punktu nie zmieni wartości funckji straty, o ile wartość $\epsilon_i$ nie wyniesie idealnie 0, co jest bardzo mało prawdopodobne. Dlatego, gradient prawie wszędzie wynosi 0, przez co algorytm nie wiedziałby, w którym kierunku ma zmieniać położenie punktów.

In [12]:
# L1 loss function
import torch
import matplotlib.pyplot as plt
import numpy as np
import os
import imageio
from IPython.display import Image, HTML, Video

# fixing the random seed
torch.manual_seed(42)
np.random.seed(42)

# defining ellipse parameters
focus1 = torch.tensor([-2.0, 0.0])
focus2 = torch.tensor([2.0, 0.0])
constant_sum = 6.0

# initializing random points
num_points = 100
points = torch.rand((num_points, 2)) * 10 - 5
points.requires_grad = True

# resetting trajectories
trajectories = [[] for _ in range(num_points)]
loss_history = []

# making a folder in which we'll save images for video showing the optimization process
os.makedirs("frames2", exist_ok=True)



# training
optimizer = torch.optim.Adam([points], lr=0.1)
num_epochs = 1000

for epoch in range(num_epochs):
    optimizer.zero_grad()

    dist1 = torch.norm(points - focus1, dim=1)
    dist2 = torch.norm(points - focus2, dim=1)
    eps = dist1 + dist2 - constant_sum

    # using L1 loss function
    loss = torch.mean(torch.abs(eps))

    loss.backward()
    optimizer.step()

    loss_history.append(loss.item())

    for i in range(num_points):
        trajectories[i].append(points[i].detach().cpu().clone().numpy())

    if epoch % 10 == 0:
        plot_results(epoch, trajectories, loss_history)

    if epoch % 10 == 0:
        print(f"Epoch {epoch}: Loss = {loss.item():.6f}")


# reading all frames
filenames = sorted([f for f in os.listdir("frames2") if f.endswith(".png")])
frames2 = [imageio.imread(os.path.join("frames2", f)) for f in filenames]

# saving joined png files as MP4
video_path = "ellipse_training2.mp4"
imageio.mimsave(video_path, frames2, fps=7)

# displaying the video (unfortunately, only works in Colab editor and the video doesn't load in Colab notebook. So, below there's a link to the video)
Video(video_path, embed=True)



Epoch 0: Loss = 2.754212
Epoch 10: Loss = 1.059926
Epoch 20: Loss = 0.310943
Epoch 30: Loss = 0.188926
Epoch 40: Loss = 0.093729
Epoch 50: Loss = 0.053497
Epoch 60: Loss = 0.037769
Epoch 70: Loss = 0.029401
Epoch 80: Loss = 0.025541
Epoch 90: Loss = 0.026995
Epoch 100: Loss = 0.024422
Epoch 110: Loss = 0.022661
Epoch 120: Loss = 0.023052
Epoch 130: Loss = 0.024824
Epoch 140: Loss = 0.022773
Epoch 150: Loss = 0.022459
Epoch 160: Loss = 0.020810
Epoch 170: Loss = 0.022611
Epoch 180: Loss = 0.021731
Epoch 190: Loss = 0.018421
Epoch 200: Loss = 0.018269
Epoch 210: Loss = 0.023981
Epoch 220: Loss = 0.021062
Epoch 230: Loss = 0.020594
Epoch 240: Loss = 0.022908
Epoch 250: Loss = 0.019486
Epoch 260: Loss = 0.020366
Epoch 270: Loss = 0.024502
Epoch 280: Loss = 0.019642
Epoch 290: Loss = 0.018578
Epoch 300: Loss = 0.021560
Epoch 310: Loss = 0.021064
Epoch 320: Loss = 0.023591
Epoch 330: Loss = 0.025661
Epoch 340: Loss = 0.020722
Epoch 350: Loss = 0.019547
Epoch 360: Loss = 0.020766
Epoch 370: L

  frames2 = [imageio.imread(os.path.join("frames2", f)) for f in filenames]


**$L_1$ loss function**

**Link do animacji**: https://drive.google.com/drive/folders/1uXs2glkMpHFTqmKGXcrSd4rH9R3u4XL2?usp=sharing

The question for you to answer in relation to $L^{(1)}_{\text{ellipse}}$ is why the training loss doesn't converge, even after the ellipse has been fully drawn.


Po analizie wartości funkcji starty w kolejnych epokach, łatwo można zobaczyć, że funkcja staty spada, aż dojdzie do poziomu około 0.02, wokół którego oscyluje. Zatem, widzimy, że używając tej funkcji starty nie mamy zbieżności.
Spróbujemy wytłumaczyć, dlaczego tak się dzieje.

Funkcja $|\epsilon_i|$ ma pochodną równą $sgn(\epsilon_i)$, czyli równą 1 lub -1. Dlatego, pochodne mają dalej stosunkowo wysokie wartości, nawet jeśli punkty leżą blisko teoretycznej elipsy. Mówiąc inaczej, funkcja straty $L_1$ nie nagradza punktów leżących w bliskim otoczeniu elipsy tak mocno, jak robi to funkcja starty $L_2$ (dla której pochodne będą mniejsze, gdy punkty będą znajdować się blisko teoretycznej elipsy). Właśnie dlatego, *loss* dla $L_1$ oscyluje wokół około 0.02 - gdyż ciągle występuje stosunkowo wysoki gradient, który napędza ruch punktów.


In [11]:
# L_inf loss function
import torch
import matplotlib.pyplot as plt
import numpy as np
import os
import imageio
from IPython.display import Image, HTML, Video

# fixing the random seed
torch.manual_seed(42)
np.random.seed(42)

# defining ellipse parameters
focus1 = torch.tensor([-2.0, 0.0])
focus2 = torch.tensor([2.0, 0.0])
constant_sum = 6.0

# initializing random points
num_points = 100
points = torch.rand((num_points, 2)) * 10 - 5
points.requires_grad = True

# resetting trajectories
trajectories = [[] for _ in range(num_points)]
loss_history = []

# making a folder in which we'll save images for video showing the optimization process
os.makedirs("frames2", exist_ok=True)



# training
optimizer = torch.optim.Adam([points], lr=0.1)
num_epochs = 1000

for epoch in range(num_epochs):
    optimizer.zero_grad()

    dist1 = torch.norm(points - focus1, dim=1)
    dist2 = torch.norm(points - focus2, dim=1)
    eps = dist1 + dist2 - constant_sum

    # using L_inf loss function
    loss =torch.max(torch.abs(eps))

    loss.backward()
    optimizer.step()

    loss_history.append(loss.item())

    for i in range(num_points):
        trajectories[i].append(points[i].detach().cpu().clone().numpy())

    if epoch % 10 == 0:
        plot_results(epoch, trajectories, loss_history)

    if epoch % 10 == 0:
        print(f"Epoch {epoch}: Loss = {loss.item():.6f}")


# reading all frames
filenames = sorted([f for f in os.listdir("frames2") if f.endswith(".png")])
frames2 = [imageio.imread(os.path.join("frames2", f)) for f in filenames]

# saving joined png files as MP4
video_path = "ellipse_training2.mp4"
imageio.mimsave(video_path, frames2, fps=7)

# displaying the video (unfortunately, only works in Colab editor and the video doesn't load in Colab notebook. So, below there's a link to the video)
Video(video_path, embed=True)



Epoch 0: Loss = 7.323938
Epoch 10: Loss = 6.148848
Epoch 20: Loss = 5.248571
Epoch 30: Loss = 4.691919
Epoch 40: Loss = 4.248756
Epoch 50: Loss = 3.796906
Epoch 60: Loss = 3.553381
Epoch 70: Loss = 3.303754
Epoch 80: Loss = 2.946027
Epoch 90: Loss = 2.604202
Epoch 100: Loss = 2.393175
Epoch 110: Loss = 2.174458
Epoch 120: Loss = 1.918397
Epoch 130: Loss = 1.716803
Epoch 140: Loss = 1.519077
Epoch 150: Loss = 1.389654
Epoch 160: Loss = 1.143647
Epoch 170: Loss = 1.074789
Epoch 180: Loss = 1.002527
Epoch 190: Loss = 0.911202
Epoch 200: Loss = 0.886059
Epoch 210: Loss = 0.862133
Epoch 220: Loss = 0.769010
Epoch 230: Loss = 0.855409
Epoch 240: Loss = 0.760064
Epoch 250: Loss = 0.849135
Epoch 260: Loss = 0.608963
Epoch 270: Loss = 0.598429
Epoch 280: Loss = 0.600041
Epoch 290: Loss = 0.555163
Epoch 300: Loss = 0.537401
Epoch 310: Loss = 0.562063
Epoch 320: Loss = 0.596848
Epoch 330: Loss = 0.525826
Epoch 340: Loss = 0.524653
Epoch 350: Loss = 0.512818
Epoch 360: Loss = 0.488097
Epoch 370: L

  frames2 = [imageio.imread(os.path.join("frames2", f)) for f in filenames]


**$L_{infinity}$ loss function**

**Link do animacji:** https://drive.google.com/drive/folders/1_eQOTXYetmeVrOvYYFQ-sWAzyQp9RBGk?usp=sharing

The question for you to answer in relation to $L^{(\infty)}_{\text{ellipse}}$ is why the training takes so long and it doesn't converge in the end, either.

Przyjrzyjmy się postaci funkcji straty dla $L_{infinity}$:

  $$
  L^{(\infty)}_{\text{ellipse}} =  \max_{i} |\epsilon_i|
  $$

Zauważmy, że cała wartość tej funkcji straty zależy od punktu, który jest najgorzej położony ze wszystkich. W związku z tym, trening postępuje o wiele wolniej, gdyż bardzo mała część z punktów ma znaczenie w procesie uczenia. Trening działa bardziej "lokalnie" - kolejno poprawia sytuację dla najgorszych punktów. Proces uczenia nie stara się polepszyć położenia wszytskich punktów jednocześnie (tylko najgorzej położonego) i, dlatego, trening trwa tak długo (o wiele wolniej niż dla funkcji starty $L_2$ i $L_1$).

Również w tym przypadku, widać, że nie mamy zbieżności, gdyż wartości funkcji straty oscylują końcowo wokół poziomu około 0.3. Brak zbieżności wynika z wyżej opisanego mechanizmu działania funkcji straty $L_{infinity}$ oraz faktu, że (tak samo jak dla funkcji starty $L_1$) pochodne wartości $max(|\epsilon_i|)$ nie maleją wraz z błędem. Analogicznie jak dla $L_1$, dalej będzie występował stosunkowo wysoki gradient napędzający ruch punktów w procesie uczenie. Ponadto, fakt, że funkcja staty bierze pod uwagę wyłącznie wartość największego błędu, powoduje, iż proces uczenia jest mniej optymalny niż dla funkcji straty $L_1$.