From feeead419f9d375e79a2f4e6883473652d0fae14 Mon Sep 17 00:00:00 2001
From: "Turkunov Y." <55660526+turkunov@users.noreply.github.com>
Date: Sat, 18 Jan 2025 11:56:57 +0300
Subject: [PATCH 1/6] added q94 ROC AUC

Implementation of ROC Area Under Curve, a metric used for measuring predictive quality of a binary classifier.

Input: true labels and predicted probabilities
Output: ROC AUC rounded to 5 floating points
---
 Problems/94_roc_auc/learn.md    | 64 +++++++++++++++++++++++
 Problems/94_roc_auc/solution.py | 90 +++++++++++++++++++++++++++++++++
 2 files changed, 154 insertions(+)
 create mode 100644 Problems/94_roc_auc/learn.md
 create mode 100644 Problems/94_roc_auc/solution.py

diff --git a/Problems/94_roc_auc/learn.md b/Problems/94_roc_auc/learn.md
new file mode 100644
index 00000000..84c518e4
--- /dev/null
+++ b/Problems/94_roc_auc/learn.md
@@ -0,0 +1,64 @@
+## Overview
+ROC-AUC is a metric used for measuring predictive quality of a binary classifier with the highest value being $1$ and the lowest being $0$.
+
+## $TPR$ and $FPR$
+Consider a trivial case, when we have true binary labels $y_i\in\{0, 1\}$ and our predicted labels by the model $\hat{y_i}\in\{0, 1\}$. We also denote any arbitrary example labeled as $1$ as positive and $0$ as negative. Using $(y_i, \hat{y_i})$ combinations we build a set of $Y$, with the help of which we then generate statistics such as $TP$ (True Positive), $TN$ (True Negative), $FP$ (False Positive) and $FN$ (False Negative):
+
+| Total population = P + N     | Predicted positive (PP)                            | Predicted negative (PN)                               |
+|-----------------------------------|----------------------------------------------------|-------------------------------------------------------|
+| **Actual positive (P)**           | $TP=\#\{Y\|y_i=\hat{y_i}=1\}$                        | $FN=\#\{Y\|y_i=1;\hat{y_i}=0\}$ (also called type II error) |
+| **Actual negative (N)**           | $FP=\#\{Y\|y_i=0;\hat{y_i}=1\}$ (also called type I error)                      | $TN=\#\{Y\|y_i=\hat{y_i}=0\}$                           |
+
+This table, also referenced as **confusion matrix**, could provide an overview of the model's performance for this particular task. Now with the help of these statistics we can calculate the following estimates:
+$$
+TPR=\frac{TP}{TP+FN}\quad(\text{also called a recall}) \\ 
+
+FPR=\frac{FP}{FP+TN}
+$$
+
+Intuition-wise, **TPR** shows the model's sensitivity to positive cases, where the true label $y_i=1$. In some cases, for example in credit scoring or cancer detection tasks, we even neglect other metrics in favor of recall, since any $FN$-case could turn out a very costly mistake. **FPR**, on the other hand, shows how biased are we towards positive cases at the expense of $y_i=0$. 
+
+## Thershold
+Now recall that we originally obtain a vector $\hat{y_i}\in\{0, 1\}$ of predicted labels. But the model itself is not able to directly output either $0$ or $1$. Instead we look at the probability $z_i$ the model has provided us with and compare it with empirically chosen threshold $t$. For example, for a chosen $t=0.7$ we would have the following decision rule:
+$$
+\hat{y_i}=\begin{cases} 1, & \text{if } z_i\gt 0.7 \\ 0, & \text{otherwise } \end{cases}
+$$
+
+With this idea in mind, we can see that for every $t$ our previous estimates of $TPR$ and $FPR$ would change as well, so we can actually denote them as $TPR(t)$ and $FPR(t)$. But we also want our model to be robust and not be dependent on what thershold we choose. That is why when we need to measure the quality of our model, we often look at the **ROC** curve $TPR(FPR | t)$, which shows $TPR$ and $FPR$ under various thresholds. 
+
+## ROC curve
+Each point of this curve is obtained via this algorithm:
+$$
+
+
+\begin{array}{l}
+\textbf{Input}: y\_true, y\_pred \text{ (true labels and output probabilities)} \\
+\textbf{Output: } \text{points} \text{ (a set of (x, y) coordinates)} \\
+\text{\textbf{function} roc\_points}(y\_true, y\_pred): \\
+\quad thresholds \leftarrow y\_pred \cup \{0\} \\
+\quad points \leftarrow [\quad ] \\
+\quad \textbf{for } t\in\{thresholds\,:\ t_i\ge t_{i+1}\} \textbf{ do}: \\
+\quad \quad y \leftarrow TPR(t) \\
+\quad \quad x \leftarrow FPR(t) \\
+\quad \quad \text{points.append}((x, y)) \\
+\quad \textbf{end for} \\
+\textbf{end function}
+\end{array}
+$$
+
+ROC curve's domain stays within $[0, 1]$. To break it down, first consider a thershold $t=1$. Then it is impossible to assign any label to our predictions, unless it is $0$. Therefore $TP=0\implies TPR=0$ and $FP=0\implies FPR=0$ (since all negative examples are going to be assigned a correct label). On the other hand if we have $t=0$, then $FN = 0\implies TPR=\frac{TP}{TP}=1$ and $TN = 0 \implies FPR=\frac{FP}{FP}=1$, since there is no way we can assign $0$ to any prediction.
+
+The best case cenario is when with increasing thershold $t$ our sensitivity increases without disregarding the bias ($FPR$ does not change or is around 0 and $TPR$ is always high). The worst case cenario is when the model is random and we follow an $FPR=TPR$ diagonal line. 
+
+## ROC-AUC
+If you consider two ROC curves mentioned above, you could see that the space underneath the first one is greater than the second one. This is why we usually calculate **ROC-AUC** - area under the ROC curve. You might think that the larger is the AUC, the better is the model, but in fact it's a common misconception.
+
+Consider you want to choose a model between model #1 with $AUC_{ROC}=0.6$ and model #2 with $AUC_{ROC}=0.3$. The correct answer is actually #2, since we can always invert our decision rule in favor of the ROC-AUC and our $AUC_{ROC}$ for model #2 would actually become $0.7$. Therefore, when looking at the ROC AUC, we should consider how large is the **absolute** difference between the area of $0.5$ (worst case performance) and the one our model has generated.
+
+## Calculating AUC
+There are also various ways for calculating an area under the curve. The most applicable one, which is also used in scikit-learn, is the trapezoidal rule:
+$$
+\int f(x)=\sum_i\frac{1}{2}\Delta x_i * (f(x_i)-f(x_{i-1})) ,
+$$
+
+where $\Delta x_i=x_i-x_{i-1}$. This method breaks a total area under the curve into a sum of $90^\circ$-rotated trapezoids that make up the convex curve.
\ No newline at end of file
diff --git a/Problems/94_roc_auc/solution.py b/Problems/94_roc_auc/solution.py
new file mode 100644
index 00000000..1dd1f841
--- /dev/null
+++ b/Problems/94_roc_auc/solution.py
@@ -0,0 +1,90 @@
+import numpy as np
+
+
+def roc_auc(y_true: list[float], probas: list[float]) -> float:
+    """
+    Parameters
+    ----------
+    y_true : list[float]
+        True labels
+    probas : list[float]
+        Output probabilities of our binary classifier
+        
+    Returns
+    -------
+    auc : float
+        ROC AUC rounded to 5 floating points
+    """
+    thresh = sorted(probas + [0], reverse=True)
+    y_true, probas = np.array(y_true), np.array(probas)
+
+    fpr, tpr = [0], [0]
+    auc = 0
+    
+    for t in thresh:
+        y_pred = np.where(probas < t, 0, 1)
+        tp = ((y_true == 1) & (y_pred == 1)).sum()
+        fn = (y_true == 1).sum() - tp
+
+        fp = (y_pred == 1).sum() - tp
+        tn = (y_true == 0).sum() - fp
+
+        fpr.append(fp / (fp + tn))
+        tpr.append(tp / (tp + fn))
+    
+        auc += (fpr[-1] - fpr[-2]) * (tpr[-1] + tpr[-2])
+
+    return round(1/2 * auc, 5)
+
+
+def test_roc_auc():
+    # Test 1
+    y = [0, 0, 1, 1]
+    y_proba = [0.1, 0.4, 0.35, 0.8]
+    assert roc_auc(y, y_proba) == .75, 'Test case 1 failed'
+
+    # Test 2
+    y = [1, 1, 1, 0, 1, 0, 0, 0, 1, 1]
+    y_proba = [
+        0.9945685360621648,
+        0.9937332904188113,
+        0.9958526266087151,
+        4.391062222999706e-09,
+        0.9959272720187046,
+        0.10851446498385146,
+        0.001096202856869512,
+        4.995474609174945e-06,
+        0.9921605697799972,
+        0.9826790537446354
+    ]
+    assert roc_auc(y, y_proba) == 1.0, 'Test case 2 failed'
+
+    # Test 3
+    y = [0, 0, 0, 0, 0, 1, 1, 1, 0, 1]
+    y_proba = [
+        0.8318040739657637,
+        0.421445304232661,
+        0.003309769194418868,
+        0.015529393142531172,
+        0.0001635684705459328,
+        0.6988867797464966,
+        0.9534132112895218,
+        0.8471417487716292,
+        0.0005832121647006822,
+        0.9990059733653113
+    ]
+    assert roc_auc(y, y_proba) == 0.95833, 'Test case 3 failed'
+
+    # Test 4
+    y = [0, 0, 1, 1, 1, 0, 1]
+    y_proba = [
+        8.99e-1,9.95e-1,5e-3,
+        2.3e-4,1e-4,9e-1,2.1e-4
+    ]
+    assert roc_auc(y, y_proba) == 0.0, 'Test case 4 failed'
+
+    print('All tests passed')
+
+
+if __name__ == '__main__':
+    test_roc_auc()
\ No newline at end of file

From f036168d3a8c85373f4d71a27004a7ff5d637a87 Mon Sep 17 00:00:00 2001
From: "Turkunov Y." <55660526+turkunov@users.noreply.github.com>
Date: Sat, 18 Jan 2025 12:01:00 +0300
Subject: [PATCH 2/6] renamed dir 94_roc_auc -> roc_auc

---
 Problems/{94_roc_auc => roc_auc}/learn.md    | 0
 Problems/{94_roc_auc => roc_auc}/solution.py | 0
 2 files changed, 0 insertions(+), 0 deletions(-)
 rename Problems/{94_roc_auc => roc_auc}/learn.md (100%)
 rename Problems/{94_roc_auc => roc_auc}/solution.py (100%)

diff --git a/Problems/94_roc_auc/learn.md b/Problems/roc_auc/learn.md
similarity index 100%
rename from Problems/94_roc_auc/learn.md
rename to Problems/roc_auc/learn.md
diff --git a/Problems/94_roc_auc/solution.py b/Problems/roc_auc/solution.py
similarity index 100%
rename from Problems/94_roc_auc/solution.py
rename to Problems/roc_auc/solution.py

From d1acad8f3e71e41adc1462518e6d7ad53177eb83 Mon Sep 17 00:00:00 2001
From: "Turkunov Y." <55660526+turkunov@users.noreply.github.com>
Date: Sun, 19 Jan 2025 21:58:02 +0300
Subject: [PATCH 3/6] Code and theory for train_logreg

The logistic model (or logit model) is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables.

* Input:
  * X, y
  * learning_rate
  * iterations - n of epochs
* Output:
  * list of updated parameters
  * list of recorded losses over iterations (starting from 0)
---
 Problems/train_logreg/learn.md    | 73 +++++++++++++++++++++++++++
 Problems/train_logreg/solution.py | 84 +++++++++++++++++++++++++++++++
 2 files changed, 157 insertions(+)
 create mode 100644 Problems/train_logreg/learn.md
 create mode 100644 Problems/train_logreg/solution.py

diff --git a/Problems/train_logreg/learn.md b/Problems/train_logreg/learn.md
new file mode 100644
index 00000000..4ab7307f
--- /dev/null
+++ b/Problems/train_logreg/learn.md
@@ -0,0 +1,73 @@
+## Overview
+Softmax regression or multinomial logistic regression is a type of generalized logistic regression for not only 2 classes, but more as well.
+
+## Prerequisites for a regular logistic regression
+**tl;dr** regular (binary) logistic regression outputs probabilities using a sigmoid $\frac{1}{e^{-X\beta}+1}$ and is called a regression, because it is originally meant to approximate a logit function of odds.
+
+Logistic regression is based on the concept of "logits of odds". **Odds** is measure of how frequent we encounter success. It also allows us to shift our probabilities domain of $[0, 1]$ to $[0,\infty]$ Consider a probability of scoring a goal $p=0.8$, then our $odds=\frac{0.8}{0.2}=4$. This means that every $4$ matches we could be expecting a goal followed by a miss. So the higher the odds, the more consistent is our streak of goals. **Logit** is an inverse of the standard logistic function, i.e. sigmoid: $logit(p)=\sigma^{-1}(p)=ln\frac{p}{1-p}$. In our case $p$ is a probability, therefore we call $\frac{p}{1-p}$ the "odds". The logit allows us to further expand our domain from $[0,\infty]$ to $[-\infty,\infty]$.
+
+With this domain expansion we can treat our problem as a linear regression and try to approximate our logit function: $X\beta=logit(p)$. However what we really want for this approximation is to yield predictions for probabilities:
+$$
+X\beta=ln\frac{p}{1-p} \\
+e^{-X\beta}=\frac{1-p}{p} \\ 
+e^{-X\beta}+1 = \frac{1}{p} \\
+p = \frac{1}{e^{-X\beta}+1}
+$$
+
+What we practically just did is taking an inverse of a logit function w.r.t. our approximation and go back to sigmoid. This is also the backbone of the regular logistic regression, which is commonly defined as:
+$$
+\pi=\frac{e^{\alpha+X\beta}}{1+e^{\alpha+X\beta}}=\frac{1}{1+e^{-(\alpha+X\beta)}}.
+$$
+
+## Loss in logistic regression
+The loss function used for solving the logistic regression for $\beta$ is derived from MLE (Maximum Likelihood Estimation). This method allows us to search for $\beta$ that maximize our **likelihood function** $L(\beta)$. This function tells us how likely it is that $X$ has come from the distribution generated by $\beta$: $L(\beta)=L(\beta|X)=P(X|\beta)=\prod_{\{x\in X\}}f^{univar}_X(x;\beta)$, where $f$ is a PMF and $univar$ means univariate, i.e. applied to a single variable.
+
+In the case of a regular logistic regression we expect our output to belong to a single Bernoulli-distributed random variable (hence the univariance), since our true label is either $y_i=0$ or $y_i=1$. The Bernoulli's PMF is defined as $P(Y=y)=p^y(1-p)^{(1-y)}$, where $y\in\{0, 1\}$. Also let's denote $\{x\in X\}$ simply as $X$ and refer to a single pair of vectors from the training set as $(x_i, y_i)$. Thus, our likelihood function would look like this:
+$$
+\prod_X p\left(x_i\right)^{y_i} \times\left[1-p\left(x_i\right)\right]^{1-y_i}
+$$
+
+Then we convert our function from likelihood to log-likelihood by taking $ln$ (or $log$) of it:
+$$
+\sum_X y_i \log \left[p\left(x_i\right)\right]+\left(1-y_i\right) \log \left[1-p\left(x_i\right)\right]
+$$
+
+And then we replace $p(x_i)$ with the sigmoid from previously defined equality to get a final version of our **loss function**:
+$$
+\sum_X y_i \log \left(\frac{1}{1+e^{-x_i\beta}}\right)+\left(1-y_i\right)\log \left(1-\frac{1}{1+e^{-x_i\beta}}\right)
+$$
+
+## Optimization objective
+Recall that originally we wanted to search for $\beta$ that maximize the likelihood function. Since $log$ is a monotonic transformation, our maximization objective does not change and we can confindently say that now we can equally search for $\beta$ that maximize our log-likelihood. Hence we can finally write our actual objective as:
+
+$$
+argmax_\beta [\sum_X y_i \log\sigma(x_i\beta)+\left(1-y_i\right)\log (1-\sigma(x_i\beta))] = \\
+
+= argmin_\beta -[\sum_X y_i \log\sigma(x_i\beta)+\left(1-y_i\right)\log (1-\sigma(x_i\beta))]
+$$
+
+where $\sigma$ is the sigmoid. This function we're trying to minimize is also called **Binary Cross Entropy** loss function (BCE). To find the minimum we would need to take the gradient of this LLF (Log-Likelihood Function), or find a vector of derivatives with respect to every individual $\beta_j$, using a chain rule, i.e.:
+
+$$
+\frac{\partial LLF}{\partial\beta_j}=\frac{\partial LLF}{\partial\sigma}\frac{\partial\sigma}{\partial[X\beta]}\frac{\partial[X\beta]}{\beta_j} = \\
+
+=-\sum_{i=1}^n\left(y^{(i)} \frac{1}{\sigma\left(x^{(i)}\beta\right)}-(1-y^{(i)} ) \frac{1}{1-\sigma\left(x^{(i)}\beta\right)}\right) \frac{\partial\sigma}{\partial[x^{(i)}\beta]} = \\
+
+=-\sum_{i=1}^n\left(y^{(i)} \frac{1}{\sigma\left(x^{(i)}\beta\right)}-(1-y^{(i)} ) \frac{1}{1-\sigma\left(x^{(i)}\beta\right)}\right) \sigma\left(x^{(i)}\beta\right)\left(1-\sigma\left(x^{(i)}\beta\right)\right) \frac{\partial[x^{(i)}\beta]}{\partial\beta_j} = \\
+
+=-\sum_{i=1}^n\left(y^{(i)}\left(1-\sigma\left(x^{(i)}\beta\right)\right)-(1-y^{(i)} ) \sigma\left(x^{(i)}\beta\right)\right) x_j^{(i)} = \\
+
+=-\sum_{i=1}^n\left(y^{(i)}-\sigma\left(x^{(i)}\beta\right)\right) x_j^{(i)} = \\
+
+=\sum_{i=1}^n\left(\sigma\left(x^{(i)}\beta\right)-y^{(i)}\right) x_j^{(i)}.
+$$
+
+This sum can be then rewritten in a more convenient gradient matrix form as:
+$$
+X^T(\sigma(X\beta)-Y)
+$$
+
+Then we can finally use gradient descent in order to iteratively update our parameters:
+$$
+\beta_{t+1}=\beta_t - \eta [X^T(\sigma(X\beta_t)-Y)]
+$$
diff --git a/Problems/train_logreg/solution.py b/Problems/train_logreg/solution.py
new file mode 100644
index 00000000..dc50ff52
--- /dev/null
+++ b/Problems/train_logreg/solution.py
@@ -0,0 +1,84 @@
+import numpy as np
+
+
+def train_logreg(X: np.ndarray, y: np.ndarray, 
+                 learning_rate: float, iterations: int) -> tuple[list[float], ...]:
+    """        
+    Gradient-descent training algorithm for logistic regression, that collects sum-reduced
+    BCE losses, accuracies. Assigns label "0" if the P(x_i)<=0.5 and "1" otherwise.
+
+    Returns
+    -------
+    B : list[float]
+        1xM updated parameter vector rounded to 4 floating points
+    losses : list[float]
+        collected values of a BCE loss function (LLF) rounded to 4 floating points
+    """
+
+    def sigmoid(x):
+        return 1 / (1 + np.exp(-x))
+
+    def accuracy(y_pred, y_true):
+        return (y_true == np.rint(y_pred)).sum() / len(y_true)
+    
+    def bce_loss(y_pred, y_true):
+        return -np.sum(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
+
+    y = y.reshape(-1, 1)
+    X = np.hstack((np.ones((X.shape[0], 1)), X))
+    B = np.zeros((X.shape[1], 1))
+    accuracies, losses = [], []
+
+    for epoch in range(iterations):
+        y_pred = sigmoid(X @ B)
+        B -= learning_rate * X.T @ (y_pred - y)
+        losses.append(round(bce_loss(y_pred, y), 4))
+        accuracies.append(round(accuracy(y_pred, y), 4))
+
+    return B.flatten().round(4).tolist(), losses
+
+
+def test_train_logreg():
+    # Test 1
+    X = np.array([[ 0.76743473, -0.23413696, -0.23415337,  1.57921282],
+       [-1.4123037 ,  0.31424733, -1.01283112, -0.90802408],
+       [-0.46572975,  0.54256004, -0.46947439, -0.46341769],
+       [-0.56228753, -1.91328024,  0.24196227, -1.72491783],
+       [-1.42474819, -0.2257763 ,  1.46564877,  0.0675282 ],
+       [ 1.85227818, -0.29169375, -0.60063869, -0.60170661],
+       [ 0.37569802,  0.11092259, -0.54438272, -1.15099358],
+       [ 0.19686124, -1.95967012,  0.2088636 , -1.32818605],
+       [ 1.52302986, -0.1382643 ,  0.49671415,  0.64768854],
+       [-1.22084365, -1.05771093, -0.01349722,  0.82254491]])
+    y = np.array([1., 0., 0., 0., 1., 1., 0., 0., 1., 0.])
+    learning_rate = 1e-3
+    iterations = 10
+    b, llf = train_logreg(X, y, learning_rate, iterations)
+    assert b == [-0.0097, 0.0286, 0.015, 0.0135, 0.0316] and \
+        llf == [6.9315, 6.9075, 6.8837, 6.8601, 6.8367, 6.8134, 6.7904, 6.7675, 6.7448, 6.7223], \
+            'Test case 1 failed'
+
+    # Test 2
+    X = np.array([[ 0.76743473,  1.57921282, -0.46947439],
+       [-0.23415337,  1.52302986, -0.23413696],
+       [ 0.11092259, -0.54438272, -1.15099358],
+       [-0.60063869,  0.37569802, -0.29169375],
+       [-1.91328024,  0.24196227, -1.72491783],
+       [-1.01283112, -0.56228753,  0.31424733],
+       [-0.1382643 ,  0.49671415,  0.64768854],
+       [-0.46341769,  0.54256004, -0.46572975],
+       [-1.4123037 , -0.90802408,  1.46564877],
+       [ 0.0675282 , -0.2257763 , -1.42474819]])
+    y = np.array([1., 1., 0., 0., 0., 0., 1., 1., 0., 0.])
+    learning_rate = 1e-1
+    iterations = 10
+    b, llf = train_logreg(X, y, learning_rate, iterations)
+    assert b == [-0.2509, 0.9325, 1.6218, 0.6336] and \
+        llf == [6.9315, 5.5073, 4.6382, 4.0609, 3.6503, 3.3432, 3.1045, 2.9134, 2.7567, 2.6258], \
+            'Test case 2 failed'
+
+    print('All tests passed')
+
+
+if __name__ == '__main__':
+    test_train_logreg()
\ No newline at end of file

From 33dca3deb81b7edc0e4519a977d2241300027a8d Mon Sep 17 00:00:00 2001
From: "Turkunov Y." <55660526+turkunov@users.noreply.github.com>
Date: Sun, 19 Jan 2025 22:03:14 +0300
Subject: [PATCH 4/6] Removed ROC AUC files

---
 Problems/roc_auc/learn.md    | 64 -------------------------
 Problems/roc_auc/solution.py | 90 ------------------------------------
 2 files changed, 154 deletions(-)
 delete mode 100644 Problems/roc_auc/learn.md
 delete mode 100644 Problems/roc_auc/solution.py

diff --git a/Problems/roc_auc/learn.md b/Problems/roc_auc/learn.md
deleted file mode 100644
index 84c518e4..00000000
--- a/Problems/roc_auc/learn.md
+++ /dev/null
@@ -1,64 +0,0 @@
-## Overview
-ROC-AUC is a metric used for measuring predictive quality of a binary classifier with the highest value being $1$ and the lowest being $0$.
-
-## $TPR$ and $FPR$
-Consider a trivial case, when we have true binary labels $y_i\in\{0, 1\}$ and our predicted labels by the model $\hat{y_i}\in\{0, 1\}$. We also denote any arbitrary example labeled as $1$ as positive and $0$ as negative. Using $(y_i, \hat{y_i})$ combinations we build a set of $Y$, with the help of which we then generate statistics such as $TP$ (True Positive), $TN$ (True Negative), $FP$ (False Positive) and $FN$ (False Negative):
-
-| Total population = P + N     | Predicted positive (PP)                            | Predicted negative (PN)                               |
-|-----------------------------------|----------------------------------------------------|-------------------------------------------------------|
-| **Actual positive (P)**           | $TP=\#\{Y\|y_i=\hat{y_i}=1\}$                        | $FN=\#\{Y\|y_i=1;\hat{y_i}=0\}$ (also called type II error) |
-| **Actual negative (N)**           | $FP=\#\{Y\|y_i=0;\hat{y_i}=1\}$ (also called type I error)                      | $TN=\#\{Y\|y_i=\hat{y_i}=0\}$                           |
-
-This table, also referenced as **confusion matrix**, could provide an overview of the model's performance for this particular task. Now with the help of these statistics we can calculate the following estimates:
-$$
-TPR=\frac{TP}{TP+FN}\quad(\text{also called a recall}) \\ 
-
-FPR=\frac{FP}{FP+TN}
-$$
-
-Intuition-wise, **TPR** shows the model's sensitivity to positive cases, where the true label $y_i=1$. In some cases, for example in credit scoring or cancer detection tasks, we even neglect other metrics in favor of recall, since any $FN$-case could turn out a very costly mistake. **FPR**, on the other hand, shows how biased are we towards positive cases at the expense of $y_i=0$. 
-
-## Thershold
-Now recall that we originally obtain a vector $\hat{y_i}\in\{0, 1\}$ of predicted labels. But the model itself is not able to directly output either $0$ or $1$. Instead we look at the probability $z_i$ the model has provided us with and compare it with empirically chosen threshold $t$. For example, for a chosen $t=0.7$ we would have the following decision rule:
-$$
-\hat{y_i}=\begin{cases} 1, & \text{if } z_i\gt 0.7 \\ 0, & \text{otherwise } \end{cases}
-$$
-
-With this idea in mind, we can see that for every $t$ our previous estimates of $TPR$ and $FPR$ would change as well, so we can actually denote them as $TPR(t)$ and $FPR(t)$. But we also want our model to be robust and not be dependent on what thershold we choose. That is why when we need to measure the quality of our model, we often look at the **ROC** curve $TPR(FPR | t)$, which shows $TPR$ and $FPR$ under various thresholds. 
-
-## ROC curve
-Each point of this curve is obtained via this algorithm:
-$$
-
-
-\begin{array}{l}
-\textbf{Input}: y\_true, y\_pred \text{ (true labels and output probabilities)} \\
-\textbf{Output: } \text{points} \text{ (a set of (x, y) coordinates)} \\
-\text{\textbf{function} roc\_points}(y\_true, y\_pred): \\
-\quad thresholds \leftarrow y\_pred \cup \{0\} \\
-\quad points \leftarrow [\quad ] \\
-\quad \textbf{for } t\in\{thresholds\,:\ t_i\ge t_{i+1}\} \textbf{ do}: \\
-\quad \quad y \leftarrow TPR(t) \\
-\quad \quad x \leftarrow FPR(t) \\
-\quad \quad \text{points.append}((x, y)) \\
-\quad \textbf{end for} \\
-\textbf{end function}
-\end{array}
-$$
-
-ROC curve's domain stays within $[0, 1]$. To break it down, first consider a thershold $t=1$. Then it is impossible to assign any label to our predictions, unless it is $0$. Therefore $TP=0\implies TPR=0$ and $FP=0\implies FPR=0$ (since all negative examples are going to be assigned a correct label). On the other hand if we have $t=0$, then $FN = 0\implies TPR=\frac{TP}{TP}=1$ and $TN = 0 \implies FPR=\frac{FP}{FP}=1$, since there is no way we can assign $0$ to any prediction.
-
-The best case cenario is when with increasing thershold $t$ our sensitivity increases without disregarding the bias ($FPR$ does not change or is around 0 and $TPR$ is always high). The worst case cenario is when the model is random and we follow an $FPR=TPR$ diagonal line. 
-
-## ROC-AUC
-If you consider two ROC curves mentioned above, you could see that the space underneath the first one is greater than the second one. This is why we usually calculate **ROC-AUC** - area under the ROC curve. You might think that the larger is the AUC, the better is the model, but in fact it's a common misconception.
-
-Consider you want to choose a model between model #1 with $AUC_{ROC}=0.6$ and model #2 with $AUC_{ROC}=0.3$. The correct answer is actually #2, since we can always invert our decision rule in favor of the ROC-AUC and our $AUC_{ROC}$ for model #2 would actually become $0.7$. Therefore, when looking at the ROC AUC, we should consider how large is the **absolute** difference between the area of $0.5$ (worst case performance) and the one our model has generated.
-
-## Calculating AUC
-There are also various ways for calculating an area under the curve. The most applicable one, which is also used in scikit-learn, is the trapezoidal rule:
-$$
-\int f(x)=\sum_i\frac{1}{2}\Delta x_i * (f(x_i)-f(x_{i-1})) ,
-$$
-
-where $\Delta x_i=x_i-x_{i-1}$. This method breaks a total area under the curve into a sum of $90^\circ$-rotated trapezoids that make up the convex curve.
\ No newline at end of file
diff --git a/Problems/roc_auc/solution.py b/Problems/roc_auc/solution.py
deleted file mode 100644
index 1dd1f841..00000000
--- a/Problems/roc_auc/solution.py
+++ /dev/null
@@ -1,90 +0,0 @@
-import numpy as np
-
-
-def roc_auc(y_true: list[float], probas: list[float]) -> float:
-    """
-    Parameters
-    ----------
-    y_true : list[float]
-        True labels
-    probas : list[float]
-        Output probabilities of our binary classifier
-        
-    Returns
-    -------
-    auc : float
-        ROC AUC rounded to 5 floating points
-    """
-    thresh = sorted(probas + [0], reverse=True)
-    y_true, probas = np.array(y_true), np.array(probas)
-
-    fpr, tpr = [0], [0]
-    auc = 0
-    
-    for t in thresh:
-        y_pred = np.where(probas < t, 0, 1)
-        tp = ((y_true == 1) & (y_pred == 1)).sum()
-        fn = (y_true == 1).sum() - tp
-
-        fp = (y_pred == 1).sum() - tp
-        tn = (y_true == 0).sum() - fp
-
-        fpr.append(fp / (fp + tn))
-        tpr.append(tp / (tp + fn))
-    
-        auc += (fpr[-1] - fpr[-2]) * (tpr[-1] + tpr[-2])
-
-    return round(1/2 * auc, 5)
-
-
-def test_roc_auc():
-    # Test 1
-    y = [0, 0, 1, 1]
-    y_proba = [0.1, 0.4, 0.35, 0.8]
-    assert roc_auc(y, y_proba) == .75, 'Test case 1 failed'
-
-    # Test 2
-    y = [1, 1, 1, 0, 1, 0, 0, 0, 1, 1]
-    y_proba = [
-        0.9945685360621648,
-        0.9937332904188113,
-        0.9958526266087151,
-        4.391062222999706e-09,
-        0.9959272720187046,
-        0.10851446498385146,
-        0.001096202856869512,
-        4.995474609174945e-06,
-        0.9921605697799972,
-        0.9826790537446354
-    ]
-    assert roc_auc(y, y_proba) == 1.0, 'Test case 2 failed'
-
-    # Test 3
-    y = [0, 0, 0, 0, 0, 1, 1, 1, 0, 1]
-    y_proba = [
-        0.8318040739657637,
-        0.421445304232661,
-        0.003309769194418868,
-        0.015529393142531172,
-        0.0001635684705459328,
-        0.6988867797464966,
-        0.9534132112895218,
-        0.8471417487716292,
-        0.0005832121647006822,
-        0.9990059733653113
-    ]
-    assert roc_auc(y, y_proba) == 0.95833, 'Test case 3 failed'
-
-    # Test 4
-    y = [0, 0, 1, 1, 1, 0, 1]
-    y_proba = [
-        8.99e-1,9.95e-1,5e-3,
-        2.3e-4,1e-4,9e-1,2.1e-4
-    ]
-    assert roc_auc(y, y_proba) == 0.0, 'Test case 4 failed'
-
-    print('All tests passed')
-
-
-if __name__ == '__main__':
-    test_roc_auc()
\ No newline at end of file

From 3011acd34970efdbda57ec728fa52115bca922b7 Mon Sep 17 00:00:00 2001
From: "Turkunov Y." <55660526+turkunov@users.noreply.github.com>
Date: Sun, 19 Jan 2025 22:11:04 +0300
Subject: [PATCH 5/6] Replaced softmax regression with logistic

---
 Problems/train_logreg/learn.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Problems/train_logreg/learn.md b/Problems/train_logreg/learn.md
index 4ab7307f..1200e467 100644
--- a/Problems/train_logreg/learn.md
+++ b/Problems/train_logreg/learn.md
@@ -1,5 +1,5 @@
 ## Overview
-Softmax regression or multinomial logistic regression is a type of generalized logistic regression for not only 2 classes, but more as well.
+Logistic regression is a model used for a binary classification poblem.
 
 ## Prerequisites for a regular logistic regression
 **tl;dr** regular (binary) logistic regression outputs probabilities using a sigmoid $\frac{1}{e^{-X\beta}+1}$ and is called a regression, because it is originally meant to approximate a logit function of odds.

From 9498dd5be4fba652099fb8737c451addde2fb773 Mon Sep 17 00:00:00 2001
From: "Turkunov Y." <55660526+turkunov@users.noreply.github.com>
Date: Mon, 20 Jan 2025 10:02:18 +0300
Subject: [PATCH 6/6] Left only ROC AUC code, removed logreg

---
 Problems/roc_auc/learn.md         | 64 ++++++++++++++++++++++
 Problems/roc_auc/solution.py      | 90 +++++++++++++++++++++++++++++++
 Problems/train_logreg/learn.md    | 73 -------------------------
 Problems/train_logreg/solution.py | 84 -----------------------------
 4 files changed, 154 insertions(+), 157 deletions(-)
 create mode 100644 Problems/roc_auc/learn.md
 create mode 100644 Problems/roc_auc/solution.py
 delete mode 100644 Problems/train_logreg/learn.md
 delete mode 100644 Problems/train_logreg/solution.py

diff --git a/Problems/roc_auc/learn.md b/Problems/roc_auc/learn.md
new file mode 100644
index 00000000..84c518e4
--- /dev/null
+++ b/Problems/roc_auc/learn.md
@@ -0,0 +1,64 @@
+## Overview
+ROC-AUC is a metric used for measuring predictive quality of a binary classifier with the highest value being $1$ and the lowest being $0$.
+
+## $TPR$ and $FPR$
+Consider a trivial case, when we have true binary labels $y_i\in\{0, 1\}$ and our predicted labels by the model $\hat{y_i}\in\{0, 1\}$. We also denote any arbitrary example labeled as $1$ as positive and $0$ as negative. Using $(y_i, \hat{y_i})$ combinations we build a set of $Y$, with the help of which we then generate statistics such as $TP$ (True Positive), $TN$ (True Negative), $FP$ (False Positive) and $FN$ (False Negative):
+
+| Total population = P + N     | Predicted positive (PP)                            | Predicted negative (PN)                               |
+|-----------------------------------|----------------------------------------------------|-------------------------------------------------------|
+| **Actual positive (P)**           | $TP=\#\{Y\|y_i=\hat{y_i}=1\}$                        | $FN=\#\{Y\|y_i=1;\hat{y_i}=0\}$ (also called type II error) |
+| **Actual negative (N)**           | $FP=\#\{Y\|y_i=0;\hat{y_i}=1\}$ (also called type I error)                      | $TN=\#\{Y\|y_i=\hat{y_i}=0\}$                           |
+
+This table, also referenced as **confusion matrix**, could provide an overview of the model's performance for this particular task. Now with the help of these statistics we can calculate the following estimates:
+$$
+TPR=\frac{TP}{TP+FN}\quad(\text{also called a recall}) \\ 
+
+FPR=\frac{FP}{FP+TN}
+$$
+
+Intuition-wise, **TPR** shows the model's sensitivity to positive cases, where the true label $y_i=1$. In some cases, for example in credit scoring or cancer detection tasks, we even neglect other metrics in favor of recall, since any $FN$-case could turn out a very costly mistake. **FPR**, on the other hand, shows how biased are we towards positive cases at the expense of $y_i=0$. 
+
+## Thershold
+Now recall that we originally obtain a vector $\hat{y_i}\in\{0, 1\}$ of predicted labels. But the model itself is not able to directly output either $0$ or $1$. Instead we look at the probability $z_i$ the model has provided us with and compare it with empirically chosen threshold $t$. For example, for a chosen $t=0.7$ we would have the following decision rule:
+$$
+\hat{y_i}=\begin{cases} 1, & \text{if } z_i\gt 0.7 \\ 0, & \text{otherwise } \end{cases}
+$$
+
+With this idea in mind, we can see that for every $t$ our previous estimates of $TPR$ and $FPR$ would change as well, so we can actually denote them as $TPR(t)$ and $FPR(t)$. But we also want our model to be robust and not be dependent on what thershold we choose. That is why when we need to measure the quality of our model, we often look at the **ROC** curve $TPR(FPR | t)$, which shows $TPR$ and $FPR$ under various thresholds. 
+
+## ROC curve
+Each point of this curve is obtained via this algorithm:
+$$
+
+
+\begin{array}{l}
+\textbf{Input}: y\_true, y\_pred \text{ (true labels and output probabilities)} \\
+\textbf{Output: } \text{points} \text{ (a set of (x, y) coordinates)} \\
+\text{\textbf{function} roc\_points}(y\_true, y\_pred): \\
+\quad thresholds \leftarrow y\_pred \cup \{0\} \\
+\quad points \leftarrow [\quad ] \\
+\quad \textbf{for } t\in\{thresholds\,:\ t_i\ge t_{i+1}\} \textbf{ do}: \\
+\quad \quad y \leftarrow TPR(t) \\
+\quad \quad x \leftarrow FPR(t) \\
+\quad \quad \text{points.append}((x, y)) \\
+\quad \textbf{end for} \\
+\textbf{end function}
+\end{array}
+$$
+
+ROC curve's domain stays within $[0, 1]$. To break it down, first consider a thershold $t=1$. Then it is impossible to assign any label to our predictions, unless it is $0$. Therefore $TP=0\implies TPR=0$ and $FP=0\implies FPR=0$ (since all negative examples are going to be assigned a correct label). On the other hand if we have $t=0$, then $FN = 0\implies TPR=\frac{TP}{TP}=1$ and $TN = 0 \implies FPR=\frac{FP}{FP}=1$, since there is no way we can assign $0$ to any prediction.
+
+The best case cenario is when with increasing thershold $t$ our sensitivity increases without disregarding the bias ($FPR$ does not change or is around 0 and $TPR$ is always high). The worst case cenario is when the model is random and we follow an $FPR=TPR$ diagonal line. 
+
+## ROC-AUC
+If you consider two ROC curves mentioned above, you could see that the space underneath the first one is greater than the second one. This is why we usually calculate **ROC-AUC** - area under the ROC curve. You might think that the larger is the AUC, the better is the model, but in fact it's a common misconception.
+
+Consider you want to choose a model between model #1 with $AUC_{ROC}=0.6$ and model #2 with $AUC_{ROC}=0.3$. The correct answer is actually #2, since we can always invert our decision rule in favor of the ROC-AUC and our $AUC_{ROC}$ for model #2 would actually become $0.7$. Therefore, when looking at the ROC AUC, we should consider how large is the **absolute** difference between the area of $0.5$ (worst case performance) and the one our model has generated.
+
+## Calculating AUC
+There are also various ways for calculating an area under the curve. The most applicable one, which is also used in scikit-learn, is the trapezoidal rule:
+$$
+\int f(x)=\sum_i\frac{1}{2}\Delta x_i * (f(x_i)-f(x_{i-1})) ,
+$$
+
+where $\Delta x_i=x_i-x_{i-1}$. This method breaks a total area under the curve into a sum of $90^\circ$-rotated trapezoids that make up the convex curve.
\ No newline at end of file
diff --git a/Problems/roc_auc/solution.py b/Problems/roc_auc/solution.py
new file mode 100644
index 00000000..1dd1f841
--- /dev/null
+++ b/Problems/roc_auc/solution.py
@@ -0,0 +1,90 @@
+import numpy as np
+
+
+def roc_auc(y_true: list[float], probas: list[float]) -> float:
+    """
+    Parameters
+    ----------
+    y_true : list[float]
+        True labels
+    probas : list[float]
+        Output probabilities of our binary classifier
+        
+    Returns
+    -------
+    auc : float
+        ROC AUC rounded to 5 floating points
+    """
+    thresh = sorted(probas + [0], reverse=True)
+    y_true, probas = np.array(y_true), np.array(probas)
+
+    fpr, tpr = [0], [0]
+    auc = 0
+    
+    for t in thresh:
+        y_pred = np.where(probas < t, 0, 1)
+        tp = ((y_true == 1) & (y_pred == 1)).sum()
+        fn = (y_true == 1).sum() - tp
+
+        fp = (y_pred == 1).sum() - tp
+        tn = (y_true == 0).sum() - fp
+
+        fpr.append(fp / (fp + tn))
+        tpr.append(tp / (tp + fn))
+    
+        auc += (fpr[-1] - fpr[-2]) * (tpr[-1] + tpr[-2])
+
+    return round(1/2 * auc, 5)
+
+
+def test_roc_auc():
+    # Test 1
+    y = [0, 0, 1, 1]
+    y_proba = [0.1, 0.4, 0.35, 0.8]
+    assert roc_auc(y, y_proba) == .75, 'Test case 1 failed'
+
+    # Test 2
+    y = [1, 1, 1, 0, 1, 0, 0, 0, 1, 1]
+    y_proba = [
+        0.9945685360621648,
+        0.9937332904188113,
+        0.9958526266087151,
+        4.391062222999706e-09,
+        0.9959272720187046,
+        0.10851446498385146,
+        0.001096202856869512,
+        4.995474609174945e-06,
+        0.9921605697799972,
+        0.9826790537446354
+    ]
+    assert roc_auc(y, y_proba) == 1.0, 'Test case 2 failed'
+
+    # Test 3
+    y = [0, 0, 0, 0, 0, 1, 1, 1, 0, 1]
+    y_proba = [
+        0.8318040739657637,
+        0.421445304232661,
+        0.003309769194418868,
+        0.015529393142531172,
+        0.0001635684705459328,
+        0.6988867797464966,
+        0.9534132112895218,
+        0.8471417487716292,
+        0.0005832121647006822,
+        0.9990059733653113
+    ]
+    assert roc_auc(y, y_proba) == 0.95833, 'Test case 3 failed'
+
+    # Test 4
+    y = [0, 0, 1, 1, 1, 0, 1]
+    y_proba = [
+        8.99e-1,9.95e-1,5e-3,
+        2.3e-4,1e-4,9e-1,2.1e-4
+    ]
+    assert roc_auc(y, y_proba) == 0.0, 'Test case 4 failed'
+
+    print('All tests passed')
+
+
+if __name__ == '__main__':
+    test_roc_auc()
\ No newline at end of file
diff --git a/Problems/train_logreg/learn.md b/Problems/train_logreg/learn.md
deleted file mode 100644
index 1200e467..00000000
--- a/Problems/train_logreg/learn.md
+++ /dev/null
@@ -1,73 +0,0 @@
-## Overview
-Logistic regression is a model used for a binary classification poblem.
-
-## Prerequisites for a regular logistic regression
-**tl;dr** regular (binary) logistic regression outputs probabilities using a sigmoid $\frac{1}{e^{-X\beta}+1}$ and is called a regression, because it is originally meant to approximate a logit function of odds.
-
-Logistic regression is based on the concept of "logits of odds". **Odds** is measure of how frequent we encounter success. It also allows us to shift our probabilities domain of $[0, 1]$ to $[0,\infty]$ Consider a probability of scoring a goal $p=0.8$, then our $odds=\frac{0.8}{0.2}=4$. This means that every $4$ matches we could be expecting a goal followed by a miss. So the higher the odds, the more consistent is our streak of goals. **Logit** is an inverse of the standard logistic function, i.e. sigmoid: $logit(p)=\sigma^{-1}(p)=ln\frac{p}{1-p}$. In our case $p$ is a probability, therefore we call $\frac{p}{1-p}$ the "odds". The logit allows us to further expand our domain from $[0,\infty]$ to $[-\infty,\infty]$.
-
-With this domain expansion we can treat our problem as a linear regression and try to approximate our logit function: $X\beta=logit(p)$. However what we really want for this approximation is to yield predictions for probabilities:
-$$
-X\beta=ln\frac{p}{1-p} \\
-e^{-X\beta}=\frac{1-p}{p} \\ 
-e^{-X\beta}+1 = \frac{1}{p} \\
-p = \frac{1}{e^{-X\beta}+1}
-$$
-
-What we practically just did is taking an inverse of a logit function w.r.t. our approximation and go back to sigmoid. This is also the backbone of the regular logistic regression, which is commonly defined as:
-$$
-\pi=\frac{e^{\alpha+X\beta}}{1+e^{\alpha+X\beta}}=\frac{1}{1+e^{-(\alpha+X\beta)}}.
-$$
-
-## Loss in logistic regression
-The loss function used for solving the logistic regression for $\beta$ is derived from MLE (Maximum Likelihood Estimation). This method allows us to search for $\beta$ that maximize our **likelihood function** $L(\beta)$. This function tells us how likely it is that $X$ has come from the distribution generated by $\beta$: $L(\beta)=L(\beta|X)=P(X|\beta)=\prod_{\{x\in X\}}f^{univar}_X(x;\beta)$, where $f$ is a PMF and $univar$ means univariate, i.e. applied to a single variable.
-
-In the case of a regular logistic regression we expect our output to belong to a single Bernoulli-distributed random variable (hence the univariance), since our true label is either $y_i=0$ or $y_i=1$. The Bernoulli's PMF is defined as $P(Y=y)=p^y(1-p)^{(1-y)}$, where $y\in\{0, 1\}$. Also let's denote $\{x\in X\}$ simply as $X$ and refer to a single pair of vectors from the training set as $(x_i, y_i)$. Thus, our likelihood function would look like this:
-$$
-\prod_X p\left(x_i\right)^{y_i} \times\left[1-p\left(x_i\right)\right]^{1-y_i}
-$$
-
-Then we convert our function from likelihood to log-likelihood by taking $ln$ (or $log$) of it:
-$$
-\sum_X y_i \log \left[p\left(x_i\right)\right]+\left(1-y_i\right) \log \left[1-p\left(x_i\right)\right]
-$$
-
-And then we replace $p(x_i)$ with the sigmoid from previously defined equality to get a final version of our **loss function**:
-$$
-\sum_X y_i \log \left(\frac{1}{1+e^{-x_i\beta}}\right)+\left(1-y_i\right)\log \left(1-\frac{1}{1+e^{-x_i\beta}}\right)
-$$
-
-## Optimization objective
-Recall that originally we wanted to search for $\beta$ that maximize the likelihood function. Since $log$ is a monotonic transformation, our maximization objective does not change and we can confindently say that now we can equally search for $\beta$ that maximize our log-likelihood. Hence we can finally write our actual objective as:
-
-$$
-argmax_\beta [\sum_X y_i \log\sigma(x_i\beta)+\left(1-y_i\right)\log (1-\sigma(x_i\beta))] = \\
-
-= argmin_\beta -[\sum_X y_i \log\sigma(x_i\beta)+\left(1-y_i\right)\log (1-\sigma(x_i\beta))]
-$$
-
-where $\sigma$ is the sigmoid. This function we're trying to minimize is also called **Binary Cross Entropy** loss function (BCE). To find the minimum we would need to take the gradient of this LLF (Log-Likelihood Function), or find a vector of derivatives with respect to every individual $\beta_j$, using a chain rule, i.e.:
-
-$$
-\frac{\partial LLF}{\partial\beta_j}=\frac{\partial LLF}{\partial\sigma}\frac{\partial\sigma}{\partial[X\beta]}\frac{\partial[X\beta]}{\beta_j} = \\
-
-=-\sum_{i=1}^n\left(y^{(i)} \frac{1}{\sigma\left(x^{(i)}\beta\right)}-(1-y^{(i)} ) \frac{1}{1-\sigma\left(x^{(i)}\beta\right)}\right) \frac{\partial\sigma}{\partial[x^{(i)}\beta]} = \\
-
-=-\sum_{i=1}^n\left(y^{(i)} \frac{1}{\sigma\left(x^{(i)}\beta\right)}-(1-y^{(i)} ) \frac{1}{1-\sigma\left(x^{(i)}\beta\right)}\right) \sigma\left(x^{(i)}\beta\right)\left(1-\sigma\left(x^{(i)}\beta\right)\right) \frac{\partial[x^{(i)}\beta]}{\partial\beta_j} = \\
-
-=-\sum_{i=1}^n\left(y^{(i)}\left(1-\sigma\left(x^{(i)}\beta\right)\right)-(1-y^{(i)} ) \sigma\left(x^{(i)}\beta\right)\right) x_j^{(i)} = \\
-
-=-\sum_{i=1}^n\left(y^{(i)}-\sigma\left(x^{(i)}\beta\right)\right) x_j^{(i)} = \\
-
-=\sum_{i=1}^n\left(\sigma\left(x^{(i)}\beta\right)-y^{(i)}\right) x_j^{(i)}.
-$$
-
-This sum can be then rewritten in a more convenient gradient matrix form as:
-$$
-X^T(\sigma(X\beta)-Y)
-$$
-
-Then we can finally use gradient descent in order to iteratively update our parameters:
-$$
-\beta_{t+1}=\beta_t - \eta [X^T(\sigma(X\beta_t)-Y)]
-$$
diff --git a/Problems/train_logreg/solution.py b/Problems/train_logreg/solution.py
deleted file mode 100644
index dc50ff52..00000000
--- a/Problems/train_logreg/solution.py
+++ /dev/null
@@ -1,84 +0,0 @@
-import numpy as np
-
-
-def train_logreg(X: np.ndarray, y: np.ndarray, 
-                 learning_rate: float, iterations: int) -> tuple[list[float], ...]:
-    """        
-    Gradient-descent training algorithm for logistic regression, that collects sum-reduced
-    BCE losses, accuracies. Assigns label "0" if the P(x_i)<=0.5 and "1" otherwise.
-
-    Returns
-    -------
-    B : list[float]
-        1xM updated parameter vector rounded to 4 floating points
-    losses : list[float]
-        collected values of a BCE loss function (LLF) rounded to 4 floating points
-    """
-
-    def sigmoid(x):
-        return 1 / (1 + np.exp(-x))
-
-    def accuracy(y_pred, y_true):
-        return (y_true == np.rint(y_pred)).sum() / len(y_true)
-    
-    def bce_loss(y_pred, y_true):
-        return -np.sum(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
-
-    y = y.reshape(-1, 1)
-    X = np.hstack((np.ones((X.shape[0], 1)), X))
-    B = np.zeros((X.shape[1], 1))
-    accuracies, losses = [], []
-
-    for epoch in range(iterations):
-        y_pred = sigmoid(X @ B)
-        B -= learning_rate * X.T @ (y_pred - y)
-        losses.append(round(bce_loss(y_pred, y), 4))
-        accuracies.append(round(accuracy(y_pred, y), 4))
-
-    return B.flatten().round(4).tolist(), losses
-
-
-def test_train_logreg():
-    # Test 1
-    X = np.array([[ 0.76743473, -0.23413696, -0.23415337,  1.57921282],
-       [-1.4123037 ,  0.31424733, -1.01283112, -0.90802408],
-       [-0.46572975,  0.54256004, -0.46947439, -0.46341769],
-       [-0.56228753, -1.91328024,  0.24196227, -1.72491783],
-       [-1.42474819, -0.2257763 ,  1.46564877,  0.0675282 ],
-       [ 1.85227818, -0.29169375, -0.60063869, -0.60170661],
-       [ 0.37569802,  0.11092259, -0.54438272, -1.15099358],
-       [ 0.19686124, -1.95967012,  0.2088636 , -1.32818605],
-       [ 1.52302986, -0.1382643 ,  0.49671415,  0.64768854],
-       [-1.22084365, -1.05771093, -0.01349722,  0.82254491]])
-    y = np.array([1., 0., 0., 0., 1., 1., 0., 0., 1., 0.])
-    learning_rate = 1e-3
-    iterations = 10
-    b, llf = train_logreg(X, y, learning_rate, iterations)
-    assert b == [-0.0097, 0.0286, 0.015, 0.0135, 0.0316] and \
-        llf == [6.9315, 6.9075, 6.8837, 6.8601, 6.8367, 6.8134, 6.7904, 6.7675, 6.7448, 6.7223], \
-            'Test case 1 failed'
-
-    # Test 2
-    X = np.array([[ 0.76743473,  1.57921282, -0.46947439],
-       [-0.23415337,  1.52302986, -0.23413696],
-       [ 0.11092259, -0.54438272, -1.15099358],
-       [-0.60063869,  0.37569802, -0.29169375],
-       [-1.91328024,  0.24196227, -1.72491783],
-       [-1.01283112, -0.56228753,  0.31424733],
-       [-0.1382643 ,  0.49671415,  0.64768854],
-       [-0.46341769,  0.54256004, -0.46572975],
-       [-1.4123037 , -0.90802408,  1.46564877],
-       [ 0.0675282 , -0.2257763 , -1.42474819]])
-    y = np.array([1., 1., 0., 0., 0., 0., 1., 1., 0., 0.])
-    learning_rate = 1e-1
-    iterations = 10
-    b, llf = train_logreg(X, y, learning_rate, iterations)
-    assert b == [-0.2509, 0.9325, 1.6218, 0.6336] and \
-        llf == [6.9315, 5.5073, 4.6382, 4.0609, 3.6503, 3.3432, 3.1045, 2.9134, 2.7567, 2.6258], \
-            'Test case 2 failed'
-
-    print('All tests passed')
-
-
-if __name__ == '__main__':
-    test_train_logreg()
\ No newline at end of file