> Igor Sorochan DSU-31

# Loss function and optimization problem implementation without sklearn

In [160]:
from sklearn import datasets
import numpy as np
import pandas as pd

import plotly.express as px
import matplotlib.pyplot as plt

### Loading data

In [3]:
iris = datasets.load_iris()
iris.feature_names, iris.target_names

(['sepal length (cm)',
  'sepal width (cm)',
  'petal length (cm)',
  'petal width (cm)'],
 array(['setosa', 'versicolor', 'virginica'], dtype='<U10'))

In [127]:
print(iris.target_names)
# ~(iris.target_names == 'setosa')

['setosa' 'versicolor' 'virginica']


In [5]:
# masking as True 'versicolor' and 'virginica'
mask = iris.target > 0

In [129]:
X= iris.data[mask]
# reshape to assign dimensions
y= iris.target[mask].reshape(-1,1)
X.shape, y.shape

((100, 4), (100, 1))

### Visualizing input data

In [132]:
fig = px.scatter_matrix(X, color = y.ravel(), height= 900)
fig.update_traces(diagonal_visible=False)


iteritems is deprecated and will be removed in a future version. Use .items instead.



### The least squares method in a matrix form

Given: a linear regression model as  **matrix X [n, p]** - observations,   **column vector [n, 1]** y as dependent variable.
*  n observations 
*  p independent variables. 

The least squares method seeks to find the p x 1 column vector β that minimizes the sum of the squared residuals e, which is given by:

e = y - Xβ

The least squares estimate of β is the value that minimizes the sum of squared residuals e' * e, which is equivalent to minimizing the Euclidean norm of e. We can express this in matrix form as:
$$ \beta = (X^TX)^{-1}X^TY $$
where:
* $\beta$ is the least squares estimate of the parameters.
* $X^T$ is the transpose of the X matrix.  
* $(X^T X)^{-1}$ is the inverse of the product of $X^T$ and $X$.  
* $y$ is the column vector of actual values of the dependent variable.

|$$ \beta = (X^TX)^{-1}X^TY $$|
|:---|

Let's figure out in a code:

In [133]:
class Lin_regr():
    def __init__(self):

        print('Initialized')

    def fit(self, X, y):
        # number of observations
        self.rows = X.shape[0]
        
        XX = X

        # add eye column to the right, to simplify matrix operations with initial bias(intercept or B0)
        XX = np.hstack((X, np.ones((self.rows, 1))))

        # calculate coeffs as defined earlier
        self.coefs = np.linalg.inv(XX.T @ XX) @ XX.T @ y
        print('Fitted')
        return 
        
    def predict(self, X):

        XX = np.hstack((X, np.ones((self.rows, 1))))

        y_pred = XX @ self.coefs

        return y_pred
    
    def coefficients(self):
        print('Linear regression coefficients:')
        return self.coefs

In [138]:
lr = Lin_regr() # initialize a LR model
lr.fit(X, y) # fitting the model

Initialized
Fitted


In [154]:
# ravel arrays to confirm DataFrame input format, bias at the right as we add eye column in LR fit()
pd.DataFrame({'Coef':lr.coefficients().ravel()}, index= iris.feature_names + ['bias'] ) # print out LR coefficients

Linear regression coefficients:


Unnamed: 0,Coef
sepal length (cm),-0.19606
sepal width (cm),-0.30755
petal length (cm),0.384264
petal width (cm),0.682845
bias,0.581361


In [159]:
# ravel arrays to confirm DataFrame input format
df = pd.DataFrame({'Predicted':np.round(lr.predict(X).ravel()), 'GT': y.ravel()})
df[df.Predicted != df.GT]
df["Result"] = df.apply(lambda x: 'TP' if x.Predicted == x.GT else 'FP', axis= 1 )
TP,FP = df.Result.value_counts()

$Accuracy = \frac{TP}{TP + FP}$

In [157]:
print(f'Model accuracy :{TP / (TP + FP)}')

Model accuracy :0.97


![](https://u.netology.ru/backend/uploads/lms/attachments/files/data/49656/FEML_3_%D1%84%D0%BE%D1%80%D0%BC%D1%83%D0%BB%D1%8B.png)

|$$Loss = (θ_0 + θ_1 * x - y)^2 $$|
|:---|
|$$\frac{dLoss}{dθ_0} = 2 \cdot (θ_0 + θ_1 * x - y) \cdot 1 $$|
|$$\frac{dLoss}{dθ_1} = 2 \cdot (θ_0 + θ_1 * x - y) \cdot x $$|

### Gradient descent optimization method

Inverse matrix $(X^TX)^{-1}$ used in calculating column vector $\beta$ doesn't always exist.  
As the number of observations and features in our model increases, calculating the inverse matrix may become inefficient.  
That is why we often need to rely on empirical methods to find the minimum of the loss function.  

One of very popular method is the **Stochastic Gradient Descent**.  
 It is a popular optimization algorithm used in machine learning to minimize the loss function of a model.  
 Instead of calculating the gradient of the entire dataset,  
 SGD updates the model parameters based on the gradient of a **random subset of the dataset**,  
 which is referred to as a **"mini-batch"**. 
 This makes the algorithm more efficient, especially when working with large datasets.  
 
 The **"stochastic"** part of the name comes from the fact that the **random selection** of the mini-batch  
 introduces some randomness into the optimization process,  
 which can help the algorithm escape local minimum and find a better overall solution.
