> Igor Sorochan DSU-31

# Loss function and optimization problem implementation without sklearn

In [68]:
from sklearn import datasets
import numpy as np
import pandas as pd

import plotly.express as px
import matplotlib.pyplot as plt

from sklearn.metrics import mean_squared_error, accuracy_score

# np.random.seed(66)

### Loading data

In [69]:
iris = datasets.load_iris()
iris.feature_names, iris.target_names

(['sepal length (cm)',
  'sepal width (cm)',
  'petal length (cm)',
  'petal width (cm)'],
 array(['setosa', 'versicolor', 'virginica'], dtype='<U10'))

In [70]:
print(iris.target_names)
# ~(iris.target_names == 'setosa')

['setosa' 'versicolor' 'virginica']


In [71]:
# masking as True 'versicolor' and 'virginica'
mask = iris.target > 0

In [72]:
X= iris.data[mask]
# reshape to assign dimensions
y= iris.target[mask].reshape(-1,1)
X.shape, y.shape

((100, 4), (100, 1))

### Visualizing input data

In [73]:
fig = px.scatter_matrix(X, color = y.ravel(), height= 900)
# fig.update_traces(diagonal_visible=False)
fig.show()


iteritems is deprecated and will be removed in a future version. Use .items instead.



## Solving the problem using   `Multiple Linear Regression`

Let's use Multiple Linear regression to predict iris classes.

### The least squares method in a matrix form

Given: a linear regression model as  **matrix X [n, p]** - observations,   **column vector [n, 1]** y as dependent variable.
*  n observations 
*  p independent variables. 

The least squares method seeks to find the p x 1 column vector β that minimizes the sum of the squared residuals e, which is given by:

e = y - Xβ

The least squares estimate of β is the value that minimizes the sum of squared residuals e' * e, which is equivalent to minimizing the Euclidean norm of e. We can express this in matrix form as:
$$ \beta = (X^TX)^{-1}X^TY $$
where:
* $\beta$ is the least squares estimate of the parameters.
* $X^T$ is the transpose of the X matrix.  
* $(X^T X)^{-1}$ is the inverse of the product of $X^T$ and $X$.  
* $y$ is the column vector of actual values of the dependent variable.

|$$ \beta = (X^TX)^{-1}X^TY $$|
|:---|

Let's write it in code:

In [74]:
class Lin_regr():
    def __init__(self):
        self.coeffs= None
        print('Linear regression model is initialized')

    def fit(self, X, y):
        # number of observations
        self.rows = X.shape[0]
        
        XX = X

        # add eye column to simplify matrix operations with initial bias(intercept or B0)
        XX = np.hstack((np.ones((self.rows, 1)), X))

        # calculate coeffs as defined earlier
        self.coeffs = np.linalg.inv(XX.T @ XX) @ XX.T @ y
        print('Linear regression model is fitted')
        return 
        
    def predict(self, X):

        XX = np.hstack((np.ones((self.rows, 1)), X))

        y_pred = XX @ self.coeffs

        return y_pred
    
    def coefficients(self):
        print('Linear regression coefficients:')
        return self.coeffs

In [75]:
lr = Lin_regr() # initialize a LR model
lr.fit(X, y) # fitting the model

Linear regression model is initialized
Linear regression model is fitted


In [76]:
# ravel arrays to confirm DataFrame input format, bias at the right as we add eye column in LR fit()
pd.DataFrame({'Coef':lr.coefficients().ravel()}, index= ['bias'] + iris.feature_names ) # print out LR coefficients

Linear regression coefficients:


Unnamed: 0,Coef
bias,0.581361
sepal length (cm),-0.19606
sepal width (cm),-0.30755
petal length (cm),0.384264
petal width (cm),0.682845


Defining a metric. Calculating the proportion ot correct answers among all outputs.

In [81]:
# ravel arrays to confirm DataFrame input format
df = pd.DataFrame({'Predicted':np.round(lr.predict(X).ravel()), 'GT': y.ravel()})
df["Result"] = df.apply(lambda x: 'TP' if x.Predicted == x.GT else 'FP', axis= 1 )
df[df.Predicted != df.GT]

Unnamed: 0,Predicted,GT,Result
20,2.0,1,FP
33,2.0,1,FP
83,1.0,2,FP


$Accuracy = \frac{TP}{TP + FP}$

In [82]:
TP,FP = df.Result.value_counts()
print(f'Model accuracy :{TP / (TP + FP)}')

Model accuracy :0.97


### Comparing the metric with sklearn accuracy_score

In [83]:
accuracy_score(df.Predicted, y) == TP / (TP + FP)

True

Visualizing some feature relations.

In [84]:
fig =px.scatter(df, x=X[:,2], y=X[:,3], color= df.GT.astype(str)+':' + df.Result, title= 'Visualizing the regression predictions of iris classes')
fig.update_layout(yaxis_title="petal width (cm)",
                xaxis_title = 'petal length (cm)')
fig.show()

### Gradient descent optimization method

Inverse matrix $(X^TX)^{-1}$ used in calculating column vector $\beta$ doesn't always exist.  
Moreover, as the number of observations and features in our model increases,  
calculating the inverse matrix may becomes **costly** in terms of the use of computing power resources.  
That is why we often need to rely on **empirical methods** to find the minimum of the loss function.  

One of very popular method is the **Stochastic Gradient Descent**.  
 It is a popular optimization algorithm used in machine learning to minimize the loss function of a model.  
 Instead of calculating the gradient of the entire dataset,  
 SGD updates the model parameters based on the gradient of a **random subset of the dataset**,  
 which is referred to as a **"mini-batch"**. 
 This makes the algorithm more efficient, especially when working with large datasets.  
 
 The **"stochastic"** part of the name comes from the fact that the **random selection** of the mini-batch  
 introduces some randomness into the optimization process,  
 which can help the algorithm escape local minimum and find a better overall solution.

### Gradient descent optimizations


$$Loss(y, p) = -\sum_{i=1}^{l} (y_i \log (p_i) + (1 - y_i) \log (1 - p_i))$$


$$ \frac{\partial L}{\partial w} = X^T (p - y)$$

In [85]:
def logit(x, w): 
    return np.dot(x, w)

def sigmoid(h):
    return 1. / (1 + np.exp(-h))

def chunks(X, y, batch_size):
    perm = np.random.permutation(len(X))
    # perm - array of permutated indexes
    for epoch in range(len(X)//batch_size):
        start_idx = epoch * batch_size         
        end_idx = (epoch + 1) * batch_size  
        yield X[perm[start_idx:end_idx]],y[perm[start_idx:end_idx]] # [0:10], [10:20], [20:30] ...if batch size=10

In [86]:
class Logistic_Regression():
    def __init__(self):
        self.W = None
    
    def fit(self, X, y, epochs= 10, lr=0.5, batch_size=100):
        rows, features = X.shape   
        if self.W is None:

            # np.random.seed(66)
            self.W = np.random.randn(features + 1)
            # self.w = np.zeros(5)
            print(self.W) 
            
        XX = np.concatenate((np.ones((rows, 1)), X), axis=1) 
        losses = []
        for i in range(epochs):
            for X_batch, y_batch in chunks(XX, y, batch_size):

                predictions = self.predict_proba(X_batch)[:,np.newaxis]
                # loss = self.__loss(y_batch, predictions)               
                # losses.append(loss)

                self.W -= lr * self.get_grad(X_batch, y_batch, predictions) 
                

            loss = self.__loss(y_batch, predictions)
            losses.append(loss) 

            if i % 50 == 0: 
                print(f'epoch {i}, loss {loss}')
            if loss < 1e-10:
                break
                
        return losses 
    
    def get_grad(self, X_batch, y_batch, predictions):

        grad_basic = X_batch.T @ (predictions - y_batch)  

        return grad_basic.ravel()
        

    def predict_proba(self, X): 

        return sigmoid(logit(X, self.W))

    def predict(self, X, threshold=0.5):
        return self.predict_proba(X) >= threshold
      
    def __loss(self, y, p):  
        p = np.clip(p, 1e-8, 1 - 1e-8) 
        return -np.sum(y * np.log(p) + (1 - y) * np.log(1 - p))

In [92]:
lr = Logistic_Regression()
XX = np.concatenate((np.ones((X.shape[0], 1)), X), axis=1) 

losses= lr.fit(X, y - 1, epochs=301, lr=.01, batch_size=10)
fig=px.line(losses, title= "Loss function")
fig.update_layout(yaxis_title="Loss",
                xaxis_title = 'Epochs')
fig.show()
print(accuracy_score(lr.predict(XX), y - 1))

[-0.94034109 -1.08627122 -1.37284787  0.06320324 -0.07778473]
epoch 0, loss 6.866616766261986
epoch 50, loss 2.94822556184178
epoch 100, loss 3.019407706133533
epoch 150, loss 0.5101084325634185
epoch 200, loss 0.4169767827266931
epoch 250, loss 0.9878199358252409
epoch 300, loss 0.4388660889694349


0.97


In [88]:
accuracy_score(lr.predict(XX), y)

0.03

In [93]:
# ravel arrays to confirm DataFrame input format
df = pd.DataFrame({'Predicted':lr.predict(XX).ravel(), 'GT': y.ravel()-1})
df["Result"] = df.apply(lambda x: 'TP' if x.Predicted == x.GT else 'FP', axis= 1 )
df[df.Predicted != df.GT]

Unnamed: 0,Predicted,GT,Result
20,True,0,FP
22,True,0,FP
33,True,0,FP


In [94]:
fig =px.scatter(x=X[:,2], y=X[:,3], color= df.GT.astype(str)+':' + df.Result, title= 'Visualizing the Positive and Negative predictions')
fig.update_layout(yaxis_title="petal width (cm)",
                xaxis_title = 'petal length (cm)')
fig.show()