# Scalars, vectors, and matrices

The bellow definitions are often seen in ML practise: 

## Scalars

**Scalar** - a single number. They are denoted with lowercase letters. For example, 

a = 1

b = 67.5

c = -100

$\lambda = 0.01$

## Vectors

**Vector** - a collection of ordered numbers, denoted with brackets from each side. They are denoted with bolded lowercase letters. For example,

$$\textbf{v} = [v_{0}, v_{1}, ..., v_{n}]$$

Usually, the vectors are defined as having one column and $n$ number of rows: 

$$\textbf{v} = \begin{bmatrix} v_{0} \\ v_{1} \\ ... \\ v_{n} \end{bmatrix}_{N \times 1}$$

## Matrices 

**Matrix** - a rectangle table of numbers, denoted with brackets from each side. Each matrix element can be accessed by its row and column index. For example,

$$\mathbb{X} = \begin{bmatrix} x_{11} & x_{12} & ... & x_{1p} \\ x_{21} & x_{22} & ... & x_{2p} \\ ... \\ x_{n1} & x_{n2} & ... & x_{np} \end{bmatrix}_{n \times p}$$

## Tensor 

**Tensor** - An $n^{th}$-rank tensor in m-dimensional space is a mathematical object that has n indices and $m^{n}$ components and obeys certain transformation rules. It is a generalization of a matrix and a vector. For example, a 3d tensor has three coordinates that identify it's object in a 3d space. 

![](docs/tensor.png)

The above tensor is a 3rd tensor where each scalar is defined with $x y z$ coordinates. 

# Functions 

## Function definition

A function in math is a rule that maps items from one set to another. 

The mathematical notation for a function is: 

$$f: \mathbb{X} \rightarrow \mathbb{Y}$$

The above notation maps any item from the set $\mathbb{X}$ to the set $\mathbb{Y}$.

The sets $\mathbb{X}$ and $\mathbb{Y}$ are called the `domain` and `range` of the function, respectively.

The $\mathbb{X}$ set can hold any type of data: scalars, vectors, matrices, tensors, etc.  

The $\mathbb{Y}$ set in ML usually holds a scalar or vector.

### Example

A simple function is a function that maps one scalar to another scalar. For example, 

Lets us say that the $$\mathbb{X} = \mathbb{R}^{2}$$ (Each observation in the dataset is a 2d vector) and the $$\mathbb{Y} = \mathbb{R}^{1}$$ (The output is a scalar).

$$ f(\mathbb{X}) = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} $$




# Machine Learning

Machine Learning (**ML**) - the process of an electronical system, given a metric of `goodness`, increase the metric value given more `data`. The key points here are `metric` and `data` - without them we cannot define any machine learning model.  

![](docs/ml-high-level.png)

The machine can now create rules based on the data and the evidence we give it. A programmer does not need to spend countless hours creating rules based on the data he sees - the computer can do that much more efficiently. 

Often, the definitions machine learning and machine learning model are used interchangebly. 

A machine learning model is a function that maps data to a scalar or vector: 

$$ f : \mathbb{X}^{n} \rightarrow \mathbb{Y}$$

Thus machine learning model $\approx$ function. 

## Loss function 

At the heart of every machine learning algorithm is a `loss function` ussualy denoted as an uppercase $L$. The loss function measure how **good** are the predictions of the machine learning algorithm. For example, in a typical linear regression model, the loss function is the **mean squared error**: 

$$MSE = \dfrac{1}{n} \Sigma_{i=1}^{N} \left(y_{i} - \widehat{y_{i}} \right)^{2}$$

Here 

$i$ - the index of the observation 

$y_{i}$ - true observation i 

$\widehat{y_{i}}$ - predicted observation i 

For binary classification problems, a popular loss function is the **binary cross entropy**:

$$CE = - \dfrac{1}{N} \Sigma_{i}^{n}\left( y_{i} log(p_{i}) + (1 + y_{i}) log(1 - p_{i}) \right)$$

Here 

$p_{i} = P(y_{i} = 1 | x_{i}) \in (0, 1)$ 

In both the regression and the classification problems, the **smaller the loss, the better the model**.


# Optimization 

The task of optimization in math is to find the arguments that either **minimize** or **maximize** a given function. In most ML algorithms, the general rule is to **minimize** the loss function. 

The notation for finding the arguments that minimize the loss function is:

$$\underset{x}{\mathrm{argmin}} f(x)$$

In machine learning, the data (meaning $\mathbb{X}$ and $\mathbb{Y}$) is fixed and the only thing that a model can change are the coefficients (sometimes called weights). Thus, the goal of the optimization is to find the coefficients that minimize the loss function. Additionaly, all the loss functions take coefficients as arguments. 

## Test case - predicting weight based on height for the NBA players

The data is taken from https://www.kaggle.com/datasets/justinas/nba-players-data. 

In [None]:
# Data reading package 
import pandas as pd 

# Ploting 
import matplotlib.pyplot as plt

In [None]:
# Reading the data 
d = pd.read_csv('data/nba-data.csv')
print(f"Number of observations: {d.shape[0]}")
print(f"{d.head()}")

In [None]:
# Ploting the relationship between player_height and player_weight
d.plot(x='player_height', y='player_weight', kind='scatter', figsize=(10,10))
plt.title('Height vs Weight')
plt.show()

There is a clear linear relationship here: the bigger the height, the higher the weight. We will try to fit a linear model to the data: 

$$ w = \beta_{0} + \beta_{1}h $$ 

Here 

$w$ - the weight

$h$ - the height

$\beta_{0}$ - the intercept (average weight)

$\beta_{1}$ - the slope (height to weight)

## Constructing the machine learning problem 

In order to solve the problem, we need to define the data, the model, the loss function and an optimization algorithm for the loss function. 

The data is the height and the weight of the players in the NBA: 

$$ \mathbb{D} := \{h_{i}, w_{i}\} $$ 

$$ i \in \{1, 2, ..., 11700\} $$

The model: 

$$ w_{i} = \beta_{0} + \beta_{1}h_{i} $$ 

The loss function is the **mean squared error**:

$$L(\beta_{0}, \beta_{1}) = \dfrac{1}{n} \Sigma_{i=1}^{N} \left(w_{i} - (\beta_{0} + \beta_{1} h_{i}) \right)^{2}$$ 

The optimization algorithm is the **gradient descent**. 

### Gradient descent 

Gradient descent's main idea is that any differentiable function decreases with a given input $x$ if the derivative of the function with the value $x$ is negative. 

The the algorithm is an iterative algorithm where at each step, the coefficients $\beta$ get updated based on the derivative of the loss function with the current coefficients: 

We set a number of iterations $M$ and the learning rate $\alpha > 0$. 

Then, for m = 0 to M, we update the coefficients $\beta$ as follows:

$$ \beta_{0}^{m + 1} \leftarrow \beta_{0}^{m} - \alpha \dfrac{\partial L}{\partial \beta_{0}}$$

$$ \beta_{1}^{m + 1} \leftarrow \beta_{1}^{m} - \alpha \dfrac{\partial L}{\partial \beta_{1}}$$

Here 

$m$ - iteration number 

In order to implement the algorithm, we need to define the partial derivatives. 

$$\dfrac{\partial L}{\partial \beta_{0}} = \dfrac{1}{N} \Sigma_{i=1}^{N} -2 (w_{i} - (\beta_{0} + \beta_{1} h_{i}))$$

$$\dfrac{\partial L}{\partial \beta_{1}} = \dfrac{1}{N} \Sigma_{i=1}^{N} -2 h_{i} (w_{i} - (\beta_{0} + \beta_{1} h_{i}))$$

In [None]:
def MSE(x, y, beta_0, beta_1):
    """
    Mean squared error implementation
    """
    n = len(x)
    mse = 0
    for i in range(n):
        mse += (y[i] - (beta_0 + beta_1 * x[i])) ** 2
    return mse / n

def gradient_update(x, y, beta_0, beta_1, alpha): 
    """
    Weight update based on gradient implementation for 1 epoch
    """
    dL_d0 = 0
    dL_d1 = 0

    # Saving the number of obs 
    n = len(x)

    # Getting the sums 
    for i in range(n):
        dL_d0 += -2 * (y[i] - (beta_0 + beta_1 * x[i]))
        dL_d1 += -2 * x[i] * (y[i] - (beta_0 + beta_1 * x[i]))

    # Getting the means 
    dL_d0 /= n
    dL_d1 /= n

    # Updating the betas
    beta_0 = beta_0 - alpha * dL_d0
    beta_1 = beta_1 - alpha * dL_d1

    # Returning the updated weights 
    return beta_0, beta_1

def gradient_descent(x, y, beta_0, beta_1, alpha, epochs):
    """
    Gradient descent implementation for multiple epochs
    """
    for i in range(epochs):
        beta_0, beta_1 = gradient_update(x, y, beta_0, beta_1, alpha)

        # Logging the loss after each 100 epochs 
        if i % 10 == 0:
            print(f"Epoch {i}: {MSE(x, y, beta_0, beta_1)}")
    return beta_0, beta_1

## Data standartization 

Gradient descent works best when the data is standardized. To ensure that the gradient descent moves smoothly towards the minima and that the steps for gradient descent are updated at the same rate for all the features, we scale the data before feeding it to the model. A popular standartization technique is the **mean-variance**:

We transform every entry in $x$ by subtracting the mean and dividing by its standard deviation. 

$$ z_{i} = \dfrac{x_{i} - \overline{x}}{\sigma(x)}$$ 

## Applying the gradient descent algorithm

In [None]:
# Sorting the x variable
d = d.sort_values(by='player_height')

# Standatizing the data 
x = d['player_height']
y = d['player_weight']

# Importing the scalers 
from sklearn.preprocessing import StandardScaler
_x_scaler = StandardScaler()
_y_scaler = StandardScaler()

# Applying 
z_x = _x_scaler.fit_transform(x.values.reshape(-1, 1))
z_y = _y_scaler.fit_transform(y.values.reshape(-1, 1))

In [None]:
# Searching for the best parameters 
alpha = 0.1
epochs = 100
beta_0 = 0
beta_1 = 0

_beta_0, _beta_1 = gradient_descent(x=z_x, y=z_y, beta_0=beta_0, beta_1=beta_1, alpha=alpha, epochs=epochs)

In [None]:
print(f"Gradient descent results: {_beta_0}, {_beta_1}")

In [None]:
# Getting the predictions 
_predictions_gd = _beta_0 + _beta_1 * z_x

# Scaling back 
_predictions_gd = _y_scaler.inverse_transform(_predictions_gd.reshape(-1, 1))

In [None]:
# Linear regression in scikit learn 
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(z_x, z_y)

# Getting the coefficients
print(f"Scikit learn results: {lr.intercept_}, {lr.coef_}")

In [None]:
# Getting the predictions 
predictions = lr.predict(z_x)

# Inverse transforming 
predictions = _y_scaler.inverse_transform(predictions.reshape(-1, 1))

In [None]:
# Ploting the true and predicted values
d.plot(x='player_height', y='player_weight', kind='scatter', figsize=(10,10))
plt.title('Height vs Weight')
plt.plot(x, predictions, 'r', label='Predicted - Linear Regression')
plt.plot(x, _predictions_gd, 'g', label='Predicted - Gradient Descent')
plt.legend()
plt.show()