# PCA derivation witn Lagrange multipliers and usage

# Outline
- [ 1 - Theory behind PCA](#1)
- [ 2 - Import Data and brief EDA](#2)
- [ 3 - Feature selection and train custom model](#3)
- [ 4 - Training sklearm model and fine tuning](#4)

<a name="1"></a>
## 1 - Theory behind PCA

<br>The goal of PCA is to find a set of orthogonal vectors (principal components) that capture the maximum variance in the data. We want to find these vectors in order of decreasing variance explained.
<br>The Principal Component Analysis (PCA) approach using Lagrange multipliers is an elegant mathematical method to derive the principal components. This approach provides a direct way to find the directions of maximum variance in the data. Let's break down this process step by step!

1.	Matrix Representation of Data: First, let's consider how data is represented in matrix form. Suppose we have n observations of p variables. We can represent this data as a matrix X:

$$
X = [ [x₁₁, x₁₂, ..., x₁ₚ], [x₂₁, x₂₂, ..., x₂ₚ], ... [xₙ₁, xₙ₂, ..., xₙₚ] ]
$$

Where xᵢⱼ represents the value of the j-th variable for the i-th observation.

2.	Variance-Covariance Matrix: The variance of this multivariate data is typically represented by the variance-covariance matrix, often denoted as Σ (sigma). This matrix is a p × p symmetric matrix where:
•	The diagonal elements (i = j) represent the variances of individual variables.
•	The off-diagonal elements (i ≠ j) represent the covariances between pairs of variables.

$$
Σ = [ [σ₁₁, σ₁₂, ..., σ₁ₚ], [σ₂₁, σ₂₂, ..., σ₂ₚ], ... [σₚ₁, σₚ₂, ..., σₚₚ] ]
$$

3.	Calculating the Variance-Covariance Matrix: To calculate Σ, we use the following formula:

$$
Σ = (1 / (n-1)) * (X - \tilde{X̄})ᵀ * (X - \tilde{X̄})
$$
<br>Where:


<br>•	X is the original data matrix
<br>•	X̄ is a matrix where each column is the mean of the corresponding column in X
<br>•	(X - X̄) is the centered data matrix
<br>•	ᵀ denotes the transpose of a matrix


4.	Properties of the Variance-Covariance Matrix:
<br>•	It is always symmetric: σᵢⱼ = σⱼᵢ
<br>•	The diagonal elements (σᵢᵢ) represent the variance of the i-th variable
<br>•	The off-diagonal elements (σᵢⱼ, i ≠ j) represent the covariance between the i-th and j-th variables
5.	Interpretation:
<br>•	Large values on the diagonal indicate high variability in that particular variable.
<br>•	Off-diagonal elements show how variables covary with each other. Positive values indicate positive correlation, negative values indicate negative correlation, and values close to zero indicate little to no linear relationship.
6.	Correlation Matrix: Sometimes, it's useful to standardize the variance-covariance matrix to get the correlation matrix. This is done by dividing each element by the product of the standard deviations of the corresponding variables:

$$
ρᵢⱼ = σᵢⱼ / (√σᵢᵢ * √σⱼⱼ)
$$

This matrix has 1's on the diagonal and correlation coefficients (between -1 and 1) on the off-diagonal elements.

Let's consider a data matrix X (n × p), where n is the number of observations and p is the number of variables. We assume the data is centered (mean-subtracted).

We want to find a unit vector w that maximizes the variance of the projected data:
$$
\begin{cases}
        \max_{w} (var(Xw)=w^T Σ w)\\
        \| w \| = 1
\end{cases}
$$

<br>where Σ  - is the covariance matrix of X.

Breaking down the projections on w:
<br>1.	Starting with the data matrix X: X is an n × p matrix, where n is the number of observations and p is the number of variables. We assume X is centered (mean-subtracted).
<br>2.	Projection onto w: When we project X onto a unit vector w, we get Xw, which is an n × 1 vector.
<br>3.	Variance of the projection: The variance of this projection is what we want to maximize. Let's derive this:

<br>Let's write down a Lagrangian of our optimization problem:
$$
L(w,\lambda)=w^T Σ w - \lambda(w^Tw - 1)
$$