In [1]:
%%html
<style>
.output_subarea.output_text.output_stream.output_stdout > pre {
    width:max-content;
}
.p-Widget.jp-RenderedText.jp-OutputArea-output > pre {
   width:max-content;
}
</style>

In [2]:
import install_ophelia

from ophelia.spark.OpheliaMain import Ophelia

In [3]:
ophelia = Ophelia("Spark Singular Value Decomposition")

22:13:49.007 Ophelia [TAPE] +---------------------------------------------------------------------+
22:13:49.008 Ophelia [INFO] | My name is Ophelia Vendata                                          |
22:13:49.008 Ophelia [INFO] | I am an artificial assistant for data mining & ML engine with spark |
22:13:49.008 Ophelia [INFO] | Welcome to Ophelia spark miner engine                               |
22:13:49.008 Ophelia [INFO] | Lib Version Ophelia.dev1.0                                          |
22:13:49.008 Ophelia [WARN] | V for Vendata...                                                    |
22:13:49.008 Ophelia [TAPE] +---------------------------------------------------------------------+
22:13:49.008 Ophelia [WARN] Initializing Spark Session
22:13:58.168 Ophelia [INFO] Spark Version: 3.0.0
22:13:58.168 Ophelia [INFO] This Is: 'Spark Singular Value Decomposition' App


In [4]:
spark = ophelia.SparkSession

In [5]:
day_price_path = 'data/staging/benchmark/close_day_price'
day_price_df = spark.read.parquet(day_price_path)

### The calculation is performed using Singular Value Decomposition (SVD). The SVD of any $m x n$ array is calculated as follows:

$$A = U \sum V^{T}$$

### Where $U$ is an orthogonal matrix $m x m$ whose columns are the eigenvectors (eigenvectors) of $AA^{T}$, $V$ is an orthogonal matrix $n x n$ whose columns are the eigenvectors of $A^{T}A$, and $\sum$ is a diagonal matrix $m x n$ and its values are zero except along the diagonal.

### When applying PCA, we have to center our data, that is, depending on its nature, we may need to standardize (make each characteristic have a variance of 1 and a mean of 0). If the columns are on different scales like the year, the temperature, the concentration of carbon dioxide, we have to standardize the data. If the data is on the same drive, on the other hand, standardization can lead to the loss of important information. In the first case, when the columns are in the same unit and on a similar scale, we use the covariance matrix for SVD but when the units are different since we standardize the data, we use the correlation matrix.

### The principal components (PC) are the matrix product of the original data and the matrix $V$, which is equal to the product of the matrices $U$ and $\sum$.

# Single Value Decomposition analysis.

### At the very first step we have to take two input parameters, one is called ___n___, that refers to the total count of rows in dataframe. The second refers to the total number of columns called _features_, i.e. ___d___. Thus we will find this matrix with _(n, d)_ dimensions.

### What do we want to confirm is that every vector $\vec{V_i}$ of length d is a _dense vector_. This is, we want to get full vectors without any null values.

### Let's standarize this dense vectors of length __d__ with the _Standard Scaler_ method, i.e. Mean and Standard Deviation are involved for this standarization (re-scaled vectors of features).

### In order to compute SVD we have to transfrom spark-dataframe to a matrix object with indexed elements from scaled features, for that, we will use _IndexedRowMatrix_ method.

### Now let's compute the singular value decomposition of the IndexedRowMatrix. The given row matrix $A$ of dimension __$(m x n)$__ is decomposed into
### _$$U s V^{T}$$ where:_
* $U$: $(m x k)$ __*left singular vectors* is a IndexedRowMatrix whose columns are the eigenvectors of $(A X A')$__
* $s$: __DenseVector consisting of square root of the eigenvalues *singular values* in descending order.__
* $V$: $(n x k)$ __*right singular vectors* is a Matrix whose columns are the eigenvectors of $(A' X A)$__

### This _computeSVD_ interface recieves two main arguments:
* $k$, for $k^{th}$ int number, thus each element $k$ = {${k_{i} \in \Bbb R}$}
* $U$, with _computeU_ boolean __True__, whether or not to compute $U$. If set to be __True__, then $U$ is computed by $A  V  s^{-1}$

In [None]:
from ophelia.spark.ml.unsupervised.FeatureExtraction import SingularValueDecomposition

In [8]:
feature_selection = day_price_df.drop('close_timestamp', 'close_date', 'close_year')

svd_df = SingularValueDecomposition(k=10).transform(feature_selection)
svd_df.show(5)

22:32:11.097 Ophelia [INFO] Build Vector Assembler
22:32:11.440 Ophelia [INFO] Feature Vector Assembling
22:32:11.471 Ophelia [INFO] Feature Standard Normalization
22:32:14.226 Ophelia [INFO] Indexing RDD Row Matrix
22:32:17.770 Ophelia [TAPE] +-------------------------------------+
22:32:17.770 Ophelia [WARN] | Compute SVD With K=10, d=71, n=3062 |
22:32:17.770 Ophelia [TAPE] +-------------------------------------+
22:32:18.036 Ophelia [INFO] Compute Variance For Each Component
22:32:18.036 Ophelia [TAPE] +------------------------------------------+
22:32:18.036 Ophelia [WARN] | Components Over 0.75 Of The Variance k=2 |
22:32:18.036 Ophelia [WARN] | Components Over 0.85 Of The Variance k=3 |
22:32:18.036 Ophelia [WARN] | Components Over 0.95 Of The Variance k=5 |
22:32:18.040 Ophelia [TAPE] +------------------------------------------+
22:32:18.047 Ophelia [INFO] Set Components Over 0.95 Of The Variance k=5
+---+--------------------+--------------------+--------------------+
| id|    