# Dimencionality Reduction Function Turotial

## 1.0 Objetivo da Função.

Fazer com que os dois métodos tanto o PCA quanto o SVD possam ser chamados e testado para o modelo.




## 2.0 Importes das Bibliotecas

Basicamente a função vai importar as bibliotecas do SVD que é a do Sypy e o PCA que é pelo scikit-learn.

In [2]:
from sklearn.decomposition import PCA
from scipy.sparse.linalg import svds

## 3.0 Input e Output

A função receberá uma matriz esparsa e devolverá um array com a dimenção reduzida para o método que for chamado através do parâmetro "decoponsition_method" sendo ele uma string que pode chamar tando o "SVD"
 quanto o "PCA".

In [3]:
def dimensionality_reduction(
    df_input,
    decomposition_method: str = None,
    k: int = None,
    explained_variance: float = None,
):
    """
    Implements methods for dimensionality reduction.
    
    :param df_input: sparse matrix
    :type: Array to compute SVD and PCA on, of shape (M,N)
    :param decomposition_method: Choice of method
    :type: str, default = None 
    :param K: Number of singular values(SVD) and principal component analyis(PCA) to compute. 
        Must be 1 <= k < min(A.shape)
    :type: int, default = None
    :param explained_variance: 0 < n_components < 1, select the number of components such that the 
        amount of variance that needs to be explained is greater than the percentage 
        specified by n_components
    :type: float, default = None
    :raise ValueError: K and explained_variance must be defined.
    :raise TypeError: explained_variance must be a float.
    :raise ValueError : explained_variance must be in the interval (0..1) and k 
        or explained_variance must be defined
    :raise NotImplementedError: Model implemented yet. Available names: 'SVD', 'PCA'
    :return: Input data with reduced dimensions
    :rtype: numpy.ndarray

    """
    if k is None and explained_variance is None:
        raise ValueError(f"k and explained_variance must be defined")

    if decomposition_method == "SVD":
        # Implements SVD for reducing dimensionality
        u, s, vt = svds(df_input, k=k)
        return u

    elif decomposition_method == "PCA":
        if k is not None:
            # Implements PCA for reducing dimensionality
            u = PCA(k).fit_transform(df_input)
        elif explained_variance is not None:

            if not isinstance(explained_variance, float):
                raise TypeError(
                    f"explained_variance must be a float, but its value passed was {explained_variance}."
                )
            if explained_variance <= 0 or explained_variance >= 1:
                raise ValueError(f"explained_variance must be in the interval (0..1)")

            u = PCA(explained_variance).fit_transform(df_input)
        return u

    else:
        raise NotImplementedError(
            "Model implemented yet. Available names: 'SVD', 'PCA'."
        )

## 3.1 Exemplo do output SVD

Recebendo uma Matriz esparsa aplicando um np.array:

In [5]:
import numpy as np

In [12]:
    df = np.array(
        [
            [-1, -1, 2, -2],
            [-2, -1, 3, -1],
            [-3, -2, 5, 1],
            [1, 1, 6, 1],
            [2, 1, 7, 1],
            [3, 2, 8, 1],
        ],
        dtype=np.float32,
    )

Chamando a função com os parâmetros, recependo a matriz esparsa, o método e o "k" como o número a ser reduzido :

In [13]:
dimensionality_reduction(df, decomposition_method= "SVD", k=2)

array([[ 0.34073317,  0.10864144],
       [ 0.47650996,  0.17692444],
       [ 0.7011086 ,  0.31684482],
       [-0.06755795,  0.44724497],
       [-0.17133306,  0.52770823],
       [-0.36244583,  0.61481714]], dtype=float32)

Trabalhando o SVD com dataframe

In [17]:
import pandas as pd

In [18]:
def get_df(m, n):
    df = pd.DataFrame(np.random.rand(m,n))
    return df

In [20]:
df = get_df(6, 5)
df

Unnamed: 0,0,1,2,3,4
0,0.800836,0.669391,0.408557,0.83719,0.055235
1,0.001072,0.169011,0.405014,0.774019,0.329619
2,0.755557,0.321485,0.429555,0.288146,0.446964
3,0.912103,0.883492,0.153328,0.243961,0.750254
4,0.399057,0.259895,0.398195,0.73586,0.617795
5,0.965679,0.71516,0.512099,0.398048,0.354085


In [28]:
dimensionality_reduction(df, decomposition_method= "SVD", k=4)

array([[-0.27882797,  0.7226522 ,  0.18870167, -0.46029838],
       [-0.15486814, -0.17888954,  0.68342389, -0.23462217],
       [ 0.70840188, -0.12869914, -0.12085638, -0.36062803],
       [-0.54870519, -0.42397015, -0.49923854, -0.48682442],
       [ 0.09320274, -0.43399306,  0.42215305, -0.36293365],
       [ 0.29440359,  0.24720075, -0.23508485, -0.48405207]])

## 3.1 Exemplo do output PCA

In [29]:
 df = np.array(
        [
            [-1, -1, 2, -2],
            [-2, -1, 3, -1],
            [-3, -2, 5, 1],
            [1, 1, 6, 1],
            [2, 1, 7, 1],
            [3, 2, 8, 1],
        ],
    )

In [30]:
dimensionality_reduction(df, decomposition_method= "PCA", k=2)

array([[-3.55416303, -2.01120392],
       [-3.29181646, -0.37102401],
       [-2.55517195,  2.67830462],
       [ 1.76504972,  0.04940415],
       [ 2.9974995 ,  0.00646722],
       [ 4.63860223, -0.35194805]])

3.1.1 Trabalhando o PCA com DataFrame

Como mencionado na função para trabalahr com DataFrame no PCA é preciso setar o ".values" após o df_imput abaixo segue a função alterada:

In [31]:
def dimensionality_reduction(
    df_input,
    decomposition_method: str = None,
    k: int = None,
    explained_variance: float = None,
):
    """
    Implements methods for dimensionality reduction.
    
    :param df_input: sparse matrix
    :type: Array to compute SVD and PCA on, of shape (M,N)
    :param decomposition_method: Choice of method
    :type: str, default = None 
    :param K: Number of singular values(SVD) and principal component analyis(PCA) to compute. 
        Must be 1 <= k < min(A.shape)
    :type: int, default = None
    :param explained_variance: 0 < n_components < 1, select the number of components such that the 
        amount of variance that needs to be explained is greater than the percentage 
        specified by n_components
    :type: float, default = None
    :raise ValueError: K and explained_variance must be defined.
    :raise TypeError: explained_variance must be a float.
    :raise ValueError : explained_variance must be in the interval (0..1) and k 
        or explained_variance must be defined
    :raise NotImplementedError: Model implemented yet. Available names: 'SVD', 'PCA'
    :return: Input data with reduced dimensions
    :rtype: numpy.ndarray

    """
    if k is None and explained_variance is None:
        raise ValueError(f"k and explained_variance must be defined")

    if decomposition_method == "SVD":
        # Implements SVD for reducing dimensionality
        u, s, vt = svds(df_input, k=k)
        return u

    elif decomposition_method == "PCA":
        if k is not None:
            # Implements PCA for reducing dimensionality
            u = PCA(k).fit_transform(df_input.values)
        elif explained_variance is not None:

            if not isinstance(explained_variance, float):
                raise TypeError(
                    f"explained_variance must be a float, but its value passed was {explained_variance}."
                )
            if explained_variance <= 0 or explained_variance >= 1:
                raise ValueError(f"explained_variance must be in the interval (0..1)")

            u = PCA(explained_variance).fit_transform(df_input.values)
        return u

    else:
        raise NotImplementedError(
            "Model implemented yet. Available names: 'SVD', 'PCA'."
        )

In [32]:
def get_df(m, n):
    df = pd.DataFrame(np.random.rand(m,n))
    return df

In [33]:
df = get_df(6, 5)
df

Unnamed: 0,0,1,2,3,4
0,0.437756,0.066781,0.964135,0.553633,0.790037
1,0.367796,0.732573,0.764866,0.562353,0.755449
2,0.40388,0.41887,0.930407,0.654264,0.214346
3,0.702962,0.400857,0.057568,0.624704,0.299899
4,0.514132,0.558977,0.093727,0.299503,0.796508
5,0.88313,0.021541,0.577477,0.058158,0.426417


In [41]:
dimensionality_reduction(df, decomposition_method= "PCA", k=4)

array([[-0.46726287,  0.13219022, -0.21369465,  0.22122619],
       [-0.23578174, -0.41607618, -0.11447279, -0.12626969],
       [-0.39799175, -0.03072337,  0.37207247, -0.07933675],
       [ 0.4961718 , -0.0416109 ,  0.31764444,  0.13969046],
       [ 0.46463338, -0.24004234, -0.26407372, -0.01697833],
       [ 0.14023119,  0.59626256, -0.09747575, -0.13833187]])