# Data Normalization

This notebook outlines a range of normalization methodologies tailored specifically for enhancing the computational processing of single-cell RNA sequencing (scRNA-seq) data. In the realm of scRNA-seq, molecular profiling at the level of individual cells facilitates profound insights into gene expression patterns.

*Employed methods:*

- CPM normalization
- Log normalization
- Min-Max normalization

## Libraries

Library | Version | Channel
--- | --- | ---
NumPy | 1.26.4 | Default
RNAnorm | 2.1.0 | Bioconda
Scikit-Learn | 1.4.2 | Default
SciPy | 1.12.0 | Default

In [1]:
import sys
import os

import numpy as np
from rnanorm import CPM
import scipy.sparse as sp
from sklearn.preprocessing import FunctionTransformer, MinMaxScaler

# Get the absolute path of the 'notebooks' directory
notebooks_dir = os.path.dirname(os.path.abspath("__file__"))

# Construct the path to the 'src' directory
src_path = os.path.abspath(os.path.join(notebooks_dir, "..", "src"))

# Add the 'src' directory to the Python path
if src_path not in sys.path:
    sys.path.append(src_path)

# Self-built modules
import data_transformation.methods as mtd

##### Example Matix

All example calculations are performed with this matrix.  
The efficacy of normalization methods is firstly established through testing on a dense matrix representation. Furthermore, in order to ensure the robustness of these methods across various data formats, they are also evaluated using sparse matrix representations, given the sparse format of the **h5ad** data.

In [3]:
# Non zero row indices
row = np.array([5, 8, 2, 3, 1, 7, 0, 4, 6, 9, 2, 3, 8, 1, 0])
# Non zero col indicies
col = np.array([4, 3, 2, 0, 1, 2, 0, 3, 1, 4, 3, 2, 4, 3, 1])
# Non zero data
data = np.array([56, 183, 109, 24, 71, 145, 92, 12, 176, 31, 198, 64, 37, 115, 82])

# Creates sparse matrix with 10 rows (cells) and 5 cols (genes)
test_sparse = sp.csr_matrix((data, (row, col)), shape=(10, 5))
# Dense matrix
test_dense = test_sparse.toarray()

## Counts per Million (CPM)

To compute the Counts Per Million (CPM) for a gene in a sample, the counts are scaled by a factor reflecting a million mapped reads to ensure comparability. Then, they're normalized by dividing through the total mapped reads in the sample to facilitate meaningful expression level comparisons [[1]](https://www.reneshbedre.com/blog/expression_units.html).
In mathematical terms, the formula looks like this:  

$$CPM_{ij} = \frac{count\ gene\ j * 10^6}{\sum{counts\ sample\ i}}$$  

**Note:** Gene length is not considered during normalization. For this analysis gene length seems not important.

When analyzing scRNA-seq data, the focus is not on comparing the expression levels of different genes against each other, but rather on comparing the expression patterns across different cells. Consequently, the calculation formula is adjusted to suit this objective:  

$$CPM_{ij} = \frac{count\ gene\ j * 10^6}{\sum{counts\ cell\ i}}$$  

##### Example - Sanity

Sanity check for the correctness of the self-implemented CPM function.  
Using the CPM function from the RNAnorm library [[2]](https://github.com/genialis/RNAnorm?tab=readme-ov-file).

In [3]:
cpm_sanity = CPM().fit_transform(test_dense)
cpm_sanity

array([[ 528735.63218391,  471264.36781609,       0.        ,
              0.        ,       0.        ],
       [      0.        ,  381720.43010753,       0.        ,
         618279.56989247,       0.        ],
       [      0.        ,       0.        ,  355048.85993485,
         644951.14006515,       0.        ],
       [ 272727.27272727,       0.        ,  727272.72727273,
              0.        ,       0.        ],
       [      0.        ,       0.        ,       0.        ,
        1000000.        ,       0.        ],
       [      0.        ,       0.        ,       0.        ,
              0.        , 1000000.        ],
       [      0.        , 1000000.        ,       0.        ,
              0.        ,       0.        ],
       [      0.        ,       0.        , 1000000.        ,
              0.        ,       0.        ],
       [      0.        ,       0.        ,       0.        ,
         831818.18181818,  168181.81818182],
       [      0.        ,       0.   

##### Example - Dense

CPM normalization using the dense matrix as input.

In [4]:
cpm_dense = mtd.dense_cpm(test_dense)
cpm_dense

array([[ 528735.63218391,  471264.36781609,       0.        ,
              0.        ,       0.        ],
       [      0.        ,  381720.43010753,       0.        ,
         618279.56989247,       0.        ],
       [      0.        ,       0.        ,  355048.85993485,
         644951.14006515,       0.        ],
       [ 272727.27272727,       0.        ,  727272.72727273,
              0.        ,       0.        ],
       [      0.        ,       0.        ,       0.        ,
        1000000.        ,       0.        ],
       [      0.        ,       0.        ,       0.        ,
              0.        , 1000000.        ],
       [      0.        , 1000000.        ,       0.        ,
              0.        ,       0.        ],
       [      0.        ,       0.        , 1000000.        ,
              0.        ,       0.        ],
       [      0.        ,       0.        ,       0.        ,
         831818.18181818,  168181.81818182],
       [      0.        ,       0.   

##### Example - Sparse

In order to implement the divisions of values for the sparse matrix, a diagonal matrix of the reciprocals of the row sums was created. By multiplying this diagonal matrix, the division of the values could be emulated [[3]](https://stackoverflow.com/questions/42225269/scipy-sparse-matrix-division).

In [5]:
cpm_sparse = mtd.sparse_cpm(test_sparse)

# Print as dense matrix
cpm_sparse.toarray()

array([[ 528735.63218391,  471264.36781609,       0.        ,
              0.        ,       0.        ],
       [      0.        ,  381720.43010753,       0.        ,
         618279.56989247,       0.        ],
       [      0.        ,       0.        ,  355048.85993485,
         644951.14006515,       0.        ],
       [ 272727.27272727,       0.        ,  727272.72727273,
              0.        ,       0.        ],
       [      0.        ,       0.        ,       0.        ,
        1000000.        ,       0.        ],
       [      0.        ,       0.        ,       0.        ,
              0.        , 1000000.        ],
       [      0.        , 1000000.        ,       0.        ,
              0.        ,       0.        ],
       [      0.        ,       0.        , 1000000.        ,
              0.        ,       0.        ],
       [      0.        ,       0.        ,       0.        ,
         831818.18181818,  168181.81818182],
       [      0.        ,       0.   

### Conclusion

&rarr; **All 3 functions appear to compute the same results.**

***

## Log Transformation

Log transformation is feature transformation technique. We apply the natural logarithm to each value of the matrix. This reduces the impact of outliers and enhance the fitting of the model [[4]](https://www.pythonprog.com/log-transformation-in-machine-learning/).  
**1** is added to each value to achive a good transformation for all 0 values.

$$\log(0) = NaN$$

##### Example - Sanity

Sanity check for the correctness of the self-implemented log function.  
Used implemented functionality from the Scikit-Learn package [[5]](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html#sklearn.preprocessing.FunctionTransformer).

In [6]:
# Define log transformer
log_transformer = FunctionTransformer(np.log)

# Apply transformation
log_sanity = log_transformer.transform(test_dense + 1)
log_sanity

array([[4.53259949, 4.41884061, 0.        , 0.        , 0.        ],
       [0.        , 4.27666612, 0.        , 4.75359019, 0.        ],
       [0.        , 0.        , 4.70048037, 5.29330482, 0.        ],
       [3.21887582, 0.        , 4.17438727, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 2.56494936, 0.        ],
       [0.        , 0.        , 0.        , 0.        , 4.04305127],
       [0.        , 5.17614973, 0.        , 0.        , 0.        ],
       [0.        , 0.        , 4.98360662, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 5.21493576, 3.63758616],
       [0.        , 0.        , 0.        , 0.        , 3.4657359 ]])

##### Example - Dense

Log transformation with a dense matrix as input.

In [7]:
log_dense = mtd.dense_log(test_dense)
log_dense

array([[4.53259949, 4.41884061, 0.        , 0.        , 0.        ],
       [0.        , 4.27666612, 0.        , 4.75359019, 0.        ],
       [0.        , 0.        , 4.70048037, 5.29330482, 0.        ],
       [3.21887582, 0.        , 4.17438727, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 2.56494936, 0.        ],
       [0.        , 0.        , 0.        , 0.        , 4.04305127],
       [0.        , 5.17614973, 0.        , 0.        , 0.        ],
       [0.        , 0.        , 4.98360662, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 5.21493576, 3.63758616],
       [0.        , 0.        , 0.        , 0.        , 3.4657359 ]])

##### Example - Sparse

The CSR class from SciPy already includes a log transformation method that also adds 1 to each value in order to prevent NaN values.

In [8]:
log_sparse = test_sparse.log1p()
log_sparse.toarray()

array([[4.53259949, 4.41884061, 0.        , 0.        , 0.        ],
       [0.        , 4.27666612, 0.        , 4.75359019, 0.        ],
       [0.        , 0.        , 4.70048037, 5.29330482, 0.        ],
       [3.21887582, 0.        , 4.17438727, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 2.56494936, 0.        ],
       [0.        , 0.        , 0.        , 0.        , 4.04305127],
       [0.        , 5.17614973, 0.        , 0.        , 0.        ],
       [0.        , 0.        , 4.98360662, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 5.21493576, 3.63758616],
       [0.        , 0.        , 0.        , 0.        , 3.4657359 ]])

### Conclusion

&rarr; **All 3 functions appear to compute the same results.**

***

## Min-0-Max-1 Normalization

Min-Max normalization is a normalization typically used in machine learning.  
With this normalization method the values are normalized so that the lowest value in the dataset is the **min** value and the highest value is **max** [[6]](https://www.datacamp.com/tutorial/normalization-in-machine-learning).  
The formula for this methods looks like this:  
$$X_{ij} = \frac{X_{ij} - X_{min}}{X_{max} - X_{min}}$$  

##### Example - Sanity

Sanity check for testing the correctness of the self-implemented min-max function.  
Used the implemented functionality of the Scikit-Learn package [[7]](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler).

In [9]:
# Define min-max scaler
min_max_transformer = MinMaxScaler().fit(test_dense)

# Apply tranformation
min_max_sanity = min_max_transformer.transform(test_dense)
min_max_sanity

array([[1.        , 0.46590909, 0.        , 0.        , 0.        ],
       [0.        , 0.40340909, 0.        , 0.58080808, 0.        ],
       [0.        , 0.        , 0.75172414, 1.        , 0.        ],
       [0.26086957, 0.        , 0.44137931, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.06060606, 0.        ],
       [0.        , 0.        , 0.        , 0.        , 1.        ],
       [0.        , 1.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 1.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.92424242, 0.66071429],
       [0.        , 0.        , 0.        , 0.        , 0.55357143]])

##### Example - Dense

Min-Max normalization with the dense matrix as input.

In [10]:
min_max_dense = mtd.dense_min_max(test_dense)
min_max_dense

array([[1.        , 0.46590909, 0.        , 0.        , 0.        ],
       [0.        , 0.40340909, 0.        , 0.58080808, 0.        ],
       [0.        , 0.        , 0.75172414, 1.        , 0.        ],
       [0.26086957, 0.        , 0.44137931, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.06060606, 0.        ],
       [0.        , 0.        , 0.        , 0.        , 1.        ],
       [0.        , 1.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 1.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.92424242, 0.66071429],
       [0.        , 0.        , 0.        , 0.        , 0.55357143]])

##### Example - Sparse

Min-Max normalization with the sparse matrix as input.

In [4]:
min_max_sparse = mtd.sparse_min_max(test_sparse)
min_max_sparse.toarray()

Scale features: 100%|██████████| 5/5 [00:00<00:00, 3418.90it/s]


array([[1.        , 0.46590909, 0.        , 0.        , 0.        ],
       [0.        , 0.40340909, 0.        , 0.58080808, 0.        ],
       [0.        , 0.        , 0.75172414, 1.        , 0.        ],
       [0.26086957, 0.        , 0.44137931, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.06060606, 0.        ],
       [0.        , 0.        , 0.        , 0.        , 1.        ],
       [0.        , 1.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 1.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.92424242, 0.66071429],
       [0.        , 0.        , 0.        , 0.        , 0.55357143]])

### Conclusion

&rarr; **All 3 functions appear to compute the same results.**