# Data Normalization

This notebook outlines a range of normalization methodologies tailored specifically for enhancing the computational processing of single-cell RNA sequencing (scRNA-seq) data. In the realm of scRNA-seq, molecular profiling at the level of individual cells facilitates profound insights into gene expression patterns.

*Employed methods:*

- CPM normalization
- Log normalization
- Min-Max normalization

## Libraries

Library | Version | Channel
--- | --- | ---
NumPy | 1.26.4 | Default
RNAnorm | 2.1.0 | Bioconda
SciPy | 1.12.0 | Default

In [13]:
import numpy as np
from rnanorm import CPM
import scipy.sparse as sp

import modules.normalization as norm

##### Example Matix

All example calculations are performed with this matrix.  
The efficacy of normalization methods is firstly established through testing on a dense matrix representation. Furthermore, in order to ensure the robustness of these methods across various data formats, they are also evaluated using sparse matrix representations, given the sparse format of the **h5ad** data.

In [6]:
# Non zero row indices
row = np.array([5, 8, 2, 3, 1, 7, 0, 4, 6, 9, 2, 3, 8, 1, 0])
# Non zero col indicies
col = np.array([4, 3, 2, 0, 1, 2, 0, 3, 1, 4, 3, 2, 4, 3, 1])
# Non zero data
data = np.array([56, 183, 109, 24, 71, 145, 92, 12, 176, 31, 198, 64, 37, 115, 82])

# Creates sparse matrix with 10 rows (cells) and 5 cols (genes)
test_sparse = sp.csr_matrix((data, (row, col)), shape=(10, 5))
# Dense matrix
test_dense = test_sparse.toarray()

## Counts per Million (CPM)

To compute the Counts Per Million (CPM) for a gene in a sample, the counts are scaled by a factor reflecting a million mapped reads to ensure comparability. Then, they're normalized by dividing through the total mapped reads in the sample to facilitate meaningful expression level comparisons [[1]](https://www.reneshbedre.com/blog/expression_units.html).
In mathematical terms, the formula looks like this:  

$$CPM_{ij} = \frac{count\ gene\ j * 10^6}{\sum{counts\ sample\ i}}$$  

**Note:** Gene length is not considered during normalization. For this analysis gene length seems not important.

When analyzing scRNA-seq data, the focus is not on comparing the expression levels of different genes against each other, but rather on comparing the expression patterns across different cells. Consequently, the calculation formula is adjusted to suit this objective:  

$$CPM_{ij} = \frac{count\ gene\ j * 10^6}{\sum{counts\ cell\ i}}$$  

##### Example - Sanity

Using the CPM function from the RNAnorm library [[2]](https://github.com/genialis/RNAnorm?tab=readme-ov-file)

In [14]:
sanity_result = CPM().fit_transform(test_dense)
sanity_result

array([[ 528735.63218391,  471264.36781609,       0.        ,
              0.        ,       0.        ],
       [      0.        ,  381720.43010753,       0.        ,
         618279.56989247,       0.        ],
       [      0.        ,       0.        ,  355048.85993485,
         644951.14006515,       0.        ],
       [ 272727.27272727,       0.        ,  727272.72727273,
              0.        ,       0.        ],
       [      0.        ,       0.        ,       0.        ,
        1000000.        ,       0.        ],
       [      0.        ,       0.        ,       0.        ,
              0.        , 1000000.        ],
       [      0.        , 1000000.        ,       0.        ,
              0.        ,       0.        ],
       [      0.        ,       0.        , 1000000.        ,
              0.        ,       0.        ],
       [      0.        ,       0.        ,       0.        ,
         831818.18181818,  168181.81818182],
       [      0.        ,       0.   

##### Example - Dense

CPM normalization using the dense matrix as input.

In [9]:
dense_result = norm.dense_cpm(test_dense)
dense_result

array([[ 528735.63218391,  471264.36781609,       0.        ,
              0.        ,       0.        ],
       [      0.        ,  381720.43010753,       0.        ,
         618279.56989247,       0.        ],
       [      0.        ,       0.        ,  355048.85993485,
         644951.14006515,       0.        ],
       [ 272727.27272727,       0.        ,  727272.72727273,
              0.        ,       0.        ],
       [      0.        ,       0.        ,       0.        ,
        1000000.        ,       0.        ],
       [      0.        ,       0.        ,       0.        ,
              0.        , 1000000.        ],
       [      0.        , 1000000.        ,       0.        ,
              0.        ,       0.        ],
       [      0.        ,       0.        , 1000000.        ,
              0.        ,       0.        ],
       [      0.        ,       0.        ,       0.        ,
         831818.18181818,  168181.81818182],
       [      0.        ,       0.   

##### Example - Sparse

In order to implement the divisions of values for the sparse matrix, a diagonal matrix of the reciprocals of the row sums was created. By multiplying this diagonal matrix, the division of the values could be emulated [[3]](https://stackoverflow.com/questions/42225269/scipy-sparse-matrix-division).

In [11]:
sparse_result = norm.sparse_cpm(test_sparse)
sparse_result.toarray()

array([[ 528735.63218391,  471264.36781609,       0.        ,
              0.        ,       0.        ],
       [      0.        ,  381720.43010753,       0.        ,
         618279.56989247,       0.        ],
       [      0.        ,       0.        ,  355048.85993485,
         644951.14006515,       0.        ],
       [ 272727.27272727,       0.        ,  727272.72727273,
              0.        ,       0.        ],
       [      0.        ,       0.        ,       0.        ,
        1000000.        ,       0.        ],
       [      0.        ,       0.        ,       0.        ,
              0.        , 1000000.        ],
       [      0.        , 1000000.        ,       0.        ,
              0.        ,       0.        ],
       [      0.        ,       0.        , 1000000.        ,
              0.        ,       0.        ],
       [      0.        ,       0.        ,       0.        ,
         831818.18181818,  168181.81818182],
       [      0.        ,       0.   

### Conclusion

&rarr; **All 3 functions appear to compute the same results.**

## Min-0-Max-1 Normalization

Min-Max normalization is a normalization typically used in machine learning.  
With this normalization method the values are normalized so that the lowest value in the dataset is the **min** value and the highest value is **max** [[3]](https://www.datacamp.com/tutorial/normalization-in-machine-learning).  
The formula for this methods looks like this:  
$$X_{normalized} = \frac{X - X_{min}}{X_{max} - X_{min}}$$  

**Note:** This functionality is also implemented in the Python library *sklearn*

##### Example

In [None]:
def calculate_min_max(data, min_val=0, max_val=1):
    min_data = np.min(data)
    max_data = np.max(data)

    # If min and max are not set to 0 and 1 this is the full formula
    min_max_normalized = (data - min_data) / (max_data - min_data) * (max_val - min_val) + min_val

    return min_max_normalized

In [None]:
example_matrix = example_result
print(example_matrix)

example_result_2 = calculate_min_max(example_matrix)
example_result_2

The Min-Max method has extra calculations. These are added in this function to provide full functionality if the min and max values are **not 0 and 1**. If this is not the case, the steps are not needed. If the boundaries for the data **are 0 and 1**:  
$$bound_{max} - bound_{min}$$ 
- Turns out to 1 - 0 = 1 (result multiplied with 1)
$$bound_{min}$$
- Turns out to 0 (adds 0 to result)

In [None]:
def caclulate_sparse_min_max(data, min_val=0, max_val=1):
    # Get min value from sparse matrix
    min_data = data.min()
    # Get max value from sparse matrix
    max_data = data.max()

    # Calculate Min-Max as described above
    min_max_matrix = (cpm_normalized - min_data) / (max_data - min_data) * (max_val - min_val) + min_val

    return min_max_matrix

In [None]:
min_max_normalized = caclulate_sparse_min_max(cpm_normalized)
min_max_normalized

## Save Results

Save the results as additional layers in the anndata object.

In [None]:
adata.layers["cpm_normalized"] = cpm_normalized
adata.layers["min_max_normalized"] = min_max_normalized
adata

In [None]:
adata.write_h5ad(filename="../data/adata_30kx10k_normalized_sample.h5ad")