# Data Normalization

Implementation of different normalization methods for the annotated data.

## Libraries

- anndata: 0.10.7
- numpy: 1.26.4
- scipy: 1.12.0

In [33]:
import anndata as ad
import numpy as np
import scipy.sparse as sp

In [34]:
file_path = "../data/adata_1000x1000_sample.h5ad"
# file_path = "/home/ubuntu/projects/project_data/thesis/global_raw.h5ad"

adata = ad.read_h5ad(filename=file_path)

## Normalization

### Counts per Million (CPM)

CPM normalization is performed by dividing the mapped reads count by a per million scaling factor of total mapped reads [[1]](https://www.reneshbedre.com/blog/expression_units.html).  
In scRNA-seq we want to compare different cells with each other, therefor the caclulation compared to normal RNA-seq is a bit different. The formula looks like this:
$$CPM_{ij} = \frac{count\ gene\ j * 10^6}{\sum{counts\ cell\ i}}$$  

**Note:** Gene length is not considered during normalization. For this analysis gene length seems not important.

##### Example

In [35]:
def calculate_cpm(count_matrix):
    # Sums over all rows (cells) to get total counts
    cell_counts = np.sum(count_matrix, axis=1)

    # Caclualte the CPM normalized values
    cpm_matrix = count_matrix * 1e6 / cell_counts[:, np.newaxis]

    return cpm_matrix

In [46]:
example_matrix = np.array([[10, 0, 30],
                           [5, 0, 25],
                           [20, 0, 5]])

example_result = calculate_cpm(example_matrix)
example_result

array([[250000.        ,      0.        , 750000.        ],
       [166666.66666667,      0.        , 833333.33333333],
       [800000.        ,      0.        , 200000.        ]])

Expected result after running:  
```python
array([[250000.        ,      0.        , 750000.        ],
       [166666.66666667,      0.        , 833333.33333333],
       [800000.        ,      0.        , 200000.        ]])
```

In order to implement the divisions of values for the sparse matrix, a diagonal matrix of the reciprocals of the row sums was created. By multiplying this diagonal matrix, the division of the values could be emulated [[2]](https://stackoverflow.com/questions/42225269/scipy-sparse-matrix-division).

In [37]:
adata_counts = adata.X.copy()
adata_counts

<1000x1000 sparse matrix of type '<class 'numpy.float32'>'
	with 39019 stored elements in Compressed Sparse Row format>

In [38]:
def calculate_sparse_cpm(count_matrix):
    # Calcualte diag matrix as metioned above
    cell_counts = sp.diags(1 / count_matrix.sum(axis=1).A.ravel())
    # Calcualte numerator for formula
    multplied_counts = count_matrix.dot(1e6)

    # Calcualte CPM values
    cpm_matrix = cell_counts.dot(multplied_counts)

    return cpm_matrix

In [39]:
cpm_normalized = calculate_sparse_cpm(adata_counts)
cpm_normalized


<1000x1000 sparse matrix of type '<class 'numpy.float32'>'
	with 39019 stored elements in Compressed Sparse Row format>

### Min-0-Max-1 Normalization

Min-Max normalization is a normalization typically used in machine learning.  
With this normalization method the values are normalized so that the lowest value in the dataset is the **min** value and the highest value is **max** [[3]](https://www.datacamp.com/tutorial/normalization-in-machine-learning).  
The formula for this methods looks like this:  
$$X_{normalized} = \frac{X - X_{min}}{X_{max} - X_{min}}$$  

**Note:** This functionality is also implemented in the Python library *sklearn*

##### Example

In [42]:
def calculate_min_max(data, min_val=0, max_val=1):
    min_data = np.min(data)
    max_data = np.max(data)

    # If min and max are not set to 0 and 1 this is the full formula
    min_max_normalized = (data - min_data) / (max_data - min_data) * (max_val - min_val) + min_val

    return min_max_normalized

In [47]:
example_matrix = example_result
print(example_matrix)

example_result_2 = calculate_min_max(example_matrix)
example_result_2

[[250000.              0.         750000.        ]
 [166666.66666667      0.         833333.33333333]
 [800000.              0.         200000.        ]]


array([[0.3 , 0.  , 0.9 ],
       [0.2 , 0.  , 1.  ],
       [0.96, 0.  , 0.24]])

The Min-Max method has extra calculations. These are added in this function to provide full functionality if the min and max values are **not 0 and 1**. If this is not the case, the steps are not needed. If the boundaries for the data **are 0 and 1**:  
$$bound_{max} - bound_{min}$$ 
- Turns out to 1 - 0 = 1 (result multiplied with 1)
$$bound_{min}$$
- Turns out to 0 (adds 0 to result)

In [57]:
def caclulate_sparse_min_max(data, min_val=0, max_val=1):
    # Get min value from sparse matrix
    min_data = data.min()
    # Get max value from sparse matrix
    max_data = data.max()

    # Calculate Min-Max as described above
    min_max_matrix = (cpm_normalized - min_data) / (max_data - min_data) * (max_val - min_val) + min_val

    return min_max_matrix

In [58]:
min_max_normalized = caclulate_sparse_min_max(cpm_normalized)
min_max_normalized

<1000x1000 sparse matrix of type '<class 'numpy.float64'>'
	with 39019 stored elements in Compressed Sparse Row format>

## Save Results

Save the results as additional layers in the anndata object.

In [60]:
adata.layers["cpm_normalized"] = cpm_normalized
adata.layers["min_max_normalized"] = min_max_normalized
adata

In [63]:
adata.write_h5ad(filename="../data/adata_normalized_sample.h5ad")