# Normalization

To normalize the dataset and account for sequencing depth, we employ a method inspired by Transcripts Per Kilobase Million (TPM). This normalization involves two key steps:

#### 1.Remove Non-mtRNA Data: 
First, we discard the column containing non-mitochondrial RNA data to focus solely on mitochondrial RNA.
#### 2.Normalize Gene Counts:
Gene Length Normalization: For each gene, divide the raw gene count by its corresponding gene length (in kilobases). This step adjusts for differences in gene length, ensuring that longer genes do not artificially appear to have higher expression levels.
#### 3.Row Sum Normalization: 
After length normalization, compute the sum of the normalized values for each row (representing each cell). Then, divide each value in the row by this row sum to adjust for differences in total read depth among cells.
Extract Relevant Columns: Once the normalization is complete, select the columns of interest from the normalized dataset for further analysis.

In [20]:
import pandas as pd

# Read the CSV files
data1_cd4 = pd.read_csv("mt-datasets/Donor1_CD4_Genes.csv")
gene_length = pd.read_csv("mt-datasets/gene_length.csv")
data1_cd4 = data1_cd4.loc[:, ~data1_cd4.columns.str.contains('^Unnamed')]

In [28]:
# Divide the read counts by the length of each gene
df_norm_1 = data1_cd4.drop(columns=['non-MT'])

row_sums = df_norm_1.sum(axis=1)

# Drop zero lines to aviod dividing something by zero
zero_indices = row_sums.index[row_sums == 0].tolist()
df_norm_1 = df_norm_1.drop(index=zero_indices)

row_sums = row_sums.drop(index=zero_indices)
df_normalized = df_norm_1.div(row_sums, axis=0)

# Get the interested mtRNA columns
interested_genes = ["MT-CO1", "MT-CO2", "MT-CO3", "MT-CYB",
                    "MT-ND1", "MT-ND2", "MT-ND3", "MT-ND4",
                    "MT-ND4L", "MT-ND5", "MT-ND6", "MT-ATP6",
                    "MT-ATP8"]
df_normalized_fin = round(df_normalized[interested_genes] *1000000)
print(df_normalized_fin.head)

<bound method NDFrame.head of         MT-CO1    MT-CO2    MT-CO3    MT-CYB   MT-ND1   MT-ND2   MT-ND3  \
0      77441.0  178451.0  127946.0   67340.0  40404.0  10101.0  20202.0   
1     126984.0  185185.0  179894.0   89947.0  42328.0  21164.0  15873.0   
2      67511.0  126582.0   80169.0   63291.0  29536.0   8439.0  25316.0   
3      89606.0   82437.0   82437.0   50179.0  10753.0   7168.0   7168.0   
4      23346.0  101167.0   42802.0   27237.0  23346.0  11673.0  15564.0   
...        ...       ...       ...       ...      ...      ...      ...   
2883  163934.0  122951.0   90164.0   32787.0  32787.0      0.0  16393.0   
2884   92199.0  198582.0  127660.0   56738.0  14184.0  35461.0  42553.0   
2885   59524.0  130952.0   83333.0  119048.0  29762.0      0.0  23810.0   
2886   99723.0  124654.0  110803.0  102493.0  16620.0  33241.0  13850.0   
2887   45918.0  122449.0  112245.0   86735.0  35714.0   5102.0   5102.0   

       MT-ND4  MT-ND4L   MT-ND5   MT-ND6  MT-ATP6   MT-ATP8  
0     1