# Normalization

To normalize the dataset and account for sequencing depth, we employ a method inspired by Transcripts Per Kilobase Million (TPM). This normalization involves two key steps:

#### 1.Remove Non-mtRNA Data: 
First, we discard the column containing non-mitochondrial RNA data to focus solely on mitochondrial RNA.
#### 2.Normalize Gene Counts:
Gene Length Normalization: For each gene, divide the raw gene count by its corresponding gene length (in kilobases). This step adjusts for differences in gene length, ensuring that longer genes do not artificially appear to have higher expression levels.
#### 3.Row Sum Normalization: 
After length normalization, compute the sum of the normalized values for each row (representing each cell). Then, divide each value in the row by this row sum to adjust for differences in total read depth among cells.
Extract Relevant Columns: Once the normalization is complete, select the columns of interest from the normalized dataset for further analysis.

In [34]:
import pandas as pd

# Read the CSV files
data1_cd4 = pd.read_csv("mt-datasets/Donor1_CD4_Genes.csv")
gene_length = pd.read_csv("mt-datasets/gene_length.csv")


In [36]:
# Divide the read counts by the length of each gene
df_norm_1 = data1_cd4.drop(columns=['non-MT'])
df_norm_1 = df_norm_1.div(gene_length.iloc[0])

# Drop non-MT values
df_norm_1 = df_norm_1.drop(columns=['non-MT'])

# Divide each value by the sum of its row
row_sums = df_norm_1.sum(axis=1)

# Drop zero lines to aviod dividing something by zero
zero_indices = row_sums.index[row_sums == 0].tolist()
df_norm_1 = df_norm_1.drop(index=zero_indices)
row_sums = row_sums.drop(index=zero_indices)
df_normalized = df_norm_1.div(row_sums, axis=0)

# Get the interested mtRNA columns
interested_genes = ["MT-CO1", "MT-CO2", "MT-CO3", "MT-CYB",
                    "MT-ND1", "MT-ND2", "MT-ND3", "MT-ND4",
                    "MT-ND4L", "MT-ND5", "MT-ND6", "MT-ATP6",
                    "MT-ATP8"]
df_normalized_fin = df_normalized[interested_genes] *1000
print(df_normalized_fin.head)

<bound method NDFrame.head of          MT-CO1      MT-CO2      MT-CO3     MT-CYB     MT-ND1     MT-ND2  \
0     30.321946  157.519157   98.532873  35.633457  25.517432   5.852847   
1     56.998655  187.391412  158.818319  54.563349   30.64572   14.05821   
2     34.060899  143.974522   79.553269  43.154502  24.035914   6.300612   
3     48.433953  100.453718   87.640743  36.655291   9.374706   5.733985   
4       11.5994  113.314603   41.825896  18.288624  18.709492   8.582665   
...         ...         ...         ...        ...        ...        ...   
2883  76.569801  129.463414   82.830164  20.695992  24.700969        0.0   
2884  35.842777  174.038205   97.611078  29.808973   8.894361  20.400694   
2885  28.668374  142.185075   78.940415  77.487524  23.120624        0.0   
2886  50.678805  142.811982   110.75215  70.392164  13.623905   24.99895   
2887  24.533582  147.488432  117.953121  62.627647  30.778182   4.033993   

         MT-ND3     MT-ND4     MT-ND4L     MT-ND5     MT-