# Data Processing

In this notebook you will find code for CSV processing. The raw CSV is formatted like this:

<img src="mRNA_raw_csv.png" width="400">

Column labels represent <font color = 'red'> <b> patients </b> </font> with either mild or severe RSV. Row labels represent <font color = 'red'> <b> genes </b> </font>. For each observation, we have a <font color = 'red'> <b> non-normalized mRNA count </b> </font> of that gene in each patient.

Let's start by importing our packages, then loading in our data.

In [18]:
# Imports
import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load in data
pd.set_option('display.max_columns', None) # We'd like to be able to view all columns

mRNA_data = pd.read_csv("../RAW_DATA/mRNA_count_data.csv", encoding = "UTF-8") # mRNA data
mRNA_data.head(10) # Let's take a look at the first 10 rows.



Unnamed: 0,Gene,GSM4715942,GSM4715943,GSM4715947,GSM4715948,GSM4715951,GSM4715952,GSM4715953,GSM4715954,GSM4715956,GSM4715957,GSM4715961,GSM4715962,GSM4715963,GSM4715964,GSM4715965,GSM4715966,GSM4715968,GSM4715970,GSM4715971,GSM4715972,GSM4715973,GSM4715975,GSM4715977,GSM4715980,GSM4715986,GSM4715987,GSM4715988,GSM4715989,GSM4715990,GSM4715994,GSM4715995,GSM4715996,GSM4715997,GSM4715999,GSM4716002,GSM4716003,GSM4716004
0,DPM1,201,259,257,248,810,353,333,125,367,351,325,270,392,433,313,425,354,361,348,607,407,320,514,442,543,321,400,454,476,484,428,211,76,307,333,194,263
1,SCYL3,635,454,519,585,1323,1013,964,339,695,617,634,624,541,558,503,705,705,540,514,890,622,791,654,818,1343,899,1111,1213,1294,1059,1058,540,295,820,759,397,834
2,C1orf112,174,173,144,219,393,242,323,102,218,202,341,186,215,412,117,299,228,146,217,322,230,324,274,315,482,283,371,372,384,363,361,134,105,304,201,174,209
3,FGR,1925,4931,4142,4080,4505,11083,8909,3411,3504,5208,4884,7127,2956,2763,5713,4106,5247,8067,3317,12291,7987,7154,4805,6989,5308,6174,5757,6331,6466,7845,4402,14466,718,4426,8914,3905,6426
4,CFH,5,16,24,16,45,21,7,53,30,42,25,19,15,34,37,10,29,8,24,18,52,16,17,32,63,17,11,30,13,37,7,1,40,20,27,11,19
5,FUCA2,135,228,206,164,316,301,311,64,205,240,230,219,142,208,290,194,283,290,161,436,341,238,264,329,451,386,389,371,391,258,377,141,64,216,215,100,230
6,GCLC,254,258,175,224,539,551,363,148,215,266,460,1152,340,587,635,476,1650,362,401,677,920,492,579,1377,1040,424,462,361,614,464,444,187,250,256,241,344,354
7,NFYA,606,772,842,691,1773,1191,897,277,684,753,863,1091,984,913,843,949,1063,1120,1033,1933,1242,1006,849,1439,1368,706,892,1100,1181,1709,1185,805,305,930,731,642,919
8,STPG1,80,61,67,102,182,122,65,85,90,85,67,128,54,66,90,71,78,46,49,76,79,70,49,138,159,119,180,126,133,110,93,33,112,99,73,38,106
9,NIPAL3,1010,940,675,959,2466,2005,1089,180,757,1056,1069,1265,975,1011,959,1503,1610,977,735,1313,1486,1158,1168,1937,2466,1079,1855,1974,1986,1690,1924,507,522,1437,948,613,1405


First order of business - let's get rid of those gross column labels. For each column, we'd like to encode information about their gender, as well as whether they were a mild or severe case. We'll get that info from the patient data csv (0 = Female, 1 = Male)

In [19]:
patient_data = pd.read_csv("../RAW_DATA/patient_data.csv") # patient data
patient_data.head(40)

Unnamed: 0,Accession,Sex,Age,Batch,Hospital,WHO_LRTI,Severity
0,GSM4715942,0,1,0,0,2,mild
1,GSM4715943,0,0,0,0,1,mild
2,GSM4715947,0,1,0,0,4,severe
3,GSM4715948,1,0,0,0,2,mild
4,GSM4715951,0,0,1,0,1,mild
5,GSM4715952,1,0,1,0,1,mild
6,GSM4715953,1,1,1,0,2,mild
7,GSM4715954,1,3,1,0,4,severe
8,GSM4715956,0,2,1,0,4,severe
9,GSM4715957,0,1,1,0,1,mild


There are few enough patients that we can manually enter the info.

In [20]:
new_names = {"GSM4715942": "F_mild_1", "GSM4715943": "F_mild_2", "GSM4715947": "F_severe_1",
             "GSM4715948": "M_mild_3", "GSM4715951": "F_mild_4", "GSM4715952": "M_mild_5",
             "GSM4715953": "M_mild_6", "GSM4715954": "M_severe_2", "GSM4715956": "F_severe_3",
             "GSM4715957": "F_mild_7", "GSM4715961": "M_mild_8", "GSM4715962": "M_mild_9", 
             "GSM4715963": "M_mild_10", "GSM4715964": "F_mild_11", "GSM4715965": "F_mild_12",
             "GSM4715966": "M_mild_13", "GSM4715968": "F_mild_14", "GSM4715970": "M_mild_15",
             "GSM4715971": "F_mild_16", "GSM4715972": "M_severe_4", "GSM4715973": "M_mild_17",
             "GSM4715975": "F_severe_5", "GSM4715977": "M_mild_18", "GSM4715980": "F_mild_19",
             "GSM4715986": "M_mild_20", "GSM4715987": "M_mild_21", "GSM4715988": "F_mild_22",
             "GSM4715989": "M_mild_23", "GSM4715990": "M_mild_24", "GSM4715994": "F_mild_25",
             "GSM4715995": "F_mild_26", "GSM4715996": "F_severe_6", "GSM4715997": "F_mild_27",
             "GSM4715999": "M_mild_28", "GSM4716002": "M_severe_7", "GSM4716003": "F_severe_8",
             "GSM4716004": "M_mild_29"}

mRNA_data = mRNA_data.rename(columns = new_names)
mRNA_data.head(10)

Unnamed: 0,Gene,F_mild_1,F_mild_2,F_severe_1,M_mild_3,F_mild_4,M_mild_5,M_mild_6,M_severe_2,F_severe_3,F_mild_7,M_mild_8,M_mild_9,M_mild_10,F_mild_11,F_mild_12,M_mild_13,F_mild_14,M_mild_15,F_mild_16,M_severe_4,M_mild_17,F_severe_5,M_mild_18,F_mild_19,M_mild_20,M_mild_21,F_mild_22,M_mild_23,M_mild_24,F_mild_25,F_mild_26,F_severe_6,F_mild_27,M_mild_28,M_severe_7,F_severe_8,M_mild_29
0,DPM1,201,259,257,248,810,353,333,125,367,351,325,270,392,433,313,425,354,361,348,607,407,320,514,442,543,321,400,454,476,484,428,211,76,307,333,194,263
1,SCYL3,635,454,519,585,1323,1013,964,339,695,617,634,624,541,558,503,705,705,540,514,890,622,791,654,818,1343,899,1111,1213,1294,1059,1058,540,295,820,759,397,834
2,C1orf112,174,173,144,219,393,242,323,102,218,202,341,186,215,412,117,299,228,146,217,322,230,324,274,315,482,283,371,372,384,363,361,134,105,304,201,174,209
3,FGR,1925,4931,4142,4080,4505,11083,8909,3411,3504,5208,4884,7127,2956,2763,5713,4106,5247,8067,3317,12291,7987,7154,4805,6989,5308,6174,5757,6331,6466,7845,4402,14466,718,4426,8914,3905,6426
4,CFH,5,16,24,16,45,21,7,53,30,42,25,19,15,34,37,10,29,8,24,18,52,16,17,32,63,17,11,30,13,37,7,1,40,20,27,11,19
5,FUCA2,135,228,206,164,316,301,311,64,205,240,230,219,142,208,290,194,283,290,161,436,341,238,264,329,451,386,389,371,391,258,377,141,64,216,215,100,230
6,GCLC,254,258,175,224,539,551,363,148,215,266,460,1152,340,587,635,476,1650,362,401,677,920,492,579,1377,1040,424,462,361,614,464,444,187,250,256,241,344,354
7,NFYA,606,772,842,691,1773,1191,897,277,684,753,863,1091,984,913,843,949,1063,1120,1033,1933,1242,1006,849,1439,1368,706,892,1100,1181,1709,1185,805,305,930,731,642,919
8,STPG1,80,61,67,102,182,122,65,85,90,85,67,128,54,66,90,71,78,46,49,76,79,70,49,138,159,119,180,126,133,110,93,33,112,99,73,38,106
9,NIPAL3,1010,940,675,959,2466,2005,1089,180,757,1056,1069,1265,975,1011,959,1503,1610,977,735,1313,1486,1158,1168,1937,2466,1079,1855,1974,1986,1690,1924,507,522,1437,948,613,1405


Now let's rearrange these columns so that mild and severe patients are in separate blocs. 

In [21]:
mild_cases = []
severe_cases = []

# Generate new ordering of column labels, with mild ones listed first
for label in mRNA_data.columns.tolist():
    if label[2] == 'm':
        mild_cases.append(label)
    elif label[2] == 's':
        severe_cases.append(label)

        
mRNA_data = mRNA_data.reindex(columns = ["Gene"] + mild_cases + severe_cases)
mRNA_data.head(30)

Unnamed: 0,Gene,F_mild_1,F_mild_2,M_mild_3,F_mild_4,M_mild_5,M_mild_6,F_mild_7,M_mild_8,M_mild_9,M_mild_10,F_mild_11,F_mild_12,M_mild_13,F_mild_14,M_mild_15,F_mild_16,M_mild_17,M_mild_18,F_mild_19,M_mild_20,M_mild_21,F_mild_22,M_mild_23,M_mild_24,F_mild_25,F_mild_26,F_mild_27,M_mild_28,M_mild_29,F_severe_1,M_severe_2,F_severe_3,M_severe_4,F_severe_5,F_severe_6,M_severe_7,F_severe_8
0,DPM1,201,259,248,810,353,333,351,325,270,392,433,313,425,354,361,348,407,514,442,543,321,400,454,476,484,428,76,307,263,257,125,367,607,320,211,333,194
1,SCYL3,635,454,585,1323,1013,964,617,634,624,541,558,503,705,705,540,514,622,654,818,1343,899,1111,1213,1294,1059,1058,295,820,834,519,339,695,890,791,540,759,397
2,C1orf112,174,173,219,393,242,323,202,341,186,215,412,117,299,228,146,217,230,274,315,482,283,371,372,384,363,361,105,304,209,144,102,218,322,324,134,201,174
3,FGR,1925,4931,4080,4505,11083,8909,5208,4884,7127,2956,2763,5713,4106,5247,8067,3317,7987,4805,6989,5308,6174,5757,6331,6466,7845,4402,718,4426,6426,4142,3411,3504,12291,7154,14466,8914,3905
4,CFH,5,16,16,45,21,7,42,25,19,15,34,37,10,29,8,24,52,17,32,63,17,11,30,13,37,7,40,20,19,24,53,30,18,16,1,27,11
5,FUCA2,135,228,164,316,301,311,240,230,219,142,208,290,194,283,290,161,341,264,329,451,386,389,371,391,258,377,64,216,230,206,64,205,436,238,141,215,100
6,GCLC,254,258,224,539,551,363,266,460,1152,340,587,635,476,1650,362,401,920,579,1377,1040,424,462,361,614,464,444,250,256,354,175,148,215,677,492,187,241,344
7,NFYA,606,772,691,1773,1191,897,753,863,1091,984,913,843,949,1063,1120,1033,1242,849,1439,1368,706,892,1100,1181,1709,1185,305,930,919,842,277,684,1933,1006,805,731,642
8,STPG1,80,61,102,182,122,65,85,67,128,54,66,90,71,78,46,49,79,49,138,159,119,180,126,133,110,93,112,99,106,67,85,90,76,70,33,73,38
9,NIPAL3,1010,940,959,2466,2005,1089,1056,1069,1265,975,1011,959,1503,1610,977,735,1486,1168,1937,2466,1079,1855,1974,1986,1690,1924,522,1437,1405,675,180,757,1313,1158,507,948,613


Now let's pull in the data for our controls. We'll want to merge into our initial dataframe.

In [22]:
# Read in controls dataframe
controls_data = pd.read_csv("controls.csv")
print(controls_data.columns.tolist())
controls_data.head(10)

['Gene', 'Case 1', 'Case 2', 'Case 3', 'Case 4', 'Case 5', 'Case 6', 'Case 7', 'Case 8', 'Case 9', 'Case 10', 'Case 11', 'Case 12', 'Case 13', 'Case 14', 'Case 15', 'Case 16', 'Case 17', 'Case 18', 'Case 19', 'Case 20', 'Case 21', 'Case 22', 'Case 23', 'Case 24', 'Case 25', 'Case 26', 'Case 27', 'Case 28', 'Case 29', 'Case 30', 'Case 31', 'Case 32', 'Case 33', 'Case 34', 'Case 35', 'Case 36', 'Case 37', 'Case 38', 'Case 39', 'Case 40', 'Case 41', 'Case 42', 'Case 43', 'Case 44', 'Case 45', 'Case 46', 'Case 47', 'Case 48', 'Case 49', 'Case 50', 'Case 51', 'Case 52', 'Case 53', 'Case 54', 'Case 55', 'Case 56', 'Case 57', 'Case 58', 'Case 59', 'Case 60', 'Case 61', 'Case 62', 'Case 63', 'Case 64']


Unnamed: 0,Gene,Case 1,Case 2,Case 3,Case 4,Case 5,Case 6,Case 7,Case 8,Case 9,Case 10,Case 11,Case 12,Case 13,Case 14,Case 15,Case 16,Case 17,Case 18,Case 19,Case 20,Case 21,Case 22,Case 23,Case 24,Case 25,Case 26,Case 27,Case 28,Case 29,Case 30,Case 31,Case 32,Case 33,Case 34,Case 35,Case 36,Case 37,Case 38,Case 39,Case 40,Case 41,Case 42,Case 43,Case 44,Case 45,Case 46,Case 47,Case 48,Case 49,Case 50,Case 51,Case 52,Case 53,Case 54,Case 55,Case 56,Case 57,Case 58,Case 59,Case 60,Case 61,Case 62,Case 63,Case 64
0,TSPAN6,7,12,7,7,5,6,5,3,3,4,41,17,2,28,25,44,58,4,2,5,7,6,3,0,7,4,11,11,5,5,3,11,8,5,8,19,10,5,6,21,9,16,7,11,5,53,10,18,22,8,12,27,9,4,18,3,60,34,22,37,1,18,6,29
1,TNMD,0,1,0,0,0,0,0,0,0,0,0,0,0,8,13,10,13,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,4,2,0,0,0,19,1,0,2,0,1,0,0
2,DPM1,227,201,259,224,154,143,257,248,133,147,810,353,333,125,164,367,351,199,253,220,325,270,392,433,313,425,431,354,284,361,348,607,407,249,320,374,514,307,505,442,709,586,455,606,519,543,321,400,454,476,336,285,386,484,428,211,76,312,307,391,172,333,194,263
3,SCYL3,500,635,454,643,473,309,519,585,413,333,1323,1013,964,339,330,695,617,570,725,677,634,624,541,558,503,705,749,705,497,540,514,890,622,601,791,641,654,688,754,818,882,998,873,914,882,1343,899,1111,1213,1294,1026,567,993,1059,1058,540,295,892,820,1011,378,759,397,834
4,C1orf112,158,174,173,157,134,131,144,219,116,79,393,242,323,102,106,218,202,167,267,270,341,186,215,412,117,299,227,228,194,146,217,322,230,177,324,226,274,273,251,315,379,317,306,313,235,482,283,371,372,384,262,208,248,363,361,134,105,317,304,341,128,201,174,209
5,FGR,4723,1925,4931,10412,6132,3896,4142,4080,2408,3355,4505,11083,8909,3411,2203,3504,5208,7497,8619,4944,4884,7127,2956,2763,5713,4106,5945,5247,6825,8067,3317,12291,7987,9422,7154,5977,4805,5964,8811,6989,11269,8919,6361,10061,11220,5308,6174,5757,6331,6466,7375,2746,6846,7845,4402,14466,718,3665,4426,6031,5007,8914,3905,6426
6,CFH,7,5,16,9,27,11,24,16,2,3,45,21,7,53,41,30,42,26,18,11,25,19,15,34,37,10,14,29,10,8,24,18,52,19,16,14,17,19,10,32,23,8,87,15,17,63,17,11,30,13,14,26,40,37,7,1,40,30,20,18,8,27,11,19
7,FUCA2,158,135,228,176,195,170,206,164,108,140,316,301,311,64,53,205,240,150,174,145,230,219,142,208,290,194,333,283,175,290,161,436,341,281,238,194,264,268,267,329,377,278,275,360,363,451,386,389,371,391,440,195,310,258,377,141,64,245,216,314,191,215,100,230
8,GCLC,184,254,258,248,241,179,175,224,269,288,539,551,363,148,110,215,266,707,406,894,460,1152,340,587,635,476,666,1650,886,362,401,677,920,690,492,916,579,496,404,1377,776,847,722,1306,544,1040,424,462,361,614,380,170,243,464,444,187,250,448,256,298,281,241,344,354
9,NFYA,616,606,772,1001,769,438,842,691,786,883,1773,1191,897,277,372,684,753,1384,1401,1071,863,1091,984,913,843,949,1065,1063,1212,1120,1033,1933,1242,822,1006,1137,849,947,1449,1439,1220,1836,1551,1268,1991,1368,706,892,1100,1181,797,515,1217,1709,1185,805,305,1095,930,1081,492,731,642,919


In [23]:
print(mRNA_data.columns.tolist())
mRNA_data = mRNA_data.merge(controls_data, on = "Gene")
mRNA_data.head(20)

['Gene', 'F_mild_1', 'F_mild_2', 'M_mild_3', 'F_mild_4', 'M_mild_5', 'M_mild_6', 'F_mild_7', 'M_mild_8', 'M_mild_9', 'M_mild_10', 'F_mild_11', 'F_mild_12', 'M_mild_13', 'F_mild_14', 'M_mild_15', 'F_mild_16', 'M_mild_17', 'M_mild_18', 'F_mild_19', 'M_mild_20', 'M_mild_21', 'F_mild_22', 'M_mild_23', 'M_mild_24', 'F_mild_25', 'F_mild_26', 'F_mild_27', 'M_mild_28', 'M_mild_29', 'F_severe_1', 'M_severe_2', 'F_severe_3', 'M_severe_4', 'F_severe_5', 'F_severe_6', 'M_severe_7', 'F_severe_8']


Unnamed: 0,Gene,F_mild_1,F_mild_2,M_mild_3,F_mild_4,M_mild_5,M_mild_6,F_mild_7,M_mild_8,M_mild_9,M_mild_10,F_mild_11,F_mild_12,M_mild_13,F_mild_14,M_mild_15,F_mild_16,M_mild_17,M_mild_18,F_mild_19,M_mild_20,M_mild_21,F_mild_22,M_mild_23,M_mild_24,F_mild_25,F_mild_26,F_mild_27,M_mild_28,M_mild_29,F_severe_1,M_severe_2,F_severe_3,M_severe_4,F_severe_5,F_severe_6,M_severe_7,F_severe_8,Case 1,Case 2,Case 3,Case 4,Case 5,Case 6,Case 7,Case 8,Case 9,Case 10,Case 11,Case 12,Case 13,Case 14,Case 15,Case 16,Case 17,Case 18,Case 19,Case 20,Case 21,Case 22,Case 23,Case 24,Case 25,Case 26,Case 27,Case 28,Case 29,Case 30,Case 31,Case 32,Case 33,Case 34,Case 35,Case 36,Case 37,Case 38,Case 39,Case 40,Case 41,Case 42,Case 43,Case 44,Case 45,Case 46,Case 47,Case 48,Case 49,Case 50,Case 51,Case 52,Case 53,Case 54,Case 55,Case 56,Case 57,Case 58,Case 59,Case 60,Case 61,Case 62,Case 63,Case 64
0,DPM1,201,259,248,810,353,333,351,325,270,392,433,313,425,354,361,348,407,514,442,543,321,400,454,476,484,428,76,307,263,257,125,367,607,320,211,333,194,227,201,259,224,154,143,257,248,133,147,810,353,333,125,164,367,351,199,253,220,325,270,392,433,313,425,431,354,284,361,348,607,407,249,320,374,514,307,505,442,709,586,455,606,519,543,321,400,454,476,336,285,386,484,428,211,76,312,307,391,172,333,194,263
1,SCYL3,635,454,585,1323,1013,964,617,634,624,541,558,503,705,705,540,514,622,654,818,1343,899,1111,1213,1294,1059,1058,295,820,834,519,339,695,890,791,540,759,397,500,635,454,643,473,309,519,585,413,333,1323,1013,964,339,330,695,617,570,725,677,634,624,541,558,503,705,749,705,497,540,514,890,622,601,791,641,654,688,754,818,882,998,873,914,882,1343,899,1111,1213,1294,1026,567,993,1059,1058,540,295,892,820,1011,378,759,397,834
2,C1orf112,174,173,219,393,242,323,202,341,186,215,412,117,299,228,146,217,230,274,315,482,283,371,372,384,363,361,105,304,209,144,102,218,322,324,134,201,174,158,174,173,157,134,131,144,219,116,79,393,242,323,102,106,218,202,167,267,270,341,186,215,412,117,299,227,228,194,146,217,322,230,177,324,226,274,273,251,315,379,317,306,313,235,482,283,371,372,384,262,208,248,363,361,134,105,317,304,341,128,201,174,209
3,FGR,1925,4931,4080,4505,11083,8909,5208,4884,7127,2956,2763,5713,4106,5247,8067,3317,7987,4805,6989,5308,6174,5757,6331,6466,7845,4402,718,4426,6426,4142,3411,3504,12291,7154,14466,8914,3905,4723,1925,4931,10412,6132,3896,4142,4080,2408,3355,4505,11083,8909,3411,2203,3504,5208,7497,8619,4944,4884,7127,2956,2763,5713,4106,5945,5247,6825,8067,3317,12291,7987,9422,7154,5977,4805,5964,8811,6989,11269,8919,6361,10061,11220,5308,6174,5757,6331,6466,7375,2746,6846,7845,4402,14466,718,3665,4426,6031,5007,8914,3905,6426
4,CFH,5,16,16,45,21,7,42,25,19,15,34,37,10,29,8,24,52,17,32,63,17,11,30,13,37,7,40,20,19,24,53,30,18,16,1,27,11,7,5,16,9,27,11,24,16,2,3,45,21,7,53,41,30,42,26,18,11,25,19,15,34,37,10,14,29,10,8,24,18,52,19,16,14,17,19,10,32,23,8,87,15,17,63,17,11,30,13,14,26,40,37,7,1,40,30,20,18,8,27,11,19
5,FUCA2,135,228,164,316,301,311,240,230,219,142,208,290,194,283,290,161,341,264,329,451,386,389,371,391,258,377,64,216,230,206,64,205,436,238,141,215,100,158,135,228,176,195,170,206,164,108,140,316,301,311,64,53,205,240,150,174,145,230,219,142,208,290,194,333,283,175,290,161,436,341,281,238,194,264,268,267,329,377,278,275,360,363,451,386,389,371,391,440,195,310,258,377,141,64,245,216,314,191,215,100,230
6,GCLC,254,258,224,539,551,363,266,460,1152,340,587,635,476,1650,362,401,920,579,1377,1040,424,462,361,614,464,444,250,256,354,175,148,215,677,492,187,241,344,184,254,258,248,241,179,175,224,269,288,539,551,363,148,110,215,266,707,406,894,460,1152,340,587,635,476,666,1650,886,362,401,677,920,690,492,916,579,496,404,1377,776,847,722,1306,544,1040,424,462,361,614,380,170,243,464,444,187,250,448,256,298,281,241,344,354
7,NFYA,606,772,691,1773,1191,897,753,863,1091,984,913,843,949,1063,1120,1033,1242,849,1439,1368,706,892,1100,1181,1709,1185,305,930,919,842,277,684,1933,1006,805,731,642,616,606,772,1001,769,438,842,691,786,883,1773,1191,897,277,372,684,753,1384,1401,1071,863,1091,984,913,843,949,1065,1063,1212,1120,1033,1933,1242,822,1006,1137,849,947,1449,1439,1220,1836,1551,1268,1991,1368,706,892,1100,1181,797,515,1217,1709,1185,805,305,1095,930,1081,492,731,642,919
8,STPG1,80,61,102,182,122,65,85,67,128,54,66,90,71,78,46,49,79,49,138,159,119,180,126,133,110,93,112,99,106,67,85,90,76,70,33,73,38,51,80,61,54,53,50,67,102,92,61,182,122,65,85,81,90,85,32,53,83,67,128,54,66,90,71,101,78,50,46,49,76,79,47,70,60,49,54,78,138,106,120,85,92,46,159,119,180,126,133,87,69,73,110,93,33,112,116,99,96,20,73,38,106
9,NIPAL3,1010,940,959,2466,2005,1089,1056,1069,1265,975,1011,959,1503,1610,977,735,1486,1168,1937,2466,1079,1855,1974,1986,1690,1924,522,1437,1405,675,180,757,1313,1158,507,948,613,552,1010,940,601,959,465,675,959,475,404,2466,2005,1089,180,143,757,1056,595,845,954,1069,1265,975,1011,959,1503,1598,1610,665,977,735,1313,1486,700,1158,1472,1168,1102,1239,1937,1280,1654,1519,1444,868,2466,1079,1855,1974,1986,1465,947,829,1690,1924,507,522,2028,1437,1234,395,948,613,1405


Great! Now, our next order of business is to normalize each column by total patient mRNA.

In [24]:
# Calculate sum of each column, convert to proportions. 

def logarithm(element):
    if element > 0:
        return math.log(element)
    else:
        return 0

# Normalize by total mRNA content per patient.
for label in mRNA_data.columns.tolist()[1:]:
    mRNA_data[label] = mRNA_data[label].astype(int)
    col_sum = mRNA_data[label].sum()
    mRNA_data[label] = (mRNA_data[label].div(col_sum))
    mRNA_data[label] = mRNA_data[label].apply(logarithm) # Take log proportion to avoid really small numbers.

# Let's take a look again.
mRNA_data.head(10)

Unnamed: 0,Gene,F_mild_1,F_mild_2,M_mild_3,F_mild_4,M_mild_5,M_mild_6,F_mild_7,M_mild_8,M_mild_9,M_mild_10,F_mild_11,F_mild_12,M_mild_13,F_mild_14,M_mild_15,F_mild_16,M_mild_17,M_mild_18,F_mild_19,M_mild_20,M_mild_21,F_mild_22,M_mild_23,M_mild_24,F_mild_25,F_mild_26,F_mild_27,M_mild_28,M_mild_29,F_severe_1,M_severe_2,F_severe_3,M_severe_4,F_severe_5,F_severe_6,M_severe_7,F_severe_8,Case 1,Case 2,Case 3,Case 4,Case 5,Case 6,Case 7,Case 8,Case 9,Case 10,Case 11,Case 12,Case 13,Case 14,Case 15,Case 16,Case 17,Case 18,Case 19,Case 20,Case 21,Case 22,Case 23,Case 24,Case 25,Case 26,Case 27,Case 28,Case 29,Case 30,Case 31,Case 32,Case 33,Case 34,Case 35,Case 36,Case 37,Case 38,Case 39,Case 40,Case 41,Case 42,Case 43,Case 44,Case 45,Case 46,Case 47,Case 48,Case 49,Case 50,Case 51,Case 52,Case 53,Case 54,Case 55,Case 56,Case 57,Case 58,Case 59,Case 60,Case 61,Case 62,Case 63,Case 64
0,DPM1,-10.495318,-10.537266,-10.411985,-10.035166,-10.775418,-10.76328,-10.33238,-10.431805,-10.781632,-10.144019,-10.213861,-10.507874,-10.234819,-10.608586,-10.517203,-10.325758,-10.556518,-10.090653,-10.552841,-10.438853,-10.59839,-10.564971,-10.500881,-10.529793,-10.509058,-10.479612,-10.824071,-10.540797,-10.719486,-10.647084,-10.593963,-10.075558,-10.441338,-10.574109,-11.040751,-10.552491,-10.653034,-10.573358,-10.494605,-10.536768,-11.076896,-11.09473,-10.74624,-10.646705,-10.411492,-10.773244,-10.804425,-10.034403,-10.774905,-10.762649,-10.593538,-10.280745,-10.07466,-10.33164,-11.051653,-10.899309,-10.815565,-10.431016,-10.781135,-10.143308,-10.213,-10.507164,-10.234045,-10.496478,-10.608011,-10.668865,-10.51659,-10.325073,-10.440802,-10.555987,-10.801691,-10.573476,-10.560253,-10.089866,-10.586054,-10.38264,-10.552261,-10.21406,-10.441881,-10.444915,-10.376679,-10.628495,-10.43801,-10.597669,-10.564167,-10.500057,-10.529017,-10.678219,-10.247863,-10.640927,-10.508425,-10.47888,-11.040543,-10.823269,-10.646576,-10.54009,-10.436501,-10.875967,-10.551842,-10.65271,-10.718914
1,SCYL3,-9.344997,-9.975997,-9.553802,-9.544543,-9.721215,-9.700331,-9.768298,-9.763581,-9.943903,-9.821862,-9.96024,-10.033487,-9.728711,-9.919686,-10.114512,-9.935737,-10.132391,-9.849768,-9.937288,-9.533301,-9.568548,-9.543419,-9.518126,-9.529717,-9.726063,-9.5746,-9.467829,-9.55834,-9.565407,-9.944256,-9.596277,-9.437008,-10.058645,-9.669132,-10.10104,-9.728632,-9.936955,-9.7837,-9.344285,-9.975499,-10.022397,-9.972587,-9.975744,-9.943877,-9.553309,-9.640145,-9.986715,-9.54378,-9.720702,-9.6997,-9.595852,-9.581518,-9.43611,-9.767557,-9.999321,-9.846527,-9.691521,-9.762792,-9.943407,-9.821151,-9.959379,-10.032777,-9.727936,-9.943847,-9.91911,-10.109249,-10.113899,-9.935052,-10.058109,-10.13186,-9.920549,-9.668499,-10.021479,-9.848982,-9.779113,-9.981807,-9.936708,-9.995723,-9.909448,-9.793277,-9.965729,-10.098207,-9.532458,-9.567827,-9.542616,-9.517302,-9.528941,-9.561907,-9.559993,-9.696034,-9.725429,-9.573867,-10.100832,-9.467027,-9.596113,-9.557634,-9.486513,-10.088568,-9.727983,-9.936632,-9.564835
2,C1orf112,-10.639567,-10.940803,-10.536342,-10.758391,-11.152948,-10.79377,-10.884899,-10.383748,-11.154307,-10.744643,-10.263576,-11.491904,-10.586465,-11.048538,-11.422475,-10.798063,-11.127252,-10.719748,-10.891578,-10.558018,-10.724384,-10.640233,-10.700084,-10.744568,-10.79674,-10.649858,-10.500844,-10.550617,-10.949306,-11.226347,-10.797304,-10.596425,-11.075315,-10.561687,-11.494769,-11.057329,-10.761836,-10.935713,-10.638854,-10.940304,-11.432296,-11.233843,-10.833888,-11.225967,-10.535849,-10.910003,-11.42541,-10.757628,-11.152435,-10.793139,-10.796879,-10.717172,-10.595526,-10.884158,-11.226964,-10.84545,-10.61077,-10.382959,-11.15381,-10.743932,-10.262714,-11.491193,-10.585691,-11.137636,-11.047962,-11.049981,-11.421861,-10.797378,-11.074779,-11.126721,-11.142995,-10.561054,-11.063974,-10.718961,-10.70343,-11.081746,-10.890998,-10.840379,-11.0563,-10.841627,-11.037356,-11.420814,-10.557175,-10.723664,-10.639429,-10.69926,-10.743792,-10.926985,-10.562814,-11.083336,-10.796107,-10.649125,-11.494561,-10.500042,-10.630677,-10.549911,-10.573326,-11.171432,-11.05668,-10.761513,-10.948734
3,FGR,-8.235941,-7.590797,-7.611562,-8.319257,-7.328718,-7.476605,-7.635215,-7.72191,-7.508408,-8.123689,-8.360527,-7.603578,-7.966704,-7.912472,-7.410544,-8.071145,-7.57976,-7.855464,-7.792058,-8.158992,-7.641729,-7.898263,-7.865765,-7.920898,-7.723511,-8.148921,-8.578335,-7.872393,-7.523533,-7.867226,-7.287516,-7.819259,-7.433244,-7.467003,-6.813053,-7.265256,-7.650879,-7.538109,-8.235228,-7.590299,-7.237827,-7.410407,-7.441379,-7.866847,-7.611069,-7.877041,-7.676651,-8.318494,-7.328206,-7.475974,-7.287091,-7.683036,-7.818361,-7.634475,-7.422699,-7.370974,-7.703262,-7.721121,-7.507911,-8.122978,-8.359665,-7.602868,-7.96593,-7.87228,-7.911896,-7.489491,-7.409931,-8.070459,-7.432708,-7.57923,-7.168342,-7.46637,-7.788835,-7.854677,-7.619405,-7.523443,-7.791478,-7.448105,-7.719262,-7.807271,-7.567137,-7.554946,-8.158149,-7.641008,-7.89746,-7.864941,-7.920121,-7.589479,-7.982451,-7.765345,-7.722878,-8.148188,-6.812845,-8.577533,-8.182995,-7.871687,-7.70054,-7.50487,-7.264606,-7.650555,-7.52296
4,CFH,-14.189185,-13.321506,-13.152825,-12.925538,-13.597364,-14.625512,-12.455497,-12.996754,-13.435615,-13.407231,-12.758238,-12.64316,-13.984323,-13.110588,-14.32664,-12.999907,-12.614087,-13.499663,-13.178415,-12.592828,-13.536618,-14.15854,-13.217781,-14.130261,-13.080225,-14.592825,-11.465925,-13.271912,-13.347201,-13.018107,-11.451985,-12.579722,-13.959495,-13.569841,-16.392609,-13.064797,-13.522996,-14.052398,-14.188472,-13.321007,-14.291317,-12.835846,-13.31119,-13.017727,-13.152332,-14.970446,-14.696246,-12.924775,-13.596851,-14.624881,-11.45156,-11.667039,-12.578824,-12.454756,-13.086861,-13.542327,-13.811297,-12.995965,-13.435118,-13.40652,-12.757377,-12.642449,-13.983549,-13.923529,-13.110012,-14.015254,-14.326026,-12.999221,-13.958959,-12.613556,-13.374705,-13.569209,-13.845451,-13.498876,-13.368462,-14.304614,-13.177835,-13.642421,-14.73576,-12.099304,-14.075509,-14.047186,-12.591985,-13.535897,-14.157736,-13.216957,-14.129485,-13.856273,-12.642255,-12.907885,-13.079592,-14.592093,-16.392401,-11.465123,-12.988381,-13.271206,-13.514837,-13.94402,-13.064148,-13.522673,-13.346629
5,FUCA2,-10.893348,-10.664749,-10.825548,-10.976458,-10.934776,-10.831629,-10.712528,-10.777551,-10.990982,-11.159454,-10.947061,-10.584197,-11.01905,-10.832436,-10.7362,-11.096556,-10.733448,-10.756927,-10.848093,-10.624495,-10.413993,-10.592856,-10.702776,-10.726503,-11.138183,-10.60649,-10.995921,-10.892366,-10.853561,-10.868284,-11.263394,-10.65791,-10.772225,-10.870159,-11.443849,-10.989996,-11.315722,-10.935713,-10.892635,-10.66425,-11.318058,-10.858683,-10.573287,-10.867905,-10.825055,-10.981462,-10.853216,-10.975695,-10.934263,-10.830998,-11.262969,-11.410319,-10.657011,-10.711787,-11.334322,-11.273643,-11.232458,-10.776762,-10.990485,-11.158743,-10.9462,-10.583486,-11.018276,-10.754444,-10.831861,-11.153053,-10.735587,-11.095871,-10.771688,-10.732918,-10.68079,-10.869527,-11.216651,-10.75614,-10.721914,-11.01995,-10.847513,-10.84567,-11.18758,-10.948441,-10.897455,-10.985996,-10.623652,-10.413273,-10.592052,-10.701952,-10.725727,-10.408555,-10.627352,-10.860193,-11.13755,-10.605758,-11.443641,-10.995119,-10.888321,-10.89166,-10.655815,-10.771189,-10.989347,-11.315398,-10.852989
6,GCLC,-10.261288,-10.541135,-10.513768,-10.442485,-10.330151,-10.677019,-10.60967,-10.084404,-9.330799,-10.286335,-9.909574,-9.800453,-10.121491,-9.069353,-10.514437,-10.183999,-9.740957,-9.971573,-9.416488,-9.788986,-10.320097,-10.42087,-10.7301,-10.275216,-10.551259,-10.442911,-9.633343,-10.722467,-10.422343,-11.031374,-10.425065,-10.610282,-10.332196,-10.143951,-11.1615,-10.875837,-10.08025,-10.783372,-10.260575,-10.540636,-10.975113,-10.646886,-10.521699,-11.030995,-10.513275,-10.068881,-10.131897,-10.441722,-10.329638,-10.676388,-10.42464,-10.680131,-10.609383,-10.60893,-9.783927,-10.426345,-9.413486,-10.083614,-9.330302,-10.285624,-9.908713,-9.799742,-10.120716,-10.061296,-9.068777,-9.531122,-10.513824,-10.183314,-10.331659,-9.740426,-9.782453,-10.143319,-9.664492,-9.970787,-10.106325,-10.605784,-9.415908,-10.123763,-10.073501,-9.983187,-9.608835,-10.58145,-9.788143,-10.319377,-10.420066,-10.729276,-10.274439,-10.555159,-10.764554,-11.103703,-10.550625,-10.442178,-11.161293,-9.632541,-10.284786,-10.721761,-10.708115,-10.385107,-10.875188,-10.079926,-10.421771
7,NFYA,-9.391743,-9.44511,-9.387274,-9.251772,-9.559338,-9.772366,-9.569101,-9.455215,-9.385204,-9.223655,-9.467863,-9.517111,-9.4315,-9.509033,-9.384997,-9.237738,-9.440853,-9.588817,-9.372447,-9.514857,-9.810216,-9.762969,-9.615913,-9.621094,-9.247479,-9.461238,-9.434492,-9.43246,-9.468354,-9.46038,-9.798259,-9.452962,-9.283038,-9.428693,-9.701767,-9.766221,-9.456303,-9.575061,-9.39103,-9.444612,-9.579787,-9.486592,-9.626866,-9.460001,-9.386781,-8.996636,-9.011533,-9.251009,-9.558825,-9.771735,-9.797835,-9.461717,-9.452063,-9.568361,-9.112225,-9.187757,-9.232844,-9.454426,-9.384707,-9.222944,-9.467002,-9.5164,-9.430725,-9.591856,-9.508457,-9.217812,-9.384384,-9.237053,-9.282502,-9.440322,-9.607404,-9.42806,-9.44836,-9.58803,-9.459602,-9.32857,-9.371867,-9.671309,-9.299857,-9.218557,-9.638363,-9.284007,-9.514014,-9.809495,-9.762165,-9.615089,-9.620318,-9.814475,-9.656185,-9.492621,-9.246846,-9.460505,-9.701559,-9.433691,-9.391069,-9.431754,-9.419567,-9.824983,-9.765571,-9.45598,-9.467782
8,STPG1,-11.416596,-11.983221,-11.300441,-11.528194,-11.837865,-12.397035,-11.750515,-12.010938,-11.528023,-12.126297,-12.094944,-11.754268,-12.024229,-12.121175,-12.57744,-12.28614,-12.195883,-12.441056,-11.716897,-11.667058,-11.590707,-11.363478,-11.782696,-11.804862,-11.990663,-12.006136,-10.436305,-11.672524,-11.628201,-11.991468,-10.979626,-11.48111,-12.519133,-12.093935,-12.896101,-12.070175,-12.283306,-12.066482,-11.415883,-11.982722,-12.499558,-12.161391,-11.797062,-11.991088,-11.299948,-11.141804,-11.683984,-11.527431,-11.837352,-12.396404,-10.979201,-10.986162,-11.480212,-11.749775,-12.879222,-12.462406,-11.790351,-12.010148,-11.527527,-12.125586,-12.094083,-11.753558,-12.023454,-11.947466,-12.120599,-12.405816,-12.576826,-12.285455,-12.518597,-12.195352,-12.468997,-12.093302,-12.390164,-12.440269,-12.323917,-12.25049,-11.716317,-12.114476,-12.02771,-12.122561,-12.261771,-13.051758,-11.666215,-11.589987,-11.362675,-11.781872,-11.804085,-12.029422,-11.666245,-12.306305,-11.990029,-12.005403,-12.895894,-10.435503,-11.635989,-11.671818,-11.84086,-13.02773,-12.069525,-12.282982,-11.627629
9,NIPAL3,-8.880917,-9.248215,-9.059523,-8.921848,-9.038487,-9.578407,-9.230923,-9.241151,-9.237226,-9.232844,-9.365904,-9.388187,-8.97169,-9.093894,-9.521595,-9.57809,-9.261488,-9.269828,-9.075255,-8.92561,-9.386041,-9.030795,-9.031161,-9.101333,-9.258659,-8.976574,-8.897137,-8.997331,-9.043848,-9.681448,-10.22932,-9.351556,-9.669797,-9.28798,-10.164098,-9.506279,-9.502527,-9.68476,-8.880204,-9.247716,-10.089947,-9.265792,-9.567048,-9.681068,-9.05903,-9.500278,-9.793443,-8.921085,-9.037974,-9.577776,-10.228895,-10.417766,-9.350658,-9.230183,-9.956396,-9.693362,-9.348528,-9.240362,-9.23673,-9.232132,-9.365042,-9.387476,-8.970916,-9.186078,-9.093318,-9.818052,-9.520981,-9.577405,-9.669261,-9.260957,-9.768064,-9.287348,-9.190131,-9.269041,-9.308019,-9.485139,-9.074675,-9.6233,-9.404249,-9.239405,-9.508387,-10.114208,-8.924767,-9.385321,-9.029991,-9.030337,-9.100557,-9.205719,-9.047053,-9.876545,-9.258026,-8.975841,-10.16389,-8.896335,-8.774773,-8.996625,-9.287192,-10.044576,-9.50563,-9.502203,-9.043275


Perfect! Let's write this into a CSV.

In [25]:
mRNA_data.to_csv("normalized_mRNA_counts.csv", index = False)