# Phase 1: Data Preprocessing
**Shreya Das**

In this phase we are going to do some data clean-up and data preprocessing. This is important step to do as raw data may have missing inputs or incorrect data inputs that are not compatible with the data analysis we will perform in later phases.

## Importing Libraries
Here we will import the pandas package for data manipulation of the cancer dataset. We will then convert the cancer csv file to a dataframe in order to view the csv in a table format.

In [16]:
import pandas as pd
df = pd.read_csv("Brain_GSE50161 2.csv")
df

Unnamed: 0,samples,type,1007_s_at,1053_at,117_at,121_at,1255_g_at,1294_at,1316_at,1320_at,...,AFFX-r2-Ec-bioD-3_at,AFFX-r2-Ec-bioD-5_at,AFFX-r2-P1-cre-3_at,AFFX-r2-P1-cre-5_at,AFFX-ThrX-3_at,AFFX-ThrX-5_at,AFFX-ThrX-M_at,AFFX-TrpnX-3_at,AFFX-TrpnX-5_at,AFFX-TrpnX-M_at
0,834,ependymoma,12.498150,7.604868,6.880934,9.027128,4.176175,7.224920,6.085942,6.835999,...,9.979005,9.926470,12.719785,12.777792,5.403657,4.870548,4.047380,3.721936,4.516434,4.749940
1,835,ependymoma,13.067436,7.998090,7.209076,9.723322,4.826126,7.539381,6.250962,8.012549,...,11.924749,11.215930,13.605662,13.401342,5.224555,4.895315,3.786437,3.564481,4.430891,4.491416
2,836,ependymoma,13.068179,8.573674,8.647684,9.613002,4.396581,7.813101,6.007746,7.178156,...,12.154405,11.532460,13.764593,13.477800,5.303565,5.052184,4.005343,3.595382,4.563494,4.668827
3,837,ependymoma,12.456040,9.098977,6.628784,8.517677,4.154847,8.361843,6.596064,6.347285,...,11.969072,11.288801,13.600828,13.379029,4.953429,4.708371,3.892318,3.759429,4.748381,4.521275
4,838,ependymoma,12.699958,8.800721,11.556188,9.166309,4.165891,7.923826,6.212754,6.866387,...,11.411701,11.169317,13.751442,13.803646,4.892677,4.773806,3.796856,3.577544,4.504385,4.541450
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125,959,pilocytic_astrocytoma,12.658228,8.843270,7.672655,9.125912,5.495477,8.603892,7.747514,5.828978,...,13.170441,12.676080,14.124837,13.996436,4.913579,4.399176,3.878855,3.680103,4.726784,4.564637
126,960,pilocytic_astrocytoma,12.812823,8.510550,8.729699,9.104402,3.967228,7.719089,7.092496,6.504812,...,13.040267,12.403316,13.978009,13.812916,5.189600,4.912618,3.764800,3.664920,4.628355,4.761351
127,961,pilocytic_astrocytoma,12.706991,8.795721,7.772359,8.327273,6.329383,8.550471,6.613332,6.308945,...,12.825383,12.439265,14.328373,14.008693,4.931460,4.712895,3.913637,3.700964,4.764693,4.834952
128,962,pilocytic_astrocytoma,12.684593,8.293938,7.228186,8.494428,6.049414,8.214729,7.287758,5.732710,...,13.116581,12.657967,14.390346,14.194904,4.871092,4.739400,3.782980,3.920363,4.665584,4.613326


Here we see that there are 130 rows by 54677 columns (as of 01-30-2025). According to this means that there is 130 samples and 54675 genes (Note: that column 1 and 2 are just the sample ID and the tissue type and thus are not included in the count of genes) is in this dataset. 

##Handling Missing Data
Our next step is to check if there is any null or incorrect values. If we find any values that are null or incorrect, we have to remove those samples from the dataset, otherwise it will skew the data analysis that we will perform in later phases.

First we will figure out if there are any NaN values in the entire dataframe.

In [17]:
print(df.isnull().T.any().sum())

0


Luckily, we don't have any cells that are null, so we can move on to the next step.

###Subdividing the Data
Since we know that there are 5 main tissue types in this dataset, we are going to subdivide this dataset into 5 groups. First we will use the **unique()** function to obtain the unique names the "type" column and we will print it out. The unique() function is in the numpy package, thus we will need to import that first.

In [25]:
import numpy as np

print(np.unique(df['type']))

['ependymoma' 'glioblastoma' 'medulloblastoma' 'normal'
 'pilocytic_astrocytoma']


Perfect we see that there are 5 tissue types: ependymoma, glioblastoma, medulloblastoma, normal, and pilocytic_astrocytoma. Based on these names, we will find which samples are corresponding to each of the tissue types.

In [32]:
df_ependymoma = df[df.type == 'ependymoma']

df_glioblastoma = df[df.type == 'glioblastoma']


df_medulloblastoma = df[df.type == 'medulloblastoma']


df_normal = df[df.type == 'normal']


df_pilocytic_astrocytoma = df[df.type == 'pilocytic_astrocytoma']

# See the number of samples between the dataframes

print(len(df_ependymoma))

print(len(df_glioblastoma))

print(len(df_medulloblastoma))

print(len(df_normal))

print(len(df_pilocytic_astrocytoma))

46
34
22
13
15


Doing some simple math (46 + 34 + 22 + 13 + 15 = 130 samples), we confirm that we are not missing any samples. Unfortunately, we see that not every tissue has the same number of samples avaliable. This makes our comparison to the control (normal tissue) a lot less straight forward. Additionally, the control group has the least amount of samples avaliable to us.

There are several ways we can solve this issue:
1. **Randomly Sample the from the experimental group to match the number of samples in the control group.** This solution is straightforward but can reduce statistical power of some analytical test. This is important to consider as reducing statistical power can reduce the likelihood of a statistical test finding a true statistical effect. Since we already have very small sample sizes, this may not be a viable option for us.

2. **Performing T-tests.** This is also not a great option since one of the assumption of t-tests is to have approximately the same sample size when doing comparisons. This is important because the t-test assumes the *homogeneity of variance*, meaning that both groups, although different in size, are assumed to have the approximate same variance. But when sample sizes are very different, variance is no longer approximately the same.

3. **