# Data preprocessing: more art than science?

## Contents of this notebook:
<ol>
<li>Load and examine your data</li>
<li>Merging two dataframes</li>
<li>Removing features that you do not need</li>
<li>Making your data machine-readable</li>
<li>Handling not available (NA) and inf data</li>
<li>Removing columns with a standard deviation of 0</li>
<li>Feature scaling</li>
<li>Data visualization</li>
<li>Loading local files into Google Colab</li>
<li>Mini assignment: data visualization</li>
</ol>

# Setup
Fetch the dataset that you'll be working with throughout this assignment.*italicized text*\\


In [76]:
!git clone https://github.com/NadiaBlostein/McMedHacks2022_Prep_Week_3_Assignment.git

Cloning into 'McMedHacks2022_Prep_Week_3_Assignment'...
remote: Enumerating objects: 34, done.[K
remote: Counting objects: 100% (34/34), done.[K
remote: Compressing objects: 100% (31/31), done.[K
remote: Total 34 (delta 4), reused 27 (delta 1), pack-reused 0[K
Unpacking objects: 100% (34/34), done.


Examine your directory structure with the `ls` command. To invoke this command from a jupyter notebook, the `ls` command should be preceded with `!`.

In [77]:
!ls

HCP_2D_slices_MRI_data	McMedHacks2022_Prep_Week_3_Assignment
HCP_csv_data		README.md


Change the working directory of the notebook to within the folder `McMedHacks2022_Prep_Week_3_Assignment`.

In [78]:
%cd McMedHacks2022_Prep_Week_3_Assignment 
!ls

/content/McMedHacks2022_Prep_Week_3_Assignment/McMedHacks2022_Prep_Week_3_Assignment/McMedHacks2022_Prep_Week_3_Assignment
HCP_2D_slices_MRI_data	HCP_csv_data  README.md


In this assignment we will be working with `.png` files contained in the `HCP_2D_slices_MRI_data` folder and `.csv` files within the `HCP_csv_data` folder.

In [79]:
!ls HCP_2D_slices_MRI_data

 HCP_102109_T1w_acpc_dc_restore_brain_t1_axial.png
 HCP_102109_T1w_acpc_dc_restore_brain_t1_coronal.png
 HCP_102109_T1w_acpc_dc_restore_brain_t1_sagittal.png
 HCP_210112_T1w_acpc_dc_restore_brain_t1_axial.png
 HCP_210112_T1w_acpc_dc_restore_brain_t1_coronal.png
 HCP_210112_T1w_acpc_dc_restore_brain_t1_sagittal.png
' HCP_552241_T1w_acpc_dc_restore_brain_t1_axial.png'
' HCP_552241_T1w_acpc_dc_restore_brain_t1_coronal.png'
' HCP_552241_T1w_acpc_dc_restore_brain_t1_sagittal.png'
 HCP_615441_T1w_acpc_dc_restore_brain_t1_axial.png
 HCP_615441_T1w_acpc_dc_restore_brain_t1_coronal.png
 HCP_615441_T1w_acpc_dc_restore_brain_t1_sagittal.png
 HCP_995174_T1w_acpc_dc_restore_brain_t1_axial.png
 HCP_995174_T1w_acpc_dc_restore_brain_t1_coronal.png
 HCP_995174_T1w_acpc_dc_restore_brain_t1_sagittal.png


In [80]:
!ls HCP_csv_data 

HCP_volumes.csv  unrestricted_HCP_behavioral.csv


## 1. Load and examine your data

### Preliminary examination of your data

In [81]:
import pandas as pd
import numpy as np

df = pd.read_csv("HCP_csv_data/unrestricted_HCP_behavioral.csv")

**Question 1:** How many subjects (rows) and features (columns) do you have?

In [82]:
#@title
print(f"Num subjects: {df.shape[0]}")
print(f"Num features: {df.shape[1]}")

Num subjects: 1206
Num features: 499


In [83]:
df.columns

Index(['Subject', 'Gender', 'Age', 'MMSE_Score', 'PSQI_Score', 'PSQI_Comp1',
       'PSQI_Comp2', 'PSQI_Comp3', 'PSQI_Comp4', 'PSQI_Comp5',
       ...
       'Noise_Comp', 'Odor_Unadj', 'Odor_AgeAdj', 'PainIntens_RawScore',
       'PainInterf_Tscore', 'Taste_Unadj', 'Taste_AgeAdj', 'Mars_Log_Score',
       'Mars_Errs', 'Mars_Final'],
      dtype='object', length=499)

In [84]:
df['Gender'].unique()

array(['M', 'F'], dtype=object)

**Question 2:** How many males and how many females in this dataset?

In [85]:
#@title
df['Gender'].value_counts()

F    656
M    550
Name: Gender, dtype: int64

**Question 3:** Display only the rows associated with female subjects.

In [86]:
#@title
df[df['Gender']=="F"]

Unnamed: 0,Subject,Gender,Age,MMSE_Score,PSQI_Score,PSQI_Comp1,PSQI_Comp2,PSQI_Comp3,PSQI_Comp4,PSQI_Comp5,...,Noise_Comp,Odor_Unadj,Odor_AgeAdj,PainIntens_RawScore,PainInterf_Tscore,Taste_Unadj,Taste_AgeAdj,Mars_Log_Score,Mars_Errs,Mars_Final
2,100307,F,26-30,29,4,1,0,1,0,2,...,3.6,101.12,86.45,0.0,38.6,71.69,71.76,1.76,0.0,1.76
5,101006,F,31-35,28,2,1,1,0,0,0,...,6.0,122.25,111.41,0.0,38.6,123.80,123.31,1.80,0.0,1.80
7,101208,F,31-35,30,6,1,2,0,0,2,...,4.4,101.12,87.11,1.0,50.1,105.57,102.32,1.92,0.0,1.92
10,101612,F,26-30,28,4,1,1,0,0,1,...,4.4,122.25,111.41,2.0,48.7,97.26,96.41,1.84,0.0,1.84
11,101915,F,31-35,29,6,1,1,1,1,1,...,4.4,96.87,77.61,0.0,38.6,112.11,111.70,1.84,1.0,1.80
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1194,984472,F,26-30,30,5,1,1,1,0,1,...,2.8,122.25,110.45,0.0,38.6,108.73,108.00,1.76,0.0,1.76
1196,987983,F,26-30,30,3,1,1,0,0,1,...,5.2,108.79,97.19,5.0,56.4,88.02,87.70,1.88,0.0,1.88
1200,992673,F,31-35,30,0,0,0,0,0,0,...,3.6,122.25,111.41,1.0,38.6,101.63,99.26,1.80,0.0,1.80
1202,993675,F,26-30,30,4,1,1,1,0,1,...,0.4,122.25,110.45,0.0,38.6,84.07,84.25,1.80,1.0,1.76


### A quick note on documentation
Woah! So many columns and abbreviations! What do they all mean? Make sure you know where your [dataset's documentation](https://wiki.humanconnectome.org/display/PublicData/HCP-YA+Data+Dictionary-+Updated+for+the+1200+Subject+Release#HCPYADataDictionaryUpdatedforthe1200SubjectRelease-Instrument:Demographics) is.

Unfortunately, thorough documentation is not always available. Some data types are also very field-specific and require the help of experts, which is part of what makes machine learning so wonderfully interdisciplinary.

Let's take a look at what features we have here:

In [87]:
print(df.columns)

Index(['Subject', 'Gender', 'Age', 'MMSE_Score', 'PSQI_Score', 'PSQI_Comp1',
       'PSQI_Comp2', 'PSQI_Comp3', 'PSQI_Comp4', 'PSQI_Comp5',
       ...
       'Noise_Comp', 'Odor_Unadj', 'Odor_AgeAdj', 'PainIntens_RawScore',
       'PainInterf_Tscore', 'Taste_Unadj', 'Taste_AgeAdj', 'Mars_Log_Score',
       'Mars_Errs', 'Mars_Final'],
      dtype='object', length=499)


**Question 4:** The script above is not printing ALL of the columns... How do we fix that to be able to see all the features in the dataset?

In [88]:
#@title
for col in df.columns: print(col)

Subject
Gender
Age
MMSE_Score
PSQI_Score
PSQI_Comp1
PSQI_Comp2
PSQI_Comp3
PSQI_Comp4
PSQI_Comp5
PSQI_Comp6
PSQI_Comp7
PSQI_BedTime
PSQI_Min2Asleep
PSQI_GetUpTime
PSQI_AmtSleep
PSQI_Latency30Min
PSQI_WakeUp
PSQI_Bathroom
PSQI_Breathe
PSQI_Snore
PSQI_TooCold
PSQI_TooHot
PSQI_BadDream
PSQI_Pain
PSQI_Other
PSQI_Quality
PSQI_SleepMeds
PSQI_DayStayAwake
PSQI_DayEnthusiasm
PSQI_BedPtnrRmate
PicSeq_Unadj
PicSeq_AgeAdj
CardSort_Unadj
CardSort_AgeAdj
Flanker_Unadj
Flanker_AgeAdj
PMAT24_A_CR
PMAT24_A_SI
PMAT24_A_RTCR
ReadEng_Unadj
ReadEng_AgeAdj
PicVocab_Unadj
PicVocab_AgeAdj
ProcSpeed_Unadj
ProcSpeed_AgeAdj
DDisc_SV_1mo_200
DDisc_SV_6mo_200
DDisc_SV_1yr_200
DDisc_SV_3yr_200
DDisc_SV_5yr_200
DDisc_SV_10yr_200
DDisc_SV_1mo_40K
DDisc_SV_6mo_40K
DDisc_SV_1yr_40K
DDisc_SV_3yr_40K
DDisc_SV_5yr_40K
DDisc_SV_10yr_40K
DDisc_AUC_200
DDisc_AUC_200.1
DDisc_AUC_40K
VSPLOT_TC
VSPLOT_CRTE
VSPLOT_OFF
SCPT_TP
SCPT_TN
SCPT_FP
SCPT_FN
SCPT_TPRT
SCPT_SEN
SCPT_SPEC
SCPT_LRNR
IWRD_TOT
IWRD_RTC
ListSort_Unadj
ListSort

Handpicking the columns you want to work with:



In [89]:
basics = ['Subject','Gender','Age','PSQI_BedTime']
df[basics]

Unnamed: 0,Subject,Gender,Age,PSQI_BedTime
0,100004,M,22-25,9:00:00
1,100206,M,26-30,22:30:00
2,100307,F,26-30,22:00:00
3,100408,M,31-35,22:00:00
4,100610,M,26-30,21:30:00
...,...,...,...,...
1201,992774,M,31-35,22:00:00
1202,993675,F,26-30,22:00:00
1203,994273,M,26-30,2:00:00
1204,995174,M,22-25,23:00:00


Let's also add all of some cognitive variables to the mix! Specifically, we'll select the measures related to fluid intelligence (they start with `PMAT` for Penn Matrix Test) and impulsivity (they start with `DDisc` for Delay Discounting)

In [90]:
cognition = ['Subject','Gender','Age','PSQI_BedTime']
for col in df.columns:
    if (col.find("PMAT")!=-1 or col.find("DDisc")!=-1):
        cognition.append(col)
print(f"List of variables we will be looking at: {cognition}") # PS: f-strings will be very useful for you in your Python journey!

List of variables we will be looking at: ['Subject', 'Gender', 'Age', 'PSQI_BedTime', 'PMAT24_A_CR', 'PMAT24_A_SI', 'PMAT24_A_RTCR', 'DDisc_SV_1mo_200', 'DDisc_SV_6mo_200', 'DDisc_SV_1yr_200', 'DDisc_SV_3yr_200', 'DDisc_SV_5yr_200', 'DDisc_SV_10yr_200', 'DDisc_SV_1mo_40K', 'DDisc_SV_6mo_40K', 'DDisc_SV_1yr_40K', 'DDisc_SV_3yr_40K', 'DDisc_SV_5yr_40K', 'DDisc_SV_10yr_40K', 'DDisc_AUC_200', 'DDisc_AUC_200.1', 'DDisc_AUC_40K']


**Question 5:** Now that we made a list of all of the features we want to examine, select this subset of our data and make it a separate dataframe called `df_cognition`.

In [91]:
df_cognition = df[cognition]
df_cognition.head()

Unnamed: 0,Subject,Gender,Age,PSQI_BedTime,PMAT24_A_CR,PMAT24_A_SI,PMAT24_A_RTCR,DDisc_SV_1mo_200,DDisc_SV_6mo_200,DDisc_SV_1yr_200,...,DDisc_SV_10yr_200,DDisc_SV_1mo_40K,DDisc_SV_6mo_40K,DDisc_SV_1yr_40K,DDisc_SV_3yr_40K,DDisc_SV_5yr_40K,DDisc_SV_10yr_40K,DDisc_AUC_200,DDisc_AUC_200.1,DDisc_AUC_40K
0,100004,M,22-25,9:00:00,19.0,0.0,15590.0,153.13,46.88,21.88,...,21.88,34375.0,24375.0,625.0,625.0,625.0,625.0,0.121811,0.121811,0.067448
1,100206,M,26-30,22:30:00,20.0,0.0,18574.5,78.13,34.38,9.38,...,9.38,30625.0,625.0,625.0,3125.0,625.0,625.0,0.097072,0.097072,0.05
2,100307,F,26-30,22:00:00,17.0,2.0,11839.0,103.13,46.88,103.13,...,9.38,19375.0,29375.0,24375.0,9375.0,9375.0,9375.0,0.162176,0.162176,0.311459
3,100408,M,31-35,22:00:00,7.0,12.0,3042.0,153.13,46.88,46.88,...,9.38,39375.0,29375.0,24375.0,19375.0,18125.0,4375.0,0.203061,0.203061,0.421354
4,100610,M,26-30,21:30:00,23.0,0.0,12280.0,196.88,196.88,184.38,...,146.88,39375.0,39375.0,39375.0,39375.0,36875.0,24375.0,0.801629,0.801629,0.86875


In [92]:
df_cognition.shape

(1206, 22)

### 2. Merging two dataframes

So for our dataset of 1206 subjects, we have information about their gender, age-range and a variety of cognitive measures. It would be cool to be able to integrate some other data as well. You have been provided with a separate file that contains brain structure volume data obtained from the neuroimaging data of these same subjects (note that not all subjects in the HCP have neuroimaging data)

In [93]:
df_volumes = pd.read_csv("HCP_csv_data/HCP_volumes.csv")

**Question 6:** How many subjects and features does `df_volumes` have?

In [94]:
#@title
print(f"Num subjects: {df_volumes.shape[0]}")
print(f"Num features: {df_volumes.shape[1]}")

Num subjects: 1086
Num features: 14


**Question 7:** Print the mean and standard deviation of total brain volume (TBV) of this sample.

In [95]:
#@title
print(f"Mean TBV: {df_volumes['TBV'].mean()}")
print(f"TBV standard deviation: {df_volumes['TBV'].std()}")

Mean TBV: 1406181.3996316758
TBV standard deviation: 151214.5002924137


**Question 8:** List all the subjects whose TBV is above average.

In [96]:
#@title
df_volumes[df_volumes['TBV']>df_volumes['TBV'].mean()]

Unnamed: 0,Subject,TBV,Gp_Left,Gp_Right,Str_Left,Str_Right,Thal_Left,Thal_Right,str_left_SA,str_right_SA,thal_left_SA,thal_right_SA,gp_left_SA,gp_right_SA
0,100206,1694580,1746.90,1597.69,11848.20,12030.00,7827.26,7750.77,15156.32547,16506.40087,7902.211643,8394.220729,3603.322526,3751.440646
2,100408,1563630,1617.93,1531.84,10274.90,10693.40,6865.83,6842.16,14165.48884,15590.63180,7235.193241,7763.609633,3476.901847,3725.802635
3,100610,1585390,1841.22,1639.88,12464.30,12575.80,7452.36,7598.48,15493.70123,16940.05756,7594.851693,8316.436428,3862.828117,3893.266274
6,101309,1426200,1690.65,1507.14,10144.90,10251.60,6266.95,6182.92,13848.22982,15324.77410,6798.623361,7329.928997,3582.231784,3677.827998
7,101410,1588370,1440.26,1328.78,9376.25,9713.07,6865.83,6683.70,13217.76480,14592.03987,7153.875176,7624.042048,3143.810997,3339.830347
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1075,987074,1451930,1541.44,1384.69,10053.70,10266.00,6290.62,6012.10,13970.55221,14972.19121,7221.456229,7054.193554,3353.157608,3502.805296
1079,991267,1802370,1804.52,1681.04,12363.10,12466.30,7433.50,7367.64,15554.95026,16965.02114,7649.207918,8214.273885,3620.911699,3828.376050
1083,994273,1487180,1646.74,1501.31,11164.60,11670.90,6711.82,6886.75,14637.99165,16242.63411,7100.454111,7637.944904,3441.358390,3549.415428
1084,995174,1517230,1678.30,1594.26,11154.40,11338.60,6766.36,6734.46,14516.88893,15906.36860,7263.501869,7809.016115,3535.013585,3761.052946


Ok! Let's merge our dataframes. One problem is that our behavioral data has 1206 subjects and our volume data has 1086 subjects. 

**Question 9:** Create a new dataframe where we only keep the subjects for which we have all of our features.

In [97]:
#@title
df_final=pd.merge(df_cognition,df_volumes, on='Subject')
df_final.head()

Unnamed: 0,Subject,Gender,Age,PSQI_BedTime,PMAT24_A_CR,PMAT24_A_SI,PMAT24_A_RTCR,DDisc_SV_1mo_200,DDisc_SV_6mo_200,DDisc_SV_1yr_200,...,Str_Left,Str_Right,Thal_Left,Thal_Right,str_left_SA,str_right_SA,thal_left_SA,thal_right_SA,gp_left_SA,gp_right_SA
0,100206,M,26-30,22:30:00,20.0,0.0,18574.5,78.13,34.38,9.38,...,11848.2,12030.0,7827.26,7750.77,15156.32547,16506.40087,7902.211643,8394.220729,3603.322526,3751.440646
1,100307,F,26-30,22:00:00,17.0,2.0,11839.0,103.13,46.88,103.13,...,9470.57,9749.09,5743.19,5773.38,12555.09774,13845.51983,6368.274739,6969.617635,3149.392096,3156.322277
2,100408,M,31-35,22:00:00,7.0,12.0,3042.0,153.13,46.88,46.88,...,10274.9,10693.4,6865.83,6842.16,14165.48884,15590.6318,7235.193241,7763.609633,3476.901847,3725.802635
3,100610,M,26-30,21:30:00,23.0,0.0,12280.0,196.88,196.88,184.38,...,12464.3,12575.8,7452.36,7598.48,15493.70123,16940.05756,7594.851693,8316.436428,3862.828117,3893.266274
4,101006,F,31-35,23:00:00,11.0,8.0,6569.0,140.63,96.88,115.63,...,9774.47,9967.58,5856.38,5877.3,13309.15216,14458.50988,6514.710165,7003.36731,3586.431707,3773.869259


In [98]:
print(df_final.shape)

(1086, 35)


Mission accomplished! We have 1086 and 14 + 22 features. 

### 3. Removing features that you do not need

**Question 10:** Find and remove any duplicate columns.

In [99]:
#@title
cols_to_drop=[]
for i in range(df_final.shape[1]):
    for j in range(i+1,df_final.shape[1]):
        col1=df_final.columns[i]
        col2=df_final.columns[j]
        if (df_final[col1].equals(df_final[col2])):
            print(f"Duplicate columns: {col1, col2}")
            cols_to_drop.append(col2)
df_final.drop(cols_to_drop, inplace=True, axis=1)

Duplicate columns: ('DDisc_AUC_200', 'DDisc_AUC_200.1')


## 4. Making your data machine-readable

To be machine-readable, your variables need to be numerical. Want to check?

In [100]:
df_final.dtypes

Subject                int64
Gender                object
Age                   object
PSQI_BedTime          object
PMAT24_A_CR          float64
PMAT24_A_SI          float64
PMAT24_A_RTCR        float64
DDisc_SV_1mo_200     float64
DDisc_SV_6mo_200     float64
DDisc_SV_1yr_200     float64
DDisc_SV_3yr_200     float64
DDisc_SV_5yr_200     float64
DDisc_SV_10yr_200    float64
DDisc_SV_1mo_40K     float64
DDisc_SV_6mo_40K     float64
DDisc_SV_1yr_40K     float64
DDisc_SV_3yr_40K     float64
DDisc_SV_5yr_40K     float64
DDisc_SV_10yr_40K    float64
DDisc_AUC_200        float64
DDisc_AUC_40K        float64
TBV                    int64
Gp_Left              float64
Gp_Right             float64
Str_Left             float64
Str_Right            float64
Thal_Left            float64
Thal_Right           float64
str_left_SA          float64
str_right_SA         float64
thal_left_SA         float64
thal_right_SA        float64
gp_left_SA           float64
gp_right_SA          float64
dtype: object

We have 3 columns that are non-numerical: `Gender`,`Age`,`PSQI_BedTime`. Let's figure out how to handle them, one at a time. First of all, we know that the `Gender` column is categorical:

In [101]:
df_final['Gender'].value_counts()

F    590
M    496
Name: Gender, dtype: int64

### One-hot encoding or binarizing your data
In this specific dataset, the `Gender` column has one of two values: `M` or `F`. 

**Question 11:** Given that we only have two categories here, you can just replace all of your `M` values with 1 and `F` values with 2.

In [None]:
#@title
df_final['Gender'] = df_final['Gender'].replace('M',1)
df_final['Gender'] = df_final['Gender'].replace('F',2)
df_final

**A note on one-hot encoding:** Suppose that you actually had more than 2 numerical values for this feature (e.g. `M`,`F`,`other`). If you just convert categorical variables to numerical values (ex: `M`=1,`F`=2,`other`=3), you give a "distance" to the relationship between variables. For instance, since 1 is closer to 2 than to 3, you are telling your machine that `M` is "closer" to `F` (`distance = 2 - 1 = 1`) than to `other` (`distance = 3 - 1 = 2`). One-hot encoding is a way to make sure the categories remain independant: "[A representation of categorical variables as binary vectors](https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/)"

### Parsing strings in your data

#### Handling the `Age` feature

Recall: our `Age` feature is also non-numerical! Luckily, we have an age range, so the string can be split up into two new columns: min age and max age. Let's replace our age range in `df_final` with an average age estimate for each subject.

In [None]:
df_final['Age'].value_counts()

For the sake of this exercise, let's replace 36+ with a range of 36-40.

In [None]:
df_final['Age'] = df_final['Age'].replace("36+",'36-40')
df_final['Age'].value_counts()
fix_age = df_final['Age'].str.split('-', 1, expand=True)
fix_age.columns = ['min','max']

In [None]:
fix_age = df_final['Age'].str.split('-', 1, expand=True)
fix_age.columns = ['min','max']
fix_age
# fix_age['mean'] = (fix_age['max']+fix_age['min'])/2

In [None]:
fix_age["min"] = fix_age['min'].astype(float)
fix_age["max"] = fix_age['max'].astype(float)
fix_age['mean'] = (fix_age['max']+fix_age['min'])/2
fix_age

In [None]:
df_final['Age'] = fix_age['mean']
df_final

#### 4.2.2 Handling the `PSQI_BedTime` feature

Convert your bed time variable from HH:MM:SS to seconds!

In [None]:
df_final['PSQI_BedTime']

In [None]:
ftr = [3600,60,1]
for i in range(len(df_final['PSQI_BedTime'])):
    x = sum([a*b for a,b in zip(ftr, map(int,df_final['PSQI_BedTime'][i].split(':')))])
    df_final['PSQI_BedTime'][i] = x

In [None]:
df_final

#### 4.2.3 Other strings that often crop up in dataframes and need to be replaced with numbers!
`
df_final = df_final.replace('FALSE',0)
df_final = df_final.replace('TRUE',1)
df_final = df_final.replace(False,0)
df_final = df_final.replace(True,1)
df_final = df_final.replace('0',0) # example of random spaces
df_final = df_final.replace(' ',np.NaN) # example of random spaces
`

#### 4.2.4 Making sure every column is of type float
The next line of code will throw an error if you forgot to replace any strings, and it will tell you what those strings are.

In [None]:
for col in df_final.columns:
    df_final[col] = df_final[col].astype(float)

In [None]:
df_final.head()

## 5. Handling not available (NA) and inf data:

Sometimes, Python will convert some of your values to + or - infinity, which will result in downstream errors. Convert them to NA, and then handle them as NA values.

In [None]:
df_final = df_final.replace([np.inf, -np.inf], np.nan)

Next, you need to deal with your NA values. How many nas do you have?

In [None]:
df_final.isna().sum()

There is a variety of ways to handle `na` data. The most simple approach is to replace `na` data with the median (or mean) value of the feature of interest. There are [other](https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779) [more](https://arxiv.org/abs/1804.11087) sophisticated data imputation techniques out there, many of which actually leverage machine learning tools (so meta)!

However, if a feature has too many NAs, you may want to remove it completely. Define a threshold for the minimal number of missing values that qualifies a feature for removal from your dataset. Here, 12 is quite stringent.

In [None]:
threshold=12

remove_cols = []
for i in range(len(df_final.columns)):
    if (df_final.iloc[:,i].isnull().sum() >= threshold):
        remove_cols.append(df_final.columns[i])
df_final = df_final.drop(columns=remove_cols)

Next, let's replace the NA values we have left with the feature-specific median (the median is more robust against outliers than the mean is).

In [None]:
for col in df_final.columns:
    df_final[col].fillna(df_final[col].median(), inplace=True)

In [None]:
df_final.isna().sum().sum()
# note the difference between df_final.isna().sum() and df_final.isna().sum().sum()

## 6. Removing columns with a standard deviation of 0:

In [None]:
df_final.std()

In [None]:
df_final = df_final.loc[:, df_final.std() > 0]

## 7. Feature scaling

You usually need to perform some sort of feature scaling to make sure that all of your variables are in the same range (this affects gradient-descent-based algorithms and distance-based algorithms==).

**Min-Max Scaling / Normalization:** X' = (X-Xmin) / (Xmax-Xmin), X' always ends up with a range of \[0,1\] \
**Standardization / Standard Scaler / Z-score):** X' = (X-mu)/sigma

Which to use? Depends on your data! ["Normalization is good to use when you know that the distribution of your data does not follow a Gaussian distribution. Standardization, on the other hand, can be helpful in cases where the data follows a Gaussian distribution."](https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalization-standardization/). Other [popular scaling techniques](https://www.analyticsvidhya.com/blog/2020/07/types-of-feature-transformation-and-scaling/) include the log transform (you often see this with GWAS, ie genome-wide association studies) and dividing your column-wise values by the absolute value of the maximal value of each column (max abs scaler).

Example 1: Min-Max Scaling

In [None]:
minMaxScaled_df_final=(df_final-df_final.min())/(df_final.max()-df_final.min())

Example 2: Standard Scaling

In [None]:
standardized_df_final=(df_final-df_final.mean())/df_final.std()

Example 3: Sklearn Min-Max Scaler\
Slightly different from the Min-Max Scaling defined above:\
`
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
`

In [None]:
from sklearn.preprocessing import MinMaxScaler
mms=MinMaxScaler()
mms.fit(df_final)
df_final_mms=mms.transform(df_final)

# 8. Data visualization

Data visualization is a wonderful way to get to know your data in order to plan a relevant analysis or find an appropriate machine learning application. [Matplotlib](https://matplotlib.org/) and [seaborn](https://seaborn.pydata.org/) are two canonical data visualizations tools that you can use in Python.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

#### 9.1 Scatter plot

In [None]:
plt.scatter(df_final["PSQI_BedTime"],df_final["TBV"])
plt.ylabel("Total brain volume (mm3)")
plt.xlabel("Bedtime (seconds)")
plt.title("Total brain volume as a function of bed time")

#### 9.2 [Violin plot](https://chartio.com/learn/charts/violin-plot-complete-guide/)

In [None]:
plt.violinplot(df_final['PSQI_BedTime'])
plt.ylabel("Bed time")
plt.title("Violin plot of bed times in our sample")

#### 9.3 Histogram plot

In [None]:
sns.displot(df_final,x='TBV',kind='kde',fill=True) # smoothed histogram
plt.ylabel("Density")
plt.xlabel("Total brain volume (mm3)")
plt.title("Distribution of total brain volume in our sample")

In [None]:
sns.histplot(df_final,x='Str_Left',fill=True,color='orange')
sns.histplot(df_final,x='Thal_Left',fill=True,color='turquoise')
plt.legend(['Left striatum','Left thalamus'])
plt.ylabel("Density")
plt.xlabel("Structure-specific volume (mm3)")
plt.title("Distribution of different structure volumes in our sample")
plt.show()

# 10. Mini data visualization assignemnt
PS: this [list of named colours](https://matplotlib.org/3.5.0/gallery/color/named_colors.html) can help make your graphs nicer :)
<ol>
<li>Generate a scatter plot of TBV as a function of bed time, by gender.</li>
<li>Generate a violin plot of bed times in our sample for subjects over 30 years.</li>
<li>Generate two smoothed and superimposed histograms of bed times in subjects that are above and below 30 years.</li>
<li>Generate two superimposed histograms of the volume distributions in the left hemisphere (left striatum, thalamus and globus pallidus) and right hemisphere (right striatum, thalamus and globus pallidus)</li>
</ol>

# Image Visualization
Let's visualize the *(either chest X-ray or brain scan if @Nadia gets approval to use them)* of a given person.

In [None]:
# Get the image as an array
from PIL import Image
import os

png_file_path = os.path.join("data_png", "chest_xrays_pngs", "CHNCXR_0001_0.png")
png_array = np.array(Image.open(png_file_path))

In [None]:
# Write a function that plots a *list* of numpy arrays in grayscale
def plot_png_array(png_array_list, title, figsize=(10, 10)):
  for png_array in png_array_list:
    # Plot the image with imgshow
    fg, ax = plt.subplots(figsize=(10, 10))

    ax.imshow(png_array, cmap='gray')

    ax.set_title("Title")

    plt.show()

Reduce the dimensionality of the image using average pooling with a 3 x 3 kernel.

In [None]:
# Write a function that reduces the dimensionality of the image
def reduce_png_array_dim(png_array, kernel_height=21, kernel_width=30):
  cropped_png_array = []
  for i in range(0, png_array.shape[0], kernel_height):
    cropped_array = []
    for j in range(0, png_array.shape[1], kernel_width):
      cropped_array.append(png_array[i:i+kernel_height, j:j+kernel_width].mean())
    cropped_png_array.append(cropped_array)
  return np.array(cropped_png_array)

In [None]:
cropped_png_array = reduce_png_array_dim(png_array)

In [None]:
# Replot the oringal and cropped image
plot_png_array([png_array, cropped_png_array], "Title")