## Data Preprocessing

This Jupyter Notebook file investigates the data. This includes handling missing data, outliers, normalisation, etc. ER, HER2, and Gene are the most important features, so are retained for the modelling process

In [89]:
#GENERAL IMPORTS
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as plt
from sklearn import linear_model
import IPython

In [90]:
data = pd.read_excel('TrainDataset2024.xls')
pd.options.display.max_rows = len(data)
data.info()
data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Columns: 121 entries, ID to original_ngtdm_Strength
dtypes: float64(108), int64(12), object(1)
memory usage: 378.2+ KB


Unnamed: 0,ID,pCR (outcome),RelapseFreeSurvival (outcome),Age,ER,PgR,HER2,TrippleNegative,ChemoGrade,Proliferation,...,original_glszm_SmallAreaHighGrayLevelEmphasis,original_glszm_SmallAreaLowGrayLevelEmphasis,original_glszm_ZoneEntropy,original_glszm_ZonePercentage,original_glszm_ZoneVariance,original_ngtdm_Busyness,original_ngtdm_Coarseness,original_ngtdm_Complexity,original_ngtdm_Contrast,original_ngtdm_Strength
0,TRG002174,1,144.0,41.0,0,0,0,1,3,3,...,0.517172,0.375126,3.325332,0.002314,3880771.5,473.464852,0.000768,0.182615,0.030508,0.000758
1,TRG002178,0,142.0,39.0,1,1,0,0,3,3,...,0.444391,0.444391,3.032144,0.005612,2372009.744,59.45971,0.004383,0.032012,0.001006,0.003685
2,TRG002204,1,135.0,31.0,0,0,0,1,2,1,...,0.534549,0.534549,2.485848,0.006752,1540027.421,33.935384,0.007584,0.024062,0.000529,0.006447
3,TRG002206,0,12.0,35.0,0,0,0,1,3,3,...,0.506185,0.506185,2.606255,0.003755,6936740.794,46.859265,0.005424,0.013707,0.000178,0.004543
4,TRG002210,0,109.0,61.0,1,0,0,0,2,1,...,0.462282,0.462282,2.809279,0.006521,1265399.054,39.621023,0.006585,0.034148,0.001083,0.005626


400 patients, each with ID numbers (object(1)).

PCR and RFS outcome columns.

118 features: 11 clinical and 107 extracted from tumor region of MRI scans. 

11 clincial: Age, ER, PgR, HER2, TrippleNegative, ChemoGrade, Proliferation, HistologyType, LNStatus, TumourStage, and Gene.

### Regression Imputation

The code contains missing values, represetned by '999'. To mitigate for this, regression imputation is employed. This is the process of replacing missing data with estimations derived from other available information in the dataset. 

In [93]:
data.replace(999,np.NAN,inplace=True)
null_data = pd.DataFrame(data.isnull().sum())
null_data = null_data.sort_values(by=[0], ascending=True)
print('null_data:\n',null_data)

null_data:
                                                      0
ID                                                   0
original_glrlm_HighGrayLevelRunEmphasis              0
original_glrlm_GrayLevelVariance                     0
original_glrlm_GrayLevelNonUniformityNormalized      0
original_glrlm_GrayLevelNonUniformity                0
original_gldm_SmallDependenceLowGrayLevelEmphasis    0
original_gldm_SmallDependenceHighGrayLevelEmphasis   0
original_gldm_SmallDependenceEmphasis                0
original_gldm_LowGrayLevelEmphasis                   0
original_gldm_LargeDependenceLowGrayLevelEmphasis    0
original_gldm_LargeDependenceHighGrayLevelEmphasis   0
original_gldm_LargeDependenceEmphasis                0
original_gldm_HighGrayLevelEmphasis                  0
original_glrlm_LongRunEmphasis                       0
original_gldm_GrayLevelVariance                      0
original_gldm_DependenceVariance                     0
original_gldm_DependenceNonUniformityNormalized      

This illustrates that all features, aside from Gene, have 3 or less missing values, which is insignificant. This means that values can be estimated using regression imputation. Becuase Gene is a very valuable metric, it will be retained despite the large amount of missing entries.

It is important to note however, that 'pCR (outcome)' has missing entries. These rows should not be used for method development of the PCR.
RFS (outcome) does not have missing entries. 