# Effects of Data Leakage in Convolutional Neural Networks. 

This blog is based off the paper: “Effect of data leakage in brain MRI classification using 
2D convolutional neural networks”.
Deep Learning (DL) models are changing the way modern-day neurological diseases are 
diagnosed through MRI scanning, offering advanced detection and treatment. DL models, 
specifically deep conventional neural networks (CNNs) demonstrate good ability to analyse 
complex imaging data and identify otherwise hard-to-detect patterns indicative of 
neurological ailments. CNNs eliminate the need to handcraft features as they demonstrate 
high-level ability to learn complex features directly from the input data. The application of 
deep learning models in neuroimaging continues to grow [1] such as image improvement 
and transformation, identifying subtle patterns and predicting patient outcomes. Although 
good performance has been noted by using DL methods in the classification of neurological 
diseases, many challenges remain unaddressed such as complexity and non-reproducibility 
in the interpretation of highly nonlinear computation results. 
The study explores the impact of data leakage in brain MRI classification of neurological
diseases such as Alzheimer’s and Parkison’s by comparing subject-level and slice-level data 
splits using 2D convolutional neural networks across multiple datasets. It addresses the 
overestimation in model evaluation caused by slice-level cross-validation which erroneously
includes data from the same patient in both the training and test sets i.e. inducing data 
leakage. The study focuses on data leakage within 2D CNNs due to improper data splitting
(3D MRI data such as T1-weighted brain scans) leading to compromised model 
performance. Nested-cross validation is deployed to prevent data leakage at both subjectlevel split and slice-level split. A varying number of datasets have been used to assess 
potential performance overestimation due to data leakage which is a critical factor in 
predicting patient outcomes accurately.

Data leakage in brain MRI classification can introduce error and over-inflated results. This 
leakage arises from a number of factors; improper data handling techniques (i.e. data 
splitting methods). The research emphasises use of subject-level cross-validation instead of 
slice-level to prevent inflated results of model performance evaluations. Overly inflated 
results can be contributed by data leakage which refers to the process of using information 
during model training that is not available when making predictions [2] Often data leakage 
results from incorrect data splitting, for instance performing feature selection on an entire 
dataset before cross-validation; here the target variables in the test data may be mistakenly 
used for improving model learning. Multiple cases may arise from incorrect data splitting, for 
instance using the same test data to optimise training hyperparameters and model 
evaluation or performing data augmentation steps before splitting the test-train data thereby 
introducing the original data into both train and test data leading to inflated accuracy results. 
Data leakage from improperly splitting data into separate groups for training, validation and 
test can arise if training data is included in the held out test set; this will cause the model 
evaluation to overoptimistic in its true generalisation error; similar to the concept of 
overfitting. Essentially, this arises from the data having been already seen by the model 
therefore inducing bias. Data leakage can still arise after correct splitting if the model goes 
through regularisation when examining the distributions of the held out test data set; this 
would mean any performance improvements are deceptive. Furthermore, it can also occur is 
if the input data and target label have some relationship; for instance when creating a 
student attrition model to predict the risk of students not finishing their degree, log data can 
be utilised (i.e. 0 credits = dropped out, 0+ credits = enrolled), this can be problematic if 
credits are considered a feature, as the model will assign higher probability to students with 
less credits and this may not be a true representation. Another form of data leakage can 
occur when a model expects all its features to be available at run time however the values 
themselves are volatile; this can be difficult to recognise. Suppose a feature is derived a 
dataset column that changes after every user action; this means the data used for training 
can include volatile values and at inference time, the column may reflect a different 
relationship. 
In relation to feature engineering, there can be two reasons why data may be unavailable at 
inference time, for instance if a query takes too long to return a particular data, this should
not be used to train the model. Secondly it could be that the data generating process itself is 
not generating the required data at the correct time; referencing our previous example of 
student attrition model, assume that the final year grades are highly predictive of a student 
dropping out, when the data is training and the final grades are not yet available, the values 
in the database may be filled with null values (tree-based models can accept null features), 
this will lead to underperformance in actual predictions. Therefore data leakage can lead to 
suboptimal user experiences, lost profits, and even life threatening situations therefore must 
be addressed by a machine learning engineer.

- During cross-validation, it is important that the validation set remains independent 
and does not leak from the training set. Ensure that no information from the validation 
set is used to generate features during training; any information must only be based 
on the training data during feature engineering. 
- Techniques like Time Series Cross-Validation involve splitting data based on time, 
which ensures that the validation set contains data only from a later period than the 
training set, preventing contamination. 
- Datasets containing clusters can utilise group-aware cross-validation techniques like 
StratifiedGroupKFold and GroupKFold so that grouped samples remain together 
whether in training or test set. 
- When sampling datasets, it is advisable to set a seed to ensure reproducibility of 
results. This also helps identify data leakage. 
- Techniques like Nested Cross-Validation add an outer loop of cross-validation to 
model selection while using inner loop to handle hyperparameter tuning to prevent 
data leakage.