# Effects of Data Leakage in Convolutional Neural Networks. 

This blog is based off the paper: “Effect of data leakage in brain MRI classification using 
2D convolutional neural networks”.
Deep Learning (DL) models are changing the way modern-day neurological diseases are 
diagnosed through MRI scanning, offering advanced detection and treatment. DL models, 
specifically deep conventional neural networks (CNNs) demonstrate good ability to analyse 
complex imaging data and identify otherwise hard-to-detect patterns indicative of 
neurological ailments. CNNs eliminate the need to handcraft features as they demonstrate 
high-level ability to learn complex features directly from the input data. The application of 
deep learning models in neuroimaging continues to grow [1] such as image improvement 
and transformation, identifying subtle patterns and predicting patient outcomes. Although 
good performance has been noted by using DL methods in the classification of neurological 
diseases, many challenges remain unaddressed such as complexity and non-reproducibility 
in the interpretation of highly nonlinear computation results. 
The study explores the impact of data leakage in brain MRI classification of neurological
diseases such as Alzheimer’s and Parkison’s by comparing subject-level and slice-level data 
splits using 2D convolutional neural networks across multiple datasets. It addresses the 
overestimation in model evaluation caused by slice-level cross-validation which erroneously
includes data from the same patient in both the training and test sets i.e. inducing data 
leakage. The study focuses on data leakage within 2D CNNs due to improper data splitting
(3D MRI data such as T1-weighted brain scans) leading to compromised model 
performance. Nested-cross validation is deployed to prevent data leakage at both subjectlevel split and slice-level split. A varying number of datasets have been used to assess 
potential performance overestimation due to data leakage which is a critical factor in 
predicting patient outcomes accurately.x`

# Study Overview: 
The study embarks on a journey back to 1989, revisiting one of the earliest applications of neural networks trained with backpropagation for handwritten zip code recognition. Using modern tools and computational power, the study aims to replicate and extend the original findings.

## Methodology at a Glance
- Dataset: Simulated from MNIST, adjusted to match the original study's conditions.
- Network Architecture: A small, four-layer convolutional neural network.
- Training Process: Leveraging PyTorch for an efficient training loop.

## Delving into the Experimental Setup
The recreation of Yann LeCun et al.’s pioneering work involves meticulously simulating the dataset and adhering to the original network's blueprint, with adjustments to accommodate today's computational standards.

- Dataset Preparation: The MNIST dataset's images were resized to 16x16 pixels, echoing the original study's constraints.
- Neural Network Specification: A concise convolutional neural network, constructed with PyTorch, mirrors the architecture from 1989, albeit with modern optimizations.

In [None]:
# Detailed PyTorch model definition
class LeCunNet(nn.Module):
    def __init__(self):
        super(LeCunNet, self).__init__()
        self.conv_layers = nn.Sequential(
            nn.Conv2d(1, 6, kernel_size=5), nn.ReLU(),
            nn.Conv2d(6, 12, kernel_size=5), nn.ReLU()
        )
        self.fc_layers = nn.Sequential(
            nn.Linear(12*4*4, 120), nn.ReLU(),
            nn.Linear(120, 10)
        )

    def forward(self, x):
        x = self.conv_layers(x)
        x = x.view(x.size(0), -1)
        x = self.fc_layers(x)
        return x


## Optimization and Training
The training regimen is adapted to modern standards, utilizing the Adam optimizer for improved convergence.


In [None]:
# PyTorch training loop with optimizer details
optimizer = torch.optim.Adam(model.parameters(), lr=3e-4)
for epoch in range(total_epochs):
    for images, labels in dataloader:
        ...


The recreation highlights both the timeless nature of deep learning fundamentals and the evolution of methodologies over three decades.