Patient overlap in medical data is a part of a more general problem in machine learning called **data leaked**.
To identify patient overlao we will check to se if a patient's ID appears in both the  training set and test set. We should also verify that we don't have patient overlap in the training and validation sets, which is what we will do here.

Google Drive link: https://drive.google.com/drive/my-drive

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import os
import seaborn as sns
sns.set()

#1. Data

In [3]:
from google.colab import drive
drive.mount('/content/gdrive')
path = '/content/gdrive/MyDrive/train-small.csv'
train_df = pd.read_csv(path)
print(f'There are {train_df.shape[0]}rows and {train_df.shape[1]}columns in the training dataframe')
train_df.head()

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
There are 1000rows and 16columns in the training dataframe


Unnamed: 0,Image,Atelectasis,Cardiomegaly,Consolidation,Edema,Effusion,Emphysema,Fibrosis,Hernia,Infiltration,Mass,Nodule,PatientId,Pleural_Thickening,Pneumonia,Pneumothorax
0,00008270_015.png,0,0,0,0,0,0,0,0,0,0,0,8270,0,0,0
1,00029855_001.png,1,0,0,0,1,0,0,0,1,0,0,29855,0,0,0
2,00001297_000.png,0,0,0,0,0,0,0,0,0,0,0,1297,1,0,0
3,00012359_002.png,0,0,0,0,0,0,0,0,0,0,0,12359,0,0,0
4,00017951_001.png,0,0,0,0,0,0,0,0,1,0,0,17951,0,0,0


In [4]:
# Let's read csv file containing validation data from google colab
path_valid = '/content/gdrive/MyDrive/valid-small.csv'
valid_df = pd.read_csv(path_valid)
print(f"There are {valid_df.shape[0]} rows and {valid_df.shape[1]} columns in the validadtion dataset")
valid_df.head()

There are 109 rows and 16 columns in the validadtion dataset


Unnamed: 0,Image,Atelectasis,Cardiomegaly,Consolidation,Edema,Effusion,Emphysema,Fibrosis,Hernia,Infiltration,Mass,Nodule,PatientId,Pleural_Thickening,Pneumonia,Pneumothorax
0,00027623_007.png,0,0,0,1,1,0,0,0,0,0,0,27623,0,0,0
1,00028214_000.png,0,0,0,0,0,0,0,0,0,0,0,28214,0,0,0
2,00022764_014.png,0,0,0,0,0,0,0,0,0,0,0,22764,0,0,0
3,00020649_001.png,1,0,0,0,1,0,0,0,0,0,0,20649,0,0,0
4,00022283_023.png,0,0,0,0,0,0,0,0,0,0,0,22283,0,0,0


##1.2 Extracting Patient IDs
1.- Extract patients IDs from the train and validation sets

In [5]:
# Extract patient id's for the training set
ids_train = train_df.PatientId.values
# Extract patient id's for the validation set
ids_valid = valid_df.PatientId.values

##1.3 Comparing PatientIDs
2.- Convert these arrays of numbers into set() datatypes for easy comparison


3.- Identify patient overlap in the intersection of two sets

In [6]:
# Create a "set" datastructure of the training set id's to identify unique id's
ids_train_set = set(ids_train)
print(f"There are {len(ids_train_set)} unique Patient IDs in the training set")
# Create a "set" datastructure of the validation set id's to identify unique id's
ids_valid_set = set(ids_valid)
print(f"There are {len(ids_valid_set)} unique Patient IDs in the valid set")

There are 928 unique Patient IDs in the training set
There are 97 unique Patient IDs in the valid set


In [7]:
# Identify patient overlap by looking at the intersection between the sets
patient_overlap = list(ids_train_set.intersection(ids_valid_set))
n_overlap = len(patient_overlap)
print(f'There are {n_overlap} Patient IDs in both the training and validation sets')
print('')
print(f'These patients are in both the training and validation datasets:')
print(f'{patient_overlap}')

There are 11 Patient IDs in both the training and validation sets

These patients are in both the training and validation datasets:
[20290, 27618, 9925, 10888, 22764, 19981, 18253, 4461, 28208, 8760, 7482]


##1.4 Identifying and Removing Overlapping Patients
1. Create lists of the overlapping row numbers in both the training and validation sets.
2. Drop the overlapping patient records from the validation set.

In [8]:
train_overlap_idxs = []
valid_overlap_idxs = []
for idx in range(n_overlap):
  train_overlap_idxs.extend(train_df.index[train_df['PatientId']== patient_overlap[idx]].tolist())
  valid_overlap_idxs.extend(valid_df.index[valid_df['PatientId']== patient_overlap[idx]].tolist())
print(f'These are the indices of overlapping patients in the training set: ')
print(f'{train_overlap_idxs}')
print(f'These are the indices of overlapping patients in the validation set: ')
print(f'{valid_overlap_idxs}')

These are the indices of overlapping patients in the training set: 
[306, 186, 797, 98, 408, 917, 327, 913, 10, 51, 276]
These are the indices of overlapping patients in the validation set: 
[104, 88, 65, 13, 2, 41, 56, 70, 26, 75, 20, 52, 55]


In [9]:
# Drop the overlapping rows from the validation set
valid_df.drop(valid_overlap_idxs, inplace=True)

##1.5 Sanity Check
Let's check that everything worked as planned by rerunning the patient ID comparison between train and validation sets.

In [10]:
# Extract patient id's for the validation set
ids_valid = valid_df.PatientId.values
# Create a "set" datastructure of the validation set id's to identify unique id's
ids_valid_set = set(ids_valid)
print(f"There are {len(ids_valid_set)} unique Patient IDs in the training set")

There are 86 unique Patient IDs in the training set


In [11]:
# Identify patient overlap by looking at the intersection between the sets
patient_overlap = list(ids_train_set.intersection(ids_valid_set))
n_overlap = len(patient_overlap)
print(f"There are {n_overlap} Patient IDs in both the training and validation sets")

There are 0 Patient IDs in both the training and validation sets
