# Group member

**Name:** Shilong Yin  
**UNI:** sy2792

**Name:** Diwei Xiong  
**UNI:** dx2183

## Introduction  
This file is for the use of cleaning dataset from: https://www.kaggle.com/paultimothymooney/blood-cells. The original dataset is divided into 4 folders, each stands for a different cell type, for both train and test data. And there are some duplicate pictures in these files (different files from different classes but with the same name). Now the dataset is merged into only two files: train and test, and delete the images with same names. Two csv files with image id and its label is created for later reference.

In [1]:
import numpy as np
import pandas as pd
import os
from tqdm import tqdm

Project the type of blood cells to numerical labels:  
**1: Neutrophil  
2: Eosinophil  
3: Monocyte  
4: Lymphocyte**

In [2]:
dict_characters = {'NEUTROPHIL':0,'EOSINOPHIL':1,'MONOCYTE':2,'LYMPHOCYTE':3}

## Merge data

Search all files in four folders, store the file names in X and corresponding labels in Y.

In [3]:
# Store file names and labels in vectors

def merge(path):
    X = []  # vectors containing file names
    Y = []  # vectors containing labels
    
    for wbc_type in os.listdir(path):
        if wbc_type == 'NEUTROPHIL':
            label = dict_characters['NEUTROPHIL']
        elif wbc_type == 'EOSINOPHIL':
            label = dict_characters['EOSINOPHIL']
        elif wbc_type == 'MONOCYTE':
            label = dict_characters['MONOCYTE']
        elif wbc_type == 'LYMPHOCYTE':
            label = dict_characters['LYMPHOCYTE']
        for image_filename in tqdm(os.listdir(path + wbc_type)):
            X.append(image_filename)
            Y.append(label)
      
    return X,Y

In [4]:
train_dir = './bcc/TRAIN/'
test_dir = './bcc/TEST/'

X_train,Y_train = merge(train_dir)
X_test,Y_test = merge(test_dir)

100%|██████████| 2497/2497 [00:00<00:00, 2583418.13it/s]
100%|██████████| 2483/2483 [00:00<00:00, 2486143.91it/s]
100%|██████████| 2478/2478 [00:00<00:00, 1255706.82it/s]
100%|██████████| 2499/2499 [00:00<00:00, 2498585.39it/s]
100%|██████████| 623/623 [00:00<?, ?it/s]
100%|██████████| 620/620 [00:00<00:00, 621824.12it/s]
100%|██████████| 620/620 [00:00<?, ?it/s]
100%|██████████| 624/624 [00:00<?, ?it/s]


## Delete duplicate data

Delete files with duplicate names to avoid confusion.  

At the same time, merge images from four folders in file explorer and only retain the first with the same name.

In [5]:
train_dataset = pd.DataFrame({'id':X_train,'label':Y_train})
train_dataset = train_dataset.drop_duplicates('id','first')  # Only retain the first file with the same name

test_dataset = pd.DataFrame({'id':X_test,'label':Y_test})
test_dataset = test_dataset.drop_duplicates('id','first')

In [6]:
train_dataset.head()

Unnamed: 0,id,label
0,_0_1169.jpeg,1
1,_0_1414.jpeg,1
2,_0_207.jpeg,1
3,_0_2142.jpeg,1
4,_0_2370.jpeg,1


In [7]:
test_dataset.head()

Unnamed: 0,id,label
0,_0_1616.jpeg,1
1,_0_1794.jpeg,1
2,_0_1845.jpeg,1
3,_0_187.jpeg,1
4,_0_196.jpeg,1


In [8]:
if len(train_dataset['id']) == train_dataset['id'].nunique() and len(test_dataset['id']) == test_dataset['id'].nunique():
    print('Data is clean.')  # Make sure that all the file names in dataset is unique

Data is clean.


## Output .csv file

Save the clean datasets to csv files.

In [9]:
train_dataset.to_csv('train.csv',index = False)
test_dataset.to_csv('test.csv',index = False)