# Data Preperation

### Input:
##### 1) CSV of Targets for each image
##### 2) All folders of train images from Kaggle in a new folder called train in a local only SML_Project_Data folder (38.1 GB total). The SML_Project_Data folder is in the same folder as the GitHub folder of DS-5220-Supervised-Machine-Learning-Project
### Output:
##### High level directory called data is created. Inside of high level folder are two major directories of train and val. Each major director has a seperate folder for each of the 5 classes. This results in 10 folders with resized images in them.
### Next Steps:
##### Move high level directory and all contents next to script which will train model. The script which trains the model can create all of the folders, even though they would be empty of course.

# Data Notes
- Classes are very unballanced with most being class 0. This must be solved before upload to system which trains model.
- The Test data has no labels, so we should use the training data to evaluate the models
- Each eye from a unique person has images next to each other in the training set
- The JPEG are compressed versus RGB matrices, so they are smaller than those tensors by a factor of 15 even with the same information in the JPEG and array.
# Possible Deep Learning Classification Improvements
- Each eye can have a different disease value. However, there is strong correlation of the disease one eye to other eye
- Images are different sizes, but it is easy to rescale all images to a set size. Other sizes can be tried, such as larger sizes more similar to the original images.
- Deal with use of notch on reversed eye images.
- Images have color, but not too much. Can try to convert to greyscale to reduce the bytes of the image which results in more room to have a larger image with more pixels. Note, this will require changing the model because it expects a size of (x, y, 3) of each of Red Green and Blue.
- The CNN architecture was for 150 by 150 color dog and cat images. The 500 by 500 eye images may need a different architecture. In particular, other filter windows may be important. Since the image is larger, more convolution layers may be needed.
- Data augmentation on non-0 class
- Remove black outer edge in images
- Use of existing image models for transfer learning like VGG16-VGG19
# Sick/Healthy Model
- If we use different format of the data, such as using a sick/healthy class this is done by changing the final layer of the CNN model as well as the Keras DirectoryIterator. This can be done by making sure that there are two folders inside of the train and val subfolder.
# Notes of Manual Steps
- Ballancing the classes. Manually moved a set number of files from each class into a train and validate folder. This should be able to be automated, when it could be done most easily if there is an object of the name of all objects in a folder.
- Moving images into the proper folders of Google Colab. This took around 40 minutes for the 500 x 500 images.
# Suggested Next Steps before Preliminary Report
- 0) **Must Complete Before Preliminary Report:** Carry out more detailed analysis of results, such as where there is confusion, and the level of the errors
- 1) Try simplier models such as logistic classification or an ensemble method. Can try more diverse data resizing (such as only making the images the same size), and pre-processing tecniques before further training in neural networks. Would not suggest creating features of textures since that is automatically done in deep learning, so we would not be able to transfer that over.
- 2) Split apart validation and test set to allow more easy comparison with simplier models. Can combine part of the validation set with the training set if cross validation or no distinct validation set is being used. Would suggest 10% on the test set, since currently it is 20% validation set.
- 3) Get automated way to place resized images in desired train or validate folders to speed up testing process of any models.
- 4) Try data augmentation on the non class 0 images. Must keep validation images unchanged though.
- 5) Try other architectures of the CNN model to match larger tensors, and use of classification using eye images instead of a binary cats versus dogs.

In [50]:
import numpy as np
import cv2
import os

# Analysis of a Single Image

In [52]:
# Load an color image in color
img = cv2.imread('../SML_Project_Data/train/10_left.jpeg',1)
print(img.shape)
img

(3168, 4752, 3)


array([[[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        ...,
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]],

       [[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        ...,
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]],

       [[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        ...,
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]],

       ...,

       [[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        ...,
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]],

       [[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        ...,
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]],

       [[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        ...,
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]]], dtype=uint8)

### Create PNG of Image Object

In [53]:
cv2.imwrite('10_left_new.jpeg',img) # Confirmed manually the exported image is the same number of bytes as the original image

True

### Inspect Image Object

In [54]:
img.shape

(3168, 4752, 3)

In [55]:
img[1500][2300][1]

146

In [56]:
np.max(img)

255

In [57]:
np.min(img)

0

In [58]:
print('Original Size', img.shape)

print("%d bytes in original image" % (img.size * img.itemsize))

print('MB in the original array image is', img.size * img.itemsize / 1000000, 'versus the original image JPEG has 1.5 MB') # 1 million bytes in a MB

resized = cv2.resize(img, (100, 100), interpolation = cv2.INTER_AREA)
print('Resized Size', resized.shape)
cv2.imwrite('10_left_reshaped_new.jpeg',resized) # Goes from 1.5 MB to 3 KB which redices the size by a factor 1 K

print("%d bytes in compressed image" % (resized.size * resized.itemsize))

print('Original pixel size is', np.prod(img.shape))
print('New pixel size is', np.prod(resized.shape))
print('Factor of reduction is by', np.prod(img.shape) / np.prod(resized.shape))


Original Size (3168, 4752, 3)
45163008 bytes in original image
MB in the original array image is 45.163008 versus the original image JPEG has 1.5 MB
Resized Size (100, 100, 3)
30000 bytes in compressed image
Original pixel size is 45163008
New pixel size is 30000
Factor of reduction is by 1505.4336


# Downscale Images to 100 x 100 x 3 and Export

In [59]:
from pathlib import Path
import pandas as pd

In [60]:
train_target = pd.read_csv('../SML_Project_Data/trainLabels.csv', delimiter=',')
print(len(train_target))
train_target.head()

35126


Unnamed: 0,image,level
0,10_left,0
1,10_right,0
2,13_left,0
3,13_right,0
4,15_left,1


### Count Images

In [35]:
total_images = 0
for index, row in train_target.iterrows():
    split_image = row['image'].split('_')
    image_num = split_image[0]
    side = split_image[1]
    
    image_name = str(image_num) + '_' + side + '.jpeg'

    train_image_path = '../SML_Project_Data/train/' + image_name

    my_file = Path(train_image_path)

    try:
        my_abs_path = my_file.resolve(strict=True)
    except FileNotFoundError:
        pass
    else:
        total_images = total_images + 1
total_images

35126

## Downsizes all images and moves to subfolders in the main Data folder

In [36]:
# Create folders for each
try:
    os.mkdir('../SML_Project_Data/data')
except FileExistsError:
    print('Data Folder Already Exists \n')
else:
    print('Creating Data Folder \n')

Data Folder Already Exists 



In [37]:
for disease in (0, 1, 2, 3, 4):
    path = '../SML_Project_Data/data/resized_' + str(disease)
    print('Looking to create folder of', path)
    try:
        os.mkdir(path)
    except FileExistsError:
        print('Folder Already Exists \n')
    else:
        print('Creating Folder \n')

Looking to create folder of ../SML_Project_Data/data/resized_0
Folder Already Exists 

Looking to create folder of ../SML_Project_Data/data/resized_1
Folder Already Exists 

Looking to create folder of ../SML_Project_Data/data/resized_2
Folder Already Exists 

Looking to create folder of ../SML_Project_Data/data/resized_3
Folder Already Exists 

Looking to create folder of ../SML_Project_Data/data/resized_4
Folder Already Exists 



In [46]:
# Distribute Images to each Folder
counter = 0

status = int(total_images / 20)
print('Will give a status every', status, 'images')

new_shape = (500, 500)

for index, row in train_target.iterrows():
    #print(row['image'], row['level'])
    
    split_image = row['image'].split('_')
    image_num = split_image[0]
    side = split_image[1]

    image_name = str(image_num) + '_' + side + '.jpeg'

    train_image_path = '../SML_Project_Data/train/' + image_name

    my_file = Path(train_image_path)

    try:
        my_abs_path = my_file.resolve(strict=True)
    except FileNotFoundError:
        print('File not found')
        pass
    else:
        counter = counter + 1
        
        # Pull Image
        img2 = cv2.imread(train_image_path, 1)
        # Resize
        resized = cv2.resize(img2, new_shape, interpolation = cv2.INTER_AREA)
        # Create New Image File
        path = '../SML_Project_Data/data/resized_' + str(row['level']) + '/' + image_name
        #print('Path:', path)
        cv2.imwrite(path, resized)
        resized
    if counter % status == 0:
        print('At', counter, 'images out of', total_images)
        
print('Processed Imnage Count', counter)

Will give a status every 1756 images
At 1756 images out of 35126
At 3512 images out of 35126
At 5268 images out of 35126
At 7024 images out of 35126
At 8780 images out of 35126
At 10536 images out of 35126
At 12292 images out of 35126
At 14048 images out of 35126
At 15804 images out of 35126
At 17560 images out of 35126
At 19316 images out of 35126
At 21072 images out of 35126
At 22828 images out of 35126
At 24584 images out of 35126
At 26340 images out of 35126
At 28096 images out of 35126
At 29852 images out of 35126
At 31608 images out of 35126
At 33364 images out of 35126
At 35120 images out of 35126
Processed Imnage Count 35126


In [66]:
import os
import os, shutil

min_file_count = np.inf
for disease_type in (0, 1, 2, 3, 4):
    path, dirs, files = next(os.walk('../SML_Project_Data/data/resized_' + str(disease_type)))
    file_count = len(files)
    if file_count < min_file_count:
        min_file_count = file_count
    print('Disease Class of', disease_type, 'has count of images of', file_count)
print('Mininum count by class is', min_file_count)
    
train_file_count = int(min_file_count * .80)
print('Training set will have', train_file_count)
val_file_count = min_file_count - train_file_count
print('Validation/Test set will have', val_file_count)

Disease Class of 0 has count of images of 25810
Disease Class of 1 has count of images of 2443
Disease Class of 2 has count of images of 5292
Disease Class of 3 has count of images of 873
Disease Class of 4 has count of images of 708
Mininum count by class is 708
Training set will have 566
Validation/Test set will have 142


In [72]:
try:
    os.mkdir('../SML_Project_Data/data/train')
    os.mkdir('../SML_Project_Data/data/val')
except FileExistsError:
    print('Train and/or Val Folder Already Exists \n')
else:
    pass

# Create Training Set
for disease in (0, 1, 2, 3, 4):
    path = '../SML_Project_Data/data/resized_' + str(disease)
    
    train_path = '../SML_Project_Data/data/train/resized_' + str(disease)
    
    val_path = '../SML_Project_Data/data/val/resized_' + str(disease)
    
    print('Looking to create folder of', path)
    try:
        os.mkdir(train_path)
        os.mkdir(val_path)
    except FileExistsError:
        print('Train and/or Val Folder Already Exists \n')
    else:
        pass

Train and/or Val Folder Already Exists 

Looking to create folder of ../SML_Project_Data/data/resized_0
Train and/or Val Folder Already Exists 

Looking to create folder of ../SML_Project_Data/data/resized_1
Train and/or Val Folder Already Exists 

Looking to create folder of ../SML_Project_Data/data/resized_2
Looking to create folder of ../SML_Project_Data/data/resized_3
Looking to create folder of ../SML_Project_Data/data/resized_4
Looking to create folder of ../SML_Project_Data/data/resized_0
Train and/or Val Folder Already Exists 

Looking to create folder of ../SML_Project_Data/data/resized_1
Train and/or Val Folder Already Exists 

Looking to create folder of ../SML_Project_Data/data/resized_2
Train and/or Val Folder Already Exists 

Looking to create folder of ../SML_Project_Data/data/resized_3
Train and/or Val Folder Already Exists 

Looking to create folder of ../SML_Project_Data/data/resized_4
Train and/or Val Folder Already Exists 



In [None]:

# Create Training Set
for disease in (0, 1, 2, 3, 4):
    path = '../SML_Project_Data/data/resized_' + str(disease)
    
    train_path = '../SML_Project_Data/data/train/resized_' + str(disease)
    
    val_path = '../SML_Project_Data/data/val/resized_' + str(disease)
    
    print('Looking to create folder of', path)
    try:
        os.mkdir(train_path)
        os.mkdir(val_path)
    except FileExistsError:
        print('Train and/or Val Folder Already Exists \n')
    else:
        # Copy first images to training set dir
        fnames = ['{}.jpeg'.format(i) for i in range(train_file_count)]
        for fname in fnames:
            src = os.path.join(path, fname)
            dst = os.path.join(train_path, fname)
            shutil.copyfile(src, dst)
        print('Moved files to ', train_path)

        # Copy remainder to validation set
        fnames = ['{}.jpeg'.format(i) for i in range(train_file_count, min_file_count)]
        for fname in fnames:
            src = os.path.join(path, fname)
            dst = os.path.join(val_path, fname)
            shutil.copyfile(src, dst)
        print('Moved files to ', val_path)

# Create Valiadtion Set


In [None]:

# Copy next 500 cat images to test_cats_dir
fnames = ['cat.{}.jpg'.format(i) for i in range(1500, 2000)]
for fname in fnames:
    src = os.path.join(original_dataset_dir, fname)
    dst = os.path.join(test_cats_dir, fname)
    shutil.copyfile(src, dst)
    
# Copy first 1000 dog images to train_dogs_dir
fnames = ['dog.{}.jpg'.format(i) for i in range(1000)]
for fname in fnames:
    src = os.path.join(original_dataset_dir, fname)
    dst = os.path.join(train_dogs_dir, fname)
    shutil.copyfile(src, dst)
    
# Copy next 500 dog images to validation_dogs_dir
fnames = ['dog.{}.jpg'.format(i) for i in range(1000, 1500)]
for fname in fnames:
    src = os.path.join(original_dataset_dir, fname)
    dst = os.path.join(validation_dogs_dir, fname)
    shutil.copyfile(src, dst)
    
# Copy next 500 dog images to test_dogs_dir
fnames = ['dog.{}.jpg'.format(i) for i in range(1500, 2000)]
for fname in fnames:
    src = os.path.join(original_dataset_dir, fname)
    dst = os.path.join(test_dogs_dir, fname)
    shutil.copyfile(src, dst)

In [None]:
# 1
1544_right.jpeg
Upload failed
1669_right.jpeg
Upload failed
2802_right.jpeg
Upload failed
4533_left.jpeg
Upload failed
4614_right.jpeg
Upload failed
4711_right.jpeg
Upload failed
4852_left.jpeg
Upload failed
4885_right.jpeg
Upload failed
6947_left.jpeg
Upload failed
9404_right.jpeg
Upload failed