# Diabetic Retinopathy Part 0: Data Acquisition
#### Author: Sean Flannery [sflanner@purdue.edu](sflanner@purdue.edu)
Last updated: April 21, 2019

This portion of work was partially inspired by the useful resources found at this tutorial: [https://machinelearningmastery.com/how-to-load-convert-and-save-images-with-the-keras-api/](https://machinelearningmastery.com/how-to-load-convert-and-save-images-with-the-keras-api/)

In [1]:
import tensorflow as tf
import keras
import numpy as np
import os
import pandas as pd

Using TensorFlow backend.


We need to import the necessary preprocessing functions to convert our image to a numpy array quickly.

In [2]:
from keras.preprocessing.image import load_img, save_img
from keras.preprocessing.image import img_to_array, array_to_img
from PIL import Image

Originally, these are immensely high-quality images. However, we have limited resources on a laptop,n and thus are gonna shrink them down to size a bit.

In [3]:
DESIRED_PIXEL_WIDTH =500

In [None]:
def resizeImg(filename):
    new_width = DESIRED_PIXEL_WIDTH
    img = Image.open(filename)
    ratio = new_width/(img.size[0] *1.0)
    new_height = int(img.size[1]*ratio)
    img = img.resize((new_width,new_height), Image.ANTIALIAS)
    return img

We want to navigate over the raw images we've been given of retinal scans, and put them somewhere more accessible.

In [None]:
def convertTrainImgToNumpy(file_id):
    file = os.getcwd() + '/original-images-disease-grades/train/'
    file += 'IDRiD_' + str(file_id + 1000)[1:] + '.jpg' 
    return img_to_array(resizeImg(file))
def convertTestImgToNumpy(file_id):
    file = os.getcwd() + '/original-images-disease-grades/test/'
    file += 'IDRiD_' + str(file_id + 1000)[1:] + '.jpg' 
    return img_to_array(resizeImg(file))

In order to speed up our analysis, we will also import the multiprocessing packages of Python to enable simultaneous analysis. We will also include tqdm to enable us to track the progress of parsing data.

In [None]:
from multiprocessing import Pool
from tqdm import tqdm_notebook as tqdm

In [None]:
entry_range = list(range(1,414,1))
with Pool(20) as p:
    train_list = list(tqdm(p.imap(convertTrainImgToNumpy, entry_range), total=len(entry_range)))

HBox(children=(IntProgress(value=0, max=413), HTML(value='')))

In [None]:
entry_range = list(range(1,104,1))
with Pool(20) as p:
    test_list = list(tqdm(p.imap(convertTestImgToNumpy, entry_range), total=len(entry_range)))

Now, we are interested in storing our images for later pre-processing in the Part 1 notebook! We shall save them in the part1 folder as `xtrain.npy` and `xtest.npy`. 

In [None]:
xtrain_file = os.getcwd() + '/part1/xtrain.npy'
train_np = np.array(train_list, 'float32')
np.save(xtrain_file, train_np)

In [None]:
xtest_file = os.getcwd() + '/part1/xtest.npy'
test_np = np.array(test_list, 'float32')
np.save(xtest_file, test_np)

Now, we want to save our y-values from groundtruths as well.

In [None]:
y_train_data = np.genfromtxt(os.getcwd() + '/original-images-disease-grades/groundtruths/training_labels.csv', delimiter=',', skip_header=1, encoding='utf-8')
y_train = np.array(y_train_data[:, [1,2]], dtype='int32')
ytrain_file = os.getcwd() + '/part1/ytrain.npy'
np.save(ytrain_file, y_train)

In [None]:
y_test_data = np.genfromtxt(os.getcwd() + '/original-images-disease-grades/groundtruths/test_labels.csv', delimiter=',', skip_header=1, encoding='utf-8')
y_test = np.array(y_test_data[:, [1,2]], dtype='int32')
ytest_file = os.getcwd() + '/part1/ytest.npy'
np.save(ytest_file, y_test)

This notebook is continued in Part 1: Pre-Processing