# Converting Raw Data (JPEG/PNG) into a Dataset

1. Images are saved in x.pickle, labels (the name of the directory the images are in) are saved in y.pickle
2. Both are save as Numpy Arrays
3. x has a shape of (n, IMG_SIZE, IMG_SIZE, 1) where n is the total number of images
4. y has a shape of n

Data Directory File Structure:

## Static Variables and Initialising Directories

In [1]:
import matplotlib.pyplot as plt, numpy as np 
import os, random, pickle
import cv2                              #Image library
from tqdm import tqdm                   #Progress bar library

# directories
DATADIR = "dataset/stairs2"
CATEGORIES = ["curved", "negative", "straight"]

# static variables
IMG_SIZE = 299                           #Img of size (IMG_SIZE by IMG_SIZE)
randseed = 21                           #Pseudo Random Number Generator Seed
greyscale = True                        #Choose Whether to Convert RGB Image into GreyScale Image

## Parsing Training Images

In [2]:
training_data = []

def create_training_data():
    for class_num, category in enumerate(CATEGORIES):  

        path = os.path.join(DATADIR,category)                   # Creating the file path
        for img in tqdm(os.listdir(path)):                      # Iterate over each image per image file
            try:
                img_array = cv2.imread(os.path.join(path,img))  # Convert to array
                new_array = cv2.resize(img_array, (IMG_SIZE, IMG_SIZE))  #Resize to normalize data size to N x N
                training_data.append((new_array, class_num))    # Add this to our training_data
            except:  # To keep output clean
                pass
            

create_training_data()

100%|██████████| 1178/1178 [00:01<00:00, 1006.68it/s]
100%|██████████| 1063/1063 [00:02<00:00, 417.68it/s]
100%|██████████| 912/912 [00:00<00:00, 980.63it/s] 


## Writing Training Images to Pickle Binary Files

In [3]:
#Random Shuffle Training Data
random.seed(randseed)
random.shuffle(training_data)

#Write Training Data to Lists
x = []
y = []

for img,label in training_data:
    x.append(img)
    y.append(label)

#Converts RGB into Greyscale
if greyscale == True:
    x = np.array([np.average(i,2) for i in x])
    x = np.resize(x,(len(x), IMG_SIZE, IMG_SIZE, 1))

#Output to Pickle Files
y = np.array(y)
pickle_out = open("x.pickle","wb")
pickle.dump(x, pickle_out)
pickle_out.close()

pickle_out = open("y.pickle","wb")
pickle.dump(y, pickle_out)
pickle_out.close()

print("Dataset Conversion Complete")
print("Number of Training Examples (M)", len(training_data))

Dataset Conversion Complete
Number of Training Examples (M) 3147
