<a href="https://colab.research.google.com/github/RudyMartin/dsai-2024/blob/main/data_prep_v3_from_student_pics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preparation.
 - updated Aug 3, 2024

## Introduction

The first part of building any AI model is finding appropriate data. For your project, the model will be classifying images, so your data should be images as well. In this module, you will learn how to obtain your data using different methods and process it, so it is ready to be ingested by the model.

## Image Representation

Most of the digital images available today are represented in __raster__ format. Raster image contains a two-dimensional table of _pixels_ - small squared (or rectangular) shape elements of the image.

![](https://drive.google.com/uc?export=view&id=1AeC4a6YCNDBFm3ArdyiKgM7rCHn92xg1)

There are various ways the information can be embedded into pixel, however we will use the most popular method of RGB, representing three-sized tuple of Red, Green, and Blue intensities.


![](https://drive.google.com/uc?export=view&id=1yjsqrUxGGecJXvrwFHSX6ZJ1jrjUK11R)

In other words, every pixel is a sequence of three numbers (usually from 0 to 255) and image is a table of pixels.

![](https://drive.google.com/uc?export=view&id=1Hi3HFJU-yXJX1vtd7lyhIBtSY5WEeXCW)

Hence, input representation for any colored image can be thought of as three-dimensional table, with dimensions ($width$, $height$, $3$), where ($width$, $height$) tuple is known as _image resolution_. The ratio of the two ($width/frac$) is known as aspect ratio.

## Part 1: Obtaining Data. (Only run this if you do NOT have enough images)

There are various ways the data can be obtained for the project, however we will be focusing on two methods:

1. Creating images using photo camera (such as the one you have in your phones).
2. Creating snapshots from a video feed (which can be created using web camera).

These methods have no meaningful advantages over each other for this project. You can use whichever is more preferrable to you or combination of both. Beloww is the sceleton code that you can use to prepare the data for data pre-processing.

Mount your Google Drive using your Gmail account. Proceed to the folder where you will store the data for your project.

In [1]:
### Name Project folder
import os
from google.colab import drive

drive.mount("/content/gdrive", force_remount=True)
root_dir = "/content/gdrive/My Drive/dsai-2024/MVPS"
proj_dir = os.path.join(root_dir, 'Camp-Rock-Paper-Scissors')
os.chdir('/content/gdrive/My Drive/dsai-2024/MVPS/Camp-Rock-Paper-Scissors')

rps_dir = os.path.join(proj_dir, 'rps') # this points to data folder
train_dir = os.path.join(rps_dir, 'train')
test_dir = os.path.join(rps_dir, 'test')
model_dir = os.path.join(root_dir, 'model')

# Define and create directories for rps
for dir in ['train', 'test', 'model']:
    os.makedirs(os.path.join(rps_dir, dir), exist_ok=True)

# Define and create subdirectories for train and test
for sub_dir in ['rock', 'paper', 'scissors']:
    os.makedirs(os.path.join(train_dir, sub_dir), exist_ok=True)
    os.makedirs(os.path.join(test_dir, sub_dir), exist_ok=True)

Mounted at /content/gdrive



**Switching to the Project Folder and Creating Test Images**

1. **Change to the Training or Testing Folder:**
   First, navigate to the appropriate folder where you will be working. You can switch to either the training folder or the testing folder depending on your task. Here's an example of how to do it in code:
   



In [None]:
## pick either training or testing directory to send images
%cd {train_dir}
# or
#%cd {train_dir}
%ls

/content/gdrive/My Drive/dsai-2024/MVPS/Camp-Rock-Paper-Scissors/rps/train
[0m[01;34mpaper[0m/  [01;34mrock[0m/  [01;34mscissors[0m/



2. **Create Unique Test Images:**
   You need to ensure you have unique test images, not duplicates. For testing, aim for 20% of the total images in your training set. This means if you have 10 images for each category (rock, paper, scissors) in your training set, you should create 4 unique test images for each category.

   For example, if you have:
   
   - maxRock = 10 # rock images
   - maxPaper = 10 # paper images
   - maxScissors = 10 # scissors images

   Likewise, You need to switch the folder to test and. create 4 unique test images for rocks, 4 for paper, and 4 for scissors.


### Using Web Camera

Run the following cell. It will create a function that will take a snapshot from your webcam by pressing a button and save the file in your current folder.

In [None]:
from IPython.display import display, Javascript
from google.colab.output import eval_js
from base64 import b64decode

def take_photo(filename='photo.jpg', quality=0.8):
  js = Javascript('''
    async function takePhoto(quality) {
      const div = document.createElement('div');
      const capture = document.createElement('button');
      capture.textContent = 'Capture';
      div.appendChild(capture);

      const video = document.createElement('video');
      video.style.display = 'block';
      const stream = await navigator.mediaDevices.getUserMedia({video: true});

      document.body.appendChild(div);
      div.appendChild(video);
      video.srcObject = stream;
      await video.play();

      // Resize the output to fit the video element.
      google.colab.output.setIframeHeight(document.documentElement.scrollHeight, true);

      // Wait for Capture to be clicked.
      await new Promise((resolve) => capture.onclick = resolve);

      const canvas = document.createElement('canvas');
      canvas.width = video.videoWidth;
      canvas.height = video.videoHeight;
      canvas.getContext('2d').drawImage(video, 0, 0);
      stream.getVideoTracks()[0].stop();
      div.remove();
      return canvas.toDataURL('image/jpeg', quality);
    }
    ''')
  display(js)
  data = eval_js('takePhoto({})'.format(quality))
  binary = b64decode(data.split(',')[1])
  with open(filename, 'wb') as f:
    f.write(binary)
  return filename

Change the parameters in the cells to choose number of snapshots for each gesture. Run the cell and follow the prompt.

![](https://drive.google.com/uc?export=view&id=1YrurRp9FYI3yGmzqqIG9wuPrIpBiAO1D)

In [None]:
from IPython.display import Image

### Choose these parameters
# maxRock - number of images taken with "rock" gesture
# maxPaper - number of images taken with "paper" gesture
# maxScissors - number of images taken with "scissors" gesture
###

maxRock = 10 # rock images
maxPaper = 10 # paper images
maxScissors = 10 # scissors images

### End initialization
###
try:
  ct = int(1)
  maxPaper = maxRock + maxPaper
  maxScissors = maxPaper + maxScissors
  while True:
    if ct <= maxRock:
      print("Show Rock and press Capture!")
      label = "rock"
    elif ct <= maxPaper:
      print("Show Paper and press Capture!")
      label = "paper"
    else:
      print("Show Scissors and press Capture!")
      label = "scissors"

    filename = take_photo(label + "/" +label+'_'+str(ct))

    print('Saved to {}'.format(filename))

    ct += 1
    if ct > maxScissors:
      break
except Exception as err:
  # Errors will be thrown if the user does not have a webcam or if they do not
  # grant the page permission to access it.
  print(str(err))

Show Rock and press Capture!


<IPython.core.display.Javascript object>

Saved to rock/rock_1
Show Rock and press Capture!


<IPython.core.display.Javascript object>

Saved to rock/rock_2
Show Paper and press Capture!


<IPython.core.display.Javascript object>

Saved to paper/paper_3
Show Paper and press Capture!


<IPython.core.display.Javascript object>

Saved to paper/paper_4
Show Scissors and press Capture!


<IPython.core.display.Javascript object>

Saved to scissors/scissors_5
Show Scissors and press Capture!


<IPython.core.display.Javascript object>

Saved to scissors/scissors_6


### Using photo camera

0. If you have iPhone, please change File formats before you take pictures. In Settings -> Camera -> Formats, choose Most Compatible. You can change it back after pictures are taken.
1. Make pictures of different gestures using you camera.
2. Download and Install Google Drive App (if not already installed).
3. Copy images in the respective folders.
4. Run the cell below to rename files to general convention.

## Creating Data Tips

1. Ensure image is of high quality and the gesture is discernible by human eye. If a human cannot classify the image, neither will machine!
2. Keep background clean and the classification object large enough to improve quality of learning.
  - If there are extraneous items in the background, the algorithm may incorrectly aim these items to be part of classification solution.
  - If object is too small and not prominent, algorithm will tend to focus on background more, reducing accuracy.
3. Ensure appropriate aspect ratio. Changing aspect ratio will stretch/shrink object on the image, which makes learning more difficult.
4. Ensure labels are correct, i.e. appropriate images are stored in their respective folders.

## Part 2: Data pre-processing (Only run after ALL images have been loaded)

In [None]:
### Name Project folder
import os
from google.colab import drive
import datetime

# Record the start time for performance evaluation
start_time = datetime.datetime.now()

drive.mount("/content/gdrive", force_remount=True)
root_dir = "/content/gdrive/My Drive/"
rps_dir = os.path.join(root_dir, 'dsai-2024/MVPS/Camp-Rock-Paper-Scissors/rps')
train_dir = os.path.join(rps_dir, 'train')
test_dir = os.path.join(rps_dir, 'test')
model_dir = os.path.join(root_dir, 'model')

# Define and create directories for rps
for dir in ['train', 'test', 'model']:
    os.makedirs(os.path.join(rps_dir, dir), exist_ok=True)

# Define and create subdirectories for train and test
for sub_dir in ['rock', 'paper', 'scissors']:
    os.makedirs(os.path.join(train_dir, sub_dir), exist_ok=True)
    os.makedirs(os.path.join(test_dir, sub_dir), exist_ok=True)

# Ensure the data directory exists
if not os.path.exists(rps_dir):
    raise FileNotFoundError(f"Directory {rps_dir} does not exist.")

print(f"'rps' directory contents: {os.listdir(rps_dir)}")

Mounted at /content/gdrive
'rps' directory contents: ['train', 'test', 'model']


In [None]:
# Ensure the data directory exists
if not os.path.exists(rps_dir):
    raise FileNotFoundError(f"Directory {rps_dir} does not exist.")
print(f"'rps' directory contents: {os.listdir(rps_dir)}")

'rps' directory contents: ['train', 'test', 'model']


To uphold principle of fairness and simplify machine learning process, all images need to be converted into same file format (.jpg) and same resolution. Please run the cell below to convert the files to common format.

Participants are also encouraged to manually edit the pictures, using tips above, to improve accuracy of their learning algorithm.

In [None]:
# Load training data
import glob

print(f"Train directory: {train_dir}")
print(f"Test directory: {test_dir}")

print(f"Number of train scissors images: {len(glob.glob(f'{train_dir}/paper/*.jpg'))}")
print(f"Number of train rock images: {len(glob.glob(f'{train_dir}/rock/*.jpg'))}")
print(f"Number of train paper images: {len(glob.glob(f'{train_dir}/paper/*.jpg'))}")

print(f"Number of test scissors images: {len(glob.glob(f'{test_dir}/paper/*.jpg'))}")
print(f"Number of test rock images: {len(glob.glob(f'{test_dir}/rock/*.jpg'))}")
print(f"Number of test paper images: {len(glob.glob(f'{test_dir}/paper/*.jpg'))}")

## ORIGINAL PHOTOS DO NOT HAVE EXTENSION NAMES SO WILL NOT BE COUNTED (OK ZER) - BUT CHECK VISUALLY

Train directory: /content/gdrive/My Drive/dsai-2024/MVPS/Camp-Rock-Paper-Scissors/rps/train
Test directory: /content/gdrive/My Drive/dsai-2024/MVPS/Camp-Rock-Paper-Scissors/rps/test
Number of train scissors images: 4
Number of train rock images: 4
Number of train paper images: 4
Number of test scissors images: 0
Number of test rock images: 0
Number of test paper images: 0


Converts code and renames bad files with starting 'z' name.

In [None]:
from os import listdir, rename
from os.path import isfile, join
from PIL import Image
from os.path import splitext

def process_images(proj_path):
    for label in ['rock', 'paper', 'scissors']:
        for i in [f for f in listdir(join(proj_path, label)) if isfile(join(proj_path, label, f))]:
            file_path = join(proj_path, label, i)
            file_name, file_ext = splitext(i)

            try:
                im = Image.open(file_path)
                im = im.resize((640, 480))

                if file_ext.lower() in ['.png', '.jpeg']:
                    # Convert RGBA to RGB if the image has an alpha channel
                    if im.mode == 'RGBA':
                        im = im.convert('RGB')
                    new_file_path = join(proj_path, label, f"{file_name}.jpg")
                    im.save(new_file_path, 'jpeg')
                    print(f"Successfully Converted {i} to {file_name}.jpg")
                    # Optionally, remove the original file
                    # os.remove(file_path)
                else:
                    new_file_path = join(proj_path, label, f"{file_name}.jpg")
                    im.save(new_file_path, 'jpeg')
                    print(f"Successfully Resized {i} and saved as {file_name}.jpg")

            except Exception as e:
                print(f"Error processing file {i} at {file_path}: {e}")
                # Rename the problematic file with 'z_' prefix
                new_file_path = join(proj_path, label, f"z_{i}")
                rename(file_path, new_file_path)
                print(f"Renamed problematic file {i} to z_{i}")

# Example usage:

process_images(test_dir)

process_images(train_dir)

Successfully Resized rock_1 and saved as rock_1.jpg
Successfully Resized rock_2 and saved as rock_2.jpg
Successfully Resized paper_3 and saved as paper_3.jpg
Successfully Resized paper_4 and saved as paper_4.jpg
Successfully Resized scissors_5 and saved as scissors_5.jpg
Successfully Resized scissors_6 and saved as scissors_6.jpg
Successfully Resized rock_4.jpg and saved as rock_4.jpg
Successfully Resized rock_3.jpg and saved as rock_3.jpg
Successfully Resized rock_1.jpg and saved as rock_1.jpg
Successfully Resized rock_2.jpg and saved as rock_2.jpg
Successfully Resized paper_1.jpg and saved as paper_1.jpg
Successfully Resized paper_2.jpg and saved as paper_2.jpg
Successfully Resized paper_3.jpg and saved as paper_3.jpg
Successfully Resized paper_4.jpg and saved as paper_4.jpg
Successfully Resized scissors_1.jpg and saved as scissors_1.jpg
Successfully Resized scissors_3.jpg and saved as scissors_3.jpg
Successfully Resized scissors_2.jpg and saved as scissors_2.jpg
Successfully Resized

ONLY AFTER REVIEWING AND DELETING FILES - Then RENAME, REORDER remaining files to jpg - Wait 1-2 minutes between running this and prior section to be sure the Z files are no longer in folder.

In [None]:
import os

def renumber_images(base_path):
    categories = ['rock', 'paper', 'scissors']
    for category in categories:
        # Renumber images in the train folder
        train_path = os.path.join(base_path, 'train', category)
        renumber_folder(train_path, category)

        # Renumber images in the test folder
        test_path = os.path.join(base_path, 'test', category)
        renumber_folder(test_path, category)

def renumber_folder(folder_path, class_name):
    if not os.path.exists(folder_path):
        print(f"Folder not found: {folder_path}")
        return

    files = [f for f in os.listdir(folder_path) if os.path.isfile(os.path.join(folder_path, f))]
    files.sort()  # Sort to maintain a consistent order

    # Rename files to a temporary name to avoid conflicts
    temp_files = []
    for idx, filename in enumerate(files, start=1):
        file_path = os.path.join(folder_path, filename)
        temp_filename = f"temp_{class_name}_{idx:05d}.jpg"
        temp_file_path = os.path.join(folder_path, temp_filename)
        os.rename(file_path, temp_file_path)
        temp_files.append(temp_filename)

    # Then, rename temporary files to the final name
    for idx, temp_filename in enumerate(temp_files, start=1):
        temp_file_path = os.path.join(folder_path, temp_filename)
        new_filename = f"{class_name}_{idx}.jpg"
        new_file_path = os.path.join(folder_path, new_filename)
        os.rename(temp_file_path, new_file_path)
        print(f"Renamed {temp_file_path} to {new_file_path}")

# Path to your dataset
renumber_images(rps_dir)

Renamed /content/gdrive/My Drive/dsai-2024/MVPS/Camp-Rock-Paper-Scissors/rps/train/rock/temp_rock_00001.jpg to /content/gdrive/My Drive/dsai-2024/MVPS/Camp-Rock-Paper-Scissors/rps/train/rock/rock_1.jpg
Renamed /content/gdrive/My Drive/dsai-2024/MVPS/Camp-Rock-Paper-Scissors/rps/train/rock/temp_rock_00002.jpg to /content/gdrive/My Drive/dsai-2024/MVPS/Camp-Rock-Paper-Scissors/rps/train/rock/rock_2.jpg
Renamed /content/gdrive/My Drive/dsai-2024/MVPS/Camp-Rock-Paper-Scissors/rps/train/rock/temp_rock_00003.jpg to /content/gdrive/My Drive/dsai-2024/MVPS/Camp-Rock-Paper-Scissors/rps/train/rock/rock_3.jpg
Renamed /content/gdrive/My Drive/dsai-2024/MVPS/Camp-Rock-Paper-Scissors/rps/train/rock/temp_rock_00004.jpg to /content/gdrive/My Drive/dsai-2024/MVPS/Camp-Rock-Paper-Scissors/rps/train/rock/rock_4.jpg
Renamed /content/gdrive/My Drive/dsai-2024/MVPS/Camp-Rock-Paper-Scissors/rps/test/rock/temp_rock_00001.jpg to /content/gdrive/My Drive/dsai-2024/MVPS/Camp-Rock-Paper-Scissors/rps/test/rock/roc

**Test manipulating images** - Preview of what should happen later in experiments

In [None]:
# Get the number of training images
num_train_images = train_generator.samples
print(f"Number of training images: {num_train_images}")

# Get the number of testing images
num_test_images = test_generator.samples
print(f"Number of testing images: {num_test_images}")

Number of training images: 12
Number of testing images: 12


In [None]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator
# Define a custom preprocessing function to ensure uniform image sizes and shapes
def make_square_and_resize(image):
    """Pad an image to make it square and resize it to (224, 224)."""
    target_size = (224, 224)
    height, width = image.shape[:2]
    delta_w = max(height - width, 0)
    delta_h = max(width - height, 0)
    top, bottom = delta_h // 2, delta_h - (delta_h // 2)
    left, right = delta_w // 2, delta_w - (delta_w // 2)
    color = [255, 255, 255]  # white background for padding
    new_img = cv2.copyMakeBorder(image, top, bottom, left, right, cv2.BORDER_CONSTANT, value=color)
    new_img = cv2.resize(new_img, target_size)
    return new_img

# Prepare image data generators with real-time augmentation and custom preprocessing
train_datagen = ImageDataGenerator(
    rescale=1./255,
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    fill_mode='nearest',
    preprocessing_function=make_square_and_resize
)
test_datagen = ImageDataGenerator(rescale=1./255, preprocessing_function=make_square_and_resize)

# Load images from directories and prepare them for training and validation
train_generator = train_datagen.flow_from_directory(train_dir, target_size=(224, 224), batch_size=32, class_mode='categorical')
test_generator = test_datagen.flow_from_directory(test_dir, target_size=(224, 224), batch_size=32, class_mode='categorical')



Found 12 images belonging to 3 classes.
Found 12 images belonging to 3 classes.
