# 1.0 Feature extraction

In this iteration we use manual image classification. For this, we only extract a few features where we can easily categories each image. 

### 1. Adding the libraries we need

In [1]:
import os
import skimage as si
from skimage import io
import numpy as np

### 2. Load in all images

All images are loaded in into an array as a tuple of (image, label). This way we know what image we are dealing with. We also lowercase all the labels as "Tyr" is in uppercase, while everything else is in lowercase.

In [2]:
directory = "../dataset-images/" # Path to the dataset
images = [] # List of images

for filename in os.listdir(directory):
    # Check if the file is an image
    if filename.endswith(".png"):
        label = filename.split("_")[0].lower() # Get the label
        # Label and load the image
        img = io.imread(os.path.join(directory, filename))
        images.append((img, label))

# Print the number of images
print("Number of images: " + str(len(images)))
print("All labels: " + str(set([label for _, label in images])))

Number of images: 1472
All labels: {'sun', 'wealth', 'bow', 'ash', 'elk-sedge', 'oak', 'need', 'spear', 'tyr', 'joy', 'gift', 'serpent'}


### 3. Feature extraction

We first have to check if the image has 2 channels, as at least one image (based on runtime error) in the dataset had only 2 channels.

Then, we process the image with the following steps:

1. **Convert to grayscale:** we do not care about the color of the image, so we convert it to grayscale (in case the image is not already grayscale)
2. **Remove alpha channel:** we do not care about the alpha channel, it is just extra data, so we remove it to save on processing power
3. **Binary erosion:** we erode the image to remove noise and make the image more clear

We want to make sure that we only have black pixels in our image.

In [None]:
processed = [] # List of processed images

for img, label in images:
    # if the image is not RGB, convert it
    if len(img.shape) == 2:
        img = si.color.gray2rgb(img)
    # remove the alpha channel
    img = img[:, :, :3]
    # binary color scale
    img = si.color.rgb2gray(img)
    threshold_value = si.filters.threshold_otsu(img)
    img = img > threshold_value
    # Apply erosion
    img = si.morphology.binary_erosion(img, si.morphology.square(3))
    # Label the image
    img = si.measure.label(img)

    processed.append((img, label))

# We do this only to print an example processed image
for img, label in processed:
    if label == 'spear':
        print(f'Label: {label}')
        io.imshow(img)
        io.show()
        break


### 4. Measure the labeled image properties

Finally, we extract various features from each image using the regionprops of scikit-image. We extract the `number of regions` and `black pixel count` of each image. We only extract these two features since we are doing manual categorization and we want few, but precise features.

We then save the features in a csv file. The output of this notebook is a csv file with the following columns: `label, black_pixels, regions`. It is saved in `dataset-numpy/1.0 - features.csv`.

In [4]:
features = [] # List of features

for labeled_img, label in processed:
    data = si.measure.regionprops(labeled_img)
    # Extract the num of black pixels from the labeled image
    total_pixels = labeled_img.size
    foreground_pixels = np.sum(labeled_img > 0)
    black_pixels = total_pixels - foreground_pixels
    # Get num of holes
    regions = len(data)
    features.append((label, black_pixels, regions))

# make csv file
directory = "../dataset-numpy/" 
path = os.path.join(directory, '1.0 - features.csv')
with open(path, 'w', newline='') as f:
    # Print csv header
    print("label,black_pixels,regions", file=f)
    # Print csv rows
    for label, pixels, regions in features:
        print(label + "," + str(pixels) + "," + str(regions), file=f)