# Section Two - Feature Engineering

## Library Installation
The commands needed to install the libraries required for section two.

In [None]:
%pip install scipy
%pip install numpy

## Function Definitions
After loading each CSV image into a NumPy array, I computed the following features:

1.	`nr_pix`: The total number of black pixels, calculated by taking the sum of all values in the matrix.

2.	`rows_with_1` and `cols_with_1`: The counts of rows and columns, respectively, that contain exactly one black pixel. I achieved this by summing along each row, or column, and counting how many of these equal one.

3.	`rows_with_3p` and `cols_with_3p`: The number of rows and columns that contain three or more black pixels. This was achieved in a similar fashion to calculating rows and columns with one pixel, but instead, by counting how many are greater than or equal to three.

4.	`aspect_ratio`: Using NumPy’s argwhere function, which finds the indices of array elements that are non-zero grouped by element, I was able to calculate the minimal bounding box around the symbol. The aspect ratio was then calculated as the width divided by the height of the bounding box, capturing the overall shape of the symbol.

5.	`neigh_1`: I implemented a helper function, “count_neighbours”, to count the 8-connected neighbours for a pixel. I then iterated through every pixel, tallying the number of black pixels that had exactly one black neighbour.

6.	**Directional Isolation Features**:
•	`no_neigh_above`, `no_neigh_below`, `no_neigh_left`, `no_neigh_right`: For each black pixel, I examined the immediate neighbours in the specified directions using a sliding window approach and counted those pixels that had no adjacent black pixels in that region.
•	`no_neigh_horiz` and `no_neigh_vert`: These features aggregate the isolation checking by only checking the black neighbours on both the left and right, or top and bottom, simultaneously.

7.	`connected_areas`: To calculate this feature, I used an 8-connected neighbourhood- defined using SciPy’s “generate_binary_structure(2, 2)” – to determine connectivity between pixels. This means that any two foreground pixels with value 1 are considered connected if they are adjacently horizontally, vertically, or diagonally. The “nd_label” function then labels each unique connected component. I then calculated the total number of these labels, which represents the number of connected areas in the image.

8.	`eyes`: According to the assignment specification, an “Eye” is a region of whitespace that is completely surrounded by lines of the character. To compute this, I started by inverting the binary image so that white becomes 1, and black becomes 0. I then applied connected component analyses, as seen in connected_areas, but instead using a 4-connected structure. I then counted the connected white regions that do not touch the image border, as these represent truly enclosed regions.

9.	`custom - Normalised Horizontal Symmetry`: Initially, I was planning on using the Euler number as my custom feature, however I found it to be too similar to the eyes feature, so I decided to base my custom feature on horizontal symmetry. I started by calculating the symbols bounding box, as seen in the aspect_ratio feature, and then calculating the midpoint column. I then flip the right half of the bounding box horizontally and calculate the absolute differences between the left and right halves. This value is then normalised by the total number of pixels in the halves to yield a value between 0 and 1, with 1 indicating perfect symmetry. Additionally, to handle edge cases, I return a default score of 1.0, under the assumption that insufficient width implies a lack of asymmetry.



In [None]:
import numpy
from scipy.ndimage import label as nd_label, generate_binary_structure

#region Helper Functions
def count_neighbours(x, y, matrix):
    count = 0

    for dx in [-1, 0, 1]:
        for dy in [-1, 0, 1]:
            if dx == 0 and dy == 0:
                continue
                
            nx, ny = x + dx, y + dy
            
            if 0 <= nx < matrix.shape[0] and 0 <= ny < matrix.shape[1]:
                count += matrix[nx, ny]
                
    return count
#endregion

def calculate_features(matrix):

    features = {}
    matrix_arr = numpy.array(matrix)

    # 1. nr_pix
    features['nr_pix'] = int(numpy.sum(matrix_arr))
    
    # 2. row_with_1
    features['rows_with_1'] = int(numpy.sum(numpy.sum(matrix_arr, axis=1) == 1))
    
    # 3. cols_with_1
    features['cols_with_1'] = int(numpy.sum(numpy.sum(matrix_arr, axis=0) == 1))
    
    # 4. rows_with_3p
    features['rows_with_3p'] = int(numpy.sum(numpy.sum(matrix_arr, axis=1) >= 3))
    
    # 5. cols_with_3p
    features['cols_with_3p'] = int(numpy.sum(numpy.sum(matrix_arr, axis=0) >= 3))
    
    # 6. aspect_ratio
    black_indices = numpy.argwhere(matrix_arr == 1)
    if black_indices.size == 0:
        features['aspect_ratio'] = 0
    else:
        min_row, min_col = numpy.min(black_indices, axis=0)
        max_row, max_col = numpy.max(black_indices, axis=0)

        height = max_row - min_row + 1
        width = max_col - min_col + 1
        features['aspect_ratio'] = width/height if height != 0 else 0
        
    # 7. neigh_1
    neigh_1 = 0
    for i in range(matrix_arr.shape[0]):
        for j in range(matrix_arr.shape[1]):
            if matrix_arr[i, j] == i and count_neighbours(i, j, matrix_arr) == 1:
                neigh_1 += 1
    features['neigh_1'] = neigh_1
        
    # 8. no_neigh_above
    no_neigh_above = 0
    for i in range(matrix_arr.shape[0]):
        for j in range(matrix_arr.shape[1]):
            if matrix_arr[i, j] == 1:
                neigh_vals = []
                for dj in [-1, 0, 1]:
                    ni, nj = i - 1, j + dj

                    if 0 <= ni < matrix_arr.shape[0] and 0 <= nj < matrix_arr.shape[1]:
                        neigh_vals.append(matrix_arr[ni, nj])
                    else:
                        neigh_vals.append(0)

                if sum(neigh_vals) == 0:
                    no_neigh_above += 1
    features['no_neigh_above'] = no_neigh_above
                    
    # 9. no_neigh_below
    no_neigh_below = 0
    for i in range(matrix_arr.shape[0]):
        for j in range(matrix_arr.shape[1]):
            if matrix_arr[i, j] == 1:
                neigh_vals = []
                for dj in [-1, 0, 1]:
                    ni, nj = i + 1, j + dj

                    if 0 <= ni < matrix_arr.shape[0] and 0 <= nj < matrix_arr.shape[1]:
                        neigh_vals.append(matrix_arr[ni, nj])
                    else:
                        neigh_vals.append(0)

                if sum(neigh_vals) == 0:
                    no_neigh_below += 1
    features['no_neigh_below'] = no_neigh_below
        
    # 10. no_neigh_left
    no_neigh_left = 0
    for i in range(matrix_arr.shape[0]):
        for j in range(matrix_arr.shape[1]):
            if matrix_arr[i, j] == 1:
                neigh_vals = []
                for di in [-1, 0, 1]:
                    ni, nj = i + di, j - 1

                    if 0 <= ni < matrix_arr.shape[0] and 0 <= nj < matrix_arr.shape[1]:
                        neigh_vals.append(matrix_arr[ni, nj])
                    else:
                        neigh_vals.append(0)

                if sum(neigh_vals) == 0:
                    no_neigh_left += 1
    features['no_neigh_left'] = no_neigh_left
        
    # 11. no_neigh_right
    no_neigh_right = 0
    for i in range(matrix_arr.shape[0]):
        for j in range(matrix_arr.shape[1]):
            if matrix_arr[i, j] == 1:
                neigh_vals = []
                for di in [-1, 0, 1]:
                    ni, nj = i + di, j + 1

                    if 0 <= ni < matrix_arr.shape[0] and 0 <= nj < matrix_arr.shape[1]:
                        neigh_vals.append(matrix_arr[ni, nj])
                    else:
                        neigh_vals.append(0)

                if sum(neigh_vals) == 0:
                    no_neigh_right += 1
    features['no_neigh_right'] = no_neigh_right
        
    # 12. no_neigh_horiz
    no_neigh_horiz = 0
    for i in range(matrix_arr.shape[0]):
        for j in range(matrix_arr.shape[1]):
            if matrix_arr[i, j] == 1:
                left = matrix_arr[i, j - 1] if j - 1 >= 0 else 0
                right = matrix_arr[i, j + 1] if j + 1 < matrix_arr.shape[1] else 0

                if left + right == 0:
                    no_neigh_horiz += 1
    features['no_neigh_horiz'] = no_neigh_horiz
        
    # 13. no_neigh_vert
    no_neigh_vert = 0
    for i in range(matrix_arr.shape[0]):
        for j in range(matrix_arr.shape[1]):
            if matrix_arr[i, j] == 1:
                up = matrix_arr[i - 1, j] if i - 1 >= 0 else 0
                down = matrix_arr[i + 1, j] if i + 1 < matrix_arr.shape[0] else 0

                if up + down == 0:
                    no_neigh_vert += 1
    features['no_neigh_vert'] = no_neigh_vert
        
    # 14. connected_areas
    bin_structure = generate_binary_structure(2, 2)
    labeled_arr, num_regions = nd_label(matrix_arr, structure=bin_structure)
    features['connected_areas'] = num_regions
        
    # 15. eyes
    inverted_matrix = 1 - matrix_arr

    bin_structure = generate_binary_structure(2, 1)
    labeled_arr, num_white = nd_label(inverted_matrix, structure=bin_structure)

    eyes_count = 0
    for region_label in range(1, num_white + 1):
        region = (labeled_arr == region_label)
        if (region[0, :].any() or region[-1, :].any() or region[:, 0].any() or region[:, -1].any()):
            continue
        eyes_count += 1
    features['eyes'] = eyes_count
        
    # 16. custom (horizontal symmetry)
    foreground = numpy.argwhere(matrix_arr == 1)
    if foreground.size == 0:
        features['custom'] = 1.0
    else:
        min_row, min_col = foreground.min(axis=0)
        max_row, max_col = foreground.max(axis=0)

        symbol_region = matrix_arr[min_row:max_row + 1, min_col:max_col + 1]
    
        region_columns = symbol_region.shape[1]
        if region_columns < 2:
            features['custom'] = 1.0
        else:
            midpoint = region_columns // 2
            left_half = symbol_region[:, :midpoint]
            right_half = symbol_region[:, midpoint:]
    
            right_half_flipped = numpy.fliplr(right_half)

            min_width = min(left_half.shape[1], right_half_flipped.shape[1])
            left_half_cropped = left_half[:, :min_width]
            right_half_cropped = right_half_flipped[:, :min_width]
    
            difference = numpy.abs(left_half_cropped - right_half_cropped)

            total_pixels = left_half_cropped.size
            if total_pixels == 0:
                features['custom'] = 1.0
            else:
                normalised_diff = numpy.sum(difference) / total_pixels

                symmetry_score = 1 - normalised_diff
                features['custom'] = symmetry_score
                
    return features

## File Processing
This cell loops through every file in the `IMAGE_CSV_FOLDER` (which contains the CSV files exported in section one), calculates its features, and saves the information in a file called `40394874_features.csv`.

In [None]:
import os
import csv
import numpy

IMAGE_CSV_FOLDER = 'images/CSV'
FEATURES_FILE = '40394874_features.csv'

CSV_HEADER = ['label', 'Index', 'nr_pix', 'rows_with_1', 'cols_with_1', 'rows_with_3p', 'cols_with_3p', 'aspect_ratio', 'neigh_1', 'no_neigh_above', 'no_neigh_below', 'no_neigh_left', 'no_neigh_right', 'no_neigh_horiz', 'no_neigh_vert', 'connected_areas', 'eyes', 'custom']

feature_rows = [CSV_HEADER]

image_files = [file for file in os.listdir(IMAGE_CSV_FOLDER) if file.lower().endswith('.csv')]

for file in image_files:
    parts = file.split('_')
    if len(parts) != 3:
        continue
    label = parts[1]
    index = os.path.splitext(parts[2])[0]

    img_path = os.path.join(IMAGE_CSV_FOLDER, file)
    img_matrix = numpy.loadtxt(img_path, delimiter=',', dtype=int)

    features = calculate_features(img_matrix)

    row = [label, index, features['nr_pix'], features['rows_with_1'], features['cols_with_1'], features['rows_with_3p'], features['cols_with_3p'], features['aspect_ratio'], features['neigh_1'], features['no_neigh_above'], features['no_neigh_below'], features['no_neigh_left'], features['no_neigh_right'], features['no_neigh_horiz'], features['no_neigh_vert'], features['connected_areas'], features['eyes'], features['custom']]
    feature_rows.append(row)

header_row, *data_rows = feature_rows
data_rows.sort(key=lambda x: (x[0], int(x[1])))
feature_rows = [header_row] + data_rows

with open(FEATURES_FILE, 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(feature_rows)

print(f"Features saved to {FEATURES_FILE}")