The goal of this project is to create a model that can predict what type of common element of computer user interfaces an image is from a hand-written drawing (buttons, toggles, windows, etc.).

We are using the UISketch dataset downloaded from Kaggle. it can be found at https://www.kaggle.com/datasets/vinothpandian/uisketch.

This notebook covers the first step of that process, the data wrangling and exploratory data analysis.

In [44]:
# Import necessary packages
import numpy as np
import cv2
import pandas as pd
import glob
import os
import pyarrow as pa
import pyarrow.parquet as pq

First we create a list of all the category labels, which are also the filenames.

In [30]:
labels = os.listdir('/Users/grahamsmith/Documents/SpringboardWork/UIsketch dataset')

# remove the files that are not classes
labels.remove('labels.csv')
labels.remove('.DS_Store')

Next we check the proportions of the classes to see if there is significant variance.

In [4]:
label_length = []

for x in labels:
    label_length.append(len(os.listdir('/Users/grahamsmith/Documents/SpringboardWork/Springboard/UIsketch dataset/' + x)))

print('The smallest class size is ', min(label_length), ', and the largest class size is ', max(label_length),
     '. The average class size is ', round(sum(label_length)/len(label_length)))

The smallest class size is  847 , and the largest class size is  1157 . The average class size is  948


There are a roughly equivilent number of images for each class, although it shoudl still be watched in case it causes issues with accuracy of the smaller classes later.

Our goal it to convert the images into a single matrix where each row is an image, each column is a single pixel location, and each value represents the brightness of that pixel in grayscale.

First, the we crawl through the imported folders and create a list of all the images as 1d arrays:

In [5]:
images = []

for label in labels:
    filenames = glob.glob('/Users/grahamsmith/Documents/SpringboardWork/Springboard/UIsketch dataset/' + label + '/*')
    for filename in filenames:
        images.append(cv2.imread(filename).flatten())

Double check that the images are all the same dimensions.

In [6]:
for image in images:
    if len(image) == 150528:
        pass
    else:
        print('image not equal to 150528')

Nothing printed, so we can conlcude all the images are the same size (150528 pixels).

Now another list is generated with all of the image labels, in the same order as the list of images.

In [7]:
label_list = []

for label in labels:
    filenames = glob.glob('/Users/grahamsmith/Documents/SpringboardWork/Springboard/UIsketch dataset/' + label + '/*')
    for filename in filenames:
        label_list.append(str(label))
        

Now that we have all the data loaded in, the lists need to be combined into our final design matrix. The following code has so much repitition becasue it was written for the Kaggle IDE, which would break if more than 1000 lines were run at once (even with a loop).

In [8]:
imgsub = images[0:1000]
images_mat = imgsub[0]

for x in range(len(imgsub)):
    images_mat = np.row_stack([images_mat, imgsub[x]])

In [9]:
imgsub1 = images[1001:2000]
images_mat1 = imgsub1[0]

for x in range(len(imgsub1)):
    images_mat1 = np.row_stack([images_mat1, imgsub1[x]])

In [10]:
imgsub2 = images[2001:3000]
images_mat2 = imgsub2[0]

for x in range(len(imgsub2)):
    images_mat2 = np.row_stack([images_mat2, imgsub2[x]])

In [11]:
imgsub3 = images[3001:4000]
images_mat3 = imgsub3[0]

for x in range(len(imgsub3)):
    images_mat3 = np.row_stack([images_mat3, imgsub3[x]])

In [12]:
imgsub4 = images[4001:5000]
images_mat4 = imgsub4[0]

for x in range(len(imgsub4)):
    images_mat4 = np.row_stack([images_mat4, imgsub4[x]])

In [13]:
imgsub5 = images[5001:6000]
images_mat5 = imgsub5[0]

for x in range(len(imgsub5)):
    images_mat5 = np.row_stack([images_mat5, imgsub5[x]])

In [14]:
imgsub6 = images[6001:7000]
images_mat6 = imgsub6[0]

for x in range(len(imgsub5)):
    images_mat5 = np.row_stack([images_mat5, imgsub5[x]])

In [15]:
imgsub7 = images[7001:8000]
images_mat7 = imgsub7[0]

for x in range(len(imgsub7)):
    images_mat7 = np.row_stack([images_mat7, imgsub7[x]])

In [16]:
imgsub8 = images[8001:9000]
images_mat8 = imgsub8[0]

for x in range(len(imgsub8)):
    images_mat8 = np.row_stack([images_mat8, imgsub8[x]])

In [17]:
imgsub9 = images[9001:10000]
images_mat9 = imgsub9[0]

for x in range(len(imgsub9)):
    images_mat9 = np.row_stack([images_mat9, imgsub9[x]])

In [18]:
imgsub10 = images[10001:11000]
images_mat10 = imgsub10[0]

for x in range(len(imgsub10)):
    images_mat10 = np.row_stack([images_mat10, imgsub10[x]])

In [19]:
imgsub11 = images[11001:12000]
images_mat11 = imgsub11[0]

for x in range(len(imgsub11)):
    images_mat11 = np.row_stack([images_mat11, imgsub11[x]])

In [20]:
imgsub12 = images[12001:13000]
images_mat12 = imgsub12[0]

for x in range(len(imgsub12)):
    images_mat12 = np.row_stack([images_mat12, imgsub12[x]])

In [21]:
imgsub13 = images[13001:14000]
images_mat13 = imgsub13[0]

for x in range(len(imgsub13)):
    images_mat13 = np.row_stack([images_mat13, imgsub13[x]])

In [22]:
imgsub14 = images[14001:15000]
images_mat14 = imgsub14[0]

for x in range(len(imgsub14)):
    images_mat14 = np.row_stack([images_mat14, imgsub14[x]])

In [23]:
imgsub15 = images[15001:16000]
images_mat15 = imgsub15[0]

for x in range(len(imgsub15)):
    images_mat15 = np.row_stack([images_mat15, imgsub15[x]])

In [24]:
imgsub16 = images[16001:17000]
images_mat16 = imgsub16[0]

for x in range(len(imgsub16)):
    images_mat16 = np.row_stack([images_mat16, imgsub16[x]])

In [25]:
imgsub17 = images[17001:18000]
images_mat17 = imgsub17[0]

for x in range(len(imgsub17)):
    images_mat17 = np.row_stack([images_mat17, imgsub17[x]])

In [26]:
imgsub18 = images[17001:18844]
images_mat18 = imgsub18[0]

for x in range(len(imgsub18)):
    images_mat18 = np.row_stack([images_mat18, imgsub18[x]])

Stack the arrays together into the final matrix.

In [27]:
# Create final matrix
design_matrix = np.row_stack([images_mat, images_mat1, images_mat2, images_mat3, images_mat4, images_mat5,
                                 images_mat6, images_mat7, images_mat8, images_mat9, images_mat10, images_mat11,
                                 images_mat12, images_mat13, images_mat14, images_mat15, images_mat16,
                                 images_mat17, images_mat18])

Double check matrix dimensions are accurate

In [28]:
design_matrix.shape

(19845, 150528)

Finally, we turn the matrix into a dataframe so it can be easily exported as a parquet file, a column-oriented and space efficient filetype.

In [29]:
df = pd.DataFrame(design_matrix)
df['label'] = label_list

In [42]:
# Final check to make sure everything looks good
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,150519,150520,150521,150522,150523,150524,150525,150526,150527,label
0,255,255,255,255,255,255,255,255,255,255,...,255,255,255,255,255,255,255,255,255,dropdown_menu
1,255,255,255,255,255,255,255,255,255,255,...,255,255,255,255,255,255,255,255,255,dropdown_menu
2,255,255,255,255,255,255,255,255,255,255,...,255,255,255,255,255,255,255,255,255,dropdown_menu
3,255,255,255,255,255,255,255,255,255,255,...,255,255,255,255,255,255,255,255,255,dropdown_menu
4,255,255,255,255,255,255,255,255,255,255,...,255,255,255,255,255,255,255,255,255,dropdown_menu


In [45]:
table = pa.Table.from_pandas(df)
pq.write_table(table, 'UIsketch.parquet')

  table = pa.Table.from_pandas(df)
