The cell below imports all necessary libraries and defines some global constants.  

In [12]:
from matplotlib.image import imread
from sklearn.model_selection import train_test_split
from sklearn.cluster import MiniBatchKMeans
import numpy as np
import pandas as pd

# each image dimension is (128, 384). After segmentation the following will be the dimensions of each character.
charHeight = 128
charWidth = 128
numImages = 48574

vectorLength = 16384

imgPath = "SoML-50/SoML-50/data/"
csvPath = "SoML-50/SoML-50/annotations.csv"

def getPathOfImg (index):
    return (imgPath + str (index) + ".jpg")

def getLabelOfImg (df,index):
    return (df.loc[df['Image'] == (str(index) + '.jpg')]['Label'].values[0])

def getValueOfImg (df, index):
    return (int(df.loc[df['Image'] == (str(index) + '.jpg')]['Value'].values[0]))


Now we divide the data set into training and testing sets

In [13]:
imgIndex = [ i for i in range (1, numImages+1)]
train_set, test_set = train_test_split (imgIndex, test_size= 0.1, random_state= 1)

df = pd.read_csv (csvPath)

print (getLabelOfImg (df,5))
print (getValueOfImg (df,6))

prefix
4


The below cell takes the image number and returns the numpy array of length 3 - Ready for applying K-means clustering algorithm.

In [14]:
def vectoriseImg (image):
    return np.reshape (image, (1,-1))

def getSegmentedVectors (df,index):
    """ This function returns a numpy array of the three character images of shape (128,128) present in index.jpg after converting them into vectors 
    as required by K-Means algorithm. You cannot directly apply K-Means to image Matrices. First need to vectorise image matrix. 
    Also, the operator is always present at [0] and other two operands at [1] and [2] in the order in which the operator has to be applied."""
    
    image = imread (getPathOfImg(index))
    label = getLabelOfImg (df, index)
    if (label == 'prefix'):
        charArray = np.array ([image[:, 0:charWidth],image[:, charWidth:(2*charWidth)],image[:, (2*charWidth):]])
    elif (label == 'postfix'):
        charArray = np.array ([image[:, (2*charWidth):],image[:, 0:charWidth],image[:, charWidth:(2*charWidth)]])
    else:
        charArray = np.array ([image[:, charWidth:(2*charWidth)],image[:, 0:charWidth],image[:, (2*charWidth):]])
    
    ans = np.array ([vectoriseImg (charArray[i]) for i in range (3)])
    return ans

print ((getSegmentedVectors (df, 6).shape))

(3, 1, 16384)


Now create numpy matrices on which KMeans class object will cluster the rows. We create three numpy arrays, each for the two operand vectors and one for the operator vector. 

Note that we initialize the arrays with the required shape at the beginning itself. We should not do append to numpy arrays as they are stored in contiguous blocks of memory and whole array needs to be copied again and again in order to append. Source: https://stackoverflow.com/questions/568962/how-do-i-create-an-empty-array-matrix-in-numpy

I faced with a problem here - I cannot initialize 3 arrays as big as len(train_set) * vectorLength = ~ 45000 * 16000 =~ 10^9. My entire 16 GB ram was not enough and the laptop kept on freezing. Hence I chose the mini batch training which took mini batches of size 10 images at a time and trained the k means clustering model.



In [19]:
# our matrices on which mini batch KMeans clustering will work are declared. 
operators = np.empty (shape = [10, vectorLength], dtype = int)
operands = np.empty (shape = [20, vectorLength], dtype = int)


# now initialize KMeans class object. 
operatorCluster = MiniBatchKMeans( n_clusters = 4, random_state=0, batch_size = 10)
operandCluster = MiniBatchKMeans( n_clusters= 10, random_state=0, batch_size = 20)

for i in range (1,len(train_set)-10, 10):
    for j in range (i,i+10):
        segments = getSegmentedVectors (df, train_set[j])
        operators[j-i] , operands[j-i], operands[j-i + 10] = segments[0] , segments[1] , segments[2]
    operatorCluster = operatorCluster.partial_fit(operators)
    operandCluster = operandCluster.partial_fit(operands)
    
# now our we have trained our clustering model. This cell took approximately 190 seconds to run.

In [25]:
sorted_df = df.sort_values (by = ['Value'])
#print (sorted_df.tail (20))
sorted_df.describe()


Unnamed: 0,Value
count,50000.0
mean,8.98578
std,14.079506
min,-9.0
25%,0.0
50%,5.0
75%,12.0
max,81.0
