## Faraaz Beyabani, Fuyao Du

In [4]:
import numpy as np
import pandas as pd
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from matplotlib import pyplot
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from numpy import concatenate

Using TensorFlow backend.


## Dataset

For this project, we used the IDCAR 2013 dataset from a Kaggle competition to predict the gender of the author of a handwritten document. The dataset contains the writings of 282 unique authors, each writing 2 pages in English and 2 pages in Arabic, for a total dataset of 1128 pages of handwriting. We opted to use the pre-extracted features from the dataset, rather than extracting our own from the images. This allowed us to spend more time testing various model types, rather than building a robust feature extraction pipeline.

Unfortunately we ran into some technical difficulties, as the dataset on the Kaggle website became inaccessible some time after we had individually downloaded it, meaning we had to keep backups in the event that we still needed some files.

The dataset features included tortuousity, curviness, chaincode, and others. We explore the use of the first 3 below, then transition to using all pre-extracted features. Later, we also began using cross-validation with a 75/25% split, witholding 1 document from each author, resulting in 846 documents for training, and 282 for validation. 

In [3]:
train_data = pd.read_csv('./train.csv')
train_ans = pd.read_csv('./train_answers.csv')

In [4]:
train_ans = train_ans.iloc[:,1]
y = np.repeat(train_ans.to_numpy(), 4)

## Stochastic Gradient Descent with hinge loss on tortuousity features

First, we'll test out classic stochastic gradient descent classification on the first ~40 features. These features are **tortuousity**, which describes the "twistedness" of the handwriting. 

In [98]:
train_tort_data = train_data.iloc[:, 4:44]

x = train_tort_data.to_numpy()

In [99]:
print(np.shape(x))
print(np.shape(y))

print(f'\nThere are {(y==0).sum()} male writers.')
print(f'\nThere are {(y==1).sum()} male writers.')

(1128, 40)
(1128,)

There are 572 male writers.

There are 556 male writers.


In [103]:
model = SGDClassifier()
model.fit(X=x, y=y)

SGDClassifier()

In [104]:
answers = model.predict(X=x)

In [105]:
print((answers != y).sum())
print(answers[:32])
print(y[:32])

439
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 1 1]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]


Here, we get 61% accuracy (439 misclassifications over 1128 samples). We also print the first 32 predicted answers, and their corresponding ground truths. This result is alright, but we decided to try to use other features to try and achieve a better result. 

## Stochastic Gradient Descent with hinge loss on curviness features

Now, we attempted to use the next 855 features, describing the **curviness** of the handwriting. We wanted to test which descriptor of handwriting was accurate when it came to accurately classifying the gender of the author.

In [106]:
train_curve_data = train_data.iloc[:, 54:900]
x = train_curve_data.to_numpy()

In [107]:
print(np.shape(x))
print(np.shape(y))

(1128, 846)
(1128,)


In [108]:
model.fit(X=x, y=y)

SGDClassifier()

In [109]:
answers = model.predict(X=x)

In [110]:
print((answers != y).sum())
print(answers[:32])
print(y[:32])

499
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]


Here, we got very similar results to our attempt of SGD with tortuousity features. We were not optimistic about this method, but we wanted to give one other set of features a try before writing off SGD entirely.

## Stochastic Gradient Descent with hinge loss on chaincode features

These next ~4000 features are called "chaincode". We weren't sure what that meant, but we inferred that it referred to how letters connected to each other.

In [111]:
train_chain_data = train_data.iloc[:, 901:5020]
x = train_chain_data.to_numpy()

In [112]:
print(np.shape(x))
print(np.shape(y))

(1128, 4119)
(1128,)


In [113]:
model.fit(X=x, y=y)

SGDClassifier()

In [114]:
answers = model.predict(X=x)

In [70]:
print((answers != y).sum())
print(model.score(X=x, y=y))
print(answers[:32])
print(y[:32])

392
0.648936170212766
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 1 1]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]


Here, our results were a bit better than the previous two feature sets, however, there was a lot of variance every time we ran classification. Thus, we attributed this wildly different performance to SGD being a poor fit for this use case. However, we decided to continue attempting classficiation using these chaincode features, just in case the jump in accuracy was not a coincidence.

## Random Forest using chaincode features

Now, we tried to use **Random Forest** classification. We felt like this would be a good use case because of its similarity to traditional Decision Tree methods. In addition, to keep track of potential overfitting, we also implemented cross validation on the training set, using a 25/75 split.

In [20]:
train_chain_data = train_data.iloc[:, 901:5020]
x = train_chain_data.to_numpy()

In [21]:
model = RandomForestClassifier(n_estimators=250, min_samples_leaf=3)
model.fit(X=x, y=y)
answers = model.predict(X=x)
print((answers != y).sum())

0


Here, with 0 misclassifications, we must make sure that we are not overfitting with this method.

In [22]:
test_idx = np.arange(3, 1128, 4)
train_idx = np.delete(np.arange(1128), test_idx)

x_test = x[test_idx, :]
x_train = x[train_idx, :]
y_train = np.repeat(train_ans.to_numpy(), 3)
y_test = train_ans

In [23]:
model.fit(X=x_train, y=y_train)

RandomForestClassifier(min_samples_leaf=3, n_estimators=250)

In [24]:
print(model.score(X=x_train, y=y_train))
print((model.predict(X=x_train) != y_train).sum())
print('\n')
print(model.score(X=x_test, y=y_test))
print((model.predict(X=x_test) != y_test).sum())

1.0
0


0.7056737588652482
83


This disparity in accuracy between the training accuracy and test accuracy confirms that we are overfitting, however our base accuracy of 70% is a bit better than our earlier accuracies of around 65%. We experimented with different sklearn model parameters, but unfortunately we did not gain a significant boost in accuracy.

## Random Forest using all features

Out of curiousity, we decided to try using all of the features rather than only looking at one feature of a given type. This would give us more diversity of the data we were looking at e.g. rather than just looking at tortuousity, look at it and curviness and the chaincode, especially since random forest was decently suited for this task. 

In [25]:
train_all_data = train_data.iloc[:, 4:]
x = train_all_data.to_numpy()

In [122]:
model = RandomForestClassifier(n_estimators=250, min_samples_leaf=1, bootstrap=False, max_features='log2',
                              )
model.fit(X=x, y=y)
answers = model.predict(X=x)
print((answers != y).sum())

0


In [27]:
test_idx = np.arange(3, 1128, 4)
train_idx = np.delete(np.arange(1128), test_idx)

x_test = x[test_idx, :]
x_train = x[train_idx, :]
y_train = np.repeat(train_ans.to_numpy(), 3)
y_test = train_ans

In [131]:
model.fit(X=x_train, y=y_train)

RandomForestClassifier(bootstrap=False, max_features='log2', n_estimators=250)

In [132]:
print(model.score(X=x_train, y=y_train))
print((model.predict(X=x_train) != y_train).sum())
print('\n')
print(model.score(X=x_test, y=y_test))
print((model.predict(X=x_test) != y_test).sum())

1.0
0


0.8014184397163121
56


We had decent success. Of course, we were still overfitting to the training data, but the accuracy on the validation set was, in our opinion, still quite good.

## LSTM Neural Net with all features

As one last attempt to obtain decent accuracy without overfitting, one of the team members used their machine learning proficiencies to develop a LSTM recurrent neural net.

In [1]:
def fit_network(train_X, train_y, test_X, test_y, layer_num):
    np.random.seed(9)
    model = Sequential()
    model.add(LSTM(layer_num, activation='relu', input_shape=(train_X.shape[1],train_X.shape[2])))
    model.add(Dense(1, activation='relu'))
    model.compile(loss='binary_crossentropy', optimizer='adam')
    # fit network
    history = model.fit(train_X, train_y, epochs=300, validation_split=0.15, verbose=2, shuffle=False)
    # plot history
    pyplot.plot(history.history['loss'], label='train')
    pyplot.show()
    # make a prediction
    yhat_train = model.predict(train_X)
    yhat_train = (yhat_train > 0.5).astype(int)
    train_XS=train_X
    train_XS = train_XS.reshape((train_XS.shape[0], train_XS.shape[2]))
    yhat_train = concatenate((yhat_train, train_XS[:, 1:]), axis=1)
    yhat_train = yhat_train[:,0]
    
    yhat_test = model.predict(test_X)
    yhat_test = (yhat_test > 0.5).astype(int)
    test_X = test_X.reshape((test_X.shape[0], train_X.shape[2]))
    yhat_test = concatenate((yhat_test, test_X[:, 1:]), axis=1)
    
    yhat_test = yhat_test[:,0]
    

    # calculate accuracy
    accuracy_train = sum(yhat_train == train_y)/len(train_y)
    accuracy_test = sum(yhat_test == test_y)/len(test_y)
    print('Train accuracy: %.3f' % accuracy_train)
    print('Test accuracy: %.3f' % accuracy_test )
    return model

In [6]:
train_data = pd.read_csv('./train.csv')
train_ans = pd.read_csv('./train_answers.csv')

train_ans = train_ans.iloc[:,1]
y = np.repeat(train_ans.to_numpy(), 4)
All_features = (train_data.iloc[:, 5:5020]).to_numpy()
test_idx = np.arange(3, 1128, 4)
train_idx = np.delete(np.arange(1128), test_idx)

test_X = All_features[test_idx, :]
train_X = All_features[train_idx, :]
train_y = np.repeat(train_ans.to_numpy(), 3)
# Change into 3-dimension by adding time stamp so that can be used in LSTM
train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))
test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))

test_y = train_ans
layer_num = 100

model_LSTM = fit_network(train_X,train_y,test_X,test_y, layer_num)

Our final results for this model type were

**Train accuracy: 84.5%**

**Test accuracy: 81.6%**

These accuracy results were very close, and even better than the ones we obtained from the Random Forest model. With this model type, we are no longer overfitting on the training data, allowing for more accurate results, overall, since our model is more generalizable to all datasets, and training now gives a more accurate representation of performance on other data.

## Conclusion

In conclusion, in our opinion, we did very well in classifying the given dataset with the limited number of data and time that we had. With have 4 samples from each author, we only had the data from 282 unique authors, making our results somewhat unreliable on a larger scale.

Given more time and/or resources, we believe that this problem space could benefit from an increase of data, which would be human-written texts in multiple languages. Additionally, it may also be worth trying to use the raw pixel-based image data, just to see what sort of results may occur. Furthermore, we believe extracting our own features could also be beneficial in increasing the total achievable accuracy on the data type. Finally, additional, more sophisticated model types would also be very useful in achieving better results (90%+) on this data, included hand-crafted logistic regression and/or deep-learning methods.