In [1]:
import sklearn.datasets
import keras
import numpy as np

We'll use a classic data set about adult incomes (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html). This is already pre-processed into a machine learning uesable format, with categories expressed as 0 or 1 flag features.

And -- SciKit Learn -- sklearn -- has the ability to load this file format. In general always remember to check if sklearn does what you need. It's a huge timesaver!

In [2]:
features, labels = sklearn.datasets.load_svmlight_file('./a1a.txt')

Loading up the load data and exploring the columns

In [3]:
features.shape

(1605, 119)

In [4]:
features[0].todense() #0-1 to see if a particular attribute i present in a certain sample

matrix([[0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0.,
         0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
         0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 1., 1., 0., 0., 0., 1.,
         0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0.]])

In [5]:
labels[0:10] # labels are present but machine is unable to tell if the outputs are distinct

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 1.])

Our features are all set -- on the range 0-1, but we need to transform the output labels to one-hot classes, so that there are two distinct output possibilities -- effectively yes and no.

In [6]:
one_hot_labels = keras.utils.to_categorical(labels).astype(np.float32) #onehot encoding

In [7]:
one_hot_labels[0:10] #now it show if the label is either yes or no

array([[1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.]], dtype=float32)

In [8]:
model = keras.models.Sequential()
# logistic regression is a one layer model
model.add(keras.layers.Dense(one_hot_labels.shape[1], activation='softmax', input_dim=features.shape[1])) #take the second label for onehotlabel
model.compile(optimizer='adam', loss='categorical_crossentropy')
model.fit(features, one_hot_labels, epochs=16, batch_size=16)

Epoch 1/16
Epoch 2/16
Epoch 3/16
Epoch 4/16
Epoch 5/16
Epoch 6/16
Epoch 7/16
Epoch 8/16
Epoch 9/16
Epoch 10/16
Epoch 11/16
Epoch 12/16
Epoch 13/16
Epoch 14/16
Epoch 15/16
Epoch 16/16


<keras.callbacks.History at 0x7f128adb45f8>

And here -- we can see how accurate this model predicts the training data.

In [9]:
predictions = model.predict_classes(features)

In [10]:
import sklearn.metrics
print(sklearn.metrics.accuracy_score(labels, predictions))

0.8398753894080997


Hmm -- that seems good, but let's make sure that the model -- actually can predict both classes.

In [11]:
print(sklearn.metrics.classification_report(labels, predictions)) # gives a classification report
#Precision = How often is it Right
#Recall = Out of all the possible right answers how many did it get correct

             precision    recall  f1-score   support

        0.0       0.87      0.93      0.90      1210
        1.0       0.72      0.57      0.64       395

avg / total       0.83      0.84      0.83      1605



Aha - it does. Not with fantastic accuracy, but this is only one layer 'deep' -- so this is a shallow model.