# RBM Image-based Feature Extraction
Use RBM to perform feature extraction on an image-based dataset that you find or create. If you go this route, present the features you extract and explain why this is a useful feature extraction method in the context you’re operating in. DO NOT USE either the MNIST digit recognition database or the iris data set. They’ve been worked on in very public ways very very many times and the code is easily available. (However, that code could be a useful resource to refer to)

## Identifying ships in satellite imagery
This dataset is from kaggle

## About the Data
The dataset consists of image chips extracted from Planet satellite imagery collected over the San Franciso Bay area. It includes 2800 80x80 RGB images labeled with either a "ship" or "no-ship" classification. Image chips were derived from PlanetScope full-frame visual scene products, which are orthorectified to a 3 meter pixel size. The pixel value data for each 80x80 RGB image is stored as a list of 19200 integers within of the data list. The first 6400 entries contain the red channel values, the next 6400 the green, and the final 6400 the blue. The image is stored in row-major order, so that the first 80 entries of the array are the red channel values of the first row of the image.

The "ship" class includes 700 images. Images in this class are near-centered on the body of a single ship. Ships of different ship sizes, orientations, and atmospheric collection conditions are included. The "no-ship" class includes 2100 images. A third of these are a random sampling of different landcover features - water, vegetion, bare earth, buildings, etc. - that do not include any portion of an ship. The next third are "partial ships" that contain only a portion of an ship, but not enough to meet the full definition of the "ship" class. The last third are images that have previously been mislabeled by machine learning models, typically caused by bright pixels or string linear features. Example images from this class are shown below.

In [11]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.ndimage import convolve
from sklearn import linear_model, datasets, metrics
from sklearn.model_selection import train_test_split
from sklearn.neural_network import BernoulliRBM
from sklearn.pipeline import Pipeline
from sklearn.base import clone
%matplotlib inline

In [12]:
import json
with open('ships-in-satellite-imagery/shipsnet.json') as data_file:
    dataset = json.load(data_file)
df= pd.DataFrame(dataset)
print(df.head())

                                                data  labels  \
0  [82, 89, 91, 87, 89, 87, 86, 86, 86, 86, 84, 8...       1   
1  [76, 75, 67, 62, 68, 72, 73, 73, 68, 69, 69, 6...       1   
2  [125, 127, 129, 130, 126, 125, 129, 133, 132, ...       1   
3  [102, 99, 113, 106, 96, 102, 105, 105, 103, 10...       1   
4  [78, 76, 74, 78, 79, 79, 79, 82, 86, 85, 83, 8...       1   

                                   locations             scene_ids  
0    [-118.2254694333423, 33.73803725920789]  20180708_180909_0f47  
1    [-122.33222866289329, 37.7491755586813]  20170705_180816_103e  
2  [-118.14283073363218, 33.736016066914175]  20180712_211331_0f06  
3   [-122.34784341495181, 37.76648707436548]  20170609_180756_103a  
4   [-122.34852408322172, 37.75878462398653]  20170515_180653_1007  


In [13]:
X = np.array(dataset['data']).astype('uint8')
Y = np.array(dataset['labels']).astype('uint8')
def describeData(a,b):
    print('Total number of images: {}'.format(len(a)))
    print('Number of NoShip Images: {}'.format(np.sum(b==0)))
    print('Number of Ship Images: {}'.format(np.sum(b==1)))
    print('Percentage of positive images: {:.2f}%'.format(100*np.mean(b)))
    print('Image shape (Width, Height, Channels): {}'.format(a[0].shape))
describeData(X,Y)

Total number of images: 4000
Number of NoShip Images: 3000
Number of Ship Images: 1000
Percentage of positive images: 25.00%
Image shape (Width, Height, Channels): (19200,)


In [14]:
df.isnull().sum()

data         0
labels       0
locations    0
scene_ids    0
dtype: int64

In [15]:
len(Y)

4000

In [16]:
len(X)

4000

In [17]:
print(X)

[[ 82  89  91 ...  86  88  89]
 [ 76  75  67 ...  54  57  58]
 [125 127 129 ... 111 109 115]
 ...
 [171 135 118 ...  95  95  85]
 [ 85  90  94 ...  96  95  89]
 [122 122 126 ...  51  46  69]]


In [18]:
X.shape

(4000, 19200)

In [19]:
X_train, X_test, Y_train, Y_test = train_test_split(
 X, Y, test_size=0.2, random_state=0)


In [10]:
#X_train = X_train.reshape(1,-1)
#Y_train = Y_train.reshape(-1)
#X_train = (X_train - np.min(X, 0)) / (np.max(X_train, 0) + 0.0001)  # 0-1 scaling



(4000,)

In [20]:
# Models we will use
from sklearn import linear_model
logistic = linear_model.LogisticRegression(solver='lbfgs', max_iter=10000,
                                           multi_class='multinomial')
rbm = BernoulliRBM(random_state=0, verbose=True)

rbm_features_classifier = Pipeline(
    steps=[('rbm', rbm), ('logistic', logistic)])

In [21]:
rbm.n_components = 100
logistic.C = 6000

In [26]:

np.array(np.unique(Y_train, return_counts=True)).T

array([[   0, 2419],
       [   1,  781]])

In [27]:
print(X_train)

[[110 108 109 ...  99  95  92]
 [ 91  93  91 ...  91  90  90]
 [193 195 195 ... 186 186 194]
 ...
 [ 77  76  83 ...  75  76  76]
 [111 108 108 ...  95  97  99]
 [128 127 125 ... 116 114 121]]


In [28]:
# #############################################################################
# Training
# More components tend to give better prediction performance, but larger
# fitting time
# Training RBM-Logistic Pipeline
rbm_features_classifier.fit(X_train, Y_train)

# Training the Logistic regression classifier directly on the pixel
raw_pixel_classifier = clone(logistic)
raw_pixel_classifier.C = 100.
raw_pixel_classifier.fit(X_train, Y_train)

# #############################################################################

[BernoulliRBM] Iteration 1, pseudo-likelihood = 0.00, time = 14.01s
[BernoulliRBM] Iteration 2, pseudo-likelihood = 0.00, time = 13.86s
[BernoulliRBM] Iteration 3, pseudo-likelihood = 0.00, time = 14.35s
[BernoulliRBM] Iteration 4, pseudo-likelihood = 0.00, time = 13.48s
[BernoulliRBM] Iteration 5, pseudo-likelihood = 0.00, time = 12.97s
[BernoulliRBM] Iteration 6, pseudo-likelihood = 0.00, time = 13.53s
[BernoulliRBM] Iteration 7, pseudo-likelihood = 0.00, time = 15.27s
[BernoulliRBM] Iteration 8, pseudo-likelihood = 0.00, time = 14.75s
[BernoulliRBM] Iteration 9, pseudo-likelihood = 0.00, time = 13.70s
[BernoulliRBM] Iteration 10, pseudo-likelihood = 0.00, time = 13.75s


LogisticRegression(C=100.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=10000, multi_class='multinomial',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

In [29]:
# Evaluation

y_pred = rbm_features_classifier.predict(X_test)
print("Logistic regression using RBM features:\n%s\n" % (
    metrics.classification_report(Y_test, y_pred)))

y_pred = raw_pixel_classifier.predict(X_test)
print("Logistic regression using raw pixel features:\n%s\n" % (
    metrics.classification_report(Y_test, y_pred)))

Logistic regression using RBM features:
              precision    recall  f1-score   support

           0       0.73      1.00      0.84       581
           1       0.00      0.00      0.00       219

   micro avg       0.73      0.73      0.73       800
   macro avg       0.36      0.50      0.42       800
weighted avg       0.53      0.73      0.61       800


Logistic regression using raw pixel features:
              precision    recall  f1-score   support

           0       0.94      0.90      0.92       581
           1       0.76      0.85      0.80       219

   micro avg       0.89      0.89      0.89       800
   macro avg       0.85      0.87      0.86       800
weighted avg       0.89      0.89      0.89       800




  'precision', 'predicted', average, warn_for)
