Version 1.1.0

# The task

In this assignment you will need to implement features, based on nearest neighbours. 

KNN classifier (regressor) is a very powerful model, when the features are homogeneous and it is a very common practice to use KNN as first level model. In this homework we will extend KNN model and compute more features, based on nearest neighbors and their distances. 

You will need to implement a number of features, that were one of the key features, that leaded the instructors to prizes in [Otto](https://www.kaggle.com/c/otto-group-product-classification-challenge) and [Springleaf](https://www.kaggle.com/c/springleaf-marketing-response) competitions. Of course, the list of features you will need to implement can be extended, in fact in competitions the list was at least 3 times larger. So when solving a real competition do not hesitate to make up your own features.   

You can optionally implement multicore feature computation. Nearest neighbours are hard to compute so it is preferable to have a parallel version of the algorithm. In fact, it is really a cool skill to know how to use `multiprocessing`, `joblib` and etc. In this homework you will have a chance to see the benefits of parallel algorithm. 

# Check your versions

Some functions we use here are not present in old versions of the libraries, so make sure you have up-to-date software. 

In [1]:
import numpy as np
import pandas as pd 
import sklearn
import scipy.sparse 
from multiprocessing import Pool

for p in [np, pd, sklearn, scipy]:
    print (p.__name__, p.__version__)

numpy 1.13.1
pandas 0.20.3
sklearn 0.19.0
scipy 0.19.1


The versions should be not less than:

    numpy 1.13.1
    pandas 0.20.3
    sklearn 0.19.0
    scipy 0.19.1
   
**IMPORTANT!** The results with `scipy=1.0.0` will be different! Make sure you use _exactly_ version `0.19.1`.

# Load data

Learn features and labels. These features are actually OOF predictions of linear models.

In [2]:
train_path = '../readonly/KNN_features_data/X.npz'
train_labels = '../readonly/KNN_features_data/Y.npy'

test_path = '../readonly/KNN_features_data/X_test.npz'
test_labels = '../readonly/KNN_features_data/Y_test.npy'

# Train data
X = scipy.sparse.load_npz(train_path)
Y = np.load(train_labels)

# Test data
X_test = scipy.sparse.load_npz(test_path)
Y_test = np.load(test_labels)

# Out-of-fold features we loaded above were generated with n_splits=4 and skf seed 123
# So it is better to use seed 123 for generating KNN features as well 
skf_seed = 123
n_splits = 4

<h2>How to use KNN for feature generation</h2>

In [3]:
X[1:2]

<1x52418 sparse matrix of type '<class 'numpy.float64'>'
	with 5 stored elements in Compressed Sparse Row format>

Below you need to implement features, based on nearest neighbors.

In [4]:
from sklearn.neighbors import NearestNeighbors
nn = NearestNeighbors(n_neighbors=10, 
                                      n_jobs=1, 
                                      algorithm='brute')

here n_neighbors is the max number of neighbors to compute

In [5]:
XX = X[:100]

In [6]:
res = nn.fit(XX)

predict neighbor for only one X point (XX[1:2])

In [7]:
neighs_dist, neighs =res.kneighbors(XX[1:2])

In [8]:
print(neighs, neighs_dist)

[[ 1 95 29 87 59 72 41 43 18 94]] [[ 0.          1.41421356  1.41421356  1.41421356  1.41421356  1.41421356
   1.41421356  1.41421356  1.41421356  1.41421356]]


neighs is the index of the 10 closests neighbor to XX[1:2]<br>
neighs_dist is the distances of the 10 closests neighbor to XX[1:2]

In [9]:
YY = Y[:100]

In [10]:
neighs_y = YY[neighs[0]]

In [11]:
print(neighs_y)

[26  0  0 19 22 14  8 11  1 13]


this are the labels for the 10 closest neighbors

In [12]:
k = 5
k_closest_neighs_y = neighs_y[:k]
print(k_closest_neighs_y)

[26  0  0 19 22]


this are the labels of the k closest neighbors to XX[1:2]

In [13]:
classes_count = np.bincount(k_closest_neighs_y)
print(classes_count)

[2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1]


for every class (where index of classes_count is the class number) the number of times it appear in k_closest_neight_y

In [14]:
n_classes = 30

In [15]:
feats = [0 for _ in range(n_classes)]
class_counts = np.bincount(neighs_y[1:k+1])
total_classes = sum(class_counts)
for class_number in range(n_classes):
    if class_number < len(class_counts):
        feats[class_number] = class_counts[class_number] / total_classes
assert len(feats) == n_classes

 1. Fraction of objects of every class.
               It is basically a KNNСlassifiers predictions.

               Take a look at `np.bincount` function, it can be very helpful
               Note that the values should sum up to one

In [16]:
print(feats)

[0.40000000000000002, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.20000000000000001, 0.0, 0.0, 0.0, 0.0, 0.20000000000000001, 0.0, 0.0, 0.20000000000000001, 0, 0, 0, 0, 0, 0, 0]


In [17]:
assert sum(feats) == 1

In [18]:
neighs_y == neighs_y[0]

array([ True, False, False, False, False, False, False, False, False, False], dtype=bool)

In [19]:
x = np.arange(9.).reshape(3, 3)
print(x)
np.where(x > 5)

[[ 0.  1.  2.]
 [ 3.  4.  5.]
 [ 6.  7.  8.]]


(array([2, 2, 2]), array([0, 1, 2]))

for point 2. I dont'really know how to use np.where ...

In [20]:
neighs_y = [1,1,1,5,2,7,5]
'''
    2. Same label streak: the largest number N, 
       such that N nearest neighbors have the same label.

       What can help you: `np.where`
'''

res = 1
first_class = neighs_y[0]
for i in range(1,len(neighs_y)):
    if neighs_y[i] == first_class: res += 1
    else: break
print(res)

3


In [21]:
neighs_y = YY[neighs[0]]
neighs_dist = neighs_dist[0]

In [22]:
print(neighs_y)
print(neighs_dist)

[26  0  0 19 22 14  8 11  1 13]
[ 0.          1.41421356  1.41421356  1.41421356  1.41421356  1.41421356
  1.41421356  1.41421356  1.41421356  1.41421356]


In [23]:
'''
    3. Minimum distance to objects of each class
       Find the first instance of a class and take its distance as features.

       If there are no neighboring objects of some classes, 
       Then set distance to that class to be 999.

       `np.where` might be helpful
'''
feats = [999 for _ in range(n_classes)]
for i, _class in enumerate(neighs_y):
    feats[_class] = min(neighs_dist[i], feats[_class])

print(feats)
assert len(feats) == n_classes

[1.4142135623730949, 1.4142135623730949, 999, 999, 999, 999, 999, 999, 1.4142135623730949, 999, 999, 1.4142135623730949, 999, 1.4142135623730949, 1.4142135623730949, 999, 999, 999, 999, 1.4142135623730949, 999, 999, 1.4142135623730949, 999, 999, 999, 0.0, 999, 999, 999]


In [24]:
min(neighs_dist[1:])

1.4142135623730949

In [25]:
'''
    4. Minimum *normalized* distance to objects of each class
       As 3. but we normalize (divide) the distances
       by the distance to the closest neighbor.

       If there are no neighboring objects of some classes, 
       Then set distance to that class to be 999.

       Do not forget to add self.eps to denominator.
'''
distance_to_closest = min(neighs_dist[1:]) + 1e-6
feats = [999 / (distance_to_closest) for _ in range(n_classes)] # should I normalize also 999?
for i, _class in enumerate(neighs_y):
    feats[_class] = min(neighs_dist[i] / distance_to_closest, feats[_class])

print(feats)
assert len(feats) == n_classes

[0.99999929289371892, 0.99999929289371892, 706.39917490571429, 706.39917490571429, 706.39917490571429, 706.39917490571429, 706.39917490571429, 706.39917490571429, 0.99999929289371892, 706.39917490571429, 706.39917490571429, 0.99999929289371892, 706.39917490571429, 0.99999929289371892, 0.99999929289371892, 706.39917490571429, 706.39917490571429, 706.39917490571429, 706.39917490571429, 0.99999929289371892, 706.39917490571429, 706.39917490571429, 0.99999929289371892, 706.39917490571429, 706.39917490571429, 706.39917490571429, 0.0, 706.39917490571429, 706.39917490571429, 706.39917490571429]


In [26]:
'''
    5. 
       5.1 Distance to Kth neighbor
           Think of this as of quantiles of a distribution
       5.2 Distance to Kth neighbor normalized by 
           distance to the first neighbor

       feat_51, feat_52 are answers to 5.1. and 5.2.
       should be scalars

       Do not forget to add self.eps to denominator.
'''
for k in [2,4,8]:

    feat_51 = neighs_dist[k - 1] #distances is zero based
    feat_52 = neighs_dist[k - 1] / distance_to_closest

print([[feat_51, feat_52]])

[[1.4142135623730949, 0.99999929289371892]]


In [27]:
k = 10
neighs_dist[k - 1]

1.4142135623730949

In [28]:
len(neighs_dist)

10

In [29]:
neighs_y

array([26,  0,  0, 19, 22, 14,  8, 11,  1, 13])

In [30]:
neighs_dist

array([ 0.        ,  1.41421356,  1.41421356,  1.41421356,  1.41421356,
        1.41421356,  1.41421356,  1.41421356,  1.41421356,  1.41421356])

In [31]:
print(np.bincount(neighs_y, neighs_dist)) #this bincount with distances as weight
print(np.bincount(neighs_y)) #this bincount with distances as weight
mean_classes = np.bincount(neighs_y, neighs_dist) / (np.bincount(neighs_y) + 1e-6) #this is mean of classes with distances
print(mean_classes)

[ 2.82842712  1.41421356  0.          0.          0.          0.          0.
  0.          1.41421356  0.          0.          1.41421356  0.
  1.41421356  1.41421356  0.          0.          0.          0.
  1.41421356  0.          0.          1.41421356  0.          0.          0.
  0.        ]
[2 1 0 0 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 1]
[ 1.41421286  1.41421215  0.          0.          0.          0.          0.
  0.          1.41421215  0.          0.          1.41421215  0.
  1.41421215  1.41421215  0.          0.          0.          0.
  1.41421215  0.          0.          1.41421215  0.          0.          0.
  0.        ]


In [32]:
mean_classes > 0

array([ True,  True, False, False, False, False, False, False,  True,
       False, False,  True, False,  True,  True, False, False, False,
       False,  True, False, False,  True, False, False, False, False], dtype=bool)

In [33]:
'''
    6. Mean distance to neighbors (of x) belonging each class for each K from `k_list` 
           For each class select the neighbors of x belonging to that speficics class among K nearest neighbors 
           and compute the average distance from x to those objects

           If there are no objects of a certain class among K neighbors, set mean distance to 999

       You can use `np.bincount` with appropriate weights
       Don't forget, that if you divide by something, 
       You need to add `self.eps` to denominator.
'''
for k in [2,4]:
    feats = [999 for _ in range(n_classes)]
    k_closest_neighs_y = neighs_y[1:k + 1]
    k_neighs_dist = neighs_dist[1:k + 1]
    mean_distance_classes = np.bincount(k_closest_neighs_y, k_neighs_dist) / (np.bincount(k_closest_neighs_y) + 1e-6)
    for i, _class in enumerate(mean_distance_classes):
        if _class > 0:
            feats[i] = _class
    
    print(feats)
    assert len(feats) == n_classes

[1.4142128552666673, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999]
[1.4142128552666673, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 1.4142121481609469, 999, 999, 1.4142121481609469, 999, 999, 999, 999, 999, 999, 999]


------------

<h1>How to implement multiprocess

In [34]:
from tqdm import tqdm_notebook as tqdm
def _compute_feats(_X):
    test_feats = []
    for i in tqdm(range(_X.shape[0])):
        for _ in range(300): pass
        feats = np.arange(2) # of course this is a mock of the real compute feats
        assert feats.shape == (2,) or feats.shape == (2, 1)
        test_feats.append(feats)
    return test_feats

this above is the compute function we want to separate in multiple threads

In [35]:
from time import time
t1 = time()
got = _compute_feats(X_test)
print(time() - t1)

A Jupyter Widget


0.21630549430847168


In [36]:
int(X.shape[0]/4) + 1

12569

In [37]:
"""this is a good multiprocessing implementation including TQDM
X: input np array to splitted
_compute_feats: function that will be applied to chunks of X
"""

from multiprocessing import Pool
from tqdm import tqdm_notebook as tqdm
n_jobs = 4

from time import time

pool = Pool(n_jobs)
splitted = []
prev = 0
for x in range(n_jobs):
    i = min(prev + int(X_test.shape[0]/n_jobs) + 1, X_test.shape[0])
    print(prev, i)
    splitted.append(X_test[prev:i])
    prev = i
print(splitted) # we splitted the dataset in n_jobs chunks

t1 = time()
res = []
with Pool(processes=n_jobs) as p:
        max_ = 30
        with tqdm(total=len(splitted)) as pbar:
            for i, _res in tqdm(enumerate(p.imap_unordered(_compute_feats, splitted))):
                print(_res[:10])
                res.append(_res)
                pbar.update()
#res = tqdm_notebook(pool.imap(_compute_feats, splitted)) # now we apply the compute function
test_feats = [elem for sublist in res for elem in sublist] # and we recombintest_feats
print(time() - t1)


0 1397
1397 2794
2794 4191
4191 5586
[<1397x52418 sparse matrix of type '<class 'numpy.float64'>'
	with 7647 stored elements in Compressed Sparse Row format>, <1397x52418 sparse matrix of type '<class 'numpy.float64'>'
	with 7635 stored elements in Compressed Sparse Row format>, <1397x52418 sparse matrix of type '<class 'numpy.float64'>'
	with 7411 stored elements in Compressed Sparse Row format>, <1395x52418 sparse matrix of type '<class 'numpy.float64'>'
	with 7464 stored elements in Compressed Sparse Row format>]


A Jupyter Widget

A Jupyter Widget





[array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1])]
[array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1])]
[array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1])]
[array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1])]

1.089226245880127


In [38]:
len(got),len(test_feats)

(5586, 5586)

In [39]:
test_feats[:2], got[:2]

([array([0, 1]), array([0, 1])], [array([0, 1]), array([0, 1])])

In [40]:
np.vstack(test_feats)

array([[0, 1],
       [0, 1],
       [0, 1],
       ..., 
       [0, 1],
       [0, 1],
       [0, 1]])

In [41]:
del splitted
del got
del test_feats

-------------

In [42]:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.neighbors import NearestNeighbors
from multiprocessing import Pool
from tqdm import tqdm_notebook as tqdm

import numpy as np


class NearestNeighborsFeats(BaseEstimator, ClassifierMixin):
    '''
        This class should implement KNN features extraction 
    '''
    def __init__(self, n_jobs, k_list, metric, n_classes=None, n_neighbors=None, eps=1e-6):
        self.n_jobs = n_jobs
        self.k_list = k_list
        self.metric = metric
        
        if n_neighbors is None:
            self.n_neighbors = max(k_list) 
        else:
            self.n_neighbors = n_neighbors
            
        self.eps = eps        
        self.n_classes_ = n_classes
    
    def fit(self, X, y):
        '''
            Set's up the train set and self.NN object
        '''
        # Create a NearestNeighbors (NN) object. We will use it in `predict` function 
        self.NN = NearestNeighbors(n_neighbors=max(self.k_list), 
                                      metric=self.metric, 
                                      n_jobs=1, 
                                      algorithm='brute' if self.metric=='cosine' else 'auto')
        self.NN.fit(X)
        
        # Store labels 
        self.y_train = y
        
        # Save how many classes we have
        self.n_classes = np.unique(y).shape[0] if self.n_classes_ is None else self.n_classes_
        
    def compute_feats(self, _X):
        test_feats = []
        for i in tqdm(range(_X.shape[0])):
            test_feats.append(self.get_features_for_one(_X[i:i+1]))
        return test_feats
        
        
    def predict(self, X):       
        '''
            Produces KNN features for every object of a dataset X
        '''
               
        if self.n_jobs == 1:
            test_feats = self.compute_feats(X)
        else:
            #splits X
            splitted = []
            prev = 0
            for x in range(self.n_jobs):
                i = min(prev + int(X.shape[0]/self.n_jobs) + 1, X.shape[0])
                print(prev, i)
                splitted.append(X[prev:i])
                prev = i
                
            #start processes
            res = []
            with Pool(processes=self.n_jobs) as p:
                max_ = 30
                with tqdm(total=len(splitted)) as pbar:
                    for i, _res in tqdm(enumerate(p.imap_unordered(self.compute_feats, splitted))):
                        res.append(_res)
                        pbar.update()
            
            test_feats = [elem for sublist in res for elem in sublist] # and we recombintest_feats

    
        return np.vstack(test_feats)
        
        
    def get_features_for_one(self, x):
        '''
            Computes KNN features for a single object `x`
        '''

        NN_output = self.NN.kneighbors(x)
        
        # Vector of size `n_neighbors`
        # Stores indices of the neighbors
        neighs = NN_output[1][0]
        
        # Vector of size `n_neighbors`
        # Stores distances to corresponding neighbors
        neighs_dist = NN_output[0][0] 

        # Vector of size `n_neighbors`
        # Stores labels of corresponding neighbors
        neighs_y = self.y_train[neighs] 
        
        ## ========================================== ##
        ##              YOUR CODE BELOW
        ## ========================================== ##
        
        # We will accumulate the computed features here
        # Eventually it will be a list of lists or np.arrays
        # and we will use np.hstack to concatenate those
        return_list = [] 
        
        
        ''' 
            1. Fraction of objects of every class.
               It is basically a KNNСlassifiers predictions.

               Take a look at `np.bincount` function, it can be very helpful
               Note that the values should sum up to one
        '''
        for k in self.k_list:
            feats = [0 for _ in range(self.n_classes)]
            class_counts = np.bincount(neighs_y[:k]) # distance is zero based, might get out of bound
            total_classes = sum(class_counts)
            for class_number in range(self.n_classes):
                if class_number < len(class_counts):
                    feats[class_number] = class_counts[class_number] / total_classes
            
            assert len(feats) == self.n_classes
            assert 0.999 <= sum(feats) <= 1.001
            return_list += [feats]
        
        
        '''
            2. Same label streak: the largest number N, 
               such that N nearest neighbors have the same label.
               
               What can help you: `np.where`
        '''
        res = 1
        first_class = neighs_y[0]
        for i in range(1,len(neighs_y)):
            if neighs_y[i] == first_class: res += 1
            else: break
        feats = [res]
        
        assert len(feats) == 1
        return_list += [feats]
        
        '''
            3. Minimum distance to objects of each class
               Find the first instance of a class and take its distance as features.
               
               If there are no neighboring objects of some classes, 
               Then set distance to that class to be 999.

               `np.where` might be helpful
        '''
        feats = [999 for _ in range(self.n_classes)]
        for i, _class in enumerate(neighs_y):
            feats[_class] = min(neighs_dist[i], feats[_class])

        assert len(feats) == self.n_classes
        return_list += [feats]
        
        '''
            4. Minimum *normalized* distance to objects of each class
               As 3. but we normalize (divide) the distances
               by the distance to the closest neighbor.
               
               If there are no neighboring objects of some classes, 
               Then set distance to that class to be 999.
               
               Do not forget to add self.eps to denominator.
        '''
        distance_to_closest = neighs_dist[0]
        feats = [999 for _ in range(self.n_classes)] 
        for i, _class in enumerate(neighs_y):
            if feats[_class] == 999:
                feats[_class] = (neighs_dist[i] / (distance_to_closest + self.eps))

        assert len(feats) == self.n_classes
        return_list += [feats]
        
        '''
            5. 
               5.1 Distance to Kth neighbor
                   Think of this as of quantiles of a distribution
               5.2 Distance to Kth neighbor normalized by 
                   distance to the first neighbor
               
               feat_51, feat_52 are answers to 5.1. and 5.2.
               should be scalars
               
               Do not forget to add self.eps to denominator.
        '''
        for k in self.k_list:
            
            feat_51 = neighs_dist[k - 1] #distances is zero based
            feat_52 = neighs_dist[k - 1] / (distance_to_closest + self.eps)

            return_list += [[feat_51, feat_52]]
        
        '''
            6. Mean distance to neighbors of each class for each K from `k_list` 
                   For each class select the neighbors of that class among K nearest neighbors 
                   and compute the average distance to those objects
                   
                   If there are no objects of a certain class among K neighbors, set mean distance to 999
                   
               You can use `np.bincount` with appropriate weights
               Don't forget, that if you divide by something, 
               You need to add `self.eps` to denominator.
        '''
        for k in self.k_list:
            feats = [999 for _ in range(self.n_classes)]
            k_closest_neighs_y = neighs_y[:k]
            k_neighs_dist = neighs_dist[:k]
            mean_distance_classes = np.bincount(k_closest_neighs_y, k_neighs_dist) / (np.bincount(k_closest_neighs_y) + self.eps)
            for i, _class in enumerate(mean_distance_classes):
                if _class > 0:
                    feats[i] = _class


            assert len(feats) == self.n_classes
            return_list += [feats]
        
        
        # merge
        knn_feats = np.hstack(return_list)
        
        assert knn_feats.shape == (239,) or knn_feats.shape == (239, 1)
        return knn_feats

## Sanity check

To make sure you've implemented everything correctly we provide you the correct features for the first 50 objects.

In [43]:
# a list of K in KNN, starts with one 
k_list = [3, 8, 32]

# Load correct features
true_knn_feats_first50 = np.load('../readonly/KNN_features_data/knn_feats_test_first50.npy')

# Create instance of our KNN feature extractor
NNF = NearestNeighborsFeats(n_jobs=1, k_list=k_list, metric='minkowski')

# Fit on train set
NNF.fit(X, Y)

# Get features for test
test_knn_feats = NNF.predict(X_test[:50])

# This should be zero
print ('Deviation from ground thruth features: %f' % np.abs(test_knn_feats - true_knn_feats_first50).sum())

deviation =np.abs(test_knn_feats - true_knn_feats_first50).sum(0)
for m in np.where(deviation > 1e-3)[0]: 
    p = np.where(np.array([87, 88, 117, 146, 152, 239]) > m)[0][0]
    print ('There is a problem in feature %d, which is a part of section %d.' % (m, p + 1))
    for i in range(50):
        if test_knn_feats[i][m] != true_knn_feats_first50[i][m]:
            print (i, test_knn_feats[i][m], true_knn_feats_first50[i][m])

A Jupyter Widget


Deviation from ground thruth features: 0.000000


In [44]:
test_knn_feats[:10]

array([[   0.        ,    0.        ,    0.        , ...,  999.        ,
         999.        ,  999.        ],
       [   0.        ,    0.        ,    0.        , ...,  999.        ,
           1.31804749,  999.        ],
       [   0.        ,    0.        ,    0.        , ...,  999.        ,
         999.        ,  999.        ],
       ..., 
       [   1.        ,    0.        ,    0.        , ...,  999.        ,
         999.        ,  999.        ],
       [   0.        ,    0.        ,    0.        , ...,  999.        ,
         999.        ,  999.        ],
       [   0.        ,    0.        ,    0.        , ...,  999.        ,
           1.24359388,    1.26195733]])

Now implement parallel computations and compute features for the train and test sets. 

## Get features for test

Now compute features for the whole test set.

In [None]:
for metric in ['minkowski', 'cosine']:
    print (metric)
    
    # Create instance of our KNN feature extractor
    NNF = NearestNeighborsFeats(n_jobs=4, k_list=k_list, metric=metric)
    
    # Fit on train set
    NNF.fit(X, Y)

    # Get features for test
    test_knn_feats = NNF.predict(X_test)
    
    # Dump the features to disk
    np.save('knn_feats_%s_test.npy' % metric , test_knn_feats)

minkowski
0 1397
1397 2794
2794 4191
4191 5586


A Jupyter Widget

A Jupyter Widget






cosine
0 1397
1397 2794
2794 4191
4191 5586


A Jupyter Widget

A Jupyter Widget








## Get features for train

Compute features for train, using out-of-fold strategy.

In [45]:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=n_splits, random_state=skf_seed, shuffle=True)

In [46]:
from sklearn.model_selection import cross_val_predict
#cross_val_predict(NNF, X, Y, cv=skf, n_jobs=4)

In [None]:
# Differently from other homework we will not implement OOF predictions ourselves
# but use sklearn's `cross_val_predict`
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import StratifiedKFold

# We will use two metrics for KNN
for metric in ['minkowski', 'cosine']:
    print (metric)
    
    # Set up splitting scheme, use StratifiedKFold
    # use skf_seed and n_splits defined above with shuffle=True
    skf = StratifiedKFold(n_splits=n_splits, random_state=skf_seed, shuffle=True)
    
    # Create instance of our KNN feature extractor
    # n_jobs can be larger than the number of cores
    NNF = NearestNeighborsFeats(n_jobs=15, k_list=k_list, metric=metric)
    print('NNF')
    # Get KNN features using OOF use cross_val_predict with right parameters
    preds = cross_val_predict(NNF, X, Y, cv=skf)
    
    # Save the features
    np.save('knn_feats_%s_train.npy' % metric, preds)

minkowski
NNF
0 839
839 1678
1678 2517
2517 3356
3356 4195
4195 5034
5034 5873
5873 6712
6712 7551
7551 8390
8390 9229
9229 10068
10068 10907
10907 11746
11746 12578


A Jupyter Widget

A Jupyter Widget

















0 839
839 1678
1678 2517
2517 3356
3356 4195
4195 5034
5034 5873
5873 6712
6712 7551
7551 8390
8390 9229
9229 10068
10068 10907
10907 11746
11746 12572


A Jupyter Widget

A Jupyter Widget

















0 838
838 1676
1676 2514
2514 3352
3352 4190
4190 5028
5028 5866
5866 6704
6704 7542
7542 8380
8380 9218
9218 10056
10056 10894
10894 11732
11732 12568


A Jupyter Widget

A Jupyter Widget

















0 838
838 1676
1676 2514
2514 3352
3352 4190
4190 5028
5028 5866
5866 6704
6704 7542
7542 8380
8380 9218
9218 10056
10056 10894
10894 11732
11732 12556


A Jupyter Widget

A Jupyter Widget

















cosine
NNF
0 839
839 1678
1678 2517
2517 3356
3356 4195
4195 5034
5034 5873
5873 6712
6712 7551
7551 8390
8390 9229
9229 10068
10068 10907
10907 11746
11746 12578


A Jupyter Widget

A Jupyter Widget
















0 839
839 1678
1678 2517
2517 3356
3356 4195
4195 5034
5034 5873
5873 6712
6712 7551
7551 8390
8390 9229
9229 10068
10068 10907
10907 11746
11746 12572


A Jupyter Widget

A Jupyter Widget

















0 838
838 1676
1676 2514
2514 3352
3352 4190
4190 5028
5028 5866
5866 6704
6704 7542
7542 8380
8380 9218
9218 10056
10056 10894
10894 11732
11732 12568


A Jupyter Widget

A Jupyter Widget

















0 838
838 1676
1676 2514
2514 3352
3352 4190
4190 5028
5028 5866
5866 6704
6704 7542
7542 8380
8380 9218
9218 10056
10056 10894
10894 11732
11732 12556


A Jupyter Widget

A Jupyter Widget



















In [None]:
a=_

# Submit

If you made the above cells work, just run the following cell to produce a number to submit.

In [4]:
s = 0
for metric in ['minkowski', 'cosine']:
    knn_feats_train = np.load('knn_feats_%s_train.npy' % metric)
    knn_feats_test = np.load('knn_feats_%s_test.npy' % metric)
    
    s += knn_feats_train.mean() + knn_feats_test.mean()
    
answer = np.floor(s)
print (answer)

3838.0


Submit!

In [9]:
from grader import Grader

grader = Grader()

grader.submit_tag('statistic', answer)

STUDENT_EMAIL = 'alessandro.solbiati@post.com'
STUDENT_TOKEN = 'XTv5gl22jZQL3X3G'
grader.status()

grader.submit(STUDENT_EMAIL, STUDENT_TOKEN)

Current answer for task statistic is: 3838.0
You want to submit these numbers:
Task statistic: 3838.0
Submitted to Coursera platform. See results on assignment page!
