# Fast(er) KNN Feature Selection

While working with KNN I realized how difficult it was to select good features to use as classifiers.  Moreover, with small data samples there was a need to run long simulations with bootstrapping to smooth out high variances in accuracy values.  Hence, the need to quickly check for possible high-accuracy-featuresets was imperitive.  The following method was initially an attempt to speed up the SKLearn KNN algorithms which I suspected had a lot of overhead for handling arbitrary labels and not just binary 1s and 0s, repetitive computation of distance matrices which do not change, and general overhead required for setting up the KNN class.  After initial testing it was apparent that this method quickly picks out high-accuracy-featuresets based on confirmation by the high accuracies found with the much slower simulations.  However, this method always appears to overestimate the accuracy by 1-3%, leading me to believe that this is some sort of upper bound computation.  Though I have no proof yet of this claim.  

Algorithm
1. Let X be a point cloud and kMax-1 be a maximum number of nearest neighbors to check.
2. Compute all distances d(x_1,x_2) between all pairs of points in the cloud and place in distance matrix D.
3. Use an ordering of the rows of D to sort the label-rows of the matrix L (assigning the label of x_2 to the corresponding distance in matrix D.
4. Construct prediction matrix P, which fills each row with classifications according to majority-wins voting from the left of the row (neglecting the first label).  In this way, row i of P contains its own label at index 0 and the k-nearest neighbor majority vote at index k.  That is, each index holds the label that the index-number of nearest neighbors would vote.
5. Compute and return the average accuracy for each column (k-value); that is, for all rows compare the classification at index k with the value at index 0 and divide the number of agreeing labels by the number of rows.

In this way the algorithm essentially uses all data except one to classify with all k-values desired.  It does this once for all data points and then averages the accuracies found at all k-values used.  

In [321]:
from time import time 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.preprocessing import MaxAbsScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from IPython.display import Image
import seaborn as sns

def getDistances(X,d = 'Euclidean'):
    '''Given a point cloud X, returns a distance matrix
    using the metric provided as d (options are Euclidean,
    , more?).
    
    Expects X is a np.array of np.arrays all of the same length L.
    
    Returns a symmetric matrix of dimension N x N, where |X| = N.'''
    N = len(X)
    D = np.zeros((N,N))
    for i in range(N):
        for j in range(i,N):
            D[i][j] = np.linalg.norm(X[i]-X[j])
            D[j][i] = D[i][j]
    return D

def sortLabels(D,L):
    '''Given a square distance matrix D and a corresponding
    array of labels L, this method sorts a generated 
    labels matrix according to how the distance matrix rows  
    would be sorted.  Returns the row-sorted label matrix.   
    
    Expects D is an np.array matrix of numerical values.'''
    LMatrix = np.array([L for i in range(D.shape[0])])

    return np.take_along_axis(LMatrix,np.argsort(D),axis = 1)

def majVote(a):
    '''Assumes 'a' is an array of 1s and 0s.  Returns whichever
    of the two is the majority.  In the case of a tie, a 0 is
    returned.'''
    
    if a.mean() > 0.5:
        return 1
    else:
        return 0

def getPred(L,kMax=-1):
    '''Given a sorted square matrix L, this method computes how a
    KNN classifier would vote for each row for all values of k up
    to the kMax.
    
    Returns a L.shape[0] x kMax matrix of predicted labels. Row i
    contains its own label at index 0 and the k-nearest neighbor
    majority vote at index k.  That is, each index holds the label
    that the index-number of nearest neighbors would vote.'''
    if kMax == -1:
        kMax = L.shape[0]
    
    pred = np.zeros((L.shape[0],kMax))
    for i in range(L.shape[0]):
        for k in range(kMax):
            if k == 0 or k == 1:
                pred[i][k] = int(L[i][k])
            else:
                pred[i][k] = int(majVote(L[i][1:k+1]))
                
    return pred

def getAcc(P,kMax = -1):
    '''For a given prediction matrix, returns the accuracy that a KNN
    classifier would have if neighborhoods were left out one at a time 
    and classified based on all other neighborhoods for each k-value.'''
    if kMax == -1:
        kMax = P.shape[1]-1
        # Holds the accuracies for each k-value.
        kAcc = np.zeros(kMax)
    else:
        kAcc = np.zeros(kMax-1)
    
    for row in P:
        kAcc = kAcc + np.array([1 if row[0]==row[i] else 0 for i in range(1,len(row))])
    
    return kAcc/P.shape[0]

def KNN(data,labels,kMax = -1, prnt = True):
    '''Runs the pipeline of methods above.
    Get distances, feed into label sorter, get predictions,
    compute accuracies for leaving each neighborhood out.'''
    D = getDistances(data)
    L = sortLabels(D,labels)
    P = getPred(L,kMax)
    A = getAcc(P,kMax)
    
    if prnt:
        print("Label Matrix")
        print(L)
        print()
        print("Predictions")
        print(P)
        print()
        print("Accuracies for each K")
        print(A)
    
    return A

In [307]:
dataFrame = pd.read_csv("Dataset.csv")
fullfeaturelist = ['nei_final_simple','walk_score','transit_score','bike_score','population',
           'population_density','household_income','marital_status_married','marital_status_separated_divorce',
           'marital_status_widowed','marital_status_never_married','white_popl',
            'hispanic_popl','black_popl','asian_popl','mixed_popl','other_popl','food_stamps_total',
            'educational_attainment_no_hs','educational_attainment_bachelors','educational_attainment_very_advanced_degrees',
            'household_type_married_count', 'household_type_single_female_count',
            'household_type_single_male_count','household_type_one_person_count',
            'household_type_other_non_family_count','household_type_with_children','Age_0_to_17','Age_18_to_21',
            'Age_22_to_29','Age_30_to_39','Age_40_to_49','Age_50_to_59','Age_60_to_69','Age_70_to_79','Age_80_older']

featurelist = fullfeaturelist[1:]

# A new dataframe after removing job sector data since much of this is missing across many neighborhoods.
dfFull = dataFrame[fullfeaturelist]
# Taking out Pastures Neighborhood because of large lack of data.
df = pd.DataFrame(dfFull[dfFull['nei_final_simple']!= 'Pastures'],columns = fullfeaturelist)
# Taking out West End Neighborhood because of large lack of data.
df = pd.DataFrame(df[df['nei_final_simple'] != 'West End'],columns = fullfeaturelist)
df.reset_index(drop = True, inplace=True)

scaler = MaxAbsScaler()
scaler.fit(df[featurelist])
scaledData = scaler.transform(df[featurelist])
dfScaled = pd.DataFrame(scaledData,columns=featurelist)
dfScaled['nei_final_simple']=df['nei_final_simple']

# Labels WITH Eagle Hill and Pastures included.
#df.loc[:,'fridge_count'] = pd.Series(np.array([1,0,0,0,1,0,0,1,0,0,0,0,0,2,0,0,0,0,1,0,0,0,0,0,0,0,0,0,]),index=df.index)
# Labels WITHOUT Eagle Hill and Pastures included.
df.loc[:,'fridge_count'] = pd.Series(np.array([1,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0]),index=df.index)
dfScaled.loc[:,'fridge_count'] = pd.Series(np.array([1,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0]),index=dfScaled.index)

### Inspection...

We know several good classifiers for 2- and 3-feature sets.  We will inspect the results for a few to see how our output relates (if at all) and move on from there.

In [308]:
labels = np.array(dfScaled['fridge_count'])

In [319]:
# Known to have an average accuracy of 0.85314 for k = 1 after 10000 trials.
data = np.array(dfScaled[['marital_status_widowed','Age_18_to_21']])
accuracyList = KNN(data,labels,prnt=False)
accuracyList[:8]
# Looks like for k = 1 the algorithm suggests a very good accuracy.

array([0.88461538, 0.80769231, 0.80769231, 0.80769231, 0.80769231,
       0.80769231, 0.76923077, 0.80769231])

In [320]:
# Known to have an average accuracy of 0.834 for k = 3 after 10000 trials.
data = np.array(dfScaled[['marital_status_widowed','marital_status_separated_divorce']])
accuracyList = KNN(data,labels,prnt=False)
a = str(accuracyList[:8]).replace('\n','')
a = a.replace(' ',',')
print(a)
# Looks like for k = 3 the algorithm suggests a very good accuracy.

[0.73076923,0.80769231,0.84615385,0.80769231,0.80769231,0.80769231,0.80769231,0.80769231]


# Well, I'm convinced

Looks good and fast.  Let's see how she performs by printing out the predicted accuracies if one exceeds 0.82.

In [323]:
cutoff = 0.85

begin = time()
for i in range(len(featurelist)-1):
    for j in range(i+1,len(featurelist)):
        features = [featurelist[i],featurelist[j]]
        data = np.array(dfScaled[features])
        accuracyList = KNN(data,labels,kMax = 6, prnt=False)
        a = str(accuracyList[:5]).replace('\n','')
        a = a.replace(' ',',')
        if accuracyList.max() > cutoff:
            print(features,", ", a)
end = time()
print("Total time: ", str(end-begin))

['marital_status_widowed', 'Age_18_to_21'] ,  [0.88461538,0.80769231,0.80769231,0.80769231,0.80769231]
['marital_status_widowed', 'Age_80_older'] ,  [0.88461538,0.76923077,0.65384615,0.76923077,0.73076923]
['Age_22_to_29', 'Age_80_older'] ,  [0.80769231,0.88461538,0.73076923,0.69230769,0.65384615]
Total time:  3.0616257190704346


Runs in less than 5 seconds and correctly identifies the features and k-values for classifiers.  However, the additional feature-sets discovered have shown to not be significant.  It must be because of the nature of the 'leave one out' situation that we have going on here.   I'll take it.  On to the three-feature sets...

# Three Feature Heuristic


In [324]:
cutoff = 0.9

begin = time()
for i in range(len(featurelist)-2):
    for j in range(i+1,len(featurelist)-1):
        for k in range(j+1,len(featurelist)):
            features = [featurelist[i],featurelist[j],featurelist[k]]
            data = np.array(dfScaled[features])
            accuracyList = KNN(data,labels,prnt=False)
            a = str(accuracyList[:5]).replace('\n','')
            a = a.replace(' ',',')
            if accuracyList.max() > cutoff:
                print(features,",",a)
end = time()
print("Total time: ", str(end-begin))

['population_density', 'other_popl', 'educational_attainment_very_advanced_degrees'] , [0.92307692,0.88461538,0.80769231,0.80769231,0.84615385]
['household_type_other_non_family_count', 'Age_22_to_29', 'Age_80_older'] , [0.65384615,0.92307692,0.73076923,0.73076923,0.69230769]
Total time:  67.9499990940094


# Checking Results
The above cell pointed out two three-feature sets which I had not discovered before because of the length of time required to check each set.  Let's see how they perform in random trials.

In [325]:
def getAccuracy(features, trainData, testData,k_max = 5,label = 'fridge_count',k_min = 1):
    '''Given a list of features from a train-test split from a data frame
    this method returns the classification accuracy of a KNN method for 
    all values of k up to k_max.  Assumes features is a list of column
    names from the data frame from which trainData and testData were 
    contrived.  trainData and testData must have a column titled "fridge_count"
    or a given label must be supplied for target labels.'''
    accuracylist = []
    for i in range(k_min,k_max):
        neighFridge = KNeighborsClassifier(n_neighbors = i)
        neighFridge.fit(trainData[features], trainData[label])
        pred = neighFridge.predict(testData[features])
        accuracy = round(sum([1 for i in range(len(pred)) if pred[i] == np.array(testData[label])[i]])/len(pred),3)
        accuracylist.append(accuracy)
    return np.array(accuracylist)

def getAvgAccuracyNTrials(features,data,trials = 100,split_size = 0.2,k_max = 5,label = 'fridge_count',k_min=1):
    '''Given a list of features, a data frame, a size of train-test split (0,1),
    and a k_max value, this method runs trials of knn and returns an average
    accuracy list for all values of k from 1 to k_max.'''
    accuracyList = []
    for i in range(trials):
        trainFood, testFood = train_test_split(data, test_size = split_size)
        accuracyList.append(getAccuracy(features,trainFood,testFood,k_max,label,k_min))
    avgArray = np.zeros(len(accuracyList[0]))
    for array in accuracyList:
        avgArray = avgArray + array
    return avgArray/trials

def getCI(p,N,z=1.96):
    '''A method for determining the confindence interval (CI) for a given accuracy value p.
    When an accuracy of p* has been found after running N simulations, we wish to know the CI 
    to determine if p* is statistically significant, or if it was caused by variations within 
    our resampling. N is the number of times you have rerun your simulations.  p is the average
    accuracy of your simulations.  z is the t-value with N-1 DF or the z-value for sufficiently high
    DFs.  i.e. for very large samples you can use z = 1.96 for a 95% CI.'''
    interval = z*(np.sqrt(p*(1-p)/N))
    return np.array([p-interval,p+interval])

In [326]:
bestTwoFeats = [['population_density', 'other_popl', 'educational_attainment_very_advanced_degrees'],
               ['household_type_other_non_family_count', 'Age_22_to_29', 'Age_80_older']]

featuresets = [['walk_score', 'white_popl', 'other_popl'] ,['walk_score', 'other_popl', 'educational_attainment_very_advanced_degrees'] ,
['bike_score', 'population_density', 'household_income'] ,['bike_score', 'asian_popl', 'household_type_one_person_count'] ,
['population', 'marital_status_widowed', 'household_type_other_non_family_count'] ,['population', 'marital_status_widowed', 'Age_18_to_21'] ,
['population', 'white_popl', 'household_type_single_male_count'] ,['population', 'hispanic_popl', 'other_popl'] ,
['population', 'other_popl', 'household_type_married_count'] ,['population', 'other_popl', 'household_type_single_male_count'] ,
['population_density', 'household_income', 'Age_80_older'] ,['population_density', 'white_popl', 'other_popl'] ,
['population_density', 'white_popl', 'educational_attainment_very_advanced_degrees'] ,['population_density', 'asian_popl', 'educational_attainment_very_advanced_degrees'] ,
['population_density', 'other_popl', 'educational_attainment_no_hs'] ,['population_density', 'other_popl', 'educational_attainment_very_advanced_degrees'] ,
['population_density', 'other_popl', 'household_type_married_count'] ,['population_density', 'other_popl', 'Age_80_older'] ,
['population_density', 'food_stamps_total', 'educational_attainment_no_hs'] ,['population_density', 'educational_attainment_no_hs', 'household_type_single_male_count'] ,
['marital_status_married', 'marital_status_never_married', 'other_popl'] ,['marital_status_married', 'other_popl', 'household_type_single_female_count'] ,
['marital_status_married', 'other_popl', 'household_type_single_male_count'] ,['marital_status_married', 'other_popl', 'household_type_with_children'] ,
['marital_status_married', 'Age_30_to_39', 'Age_80_older'] ,['marital_status_separated_divorce', 'marital_status_widowed', 'hispanic_popl'] ,
['marital_status_widowed', 'Age_18_to_21', 'Age_80_older'] ,['marital_status_never_married', 'white_popl', 'Age_80_older'] ,
['marital_status_never_married', 'other_popl', 'household_type_married_count'] ,['marital_status_never_married', 'Age_18_to_21', 'Age_80_older'] ,
['marital_status_never_married', 'Age_22_to_29', 'Age_80_older'] ,['white_popl', 'mixed_popl', 'household_type_single_male_count'] ,
['white_popl', 'other_popl', 'household_type_with_children'] ,['hispanic_popl', 'other_popl', 'household_type_single_female_count'] ,
['hispanic_popl', 'other_popl', 'household_type_single_male_count'] ,['mixed_popl', 'other_popl', 'household_type_married_count'] ,
['mixed_popl', 'educational_attainment_bachelors', 'Age_80_older'] ,['other_popl', 'household_type_married_count', 'household_type_single_female_count'] ,
['other_popl', 'household_type_married_count', 'household_type_single_male_count'] ,['other_popl', 'household_type_married_count', 'household_type_with_children'] ,
['other_popl', 'household_type_single_female_count', 'Age_18_to_21'] ,['other_popl', 'household_type_single_female_count', 'Age_40_to_49'] ,
['other_popl', 'household_type_single_male_count', 'Age_18_to_21'] ,['other_popl', 'household_type_single_male_count', 'Age_30_to_39'] ,
['other_popl', 'household_type_single_male_count', 'Age_40_to_49'] ,['household_type_other_non_family_count', 'Age_18_to_21', 'Age_80_older'] ,
['household_type_other_non_family_count', 'Age_22_to_29', 'Age_80_older']]


In [318]:
for featureset in bestTwoFeats:
    print(featureset," , ",getAvgAccuracyNTrials(featureset,dfScaled,trials = 10000,k_max = 6))


['population_density', 'other_popl', 'educational_attainment_very_advanced_degrees']  ,  [0.8907558 0.8447555 0.7916189 0.8124696 0.8038102]
['household_type_other_non_family_count', 'Age_22_to_29', 'Age_80_older']  ,  [0.6784912 0.8447958 0.6974508 0.7539666 0.7181438]


Looks like the two feature sets dicovered do in fact perform very well.  My previous most-accurate three-feature classifier had an accuracy of 0.85495 and took about one week to discover.  The best found above has an accuracy of 0.89075.  Very nice, considering we found it in 67 seconds.

# Four Feature Set Heuristic

In [327]:
cutoff = 0.92

begin = time()
featStr2 = ""
for i in range(len(featurelist)-3):
    for j in range(i+1,len(featurelist)-2):
        for k in range(j+1,len(featurelist)-1):
            for p in range(k+1,len(featurelist)):
                features = [featurelist[i],featurelist[j],featurelist[k],featurelist[p]]
                data = np.array(dfScaled[features])
                accuracyList = KNN(data,labels,prnt=False)
                a = str(accuracyList[:5]).replace('\n','')
                a = a.replace(' ',',')
                if accuracyList.max() > cutoff:
                    print(features,",",a)
                    featStr2 = featStr2+str(features)
                    
end = time()
print("Total time: ", str(end-begin))

['population', 'marital_status_widowed', 'household_type_other_non_family_count', 'Age_18_to_21'] , [0.92307692,0.80769231,0.76923077,0.80769231,0.80769231]
['population_density', 'asian_popl', 'other_popl', 'educational_attainment_very_advanced_degrees'] , [0.92307692,0.84615385,0.76923077,0.84615385,0.84615385]
['population_density', 'other_popl', 'educational_attainment_very_advanced_degrees', 'household_type_single_male_count'] , [0.88461538,0.92307692,0.88461538,0.76923077,0.73076923]
['population_density', 'other_popl', 'educational_attainment_very_advanced_degrees', 'Age_18_to_21'] , [0.92307692,0.88461538,0.80769231,0.80769231,0.84615385]
['marital_status_married', 'mixed_popl', 'other_popl', 'Age_18_to_21'] , [0.92307692,0.84615385,0.80769231,0.80769231,0.73076923]
['marital_status_widowed', 'Age_18_to_21', 'Age_22_to_29', 'Age_80_older'] , [0.92307692,0.80769231,0.69230769,0.80769231,0.80769231]
['marital_status_never_married', 'white_popl', 'Age_18_to_21', 'Age_80_older'] , 

Of the 52,360 (thirty-five choose four) feature-sets checked in 544 seconds these nine-quadruples produced highly favorable heuristic scores above 92% accuracy.  Let's see how they perform under random resampling. Of special note, no quadruples were even checked before because of the time requirement for computation.

In [274]:
fourFeatP = [['population', 'marital_status_widowed', 'household_type_other_non_family_count', 'Age_18_to_21'] ,
['population_density', 'asian_popl', 'other_popl', 'educational_attainment_very_advanced_degrees'] ,
['population_density', 'other_popl', 'educational_attainment_very_advanced_degrees', 'household_type_single_male_count'] ,
['population_density', 'other_popl', 'educational_attainment_very_advanced_degrees', 'Age_18_to_21'] , 
['marital_status_married', 'mixed_popl', 'other_popl', 'Age_18_to_21'] , 
['marital_status_widowed', 'Age_18_to_21', 'Age_22_to_29', 'Age_80_older'] , 
['marital_status_never_married', 'white_popl', 'Age_18_to_21', 'Age_80_older'] , 
['marital_status_never_married', 'Age_18_to_21', 'Age_22_to_29', 'Age_80_older'] , 
['mixed_popl', 'educational_attainment_bachelors', 'Age_18_to_21', 'Age_80_older']]
for featureset in fourFeatP:
    print(featureset," , ",getAvgAccuracyNTrials(featureset,dfScaled,trials = 10000,k_max = 6))

['population', 'marital_status_widowed', 'household_type_other_non_family_count', 'Age_18_to_21']  ,  [0.8751364 0.8006425 0.7706649 0.8071058 0.806389 ]
['population_density', 'asian_popl', 'other_popl', 'educational_attainment_very_advanced_degrees']  ,  [0.8821157 0.8208687 0.782172  0.8223102 0.800449 ]
['population_density', 'other_popl', 'educational_attainment_very_advanced_degrees', 'household_type_single_male_count']  ,  [0.8684905 0.8702796 0.8242167 0.774959  0.7394808]
['population_density', 'other_popl', 'educational_attainment_very_advanced_degrees', 'Age_18_to_21']  ,  [0.8901078 0.8456081 0.7913026 0.81371   0.8071393]
['marital_status_married', 'mixed_popl', 'other_popl', 'Age_18_to_21']  ,  [0.8622299 0.8252584 0.7879649 0.7897151 0.7578321]
['marital_status_widowed', 'Age_18_to_21', 'Age_22_to_29', 'Age_80_older']  ,  [0.8526322 0.7907026 0.7251389 0.806707  0.803891 ]
['marital_status_never_married', 'white_popl', 'Age_18_to_21', 'Age_80_older']  ,  [0.8134037 0.862

# Five Feature Set Heuristic

In [328]:
cutoff = 0.92

begin = time()
featStr3 = ""
for i in range(len(featurelist)-4):
    for j in range(i+1,len(featurelist)-3):
        for k in range(j+1,len(featurelist)-2):
            for p in range(k+1,len(featurelist)-1):
                for q in range(p+1,len(featurelist)):
                    features = [featurelist[i],featurelist[j],featurelist[k],featurelist[p],featurelist[q]]
                    data = np.array(dfScaled[features])
                    accuracyList = KNN(data,labels,prnt=False)
                    a = str(accuracyList[:5]).replace('\n','')
                    a = a.replace(' ',',')
                    if accuracyList.max() > cutoff:
                        print(features,",",a)
                        featStr3 = featStr3+str(features)
end = time()
print("Total time: ", str(end-begin))

['transit_score', 'marital_status_married', 'hispanic_popl', 'other_popl', 'household_type_single_female_count'] , [0.92307692,0.80769231,0.76923077,0.80769231,0.80769231]
['bike_score', 'population_density', 'other_popl', 'educational_attainment_very_advanced_degrees', 'household_type_single_male_count'] , [0.88461538,0.84615385,0.92307692,0.80769231,0.80769231]
['bike_score', 'population_density', 'other_popl', 'household_type_married_count', 'household_type_single_male_count'] , [0.88461538,0.84615385,0.92307692,0.80769231,0.80769231]
['bike_score', 'population_density', 'other_popl', 'household_type_married_count', 'household_type_one_person_count'] , [0.92307692,0.84615385,0.84615385,0.80769231,0.80769231]
['bike_score', 'population_density', 'other_popl', 'household_type_married_count', 'Age_30_to_39'] , [0.92307692,0.84615385,0.84615385,0.80769231,0.80769231]
['bike_score', 'population_density', 'educational_attainment_very_advanced_degrees', 'household_type_single_male_count', 

Of the 324,632 (thirty-five choose five) feature-sets checked in 3601 seconds, let's see how many appear favorable for further inspection. Of special note, no pentuples were even checked before because of the time requirement for computation.

In [281]:
fiveFeatures = [['transit_score', 'marital_status_married', 'hispanic_popl', 'other_popl', 'household_type_single_female_count'],
['bike_score', 'population_density', 'other_popl', 'educational_attainment_very_advanced_degrees', 'household_type_single_male_count'],
['bike_score', 'population_density', 'other_popl', 'household_type_married_count', 'household_type_single_male_count'],
['bike_score', 'population_density', 'other_popl', 'household_type_married_count', 'household_type_one_person_count'],
['bike_score', 'population_density', 'other_popl', 'household_type_married_count', 'Age_30_to_39'],
['bike_score', 'population_density', 'educational_attainment_very_advanced_degrees', 'household_type_single_male_count', 'Age_50_to_59'],
['bike_score', 'marital_status_married', 'household_type_one_person_count', 'household_type_other_non_family_count', 'Age_22_to_29'],
['bike_score', 'marital_status_married', 'household_type_one_person_count', 'Age_18_to_21', 'Age_22_to_29'],
['bike_score', 'marital_status_never_married', 'asian_popl', 'household_type_one_person_count', 'Age_22_to_29'],
['bike_score', 'household_type_married_count', 'household_type_one_person_count', 'household_type_other_non_family_count', 'Age_22_to_29'],
['population', 'population_density', 'black_popl', 'other_popl', 'educational_attainment_very_advanced_degrees'],
['population', 'population_density', 'black_popl', 'other_popl', 'household_type_married_count'],
['population', 'population_density', 'mixed_popl', 'other_popl', 'educational_attainment_very_advanced_degrees'],
['population', 'population_density', 'other_popl', 'educational_attainment_very_advanced_degrees', 'household_type_single_male_count'],
['population', 'population_density', 'other_popl', 'Age_30_to_39', 'Age_80_older'],
['population_density', 'marital_status_married', 'marital_status_never_married', 'black_popl', 'other_popl'],
['population_density', 'marital_status_married', 'white_popl', 'black_popl', 'other_popl'],
['population_density', 'marital_status_married', 'hispanic_popl', 'other_popl', 'household_type_single_male_count'],
['population_density', 'marital_status_married', 'black_popl', 'other_popl', 'educational_attainment_bachelors'],
['population_density', 'marital_status_married', 'black_popl', 'other_popl', 'educational_attainment_very_advanced_degrees'],
['population_density', 'marital_status_married', 'black_popl', 'other_popl', 'household_type_other_non_family_count'],
['population_density', 'marital_status_married', 'black_popl', 'other_popl', 'Age_80_older'],
['population_density', 'marital_status_married', 'other_popl', 'educational_attainment_bachelors', 'household_type_single_male_count'],
['population_density', 'marital_status_married', 'other_popl', 'educational_attainment_very_advanced_degrees', 'household_type_single_male_count'],
['population_density', 'marital_status_never_married', 'black_popl', 'other_popl', 'educational_attainment_very_advanced_degrees'],
['population_density', 'marital_status_never_married', 'black_popl', 'other_popl', 'household_type_married_count'],
['population_density', 'marital_status_never_married', 'other_popl', 'educational_attainment_bachelors', 'household_type_single_male_count'],
['population_density', 'marital_status_never_married', 'other_popl', 'educational_attainment_very_advanced_degrees', 'household_type_single_male_count'],
['population_density', 'white_popl', 'asian_popl', 'other_popl', 'educational_attainment_very_advanced_degrees'],
['population_density', 'white_popl', 'other_popl', 'educational_attainment_bachelors', 'household_type_single_male_count'],
['population_density', 'white_popl', 'other_popl', 'educational_attainment_very_advanced_degrees', 'household_type_single_male_count'],
['population_density', 'white_popl', 'other_popl', 'educational_attainment_very_advanced_degrees', 'Age_18_to_21'],
['population_density', 'hispanic_popl', 'black_popl', 'other_popl', 'educational_attainment_very_advanced_degrees'],
['population_density', 'hispanic_popl', 'black_popl', 'other_popl', 'household_type_married_count'],
['population_density', 'hispanic_popl', 'other_popl', 'educational_attainment_very_advanced_degrees', 'household_type_with_children'],
['population_density', 'hispanic_popl', 'other_popl', 'household_type_married_count', 'household_type_single_male_count'],
['population_density', 'black_popl', 'mixed_popl', 'other_popl', 'educational_attainment_very_advanced_degrees'],
['population_density', 'black_popl', 'mixed_popl', 'other_popl', 'household_type_married_count'],
['population_density', 'black_popl', 'other_popl', 'educational_attainment_bachelors', 'household_type_married_count'],
['population_density', 'black_popl', 'other_popl', 'educational_attainment_very_advanced_degrees', 'household_type_married_count'],
['population_density', 'black_popl', 'other_popl', 'educational_attainment_very_advanced_degrees', 'household_type_with_children'],
['population_density', 'black_popl', 'other_popl', 'household_type_married_count', 'household_type_single_male_count'],
['population_density', 'black_popl', 'other_popl', 'household_type_married_count', 'Age_22_to_29'],
['population_density', 'black_popl', 'other_popl', 'household_type_married_count', 'Age_80_older'],
['population_density', 'black_popl', 'other_popl', 'Age_40_to_49', 'Age_80_older'],
['population_density', 'asian_popl', 'other_popl', 'educational_attainment_bachelors', 'educational_attainment_very_advanced_degrees'],
['population_density', 'asian_popl', 'other_popl', 'educational_attainment_bachelors', 'household_type_single_male_count'],
['population_density', 'asian_popl', 'other_popl', 'educational_attainment_very_advanced_degrees', 'household_type_single_male_count'],
['population_density', 'asian_popl', 'other_popl', 'educational_attainment_very_advanced_degrees', 'household_type_other_non_family_count'],
['population_density', 'asian_popl', 'other_popl', 'educational_attainment_very_advanced_degrees', 'Age_18_to_21'],
['population_density', 'mixed_popl', 'other_popl', 'educational_attainment_very_advanced_degrees', 'household_type_single_male_count'],
['population_density', 'mixed_popl', 'other_popl', 'educational_attainment_very_advanced_degrees', 'household_type_with_children'],
['population_density', 'other_popl', 'educational_attainment_bachelors', 'educational_attainment_very_advanced_degrees', 'household_type_single_male_count'],
['population_density', 'other_popl', 'educational_attainment_bachelors', 'household_type_married_count', 'household_type_single_male_count'],
['population_density', 'other_popl', 'educational_attainment_bachelors', 'household_type_single_male_count', 'Age_30_to_39'],
['population_density', 'other_popl', 'educational_attainment_very_advanced_degrees', 'household_type_married_count', 'household_type_single_male_count'],
['population_density', 'other_popl', 'educational_attainment_very_advanced_degrees', 'household_type_single_male_count', 'household_type_other_non_family_count'],
['population_density', 'other_popl', 'educational_attainment_very_advanced_degrees', 'household_type_single_male_count', 'Age_18_to_21'],
['population_density', 'other_popl', 'educational_attainment_very_advanced_degrees', 'household_type_single_male_count', 'Age_22_to_29'],
['population_density', 'other_popl', 'educational_attainment_very_advanced_degrees', 'household_type_single_male_count', 'Age_30_to_39'],
['population_density', 'other_popl', 'educational_attainment_very_advanced_degrees', 'household_type_with_children', 'Age_40_to_49'],
['population_density', 'other_popl', 'Age_30_to_39', 'Age_40_to_49', 'Age_80_older'],
['marital_status_never_married', 'white_popl', 'other_popl', 'household_type_married_count', 'household_type_other_non_family_count'],
['marital_status_never_married', 'white_popl', 'educational_attainment_bachelors', 'Age_18_to_21', 'Age_80_older'],
['marital_status_never_married', 'white_popl', 'Age_18_to_21', 'Age_22_to_29', 'Age_80_older'],
['marital_status_never_married', 'mixed_popl', 'educational_attainment_bachelors', 'Age_18_to_21', 'Age_80_older'],
['marital_status_never_married', 'other_popl', 'educational_attainment_bachelors', 'household_type_married_count', 'household_type_other_non_family_count'],
['marital_status_never_married', 'other_popl', 'household_type_married_count', 'household_type_other_non_family_count', 'Age_22_to_29'],
['marital_status_never_married', 'educational_attainment_bachelors', 'household_type_other_non_family_count', 'Age_18_to_21', 'Age_80_older'],
['marital_status_never_married', 'educational_attainment_bachelors', 'Age_18_to_21', 'Age_22_to_29', 'Age_80_older'],
['white_popl', 'mixed_popl', 'educational_attainment_bachelors', 'Age_18_to_21', 'Age_80_older'],
['mixed_popl', 'educational_attainment_bachelors', 'Age_18_to_21', 'Age_22_to_29', 'Age_80_older']]

And let's check their accuracies.  

In [282]:
for featureset in fiveFeatures:
    print(featureset," , ",getAvgAccuracyNTrials(featureset,dfScaled,trials = 10000,k_max = 6))

['transit_score', 'marital_status_married', 'hispanic_popl', 'other_popl', 'household_type_single_female_count']  ,  [0.8788005 0.8006637 0.7620105 0.8075767 0.8068931]
['bike_score', 'population_density', 'other_popl', 'educational_attainment_very_advanced_degrees', 'household_type_single_male_count']  ,  [0.8504918 0.8434005 0.8780055 0.8075057 0.7989717]
['bike_score', 'population_density', 'other_popl', 'household_type_married_count', 'household_type_single_male_count']  ,  [0.8657361 0.8443535 0.8669588 0.8055041 0.788274 ]
['bike_score', 'population_density', 'other_popl', 'household_type_married_count', 'household_type_one_person_count']  ,  [0.9040723 0.8339667 0.8357331 0.8092873 0.8063355]
['bike_score', 'population_density', 'other_popl', 'household_type_married_count', 'Age_30_to_39']  ,  [0.8901577 0.8334973 0.8324299 0.8096012 0.8072336]
['bike_score', 'population_density', 'educational_attainment_very_advanced_degrees', 'household_type_single_male_count', 'Age_50_to_59']

#  Conclusion: Very nice!

This is a fantastic result.  It allows us to much more quickly discover possible high-accuracy feature-sets.  
- Is this a general result or is it specific to these data?  
- How can we further improve this result?  
- What other results can we use to suggest high-accuracy feature sets?
- How inaccurate is this method? i.e. Are there bounds on how far off the predicted accuracy is from the true accuracy?