#### In this example we will look at how to use the K-Nearest_Neighbor algorithm for classification. We will use a modified version of the Video Store data set for this example. We will use the "Incidentals" attribute as the target attribute for classification (the class attribute). The goal is to be able to classify an unseen instance as "Yes" or "No" given the values of "Incidentals" from training instances.

In [9]:
import numpy as np
import pandas as pd

In [10]:
vstable = pd.read_csv("../data/Video_Store_2.csv", index_col=0)
vstable.shape

(50, 7)

In [11]:
vstable.head()

Unnamed: 0_level_0,Gender,Income,Age,Rentals,Avg Per Visit,Genre,Incidentals
Cust ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,M,45000,25,32,2.5,Action,Yes
2,F,54000,33,12,3.4,Drama,No
3,F,32000,20,42,1.6,Comedy,No
4,F,59000,70,16,4.2,Drama,Yes
5,M,37000,35,25,3.2,Action,Yes


#### We will be splitting the data into a test and training partions with the test partition to be used for evaluating model error-rate and the training partition to be used to find the K nearest neighbors. Before spliting the data we need to do a random reshuffling to make sure the instances are randomized.

In [12]:
vs = vstable.reindex(np.random.permutation(vstable.index))
vs.head(10)

Unnamed: 0_level_0,Gender,Income,Age,Rentals,Avg Per Visit,Genre,Incidentals
Cust ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
12,F,26000,22,32,2.9,Action,Yes
24,F,79000,35,22,3.8,Drama,Yes
45,M,56000,38,30,3.5,Drama,Yes
2,F,54000,33,12,3.4,Drama,No
15,M,68000,30,36,2.7,Comedy,Yes
49,M,31000,25,42,3.4,Action,Yes
3,F,32000,20,42,1.6,Comedy,No
30,M,41000,25,17,1.4,Action,Yes
10,F,65000,40,21,3.3,Drama,No
26,F,56000,35,40,2.6,Action,Yes


In [13]:
len(vs)

50

In [14]:
vs_names = vs.columns.values
vs_names

array(['Gender', 'Income', 'Age', 'Rentals', 'Avg Per Visit', 'Genre',
       'Incidentals'], dtype=object)

#### The target attribute for classification is Incidentals:

In [15]:
vs_target = vs.Incidentals
vs_target

Cust ID
12    Yes
24    Yes
45    Yes
2      No
15    Yes
49    Yes
3      No
30    Yes
10     No
26    Yes
44     No
18    Yes
29    Yes
40     No
11    Yes
13     No
9      No
27     No
35    Yes
50     No
39     No
14     No
31     No
16    Yes
33     No
47    Yes
48     No
38    Yes
46     No
5     Yes
25    Yes
36     No
6      No
1     Yes
34    Yes
4     Yes
7      No
17    Yes
32    Yes
23     No
8     Yes
21     No
41     No
43    Yes
20    Yes
42    Yes
19     No
37     No
22    Yes
28     No
Name: Incidentals, dtype: object

#### Before we can compute distances we need to convert the data (excluding the target attribute containing the class labels) into binary dummy variables for categorical attributes).

In [16]:
# Note that we are re-assigning vs here. We could make sure that we are creating a new variable for vs
# without the incidentals attribute otherwise running cells out of order could result in errors 
# (one frame does not contain dummies and is missing one column
# old line  vs = pd.get_dummies(vs[['Gender','Income','Age','Rentals','Avg Per Visit','Genre']])
vs_dummies = pd.get_dummies(vs[['Gender','Income','Age','Rentals','Avg Per Visit','Genre']])
vs_dummies.head(10)

Unnamed: 0_level_0,Income,Age,Rentals,Avg Per Visit,Gender_F,Gender_M,Genre_Action,Genre_Comedy,Genre_Drama
Cust ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
12,26000,22,32,2.9,1,0,1,0,0
24,79000,35,22,3.8,1,0,0,0,1
45,56000,38,30,3.5,0,1,0,0,1
2,54000,33,12,3.4,1,0,0,0,1
15,68000,30,36,2.7,0,1,0,1,0
49,31000,25,42,3.4,0,1,1,0,0
3,32000,20,42,1.6,1,0,0,1,0
30,41000,25,17,1.4,0,1,1,0,0
10,65000,40,21,3.3,1,0,0,0,1
26,56000,35,40,2.6,1,0,1,0,0


#### To be able to evaluate the accuracy of our predictions, we will split the data into training and test sets. In this case, we will use 80% for training and the remaining 20% for testing. Note that we must also do the same split to the target attribute.

In [21]:
print(vs.shape)
print(vs.columns)

print(vs_dummies.shape)
print(vs_dummies.columns)

(50, 7)
Index(['Gender', 'Income', 'Age', 'Rentals', 'Avg Per Visit', 'Genre',
       'Incidentals'],
      dtype='object')
(50, 9)
Index(['Income', 'Age', 'Rentals', 'Avg Per Visit', 'Gender_F', 'Gender_M',
       'Genre_Action', 'Genre_Comedy', 'Genre_Drama'],
      dtype='object')


In [22]:
# Training size (in percent) = 80% of my data for training
tpercent = 0.8
# round_down(0.8*50) = 40
tsize = int(np.floor(tpercent * len(vs_dummies)))
vs_train = vs_dummies[:tsize] # first 40
vs_test = vs_dummies[tsize:]  # greater than 40 or last 10

In [23]:
print (vs_train.shape)
print (vs_test.shape)

(40, 9)
(10, 9)


In [24]:
np.set_printoptions(suppress=True, linewidth=120)

vs_train.head(10)

Unnamed: 0_level_0,Income,Age,Rentals,Avg Per Visit,Gender_F,Gender_M,Genre_Action,Genre_Comedy,Genre_Drama
Cust ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
12,26000,22,32,2.9,1,0,1,0,0
24,79000,35,22,3.8,1,0,0,0,1
45,56000,38,30,3.5,0,1,0,0,1
2,54000,33,12,3.4,1,0,0,0,1
15,68000,30,36,2.7,0,1,0,1,0
49,31000,25,42,3.4,0,1,1,0,0
3,32000,20,42,1.6,1,0,0,1,0
30,41000,25,17,1.4,0,1,1,0,0
10,65000,40,21,3.3,1,0,0,0,1
26,56000,35,40,2.6,1,0,1,0,0


In [25]:
vs_test

Unnamed: 0_level_0,Income,Age,Rentals,Avg Per Visit,Gender_F,Gender_M,Genre_Action,Genre_Comedy,Genre_Drama
Cust ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
8,74000,25,31,2.4,0,1,1,0,0
21,47000,52,11,3.1,1,0,0,0,1
41,50000,33,17,1.4,1,0,0,0,1
43,49000,28,48,3.3,1,0,0,0,1
20,12000,16,23,2.2,0,1,1,0,0
42,32000,25,26,2.2,0,1,1,0,0
19,24000,25,41,3.1,1,0,0,1,0
37,89000,46,12,1.2,0,1,0,1,0
22,25000,33,16,2.9,0,1,0,0,1
28,57000,52,22,4.1,0,1,0,1,0


#### Splitting the target attribute ("Incidentals") accordingly:

In [26]:
vs_target_train = vs_target[:int(tsize)]
vs_target_test = vs_target[int(tsize):]
print(vs_target_test.shape)

(10,)


In [27]:
vs_target_train = vs_target[0:int(tsize)]
vs_target_test = vs_target[int(tsize):len(vs)]
print(vs_target_test.shape)

(10,)


In [28]:
vs_target_train.head()

Cust ID
12    Yes
24    Yes
45    Yes
2      No
15    Yes
Name: Incidentals, dtype: object

In [29]:
vs_target_test.head()

Cust ID
8     Yes
21     No
41     No
43    Yes
20    Yes
Name: Incidentals, dtype: object

#### Next, we normalize the attributes so that everything is in [0,1] scale. We can use the normalization functions from the kNN module in Ch. 2 of the text. In this case, however, we will use the more flexible and robust scaler function from the preprocessing module of scikit-learn package.

In [30]:
from sklearn import preprocessing

In [31]:
min_max_scaler = preprocessing.MinMaxScaler()
min_max_scaler.fit(vs_train)

MinMaxScaler()

In [32]:
vs_train_norm = min_max_scaler.fit_transform(vs_train)
vs_test_norm = min_max_scaler.fit_transform(vs_test)

#### Note that while these Scikit-learn functions accept Pandas dataframes as input, they return Numpy arrays (in this case the normalized versions of vs_train and vs_test).

In [33]:
np.set_printoptions(precision=2, linewidth=100)
vs_train_norm[:10]

array([[0.3 , 0.13, 0.59, 0.5 , 1.  , 0.  , 1.  , 0.  , 0.  ],
       [0.95, 0.36, 0.33, 0.75, 1.  , 0.  , 0.  , 0.  , 1.  ],
       [0.67, 0.42, 0.54, 0.67, 0.  , 1.  , 0.  , 0.  , 1.  ],
       [0.65, 0.33, 0.08, 0.64, 1.  , 0.  , 0.  , 0.  , 1.  ],
       [0.82, 0.27, 0.69, 0.44, 0.  , 1.  , 0.  , 1.  , 0.  ],
       [0.37, 0.18, 0.85, 0.64, 0.  , 1.  , 1.  , 0.  , 0.  ],
       [0.38, 0.09, 0.85, 0.14, 1.  , 0.  , 0.  , 1.  , 0.  ],
       [0.49, 0.18, 0.21, 0.08, 0.  , 1.  , 1.  , 0.  , 0.  ],
       [0.78, 0.45, 0.31, 0.61, 1.  , 0.  , 0.  , 0.  , 1.  ],
       [0.67, 0.36, 0.79, 0.42, 1.  , 0.  , 1.  , 0.  , 0.  ]])

#### For consitency, we'll also convert the training and test target labels into Numpy arrays.

In [34]:
vs_target_train = np.array(vs_target_train)
vs_target_test = np.array(vs_target_test)

In [35]:
print (vs_target_train)
print ("\n")
print (vs_target_test)

['Yes' 'Yes' 'Yes' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'No'
 'No' 'Yes' 'No' 'No' 'No' 'No' 'Yes' 'No' 'Yes' 'No' 'Yes' 'No' 'Yes' 'Yes' 'No' 'No' 'Yes' 'Yes'
 'Yes' 'No' 'Yes' 'Yes' 'No']


['Yes' 'No' 'No' 'Yes' 'Yes' 'Yes' 'No' 'No' 'Yes' 'No']


#### The following function illustrates how we can perform a k-nearest-neighbor search. The "measure" argument allows us to use either Euclidean distance or (the inverse of) Cosine similarity as the distance function:

In [36]:
def knn_search(x, D, K, measure):
    """ find K nearest neighbors of an instance x among the instances in D """
    if measure == 0:
        # euclidean distances from the other points
        dists = np.sqrt(((D - x)**2).sum(axis=1))
    elif measure == 1:
        # first find the vector norm for each instance in D as wel as the norm for vector x
        D_norm = np.array([np.linalg.norm(D[i]) for i in range(len(D))])
        x_norm = np.linalg.norm(x)
        # Compute Cosine: divide the dot product of x and each instance in D by the product of the two norms
        sims = np.dot(D,x)/(D_norm * x_norm)
        # The distance measure will be the inverse of Cosine similarity
        dists = 1 - sims
    idx = np.argsort(dists) # sorting
    # return the indexes of K nearest neighbors
    return idx[:K], dists

In [37]:
# Let's use vs_test_norm[0] as a test instance x and find its K nearest neighbors
neigh_idx, distances = knn_search(vs_test_norm[0], vs_train_norm, 5, 0)

In [38]:
vs_test_norm[0]

array([0.81, 0.25, 0.54, 0.41, 0.  , 1.  , 1.  , 0.  , 0.  ])

In [39]:
print (neigh_idx)
print ("\n")
vs_train.iloc[neigh_idx]
#print(distances.shape)

[33 29 24  7  5]




Unnamed: 0_level_0,Income,Age,Rentals,Avg Per Visit,Gender_F,Gender_M,Genre_Action,Genre_Comedy,Genre_Drama
Cust ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,45000,25,32,2.5,0,1,1,0,0
5,37000,35,25,3.2,0,1,1,0,0
33,23000,25,28,2.7,0,1,1,0,0
30,41000,25,17,1.4,0,1,1,0,0
49,31000,25,42,3.4,0,1,1,0,0


In [40]:
print (distances[neigh_idx])

[0.28 0.44 0.54 0.57 0.58]


In [41]:
neigh_labels = vs_target_train[neigh_idx]
print (neigh_labels)

['Yes' 'Yes' 'No' 'Yes' 'Yes']


#### Now that we know the nearest neighbors, we need to find the majority class label among them. The majority class would be the class assgined to the new instance x.

In [42]:
from collections import Counter
print (Counter(neigh_labels))

Counter({'Yes': 4, 'No': 1})


In [43]:
Counter(neigh_labels).most_common(1)

[('Yes', 4)]

#### Next, we'll use the Knn module from Chapter 2 of Machine Learning in Action. Before importing the whole module, let's illustrate what the code does by stepping through it with some specific input values.

In [44]:
dataSetSize = vs_train_norm.shape[0]
print (dataSetSize)
#print(vs_train_norm.shape)

40


In [45]:
#inX = vs_test_norm[0]
#print(inX.shape) (9,)
#test = np.tile(inX, (dataSetSize,1))
#print(test.shape) (40,9)

In [46]:
inX = vs_test_norm[0]   # We'll use the first instance in the test data for this example
diffMat = np.tile(inX, (dataSetSize,1)) - vs_train_norm  # Create dataSetSize copies of inX, as rows of a 2D matrix
                                                         # Compute a matrix of differences
print (diffMat[:5,:])

[[ 0.5   0.12 -0.05 -0.09 -1.    1.    0.    0.    0.  ]
 [-0.15 -0.11  0.21 -0.34 -1.    1.    1.    0.   -1.  ]
 [ 0.13 -0.17  0.   -0.25  0.    0.    1.    0.   -1.  ]
 [ 0.16 -0.08  0.46 -0.23 -1.    1.    1.    0.   -1.  ]
 [-0.01 -0.02 -0.15 -0.03  0.    0.    1.   -1.    0.  ]]


In [47]:
sqDiffMat = diffMat**2  # The matrix of squared differences
sqDistances = sqDiffMat.sum(axis=1)  # 1D array of the sum of squared differences (one element for each training instance)
distances = sqDistances**0.5  # and finally the matrix of Euclidean distances to inX
print (distances)

[1.51 2.05 1.45 2.07 1.42 0.58 2.09 0.57 2.03 1.45 1.5  1.65 1.58 0.67 2.08 1.54 1.5  2.05 0.65
 0.81 2.06 1.45 2.11 0.65 0.54 2.01 2.1  1.49 2.13 0.44 1.68 2.06 0.67 0.28 1.52 2.21 2.12 1.49
 1.46 2.17]


In [48]:
sortedDistIndicies = distances.argsort() # the indices of the training instances in increasing order of distance to inX
print (sortedDistIndicies)

[33 29 24  7  5 23 18 32 13 19  4  9  2 21 38 37 27 10 16  0 34 15 12 11 30 25  8  1 17 20 31  3 14
  6 26 22 36 28 39 35]


#### To see how the test instance should be classified, we need to find the majority class among the neighbors (here we do not use distance weighting; only a simply voting approach)

In [49]:
classCount={}
k = 5  # We'll use the top 5 neighbors
for i in range(k):
    voteIlabel = vs_target_train[sortedDistIndicies[i]]
    #print("voteILabel",     voteIlabel)
    classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1  # add to the count of the label or retun 1 for first occurrence
    print (sortedDistIndicies[i], voteIlabel, classCount[voteIlabel])


33 Yes 1
29 Yes 2
24 No 1
7 Yes 3
5 Yes 4


#### Now, let's find the predicted class for the test instance.

In [50]:
import operator
# Create a dictionary for the class labels with cumulative occurrences across the neighbors as values
# Dictionary will be ordered in decreasing order of the lable values (so the majority class label will
# be the first dictonary element)
sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
print (sortedClassCount)
print (sortedClassCount[0][0])

[('Yes', 4), ('No', 1)]
Yes


#### A better way to find the majority class given a list of class labels from neighbors:

In [51]:
from collections import Counter

k = 5  # We'll use the top 5 neighbors
vote = vs_target_train[sortedDistIndicies[0:k]]
maj_class = Counter(vote).most_common(1)

print (vote)

print (maj_class)

print ("Class label for the classified istance: ", maj_class[0][0])

['Yes' 'Yes' 'No' 'Yes' 'Yes']
[('Yes', 4)]
Class label for the classified istance:  Yes


#### Let's know import the whole KNN module from MLA (uploaded on D2L) and use as part of a more robust evaluation process. We will step through all test instances, use our Knn classifier to predict a class label for each instance, and in each case we compare the predicted label to the actual value from the target test labels.

In [52]:
import kNN # importing kNN.py from MLA

In [53]:
numTestVecs = len(vs_target_test)
print (numTestVecs)

10


In [54]:
errorCount = 0.0
for i in range(numTestVecs):
    # classify0 function uses Euclidean distance to find k nearest neighbors
    classifierResult = kNN.classify0(vs_test_norm[i,:], vs_train_norm, vs_target_train, 3)
    print ("the classifier came back with: %s, the real answer is: %s" % (classifierResult, vs_target_test[i]))
    if (classifierResult != vs_target_test[i]): errorCount += 1.0
        
print ("the total error rate is: %f" % (errorCount/float(numTestVecs)))

the classifier came back with: Yes, the real answer is: Yes
the classifier came back with: No, the real answer is: No
the classifier came back with: No, the real answer is: No
the classifier came back with: No, the real answer is: Yes
the classifier came back with: No, the real answer is: Yes
the classifier came back with: No, the real answer is: Yes
the classifier came back with: No, the real answer is: No
the classifier came back with: No, the real answer is: No
the classifier came back with: Yes, the real answer is: Yes
the classifier came back with: No, the real answer is: No
the total error rate is: 0.300000


#### I have added a new classifier function to the kNN module that uses Cosine similarity instead of Euclidean distance:

In [55]:
from importlib import reload
reload(kNN)

<module 'kNN' from '/home/roselyne/classes/DSC478/week3/kNN.py'>

In [56]:
errorCount = 0.0
for i in range(numTestVecs):
    # classify1 function uses inverse of Cosine similarity to find k nearest neighbors
    classifierResult2 = kNN.classify1(vs_test_norm[i,:], vs_train_norm, vs_target_train, 3)
    print ("the classifier came back with: %s, the real answer is: %s" % (classifierResult2, vs_target_test[i]))
    if (classifierResult2 != vs_target_test[i]): errorCount += 1.0
print ("the total error rate is: %f" % (errorCount/float(numTestVecs)))

the classifier came back with: Yes, the real answer is: Yes
the classifier came back with: No, the real answer is: No
the classifier came back with: No, the real answer is: No
the classifier came back with: No, the real answer is: Yes
the classifier came back with: No, the real answer is: Yes
the classifier came back with: No, the real answer is: Yes
the classifier came back with: No, the real answer is: No
the classifier came back with: No, the real answer is: No
the classifier came back with: Yes, the real answer is: Yes
the classifier came back with: No, the real answer is: No
the total error rate is: 0.300000
