# kNN for Text Classification
kNN classification assigns the majority class of the k nearest neighbors to a test document.   
  
kNN requires no explicit training and can use the unprocessed training set directly in classification. It's also called “Lazy learning”  
  
When k is set to 1, it's called 1NN and it's not very robust intuitively. Because the decision of each test document relies on the class of a single training document, which may be incorrectly labeled or atypical. kNN for k > 1 is more robust. It assigns documents to the majority class of their k closest neighbors, with ties broken randomly.  
  
Since the algorithm relies on distance between samples, we need a way to define our distance.  
  
The most popular distance measure is Euclidean distance. It's like the length of the straight line between two points.  
  
Cosine similarity is another popular measurement of similarity. Suppose $x$ and $y$ are two vectors, the cosine distance is defined:  
  
$\frac{x \cdot y}{||x|| ||y||}$

## Read the data

The data is in the format of **label word:count word:count ...**  
For example, "talk.politics.guns a:4 accidents:2 advance:1 age:2 an:1 and:3 any:1"  

To load the data, simply split every line by space, take the first item as label and split with colon for the rest items to get word and corresponding count

In [11]:
def parse_line(line):
    """
    Parse one line in the file
    return: set of words, class
    """
    l = line.strip().split()
    tag = l[0]
    word_set = set([w.split(':')[0] for w in l[1:]])
    return word_set, tag

def parse_file(filename):
    """
    Parse file
    return: X - word set list, y - class list
    """
    X, y = [], []
    with open(filename, 'r') as f:
        for line in f.readlines():
            word_set, target = parse_line(line)
            X.append(word_set)
            y.append(target)
    return X, y

## Build kNN class

The key part in kNN is caluating the distance in the vector space, here I find scipy useful

In [17]:
import numpy as np
from numpy.linalg import norm
from collections import Counter
from itertools import chain
from scipy.spatial.distance import cdist, cosine

class kNN:
    def __init__(self):
        self.X_raw = None
        self.y_raw = None
        self.y = None
        self.idx_word = None
        self.idx_label = None
        self.word_idx = None
        self.label_idx = None
        self.m = None

    def __create_feature_class_matrix__(self, X, y):
        bow = set(chain.from_iterable(X))
        self.X_raw = X
        self.y_raw = y
        self.idx_word = {i: w for i, w in enumerate(sorted(bow))}
        self.word_idx = {v: k for k, v in self.idx_word.items()}
        self.idx_label = {i: l for i, l in enumerate(sorted(set(y)))}
        self.label_idx = {v: k for k, v in self.idx_label.items()}
        m = np.zeros([len(X), len(self.idx_word)])
        for idx, X_ in enumerate(X):
            for w in X_:
                m[idx][self.word_idx[w]] += 1
        self.m = m
        self.y = np.array([self.label_idx[i] for i in y])

    def __calculate_distance__(self, inst, dist='euclidean'):
        if dist == 'euclidean':
            d_m = cdist(inst, self.m)
        elif dist == 'cosine':
            d_m = cdist(inst, self.m, metric='cosine')
        return d_m

    def train(self, X, y):
        self.__create_feature_class_matrix__(X, y)

    def __predict__(self, X, k, dist):
        m_pre = np.zeros([len(X), len(self.idx_word)])
        for idx, X_ in enumerate(X):
            for w in X_:
                pos = self.word_idx.get(w)
                if pos:
                    m_pre[idx][pos] += 1
        d_m = self.__calculate_distance__(m_pre, dist)
        top_k = d_m.argsort()[:,:k]
        pre = []
        for i in range(len(top_k)):
            label = Counter(self.y[top_k[i]]).most_common()[0][0]
            pre.append(label)
        return np.array(pre)

    def predict(self, X, k=5, dist='cosine'):
        pre_idx = self.__predict__(X, k, dist)
        return [self.idx_label[i] for i in pre_idx]

    def evaluate(self, X, y, k=5, dist='cosine'):
        pre = self.__predict__(X, k, dist)
        true = np.array([self.label_idx.get(i, len(self.label_idx)) for i in y])
        return np.sum(pre==true) / len(pre)

## Run the Model

In [15]:
training_data = 'data/train.txt'
test_data = 'data/test.txt'

X_train, y_train = parse_file(training_data)
X_test, y_test = parse_file(test_data)

In [18]:
knn = kNN()
knn.train(X_train, y_train)
knn.evaluate(X_test, y_test)

0.8333333333333334

With default k=5 and cosine distance measurement, we got a high accuracy, but it's not as good as Naive Bayes. However we cannot say Naive Bayes is definitely better than kNN, since there	is a tradeoff between bias and variance, kNN has high variance and low bias.