# Chapter 17. Classification
Classification is about learning how to make predictions from past examples. In a classification task, each individual or situation where we'd like to make a prediction is called an ***observation***. Each observation has multiple ***attributes***, which are known. Also, each observation has a ***class***, which is the answer to the question we care about (for example, fraudulent or not, or voting for you or not).

Amazon trying determine which orders are fraudulent: When a customer makes a new order, we do not observe whether it is fraudulent, but we do observe its attributes, and we will try to predict its class using those attributes.

Classification requires data.  It involves looking for patterns, and to find patterns, you need data.  That's where the data science comes in.  In particular, we're going to assume that we have access to *training data*: a bunch of observations, where we know the class of each observation.  The collection of these pre-classified observations is also called a training set.  A classification algorithm is going to analyze the training set, and then come up with a classifier: an algorithm for predicting the class of future observations.

### 17.1 Nearest Neighbors
A surprisingly effective algorithm for classification is called the *nearest neighbor classification*. The main idea is that you make your classification based on the class of the observation closest to the individual or situation you're trying to predict.

The decision boundary two points is not always clean. With decision boundaries that aren't so obvious, there is a simple generalization of the nearest neighbor classifier, called the ***k-Nearest Neighbors***, where we take the majority class for the kth points in our training set as as the class for the point we are trying to classify.

### 17.2 Training and Testing

When we computed confidence intervals for numerical parameters, we wanted to have many new random samples from a population, but we only had access to a single sample. We solved that problem by taking bootstrap resamples from our sample.

We will use an analogous idea to test our classifier. We will create two samples out of the original training set, use one of the samples as our training set, and the other one for testing.

So we will have three groups of individuals:
- a training set on which we can do any amount of exploration to build our classifier;
- a separate testing set on which to try out our classifier and see what fraction of times it classifies correctly;
- the underlying population of individuals for whom we don't know the true classes; the hope is that our classifier will succeed about as well for these individuals as it did for our testing set.

### 17.3 Rows of Tables
Rows are in general **not arrays**, as their elements can be of different types. For example, some of the elements of the row above are strings (like 'abnormal') and some are numerical. So the row can’t always be converted into an array. Rows whose elements are all numerical (or all strings) can be converted to arrays by the `np.array` function.

If you use apply without specifying a column label, then the entire row is passed to the function.


### 17.4 Implementing The Classifier
This is a general phenomenom in classification. Each attribute can potentially give you new information, so more attributes sometimes helps you build a better classifier.

##### A Plan for the Implementation
It's time to write some code to implement the classifier.  The input is a `point` that we want to classify.  The classifier works by finding the $k$ nearest neighbors of `point` from the training set.  So, our approach will go like this:

1. Find the closest $k$ neighbors of `point`, i.e., the $k$ wines from the training set that are most similar to `point`.

2. Look at the classes of those $k$ neighbors, and take the majority vote to find the most-common class of wine.  Use that as our predicted class for `point`.

So that will guide the structure of our Python code.


In [5]:
# ________________________________________
# STEP 1:
def distance(point1, point2):
    """Returns the distance between point1 and point2
    where each argument is an array 
    consisting of the coordinates of the point"""
    return np.sqrt(np.sum((point1 - point2)**2))

def all_distances(training, new_point):
    """Returns an array of distances
    between each point in the training set
    and the new point (which is a row of attributes)"""
    attributes = training.drop('Class')
    def distance_from_point(row):
        return distance(np.array(new_point), np.array(row))
    return attributes.apply(distance_from_point)

def table_with_distances(training, new_point):
    """Augments the training table 
    with a column of distances from new_point"""
    return training.with_column('Distance', all_distances(training, new_point))

def closest(training, new_point, k):
    """Returns a table of the k rows of the augmented table
    corresponding to the k smallest distances"""
    with_dists = table_with_distances(training, new_point)
    sorted_by_distance = with_dists.sort('Distance')
    topk = sorted_by_distance.take(np.arange(k))
    return topk
# ___________________________________
# STEP 2:
def majority(topkclasses):
    ones = topkclasses.where('Class', are.equal_to(1)).num_rows
    zeros = topkclasses.where('Class', are.equal_to(0)).num_rows
    if ones > zeros:
        return 1
    else:
        return 0

def classify(training, new_point, k):
    closestk = closest(training, new_point, k)
    topkclasses = closestk.select('Class')
    return majority(topkclasses)

: 

### 17.5 The Accuracy of the Classifier



Questions:
1. For 17.2 why do you randomly sample half for testing and half for training? How does this give you the accuracy of your classification method since you are reducing the data by half?
2. Why is a training set not like a random sample from the population?
3. What is training?
4. What is testing?