The contents of this course including lectures, labs, homework assignments, and exams have all been adapted from the [Data 8 course at University California Berkley](https://data.berkeley.edu/education/courses/data-8). Through their generosity and passion for undergraduate education, the Data 8 community at Berkley has opened their content and expertise for other universities to adapt in the name of undergraduate education.

In [None]:
#!pip install datascience
from datascience import *
import numpy as np

from IPython.display import display, Math, Latex

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

In [None]:
!git clone https://github.com/Mrsnellek/DS410.git

In [None]:
cd DS410/Week_7/Week_7_Lec/

# Chapter 17: Classification

Machine learning is a class of techniques for automatically finding patterns in data and using it to draw inferences or make predictions. You have already seen linear regression, which is one kind of machine learning. This chapter introduces a new one: classification.

Classification is about learning how to make predictions from past examples. We are given some examples where we have been told what the correct prediction was, and we want to learn from those examples how to make good predictions in the future.

Classification requires data. It involves looking for patterns, and to find patterns, you need data. That’s where the data science comes in. In particular, we’re going to assume that we have access to training data: a bunch of observations, where we know the class of each observation. The collection of these pre-classified observations is also called a training set. A classification algorithm is going to analyze the training set, and then come up with a classifier: an algorithm for predicting the class of future observations.

### Nearest Neighbor Classifyer

Let’s work through an example data set that was collected to help doctors diagnose chronic kidney disease (CKD). Each row in the data set represents a single patient who was treated in the past and whose diagnosis is known. For each patient, we have a bunch of measurements from a blood test. We’d like to find which measurements are most useful for diagnosing CKD, and develop a way to classify future patients as “has CKD” or “doesn’t have CKD” based on their blood test result

In [None]:
ckd = Table.read_table('ckd.csv').relabeled('Blood Glucose Random', 'Glucose')
ckd

Some of the variables are categorical (words like “abnormal”), and some quantitative. The quantitative variables all have different scales. We’re going to want to make comparisons and estimate distances, often by eye, so let’s select just a few of the variables and work in standard units. Then we won’t have to worry about the scale of each of the different variables.

In [None]:
def standard_units(x):
    return (x - np.mean(x))/np.std(x)

In [None]:
ckd = Table().with_columns(
    'Hemoglobin', standard_units(ckd.column('Hemoglobin')),
    'Glucose', standard_units(ckd.column('Glucose')),
    'White Blood Cell Count', standard_units(ckd.column('White Blood Cell Count')),
    'Class', ckd.column('Class')
)

In [None]:
ckd

Let’s look at two columns in particular: the hemoglobin level (in the patient’s blood), and the blood glucose level (at a random time in the day; without fasting specially for the blood test).

In [None]:
color_table = Table().with_columns(
    'Class', make_array(1, 0),
    'Color', make_array('darkblue', 'gold')
)
ckd = ckd.join('Class', color_table)

In [None]:
ckd.scatter('Hemoglobin', 'Glucose', group='Class')

Suppose Alice is a new patient who is not in the data set. If I tell you Alice’s hemoglobin level and blood glucose level, could you predict whether she has CKD?

In [None]:
# In this example, Alice's Hemoglobin attribute is 0 and her Glucose is 1.5.
ckd.scatter('Hemoglobin', 'Glucose', group='Class')
plots.scatter(0, 1.5, color = 'red')

If we have Alice’s hemoglobin and glucose numbers, we can put her somewhere on this scatterplot; the hemoglobin is her x-coordinate, and the glucose is her y-coordinate. Now, to predict whether she has CKD or not, we find the nearest point in the scatterplot and check whether it is blue or gold; we predict that Alice should receive the same diagnosis as that patient.

This is called ***nearest neighbor classification***.

Thus our nearest neighbor classifier works like this:

   * Find the point in the training set that is nearest to the new point.

   * If that nearest point is a “CKD” point, classify the new point as “CKD”. If the nearest point is a “not CKD” point, classify the new point as “not CKD”.

The scatterplot suggests that this nearest neighbor classifier should be pretty accurate. Points in the lower-right will tend to receive a “no CKD” diagnosis, as their nearest neighbor will be a gold point. The rest of the points will tend to receive a “CKD” diagnosis, as their nearest neighbor will be a blue point. So the nearest neighbor strategy seems to capture our intuition pretty well, for this example.

#### Decision Boundry

<img src="https://github.com/Mrsnellek/DS410/blob/main/Week_7/Week_7_Lec/Nearest_Neighbors.jpeg?raw=1" width=400 height=400 />
<img src="https://github.com/Mrsnellek/DS410/blob/main/Week_7/Week_7_Lec/Nearest_Neighbors_26_0.png?raw=1" width=600 height=400 />

However, the separation between the two classes won’t always be quite so clean. For instance, suppose that instead of hemoglobin levels we were to look at white blood cell count. Look at what happens:

In [None]:
ckd.scatter('White Blood Cell Count', 'Glucose', group='Class')

If we are given Alice’s glucose level and white blood cell count, can we predict whether she has CKD? Yes, we can make a prediction, but we shouldn’t expect it to be 100% accurate. 

To improve on *nearest neighbor* we will consider the $k$ nearest neighbor or the *k-nearest neighbor classifier*. To predict Alice’s diagnosis, rather than looking at just the one neighbor closest to her, we can look at the $k$ points that are closest to her, and use the diagnosis for each of those $k$ points to predict Alice’s diagnosis.  We usualy pick $k$ to be an odd number so we do not have to deal with ties.

Let's put it to a test.  We will split our data into training and test sets.  We will build a model with the training data and apply the model to the test set to measure the accuracy of the model. 

Every model should have three groups of individuals:

   * a training set on which we can do any amount of exploration to build our classifier;

   * a separate testing set on which to try out our classifier and see what fraction of times it classifies correctly;

   * the underlying population of individuals for whom we don’t know the true classes; the hope is that our classifier will succeed about as well for these individuals as it did for our testing set.

How to generate the training and testing sets? You’ve guessed it – we’ll select at random.

There are 158 individuals in ckd. Let’s use a random half of them for training and the other half for testing. To do this, we’ll shuffle all the rows, take the first 79 as the training set, and the remaining 79 for testing.

In [None]:
shuffled_ckd = ckd.sample(with_replacement=False)
training = shuffled_ckd.take(np.arange(79))
testing = shuffled_ckd.take(np.arange(79, 158))

In [None]:
training.scatter('White Blood Cell Count', 'Glucose', group='Color')
plots.xlim(-2, 6)
plots.ylim(-2, 6);

### Rows of Tables

Until this chapter, we have worked mostly with single columns of tables. But now we have to see whether one individual is “close” to another. Data for individuals are contained in rows of tables.

So let’s start by taking a closer look at rows.

Here is the original ckd data

In [None]:
ckd = Table.read_table('ckd.csv').relabeled('Blood Glucose Random', 'Glucose')

In [None]:
ckd.row(0)

Rows are in general **not arrays**, as their elements can be of different types. For example, some of the elements of the row above are strings (like 'abnormal') and some are numerical. So the row can’t be converted into an array.

However, rows share some characteristics with arrays. You can use item to access a particular element of a row. For example, to access the Albumin level of Patient 0, we can look at the labels in the printout of the row above to find that it’s item 3:

In [None]:
ckd.row(0).item(3)

Rows whose elements are all numerical (or all strings) can be converted to arrays. Converting a row to an array gives us access to arithmetic operations and other nice NumPy functions, so it is often useful.

Recall we trying to classify the patients as ‘CKD’ or ‘not CKD’, based on two attributes Hemoglobin and Glucose, both measured in standard units.

In [None]:
ckd = Table().with_columns(
    'Hemoglobin', standard_units(ckd.column('Hemoglobin')),
    'Glucose', standard_units(ckd.column('Glucose')),
    'Class', ckd.column('Class')
)

color_table = Table().with_columns(
    'Class', make_array(1, 0),
    'Color', make_array('darkblue', 'gold')
)
ckd = ckd.join('Class', color_table)
ckd

Here is a scatter plot of the two attributes, along with a red point corresponding to Alice, a new patient. Her value of hemoglobin is 0 (that is, at the average) and glucose 1.1 (that is, 1.1 SDs above average).

In [None]:
alice = make_array(0, 1.1)
ckd.scatter('Hemoglobin', 'Glucose', group='Color')
plots.scatter(alice.item(0), alice.item(1), color='red', s=30);

To find the distance between Alice’s point and any of the other points, we only need the values of the attributes:

In [None]:
ckd_attributes = ckd.select('Hemoglobin', 'Glucose')
ckd_attributes

Because the rows now consist only of numerical values, it is possible to convert them to arrays.

In [None]:
ckd_attributes.row(3)

In [None]:
np.array(ckd_attributes.row(3))

The main calculation we need to do is to find the distance between Alice’s point and any other point. For this, the first thing we need is a way to compute the distance between any pair of points.

How do we do this? In 2-dimensional space, it’s pretty easy. If we have a point at coordinates $(x_0,y_0)$
and another at $(x_1,y_1)$, the distance between them is $D = \sqrt{(x_0-x_1)^2 + (y_0-y_1)^2}$

In [None]:
patient3 = np.array(ckd_attributes.row(3))
alice, patient3

In [None]:
distance = np.sqrt(np.sum((alice - patient3)**2))
distance

Let's wrap this into a function.

In [None]:
def distance(point1, point2):
    """Returns the Euclidean distance between point1 and point2.
    
    Each argument is an array containing the coordinates of a point."""
    return np.sqrt(np.sum((point1 - point2)**2))

In [None]:
distance(alice, patient3)

If we want to classify Alice using a k-nearest neighbor classifier, we have to identify her nearest neighbors. What are the steps in this process? Suppose

. Then the steps are:

   * Step 1. Find the distance between Alice and each point in the training sample.

   * Step 2. Sort the data table in increasing order of the distances.

   * Step 3. Take the top 5 rows of the sorted table.

What we need is a function that finds the distance between Alice and another point whose coordinates are contained in a row. The function distance returns the distance between any two points whose coordinates are in arrays. We can use that to define distance_from_alice, which takes a row as its argument and returns the distance between that row and Alice.

In [None]:
def distance_from_alice(row):
    """Returns distance between Alice and a row of the attributes table"""
    return distance(alice, np.array(row))

In [None]:
distance_from_alice(ckd_attributes.row(3))

Recall that if you want to apply a function to each element of a column of a table, one way to do that is by the call *table_name.apply(function_name, column_label)*. This evaluates to an array consisting of the values of the function when we call it on each element of the column. So each entry of the array is based on the corresponding row of the table.

In [None]:
ckd_attributes.apply(distance_from_alice)

Let's put it into a table

In [None]:
ckd_with_distances = ckd.with_column('Distance from Alice', ckd_attributes.apply(distance_from_alice))
ckd_with_distances

For Step 2, let’s sort the table in increasing order of distance:

In [None]:
sorted_by_distance = ckd_with_distances.sort('Distance from Alice')
sorted_by_distance

Step 3: The top 5 rows correspond to Alice’s 5 nearest neighbors; you can replace 5 by any other positive integer.

In [None]:
alice_5_nearest_neighbors = sorted_by_distance.take(np.arange(5))
alice_5_nearest_neighbors

Three of Alice’s five nearest neighbors are blue points and two are gold. So a 5-nearest neighbor classifier would classify Alice as blue: it would predict that Alice has chronic kidney disease.

The graph below zooms in on Alice and her five nearest neighbors. The two gold ones just inside the circle directly below the red point. The classifier says Alice is more like the three blue ones around her.

<img src="https://github.com/Mrsnellek/DS410/blob/main/Week_7/Week_7_Lec/Rows_of_Tables_49_0.png?raw=1" width=400 height=400 />

### Banknote authentication example

This time we’ll look at predicting whether a banknote (e.g., a $20 bill) is counterfeit or legitimate. Researchers have put together a data set for us, based on photographs of many individual banknotes: some counterfeit, some legitimate. For each banknote, we know a few numbers that were computed from a photograph of it as well as its class (whether it is counterfeit or not). Let’s load it into a table and take a look.

In [None]:
banknotes = Table.read_table('banknote.csv')
banknotes

In [None]:
color_table = Table().with_columns(
    'Class', make_array(1, 0),
    'Color', make_array('darkblue', 'gold')
)

In [None]:
banknotes = banknotes.join('Class', color_table)
banknotes.scatter('WaveletVar', 'WaveletCurt', group='Color')

hose two measurements do seem helpful for predicting whether the banknote is counterfeit or not. However, in this example you can now see that there is some overlap between the blue cluster and the gold cluster. This indicates that there will be some images where it’s hard to tell whether the banknote is legitimate based on just these two numbers. Still, you could use a k-nearest neighbor classifier to predict the legitimacy of a banknote.

In [None]:
banknotes.scatter('WaveletSkew', 'Entropy', group='Color')

There does seem to be a pattern, but it’s a pretty complex one. Nonetheless, the k-nearest neighbors classifier can still be used and will effectively “discover” patterns out of this. 

So far I’ve been assuming that we have exactly 2 attributes that we can use to help us make our prediction. What if we have more than 2? For instance, what if we have 3 attributes?

There’s nothing special about 2 or 3. If you have 4 attributes, you can use the k-nearest neighbors classifier in 4 dimensions. 5 attributes? Work in 5-dimensional space. And no need to stop there! This all works for arbitrarily many attributes; you just work in a very high dimensional space. 

In [None]:
ax = plots.figure(figsize=(8,8)).add_subplot(111, projection='3d')
ax.scatter(banknotes.column('WaveletSkew'), 
           banknotes.column('WaveletVar'), 
           banknotes.column('WaveletCurt'), 
           c=banknotes.column('Color'));

When we use these 3 attributes, the two clusters have almost no overlap. In other words, a classifier that uses these 3 attributes will be more accurate than one that only uses the 2 attributes.

### Distance in multiple dimentions

In 3-dimensional space, the points are $(x_0, y_0, z_0)$ and $(x_1, y_1, z_1)$ , and the formula for the distance between them is:

$D = \sqrt{(x_0-x_1)^2 + (y_0-y_1)^2 + (z_0-z_1)^2}$

In $n$-dimensional space, things are a bit harder to visualize but the equation follows the same pattern.

Let's look at a new example. The table [wine](https://archive.ics.uci.edu/ml/datasets/Wine) contains the chemical composition of 178 different Italian wines. The classes are the grape species, called cultivars. There are three classes but let’s just see whether we can tell Class 1 apart from the other two.

In [None]:
wine = Table.read_table('wine.csv')

# For converting Class to binary

def is_one(x):
    if x == 1:
        return 1
    else:
        return 0
    
wine = wine.with_column('Class', wine.apply(is_one, 0))
wine

In [None]:
wine_attributes = wine.drop('Class')

The first two wines are both in Class 1. To find the distance between them, we first need a table of just the attributes:

In [None]:
distance(np.array(wine_attributes.row(0)), np.array(wine_attributes.row(1)))

The last wine in the table is of Class 0. Its distance from the first wine is:

In [None]:
distance(np.array(wine_attributes.row(0)), np.array(wine_attributes.row(177)))

In [None]:
wine_with_colors = wine.join('Class', color_table)

In [None]:
wine_with_colors.scatter('Flavanoids', 'Alcohol', group='Class')

In [None]:
wine_with_colors.scatter('Alcalinity of Ash', 'Ash', group='Class')

In [None]:
wine_with_colors.scatter('Magnesium', 'Total Phenols', group='Class')

### Let's build a classifier

The input is a point that we want to classify. The classifier works by finding the $k$ nearest neighbors of point from the training set. So, our approach will go like this:

   * Find the closest $k$ neighbors of point, i.e., the $k$ wines from the training set that are most similar to point.
   * Look at the classes of those $k$ neighbors, and take the majority vote to find the most-common class of wine. Use that as our predicted class for point.
   
To implement the first step for the kidney disease data, we had to compute the distance from each patient in the training set to point, sort them by distance, and take the $k$ closest patients in the training set.

That’s what we did in the previous section with the point corresponding to Alice. Let’s generalize that code. We’ll redefine distance here, just for convenience.

In [None]:
def distance(point1, point2):
    """Returns the distance between point1 and point2
    where each argument is an array 
    consisting of the coordinates of the point"""
    return np.sqrt(np.sum((point1 - point2)**2))

def all_distances(training, new_point):
    """Returns an array of distances
    between each point in the training set
    and the new point (which is a row of attributes)"""
    attributes = training.drop('Class')
    def distance_from_point(row):
        return distance(np.array(new_point), np.array(row))
    return attributes.apply(distance_from_point)

def table_with_distances(training, new_point):
    """Augments the training table 
    with a column of distances from new_point"""
    return training.with_column('Distance', all_distances(training, new_point))

def closest(training, new_point, k):
    """Returns a table of the k rows of the augmented table
    corresponding to the k smallest distances"""
    with_dists = table_with_distances(training, new_point)
    sorted_by_distance = with_dists.sort('Distance')
    topk = sorted_by_distance.take(np.arange(k))
    return topk

In [None]:
special_wine = wine.drop('Class').row(0)

In [None]:
closest(wine, special_wine, 5)

Next we need to take a “majority vote” of the nearest neighbors and assign our point the same class as the majority.

In [None]:
def majority(topkclasses):
    return topkclasses.group('Class').sort('count').column('Class').item(0)

def classify(training, new_point, k):
    closestk = closest(training, new_point, k)
    topkclasses = closestk.select('Class')
    return majority(topkclasses)

In [None]:
classify(wine, special_wine, 5)

If we change special_wine to be the last one in the dataset, is our classifier able to tell that it’s in Class 0?

In [None]:
special_wine = wine.drop('Class').row(177)
classify(wine, special_wine, 5)

But we don’t yet know how it does with all the other wines, and in any case we know that testing on wines that are already part of the training set might be over-optimistic. We will split the data into training and test sets and measure the accuracy of the classifier.

In [None]:
shuffled_wine = wine.sample(with_replacement=False) 
training_set = shuffled_wine.take(np.arange(89))
test_set  = shuffled_wine.take(np.arange(89, 178))

In [None]:
def evaluate_accuracy(training, test, k):
    test_attributes = test.drop('Class')
    def classify_testrow(row):
        return classify(training, row, k)
    c = test_attributes.apply(classify_testrow)
    return (test.num_rows - (np.count_nonzero(c - test_set.column('Class')))) / test.num_rows

In [None]:
evaluate_accuracy(training_set, test_set, 5)

The accuracy rate isn’t bad at all for a simple classifier.

# Chapter 18: Updating Predictions


Suppose that we eventually find out the true class of our new point. Then we will know whether we got the classification right. Also, we will have a new point that we can add to our training set, because we know its class. This updates our training set. So, naturally, we will want to update our classifier based on the new training set.

This chapter looks at some simple scenarios where new data leads us to update our predictions. While the examples in the chapter are simple in terms of calculation, the method of updating can be generalized to work in complex settings and is one of the most powerful tools used for machine learning.



Let’s try to use data to classify a point into one of two categories, choosing the category that we think is more likely than not. To do this, we not only need the data but also a clear description of how chances are involved.

We will start out in a simple artifical setting just to develop the main technique, and then move to a more intriguing example.

Suppose there is a university class with the following composition:

   * 60% of the students are Second Years and the remaining 40% are Third Years

   * 50% of the Second Years have declared their major

   * 80% of the Third Years have declared their major

Now suppose I pick a student at random from the class. Can you classify the student as Second Year or Third Year, using our “more likely than not” criterion?

You can, because the student is picked at random and so you know that the chance that the student is a Second Year is 60%. That’s greater than the 40% chance of being a Third Year, so you would classify the student as Second Year.

The information about the majors is irrelevant, as we already know the proportions of Second and Third Years in the class.

We have a pretty simple classifier!

In [None]:
students = Table().with_columns('Year', np.concatenate((np.full(60, "Second"), np.full(40, "Third"))),
                               'Major', np.concatenate((np.full(30, "Declared"), np.full(30, "Undeclared"), np.full(32, "Declared"), np.full(8, "Undeclared"))))

In [None]:
students

In [None]:
students.pivot('Major', 'Year')

But now suppose I give you some additional information about the student who was picked:

The student has declared a major.

Would this knowledge change your classification?

Now it becomes important to look at the relation between year and major declaration. It’s still true that more students are Second Years than Third Years. But it’s also true that among the Third Years, a much higher percent have declared their major than among the Second Years. Our classifier has to take both of these observations into account.

Now the student can only be in one of the two Declared cells.

There are 62 students in those cells, and 32 out of the 62 are Third Years. That’s more than half, even though not by much.

So, in the light of the new information about the student’s major, we have to update our prediction and now classify the student as a Third Year.

What is the chance that our classification is correct? We will be right for all the 32 Third Years who are Declared, and wrong for the 30 Second Years who are Declared. The chance that we are correct is therefore about 0.516.

In other words, the chance that we are correct is the proportion of Third Years among the students who have Declared.

In [None]:
32/(30+32)

In [None]:
students.pivot('Major', 'Year')

<img src="https://github.com/Mrsnellek/DS410/blob/main/Week_7/Week_7_Lec/tree_students.png?raw=1" width=400 height=400 />

Like the pivot table, this diagram partitions the students into four distinct groups known as “branches”. Notice that the “Third Year, Declared” branch contains the proportion 0.4 x 0.8 = 0.32 of the students, corresponding to the 32 students in the “Third Year, Declared” cell of the pivot table. The “Second Year, Declared” branch contains 0.6 x 0.5 = 0.3 of the students, corresponding to the 30 in the “Second Year, Declared” cell of the pivot table.

We know that the student who was picked belongs to a “Declared” branch; that is, the student is either in the top branch or the third from top. Those two branches now form our reduced space of possibilities, and all chances have to be calculated relative to the total chance of this reduced space.

So, given that the student is Declared, the chance of them being a Third Year can be calculated directly from the tree. The answer is the proportion in the “Third Year, Declared” branch relative to the total proportion in the two “Declared” branches.

That is, the answer is the proportion of Third Years among students who are Declared, as before.

In [None]:
(0.4 * 0.8)/(0.6 * 0.5  +  0.4 * 0.8)

The method that we have just used is due to the Reverend [Thomas Bayes](https://en.wikipedia.org/wiki/Thomas_Bayes) (1701-1761). His method solved what was called an “inverse probability” problem: given new data, how can you update chances you had found earlier? Though Bayes lived three centuries ago, his method is [widely used](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) now in machine learning.

We will state the rule in the context of our population of students. First, some terminology:

**Prior probabilities**. Before we knew the chosen student’s major declaration status, the chance that the student was a Second Year was 60% and the chance that the student was a Third Year was 40%. These are the prior probabilities of the two categories.

**Likelihoods**. These are the chances of the Major status, given the category of student; thus they can be read off the tree diagram. For example, the likelihood of Declared status given that the student is a Second Year is 0.5.

**Posterior probabilities**. These are the chances of the two Year categories, after we have taken into account information about the Major declaration status. We computed one of these:

The posterior probability that the student is a Third Year, given that the student has Declared, is denoted $P(\text{Third Year} ~\big{\vert}~ \text{Declared})$
and is calculated as follows:

$\begin{split}
\begin{align*}
P(\mbox{Third Year} ~\big{\vert}~ \mbox{Declared}) 
~ &=~ \frac{ 0.4 \times 0.8}{0.6 \times 0.5 ~+~ 0.4 \times  0.8} \\ \\
&=~ \frac{\mbox{(prior probability of Third Year)} \times
\mbox{(likelihood of Declared given Third Year)}}
{\mbox{total probability of Declared}}
\end{align*}
\end{split}$

The other posterior probability is:

$\begin{split}
\begin{align*}
P(\mbox{Second Year} ~\big{\vert}~ \mbox{Declared})
~ &=~ \frac{ 0.6 \times 0.5}{0.6 \times 0.5 ~+~ 0.4 \times  0.8} \\ \\
&=~ \frac{\mbox{(prior probability of Second Year)} \times
\mbox{(likelihood of Declared given Second Year)}}
{\mbox{total probability of Declared}}
\end{align*}
\end{split}$

In [None]:
(0.6 * 0.5)/(0.6 * 0.5  +  0.4 * 0.8)

Notice that both the posterior probabilities have the same denominator: the chance of the new information, which is that the student has Declared.

Because of this, Bayes’ method is sometimes summarized as a statement about proportionality: 

$\mbox{posterior} ~ \propto ~ \mbox{prior} \times \mbox{likelihood}$


Formulas are great for efficiently describing calculations. But in settings like our example about students, it is simpler not to think in terms of formulas. Just use the tree diagram.

### Medical Test Example

Many medical tests for diseases return Positive or Negative results.

Medical tests are carefully designed to be very accurate. But few tests are accurate 100% of the time. Almost all tests make errors of two kinds:

   * A false positive is an error in which the test concludes Positive but the patient doesn’t have the disease.

   * A false negative is an error in which the test concludes Negative but the patient does have the disease.

These errors can affect people’s decisions. False positives can cause anxiety and unnecessary treatment (which in some cases is expensive or dangerous). False negatives can have even more serious consequences if the patient doesn’t receive treatment because of their Negative test result.


Suppose there is a large population and a disease that strikes a tiny proportion of the population. The tree diagram below summarizes information about such a disease and about a medical test for it.

<img src="https://github.com/Mrsnellek/DS410/blob/main/Week_7/Week_7_Lec/tree_disease_rare.png?raw=1" width=400 height=400 />

So suppose a person is picked at random from the population and tested. If the test result is Positive, how would you classify them: Disease, or No disease?

We can answer this by applying Bayes’ Rule and using our “more likely than not” classifier. Given that the person has tested Positive, the chance that he or she has the disease is the proportion in the top branch, relative to the total proportion in the Test Positive branches.

In [None]:
(0.004 * 0.99)/(0.004 * 0.99  +  0.996*0.005 )

The chance they have the disease is 44%!  This is a strange conclusion. We have a pretty accurate test, and a person who has tested Positive, and our classification is … that they don’t have the disease? That doesn’t seem to make any sense.  

However, keep in mind that this is the result for the entire population and you are more likely to have a false positive than than a true positive.

The tiny fraction of those that falsely test Positive are still greater in number than the people who correctly test Positive. 

In [None]:
disease = Table().with_columns('True Condition', np.concatenate((np.full(400, "Disease"), np.full(99600, "No Disease"))),
                               'Test Result', np.concatenate((np.full(396, "Positive"), np.full(4, "Negative"), np.full(99102, "Negative"), np.full(498, "Positive"))))

In [None]:
disease.pivot('Test Result', 'True Condition')

In [None]:
396/(396 + 498)

Suppose the doctor’s subjective opinion is that there is a 5% chance that the patient has the disease. Then just the prior probabilities in the tree diagram will change:

<img src="https://github.com/Mrsnellek/DS410/blob/main/Week_7/Week_7_Lec/tree_disease_subj.png?raw=1" width=400 height=400 />

In [None]:
(0.05 * 0.99)/(0.05 * 0.99  +  0.95 * 0.005)

Even though the doctor has a pretty low prior probability (5%) that the patient has the disease, once the patient tests Positive the posterior probability of having the disease shoots up to more than 91%.

If the patient tests Positive, it would be reasonable for the doctor to proceed as though the patient has the disease.

In [None]:
disease_subj = Table().with_columns('True Condition', np.concatenate((np.full(5000, "Disease"), np.full(95000, "No Disease"))),
                               'Test Result', np.concatenate((np.full(4950, "Positive"), np.full(50, "Negative"), np.full(94525, "Negative"), np.full(475, "Positive"))))

In [None]:
disease_subj.pivot('Test Result', 'True Condition')

In [None]:
4950/(4950 + 475)

# The End