In [1]:
from datascience import *
import numpy as np
#np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning)  

## Years and Majors

Scenario:

Class consists of second years (60%) and third years (40%)

- 50% of the second years have declared their major
- 80% of the third years have declared their major
- I pick one student at random.

**Which is more likely: Second year or third year?***


## Possible Approach 1: Tree Diagram Calculation

In [2]:
# P(third year | declared), from tree diagram

(0.4 * 0.8) / (0.6 * 0.5 + 0.4 * 0.8)

0.5161290322580645

In [3]:
# P(second year | declared), from tree diagram

(0.6 * 0.5) / (0.6 * 0.5 + 0.4 * 0.8)

0.4838709677419354

## Possible Approach 2: Data Science Approach

In [8]:
# np.array(list) converts list to an array
# provided all the elements of list are of the same type

n = 100
second = round(n * 0.6)
third = n - round(n * 0.6)

year = []
year = np.append(year, ['Second']*second)
year = np.append(year, ['Third']*third)

major = []
major = np.append(major, ['Declared'] * round(second * 0.5))
major = np.append(major, ['Undeclared'] * (second - round(second * 0.5)))
major = np.append(major, ['Declared'] * round(third * 0.8))
major = np.append(major, ['Undeclared'] * (third  - round(third * 0.8)))


In [9]:
students = Table().with_columns(
    'Year', year,
    'Major', major
)

students

Year,Major
Second,Declared
Second,Declared
Second,Declared
Second,Declared
Second,Declared
Second,Declared
Second,Declared
Second,Declared
Second,Declared
Second,Declared


In [10]:
students.group('Year')

Year,count
Second,60
Third,40


In [11]:
students.pivot('Year', 'Major')

Major,Second,Third
Declared,30,32
Undeclared,30,8


In [12]:
# I pick one student at random...That student has declared a major!
# Second Year or Third Year?

# Probablity of third year
32 / (30 + 32)

0.5161290322580645

In [13]:
# Probability of second year
30 / (30 + 32)

0.4838709677419355

## Monty Hall

Suppose you're on a game show, and you're given the choice of three doors: 

- Behind one door is a car; behind the others, goats. 
- You pick a door, say No. 1, and the host, who knows what's behind the doors, opens another door, say No. 3, which has a goat.
- He then says to you, "Do you want to pick door No. 2?" Is it to your advantage to switch your choice?

In [14]:
# P(win car | switched), from tree diagram

(2/3 * 1/2) / (2/3 * 1/2 + 1/3 * 1/2)

0.6666666666666666

## Disease Decisions

### Interpretation by Physicians of Clinical Laboratory Results (1978)

"We asked 20 house officers, 20 fourth-year medical students and 20 attending physicians, selected in 67 consecutive hallway encounters at four Harvard Medical School teaching hospitals, the following question: 

- "If a test to detect a disease whose prevalence is 1/1000 has a false positive rate of 5%, what is the chance that a person found to have a positive result actually has the disease, assuming that you know nothing about the person's symptoms or signs?"


## Possible Approach 1: Tree Diagram Calculation

In [15]:
# P(disease | tested +)
# if prior probability of disease is 1/10

(0.001 * 1) / (0.001*1 + 0.999*0.05)

0.019627085377821395

In [None]:
year = []
year = np.append(year, ['Second']*second)
year = np.append(year, ['Third']*third)

## Possible Approach 2: Data Science Approach

In [16]:
def create_population(prior_disease_prob, n):
    
    disease = round(n * prior_disease_prob)
    no_disease = n - round(n * prior_disease_prob)

    status = []
    status = np.append(status, ['Disease'] * disease)
    status = np.append(status, ['No disease'] * no_disease)
 
        
    result = []
    result = np.append(result,  ['Test +'] * disease)
    result = np.append(result,  ['Test +'] * round(no_disease * 0.05))
    result = np.append(result,  ['Test +'] * round(no_disease * 0.05))
    
    np.array(
        np.sum([
            ['Test +'] * disease,
            ['Test +'] * round(no_disease * 0.05),
            ['Test -'] * (no_disease - round(no_disease * 0.05))
        ]),
        dtype=object
    )
                 
    t = Table().with_columns(
    'Status', status,
    'Test Result', result
    )
    return t

In [17]:
create_population(1/1000, 10000)

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.

In [13]:
create_population(1/1000, 10000).pivot('Test Result', 'Status')

Status,Test +,Test -
Disease,10,0
No disease,500,9490


## More Common Disease

If prior probability of disease is 1/10... what do you observe?

In [17]:
# P(disease | tested +)
# if prior probability of disease is 1/10

(0.1 * 1) / (0.1*1 + 0.9*0.05)

0.689655172413793

In [20]:
# Can we confirm this result using the data science approach?

create_population(1/10, 10000).pivot('Test Result', 'Status')

Status,Test +,Test -
Disease,1000,0
No disease,450,8550


In [21]:
# P(disease | tested +)
# if prior probability of disease is 1/10

1000 / (1000 + 450)

0.6896551724137931

**A probability of an outcome is…**

- The frequency with which it will occur in repeated trials, or
- The subjective degree of belief that it will (or has) occurred


**Why use subjective priors?**

- In order to quantify a belief that is relevant to a decision
- When the subject of your prediction was not selected randomly from the population
