# Probability

## Table of Contents

1. [Introduction](#first-bullet)
2. [The Probability Function](#The-Probability-Function)
3. [The Political Views and Parties](#The-Political-Views-and-Parties)

## Introduction <a class="anchor" id="first-bullet"></a>

- We will be using data from the [General Social Survey](http://gss.norc.org/) (GSS) in the examples of Linda the Banker and showcase how to do some probability in Python
- We will start by loading the dataset

In [3]:
import pandas as pd
import pathlib

gss_file_location = pathlib.Path("../..") / "data" / "Think Bayes" / "Chapter 1" / "gss_bayes.csv"
gss_data = pd.read_csv(gss_file_location)
gss_data.head()

Unnamed: 0,caseid,year,age,sex,polviews,partyid,indus10
0,1,1974,21.0,1,4.0,2.0,4970.0
1,2,1974,41.0,1,5.0,0.0,9160.0
2,5,1974,58.0,2,6.0,1.0,2670.0
3,6,1974,30.0,1,5.0,4.0,6870.0
4,7,1974,48.0,1,5.0,4.0,7860.0


The DataFrame has one row for each person surveyed and one column for each variable I selected.

The columns are:

- caseid: Respondent id (which is the index of the table).
- year: Year when the respondent was surveyed.
- age: Respondent’s age when surveyed.
- sex: Male or female.
- polviews: Political views on a range from liberal to conservative.
- partyid: Political party affiliation, Democrat, Independent, or Republican.
- indus10: [Code](https://www.census.gov/eos/www/naics/) for the industry the respondent works in.

- We will be using indus10 to find the bankers in the dataset, using the code reference we know that bankers are under the code of 6870. So we can see all of them here:

In [4]:
banker = (gss_data.indus10 == 6870)
banker.head()

0    False
1    False
2    False
3     True
4    False
Name: indus10, dtype: bool

If we do a sum on the bankers we will find out the number of bankers in the dataset:

In [5]:
f"Number of Bankers: {banker.sum()}"

'Number of Bankers: 728'

- To find the proportion of the bankers in the sample we can use the `mean` function which will compute the fraction of True values.

In [6]:
f"The proportion of bankers is: {banker.mean()}"

'The proportion of bankers is: 0.014769730168391155'

This means that about 1.5% of the respondents work in banking. So if we chose someone at random from the sample there is a probability of 1.5% that they are a banker.

## The Probability Function

- To make things easier we can take the code we have used in the previous examples and put it into a method that will return probabilities as long as they are Boolean Series:

In [7]:
def prob(A):
    return A.mean()

f"The Probability of being a banker is: {prob(banker)}"

'The Probability of being a banker is: 0.014769730168391155'

- going back to the example of Linda the Banker we can now look at some of the other variables such as the Gender (2 is female):

In [8]:
female = (gss_data.sex == 2 )
f"The probability of being female is {prob(female)}"

'The probability of being female is 0.5378575776019476'

### The Political Views and Parties

- The other variables we will consider for this example is the political views and parties:
    - We need to define Liberal as True in the `polviews` Column which is values less than 4.
    - We will also get the value for democrats (Which is 0 or 1) in `partyid`.

In [9]:
liberal = (gss_data.polviews <= 3 )
f"The Probability of being Liberal is {prob(liberal)}"

'The Probability of being Liberal is 0.27374721038750255'

In [10]:
democrat = (gss_data.partyid<=1)
f"The probability of being a democrat is: {prob(democrat)}"

'The probability of being a democrat is: 0.3662609048488537'

### Conjunction

- This is another name for a logical and operation.
- In python we can use the `&` operator to perform this action.
    -   for instance let's calculate the probability that someone is a banker and Democrat:

In [11]:
f"The Probability of being a Democrat and Banker is: {prob(banker & democrat)}"

'The Probability of being a Democrat and Banker is: 0.004686548995739501'

- We do expect that the conjunction will be communative: $$P(Banker & Democrat) = P(Democrat & Banker)$$

### Conditional Probability

 - This is a probability that depends on a condition.
 - Let's look at an example where we are looking for the following:
    - what is the probability that a respondent is a Democrat given that they are liberal
 - or written in a different way:
    - Of all the respondents that are liberal, what fraction of them are democrats

In [12]:
democrat_liberal = democrat[liberal]
f"The probability of being a democrat given that the respondent is a liberal is: {prob(democrat_liberal)}"

'The probability of being a democrat given that the respondent is a liberal is: 0.5206403320240125'

In [13]:
def conditional(proposition, given):
    return prob(proposition[given])

 - We can now try this method out to check the conditional probabilty of being a liberal given that you are female:

In [14]:
f"The probability of being liberal given that you are female is {conditional(liberal, given = female)}"

'The probability of being liberal given that you are female is 0.27581004111500884'

 - So about 28% of females are liberal

### Conditional Probability is not Communative:

- While this is true for conjunctions it is not true for conditional probability.
- With the data we have and the methods it should be easy to prove:

In [15]:
f"Conditional probability is not equal 1: {conditional(female, given=banker)}, 2: {conditional(banker,given=female)}"

'Conditional probability is not equal 1: 0.7706043956043956, 2: 0.02116102749801969'

### Condition and Conjunction:

- We can combine conjunctions with conditional probability.
- let's look at an example of the probability of being female given that you are a Liberal Democrat:

In [16]:
f"The Probability of being female given a respondent being a Liberal Democrat is {conditional(female, given= liberal & democrat)}"

'The Probability of being female given a respondent being a Liberal Democrat is 0.576085409252669'

- And inversely:

In [17]:
f"The probabilty of being Female and liberal given that the respondent is a banker is: {conditional(female & liberal,given=banker)}"

'The probabilty of being Female and liberal given that the respondent is a banker is: 0.17307692307692307'

### Laws of Probability

- this will take us through some derivations to get three relationships between conditional probability and conjunctions:
    - Theorem 1: Using a conjunction to compute a conditional probability.
    - Theorem 2: Using a conditional probability to compute a conjunction.
    - Theorem 3: Using conditional(A, B) to compute conditional(B, A).
- Theorem 3 is called Bayes Theorem.

#### Theorem 1

- For this we will use: `What fraction of bankers are female?`
- We can use our methods to answer this:

In [18]:
f"Fraction of Females that are bankers {conditional(female, given=banker)}"

'Fraction of Females that are bankers 0.7706043956043956'

- There is another way to calculate this conditional probability, by computing the ratios of the two probabilities:
    - The fraction of females that are female bankers
    - The fraction of respondents that are bankers
- we will be calculating the following ratio:
$$P(Female | Banker) = \frac{P(Female \cap Banker)}{P(Banker)} $$
- And using python we can see this is the case:

In [19]:
f"This should be the same as: {prob(female & banker)/prob(banker)}"

'This should be the same as: 0.7706043956043956'

#### Theorem 2

- If we start with Theorem 1 and multiply both sides by `P(Banker)` we get the following:
$$P(Female \cap Banker) = P(Female | Banker) P(Banker)$$
- This is a second way that can be used to calculate the conjunction. We can check this with our python script:

In [21]:
f"These two should be equivalent 1: {prob(female & banker)}, 2: {conditional(female, given=banker)*prob(banker)}"

'These two should be equivalent 1: 0.011381618989653074, 2: 0.011381618989653074'

#### Theorem 3

- We have originaly proved that the conjunction is commutative which means that:
$$P(A \cap B) = P(B \cap A) $$

- if we apply theorem to both sides we get the following:
$$P(B)P(A|B) = P(A)P(B|A)$$
- The way to interpret this is the following:
    - if we want to calculate A and B we can either:
        - Calucalte A first and then B conditioned on A
        - Calculate B first and then A conditioned on B
- if we divide by `P(B)` then we get theorem 3:
$$P(A|B) = \frac{P(A)P(B|A)}{P(B)}$$

- And this is Bayes Theorem.
### The Law of Total Probability

- In addition to the three Probabilities there is one more thing that needs to be known for Bayesian Statistics: The Law of total probability.
-Here is one form of the law expressed as a mathamatical equation:

$$P(A) = P(A \cap B_1) + P(A \cap B_2) $$
- in words: the total probability of `A` is the sum of the possibilities either B1 and A is True or B2 and A is True but the law only applies if B1 and B2 are:
    - Mutually Exclusive
    - Collectively Exhaustive, One of them must be true
- As an example consider the probability of a respondent being a Banker:

In [23]:
f"The probability that a respondent is a banker: {prob(banker)}"

'The probability that a respondent is a banker: 0.014769730168391155'

- Let's now confirm the same thing if we compute Female and Male Bankers seperately:

In [25]:
male = (gss_data.sex==1)
f"The Probability of being a female banker is: {prob(banker & female)}, the probabilty of being a male banker is {prob(banker & male)} and the combined probability is {prob(female & banker) + prob(male & banker)}"

'The Probability of being a female banker is: 0.011381618989653074, the probabilty of being a male banker is 0.003388111178738081 and the combined probability is 0.014769730168391155'

- We could also sub in Theorem 2 to give us the following equation:
$$ P(A) = P(B_1)P(A|B_1) + P(B_2)P(A|B_2)$$
- If we have more than two conditions it may be easier to write the equation as a summation using mathamatical notation.
- this could be generalised in the python code using the following:
    - for the example we will be looking at the political views

In [28]:
B = gss_data.polviews
pd.value_counts(B)

4.0    18943
5.0     7940
6.0     7319
3.0     6243
2.0     5808
7.0     1595
1.0     1442
Name: polviews, dtype: int64

- On this scale 4 represents moderate so we can calculate the probability of a moderate Banker as:

In [30]:
i = 4
f"The Probability of a moderate banker is {prob(B==i)*conditional(banker,B==i)}"

'The Probability of a moderate banker is 0.005822682085615744'

- And we can use sum and a generator expression to compute the summation:

In [31]:
sum(prob(B==i)*conditional(banker,B==i)
    for i in range(1,8))

0.014769730168391157

## Summary

- Some things to note about this chapter:
    - Theorem 1 gives us a way to calculate confitional probability
    - Theorem 2 gives us a way to calculate a conjuntion using conditional probability.
    - Theorem 3, also know as Bayes theorem lets us switch the conditional probabiities around
    - The law of total probability gives us a way to add up all the pieces.
- We will see how these formulas will be useful to us later.

## Source

[Chapter 1 - Think Bayes 2](http://allendowney.github.io/ThinkBayes2/chap01.html)