<a href="https://colab.research.google.com/github/550tealeaves/DATA-70500-working-with-data/blob/main/HW_2_IntroProbability.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Probability

In this notebook, we'll use probability to answer some sociological questions. This will give us an opportunity to practice some code for recoding and counting cases as well as computing odds.

For reference, I've copied the material on basic probability into this notebook.

We'll use Downey's probability function, which we reviewed here: https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/notebooks/chap01.ipynb#scrollTo=6L0EX3-337a2&line=2&uniqifier=1



In [1]:
def prob(A):
    """Computes the probability of a proposition, A."""
    return A.mean()

## Data for Computing Probability

We'll use the Baylor survey dataset from The Association of Religion Data Archives that we saw in an earlier notebook.

Here's the link to get the data file: https://www.thearda.com/data-archive?fid=BRSW5ED&tab=3

I downloaded the Excel file and the codebook.

Since we can't read the data from ARDA into a DataFrame directly, I put a copy of the Excel spreadsheet as a comma delimited file (CSV) in a place we can access it. (You could also download the file from ARDA and put it on your Google Drive and read it in that way.)

We'll need the codebook to make sense of the variables and values. Here's the link: http://data.shortell.nyc/files/BaylorReligionSurveyWaveV2017InstructionalDatasetcb_data.TXT


In [2]:
# Code block 1a: Installing some libraries we'll need
!pip install researchpy

Collecting researchpy
  Downloading researchpy-0.3.6-py3-none-any.whl.metadata (1.2 kB)
Downloading researchpy-0.3.6-py3-none-any.whl (34 kB)
Installing collected packages: researchpy
Successfully installed researchpy-0.3.6


In [3]:
# Code block 1b. Libraries
import pandas as pd
import numpy as np
import researchpy as rp


In [4]:
# Code Block 2: Importing data
Baylor2017 = pd.read_csv('http://data.shortell.nyc/files/BaylorReligionSurveyWaveV2017InstructionalDataset.csv')

# We can inspect the top of the file to make sure that the data were read in correctly.
Baylor2017.head()

Unnamed: 0,MOTHERLODE_ID,RESPONDENT_DATE,RESPONDENT_LANGUAGE,ENTITY_ID,SCAN_RESPONDENT_ID,LANG1,Q1,Q2_DK,Q3,Q3_1,...,AGER,LIBCONR,PARTYIDR,CHILDSR,HRSWORKD,EDUCR,I_AGE,I_EDUC,I_RACE,I_RELIGION
0,165167557,2/14/2017,en-US,4221710666,4221710666,en-US,10.0,,4.0,4.0,...,5.0,1.0,1.0,1.0,,3.0,6.0,3.0,1.0,1.0
1,165172207,2/21/2017,en-US,4221711095,4221711095,en-US,45.0,1.0,4.0,4.0,...,,,,,,4.0,,4.0,1.0,1.0
2,165167589,2/14/2017,en-US,4221711129,4221711129,en-US,45.0,,4.0,4.0,...,6.0,1.0,1.0,1.0,,4.0,6.0,4.0,1.0,1.0
3,165167427,2/10/2017,en-US,4221709180,4221709180,en-US,20.0,,4.0,4.0,...,2.0,1.0,1.0,1.0,,4.0,2.0,4.0,1.0,1.0
4,165171895,2/14/2017,en-US,4221707213,4221707213,en-US,12.0,,3.0,3.0,...,4.0,1.0,1.0,1.0,2.0,2.0,4.0,2.0,1.0,1.0


We can use the religious identity or affiliation variable as an example for computing probabilities.



```
7) Q1
[I. Religious Behaviors and Attitudes] With what religious family, if any, do you most closely identify? (Please mark only one box.)
RANGE: 1 to 46
	N	Mean	Std. Deviation
Total	1431	26.779	13.946
1) Other	39	2.7
6) Adventist	3	0.2
7) African Methodist	5	0.3
8) Anabaptist	1	0.1
9) Asian Folk Religion	1	0.1
10) Assemblies of God	13	0.9
11) Baha'i	1	0.1
12) Baptist	194	13.6
13) Bible Church	15	1.0
14) Brethren	1	0.1
15) Buddhist	11	0.8
16) Catholic/Roman Catholic	376	26.3
17) Christian & Missionary Alliance	6	0.4
18) Christian Reformed	3	0.2
19) Christian Science	3	0.2
20) Church of Christ	22	1.5
21) Church of God	9	0.6
22) Church of the Nazarene	5	0.3
23) Congregational	3	0.2
25) Episcopal/Anglican	30	2.1
26) Hindu	1	0.1
27) Holiness	6	0.4
28) Jehovah's Witnesses	11	0.8
29) Jewish	29	2.0
30) Latter-day Saints	30	2.1
31) Lutheran	66	4.6
32) Mennonite	1	0.1
33) Methodist	67	4.7
34) Muslim	7	0.5
35) Orthodox (Eastern, Russian, Greek)	5	0.3
36) Pentecostal	20	1.4
37) Presbyterian	38	2.7
38) Quaker/Friends	5	0.3
39) Reformed Church in America/Dutch Reformed	2	0.1
40) Salvation Army	1	0.1
41) Seventh-Day Adventist	5	0.3
42) Sikh	1	0.1
43) Unitarian Universalist	11	0.8
44) United Church of Christ	11	0.8
45) Non-denominational Christian	153	10.7
46) No religion	220	15.4
Missing	70
```

Using the function that Downey created to compute probabilities, we can ask some questions about probabilities, including conditional probabilities.


## Health and Sleep
Let's begin with the probability that a randomly selected respondent self-reported their health as Excellent? as Fair? as Fair OR Excellent?

In [5]:
# What is the probability that respondent self-reported health as Excellent?
# 5 is the rank on the list
# H3 is the section
ExHealth = (Baylor2017['H3'] == 5)
prob(ExHealth)

0.17188540972684876

In [6]:
# What is the probability that respondent self-reported health as Fair?
# 2 is the rank on the list
# H3 is the section
FairHealth = (Baylor2017['H3'] == 2)
prob(FairHealth)

0.09593604263824117

In [10]:
# What is the probability that respondent self-reported health as Fair OR Excellent?
# 2 is the rank on the list
# H3 is the section
FairExHealth = (Baylor2017['H3'] == 2) | (Baylor2017['H3'] == 5)
prob(FairExHealth)

0.2678214523650899

Now, let's compute the probability that a person gets on average fewer than 6 hours of sleep per night over a month?

We can compute probabilities to answer questions about sleep average in a number of ways. To begin, let's think about a comparison between people who responded as getting fewer than 6 hours of sleep and people who reported getting 6 or more hours.

In [13]:
Poor_sleep = (Baylor2017['H6_HR'] < 6)
prob(Poor_sleep)

0.09060626249167222

In [14]:
Good_sleep = (Baylor2017['H6_HR'] >= 6)
prob(Good_sleep)

0.8634243837441705

Not suprisingly (for somnologists, at least) in the contemporary US, people tend to report getting more at least 6 hours of sleep a night in the past month.

We could also ask a question about the probability of getting the recommended 7-9 hours of sleep per night, which is recommended by the NIH - ).

> https://www.nhlbi.nih.gov/health/sleep/how-much-sleep#:~:text=Experts%20recommend%20that%20adults%20sleep,or%20more%20hours%20a%20night.



In [15]:
Proper_sleep = (Baylor2017['H6_HR'] >= 7) & (Baylor2017['H6_HR'] <= 9)
prob(Proper_sleep)

0.6249167221852099

- Approximately 62% of respondents report getting the recommended average amount of sleep per night over the past month.

Now let's look at a compound probability.
- How likely is a respondent in this sample to self-report their health as Excellent while getting fewer than 6 hours of sleep on average in the past month?
- What about rating their health as fair with the same amount of sleep.
- Excellent health and good sleep?
- Fair health and good sleep?
- Excellent health and proper sleep?
- Fair health and proper sleep?

In [16]:
# The compound probability of being ExHealth AND Poor_sleep
prob(ExHealth & Poor_sleep)

0.005329780146568954

In [17]:
# The compound probability of being FairHealth AND Poor_sleep
prob(FairHealth & Poor_sleep)

0.01798800799467022

In [18]:
# The compound probability of being ExHealth AND Good_sleep
prob(ExHealth & Good_sleep)

0.15989340439706862

In [19]:
# The compound probability of being FairHealth AND Good_sleep
prob(FairHealth & Good_sleep)

0.07461692205196535

In [25]:
# The compound probability of being FairHealth AND Proper_sleep
prob(FairHealth & Proper_sleep)

0.04397068620919387

In [26]:
# The compound probability of being ExHealth AND Proper_sleep
prob(ExHealth & Proper_sleep)

0.13257828114590273

### Summary
- It looks like whether of nor someone rated their health Fair or Excellent, overall there is a smaller probability that they got the proper amount of sleep in the past month.
- There were higher probability for poor sleep and good sleep.
- Highest probability listed is for those with fair health and poor sleep.

Now we might ask about conditional probability. If a person is Baptist, what is the likelihood that they will identify as very religious?

Again, we'll use Downey's function for conditional probability: https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/notebooks/chap01.ipynb#scrollTo=GbbG7bsh37bG&line=1&uniqifier=1



In [20]:
# Conditional probability

def conditional(proposition, given):
    """Probability of A conditioned on given."""
    return prob(proposition[given])

In [22]:
# conditional probability of being self-reporting health as Fair, given that one gets poor sleep
conditional(FairHealth, given=Poor_sleep)

0.19852941176470587

0.19852941176470587

In [23]:
# conditional probability of being self-reporting health as Excellent, given that one gets poor sleep
conditional(ExHealth, given=Poor_sleep)

0.058823529411764705

In [24]:
# conditional probability of being self-reporting health as Fair, given that one gets good sleep
conditional(FairHealth, given=Good_sleep)

0.08641975308641975

In [27]:
# conditional probability of being self-reporting health as Excellent, given that one gets good sleep
conditional(ExHealth, given=Good_sleep)

0.18518518518518517

In [28]:
# conditional probability of being self-reporting health as Fair, given that one gets proper sleep
conditional(FairHealth, given=Proper_sleep)

0.07036247334754797

In [29]:
# conditional probability of being self-reporting health as Excellent, given that one gets proper sleep
conditional(ExHealth, given=Proper_sleep)

0.21215351812366737

- The highest probability are those who rate their health as Excellent given that they obtain the proper amount of sleep over the past month.
- The lowest probability are for those who rate their health as Excellent given that they get poor amount of sleep over the past month.  

Let's think sociologically about what we've learned from these probability computations.



We can now return to a question we've asked before, about the **relationship between self-reported Excellent health and poor health from 10 years ago**. In the codebook, health from 10 years ago is measured with H5B.

```
93) H5B
On a scale from zero to 10 where zero represents the worst possible health for you and 10 represents the best possible health for you, please rate your health at the following points in time: Your health 10 years ago.
RANGE: 0 to 10
	N	Mean	Std. Deviation
Total	1461	8.187	1.82
0) Worst possible	2	0.1
1) 	8	0.5
2) 	10	0.7
3) 	27	1.8
4) 	27	1.8
5) 	67	4.6
6) 	68	4.7
7) 	154	10.5
8) 	324	22.2
9) 	409	28.0
10) Best possible	365	25.0
Missing	40
```

For this question, we'll define 0-2 as poor health, 3-5 as fair health, 6-8 as good health, and 9 & 10 as excellent health, which would include answers 3 and 4.

We can compute the conditional probability of self-reported Excellent health given that they reported having poor health 10 years ago. We can then compute the conditional probability for those who identify as fair, good, and excellent health from 10 years ago. If the probability of self-reported excellent health is larger for the excellent helath than the good health, the fair health, or the poor health from a decade ago, it would suggest a relationship between current self-reported health and reported health from 10 years ago.


In [31]:
poor_past_health = (Baylor2017['H5B'] <= 2)

fair_past_health = ((Baylor2017['H5B'] >= 3) | (Baylor2017['H5B'] <= 5))

good_past_health = ((Baylor2017['H5B'] >= 6) | (Baylor2017['H5B'] <= 8))

excellent_past_health = (Baylor2017['H5B'] >= 9)


print("The probability of self-reported Excellent health given poor health 10 years ago: %3.2f" % conditional(ExHealth, given=poor_past_health))
print("The probability of self-reported Excellent health given fair health 10 years ago: %3.2f" % conditional(ExHealth, given=fair_past_health))
print("The probability of self-reported Excellent health given good health 10 years ago: %3.2f" % conditional(ExHealth, given=good_past_health))
print("The probability of self-reported Excellent health given excellent good health 10 years ago: %3.2f" % conditional(ExHealth, given=excellent_past_health))

The probability of self-reported Excellent health given poor health 10 years ago: 0.05
The probability of self-reported Excellent health given fair health 10 years ago: 0.17
The probability of self-reported Excellent health given good health 10 years ago: 0.17
The probability of self-reported Excellent health given excellent good health 10 years ago: 0.24


In [32]:
print("The probability of poor health 10 years ago: %3.2f" % prob(poor_past_health))
print("The probability of fair health 10 years ago: %3.2f" % prob(fair_past_health))
print("The probability of good health 10 years ago: %3.2f" % prob(good_past_health))
print("The probability of excellent health 10 years ago: %3.2f" % prob(excellent_past_health))

The probability of poor health 10 years ago: 0.01
The probability of fair health 10 years ago: 0.97
The probability of good health 10 years ago: 0.97
The probability of excellent health 10 years ago: 0.52


### Summary
- Results suggest that those who self-reported their current health as Excellent were more likely to have reported the same status for their health 10 years ago.
- Those who self-reported health as excellent had the same probability of reporting their health 10 years ago as either fair or good.

- It seems that health status has remained largely unchanged or improved within a decade. I assume there are several factors, such as sleep quality, nutrition, smoking/alcohol consumption, exercise, stress etc that people must moderate in order to achieve those results.


## Bayes's Theorem

I might have lost the plot here.





In [36]:
#create dataframe and prior probabilities
table = pd.DataFrame(index=['Poor Past Health', 'Fair Past Health', 'Good Past Health', 'Excellent Past Health'])
table['prior'] = [1/4, 1/4, 1/4, 1/4]
table

Unnamed: 0,prior
Poor Past Health,0.25
Fair Past Health,0.25
Good Past Health,0.25
Excellent Past Health,0.25


In [37]:
#normalize the probability
table['likelihood'] = 0.01, 0.97, 0.97, 0.52
table

Unnamed: 0,prior,likelihood
Poor Past Health,0.25,0.01
Fair Past Health,0.25,0.97
Good Past Health,0.25,0.97
Excellent Past Health,0.25,0.52


In [38]:
table['unnorm'] = table['prior'] * table['likelihood']
table

Unnamed: 0,prior,likelihood,unnorm
Poor Past Health,0.25,0.01,0.0025
Fair Past Health,0.25,0.97,0.2425
Good Past Health,0.25,0.97,0.2425
Excellent Past Health,0.25,0.52,0.13


In [39]:
prob_data = table['unnorm'].sum()
table['posterior'] = table['unnorm'] / prob_data
table

Unnamed: 0,prior,likelihood,unnorm,posterior
Poor Past Health,0.25,0.01,0.0025,0.004049
Fair Past Health,0.25,0.97,0.2425,0.392713
Good Past Health,0.25,0.97,0.2425,0.392713
Excellent Past Health,0.25,0.52,0.13,0.210526


In [41]:
# Starting w/ a different prior
table['prior2'] = 1/4, 1/2, 1/4, 1/2
table['unnorm2'] = table['prior2'] * table['likelihood']
prob_data = table['unnorm2'].sum()
table['posterior2'] = table['unnorm2'] / prob_data  #Compute posterior prob
table

Unnamed: 0,prior,likelihood,unnorm,posterior,prior2,unnorm2,posterior2
Poor Past Health,0.25,0.01,0.0025,0.004049,0.25,0.0025,0.002525
Fair Past Health,0.25,0.97,0.2425,0.392713,0.5,0.485,0.489899
Good Past Health,0.25,0.97,0.2425,0.392713,0.25,0.2425,0.244949
Excellent Past Health,0.25,0.52,0.13,0.210526,0.5,0.26,0.262626


### Activity

Identify a sociological question that you can ask of the Baylor survey data. This will require you to browse the codebook to see the kinds of variables available. Start with a simple question of the form "Is Y related to X?" where X and Y are sociological concepts. (In the example in this notebook, we asked if attendance at religious services was related to religiosity.)

Select variables that operationalize the concepts in your question.

Compute probabilities that allow you to formulate an answer to your question.

Explain your answer and why you computed the specific probabilities you chose.

