# Welcome to lab_populations! 👥🌎🌍🌏👥

In lecture, you have been learning about both sampling and inference. This is the idea that we can calculate statistics from a random sample of the population and use those statistics to estimate what we would get if we asked every single person in the population a question. 

The goal of this lab is to gain a more intuitive understanding of what inference is. We will explore sampling from a population and the meaning behind confidence intervals, error, and the Central Limit Theorem (CLT).

<hr>

A few tips to remember:

- **You are not alone on your journey in learning programming!**  You have your lab TA, the CAs, your lab group, and the professors (Prof. Wade and Prof. Karle), who are all here to help you out!
- If you find yourself stuck for more than a few minutes, ask a neighbor or course staff for help!  When you are giving help to your neighbor, explain the **idea and approach** to the problem without sharing the answer itself so they can have the same **<i>ah-hah</i>** moment!
- We are here to help you!  Don't feel embarrassed or shy to ask us for help!

Let's get started!

In [0]:
# Meet your CAs and TA if you haven't already!
# First name is enough, we'll know who they are! :)
ta_name = "Kunlun"
ca1_name = "Jai"
ca2_name = "Jessica"
ca3_name = ""

# Work with your group again this week! 
# QOTD to Ask Your Group: "What should the be new UIUC mascot?"
partner1_name = "Christ Goncalves"
partner1_netid = "christg2"
partner1_mascot = "No idea"

partner2_name = "Beichen Hu"
partner2_netid = "beichen7"
partner2_mascot = "No idea"

partner3_name = ""
partner3_netid = ""
partner3_mascot = ""

<hr style="color: #DD3403;">

## Part 1: Sampling the Population

The `DISCOVERY_populations` library is included with this lab and contains a **very large** population (over 100,000 students) of current and former University of Illinois students.  We have simulated over 100,000 people for answers to three questions:

1. Do you support the Kingfisher as the new Illinois mascot?
2. Do you follow @datascienceduo on Instagram?
3. Are you a Data Science major?

Right now, **we do NOT know the answers for the entire population and there is NO WAY to ask everyone**. Instead, we can only ask a sample of students and get answers for that sample. Run the following code to import the `DISCOVERY_populations` library and retrieve the sample:

In [1]:
import DISCOVERY_populations
sample = DISCOVERY_populations.getSample()
sample.head()

Unnamed: 0,DSmajor,FollowsDuo,ProKingfisher
14197,0,0,0
109667,0,0,1
27498,0,0,0
116804,1,1,1
172357,0,0,1


### Puzzle 1.1: Statistics about the Sample

You have received a **random sample** from the population and it looks like it has three columns: `DSmajor`, `FollowsDuo`, and `ProKingfisher`. Using the `len` function, create a variable `n` that stores the number of people in your sample:

In [2]:
n = len(sample)
n

49

We'll first focus on people who follow @datascienceduo -- the people who follow the DUO are coded with a `1` in the sample and the people who do not follow the DUO are coded with a `0`.  

In your sample, how many people follow the DUO?

In [5]:
followers = len(sample[sample["FollowsDuo"] == 1])
followers

29

In [6]:
## == CHECKPOINT TEST CASES ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any errors, you PASSED all test cases!
# - If this cell results in any errors, check your previous cell, make changes, and RE-RUN your code and then this cell.
assert("sample" in vars()), "Check to make sure you have the variable `sample`."
assert(len(sample) == n), "Check to make sure `n` stores the number of observations in your sample."
assert(followers == sum(sample.FollowsDuo)), "Check to make sure `followers` stores the number of people following @datascienceduo."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

🎉 All tests passed! 🎉


### Puzzle 1.2: Finding the 95% Confidence Interval for the Percentage of DUO followers

We want to estimate what percentage of the population follows @datascienceduo. To do that, we need to use the confidence interval formula you learned in lecture. 

$$ CI = {Sample \space Percent} \pm {Margin \space of \space Error}$$
$$ {Margin \space of \space Error} = {z} \times {Sample \space Standard \space Error} $$

Let's work on finding all four of the components we need: `samplePercent`, `marginOfError`, `z`, and `sampleSE`. For this entire puzzle, make sure your percentages (samplePercent and sampleSE) are in **percent form** and not decimal form. In other words, they should be numbers between 0% and 100%.


#### Puzzle 1.2(a): Finding `samplePercent`

 Using the `FollowsDuo` column, store the **percentage of the sample that follow the DUO** in the variable `samplePercent`:
 
 *Note: Since the `FollowsDuo` column is encoded so that a `0` is a non-follower and a `1` is a follower, the mean of the column will be a proportion (decimal), but we want to find a **percentage** so make sure to convert your answer to be between 0 and 100 percent.*

In [9]:
samplePercent = followers/49 * 100
samplePercent

59.183673469387756

In [10]:
## == CHECKPOINT TEST CASES ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any errors, you PASSED all test cases!
# - If this cell results in any errors, check your previous cell, make changes, and RE-RUN your code and then this cell.
import math
F = sample.FollowsDuo
assert(math.isclose(samplePercent, F.sum()/n*100)), "Check your `samplePercent`."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

🎉 All tests passed! 🎉


#### Puzzle 1.2(b): Finding `z`

We want to find the range where we are 95% sure that the true percentage of people who follow the DUO is within that range. Find the z-score we need to use to create a 95% CI:

*Hint: Because the sample size is greater than 30 and the sample was randomly selected from the population, we can use the standard normal curve to find the z-score when creating our 95% CI.*

In [18]:
from scipy.stats import norm
z = norm.ppf(0.975)
z

np.float64(1.959963984540054)

In [19]:
## == CHECKPOINT TEST CASES ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any errors, you PASSED all test cases!
# - If this cell results in any errors, check your previous cell, make changes, and RE-RUN your code and then this cell.
assert(math.isclose(abs(z) + abs(z)**abs(z), 5.69931068079139)), "Check your `z`."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")


🎉 All tests passed! 🎉


#### Puzzle 1.2(c): Finding `sampleSE`

Finally, we need to find the standard error of the sample as a **percentage**.

Remember: $SE_{\%} = \frac{SD}{\sqrt{n}} * 100\%$, where $SE$ is standard error, $SD$ is standard deviation, and $n$ is the sample size. 

In [33]:
import pandas
sd = sample["FollowsDuo"].std()

In [34]:
sampleSE = ( sd / 7)  * 100
sampleSE

np.float64(7.094099868916398)

In [35]:
## == CHECKPOINT TEST CASES ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any errors, you PASSED all test cases!
# - If this cell results in any errors, check your previous cell, make changes, and RE-RUN your code and then this cell.
assert(math.isclose(sampleSE, (n / F.var())**-0.5 * 100)), "Check your `sampleSE`."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

🎉 All tests passed! 🎉


#### Puzzle 1.2(d): Finding `marginOfError`

Finally, we need to find the margin of error.


In [36]:
marginOfError = z * sampleSE
marginOfError

np.float64(13.90418024580646)

In [37]:
## == CHECKPOINT TEST CASES ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any errors, you PASSED all test cases!
# - If this cell results in any errors, check your previous cell, make changes, and RE-RUN your code and then this cell.
assert(math.isclose((n / F.var())**-0.5 * 100, marginOfError/z)), "Check your `marginOfError`."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

🎉 All tests passed! 🎉


### Puzzle 1.3: Finding the Confidence Interval

The formula for the confidence interval has both a "lower bound" (when you subtract the margin of error from the sample average) and an "upper bound" (when you add the margin of error to the sample average). Recall the formula you learned in lecture:

$$ CI = {Sample \space Percent} \pm ({z} \times {Sample \space Standard \space Error})$$
$$ aka $$
$$ CI = {Sample \space Percent} \pm {Margin \space of \space Error}$$


Using the variables you just calculated in the previous section, find the `lower_bound_CI` and `upper_bound_CI` of your confidence interval:

In [38]:
lower_bound_CI = samplePercent - marginOfError
lower_bound_CI

np.float64(45.2794932235813)

In [39]:
upper_bound_CI = samplePercent + marginOfError
upper_bound_CI

np.float64(73.08785371519421)

Putting it all together, run the following code that will write out your full confidence interval interpretation:

In [40]:
print(f"Based on the sample, we are 95% confident that the true percentage of followers of @datascienceduo in the full population is between:\n   {round(lower_bound_CI, 2)}% - {round(upper_bound_CI, 2)}%")

Based on the sample, we are 95% confident that the true percentage of followers of @datascienceduo in the full population is between:
   45.28% - 73.09%


### Reflections

**Q1**: Talk to your group members and share your confidence intervals.
- (a): What is the confidence interval of another group member's sample?
- (b): Is it the same or different?
- (c): Why is it okay that it's the same or different?

*(The CI of my group member is from 43.72% to 74.65%, it is a little bit different from my sample, I think it is ok that it's the different, because we may have sample with different EV and std)*

**Q2**: Given your confidence interval you calculated, what statement can you make about whether or not at least 50% of the population follow @datascienceduo?

*(The lower bound of your CI is 45.28%, and the upper bound is 73.09%. Since 50% falls within this range, I can say that I am confident that the true proportion of the population who follow @datascienceduo is between 45.28% and 73.09%.)*

### Population Analysis

**Q3**: Suppose the entire population is exactly 1,000,000 people.

Professor Karle and Wade wants a good estimate of the **minimum number of people** who are likely following the DUO.  If you want to be **95% certain** in your answer you're giving to the professors, what is the minimum number of people you would claim to be following the DUO?


First, explain in at least one sentence how you will calculate this result using your confidence interval from above (with words, not code).  Then, calculate it and include your answer in the Python cell below.

*(I will use the lower bound of the confidence interval (45.28%) and apply it to the total population of 1,000,000 people. This will give me the minimum number of people who are likely following the DUO with 95% certainty which is about 452800 people.)*

In [43]:
# 95% confident that AT LEAST this many people are following the DUO in a population of 1,000,000 people
total_population = 1000000
lower_bound = 45.28 / 100 

min_people = total_population * lower_bound
min_people


452800.00000000006

<hr style="color: #DD3403;">

## Part 2: Towards a Smaller Margin of Error

The number of followers of @datascienceduo is fun, but the large margin of error you had is a little alarming.  For really important issues, we want a smaller margin of error in our sample.

**Q4**: What are at least **TWO** things we can do as a data scientist to reduce the margin of error?

*(The two things we can do is increasing the sample size since larger sample better reflects the population as a whole, leading to more precise estimates and lower the SD of this sample. Moreover, I think it is ok to improve Sampling Methods so the data will be much more reliable)*

### Part 2.1: An Expensive Sample

The issue of making the UIUC mascot the Kingfisher is a big one, so we'll want to make sure we get an accurate representation of the UIUC population. Taking a large sample requires surveying more people and getting more responses, which is almost always more expensive.  In the `DISCOVERY_populations` library you imported in Part 1, we have a second function: `getExpensiveSample()`.

The following code gets a larger and more expensive sample and stores it in `sample2`:

In [28]:
sample2 = DISCOVERY_populations.getExpensiveSample()
sample2

Unnamed: 0,DSmajor,FollowsDuo,ProKingfisher
141130,1,1,1
62766,0,0,1
142512,1,1,1
914,1,1,1
36658,0,0,1
...,...,...,...
54208,1,1,1
16479,0,0,0
119040,1,1,1
96308,1,1,1


### Part 2.2: Finding the Confidence Interval for Kingfisher Support

Find the lower and upper bounds for the 99% CI for the support of the Kingfisher mascot, storing them in `kingfisher_CI_lower` and `kingfisher_CI_upper`.  We provided individual cells for each stage of the computation, and you should make sure your answer is reasonable at each step. We also want your answers as **percentages between 0 and 100 percent**.

Make sure you're using `sample2` since you have the better, more expensive sample now! :)

In [30]:
# Step 1: Find the samplePercent:
sample2Percent = len(sample2[sample2["ProKingfisher"] == 1])/len(sample2) * 100
sample2Percent

71.26654064272212

In [45]:
# Step 2: Find the z-score for the 99% CI and store it in `z2`:
z2 = norm.ppf(0.995)
z2

np.float64(2.5758293035489004)

In [46]:
# Step 3: Find the sampleSE:
import numpy as np
sd2 = sample2["ProKingfisher"].std()
number = np.sqrt(len(sample2))
sample2SE = ( sd2 / number)  * 100
sample2SE


np.float64(1.3918720162579108)

In [48]:
# Step 4: Find the margin of error:
marginOfError2 = z2 * sample2SE
marginOfError2

np.float64(3.585224726266818)

In [49]:
# Find the lower bound of the CI:
kingfisher_CI_lower = sample2Percent - marginOfError2
kingfisher_CI_lower

np.float64(67.6813159164553)

In [50]:
# Find the upper bound of the CI:
kingfisher_CI_upper = sample2Percent + marginOfError2
kingfisher_CI_upper

np.float64(74.85176536898894)

In [51]:
## == CHECKPOINT TEST CASES ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any errors, you PASSED all test cases!
# - If this cell results in any errors, check your previous cell, make changes, and RE-RUN your code and then this cell.
import math
from scipy.stats import norm

F = sample2.ProKingfisher
N = norm(F.mean(), F.std() / (len(F)**0.5))
low, high = N.interval(0.99)
assert( math.isclose(z2, 2.5758293035489004) ), "Check your Z-score for a 99% CI."
assert(kingfisher_CI_upper > kingfisher_CI_lower), "The upper bound must be larger than the lower bound."
assert( math.isclose(kingfisher_CI_lower, low * 100) ), "Check your `kingfisher_CI_lower` calculation."
assert( math.isclose(kingfisher_CI_upper, high * 100) ), "Check your `kingfisher_CI_upper` calculation."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

🎉 All tests passed! 🎉


### Part 2.3: Reflections

**Q5**: Write out the interpretation of your confidence interval in a complete sentence.

*(I am 99% confident that the true level of support for the Kingfisher mascot lies between 67.68% and 74.85%)*

**Q6**: If the whole population voted on if the next mascot should be the Kingfisher, how confident are you that the resolution will pass (that is, receive at least 50% of the vote)? Explain in at least one complete sentence how the data analysis you did backs up your confidence.

*(I am confident that the resolution to make the Kingfisher the next mascot would pass with at least 50% of the vote. Based on the 99% confidence interval we calculated, the lower bound is approximately 67.68%, which is well above 50%. This means I am 99% confident that the true level of support for the Kingfisher mascot is above the majority threshold, making it very likely that the resolution would succeed.)*

**Q7**: Is the confidence interval of your larger (and more expensive) sample larger (wider) or smaller (narrower) than the first sample?  Why or why not?  Explain in at least one complete sentence.

*(The confidence interval of this sample is narrower than the first sample. This is because increasing the sample size reduces the standard error, leading to a more precise estimate of the population parameter and thus a narrower confidence interval.)*

<hr style="color: #DD3403;">

## Part 3: The Election is Here!

The polling is complete and the election day is here!  Run the following code to find your election-day results:

In [52]:
DISCOVERY_populations.electionDay()

The election was held and 21% of the population voted.

== Kingfisher Support ==
SUPPORT KINGFISHER: 29398 70.0%
OPPOSE KINGFISHER : 12602 30.0%

== Follows @datascienceduo ==
FOLLOWS DUO    : 21903 52.15%
DOES NOT FOLLOW: 20097 47.85%


**Q8**: In at least one complete sentence, explain if your analysis of the samples accurately predicted the outcomes.

*(I think my analysis of the samples accurately predicted the outcomes, as the estimated confidence intervals included both a support rate above 50% for the Kingfisher mascot (with the actual support being 70%) and a follow rate above 50% for @datascienceduo (with the actual follow rate being 52.15%). These results align with my predictions, confirming that my sample analysis provided reliable estimates for the election outcomes.)*

<hr style="color: #DD3403;">

# Submission

You're almost done!  All you need to do is to commit your lab to GitHub:

1.  ⚠️ **Make certain to save your work.** ⚠️ To do this, go to **File => Save All**

2.  After you have saved, exit this notebook and follow the Canvas instructions to commit this lab to your Git repository!

3. Your TA will grade your submission and provide you feedback after the lab is due. :)