# Lab: Random Variable 💯

In statistics and data science, random variables are used to model events that have uncertain outcomes.  For example, in DISCOVERY, we explore the **binomial distribution** to model flipping a coin, drawing from a deck of cards, guessing on a multiple choice exam, and many other events with a single, fixed probability of success.  However, what if there are multiple different outcomes?  This lab will explore creating custom discrete distributions in Python to model complex events!

In this lab, you will explore a dataset of the real final scores of students in DISCOVERY!  Before we get to that, let's nerd out with the basics of a distribution! :)


A few tips to remember:

- **You are not alone on your journey in learning programming!**  You have your lab Teaching Assistant, your Course Aides, your lab group, and the professors (Prof. Wade and Prof. Karle), who are all here to help you out!
- If you find yourself stuck for more than a few minutes, ask a neighbor or course staff for help!  When you are giving help to your neighbor, explain the **idea and approach** to the problem without sharing the answer itself so they can have the same **<i>ah-hah</i>** moment!
- We are here to help you!  Don't feel embarrassed or shy to ask us for help!

Let's get started!

In [0]:
# Meet your CAs and TA if you haven't already!
# First name is enough, we'll know who they are! :)
ta_name = ""
ca1_name = ""
ca2_name = ""
ca3_name = ""

# Say hello to each other!
# - Groups of 3 are ideal :)
# - However, groups of 2 or 4 are fine too!
#
# QOTD to Ask Your Group: "What's your favorite RSO on campus?"
partner1_name = ""
partner1_netid = ""
partner1_rso = ""

partner2_name = ""
partner2_netid = ""
partner2_rso = ""

partner3_name = ""
partner3_netid = ""
partner3_rso = ""

<hr style="color: #DD3403;">

## Random Variable #1: Modeling Flipping a Coin Twice

In DISCOVERY, we introduce flipping a coin twice as an example binomial distribution.  Create a variable that contains a binomial distribution called `COIN` that models the distribution of the number of heads we see when we flip a coin two times:

(Not sure?  Check out the DISCOVERY page on "Python Functions for Random Distributions" here:
https://discovery.cs.illinois.edu/learn/Polling-Confidence-Intervals-and-Hypothesis-Testing/Python-Functions-for-Random-Distributions/)

In [0]:
COIN = ...

There are three different outcomes of flipping two coins and counting the number of heads:

| `COIN` ~ Number of Heads | P(`COIN` = #) |
| --------------: | ----------: |
| 0 heads | 25% |
| 1 head | 50% |
| 2 heads | 25% |

The **expected value** of the distribution is the weighted sum of the possible results.  This means we need to add together all possible outcomes:

- The number of times we get zero heads, multiplied by the probability of getting zero heads, 
- The number of times we get one head, multiplied by the probability of getting one head, and
- The number of times we get two heads, multiplied by the probability of getting two heads.

Mathematically, we can use the generic formula for the expected value of any discrete distribution:

$$EV_{COIN} = ((0\text{ heads}) * 25\%) + ((1\text{ head}) * 50\%) + ((2\text{ heads}) * 25\%)$$


Solving the equation:

- $EV_{COIN} = ((0\text{ heads}) * 25\%) + ((1\text{ head}) * 50\%) + ((2\text{ heads}) * 25\%)$
- $EV_{COIN} = (0\text{ heads}) + (0.5\text{ heads}) + (0.5\text{ heads})$
- $EV_{COIN} = 1\text{ heads}$


Alternatively, we can use the simplified formula for binominal variables:

$$EV_{binominal} = n * p$$

Solving the equation for the `COIN`:

- $EV_{COIN} = n * p$, where n=2, and p=0.5
- $EV_{COIN} = 2 * 0.5$
- $EV_{COIN} = 1\text{ heads}$


### Verifying our `COIN` Distribution in Python

Use `COIN.mean()` to verify the expected value in Python:

In [0]:
EV_COIN = ...
EV_COIN

### 📝 Technical Note 📝

By default, `COIN.mean()` will often give you a non-native value and display something similar to:

> ```
> np.float64(1.0)
> ```

You will see this occur when working with scientific libraries that build from a library called `numpy` or `np`.
- It's okay to leave the value as a numpy data type, the value does not change.
- However, if you find this messy, you can use `.item()` to extract the Python native value out of a numpy data type.

For example:, instead of using `COIN.mean()` above, try the following:

> > ```py
> > COIN.mean().item()
> > ```
>
> Output: 1.0

All of the test cases in this lab will support either both numpy data types and Python primitive data types, so do whatever one you think looks the best! :)

### 🔬 Checkpoint Tests 🔬

In [0]:
### TEST CASE for Random Variable #1: Modeling Flipping a Coin Twice
tada = "\N{PARTY POPPER}"
import math
assert("COIN" in vars()), "Make sure your random variable is named `COIN`."
assert(COIN.mean() == 1), "Check your parameters for your COIN distribution."
assert(EV_COIN == 1), "Your expected value is incorrect."
assert(math.isclose(COIN.std(), 2**(1/2)/2)), "Check your parameters for your COIN distribution."
print(f"{tada} All Tests Passed! {tada}")

<hr style="color: #DD3403;">

## Random Variable #2: The Value of a Dice Roll

A common distribution in statistics is to model the outcome of rolling a die.  Unfortunately, binominal distributions only have the output of a zero (not successful) or a one (successful).  However, a single die has six equally likely outcomes: 1, 2, 3, 4, 5, or 6.

To model this more complex event, we will use a **custom discrete distribution**.

### Requirements of a Custom Discrete Distributions

Similar to the binomial distribution, any custom discrete distribution we create must have three properties:

1. The event we are modeling must have a **fixed outcome that is independent** (it does not matter what happened in the past),

2. The event we are modeling must have a **probability that does not change** (no external factor changes the probability), **AND**

3. The event we are modeling must have a **finite number of outcomes** (as opposed to the normal distribution that can have any possible Z-score, like 0.000332, 0.094322, or any number you can imagine; the normal distribution is NOT finite. For example: we cannot create a distribution of rolling a die that has an infinite number of sides.)

### Dice Distribution

The following table describes the distribution of a six-sided die:

| Outcome | Probability |
| ------: | ----------: |
| 1 | 1/6 |
| 2 | 1/6 |
| 3 | 1/6 |
| 4 | 1/6 |
| 5 | 1/6 |
| 6 | 1/6 |

### Creating a Custom Discrete Distribution

In Python, we must provide two parallel lists of outcomes and the probabilities, similar to the table above.  One list will contain all the outcomes and one list will contain all the probabilities.

For example:

> ```py
> outcomes    = [   1,   2,   3,   4,   5,   6 ]
> probability = [ 1/6, 1/6, 1/6, 1/6, 1/6, 1/6 ]
> 
> # Programmers will often use a lot of extra spaces to make it line up visually,
> # just like we did in the code above, but it is not required.
> ```

Once we have our two lists, the scipy.stats `rv_discrete` function can be used to make our distribution using the following code:

> ```py
> from scipy.stats import rv_discrete
> DICE = rv_discrete( values=(outcomes, probability) )
> ```

Create the `DICE` distribution below:


In [0]:
...

Once you have the `DICE` random variable, find the expected value and standard error for the `DICE` random variable:

In [0]:
# Expected Value for `DICE`:
EV_DICE = ...
EV_DICE

In [0]:
# Standard Error for `DICE`:
SE_DICE = ...
SE_DICE

### 🔬 Checkpoint Tests 🔬

In [0]:
### TEST CASE for Random Variable #2: The Value of a Dice Roll
tada = "\N{PARTY POPPER}"

import math
assert("DICE" in vars()), "Make sure your random variable is named `DICE`."
assert(EV_DICE == 3.5), "The expected value for your distribution is incorrect."
assert(math.isclose(SE_DICE, (105 / 36)**0.5)), "The standard error for your distribution is incorrect."
print(f"{tada} All Tests Passed! {tada}")

<hr style="color: #DD3403;">

## Random Variable #3: Customer in a Tea Shop

Let's create a distribution to model a tea shop!  When a customer arrives, we have historical data to suggest the following pattern:

| Description | Outcome | Probability |
| ----------- | ------: | ----------: |
| Customer buys a black tea | $ 4.49 | 20% |
| Customer buys a bubble tea | $ 5.69 | 40% |
| Customer buys a black tea and treat | $ 7.69 | 15% |
| Customer buys a bubble tea and treat | $ 8.89 | 15% |
| Customer buys nothing | $ 0.00 | 10% |

Create a custom discrete distribution for this tea shop called `TEA`  (if you're not sure, re-read the previous section on how to create a custom distribution).


In [0]:
...

Once you have the `TEA` distribution, find the expected value and standard error of this new distribution:

In [0]:
# Expected Value for `TEA`:
EV_TEA = ...
EV_TEA

In [0]:
# Standard Error for `TEA`:
SE_TEA = ...
SE_TEA

In [0]:
### TEST CASE for Random Variable #3: Customer in a Tea Shop
tada = "\N{PARTY POPPER}"

import math
assert("TEA" in vars()), "Make sure your random variable is named `TEA`."
assert(math.isclose(EV_TEA, 5.661)), "The expected value for your distribution is incorrect."
assert(math.isclose(SE_TEA, 2.379237062589603)), "The standard error for your distribution is incorrect."
print(f"{tada} All Tests Passed! {tada}")

### Analysis

Using the `TEA` distribution, let's find `TEA.cdf(6)`:

In [0]:
TEA.cdf(6)

**Q1**: What is the real-world meaning of `TEA.cdf(6)` returning the value of `0.7` in terms of your tea shop?  Explain in at least one complete sentence the real-world significance of this to the tea shop owner  (you will not earn points for just explaining the definition of CDF).

*(✏️ Edit this cell to replace this text with your answer. ✏️)*

<hr style="color: #DD3403;">

## Random Variable #4: Final Grades in DISCOVERY

Finally, we want to create a random variable called `DISCOVERY` that will model the final grades for this semester in DISCOVERY.  We can use historical data to build the custom discrete random variable.

We have provided you with a dataset containing the **ACTUAL** final points for all 1,045 students in DISCOVERY during the Fall 2023 semester (exactly one year ago).  Using `pandas`, read the `fa23-final-points.csv` dataset into a DataFrame and store it as `df`:

### Visualize a Histogram of DISCOVERY

Create a histogram of all the scores in DISCOVERY using **50 bins**:

In [0]:
# Create a histogram of the all the final scores in DISCOVERY:
...

### Counting Unique Values

When you need to find the count of unique values contained in a DataFrame, the `.value_counts()` function will return a DataFrame that contains each unique value and the number of times that value appears.

We'll want to reset the index to work with the data, so make sure to use `reset_index()` just like we did with groupby.  That means the full Python syntax to find all the unique values for `Final Points` is:

> ```py
> df.value_counts().reset_index()
> ```

Use this syntax to find all unique values for the `Final Points` column in `df` and assign this new DataFrame to the variable `df_courseFinalPoints`:

In [0]:
df_courseFinalPoints = ...
df_courseFinalPoints

In [0]:
### TEST CASE for `courseFinalPoints`
tada = "\N{PARTY POPPER}"

assert("df_courseFinalPoints" in vars()), "Make sure your new DataFrame is stored in the variable `df_courseFinalPoints`."
assert("Final Points" in df_courseFinalPoints), "Make sure to do .reset_index()"
assert("count" in df_courseFinalPoints or "0" in df_courseFinalPoints), "Make sure to do .reset_index()"
assert( df_courseFinalPoints["count"].sum() == len(df) )

print(f"{tada} All Tests Passed! {tada}")

### Building a Discrete Distribution for DISCOVERY

Recall from Part 2 that we need to create two lists, `outcomes` and `probability`:

> ```py
> outcomes    = [   1,   2,   3,   4,   5,   6 ]
> probability = [ 1/6, 1/6, 1/6, 1/6, 1/6, 1/6 ]
> ```

Applying this to the DataFrame you just created, the `outcomes` are the various final total points in DISCOVERY of the 1,045 students last Fall.  The `probability` is the likelihood of each outcome.  Looking at the first few rows, this means two lists of values might be:

> ```py
> # List of total final points in DISCOVERY:
> outcomes    = [ 963, 969, 967, ... ]
>
> # Occurrences of each outcome:
> occurrence   = [ 15,  14,  13, ... ]
> ```

*(Note we'll find the occurrence first, and then covert that to the final probability.)*

### Extracting the List of Outcomes

In your `df_courseFinalPoints` DataFrame, you have two columns: `Final Points` and `count` (or, on older versions of the library, just `0`).  To extract a list of values from a specific column, as a list, you'll use the `.values` method on the column.  For example, to get a list of values for the column `Illinois` from the DataFrame `df`, you can use the following code:

> ```py
> df["Illinois"].values
> ```

Use your `df_courseFinalPoints` to find the list of all `outcomes` in DISCOVERY:

In [0]:
outcomes = ...
outcomes

### Extracting the List of Occurrences

Repeat the same process to find the list of all `occurrences`, or how many times each of the outcomes occurred:

In [0]:
occurrences = ...
occurrences

### Converting `occurrences` to a Probability

To use this data to create a distribution, we need both the **outcome** and the **probability**.  Currently, we have the outcomes (`outcomes`), but we do not have a probability.  To convert `occurrences` to a probability, we need to divide `occurrences` by the total number of students to convert it to a probability.

Create a new variable `probability` that is the `probability` for getting each value in `Final Points`.  

*Hint 1: You can divide an entire list by a number, and it will divide each individual element of the list by that number; ex: `probability = counts / 10` will divide each value in `counts` by 10.*

*Hint 2: To get the total number of students, think about what each value in `occurrences` represents.*

In [0]:
probability = ...
probability

### Create the DISCOVERY distribution

Using `rv_discrete` (see Section 2 to refresh your memory), create the `DISCOVERY` discrete random variable:

In [0]:
DISCOVERY = ...

### Statistic #1: Average Score

Using the `DISCOVERY` distribution, what is the average score (or "expected value"), in points, in DISCOVERY?  Store your result in `avg_score`:

In [0]:
avg_score = ...
avg_score

### Statistic #2: Median Score

Using the `DISCOVERY` distribution, what is the median score (the 50th percentile), in points, in DISCOVERY?  Store your result in `median_score`:

In [0]:
median_score = ...
median_score

### Statistic #3: Earning an "A" in DISCOVERY

What percentage of students earned an "A" in DISCOVERY?  Earning an "A" requires a student to earn 930 points.  Store the percentage of people in `pct_A`:

(Not sure how to find this?  Check out the DISCOVERY page on "Python Functions for Random Distributions" here:
https://discovery.cs.illinois.edu/learn/Polling-Confidence-Intervals-and-Hypothesis-Testing/Python-Functions-for-Random-Distributions/)

In [0]:
pct_A = ...
pct_A

### 🔬 Checkpoint Tests 🔬

In [0]:
### TEST CASE for Random Variable #4: Final Scores in DISCOVERY
tada = "\N{PARTY POPPER}"

import math
assert("DISCOVERY" in vars())
assert("avg_score" in vars())
assert(math.isclose(avg_score, 934.5502392344499)), "Your average score is incorrect."
assert(math.isclose(median_score, 958)), "Your median score is incorrect."
assert(pct_A > 0.50), "There were more than 50% As... are you sure the function is doing what you expect it to be doing?"
assert(not math.isclose(pct_A, 1 - 0.32248803827751177)), "If you earn a 930, you still get an \"A\".  If looks like your solution excluded 930 as an \"A\" -- make sure to include it."
assert(math.isclose(pct_A, 1 - 0.31866028708133953))

assert(math.isclose(DISCOVERY.mean(), avg_score))

print(f"{tada} All Tests Passed! {tada}")

### Analysis

Using the `DISCOVERY` distribution, let's find `DISCOVERY.ppf(0.9)`:

In [0]:
DISCOVERY.ppf(0.9)

**Q2**: What is the real-world meaning of `DISCOVERY.ppf(0.9)` returning the value of `1003` in terms of the DISCOVERY course?  Explain in at least one complete sentence the significance of 1003 is to a fellow student in this class (you will not earn points for just explaining the definition of PPF).

*(✏️ Edit this cell to replace this text with your answer. ✏️)*

<hr style="color: #DD3403;">

# Submission

You're almost done!  All you need to do is to commit your lab to GitHub:

1.  ⚠️ **Make certain to save your work.** ⚠️ To do this, go to **File => Save All**

2.  After you have saved, exit this notebook and follow the Canvas instructions to commit this lab to your Git repository!

3. Your TA will grade your submission and provide you feedback after the lab is due. :)