<h1 style="text-align: center">
<div style="color: #DD3403; font-size: 60%">Data Science DISCOVERY MicroProject</div>
<span style="">MicroProject: Custom Discrete Distribution in Python</span>
<div style="font-size: 60%;"><a href="https://discovery.cs.illinois.edu/microproject/custom-discrete-distributions-in-python/">https://discovery.cs.illinois.edu/microproject/custom-discrete-distributions-in-python/</a></div>
</h1>

<hr style="color: #DD3403;">

## Random Variables and Distributions

In statistics and data science, random variables are used to model events that have uncertain outcomes.  For example, in DISCOVERY, we explore the **binomial distribution** to model flipping a coin, drawing from a deck of cards, guessing on a multiple choice exam, and many other events with a single, fixed probability of success.  However, what if there are multiple different outcomes?  This MicroProject will explore creating custom discrete distributions in Python to model complex events!

In this MicroProject, you will explore a dataset of the real final scores of students in DISCOVERY!  Before we get to that, let's nerd out with the basics of a distribution! :)

<hr style="color: #DD3403;">

## Random Variable #1: Modeling Flipping a Coin Twice

In DISCOVERY, we introduce flipping a coin twice as an example binomial distribution.  Create a variable that contains a binomial distribution called `COIN` that models the distribution of the number of heads we see when we flip a coin two times:

(Not sure?  Check out the DISCOVERY page on "Python Functions for Random Distributions" here:
https://discovery.cs.illinois.edu/learn/Polling-Confidence-Intervals-and-Hypothesis-Testing/Python-Functions-for-Random-Distributions/)

In [0]:
COIN = ...

There are three different outcomes of flipping two coins and counting the number of heads:

| Number of Heads | Probability |
| --------------: | ----------: |
| 0 heads | 25% |
| 1 head | 50% |
| 2 heads | 25% |

The **expected value** of the distribution is the weighted sum of the possible results.  This means we need to add together all possible outcomes:

- The number of times we get zero heads, multiplied by the probability of getting zero heads, 
- The number of times we get one head, multiplied by the probability of getting one head, and
- The number of times we get two heads, multiplied by the probability of getting two heads.

Mathematically, it's the following equation:

$$EV_{COIN} = ((0\text{ heads}) * 25\%) + ((1\text{ head}) * 50\%) + ((2\text{ heads}) * 25\%)$$


Solving the equation:

- $EV_{COIN} = ((0\text{ heads}) * 25\%) + ((1\text{ head}) * 50\%) + ((2\text{ heads}) * 25\%)$
- $EV_{COIN} = (0\text{ heads}) + (0.5\text{ heads}) + (0.5\text{ heads})$
- $EV_{COIN} = 1\text{ heads}$


### Verifying our `COIN` Distribution in Python

Use `COIN.mean()` to verify the expected value in Python:

In [0]:
COIN.mean()

### 🔬 Checkpoint Tests 🔬

In [0]:
### TEST CASE for Random Variable #1: Modeling Flipping a Coin Twice
tada = "\N{PARTY POPPER}"
import math
assert("COIN" in vars())
assert(COIN.mean() == 1)
assert(math.isclose(COIN.std(), 2**(1/2)/2))
print(f"{tada} All Tests Passed! {tada}")

<hr style="color: #DD3403;">

## Random Variable #2: The Value of a Dice Roll

A common distribution in statistics is to model the outcome of rolling a dice.  Unfortunately, binominal distributions only have the output of a zero (not successful) or a one (successful).  However, a single die has six equally likely outcomes: 1, 2, 3, 4, 5, or 6.

To model this more complex event, we will use a **custom discrete distribution**.

### Requirements of a Custom Discrete Distributions

Similar to the binomial distribution, any custom discrete distribution we create must have three properties:

1. The event we are modeling must have a **fixed outcome that it independent** (it does not matter what happened in the past),

2. The event we are modeling must have a **probability does not change** (no external factor changes the probability), **AND**

3. The event we are modeling must have a **finite number of outcomes** (as opposed to the normal distribution that can have any possible Z-score, like 0.000332, 0.094322, or any number you can imagine; the normal distribution is NOT finite.)

### Dice Distribution

The following table describes the distribution of a six-sided die:

| Outcome | Probability |
| ------: | ----------: |
| 1 | 1/6 |
| 2 | 1/6 |
| 3 | 1/6 |
| 4 | 1/6 |
| 5 | 1/6 |
| 6 | 1/6 |

### Creating a Custom Discrete Distribution

In Python, we must provide two parallel lists of outcomes and the probabilities, similar to the table above.  One list will contain all the outcomes and one list will contain all the probabilities.

For example:

```py
outcomes    = [   1,   2,   3,   4,   5,   6 ]
probability = [ 1/6, 1/6, 1/6, 1/6, 1/6, 1/6 ]

# Programmers will often use a lot of extra spaces to make it line up visually,
# just like we did in the code above, but it is no required.
```

Once we have our two lists, the scipy.stats `rv_discrete` function can be used to make our distribution using the following code:

```py
from scipy.stats import rv_discrete
DICE = rv_discrete( values=(outcomes, probability) )
```

Create the `DICE` distribution below:


In [0]:
...

Let's check the expected value:

In [0]:
DICE.mean()

### 🔬 Checkpoint Tests 🔬

In [0]:
### TEST CASE for Random Variable #2: The Value of a Dice Roll
tada = "\N{PARTY POPPER}"

import math
assert("DICE" in vars())
assert(DICE.mean() == 3.5)
assert(math.isclose(DICE.std(), 1.707825127659933))
print(f"{tada} All Tests Passed! {tada}")

<hr style="color: #DD3403;">

## Random Variable #3: Customer in a Tea Shop

Let's create a distribution to model a tea shop!  When a customer arrives, we have historical data to suggest the following pattern:

| Description | Outcome | Probability |
| ----------- | ------: | ----------: |
| Customer buys a black tea | $ 4.49 | 20% |
| Customer buys a bubble tea | $ 5.69 | 40% |
| Customer buys a black tea and treat | $ 7.69 | 15% |
| Customer buys a bubble tea and treat | $ 8.89 | 15% |
| Customer buys nothing | $ 0.00 | 10% |

Create a custom discrete distribution for this tea shop called `TEA`.  (If you're not sure, re-read the previous section on how to create a custom distribution.)


In [0]:
...

Let's check the expected value (in this case, the amount of money we expect an "average" person to spend):

In [0]:
round(TEA.mean(), 2)

In [0]:
### TEST CASE for Random Variable #3: Customer in a Tea Shop
tada = "\N{PARTY POPPER}"

import math
assert("TEA" in vars())
assert(math.isclose(TEA.mean(), 5.661))
assert(math.isclose(TEA.std(), 2.379237062589603))
print(f"{tada} All Tests Passed! {tada}")

<hr style="color: #DD3403;">

## Random Variable #4: Final Grades in DISCOVERY

Finally, we want to create a random variable called `DISCOVERY` that will model the final exam scores for this semester in DISCOVERY.  We can use historical data to build the custom discrete random variable.

We have provided you with a dataset containing the **ACTUAL** final points for all 1,045 students in DISCOVERY during the Fall 2023 semester.  Using `pandas`, read the `fa23-final-points.csv` dataset into a DataFrame and store it as `df`:

### Counting Unique Values

When you need to find the count of unique values contained in a DataFrame, the `.value_counts()` function will return a pandas `Series` (a one-dimensional DataFrame) that contains each unique value and the number of times that value appears.

To find the unique values in of the `Total Points` column using `value_counts()`, we run the following code that stores the result as `valueCounts`:

In [0]:
valueCounts = df["Final Points"].value_counts()
valueCounts

### Storing the Index Column

The left column -- which will start with the most frequently occurring value -- will be the most commonly occurring unique value of points.  Specifically, more students earned a **963** in DISCOVERY than any other number of points.  The right column contains the number of times that value occurred.  In total, **15** students earn 963 points.

The left column, storing the number of points, is referred to as the `index`.  Store the data in the index column, specifically `valueCounts.index`, as the variable `points`:

In [0]:
points = ...
points

### Storing the Value Column

The right column, storing the unique number of times that a specific number of points was earned, is referred to as the `values` column.  Store the values column in the variable `counts`:

In [0]:
counts = ...
counts

### Converting `counts` to a Probability

To use this data to create a distribution, we need both the **outcome** and the **probability**.  Currently, we have the outcomes (`points`), but we do not have a probability.  To convert `counts` to a probability, we need to divide `counts` by the total number of students to convert it to a probability.

Create a new variable `probability` that is the `probability` for getting each value in `points`.  (*Hint: You can divide a number by a list, and it will divide each individual element; ex: you `probability = counts / 10` will divide each value in `counts` by 10.*)

In [0]:
probability = ...
probability

### Create the DISCOVERY distribution

Using `rv_discrete` (learned earlier in this MicroProject), create the `DISCOVERY` discrete random variable:

In [0]:
DISCOVERY = ...

# SOLUTION
DISCOVERY = rv_discrete( values=(points, probability) )

### Statistic #1: Average Score

Using the distributution distribution, what is the average score (or "expected value"), in points, in DISCOVERY?  Store your result in `avg_score`:

In [0]:
avg_score = ...
avg_score

### Statistic #2: Median Score

Using the distributution distribution, what is the median score (50%-tile), in points, in DISCOVERY?  Store your result in `median_score`:

In [0]:
median_score = ...
median_score

### Statistic #3: Earning an "A" in DISCOVERY

What percentage of students earned an "A" in DISCOVERY?  Earning an "A" requires a student to earn 930 points.  Store the percentage of people in `pct_A`:

(Not sure how to find this?  Check out the DISCOVERY page on "Python Functions for Random Distributions" here:
https://discovery.cs.illinois.edu/learn/Polling-Confidence-Intervals-and-Hypothesis-Testing/Python-Functions-for-Random-Distributions/)

In [0]:
pct_A = ...
pct_A

### Puzzle #4: Being a Part of the Top 10%

How many points do you need to earn so that you would be among the top 10% of the course?  Store the number of points in `top10pct`:

In [0]:
top10pct = ...
top10pct

### 🔬 Checkpoint Tests 🔬

In [0]:
### TEST CASE for Random Variable #4: Final Scores in DISCOVERY
tada = "\N{PARTY POPPER}"

import math
assert("DISCOVERY" in vars())
assert("avg_score" in vars())
assert(math.isclose(avg_score, 934.5502392344499))
assert(math.isclose(median_score, 958))
assert(pct_A > 0.50), "There were more than 50% As... are you sure the function is doing what you expect it to be doing?"
assert(not math.isclose(pct_A, 1 - 0.32248803827751177)), "If you earn a 930, you still get an \"A\".  If looks like your solution excluded 930 as an \"A\" -- make sure to include it."
assert(math.isclose(pct_A, 1 - 0.31866028708133953))
assert(top10pct > 1000), "To be in the top 10%, you need more than 1000 points!  We love extra credit! :)"
assert(top10pct - 3 == 1000)

assert(math.isclose(DISCOVERY.mean(), avg_score))

print(f"{tada} All Tests Passed! {tada}")

<hr style="color: #DD3403;">

## Submission

You're almost done!  All you need to do is to commit your lab to GitHub and run the GitHub Actions Grader:

1.  ⚠️ **Make certain to save your work.** ⚠️ To do this, go to **File => Save All**

2.  After you have saved, exit this notebook and return to https://discovery.cs.illinois.edu/microproject/custom-discrete-distributions-in-python/ and complete the section **"Commit and Grade Your Notebook"**.

3. If you see a 100% grade result on your GitHub Action, you've completed this MicroProject! 🎉