# Product Defects

You are in charge of monitoring the number of products from a specific factory. You've been told that the number of defects on a given day follows the Poisson distribution with the rate parameter (lambda) equal to 7. You're new here, so you want to get a feel for what it means to follow the Poisson(7) distribution. You remember that the Poisson distribution is special because the rate parameter represents the expected value of the distribution, so in this case, the expected value of the Poisson(7) distribution is 7 defects per day.

You will investigate certain attributes of the Poisson(7) distribution to get an intuition for how many defective objects you should expect to see in a given amount of time. You will also practice and apply what you know about the Poisson distribution on a practice data set that you will simulate yourself.

## Distribution in Theory

1. Create a variable called `lam` that represents the rate parameter of our distribution.

In [2]:
import scipy.stats as stats
import numpy as np

## Task 1:
lam = 7

2. You know that the rate parameter of a Poisson distribution is equal to the expected value. So in our factory, the rate parameter would equal the expected number of defects on a given day. You are curious about how often we might observe the exact expected number of defects.

   Calculate and print the probability of observing exactly `lam` defects on a given day.

In [3]:
## Task 2:
# ...observing exactly lam defects... means EXACT value of interest in this case is the same as lambda = 7
# stats.poisson.pmf(<EXACT value of interest>, <expected value (lambda)>)
print(stats.poisson.pmf(lam, lam))
# The result of 0.149 can be confirmed with the poisson-slider at
# (https://static-assets.codecademy.com/skillpaths/master-stats-ii/probability-distributions/poisson-slider/index.html)

0.14900277967433773


3. Our boss said that having 4 or fewer defects on a given day is an exceptionally good day. You are curious about how often that might happen.

   Calculate and print the probability of having one of those days.

In [4]:
## Task 3:
# ...4 or fewer defects... means a range of 0, 1, 2, 3, 4
# stats.poisson.cdf(<RANGE of interest>, <expected value (lambda)>)
print(stats.poisson.cdf(4, lam))

0.17299160788207146


4. On the other hand, our boss said that having more than 9 defects on a given day is considered a bad day.

   Calculate and print the probability of having one of these bad days.

In [5]:
## Task 4:
# ...more thatn 9 defects... means from 10 to maximum
# Range 1: 0 to maximum --> Probability that a number will be in this range is 1.
# Range 2: 0 to 9
# Range of interest = Probability(Range 1) - Probability(Range 2)
print(1 - stats.poisson.cdf(9, lam))

0.16950406276132668


## Distribution in Practice

5. You've familiarized yourself a little bit about how the Poisson distribution works in theory by calculating different probabilities. But let's look at what this might look like in practice.

   Create a variable called `year_defects` that has 365 random values from the Poisson distribution.

In [6]:
## Task 5:
# generate random variable
# stats.poisson.rvs(lambda, size = num_values)
year_defects = stats.poisson.rvs(lam, size=365)

# Since we are using rnadomly generated values, following results/numbers may change if notebook is executed another time.

6. Let's take a look at our new dataset. Print the first 20 values in this dataset.

In [8]:
## Task 6:
#print(year_defects.head(20)) # This did not work since this is not a dataframe
# year_defects is an array
print(year_defects[0:20])

[ 9  7  8  9  8  6 10  4  7  5 10 11  9  3  3  3 11 11  6  3]


7. If we expect 7 defects on a given day, what is the total number of defects we would expect over 365 days?

   Calculate and print this value to the output terminal.

In [9]:
## Task 7:
# Expected value or average value is lambda
print(lam * 365)

2555


8. Calculate and print the total sum of the dataset `year_defects`. How does this compare to the total number of defects we expected over 365 days?

In [12]:
## Task 8:
# Comparing theoretical value (lam * 365) to the simulation (sum of year_defects)
print(year_defects.sum())
print(sum(year_defects))
# Both methods for sum work.

2531
2531


9. Calculate and print the average number of defects per day from our simulated dataset.

   How does this compare to the expected average number of defects each day that we know from the given rate parameter of the Poisson distribution?

In [13]:
## Task 9:
# Comparing theoretical value (lam) to the simulation (average of year_defects)
print(year_defects.mean())
print(np.mean(year_defects))
# Both methods for mean work.

6.934246575342466
6.934246575342466


10. You're worried about what the highest amount of defects in a single day might be because that would be a hectic day.

    Print the maximum value of `year_defects`.

In [15]:
## Task 10:
print(year_defects.max())
print(max(year_defects))

15
15


11. Wow, it would probably be super busy if there were that many defects on a single day. Hopefully, it is a rare event!

    Calculate and print the probability of observing that maximum value or more from the Poisson(7) distribution.

In [17]:
## Task 11:
# This is not asked: Observing EXACTLY 15 defects (year_defects.max()): print(stats.poisson.pmf(year_defects.max(), lam))
# Observing 15 OR MORE
print(1 - stats.poisson.cdf(year_defects.max(), lam))


0.0024065803473980463


## Extra

12. Congratulations! At this point, you have now explored the Poisson distribution and even worked with some simulated data. We have a couple of extra tasks if you would like an extra challenge. Feel free to try them out or move onto the next topic!

    Let's say we want to know how many defects in a given day would put us in the 90th percentile of the Poisson(7) distribution. One way we could calculate this is by using the following method:
    
    ```py
    stats.poisson.ppf(percentile, lambda)
    ```
    
    `percentile` is equal to the desired percentile (a decimal between 0 and 1), and `lambda` is the lambda parameter of the Poisson distribution. This function is essentially the inverse of the CDF.
    
    Use this method to calculate and print the number of defects that would put us in the 90th percentile for a given day. In other words, on 90% of days, we will observe fewer defects than this number.

In [18]:
## Task 12:
# Average number of defects is 7
# Maximum number of defects is 15
# Number of defects on 90% of the days = ?
print(stats.poisson.ppf(0.9, lam))
# ... In other words, on 90% of days, we will observe fewer defects than this number.
# The result is not a percentage. It is the number of defects.

10.0


13. Now let's see what proportion of our simulated dataset `year_defects` is greater than or equal to the number we calculated in the previous step.

    By definition of a percentile, we would expect 1 - .90, or about 10% of days to be in this range.
    
    To calculate this:
    
     1) Count the number of values in the dataset that are greater than or equal to the 90th percentile value.
     
     2) Divide this number by the length of the dataset.

In [39]:
## Task 13:
# The statement ( ... In other words, on 90% of days, we will observe fewer defects than this number.) suggests that 
# 90% of the time defects are <10, and 10% of the time defects are >=10. 
# So, the expected result here is about 10%


# year_defects_count = year_defects.count()
# This didn't work:
# AttributeError: 'numpy.ndarray' object has no attribute 'count'

# year_defects_count = np.count(year_defects)
# This didn't work:
# AttributeError: module 'numpy' has no attribute 'count'

# np.count_nonzero(year_defects)
# This returns 364 because 0 is not counted.

# np.count_nonzero(year_defects >= 10) # This works.
# This returns 59. Since we are using rnadomly generated values, these number may change if notebook is executed another time.
# len(year_defects) is 365
print((np.count_nonzero(year_defects >= 10)) / len(year_defects))
#print(59/365)
# Result is 0.16163; this is not expected, it should be around 10%

# SEE STATEMENT: 
# # The statement ( ... In other words, on 90% of days, we will observe fewer defects than this number.) suggests that 
# 90% of the time defects are <10, and 10% of the time defects are >=10. 
# My conclusions from the statement make sense, however, the STATEMENT SEEMS WRONG.

# Try >10 instead of >=10:
print((np.count_nonzero(year_defects > 10)) / len(year_defects))
# The result is 0.0904, as expected.

0.16164383561643836
0.09041095890410959
