# 1. Probability Limit Definition

According to the probability limit definition:

$$
P(A) = \lim_{n \to +\infty} \frac{n_A}{n}
$$

We want to examine the validity of this relationship practically.
This measurement is only intuitive and does not prove any objectivity (unreliable);
But the purpose is to perform statistical experiments using the 
R
language and get acquainted with this language.

Consider this example:
We flip a coin and roll a dice and want to calculate the probability that the coin shows heads and the dice shows an even number. 

We know that each dice has 6 faces and the coin has two possible outcomes of heads or tails. Based on the multiplication principle, the number of possible outcomes for flipping a coin and rolling a dice is equal to $6 \times 2 = 12$.
The set of our desired outcomes will be:
$$\{(h,2),(h,4),(h,6)\}$$
Assuming the experiment is fair and the dice and coin are valid, the probabilities of each dice face and coin flip (heads or tails) appearing are equal. Therefore, the probability of observing each of the possible pairs is also equal. Hence, we can use the classical probability definition and conclude that the probability of observing the desired outcome is 
$\frac{3}{12} = \frac{1}{4}$

Now we try to estimate this probability using the probability limit definition. First we need to prepare the sample space for this example:

In [57]:
dice_possible_observations = seq(1, 6)
dice_observation_chance_to_appear = rep(1/6, 6)

In [58]:
coin_possible_observations = seq(1, 2)
coin_observation_chance_to_appear = rep(1/2, 2)

<a href="https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/seq">Read more about `seq`</a>  
<a href="https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/rep">Read more about `rep`</a>

<h2> Sampling </h2> <div> The sampling method is a process by which a subset of the statistical population is prepared. This is done in order to identify or estimate the parameters of the statistical population. To perform sampling in R, we use the sample command.<br> For more study, refer to <a href="https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/sample">this</a> link. </div>

In [59]:
sample_of_n_dice <- function(n) {
    return(
        sample(
            x = dice_possible_observations,
            size = n,
            replace = TRUE,
            prob = dice_observation_chance_to_appear
        )
    )
}

# 1 means the coin shows the head side and 
# 2 means it shows the tail side.
sample_of_n_coin <- function(n) {
    return(
        sample(
            x = coin_possible_observations,
            size = n,
            replace = TRUE,
            prob = coin_observation_chance_to_appear
        )
    )
}


In [60]:
flip_coin_roll_dice <- function(n) {
    return(
        c(
            sample_of_n_coin(n),
            sample_of_n_dice(n)
        )
    )
}


In the next two blocks, we simulate the experiment process.

In [61]:
rownames <- seq(1, 6)
colnames <- seq(1, 2)

observations1000 <- matrix(
    0, # the data elemetns
    nrow = 6,
    ncol = 2,
    byrow = TRUE,
    dimnames = list(rownames, colnames)
)

observations100 <- matrix(
    0, # the data elemetns
    nrow = 6,
    ncol = 2,
    byrow = TRUE,
    dimnames = list(rownames, colnames)
)

[Read more about matrices](https://www.r-tutor.com/r-introduction/matrix)

In [62]:
# At the beginning of each experiment, we need to reset elements.
observations1000[, 1] <- 0
observations1000[, 2] <- 0

for (i in 1:1000) {
  obs <- flip_coin_roll_dice(1)
  observations1000[obs[2], obs[1]] <- observations1000[obs[2], obs[1]] + 1
}


for (i in 1:100) {
  obs <- flip_coin_roll_dice(1)
  observations100[obs[2], obs[1]] <- observations100[obs[2], obs[1]] + 1
}

print(observations1000)
cat("------------------", sep="\n")
print(observations100)


   1  2
1 81 91
2 83 92
3 84 88
4 94 74
5 67 86
6 80 80
------------------
   1  2
1  7  4
2  8 12
3 12  3
4 15  8
5  8 12
6  7  4


In [63]:
# Now we need to calculate the probability of observation [2,4,6] (even roll)
# and [1] (head) based on `observations`:
result <- (observations100[2, 1] + observations100[4, 1] + observations100[6, 1]) / sum(observations100)
print(result)
result <- (observations1000[2, 1] + observations1000[4, 1] + observations1000[6, 1]) / sum(observations1000)
print(result)


[1] 0.3
[1] 0.257


<div> Conduct the main loop of the experiment with 100 and 100,000 iterations. Analyze these observations. In this analysis, in addition to what you deem appropriate, you must also examine the following: <br> <li> Given the actual probability value that we calculated theoretically at the beginning of this section, compare the accuracy of this experiment with a sample size of 100 versus a sample size of 100,000. The reason for this difference is important.</li> 
Answer: 100,000 iterations is closer to the calculated theoretical value, and the reason is that as the number of samples increases, the variance of the population decreases.
</div>

<font color='yellow'  background-color: blue>
Note) Loops are used in the codes provided in this section. You should be careful that using loops is not desirable in general and should be avoided. In this section, this point is ignored for the purpose of getting acquainted with the R language, but from the next section, the use of this loop is avoided as much as possible.

The reason for this avoidance is that statistical calculations and the like can be done in parallel in R (and Python) but when loops are used, this advantage is lost and as a result, program runtimes become very long. Alternative methods that you will become familiar with later are the use of data frames and matrices and operators specific to them. Loops should only be used when the desired simulation is truly time dependent and the calculations of each step require the previous step. In the next exercise section, you are not allowed to use loops and must use a function for iterative calculations.
</font>

# 2. Birthday Problem

As discussed in class, the birthday problem asks for the probability that in a set of n people selected randomly, at least two people share the same birthday. 

Interestingly, unlike perception, the probability of a coincident birthday in a group of 23 people is over 50%!

Below we calculate this probability using R.

First, solve this problem theoretically for k people. (What is the probability that two people out of k have the same birthday?) There is no need to write a proof here.

The calculations are done in R in the cell below.

In [64]:
k <- 23
print(1 - prod((365 - k + 1):365) / 365^k)


[1] 0.5072972


As you can see, for 23 people the probability of two people having the same birthday is over 50%.

In R, some helper functions are defined for such well-known problems.
Research the pbirthday and qbirthday functions and explain these two functions.

Using the above two functions, find the probability that among 23 people, at least 3 people have the same birthday.

In [68]:
pbirthday(23, coincident = 3)
pbirthday(23, coincident = 2)

Now using the above functions, find the number of people required so that the probability of at least 4 people having the same birthday is greater than 0.8.

In [66]:
qbirthday(coincident = 4, prob = 0.8)

[Read more about pbirthday](https://www.rdocumentation.org/packages/TeachingDemos/versions/2.10/topics/pbirthday)  
[Read more about qbirthday](https://www.rdocumentation.org/packages/TeachingDemos/versions/2.10/topics/qbirthday) 

For better understanding of this problem, sampling can be used.
You got acquainted with the sample function in the previous section.
Write a code snippet that generates a sample of 23 birthdays in 365 days of the year.

Now repeat this experiment 10,000 times and obtain the probability that at least 2 people have a common birthday.

Note that you cannot use for loops in this section!

Hint: You can use tabulate to find the number of identical elements in a set.

In [67]:
n <- 23
N <- 10^4
r <- replicate(N, max(tabulate(sample(1:365, n, replace = TRUE))))
sum(r >= 2) / N
