- [Basic knowledge of probability](#basic-knowledge-of-probability)
    - [Events, Sample Space, and Random Variables](#events-sample-space-and-random-variables)

## Basic knowledge of probability

### Events, Sample Space, and Random Variables


- The sample space is a set of all possible outcomes of an experiment. 
$$\Omega = \left\{\omega_{1}, \omega_{2}, \ldots ,\omega_{n}\right\}$$

- An event is a subset of the sample space. 
$$A \subset \Omega$$

- Events are called mutually exclusive if their intersection is empty set. $$A \cap B = \emptyset$$

- A set of events is said to be exhaustive if its union is equal to the sample space 

- A random variable is a real valued function defined on the sample space $ \xi : \Omega \rightarrow R$
    Example: 
    We toss a coin: 
    + if we have a head we win 1 ruble
    + if we have a tail we lose one ruble\
    In this situation sample space $\Omega = \{Head, Tail\}$.\
    So the Random Variable will be $\xi(\omega_{1}) = \xi(Head) = 1$ or $\xi(\omega_{1}) = \xi(Tail) = -1$.\
    Then the probability that the random variable $\xi \: is \: equal \: 1$ is $P\{\xi = 1 \} = \frac{1}{2}$.\
    Then the probability that the random variable $\xi \: is \: equal \: -1$ is $P\{\xi = -1 \} = \frac{1}{2}$.\
    \
    Random variable can be Discrete or Continuous.


    

- The sample space is a set of all possible outcomes of an experiment. 
$$\Omega = \left\{\omega_{1}, \omega_{2}, \ldots ,\omega_{n}\right\}$$

- An event is a subset of the sample space. 
$$A \subset \Omega$$

- Events are called mutually exclusive if their intersection is empty set. $$A \cap B = \emptyset$$

- A set of events is said to be exhaustive if its union is equal to the sample space 

- A random variable is a real valued function defined on the sample space $ \xi : \Omega \rightarrow R$
    Example: 
    We toss a coin: 
    + if we have a head we win 1 ruble
    + if we have a tail we lose one ruble\
    In this situation sample space $\Omega = \{Head, Tail\}$.\
    So the Random Variable will be $\xi(\omega_{1}) = \xi(Head) = 1$ or $\xi(\omega_{1}) = \xi(Tail) = -1$.\
    Then the probability that the random variable $\xi \: is \: equal \: 1$ is $P\{\xi = 1 \} = \frac{1}{2}$.\
    Then the probability that the random variable $\xi \: is \: equal \: -1$ is $P\{\xi = -1 \} = \frac{1}{2}$.\
    \
    Random variable can be Discrete or Continuous.


    

### Probability, Conditional Probability and Independence

- Definition of Probability\
    Consider a sample space $\Omega$. Let $A$ be a set in $\Omega$. The PROBABILITY of $A$ is the function on $\Omega$, denoted $P(A)$, that satisfies the following three axioms:
    + $ 0 \leq P(A) \leq 1$;
    + $ P(\Omega) = 1$;
    + $ A_1, A_2, \ldots, A_n $ - mutually exclusive, then $P(A_1 \cup A_2 \cup \ldots \cup A_n) = \sum_{i=1}^{n}$

- Definition of conditional probability
    The conditional probability of A given B is probability of the event A given that we know that event B has occured.
    $$
    P(A|B) = \frac{P(A \cap B)}{P(B)},\: P(B) > 0
    $$


- Definition of independent events 
    If events A and B are independent, which means that if one of them occurs, the probability of the other to occur id not affected, then 
    $$
    P(A|B) = P(A)\: or \: P(A \cap B) = P(A)P(B)
    $$

- The law of total probability
    Let $B_1, B_2, \ldots, B_n$ be a sequence of mutually exclusive and exhaustive events in $\Omega$, and let $A$ be another event in $\Omega$. Then 
    $$
    P(A) = \sum_{i=1}^{n}P(A|B_i)P(B_i)
    $$


- Bayes' formula
    $$
    P(B_i|A) = \frac{P(A|B_i)P(B_i)}{\sum_{i=1}^{n}P(A|B_i)P(B_i)}
    $$


### Probability and Distribution Functions

![image.png](attachment:image.png)

## Statistic Sampling

### Some definitions for introduction

Classification of data sets:
1. By the number of variables for each observation
    - One - dimensional
    - Two - dimensional
    - Multidimensional
2. By measurement type for each observation
    - Quantitative variables (numbers) - an absolute scale (Example: temperature in Kelvins):
        - Discrete
        - Continuous
    - Categorical (qualitative) variables:
        - Ordinal - ordinal scale (1-st, 2-nd, etc. grades in school). Can rank the data.
        - Nominal nominal scale. Impossible to rank. (There is product we can define: 1 - ice cream, 2 - juice)

### The Sample and the Population


- The population is the set of objects that need to be studied.
- The sample is a subset of a random elements of the general population. The sample size (n) is the number of items in it.

- Let there be a sample  $X = \{x_1, \ldots, x_n\}$ of volume n obtained during some observation.
    - A **variation series** is a sample with all elements arranged in ascending order. (5, 5, 5, 7, 7, 8, 8, 8, 8, 9, 10, 10, 11, 11, 12, 12, 12, 13)
    

In [1]:
slb <- read.table('slb.txt', header = TRUE)
sort(slb$Price) # make a variation series of the veriable

### Cases when the random variable under study is descrete/continuous

#### For descrete variable

**Grouped frequency distribution**  is a representation of a sample of a random variable when there is reason to believe that the random variable is discrete.

![image.png](attachment:image.png)

where:
- $x_i$ - unique values of the sample, arranged in ascending order: $x_i^* < x_j^*, \quad \forall i < j, \quad i, j = 1, \ldots, q$
- $k_i$ - frequency: the number of repetitions of value $x_i^*$ in the sample
- $p_i^*$ - relative frequency: $\tilde{p_i^*} = \frac{k_i}{n}, \quad \sum_{i=1}^{q} \tilde{p_i^*} = 1$

An example:

<img src="pictures\image7.png" alt="image-2" width="500px">

**The emperical distribution function** is an estimate of the theoretical distribution function of the studied random variable: it is constructed based on the obtained statistical distribution series - $F_n^*(x)$.

<img src="pictures\image8.png" alt="image-2" width="500px">


The graph of the empirical distribution function is called a **cumulative curve** or **the cumulant graph**.\
For a descrete random variable, it appears as a **piecewise constant function**.

An example:

<img src="pictures\image9.png" alt="image-2" width="500px">

#### For continuous

**Interval frequency distribution**  a representation of a sample of a random variable when there is reason to believe that the random variable is continuous.

![image.png](attachment:image.png)\
\
where:
- $J_i, i = 1, \ldots , r$ - partition intervals
- $l_i$ - frequency: the number of sample elements that fall into the interval $J_i$
- $p_i^*$ - relative frequency: $\tilde{p_i^*} = \frac{l_i}{n}, \quad \sum_{i=1}^{r} \tilde{p_i^*} = 1$

The number r (amount of intervals) is usually determined by the Sturgess formula: $r = 1 + \log_2(n)$ (always round up).

An example:

<img src="pictures\image1.png" alt="image-2" width="500px">

**The emperical distribution function** is an estimate of the theoretical distribution function of the studied random variable: it is constructed based on the obtained statistical distribution series - $F_n^*(x)$.


<img src="pictures\image2.png" alt="image-2" width="500px">

The graph of the empirical distribution function is called a **cumulative curve** or **the cumulant graph**.\
For a continuous random variable, it appears as a **continuous piecewise linear function**.

An example:

<img src="pictures\image3.png" alt="image-2" width="500px">
<img src="pictures\image4.png" alt="image-2" width="500px">


**The empirical density distribution** is an estimate of the theoretical probability density function of the studied random variable: it is constructed based on the obtained statistical distribution series – $f_n^*(x)$. (Only for continuous random variable).

<img src="pictures\image5.png" alt="image-2" width="500px">
<img src="pictures\image6.png" alt="image-2" width="500px">



### Numerical characteristics of the sample

1. Position characteristic:
    - The **mode** is the value that appears most frequently in a data set.
    - The sample **mean** is an average effective rate $\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i$.
    - The **median** is the value located in the middle of the variation series  (median devide variation series into 2 equal part)
    - The **quantile** $p \in (0, 1)$ is a value ​​that divide an ordered set of data into equal 4 parts.

2. Scattering characteristics:

3. Charachteristics of the relative spread (for samples with non-negative elements):

4. Characteristics of the distribution form:


In [14]:
# Чтение файла с использованием правильного разделителя и заголовка
dataBanki <- read.table('BankiRU2.txt', sep = ';', header = TRUE)
# Просмотр первых нескольких строк данных
head(dataBanki)

efr <- dataBanki$EffRate
cat("The variable Effective Rate: ","\n", efr, "\n")

cat("Sort in ascending oreder (Make Variation series): ","\n", sort(efr), "\n")
cat("Give median: ", median(efr), "\n")
cat("Give mean: ", mean(efr), "\n")

print(table(efr))# read mode from here (it can be a lot)


Unnamed: 0_level_0,No,Bank,EffRate
Unnamed: 0_level_1,<int>,<chr>,<dbl>
1,1,"Тинькофф, СмартВклад (с повышенной ставкой)",4.59
2,2,"Сбербанк, Сохраняй Онлайн",3.4
3,3,"Совкомбанк, Щедрая осень с Халвой",5.7
4,4,"Уралсиб, Хорошая пора Онлайн",5.0
5,5,"МКБ, МЕГА Онлайн",4.7
6,6,"Абсолют Банк, Абсолютный доход",4.59


The variable Effective Rate:  
 4.59 3.4 5.7 5 4.7 4.59 4.5 4.5 4.3 4.3 4.5 4.2 4.2 4.4 4.2 4.05 4 3.97 3.66 3.65 3.7 3.4 3.4 3.8 2.85 
Sort in ascending oreder (Make Variation series):  
 2.85 3.4 3.4 3.4 3.65 3.66 3.7 3.8 3.97 4 4.05 4.2 4.2 4.2 4.3 4.3 4.4 4.5 4.5 4.5 4.59 4.59 4.7 5 5.7 
Give median:  4.2 
Give mean:  4.1424 
efr
2.85  3.4 3.65 3.66  3.7  3.8 3.97    4 4.05  4.2  4.3  4.4  4.5 4.59  4.7    5 
   1    3    1    1    1    1    1    1    1    3    2    1    3    2    1    1 
 5.7 
   1 
