<a href="https://colab.research.google.com/github/Sreeshbk/Machine_learning/blob/main/concept/Probability_n_Statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Probability

- `Chance` is a possibility of something happening
- `Probability` the likelihood of an event in which we are interested.
- The `Likelihood` of an event is the frequency with which the event may occur.
- The `probabilistic model` is a generic structure which describes the random outcomes of an activity. The activity (experiment), results of activity (outcomes) and the frequency/likelihood with which the values (outcomes) may occur can be described using a Probabilistic Model.
- The process of observation of an activity is termed as an `experiment`.
- The results of an observation are termed as `outcomes` of the Experiment.
- The events for which we cannot calculate the outcomes, those experiments are called `random experiments`.
- `Sample Space` is the set of all the possible outcomes of a random experiment. It can be of two types which are as follows:
 - Continuous sample space
Example: Value of stock for a company can be between $15 and $25 = {15.01, 15.02, …,24.98, 24.99, 25}
 - Discrete sample space
Example: Number of people attending the meeting can be between 10 and 20 = {10, 11, 12,…,20}
- If we narrow down our focus to one particular outcome or a set of some outcomes from the entire sample space it is termed as an `Event`.

| Dependent Events | Independent Events |
|---------------|---------------|
|Occurence of event affects another event | Occurence of event doesnot affects another event|

- `Meta-phrasing` refers to literal translation. Meta-phrasing involves word by word and line by line translation without loss of information.
`Sets` and `Venn diagrams` facilitate the meta-phrasing of the real-time experiment (described as word problems) to a mathematical model.

 - Sets help translate the problems into mathematical representations.
 - Venn diagrams help represent the Sets graphically.

- A `Set` is a collection of distinct objects/elements. It is a collection of all possible outcomes.
The Sets help translate the problems to mathematical representations using set notations.

 - If a set contains finite no. of elements (x1,x2,x3) it can be represented as S={x1,x2,x3} 
   
   Example: Set of all possible outcomes of a die roll S={1,2,3,4,5,6 }

 - If a set contains infinite no. of elements (x1,x2,x3,…..) it can be represented as S={x1,x2,x3,…..}

   Example: Set of all possible positive real numbers S={0,0.01,1,1.5,2,……}

- Intersection: The intersection of two sets A and B is represented as A ∩ B  
- Union: The union of two sets A and B is represented as A ∪ B 
- Complement: The complement of Set A is represented as $\bar{A}$  
- Difference: The difference of Set A and B is represented as  A — B
- Symmetric Difference: The symmetric difference of Set A and B is represented as A Δ B 

## Combination - deals with selection

Combination without replacement

\begin{equation}
 C_n^r = \frac{n!}{(n-r)! \space r!}
\end{equation}

In [None]:
from scipy.special import comb 
import math
n=10
r=4
comb(n, r, exact=True) , math.factorial(n)/(math.factorial(r)*math.factorial(n-r))

(210, 210.0)

Combination with replacement

\begin{equation}
 C_n^r =  (\frac{n!}{ n-1!})^r
\end{equation}

## Permutation - deals with ording/sequence

Without repeation

\begin{equation}
 P_n^r = \frac{n!}{(n-r)!}
\end{equation}

In [None]:
import math;
comb(n, r, exact=True)* math.factorial(2)

420

With Repeatation \begin{equation}
 P_n^r = \frac{noofletter!}{noofrepeatation!} = \frac{n!}{r!}
\end{equation}

### Probability
the probability of the Event A is given as 
\begin{equation}
P(A) = n(A)/n 
\end{equation}

where,
 - P(A) is a real number where n > 0
 - n(A) is the number of times event A has occurred.

For a Finite Sample Space S, Consider an Event A.

The probability P(A) which is a real number assigned to the Event A must follow the below axioms :

- P(A) >= 0

- P(S) = 1

- If event A and Event B are mutually exclusive then,
P (A U B) = P(A) + P(B) , (A ∩ B ) = ∅

Using the axioms the following properties of probability can be obtained:

- P ( $\bar{A}$ ) = 1 - P (A)
- P (∅) = 0
- P(A) < P(B)  if A ⊂ B
- P(A) <= 1
- P(A U B) = P(A) + P(B) - P(A ∩ B)

## Conditional Probability 
is used to determine the likelihood of an event when a partial information about the event is known.
\begin{equation}
P(A|B)= \frac{P(A ∩ B)}{P(B)}
\end{equation}

the probability that the defective bread supplied by supplier ‘k’

\begin{equation}
P(S_k|D)=  \frac{P(S_k)P(D|S_k)}{\sum_i P(S_i)P(D|S_i)}
\end{equation} 


### Bayes Theorem
For Event A and Event B,

\begin{equation} P (A ∩ B) = P(A)P(B|A), \space  if \space P(A) ≠ 0  \end{equation} 

\begin{equation} P (A ∩ B) = P(B)P(A|B), \space if  \space P(B) ≠ 0 \end{equation} 

Bayes Theorem gives a relation between P(A|B) as

\begin{equation}
P(A|B)= \frac{P(A)P(B|A)}{P(B)}
\end{equation} 




# Statistics

The collection, analysis, interpretation, presentation and organization of data is termed as statistics.

Statistics is used to study a population/process (data).

- When all data required for observation/analysis is collected and studied, the data is referred to as the `population`.
- When limited data is being collected/analyzed, this data is referred to as `sample` and is used as an indicative of the entire population.

- Classification
  - `Descriptive statistics`: Summarization of data to describe the main features of the sample.
  Descriptive statistics summarize the sample and is used to learn about the data under study.
  - `Inferential statistics`: When working with samples of the population the techniques and processes we use to draw conclusions come under inferential statistics let's understand it as we progress through this course.


  

## Descriptive statistics 




### The Measures of central tendency :
      
      Central tendency measures are statistical measures used to describe or understand a sample using a single value.
      
      A measure of central tendency could also indicate the tendency of data to centre around a particular value.
      

| Type | Description|Formula----------|
|-----|----------|------|
|arithmetic mean |the sum of all values in the data set divided by the total number of values in the data set.<br> Arithmetic Mean is a measure which is used for numeric data and is a single value<br> that represents the typical value in the given set of numbers.| $\frac{\sum_i^n x_i}{n}$|
|weighted mean |the total of multiplication of individual weights with values divided by total weight.| $\frac{\sum_i^n w_ix_i}{\sum_i^n w_i}$|
|geometric mean|The geometric mean is used while finding the average of numbers that represent growth rate| $\sqrt[n]{x_1x_2..x_n}$|
| |the geometric mean as the arithmetic mean of log-transformed values of the data|$exp[\frac{\sum_i^n\log_ex_i}{n}]$|
|Harmonic mean |Harmonic mean is a better estimate of the average of data in situations where the data <br>represents rates/ratios such as speed (km per hr.), heart rate(beats per min.), frequency etc.<br>The harmonic mean is the reciprocal of the arithmetic mean of the reciprocals,| $\frac{n}{\frac{1}{x_1}\frac{1}{x_2}...\frac{1}{x_n}}$|
|median|Median is the value that lies at the centre of the data set when data is ordered. <br>Median divides the data into equal halves when the data is sorted.||
|mode|Data that is most frequently occurring in a data set is called Mode.||



In [None]:
import statistics as st
print("mean ",st.mean([1,2,3,4]))
print("median ",st.median([7800,8000,7500,7300,7000,6500,6750,90000,8000,7800]))
print("mode",st.mode([1,2,2,3,4,4,4]))

mean  2.5
median  7650.0
mode 4


### The Measures of dispersion

Dispersion represents the variability in the data which simply means, Measures of Dispersion are representative of how squeezed or stretched the data is in comparison with the measures of central tendency. In other words, Measures of Dispersion help us identify how our data is spread overall in the data set.

| Type | Description|Formula|
|-----|----------|------|
|Range|maximum value in a data set - minimum value in a data set|max(x)-min(x)|
|Sampe Variance|describes the deviation of data from each other and from the mean.|$ S^2 = \frac{\sum_i^n (x_i-\mu)^2}{n-1}$|
|Population Variance|describes the deviation of data from each other and from the mean.|$\sigma^2 =\frac{\sum_i^n (x_i-\mu)^2}{n}$|
|StandardDeviation|the Standard Deviation is the square root of the variance ||
|Chebyshev's inequality| relates the mean and the standard deviation<br> It is bound on the probability of finding a value.<br>"If the value is k standard deviations away from the mean,<br> then the probability of finding it is lesser than 1/k2 "|$P(|x - µ| ≥ kσ) ≤  1/k2$|
|z-score|indicates the distance of the data from the mean in terms of standard deviations.|$z = (x - µ) / σ$
|Percentile|Represents the value below which a given percentage of observations fall.| Percentile = (Number of Values Below “x” / Total Number of Values) × 100.
|Quartile|Quartiles are 3 points in the data set that divide the data set into 4 equal groups. <br>The first quartile (Q1) is the 25th percentile of the data set i.e. the value below which 25% of the data lies.<br>The second quartile (Q2) is the 50th percentile of the data set i.e. the value below which 50% of the data lies.<br> This is the median (m) of the data set<br>The third quartile (Q3) is the 75th percentile of the data set i.e. the value below which 75% of the data lies.|
|IQR|The interquartile range is the difference between the third quartile(Q3) and the first quartile(Q1).<br> This range is used to indicate the spread of the data and is a better measure than range because min and max<br> (used in range) are highly volatile in the presence of unusual values.|IQR = (Q3-Q1) |
|Co-efficient of Variation| is the ratio of the standard deviation to the mean and shows the extent of variability <br> in relation to the mean of the population. Lower is better| $COV=S*100/\bar{x}$

 


In [None]:
import statistics as st
import numpy as np
from scipy.stats.mstats import mquantiles

data = [8,12,4,11,15,10,8,7,4,11,6,6,8,7,4,9,16,6,14,6]
print("Sample variance ",st.variance(data)) #Sample variance of data.
print("Population variance ",st.pvariance(data)) #Population variance of data.
print("Sample Standard Deviation ",st.stdev(data)) # Sample standard deviation of the data.
print("Population Standard Deviation ",st.pstdev(data)) #Population standard deviation of the data.
print("90th percentile ", np.percentile(data,90))
print("90th percentile ", mquantiles(data,prob=[0.25, 0.5, 0.75]))
print("Five Number Summary", mquantiles(data ,prob=[0.00,0.25, 0.5, 0.75,1.00]))

Sample variance  12.989473684210527
Population variance  12.34
Sample Standard Deviation  3.604091242492416
Population Standard Deviation  3.5128336140500593
90th percentile  14.100000000000001
90th percentile  [ 6.  8. 11.]
Five Number Summary [ 4.  6.  8. 11. 16.]


In [None]:
import scipy.stats as st
marks =[64, 47, 40, 45, 54, 63, 12, 21, 79, 95, 58, 74, 30, 50, 10, 21, 48, 85]
st.percentileofscore(marks, 70), sum(list(1 for x in marks if x <70))/ len(marks)

(77.77777777777777, 0.7777777777777778)

### The Measures of shape

| Type | Description|Formula|
|-----|----------|------|
|Histogram|To represent the distribution of numeric data graphically we use a histogram. |
|Skewness| measure of lack of symmetry of the data in comparison with symmetric distribution.<br> In symmetrically distributed data set, the mean and the median coincide.|Skewness = 3 * (Mean - Median) / (Standard Deviation)|
|Kurtosis| helps us identify whether the data spread around the mean is high or not.<br>A higher value of kurtosis indicates that data is largely centred around the <br>meanwhile a lower value of kurtosis indicates that data is not centred around the mean.<br>**A value of kurtosis above 3 indicates a peak, while that below 3 indicates flatness**|$kurtosis =\frac{\frac{\sum_i^n (x_i-\mu)^4}{n}}{s^4}$|

In [None]:
#skewness
from scipy.stats import skew, kurtosis


data = [11730,5461,6655,7484,10242,8547,5521,13207,7598,9421,11563,6995,10531,11525,7770,7094,8339,8375,8808,6802]
skew(data), 3*(st.mean(data)-st.median(data))/st.stdev(data), kurtosis(data,fisher=False,bias=True)


(0.45023241299126027, 0.45088222215824314, 2.2730363035525505)