Statistics and Probability - Basics with Examples and Exercises

**Import the required libraries.**

In [1]:
# import 'pandas' 
import pandas as pd 

# import 'numpy' 
import numpy as np

# to suppress warnings 
from warnings import filterwarnings
filterwarnings('ignore')

# import 'factorial' from math library
from math import factorial

# import 'stats' package from scipy library
from scipy import stats
from scipy.stats import randint
from scipy.stats import skewnorm

The study of statistics is mainly divided into two parts: **Descriptive** and **Inferential**

# Descriptive Statistics

Descriptive statistics summarizes or describes the given data. It includes measures of central tendency, measures of dispersion and distribution of the data.

## Measures of Central Tendency

A measure of central tendency is a value that distinguishes the central position of the data. It includes mean, median, mode and partition values of the data.

### Mean:
It is defined as the ratio of the sum of all the observations to the total number of observations. It is affected by the presence of outliers.

### Median:
It is the middlemost observation in the data when it is arranged in the increasing or decreasing order based on the values. It divides the dataset into two equal parts.

### Mode: 
It is defined as the value in the data with the highest frequency. There can be more than one mode in the data.

### Partition values:
Partition values are defined as the values that divide the data into equal parts. `Quartiles` divide the data into 4 equal parts, `Deciles` divide the data into 10 equal parts and `Percentiles` divide the data into 100 equal parts.

#### 1. A manager handles 12 branches of a supermarket situated in the U.S.A. Consider one day sale (in dollars) of all the branches. Calculate the mean and median to find the average sale.
    
    Sale = [165, 182, 140, 193, 172, 168, 174, 124, 187, 204, 148, 175]

In [2]:
Sale = [165, 182, 140, 193, 172, 168, 174, 124, 187, 204, 148, 175]
print('Mean:',np.mean(Sale))
print('Median:',np.median(Sale))

Mean: 169.33333333333334
Median: 173.0


## Measures of Dispersion

A measure of dispersion describes the variability in the data. Some of the measures of dispersion are range, variance, standard deviation, coefficient of variation, and IQR.

### Range:
It is defined as the difference between the largest and smallest observation in the data. It is affected by the presence of extreme observations. 

### Variance: 
It calculates the dispersion of the data from the mean. It is defined as the average of the sum of squares of the difference between the observation and the mean.

### Standard Deviation:
It is the positive square root of variance. The unit of standard deviation is the same as the unit of data points. The variable with near-zero standard deviation is least important for the analysis.

### Coefficient of Variation
It is a measure of the dispersion of data points around the mean. It is always expressed in percentage. We can compare the coefficient of variation of two or more groups to identify the group with more spread.

### Interquartile Range (IQR):
It is defined as the difference between the third and first quartiles. It returns the range of the middle 50% of the data. IQR can be used to identify the outliers in the data.

#### 1. A manager handles 12 branches of a supermarket situated in the U.S.A. Consider one day sale (in dollars) of all the branches. Calculate the standard deviation of the sale. Also, find the range in which the middle 50% of the sale would lie.
    
    Sale = [165, 182, 140, 193, 172, 168, 174, 124, 187, 204, 148, 175]

In [3]:
Sale = [165, 182, 140, 193, 172, 168, 174, 124, 187, 204, 148, 175]

In [4]:
range_sale = np.max(Sale)-np.min(Sale)
print("Range of sale is",range_sale)

Range of sale is 80


In [5]:
std_sale = np.std(Sale)
print("The Standard Deviation is",std_sale)

The Standard Deviation is 21.76898915634093


In [6]:
var_sale = np.var(Sale)
print("The Standard Deviation is",var_sale)

The Standard Deviation is 473.8888888888889


In [7]:
coeff_var = lambda x:round((np.std(x)/np.mean(x))*100,2)
coeff_of_var = np.apply_along_axis(coeff_var,axis=0,arr=Sale)
print("Coefficient of Variation is",coeff_of_var)

Coefficient of Variation is 12.86


There is a 12.86% spread in the sales relative to average sales.

In [8]:
q1 = np.quantile(Sale,q=0.25)
q3 = np.quantile(Sale,q=0.75)
iqr = q3-q1
print("Range of middle 50% (aka) IQR of sale is",iqr)

Range of middle 50% (aka) IQR of sale is 22.5


## Skewness and Kurtosis

### Skewness:
It measures the degree to which the distribution of the data differs from the normal distribution. The value of skewness can be `positive`, `negative`, or `zero`.

### Kurtosis:
It identifies the peakedness of the data distribution. The positive value of kurtosis represents the `leptokurtic` distribution, the negative value represents the `platykurtic` distribution, and zero value represents the `mesokurtic` distribution.

#### 1. A manager handles 12 branches of a supermarket situated in the U.S.A. Consider one day sale (in dollars) of all the branches. Identify the type of Skewness and Kurtosis for sales.
    
    Sale = [165, 182, 140, 193, 172, 168, 174, 124, 187, 204, 148, 175]

In [9]:
Sale = [165, 182, 140, 193, 172, 168, 174, 124, 187, 204, 148, 175]

In [10]:
skew_sale = stats.skew(Sale)
print('Skweness of Sale is',skew_sale)

Skweness of Sale is -0.5285526567587567


In [11]:
kurt_sale = stats.kurtosis(Sale)
print('Kurtosis of Sale is',kurt_sale)

Kurtosis of Sale is -0.38240010775017863


The negative skewness depicts the distribution is left tailed and Mean<Median<Mode. Kurtosis is also negative indicative of platykurtic distribution i.e. the sales are more dispersed but have less outliers.

## Covariance and Correlation

### Covariance:
It measures the degree to which two variables move together. The value of covariance can be between $-\infty$ to $\infty$. The magnitude of covariance is not easy to interpret.  

### Correlation:
It is the normalized value of covariance. The correlation value near to +1 indicates a `strong positive` correlation between the variables, and value near to -1 indicates a `strong negative` correlation.

#### 1. A manager handles 12 branches of a supermarket situated in the U.S.A. Consider one day sale (in dollars) and working hours of all the branches. Find the relationship between the working hours of a store and its sales.
    Sale = [165, 182, 140, 193, 172, 168, 174, 124, 187, 204, 148, 175]
    Working hours = [7, 8.5, 8, 10, 9, 8, 8.5, 7.5, 9.5, 8.5, 8, 9]

In [12]:
Sale = pd.Series([165, 182, 140, 193, 172, 168, 174, 124, 187, 204, 148, 175])
Working_hours = pd.Series([7, 8.5, 8, 10, 9, 8, 8.5, 7.5, 9.5, 8.5, 8, 9])

In [13]:
corr_coeff = Working_hours.corr(Sale)
print("Correlation of working hours to that of Sale is",corr_coeff)

Correlation of working hours to that of Sale is 0.6447248082202144


The value of the correlation coefficient shows that there is a positive correlation between the working hours and sales of a store.

<a id="prob"></a>
# 4. Probability

An event is the outcome or collection of outcomes of an experiment. It is a subset of the `sample space`, which is defined as the set of all possible outcomes of an experiment. 

In [14]:
# consider a set of first ten prime numbers
sample_space = {2, 3, 5, 7, 11, 13, 17, 19, 23, 29}

`Probability` is defined as the measure of the likelihood of an event to occur. Probability of occurrence of event A is denoted as `P(A)`. The probability of an event takes values between 0 and 1. The probability of sample space is always 1. 

The probability of complement of an event A is `P(A') = 1-P(A)`.

#### 1. If the letters of the word `AABRAAKAADAABRAA` are arranged at random, find the probability that 10 A's come consecutively in the word.

In [15]:
word_length = len('AABRAAKAADAABRAA')
No_of_A = 10
No_of_B = 2
No_of_R = 2
No_of_K = 1
No_of_D = 1

If 10 A's come consecutively in the word, we consider 10 A's as one group ([AAAAAAAAAA]BRKDBR).

Now the total number of letters is 6+1=7.

In [16]:
no_words_with_10A = factorial(7)/(factorial(No_of_B)*factorial(No_of_R)*factorial(No_of_K)*factorial(No_of_D))

total_words = factorial(word_length)/(factorial(No_of_A)*factorial(No_of_B)*factorial(No_of_R)*factorial(No_of_K)*factorial(No_of_D))

req_prob = no_words_with_10A/total_words
print("The probability that 10 A's come consecutively in the word is",req_prob)

The probability that 10 A's come consecutively in the word is 0.0008741258741258741


#### 2. If the letters of the word `AABRAAKAADAABRAA` are arranged at random, find the probability that 2 B's and 2 R's come together.

We consider 2 B's as 1 group and 2 R's as one group. Thus, the total letters will be 12 + 1 + 1 = 14.

In [17]:
no_of_words_with_2B_2R = (factorial(14)/(factorial(No_of_A)*factorial(No_of_K)*factorial(No_of_D))) *\
                         (factorial(4)/(factorial(2)*factorial(2)))

total_words = factorial(word_length)/(factorial(No_of_A)*factorial(No_of_B)*factorial(No_of_R)*factorial(No_of_K)*factorial(No_of_D))


In [18]:
req_prob = no_of_words_with_2B_2R/total_words
print("The probability that 2B's and 2R's come together in the word is",req_prob)

The probability that 2B's and 2R's come together in the word is 0.1


#### 3. A kitchen set contains 10 knives, 3 of which are defective. Two knives are drawn at random with replacement. What is the probability that none of the two knives will be defective?

In [19]:
def combination(n,r):
    comb_out = factorial(n)/(factorial(n-r)*factorial(r))
    return comb_out

probability_result = combination(7,2)/combination(10,2)
print(probability_result)

0.4666666666666667


#### 4. The new vaccine is to be tested on patients. There are 5 diabetic patients (have the same type of diabetes), 9 patients with a similar heart condition and 11 patients with the same liver condition. One patient is randomly chosen. What is the probability that the patient is not diabetic?

In [20]:
no_of_patients = 25

no_of_diabetic = 5

prob_of_diabetic = 5/25

prob_of_non_diabetic = 1-prob_of_diabetic

print('Probability of Non-Diabetic is',prob_of_non_diabetic)

Probability of Non-Diabetic is 0.8


### Odds

Probability can also be expressed in terms of `odds`. Odds is the ratio of the number of observations in favor of an event to the number of observations not in favor of an event. If odds in favor of event A are a:b then $P(A) = \frac{a}{a+b}$

#### 1. The odds that a New Yorker picked at random will be either overweight or obese are 14:11. What is the probability that the person is fit (is not overweight or obese)?

In [21]:
overweight_or_obese = 14
fit = 11

prob_fit = 11/(11+14)

print('Probability of being fit is',prob_fit)

Probability of being fit is 0.44


## Conditional Probability

#### 1. A random experiment results in an integer outcome from 21 to 30. Consider two events X and Y. 
        X: Occurrence of an even number
        Y: Occurrence of a number divisible by 4
        
#### Calculate the probability that an even number will occur given that the number is divisible by 4.

In [22]:
sample_space = set(np.arange(21,31))
sample_space

{21, 22, 23, 24, 25, 26, 27, 28, 29, 30}

In [23]:
X = [i for i in sample_space if i%2==0]
print("X")
display(X)
Y = [i for i in sample_space if i%4==0]
print("Y")
display(Y)

X


[22, 24, 26, 28, 30]

Y


[24, 28]

In [24]:
prob_x_inter_y = 2/10
prob_y = 2/10

req_prob = prob_x_inter_y/prob_y
print("The probability that even number will occur given that number is divisble by 4 is",req_prob)

The probability that even number will occur given that number is divisble by 4 is 1.0


Since Y $\subset$ X, P(X $\cap$ Y) = P(Y) which implies the P(X|Y) = 1.

#### 2. A pair of fair dice is rolled. If the product of numbers that appear is 6, find the probability that the second die shows an even number?

In [25]:
total_possible_outcomes = 36

# Let event A be the product of the events is 6
# A = {(1,6),(6,1), (3,2), (2,3)}
A = 4

# Let event A be even number in second die
# B = {(1,2), (1,4), (1,6), (2,2), (2,4), (2,6), (3,2), (3,4), (3,6), (4,2), (4,4), (4,6), (5,2), (5,4), (5,6), 
#     (6,2), (6,4), (6,6)}
B = 18

B_inter_A = 2

prob_B_inter_A = 2/36
prob_A = 4/36

prob_B_given_A = prob_B_inter_A/prob_A

print('The probability that the second die shows an even number given the product of numbers is 6:',prob_B_given_A)

The probability that the second die shows an even number given the product of numbers is 6: 0.5


### Bayes' Theorem

### Example:
<img src="matrix.png">

#### 1. What is the probability that a girl is chosen given that she likes pink color?

In [26]:
prob_PG = 70/120

prob_G = 120/190

prob_P = 80/190

req_prob = ((prob_PG)*(prob_G))/prob_P

print('The probability that a girl is chosen given that she likes Pink is', req_prob)

The probability that a girl is chosen given that she likes Pink is 0.875


#### 2. In an armament production station, the explosion can occur due to short circuit, fault in the machinery, negligence of workers. From experience, the chances of these causes are 0.1, 0.3, 0.6 respectively. The chief engineer feels that an explosion can occur with probability:
        1. 0.3 if there is a short circuit
        2. 0.2 if there is a fault in the machinery
        3. 0.25 if the workers are negligent
#### Given that an explosion has occurred, determine the most likely cause of it?

In [27]:
prob_sc = 0.1
prob_fault = 0.3
prob_negligence = 0.6

prob_exp_sc = 0.3
prob_exp_fault = 0.2
prob_exp_negligence = 0.25

# probability of explosion
prob_exp = (prob_sc*prob_exp_sc)+(prob_fault*prob_exp_fault)+(prob_negligence*prob_exp_negligence)

# using bayes theorem to calculate probability of occurrence of explosion due to the identified causes
prob_sc_exp = (prob_exp_sc*prob_sc)/prob_exp
prob_fault_exp = (prob_exp_fault*prob_fault)/prob_exp
prob_negligence_exp = (prob_exp_negligence*prob_negligence)/prob_exp

print('Probability that explosion occurred and the cause is short circuit is',prob_sc_exp)
print('Probability that explosion occurred and the cause is fault in machinery is',prob_fault_exp)
print('Probability that explosion occurred and the cause is woker negligence is',prob_negligence_exp)

Probability that explosion occurred and the cause is short circuit is 0.125
Probability that explosion occurred and the cause is fault in machinery is 0.25
Probability that explosion occurred and the cause is woker negligence is 0.625


The negligence of workers is the most likely cause of an explosion in the factory. 

#  Probability Distributions

## Discrete Probability Distributions

### Discrete Uniform Distribution

#### 1. A factory has 6 machines numbered from 1 to 6. Let r.v. X be the machine number. What is the probability that a machine chosen at random is either 3 or 4?

In [28]:
req_prob = randint.pmf(k=3,low=1,high=7)+randint.pmf(k=4,low=1,high=7)
print("The probability of selecting machine 3 or 4 is",req_prob)

The probability of selecting machine 3 or 4 is 0.3333333333333333


### Bernoulli Distribution

#### 1. If 7 out of 10 times a soccer player scores a goal for a direct free kick. What would be the probability that he scores a goal for the next free kick? 

Consider a discrete random variable X representing a success (= 1) or failure (= 0) in scoring a goal for the next free kick. Here X follows bernoulli distribution with `p = 0.7`.

In [29]:
# probability of scoring a goal
p = 0.7

# calculate probability that player scores a goal for a free kick
# To find: P(X=1)
req_prob = stats.bernoulli.pmf(k = 1,p = 0.7)
print("The probability that the player scores a goal for the next free kick is",req_prob)

The probability that the player scores a goal for the next free kick is 0.7


### Binomial Distribution

#### 1. Heaven Furnitures (HF) sells furniture like sofas, beds and tables. It is observed that 25% of their customers complain about the furniture purchased by them for many reasons. On Tuesday, 20 customers purchased furniture products from HF. 

Consider a discrete random variable X representing the customer who purchased the furniture products. Here X follows binomial distribution with `n = 20, p = 0.25`. 

#### a. Calculate the probability that exactly 3 customers will complain about the purchased products.

In [30]:
# pass required number of customers who'll complain to 'k'
# pass total number of customers to the parameter 'n'
# Here, success is defined as the customer's complaint on the product i.e. p = 0.25

req_prob = stats.binom.pmf(k = 3, n = 20, p = 0.25)
print("The probability that exactly 3 customers will complain is",round(req_prob,2))

The probability that exactly 3 customers will complain is 0.13


####  b. Calculate the probability that more than 3 customers will complain about the furniture purchased by them.

In [31]:
# Here, we use stats.binom.sf(): This is 1-cdf i.e. P(X > x)
req_prob_grt_3 = stats.binom.sf(k = 3, n = 20, p = 0.25)
print("The probability that more than 3 customers will complain about the furniture is",round(req_prob_grt_3,2))

The probability that more than 3 customers will complain about the furniture is 0.77


#### 2. In a shooting academy, data was collected on the precision shooting of a student. From 15 shots fired 11 were on target. Consider the same student, what is the probability that out of 50 shots fired, exactly 35 will hit the target?

In [32]:
req_prob = stats.binom.pmf(k = 35, n = 50, p = 11/15)
print("The probability that out of 50 shots fired, exactly 35 will hit the target is",round(req_prob,2))

The probability that out of 50 shots fired, exactly 35 will hit the target is 0.11


### Poisson Distribution

**1. The number of pizzas sold per day by a food zone "Fapinos" follows a poisson distribution at a rate of 67 pizzas per day. Calculate the probability that the number of pizza sales exceeds 70 in a day.**

Consider a discrete random variable X representing the number of pizzas sold per day. Here X follows poisson distribution with `m = 67`. 

In [33]:
# Here, since we are asked probabilty of x>70, let's use stats.poisson.sf() i.e. 1-cdf
# The mean rate i.e. lambda is 67 pizzas per day

req_prob = stats.poisson.sf(k = 70, mu = 67)
print("The probability that number of pizza sold exceeds 70 in a day is",round(req_prob,2))

The probability that number of pizza sold exceeds 70 in a day is 0.33


#### 2. The number of calls received at a telephone exchange in a day follows poisson distribution. The probability that the exchange receives 5 calls is three times that of the exchange receiving 10 calls. Obtain the average calls that the telephone exchange receives in a day.

In [34]:
# Given, P(X=5) = 3*P(X=10)
# To Find: m: average call that telephone exchange receives in a day.
# solving the above equation we obtain
m_raised_5 = factorial(10)/(3*factorial(5))

m = m_raised_5**(1/5)

print("The average number of call the telephone exchange receives in a day",int(m))

The average number of call the telephone exchange receives in a day 6


### Continuous Uniform Distribution

#### 1. A gas supplying company has a pipe of 200 km from its supplying center to the city. What is the probability that the pipe leaks in the middle 100 kms? (Assume that the chance of pipe leakage is equal on the entire route)

Consider a continuous random variable X representing the length of a gas supplying pipe taking values in the range [0,200]. Here X follows a continuous uniform distribution.

In [35]:
# Distance between gas supply center and city is 200kms
# To find: P(50 <= X <= 150)
# X follows uniform distribution over  the interval [0,200]

# let's use stats.uniform.cdf() to calculate cdf for continuous uniform distribution
# pass required value of X to the parameter, 'x'
# pass the start point of the interval to the parameter, 'loc'
# pass the difference between the end point and start point(i.e. range) of the interval to the parameter, 'scale'
prob_50 = stats.uniform.cdf(x = 50, loc = 0, scale = 200)
prob_150 = stats.uniform.cdf(x = 150, loc = 0, scale = 200)

req_prob = prob_150 - prob_50
print("The probability that pipe leaks in the middle 100 kms is",req_prob)

The probability that pipe leaks in the middle 100 kms is 0.5


### Normal Distribution

#### 1. A survey was conducted and it was found that the people spend their 300 minutes in a day surfing on online shopping sites on average and the corresponding standard deviation is 127 minutes. Assume that the time spent on surfing follows a normal distribution. Calculate the following probabilities:

Consider a continuous random variable X representing the time spent in surfing on online shopping sites. Here X follows a normal distribution with mean 300 and standard deviation as 127.


#### a. What is the probability that the users are spending less than or equal to 100 minutes per day?

In [36]:
# on an average people spend 300 minutes surfing shopping sites in a day
avg = 300

sd = 127

# standardize the variable x = 100
z = (100-avg)/sd

# cdf() returns P(X <= x) i.e. P( <= 100)
req_prob = stats.norm.cdf(z)

print("The probability that uers are surfing less than or equal to 100 minutes daily is",round(req_prob,2))

The probability that uers are surfing less than or equal to 100 minutes daily is 0.06


#### b. What is the probability that people are spending more than 400 minutes on online shopping sites per day?

In [37]:
# standardize the variable with x = 400
z = (400-avg)/sd

req_prob = stats.norm.sf(z)

print("The probability that people are spending more than 400 minutes on online shopping sites per day is",round(req_prob,2))

The probability that people are spending more than 400 minutes on online shopping sites per day is 0.22


#### c. What is the probability that people are spending time between 250 minutes and 350 minutes per day?

In [38]:
# standardize the variable x = 250
z_250 = (250-avg)/sd

# standardize the variable x = 350
z_350 = (350-avg)/sd

req_prob = stats.norm.cdf(z_350) - stats.norm.cdf(z_250)

print("The probability that people are spending time between 250 and 350 minutes per day is",round(req_prob,2))

The probability that people are spending time between 250 and 350 minutes per day is 0.31


#### 2. A monthly balance in the bank account of a credit card holders is assumed to be normally distributed with mean 500 dollars and variance 100 dollars. What is the probability that the balance can be more than 513.5 dollars?

In [39]:
avg = 500
var = 100
sd = var**(1/2)

# standardize the variable x = 513.5
z = (513.5-avg)/sd


req_prob = stats.norm.sf(z)

print("The probability that balance can be more than 513.5 dollars is",round(req_prob,2))

The probability that balance can be more than 513.5 dollars is 0.09
