<a href="https://www.kaggle.com/code/mahendra77/statistics-and-probability-with-python?scriptVersionId=143481937" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
import pandas as pd # data manipulation
import numpy as np  # linear algebra
import seaborn as sns  # plots
import matplotlib.pyplot as plt
import math
import scipy.stats as st # Statistics & Probability


data=sns.load_dataset('tips')
data.info()

In [None]:
numerical_columns = [ i for i in data.columns if data[i].dtype not in ['object','category','bool']]
categorical_columns =[ i for i in data.columns if i not in numerical_columns]

# Descriptive Statistics :
**Provides summary of data. So that one can overview it quickly**

In [None]:
''' measure of central tendencies: Centre point of the Data with a single value. '''

# Mean : Simple can stated average 

print('Mean')
print(data[numerical_columns].mean())
print('------------------')

# Median : Middle most value when data is sorted. 

print('Medain')
print(data[numerical_columns].quantile(0.5))
print('------------------')

## mode : most frequent value in the data.

print('mode')
print(data[categorical_columns].mode())

### When to use what
1. Mean: When the data is relevant. Mean is sensitive to outliers (i.e mean of 10,12,16,10,999 doesn't give overview of the data). So in that case use Median Or Trimmed Mean 
2. Median: Robust to outliers.
3. Trimmed Mean: It means exclude top 5% or bottom 5% or both(which make sure exterem values are excluded) and calculate mean over remaining data. 
4. Mode: Mostly used with Categorical Values to know which value repeated most.

In [None]:
''' measure of variability: how far individual values are from the cental point '''

## range: shows the data spread by calculating difference b/w max data point and min data point.

print('range')
print(data[numerical_columns].max()-data[numerical_columns].min())
print('----------------------------')

## variance: relative distance of individual values from the mean of the data.

print('variance')
print(data[numerical_columns].var())
print('--------------------------')

## standard deviation: Square root of varaince.

print('Standard_deviation')
print(data[numerical_columns].std())

In [None]:
## quartile usually represent the data below a centain percentage. (q1-contains 25% percent of the data or data from 0 to 0.25)

q1=data[numerical_columns].quantile(0.25)
q3=data[numerical_columns].quantile(0.75)

## Inter quartile range contains 50% of the data (It measures the centre half of our dataset)

IQR=q3-q1
print('outliers')
print('Lower_Fence',1.5*IQR-q1,'Upper_Fence',1.5*IQR+q3,sep='\n')

## box plot also used to find outliers and 
data[numerical_columns].plot(kind='box')

In [None]:
'''      skewness : shows asymmetry around mean
     Ideal skew=0 when mean=median
     Positive skew(right skew cause contains ouliers to the right tail) when mean>median
     Negitive skew(left skew cause contain outliers to the left tail) when mean<median
'''
data[numerical_columns].skew()

In [None]:
''' kurtosis: a statistical measure how heavily tails are differ from normal distribution
        Ideal value should be 3
        lighter tails means lack of outliers
        heavier tails means more outliers. '''
data[numerical_columns].kurt()

**Skewness and Kurtosis are better when visualized by plots than values or formulas.**

In [None]:
# coefficient of variation (CV) shows extent of variability in relation to the mean of the population
data[numerical_columns].std()/data[numerical_columns].mean()

In [None]:
## simply we can use pandas in-built function to get descriptive stats overview

data.describe(include='all')
#data.describe(exclude='object')

### Permutation and Combinations

In [None]:
## permutation: no of different ways to arrange objects in order. And Order of the objects is important (i.e {a,b,c} and {b,c,a} are not same)
import math as mt

''' Arraning of n different objects is n!'''

print("Total no of ways to arrange 11 Cricket Players: {}".format(mt.factorial(11)))

''' Arraning r different objects from n objects is n!/(n-r)!'''
print("Total no of ways to arrange 11 players from group of 15:  {}".format(mt.factorial(15)//mt.factorial(15-11)))

In [None]:
## combination: no of differnet ways to select. Order is not important (i.e {a,b,c} and {b,c,a} are same)

''' Selecting r objects from n objects is n!/((n-r)!*r!)'''
print("No of ways to select 11 players from 15 plauers:  {}".format(mt.factorial(15)//(mt.factorial(11)*mt.factorial(15-11))))

### Probability

In [None]:
'''
 probability: tell's about certainity of an event happening. (certainity of getting heads when toss a coin is 0.5)
 probability lies between 0 to 1.
 Total Probability = 1
'''

''' probality of event_A is defined as ratio of favourable outcomes that event_A occurs to the total_no_outcomes'''
def event_prob(event_outcomes,total_outcomes):
    return round((event_outcomes/total_outcomes),2)

'''Sample Space: Contains all elements that had a chance to happen.'''
cards = 52 ## there are 52 cards in pack.

hearts=event_prob(13,cards)
print('Probability of getting a heart:  {}'.format(hearts))

'''P(~A) = 1- P(A)'''
# not getting a heart
print('Probability of not getting a heart:  {}'.format(1-hearts))

face_cards=event_prob(12,cards)
print('Probability of getting a Face Card:  {}'.format(face_cards))

'''Mutiplication Rule: is used to find the probability of two events happening at the same time.'''
## first card needs to be a heart and after it need to be a queen => (13/52) * (1/13) = (1/52)
queen_of_hearts=event_prob(13,cards)*event_prob(1,13) 
print('probability of getting a queen which is a heart:  {}'.format(queen_of_hearts))

'''Addition Rule: is used to find the probability of either one or two events happpening'''
# event-A: probability of a face card
# event-b: probability of a getting queen
## A and B can happen together like if we get a queen(B) which is also face card(A). remove those type events to avoid Double counting
face_card_queen = round(event_prob(12,cards)+event_prob(4,cards)-event_prob(4,cards),2)
print("probability of getting either a face card or queen is: {}".format(face_card_queen))

In [None]:
' mutually exclusive events: Are events that do not occur at the same time '
# event-A: probability of a face card
# event-b: probability of a getting 5

''' If A and B are mutullay exculsive then P(A intesection B) = 0'''
face_card_rank_5 = event_prob(12,cards)+event_prob(4,cards)
print("probability of getting either a face card or 5 is: {}".format(face_card_rank_5))

In [None]:
## Conditional Probability: is used to find the probability of event-A given that event-B is occurred.
''' Fomuala P(A|B) = P(A intersection B)/P(B) '''

# event-A: Getting a king
# event-B: Getting a red card

## probility of event-A given that event-b is occured . (event-B means 26 red cards and red contains 2 Kings)
cp= event_prob(2,cards)/event_prob(26,cards)
print('Probability of king and red card: {}'.format(cp))

In [None]:
## independent events: the events which are not dependent on each other.
''' P(A|B)=P(B) '''

# event-A: getting 5 on throwing a dice
# event-B: getting 6 on throwing a dice

print('Probability of getting 6 after getting 5 is: {}'.format(event_prob(1,6)))

1. **For Independant Events Multiplication rule: P(A and B) =P(A).P(B)**
2. **For Dependent Events Multilication rule: P(A and B) = P(A|B).P(B)**
3. **For Mutually Exclusive Events P(A or B) =P(A)+p(B) [ P(A and B) =0]**

### Probability Distribution
**It is not but a mathematicsl Function,that gives probability of events occuring for a experiment.**
* **Random varaible: A varaible from the set of possible values for a experiment**

There are two types of Distributinons:
1. Discrete   {Binomial,Uniform,Poisson Distribution} Probability Mass Function
2. Continous  {Normal,T,Exponential,Chi-Square}       Probability Density FUnction

Cummulative Distributive Function =P(X<=X)


In [None]:
' Binominal distribution : allows to calculate the probability for a specific no of times out of no times '
## i.e getting 4 heads out of 6 tosses.
'''
Only two varaibles are possible one is success and failure.
Mean: n*p 
Standard Devaition: n*p*(1-p)
'''

''' 
Consider tips Dataset.
  1.If a person non smoker treat it as success
  2. Find the probaility of 5 customers who are non smokers out of 30 customers.
'''

n=30  #total no of trails
p=event_prob(len(data[data['smoker']=='No']),len(data['sex'])) #probability

print("Probability of 5 non Smokers out of 30 is: {}".format(st.binom.pmf(5,n,p)))

## no of success i.e no of non smokers
k=np.arange(0,30) 
binomial=st.binom.pmf(k,n,p)

plt.plot(k,binomial,'o-')
plt.xlabel('Number of Successes')
plt.ylabel('Probability of Successes')
plt.show()

In [None]:
'''

poissson distribution: is used to calculate probability for an event ocuuring over a period of time


Mean and Standard are equal.
'''

'''
Consider a testing environment
 1. For a week the number bugs reported are 3
 2. What is probability of 4 bugs reported next week
'''

## mean
rate=3
print("Probability of 4 bugs reported is {}".format(st.poisson.pmf(4,rate)))

## no of bugs ranging from 1 to 11
n=np.arange(0,11)
piosson=st.poisson.pmf(n,rate)
plt.plot(n,piosson,'o-')
plt.xlabel('Number of bugs')
plt.ylabel('Probability ')
plt.show()

In [None]:
## Normal Distribution: 
'''
1.mean=median=mode
2.The distribution is bell-shaped and symmetrical around the mean.
3.The total area under the curve is equal to 1.

Z-score describes that a data point how many Standard Deviations away from mean 
(data_point-mean)/Standard Devaition
'''
'''
Consider a Class room result.
1.The average result of class is 75. Standard Deviation of the calss is 18.7
2. what is the probaility of getting a score of 84

'''
mean=75
sd=18.7
print("Probaility of getting score of 84 is {}".format(st.norm.pdf(84,mean,sd)))

## scores 1 to 100
x=np.arange(1,100)
result=st.norm.pdf(x,mean,sd)

plt.plot(x,result)
plt.xlabel('Score')
plt.ylabel('Probability ')
plt.show()

## try with lower mean and higher standard devaition
x=np.arange(1,100)
result=st.norm.pdf(x,65,18.7)

plt.plot(x,result)
plt.xlabel('Score')
plt.ylabel('Probability ')
plt.show()

## try with lower mean and higher standard devaition
x=np.arange(1,100)
result=st.norm.pdf(x,78,13)

plt.plot(x,result)
plt.xlabel('Score')
plt.ylabel('Probability ')
plt.show()

**Observations from Above:**
1. If s.d is higher then width of the curve increases (like second chart)
2. If mean is higher then height of the curve increases (like thrid chart)

In [None]:
## cdf : (Cumulative distributive function) is used get the probability about less than some value
## Above problem what is the probability of getting score<=90
print("Probaility of getting score less than or equal to 90 is {}".format(st.norm.cdf(90,mean,sd)))

1. Normal distribution can approximate binomial distribution when n.p>5 and n.(1-p)>5.
2. For sufficiently large values of λ the normal distribution approximation to the Poisson distribution.
3. In a Poisson distribution, when λ is a positive integer, the modes are λ and λ − 1

**Standard Normal Distribution: where mean=0 and s.d=1**


**Emphirical Rule**: It states that 
1.         around 68% data lies in the range of  1 Standard Deviation(mean-s.d,mean+s.d)
1.          around 95% data lies in the range of 2 Standard Deviation(mean-2*s.d,mean+2*s.d)
1.          around 99% data lies in the range of 3 Standard Deviation(mean-3*s.d,mean+3*s.d)

### Sampling

* Population(N): Refers to whole group.
* Sample(n): a small and well represented subset Population.
* The process of choosing a sample from the population is known as Sampling.
     1. Random Sampling
     2. Starified Sampling
* Sampling Distribution: It is as defined distribution of sample means.
* **Central Limit Theorem states that as the sample size increases the distribution of means(sampling distribution of means) will follow normal distribution.**

In [None]:
## first try with smaple size =20
## keep increasing at some point you will notice a pattern
np.random.seed(6)
rolls=np.random.randint(1,7,50000000)

plt.hist(rolls)
plt.show()

**Confidence Intervals**: refers to the probability that a population parameter will fall between a set of values.

Standard Error = S.D of smaple / sqrt(sample size)

Margin Error(M.E) = critical z-score at alpha * Standard Error 

C.I: [ sample mean - M.E , smaple mean + M.E ]


In [None]:
''' Z-score calculation using confidence'''
print(st.norm.ppf(1-0.05/2)) ## two tailed test
print(st.norm.ppf(1-0.01)) ## right tailed test
print(st.norm.ppf(0.01))   ## left tailed test

In [None]:
'''
Consider a class room result. 
1. sample mean score: 78 and S.D=13.5
2. with 95% confidence how can score vary for sample of 50 students
'''
n=50          # sample size 
x_bar=78     # sample mean
sigma=13.5   # S.D
standard_error= (sigma/math.sqrt(13.5))
margin_error = standard_error * st.norm.ppf(1-0.05/2)

print("Confidence Interval: {},{}".format(round(x_bar-margin_error),round(x_bar+margin_error)))

## Hypothesis Testing: 

 step-1: **NULL Hypothesis(Ho): is belived to be true unless there is overhelming evidence to the contrary(Satus quo).**
 
 Step-2:**Alternate Hypothesis(H1): Opposite of Ho.**
        **One Tail test: is used when we H1 is lower than (or) greater than**
        **Two Tail test: is used when H1 is not eual to some value**
        
 Step-3:**Level of Significance: alpha=1-confidence. It's like how much error one can tolerate**
 
Step-4:**Test Statistic either(z or t)**

 Step-5:**calculate critical z-value with help of alpha(PPF function).(OR) can calulate p-value with help of CDF function.**
 
Step-6:**If z(test) <= z(critcal) then reject Ho else failed to reject Ho (OR) If p-value<=alpha then reject Ho else failed to reject Ho**

### T-Disrtibution
**It is same as normal but with Fatter Tails, Cause of less data will accomidate more extreme values.**

*Used when sample size<30 and population s.d is unknown.*

**Degree of Freedom = n-1**


In [None]:
'''Left Tailed Test'''
'''Jeffrey, as an eight-year old, established a mean time of 16.43 seconds for swimming the 25-yard freestyle, with a standard deviation of 0.8 seconds. 
    His dad, Frank, thought that Jeffrey could swim the 25-yard freestyle faster using goggles. 
    Frank bought Jeffrey a new pair of expensive goggles and timed Jeffrey for 15 25-yard freestyle swims. 
    For the 15 swims, Jeffrey's mean time was 16 seconds.
    Frank thought that the goggles helped Jeffrey to swim faster than the 16.43 seconds.
    Conduct a hypothesis test using a preset α = 0.05.'''

mean=16.43
s_d =0.8
n=15
alpha=0.05
x_bar=16

'''
   Null hypothesis: mean>=16.43 
   Alternate Hypothesis: mean<16.43 
   It is less than so Left tailed Test
   Level of significance = 0.05
   test-statistic: t-testsince we don't population S.D and n<30
   using p-value to prove it.
'''

t = (x_bar-mean)/(s_d/math.sqrt(n))
p_value = st.t.cdf(t,n-1)
if p_value<=alpha:
    print("Reject Null Hypothesis")
else:
    print("Failed to Reject Null Hyopthesis")
# With 95% Confidence we can say that googles improved swimming speed.

In [None]:
''' Right Tailed Test'''
'''
Jane has just begun her new job as on the sales force of a very competitive company.
In a sample of 16 sales calls it was found that she closed the contract for an average value of 108 dollars with a standard deviation of 12 dollars. 
Company policy requires that new members of the sales force must exceed an average of $100 per contract during the trial employment period.
Can we conclude that Jane has met this requirement at the significance level of 95%?
'''

mean=100
n=16
s_d=12
x_bar=108 # what jane made
alpha=0.05 

'''
Null Hypothesis: mean<=100
Alternate Hypothesis: mean>100 (right tailed test)
los: alpha=0.05 
test statistic is t-test
using critcal t-value
'''
t = (x_bar-mean)/(s_d/math.sqrt(n))
tC= st.t.ppf(1-0.05,n-1)

if tC<=t:
    print("Reject Null Hypotheis")
else:
    print("Failed to reject Null Hypotheis")
# with 95% confidence, Jane met the requriement  

In [None]:
''' Two Tailed Test '''

'''
A manufacturer of salad dressings uses machines to dispense liquid ingredients into bottles that move along a filling line.
The machine that dispenses salad dressings is working properly when 8 ounces are dispensed. 
Suppose that the average amount dispensed in a particular sample of 35 bottles is 7.91 ounces with a variance of 0.03 ounces squared, 
s2. Is there evidence that the machine should be stopped and production wait for repairs?
The lost production from a shutdown is potentially so great that management feels that the level of significance in the analysis should be 99%.
'''

n=35
x_bar=7.91 
variance=0.03 
mean=8
alpha=0.01

'''
Ho : mean=8  (no over or under performing)
H1 : mean!=8 two tailed test
los: alpha=0.01 
test: Z-test
critical value
'''

z= (x_bar-mean)/(math.sqrt(variance)/math.sqrt(n))
p_value=st.norm.cdf(z)

if p_value<=alpha/2:
    print("Reject Null Hypothesis")
else:
    print("Failed to Reject Null Hypothesis")
# at 99%, we can't say that machine fills 8 ounces properly

In [None]:
''' Using Proportions '''
'''
The mortgage department of a large bank is interested in the nature of loans of first-time borrowers. 
This information will be used to tailor their marketing strategy.
They believe that 50% of first-time borrowers take out smaller loans than other borrowers.
They perform a hypothesis test to determine if the percentage is the same or different from 50%.
They sample 100 first-time borrowers and find 53 of these loans are smaller that the other borrowers.
For the hypothesis test, they choose a 5% level of significance.
'''
p_bar=0.53
p=0.5 
n=100
alpha=0.05

'''
Ho: p==0.5
H1: P!=0.5
'''

z=(p_bar-p)/(math.sqrt(p*(1-p)/n))
p_value=st.norm.ppf(z)

if p_value<=alpha/2:
    print("Reject Null Hypothesis")
else:
    print("Failed to Reject Null Hypothesis")
## it means at we didn't enough evidence.

There are two types of Error:

1. Type-I Error :- **Rejecting the true null hypothesis**. (Saying a visibly pregnant woman that she isn't pregnant) 
2. Type-II Error:- **Failed to reject false null hypothesis**. ( Saying a man that he is pregnant)

Chi-Square Distribution: 
1. allows us to perform hypothesis testing on nominal and ordinal data
2. **Goodness of fit test**, which uses a sample to test whether a frequency distribution fits the predicted distribution. No -ve values.
3. formuala = sum( (observed-expected)**2 / expected)
4. Test for Independece b/w variables.


In [None]:
''' 
Consider an exit poll scenario
'''

expected = [303,117,122]
observed = [353,91,98]

stat,p_value=st.chisquare(observed,expected)

if p_value<0.05:
    print("can't use that model to train on new data")
else:
    print("can use that model to train on new data")

In [None]:
'''
Is gender independent of education level? 
A random sample of 395 people was surveyed and each person was asked to report the highest education level they obtained. 
The data that resulted from the survey are summarized in the following table:

        High School   Bachelors  Master Ph.d.	Total
Female   60              54       46     41  	201
Male    40               44       53     57  	194
Total   100              98       99     98  	395

Question: Are gender and education level dependent at a 5% level of significance?
'''

'''
Null Hypothesis: gender and education level are independent.
Alternate Hypothesis: gender and education level dependent
LOS: 5%
'''

data = [[60,54,46,41],[40,44,53,57]]
statistic,p_value,dof,expected_value = st.chi2_contingency(data)

if p_value<0.05:
    print("Reject Null Hypothesis")
else:
    print("Failed to Reject Null Hypothesis")
'gender and education level dependent'

**F-Test**
1. Variance Ratio Distribution(F-distribution) as it usually defines the variances of the two normally distributed populations.

**ANOVA:**
1. population must be normally distributed.
2. samples must be independent of each other.
3. Each population must have the same variance.

f-test(completely randomized one-way Anova)

**F = MSB/MSW**

In this formula,

F = coefficient of ANOVA
MSB = Mean sum of squares between the groups
MSW = Mean sum of squares within groups

In [None]:
'''
A pharmaceutical company conducts an experiment to test the effect of a new cholesterol medication.
The company selects 15 subjects randomly from a larger population. 
Each subject is randomly assigned to one of three treatment groups.
Within each treament group, subjects receive a different dose of the new medication.
In Group 1, subjects receive 0 mg/day; in Group 2, 50 mg/day; and in Group 3, 100 mg/day.

In conducting this experiment, the experimenter had two research questions:

Does dosage level have a significant effect on cholesterol level?
How strong is the effect of dosage level on cholesterol level?
'''
''' Use one way anova analysis'''
group1=[210,240,270,270,300]
group2=[210,240,240,270,270]
group3=[180,210,210,210,240]

'''
H0: dosage doesn't has any effect on cholesterol 
H1: dosage has effect on cholesterol
LOS:0.05
'''

f_stat,p_value=st.f_oneway(group1,group2,group3)

if p_value<0.05:
    print("Reject null hypothesis.")
else:
    print("Doesn't reject null hypothesis.")
## mean cholesterol level in at least one treatment group differed significantly from the mean cholesterol level in another group.

**Covariance** values indicate the magnitude and direction (positive or negative) of the relationship between variables. The covariance values range from -∞ to +∞. The positive value implies a positive relationship, whereas the negative value represents a negative relationship.

**Correlation**
1. It tells us about the association of two or more varaibles.
1. measures the strength & direction of a mutual relationship b/w variables which lies in range(-1 to +1).

In [None]:
data=sns.load_dataset('tips')
sns.heatmap(data.corr())

**Covariance shows you how the two variables differ, whereas correlation shows you how the two variables are related**

**Regression:**
1.     It shows the relationship between independent varaibles and depedent varaibles.
1.     F-test is used for selecting features.
1.     **R-value (Coefficient of Determination)**: which show how much better model treating Variance.

In [None]:
## regression analysis of tips using total_bill :- simple linear regression
x=data['total_bill']
y=data['tip']
res=st.linregress(x,y)
print(f"R-squared: {res.rvalue**2:.6f}")

*we can't use simple linear regression to predict tips as r^2 is low. And look at the Plot below*

In [None]:
plt.plot(x, y, 'o', label='original data')
plt.plot(x, res.intercept + res.slope*x, 'r', label='fitted line')
plt.legend()
plt.show()

## Multiple Linear Regression
Contains More than One independent varaibles

In [None]:
## regession analysis
x=pd.get_dummies(data,drop_first=True).drop('tip',axis=1)
y=data['tip']
returns = np.linalg.lstsq(x,y, rcond=None)

print("Coefficients {}".format(returns[0]))
print("intercept {}".format(returns[1]))