# Statistics

## Descriptive statistics
Descriptive statistics provide a concise summary of data. You can summarize data numerically or graphically. For example, the manager of a fast food restaurant tracks the wait times for customers during the lunch hour for a week and summarizes the data.

The manager calculates the following numeric descriptive statistics:

| Statistic	| Sample value |
| :-: | :-: |
| Mean	| 6.2 minutes |
| Standard deviation | 1.5 minutes |
| Range | 3 to 10 minutes |
| N (sample size) |	50 |

The manager examines the following graphs to visualize the wait times:


#### Histogram of wait times
![image.png](attachment:image.png)

#### Boxplot of wait times
![image-2.png](attachment:image-2.png)

### Inferential statistics
Inferential statistics use a random sample of data taken from a population to describe and make inferences(prediction) about the population. Inferential statistics are valuable when it is not convenient or possible to examine each member of an entire population. For example, it is impractical to measure the diameter of each nail that is manufactured in a mill, but you can measure the diameters of a representative random sample of nails and use that information to make generalizations about the diameters of all the nails produced.

### Descriptive statistics vs inferential statistics
https://youtu.be/rwm_R5D8lU8

### Hypothesis
The null and alternative hypotheses are two mutually exclusive statements about a population. A hypothesis test uses sample data to determine whether to reject the null hypothesis.

	• Null hypothesis (H0)
	The null hypothesis states that a population parameter (such as the mean, the standard deviation, and so on) is equal to a hypothesized value. The null hypothesis is often an initial claim that is based on previous analyses or specialized knowledge.
	
	• Alternative Hypothesis (H1)
	The alternative hypothesis states that a population parameter is smaller, greater, or different than the hypothesized value in the null hypothesis. The alternative hypothesis is what you might believe to be true or hope to prove true.

One-sided and two-sided hypotheses
The alternative hypothesis can be either one-sided or two sided.

	• Two-sided
	Use a two-sided alternative hypothesis (also known as a nondirectional hypothesis) to determine whether the population parameter is either greater than or less than the hypothesized value. A two-sided test can detect when the population parameter differs in either direction, but has less power than a one-sided test.
	Eg.
	H0: μ = 850 vs. H1: μ≠ 850
	
	• One-sided
	Use a one-sided alternative hypothesis (also known as a directional hypothesis) to determine whether the population parameter differs from the hypothesized value in a specific direction. You can specify the direction to be either greater than or less than the hypothesized value. A one-sided test has greater power than a two-sided test, but it cannot detect whether the population parameter differs in the opposite direction.
	Eg. 
	H0: μ = 850 vs. H1: μ > 850
	

There are two types of errors that may occur in our hypothesis testing:

	1. Type I error: We reject the null hypothesis when it is true. 
	2. Type II error: We failed to reject the null hypothesis when it is false. 

### P value (Probability Value):

Video on Statistical Significance and p-Values:
https://youtu.be/DAkJhY2zQ3c

![image.png](attachment:image.png)
	When you perform a statistical test a p-value helps you determine the significance of your results in relation to the null hypothesis.

*A p-value, or probability value, is a number describing how likely it is that your data would have occurred by random chance (i.e. that the null hypothesis is true).*

The level of statistical significance is often expressed as a p-value between 0 and 1. The smaller the p-value, the stronger the evidence that you should reject the null hypothesis.

	• A p-value less than 0.05 (typically ≤ 0.05) is statistically significant. It indicates strong evidence against the null hypothesis, as there is less than a 5% probability the null is correct (and the results are random). Therefore, we reject the null hypothesis, and accept the alternative hypothesis.
	However, if the p-value is below your threshold of significance (typically p < 0.05), you can reject the null hypothesis, but this does not mean that there is a 95% probability that the alternative hypothesis is true. The p-value is conditional upon the null hypothesis being true, but is unrelated to the truth or falsity of the alternative hypothesis.

	• A p-value higher than 0.05 (> 0.05) is not statistically significant and indicates strong evidence for the null hypothesis. This means we retain the null hypothesis and reject the alternative hypothesis. You should note that you cannot accept the null hypothesis, we can only reject the null or fail to reject it.
	A statistically significant result cannot prove that a research hypothesis is correct (as this implies 100% certainty).
	Instead, we may state our results “provide support for” or “give evidence for” our research hypothesis (as there is still a slight probability that the results occurred by chance and the null hypothesis was correct – e.g. less than 5%).

	• A p-value is a measure of the probability that an observed difference could have occurred just by random chance.
	• The lower the p-value, the greater the statistical significance of the observed difference.
	• P-value can be used as an alternative to or in addition to pre-selected confidence levels for hypothesis testing.

According to American Statistical Association,<br>
“a p-value is the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.”

**The null hypothesis assumes there is ‘no effect’ or ‘relationship’ by default.<br>
For example: 
	If you are testing if a drug is effective or not, then the null hypothesis will assume drug is ineffective.
	Now to make this null hypothesis false we take evidence and prove that drug is effective by rejecting null hypothesis**


### Statistical Tests reporting out p-value
	1. Welch Two Sample t-Test (t-test)
	2. Chi Square test
	3. Linear Regression


So, when the p-value is low enough, we reject the null hypothesis and conclude the observed effect holds.
But how low is ‘low enough’ for rejecting the null hypothesis?
This level of ‘low enough’ cutoff is called the alpha level, and you need to decide it before conducting a statistical test.


### Statistical Significance (alpha level)
It is the cutoff probability for p-value to establish statistical significance for a given hypothesis test.
Typically for most statistical tests(but not always), alpha is set as 0.05.

### Confidence Interval
*1. What is Confidence Interval?*

**Confidence interval is a measure to quantify the uncertainty in an estimated statistic (like mean of a certain quantity) when the true population parameter is unknown.**<br>
The reason I specifically mention the term ‘population parameter’ is because, usually when you deal with data, you will have data of a smaller sample from the population. While you can easily compute a sample statistic (example: average weight of sample of 10 adult mice), the true population parameter (example: average weight of ALL adult mice that exists) is usually unknown (not always).<br>
**In simpler terms, confidence interval provides the upper and lower bounds between which a given estimated statistic can vary. This range between which the statistic can vary is usually referred to as the ‘margin of error’.**

Let’s understand with an example<br>
Let’s suppose you want to know the expected annual yield for a given variety of Palm tree in a given region. To know this, you start collecting annual yield data from a small sample of trees. Based on this sample data you want to know the expected yield of that variety ‘in general’ and what could be the range of values the annual yield can take assuming a 95% certainty (confidence).
Understanding confidence intervals can help you answer that question.
However, this is just one example. There can be multiple flavors of ‘confidence intervals problems’. In this post, you will encounter such problems.
Also remember, when someone refers to confidence levels of a given statistic, it it is usually the 95% confidence level by default. As you increase the confidence level further, the margin of error (that is the confidence interval band) become wider.

*2. Two types of Confidence Intervals problems*

When you speak of confidence intervals, there are largely two types of problems where you would compute confidence intervals. The formula to compute confidence interval changes depending on the type.

    1. Confidence interval of a sample.
    Example: Find the confidence interval for mean weight of adult white mice.

    2. Confidence interval of a proportion.
    Example: Find the confidence interval of the percentage of voters who voted for candidate A in an election (based only on exit polls data).
    Depending on the type of problem, you need to apply the appropriate formula to calculate confidence intervals.
    Secondly, the approach you take to compute the confidence intervals depends on what information you know about the population.
    For example, in most real world cases, you would only be working with a small sample and not know anything about the population, particularly it’s standard deviation. If that is the case, you will use the T-distribution approach. However, sometimes, you may know the population standard deviation, in which case you will use the Standard normal distribution based approach.
    
    If this doesn’t make sense to you yet, that’s ok, because all of it will be explained. But you need to first understand the difference between a ‘Population parameter’ vs a ‘Sample statistic’.

*3. Difference between Population parameter vs Sample statistic*

Well as the name suggests, a population parameter (like mean, standard deviation, etc) is one which is computed or known from the entire population. Whereas, a sample statistic is computed from a smaller sample from the entire population.

*4. Confidence Interval Formula*

The formula and method of estimating confidence interval depends on whether the population’s standard deviation is known or not.
If population standard deviation is known, then:

$$		
Confidence Interval =  \bar{X} \pm Z_{crit} \frac{σ}{\sqrt{n}}
$$	
		
If population standard deviation is not known, then:

$$
Confidence Interval =  \bar{X} \pm T_{crit} \frac{s}{\sqrt{n}}
$$


Where,

1. $ \bar{X} $ is the mean value of observations,
2. $ Z_{crit} $ is the critical value of Z-score for the respective confidence level from Standard Normal distribution
3. $ T_{crit} $ is the critical value of Z-score for the respective confidence level from T-distribution
4. s is the standard deviation of the sample
5. σ is the standard deviation of the population
6. n is the number of observations in the sample.

The main difference in the calculation is, you have to look up the Z table when the population standard deviation is known. Else, look up the T-distribution table.

Since the formula has $\sqrt{n}$ in the denominator, the larger the number of samples you have, smaller will be the margin of error and thereby the smaller the confidence interval will be, which also means, more precise the estimate will be.


5. Confidence interval of a proportion

When your sample data is made up of binary values (like 1 vs 0, yes vs no) you are dealing with the problem of proportions.
For example, you might want to find the confidence interval of number of people who voted ‘yes’ in a certain shareholders voting call. In this case, the regular concept of standard deviation does not apply in the sense it does with whole numbered data.
When you want to compute the confidence interval for proportions, you need to use a slightly different formula:
Where, p is the sample’s proportion of interest.

Let’s look at a simple example.
Problem Statement<br>
Two candidates A and B are contesting in an election. Out of the total 10000 people participating in an election, 100 were randomly surveyed at exit polls. Of that 27 voted for candidate A.
Find the 99% confidence interval of the true proportion of people who voted for candidate A.

Solution:<br>

Here is what we know:<br>
N = 10000 <br>
n = 100 <br>
p = 27/100 = 0.27 <br>
1-p = 1-0.27 = 0.73 <br>
Since you want 99% CI, Z-critical = 2.576 <br>
Substituting all the values in the formula gives: <br>
Confidence Interval = {0.27 + [2.576 sqrt(0.27 0.73 / 100)], 0.27 – [2.576 sqrt(0.27 0.73 / 100)]} <br>
                    = {0.3843, 0.1556}

### Bootstrapping in Statistics: How to Compute Confidence Interval with Bootstrapping
First, What does ‘bootstrapping’ mean in statistics?

Bootstrapping is a sampling technique where you repeatedly sample same number of observations in the data but with replacement. And you do this for a large number of iterations (say 1000 or more).
By doing this you will get a large number of simulated datasets where, in each data set, you will have some observations repeated. Because, the random sampling is done ‘with replacement’.
Now, In each of the simulated samples you compute the statistic, like the ‘mean’ in this case, and note it down for a large number of observations.
Let’s say, you computed the mean of the heights for the above example this way. You might expect the mean computed from each of the simulated samples to be slightly different (as shown below).

Now, how to calculate the confidence intervals?

The 95% confidence interval of the mean is nothing but the interval that covers 95% of these data points.<br>
Bootstrapping is purely a sampling based technique, it can be used to estimate the confidence intervals regardless of what distribution your data follows.<br>
For sake of clarity, let’s write down the steps to calculate the confidence interval:<br>
Step 1: Randomly sample as many items in the dataset with replacement.<br>
Step 2: Calculate the mean (or whatever statistic) of that sample.<br>
Step 3: Repeat Step 1 and 2 for a large number of iterations and plot them in a graph if you want to visualize. The 95% confidence interval is the range that covers 95% of the simulated means.<br>
Anything outside that 95% interval, has lower probability of occurring.<br>


**Central limit theorem:**

It says that means drawn from multiple sample will resemble the familiar bell-shaped normal curve even if source population is not normally distributed provided that the sample size is large enough and departure of data from normality is not too great.

### A/B Test

**Introduction**

Statistical analysis is our best tool for predicting outcomes we don’t know, using the information we know.
Picture this scenario – You have made certain changes to your website recently. Unfortunately, you have no way of knowing with full accuracy how the next 100,000 people who visit your website will behave. That is the information we cannot know today, and if we were to wait until those 100,000 people visited our site, it would be too late to optimize their experience.
This seems to be a classic Catch-22 situation!
This is where a data scientist can take control. A data scientist collects and studies the data available to help optimize the website for a better consumer experience. And for this, it is imperative to know how to use various statistical tools, especially the concept of A/B Testing.

![image.png](attachment:image.png)

A/B Testing is a widely used concept in most industries nowadays, and data scientists are at the forefront of implementing it. In this article, I will explain A/B testing in-depth and how a data scientist can leverage it to suggest changes in a product.



What is A/B testing?
A/B testing is a basic randomized control experiment. It is a way to compare the two versions of a variable to find out which performs better in a controlled environment.
For instance, let’s say you own a company and want to increase the sales of your product. Here, either you can use random experiments, or you can apply scientific and statistical methods. A/B testing is one of the most prominent and widely used statistical tools.
In the above scenario, you may divide the products into two parts – A and B. Here A will remain unchanged while you make significant changes in B’s packaging. Now, on the basis of the response from customer groups who used A and B respectively, you try to decide which is performing better.

![image-2.png](attachment:image-2.png)

It is a hypothetical testing methodology for making decisions that estimate population parameters based on sample statistics. The population refers to all the customers buying your product, while the sample refers to the number of customers that participated in the test.

ANOVA for statistics in Data science

ANOVA is a type of hypothesis testing which is used to find out the experimental results by analysing the variance of the different survey groups. It is usually used for deciding the result of the dataset.
Analysis of variance(ANOVA) is a statistical method to find out if the means of two or more groups are significantly different from each other. It checks the impact of one or more factors by comparing the means of different samples.
When we have two samples/groups we use a t-test to find out the mean between those samples but it is not that much reliable for more than two samples, therefore, we use ANOVA.

![image.png](attachment:image.png)

Why do we use ANOVA testing?
In machine learning, the biggest problem is selecting the best features or attributes for training the model. We only require those features that are highly dependent on the response variable so that our model can able to predict the actual outcome after training the model. ANOVA is used to figure out the result when we have a continuous response variable and the target feature is categorical.

Hypothesis in ANOVA is
	• H0: μ1 = μ2 = μ3 …
	• H1: Means are not all equal.

![image-2.png](attachment:image-2.png)

### Chi-Square Test

We always wonder where the Chi-Square test is useful in machine learning and how this test makes a difference. Feature selection is an important problem in machine learning, where we will be having several features in line and have to select the best features to build the model. The chi-square test helps you to solve the problem in feature selection by testing the relationship between the features.

Chi-Square Test for Feature Selection
A chi-square test is used in statistics to test the independence of two events. Given the data of two variables, we can get observed count O and expected count E. Chi-Square measures how expected count E and observed count O deviates each other.

![image.png](attachment:image.png)
									
Let’s consider a scenario where we need to determine the relationship between the independent category feature (predictor) and dependent category feature(response). In feature selection, we aim to select the features which are highly dependent on the response.
When two features are independent, the observed count is close to the expected count, thus we will have smaller Chi-Square value. So high Chi-Square value indicates that the hypothesis of independence is incorrect. In simple words, higher the Chi-Square value the feature is more dependent on the response and it can be selected for model training.


Steps to perform the Chi-Square Test:
1. Define Hypothesis.
2. Build a Contingency table.
3. Find the expected values.
4. Calculate the Chi-Square statistic.
5. Accept or Reject the Null Hypothesis.


Null Hypothesis (H0): Two variables are independent.<br>
Alternate Hypothesis (H1): Two variables are not independent.



Chi-Square Test using Python
Here is the below code, on how to perform the Chi-Square test using python.

In [None]:
from sklearn.feature_selection import chi2
chi_scores = chi2(X,y)
chi_scores
# (array([ 11.85325057,  51.53992627,   0.15004097, 118.19941432]),
#  array([5.75607838e-04, 7.01557451e-13, 6.98496209e-01, 1.56803624e-27]))

here first array represents chi square values and second array represents p-values<br>
Since 3rd variable has higher the p-value, it says that this variables is independent of the response and cannot be considered for model training

![image.png](attachment:image.png)