**Inferential Statistics**

Inferential statistics is an important tool that allows us to make predictions and conclusions about a population based on sample data. Unlike descriptive statistics, which only summarize data, inferential statistics let us test hypotheses, make estimates, and measure the uncertainty about our predictions. These tools are essential for evaluating models, testing assumptions, and supporting data-driven decision-making.

For example, instead of surveying every voter in a country, we can survey a few thousand and still make reliable conclusions about the entire population’s opinion. Inferential statistics provides the tools to do this systematically and mathematically.

***Why Do We Need Inferential Statistics?***
In real-world scenarios, analyzing an entire population is often impossible. Instead, we collect data from a sample and use inferential statistics to:

* Conclude the whole population.
* Test claims or hypotheses.
* Calculate confidence intervals and p-values to measure uncertainty.
* Make predictions with statistical models.

Techniques in Inferential Statistics
Inferential statistics offers several key methods for testing hypotheses, estimating population parameters, and making predictions. Here are the major techniques:

***1. Confidence Intervals***: It gives us a range of values that likely includes the true population parameter. It helps quantify the uncertainty of an estimate. The formula for calculating a confidence interval for the mean is:

![image.png](attachment:image.png)  Where:


ˉ  is the sample mean
Z a/2​ is the Z-value from the standard normal distribution (e.g., 1.96 for a 95% confidence interval)
σ is the population standard deviation
n is the sample size

For example, if we measure the average height of 100 people, a 95% confidence interval gives us a range where the true population mean height is likely to fall. This helps gauge the precision of our estimate and compare models (like in A/B testing).

***2. Hypothesis Testing***: Hypothesis testing is a formal procedure for testing claims or assumptions about data. It involves the following steps:

Null Hypothesis (H₀): The default assumption, such as “there’s no difference between two models.”
Alternative Hypothesis (H₁): The claim you aim to prove, such as “Model A performs better than Model B.”
We collect data and compute a test statistic (such as Z for a Z-test or t for a T-test):

![image-2.png](attachment:image-2.png)  Where:

x
ˉ is the sample mean
μ is the hypothesized population mean
σ is the population standard deviation
n is the sample size

After calculating the test statistic, we compare it with a critical value or use a p-value to decide whether to reject or accept the null hypothesis. If the p-value is smaller than the significance level α\alphaα (usually 0.05), we reject the null hypothesis.

![image-3.png](attachment:image-3.png)  Where:

μis the population mean
σ is the population standard deviation
n is the sample size

This theorem allows us to apply normal distribution-based methods even when the original data is not normally distributed, such as in cases with skewed income or shopping behavior data.

***Errors in Inferential Statistics***
In hypothesis testing, Type I Error and Type II Error are key concepts:

***Type I Error*** occurs when we wrongly reject a true null hypothesis. The probability of making a Type I error is denoted by 
α (the significance level).

***Type II Error*** occurs when we fail to reject a false null hypothesis. The probability of making a Type II error is denoted by 
β and the power of the test is given by 
1−β.

The goal is to minimize these errors by carefully selecting sample sizes and significance levels.

***Parametric and Non-Parametric Tests***

Statistical tests help decide if the data support a hypothesis. They calculate a test statistic that shows how much the data differs from the assumption (null hypothesis). This is compared to a critical value or p-value to accept or reject the null.

***Parametric Tests***: These tests assume that the data follows a specific distribution (often normal) and has consistent variance. They are typically used for continuous data. Examples include the Z-test, T-test, and ANOVA. These tests are effective for comparing models or measuring performance when the assumptions are met.

***Non-Parametric Tests***: Non-parametric tests do not assume a specific distribution for the data, making them ideal for small samples or non-normal data, including categorical or ranked data. Examples include the Chi-Square test, Mann-Whitney U test, and Kruskal-Wallis test. They are useful when data is skewed or categorical, such as customer ratings or behaviors.

Example: Evaluating a New Delivery Algorithm Using Inferential Statistics
A quick commerce company wants to check if a new delivery algorithm reduces delivery times compared to the current system.

***Experiment Setup***:

100 orders split into two groups: 50 with the new algorithm, 50 with the current system.
Delivery times for both groups are recorded.

**Steps**

***Hypotheses***:

Null (H0): The New algorithm does not reduce delivery time.
Alternative (H1): New algorithm reduces delivery time.
Significance Level:

Set at 0.05 (5% risk of wrongly rejecting H0).

* Type I error: Thinking the new system is better when it isn’t.
* Type II error: Missing a real improvement.

***Test Statistic***: Compare average delivery times between the two groups

***Analysis***:

* Calculate means and differences.
* Check if the data is roughly normal.

***Perform a t-test or z-test.***

If p-value < 0.05, reject H0 and conclude the new algorithm is better. Otherwise, no clear improvement.

***Confidence Interval***: For example, a range of -5 to -2 minutes means deliveries are 2 to 5 minutes faster with the new system.

***Covariance and correlation*** are the two key concepts in Statistics that help us analyze the relationship between two variables. Covariance measures how two variables change together, indicating whether they move in the same or opposite directions.

![image.png](attachment:image.png)

To understand this relationship better, consider factors like sunlight, water and soil nutrients (as shown in the image), which are independent variables that influence plant growth our dependent variable. Covariance measures how these variables change together, indicating whether they move in the same or opposite directions.

What is Covariance?
Covariance is a statistical which measures the relationship between a pair of random variables where a change in one variable causes a change in another variable. It assesses how much two variables change together from their mean values. Covariance is calculated by taking the average of the product of the deviations of each variable from their respective means. Covariance helps us understand the direction of the relationship but not how strong it is because the number depends on the units used. It’s an important tool to see how two things are connected.

* It can take any value between - infinity to +infinity, where the negative value represents the negative relationship whereas a positive value represents the positive relationship.
* It is used for the linear relationship between variables.
* It gives the direction of relationship between variables.

Covariance Formula
![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)
![image-3.png](attachment:image-3.png)
![image-4.png](attachment:image-4.png)

***Types of Covariance***
* Positive Covariance: When one variable increases, the other variable tends to increase as well and vice versa.
* Negative Covariance: When one variable increases, the other variable tends to decrease.
* Zero Covariance: There is no linear relationship between the two variables; they move independently of each other.
![image-5.png](attachment:image-5.png)


***What is Correlation?***

Correlation is a standardized measure of the strength and direction of the linear relationship between two variables. It is derived from covariance and ranges between -1 and 1. Unlike covariance, which only indicates the direction of the relationship, correlation provides a standardized measure.

Positive Correlation (close to +1): As one variable increases, the other variable also tends to increase.
Negative Correlation (close to -1): As one variable increases, the other variable tends to decrease.
Zero Correlation: There is no linear relationship between the variables.
The correlation coefficient 
ρ (rho) for variables X and Y is defined as:

* Correlation takes values between -1 to +1, wherein values close to +1 represents strong positive correlation and values close to -1 represents strong negative correlation.
* In this variable are indirectly related to each other.
* It gives the direction and strength of relationship between variables.

Correlation Formula![image.png](attachment:image.png)

Here,

x' and y' = mean of given sample set
n = total no of sample
x i and y i = individual sample of set
![image-2.png](attachment:image-2.png)


***Difference between Covariance and Correlation***

This table shows the difference between Covariance and Covariance:

| Covariance |	Correlation |
| :--- | :--- |
| Covariance is a measure of how much two random variables vary together |	Correlation is a statistical measure that indicates how strongly two variables are related. |
|Involves the relationship between two variables or data sets | Involves the relationship between multiple variables as well
Lie between -infinity and +infinity	Lie between -1 and +1 |
| Measure of correlation |	Scaled version of covariance |
| Provides direction of relationship |	Provides direction and strength of relationship |
| Dependent on scale of variable |	Independent on scale of variable |
| Have dimensions	| Dimensionless |

They key difference is that Covariance shows the direction of the relationship between variables, while correlation shows both the direction and strength in a standardized form.

***Applications of correlation***
* Market Research: Correlation is used to identify relationships between consumer behavior and sales trends, helping businesses make informed marketing decisions.
* Medical Research: Correlation helps in understanding the relationship between different health indicators, such as the correlation between blood pressure and cholesterol levels.
* Weather Forecasting: Correlation is used to analyze the relationship between various meteorological variables, such as temperature and humidity, to improve weather predictions.
* Machine Learning: Correlation analysis is used in feature selection to identify which variables have strong relationships with the target variable, improving model accuracy.

***Applications of Covariance***

* Portfolio Management in Finance: Covariance is used to measure how different stocks or financial assets move together, aiding in portfolio diversification to minimize risk.
* Genetics: In genetics, covariance can help understand the relationship between different genetic traits and how they vary together.
* Econometrics: Covariance is employed to study the relationship between different economic indicators, such as the relationship between GDP growth and inflation rates.
* Signal Processing: Covariance is used to analyze and filter signals in various forms, including audio and image signals.
* Environmental Science: Covariance is applied to study relationships between environmental variables, such as temperature and humidity changes over time.