## Population v/s Sample

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*7IV-AEckvESm5UHWRXMHmw.png" width="400px"><br>
</p>

- A Population is an entire group or a set of individuals or events that we want to study and draw some conclusions about data.
- A Sample is a subset of the population that we select for a study about the entire population.

## Parameter v/s Estimate

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:624/format:webp/1*ZVtg0VwAQOE8XqZbZn_Duw.gif" width="400px"><br>
</p>

- Parameters are numerical values used to describe information about the population. Like ‚Äòmu‚Äô and ‚Äòsigma‚Äô we found for the population mean and population standard deviation. Generally, parameter values are unknown and we find them through sample data. The population-related measurement we call a ‚ÄòParameter‚Äô.
- Estimate is also known as ‚ÄòStatistics‚Äô. If we find any numerical value such as ‚Äòmean‚Äô, ‚Äòstandard deviation‚Äô, or ‚Äòvariance‚Äô, all are known as ‚Äòstatistics‚Äô.
- By knowing statistics from sample data, we can make inferences about the unknown respective parameter of the population. Common statistics we know are the sample mean ‚ÄòX bar‚Äô, the sample median, and the standard deviation ‚Äòs‚Äô.

## Inferential Statistics

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:720/format:webp/1*q3D9OJV_r1BfAsemI45VLQ.png" width="400px"><br>
</p>

- It solves a very big problem, as we know by using a small group of samples we make inferences or predictions about the population. Whenever we do this work it comes under ‚ÄòInferential Statistics‚Äô.
- Inferential statistics include techniques like ‚ÄòHypothesis Testing‚Äô, ‚ÄòConfidence Interval‚Äô, and ‚ÄòRegression Analysis‚Äô among others.
- The type of question which we can solve through ‚ÄòInferential Statistics‚Äô:

1. Is there a significant difference between the two groups?
2. Can we predict the outcome of a variable based on the values of another variable?
3. What is the relationship between two or more variables?

## Point Estimate

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*o0DRVEcNMov3cmCZEUy2jg.png" width="400px"><br>
</p>

A point estimate is a ‚ÄòSingle Value‚Äô calculated from a sample that serves as the best guess or approximation for an unknown population parameter, such as the mean or standard deviation. Point estimates are often used to make inferences about a population based on a sample.

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*6b0K476JAQgKwmTP5dnykw.jpeg" width="400px"><br>
  <em>Sample Mean</em>
</p>

- Suppose I‚Äôm interested in determining the average age of adults in my society. However, it‚Äôs impractical to inquire about the age of every resident in each household. Instead, I opt to gather data exclusively from adult attendees of society meetings and individuals frequenting commercial areas within the vicinity.
- Firstly, I collect the ages of adults attending society meetings, representing a diverse range of households within the community. Additionally, I survey adults visiting commercial areas, where individuals from various demographics converge.
- After gathering data from these sources, I calculate the mean age for each sample group of adults. These individual means serve as point estimates for the average age within their respective segments.

- To refine the estimate further, I repeat this process multiple times, sampling 50 adults each time. After conducting this procedure 10 times, I calculate the average of the 10 sample means. This final calculated mean provides a more precise point estimate for the overall average age of adults in society.

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*w9nqM3BQhTRsigmZf8RuXg.jpeg" width="400px"><br>
</p>

In summary, by strategically collecting data exclusively from adult attendees of society meetings and individuals in commercial areas, and calculating multiple sample means, I can derive a more accurate point estimate for the average age of adults in the society.



## Confidence 

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*OmzebupIFJx6hWFcKJWV1Q.jpeg" width="400px"><br>
</p>

- Only calculating the point estimate and relying solely on this result is not sufficient. For the entire population, we cannot come up with an exact single solution. So we find a ‚ÄòRange‚Äô with the point estimate, which covers the possible values in which that ‚ÄòPopulation parameter‚Äô can exist. This range is called the **Confidence Interval**.
- It's like asking, *How confident are you that a given range is a good estimate for the population?*
- In simple words, a Confidence Interval is a range of values within which we expect a particular population parameter, like a mean, to fall.
- It‚Äôs a way to express the uncertainty around an estimate obtained from a sample of data.
- **Confidence Level**, usually expressed as a percentage like 95%, indicates how sure we are that the true value lies within the interval.
- When we find the confidence interval, the confidence level also tells us the percent of confidence.

### Confidence Interval Formula

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*yROm6LT4hlIPyX-Z8KP-BQ.jpeg" width="400px"><br>
</p>

Steps to calculate CI:
1. Find the **Point Estimate**.
2. Compute the **Margin of Error**.
3. Determine the **Range (Lower & Upper Bound)**.
4. Report the **Confidence Level** (%).

## Ways to Calculate CI

1. **Z Procedure** ‚Äì When we know the population standard deviation (œÉ).
2. **t Procedure** ‚Äì When we do not have the population standard deviation.

**Note:**  
Confidence Intervals are created for **parameters**, not for estimates/statistics. We create a confidence interval for a population parameter using sample statistics.

### Examples of Confidence Intervals

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:750/format:webp/1*Y5VF6DY-vJBW6mplnuAnUA.png" width="400px"><br>
  <em>The ‚Äò|‚Äô line represents the confidence interval</em>
</p>

- Confidence intervals are widely used in finance and many other industries:
  - **Finance:** Stock market analysis ‚Äì assess potential risk and return.
  - **Economics:** Estimate parameters like unemployment rates, inflation, GDP growth.
  - **Marketing:** Customer satisfaction, market share, response rates.
  - **Healthcare:** Drug effectiveness, treatment outcome differences.
  - **Manufacturing:** Defect rates, product dimensions.

## Applications of CI in Machine Learning & Data Science

Definition: A confidence interval provides a range of plausible values for a parameter with a specified confidence (e.g., 95% CI means we are 95% confident the true value lies within this range).

### 1. Model Performance Evaluation

**Accuracy with Uncertainty:**
- Report: `"Model accuracy: 87.3% (95% CI: 84.1% to 90.5%)"`
- Compare models: If CIs don't overlap, one model is significantly better.
- Bootstrap CI: Resample data to estimate performance variability.

### 2. A/B Testing & Experimentation

**Feature/Algorithm Comparison:**
- Test new recommendation algorithm vs baseline.
- Example: Algorithm A CTR = 5.2% (CI: 4.8%-5.6%), Algorithm B = 6.1% (CI: 5.7%-6.5%).
- Decision: Non-overlapping CIs ‚Üí Algorithm B is significantly better.

### 3. Hyperparameter Tuning

**Cross-Validation Results:**
- Report mean CV score with CI.
- Example: `"5-fold CV accuracy: 0.85 ¬± 0.03 (95% CI: 0.82-0.88)"`
- Helps assess model stability and generalization.

### 4. Regression Coefficients

**Feature Importance:**
- CI for each coefficient shows if a feature is significant.
- If CI excludes 0 ‚Üí feature has a significant effect.
- Example: `"Income coefficient: 0.45 (95% CI: 0.32 to 0.58)"` ‚Üí significant predictor.

### 5. Prediction Intervals

**Uncertainty in Predictions:**
- Point prediction + uncertainty range.
- Example: `"Predicted house price: $350K (95% PI: $320K-$380K)"`
- Wider interval = higher prediction uncertainty.

### 6. Time Series Forecasting

**Future Predictions:**
- Example: `"Next month sales: 5000 units (95% CI: 4500-5500)"`
- Shows range of likely outcomes.
- Critical for risk assessment and planning.

### 7. Bias Detection

**Fairness Metrics:**
- Estimate performance across demographic groups.
- Example: `"Accuracy for Group A: 85% (CI: 82%-88%), Group B: 78% (CI: 74%-82%)"`
- Non-overlapping CIs indicate potential bias.



### Quick Implementation (Python)
```python
from scipy import stats
import numpy as np

# Bootstrap CI for model accuracy
def bootstrap_ci(scores, confidence=0.95):
    n = len(scores)
    bootstrap_means = [np.mean(np.random.choice(scores, n)) for _ in range(1000)]
    lower = np.percentile(bootstrap_means, (1-confidence)/2 * 100)
    upper = np.percentile(bootstrap_means, (1+confidence)/2 * 100)
    return lower, upper

# Usage
accuracies = [0.85, 0.87, 0.84, 0.86, 0.88]  # From cross-validation
ci = bootstrap_ci(accuracies)
print(f"95% CI: {ci}")
```
**Note:** Always report CIs with model metrics to show uncertainty.  
- Narrow CI = reliable estimate.  
- Wide CI = need more data or model is unstable.

## Assumptions of Z-Procedure

Before applying the Z-procedure, it is crucial to check the underlying assumptions. **If any of these assumptions are violated, the Z-procedure should not be applied.**

**There are 3 key assumptions:**

1. **Random Sampling** - The sample must be randomly selected from the population.
   - This ensures that the sample is representative and unbiased.

2. **Known Population Standard Deviation ($\sigma$)** - We must know the population standard deviation.
   - In practice, this is often unknown, which is why the t-procedure is used instead.

3. **Normal Distribution or Large Sample Size** - The population distribution should be normal.
   - If the population is not normal, apply the **Central Limit Theorem (CLT)**:  
     - Take multiple samples of size $n \geq 30$.
     - Compute the mean of each sample.
     - The distribution of these sample means will approximate a normal distribution.

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*aF8530nzp4iKv-2bOBzliQ.jpeg" width="300px">
  <br>
</p>



### Procedure to Calculate Confidence Interval using Z

1. Identify the **point estimate** ($\bar{X}$) from the sample.
2. Use the **population standard deviation** ($\sigma$) and **sample size** ($n$).
3. Choose the **confidence level** (e.g., 90%, 95%, 99%).  
   - Determine $Z_{\alpha/2}$ from the Z-table corresponding to the confidence level.
4. Apply the **CI formula**:

$$\text{CI} = \bar{X} \pm Z_{\alpha/2} \times \frac{\sigma}{\sqrt{n}}$$

5. The resulting range is the **Confidence Interval** for the population parameter.

**Summary:** - Z-procedure is straightforward but requires known $\sigma$ and either a normal population or a large enough sample ($n \geq 30$).  
- If these conditions are not met, the t-procedure should be used instead.

## Formula and Intuition of Z-Procedure

**Confidence Interval (Z-procedure)** *When population standard deviation ($\sigma$) is known*

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*7gNtWVmIBuBJDHK86X1V7w.jpeg" width="300px">
  <br>
</p>



- Here, **$\sigma$ known** means the **population standard deviation** is available.  
- In most cases, population parameters are unknown, but if **$\sigma$** is known, we apply the **Z-procedure**.

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*tNfSgWOFe_JkeNNeFR1qyQ.png" width="300px">
  <br>
  <em>Z score when $\sigma$ is known</em>
</p>

### Formula for Confidence Interval (Z-Procedure)

$$\text{CI} = \bar{X} \pm Z_{\alpha/2} \times \frac{\sigma}{\sqrt{n}}$$

**Where:**
- $\bar{X}$ = sample mean (point estimate)  
- $\sigma$ = population standard deviation  
- $n$ = sample size  
- $Z_{\alpha/2}$ = Z value corresponding to the desired confidence level (from Z-table)  

**Intuition:**
1. The **point estimate** ($\bar{X}$) is the best guess for the population mean.  
2. Multiply the **standard error** ($\sigma / \sqrt{n}$) by the Z value to account for variability.  
3. The range $\bar{X} \pm Z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}}$ gives the **Confidence Interval**.  
4. The confidence level (e.g., 95%) tells us how confident we are that the true population mean lies within this interval.  

**Summary:** - Z-procedure is simple and direct if the population standard deviation is known.  
- For unknown $\sigma$, we need the **t-procedure** instead.

## Interpreting CI

### 1. Confidence Level:
- The confidence level (commonly set at 90%, 95%, or 99%) represents the probability that the confidence interval will contain the true population parameter if the sampling and estimation process are repeated multiple times.
- Suppose on average age example for my society, on a 95% confidence level, the confidence interval about the population age is between 18 to 42. what does that mean?
- It explains if an entire population 100 times creates random samples. 100 times pick random 50 people and find their average age, then I can be 95% sure in these 100 trials that 95% time or near this percentage average age is between interval 18 to 42.



---

### 2. Interval Range:
- The width (difference between lower and upper bound) of the confidence interval indicates the precision of the estimate. A narrower confidence interval suggests a more precise estimate of the population parameter, while a wider interval indicates greater uncertainty.
- The width of the interval depends on the sample size, variability in the data, and the desired level of confidence.

---

### 3. Interpretation:
- To interpret the confidence interval values, we can say that we are 95% confident that the true population parameter lies within the range (lower and upper limit).
- This statement is about the interval, not the specific point estimate, and it refers to the confidence level we choose when constructing the interval.

---

### **Factors Affecting Margin of Error:**

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:634/format:webp/1*2yLaA4DYU9Z4bbx6TOILLw.png" width="300px">
</p>

1. Confidence Level ($1 - \alpha$)
2. Sample Size ($n$)
3. Population Standard Deviation ($\sigma$)

**$\text{CI} = \text{Point estimate} \pm \text{Margin of error}$**

The margin of error = $(\text{upper} - \text{lower}) / 2$

$$\text{Margin of Error} = Z_{\alpha/2} \times \frac{\sigma}{\sqrt{n}}$$

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*xhivmfyKP94UCPJHlwoySQ.png" width="400px">
</p>

It‚Äôs clearly visible confidence interval depends on three things: It depended upon **‚ÄúCritical Value‚Äù**, **‚ÄúPopulation Standard Deviation‚Äù** & **‚ÄúSample Size of Sample‚Äù**.

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*lnEJptewbc6gwKTqZe7mzg.png" width="300px">
  <br>
  <em>The margin of error increases as the Confidence Level increase</em>
</p>

- As confidence level decreases ex; from 95% to 75% it means the range is reducing and our confidence is also decreasing to firm any statement about the availability of matter which we want to prove.
- It means that the Margin of Error reduces, the confidence interval range decreases so certainty increases but I am not sure about that.

> **Ex:** > Critical value ($Z$-score): 0.14
> Margin of error: 0.29
> Confidence Interval: (49.71, 50.29)

- I will be able to be sure when the range is bigger, and margin of error is bigger then I have some range to be confident about the certainty of the statement.

> **Ex:** > Critical value ($Z$-score): 1.96
> The margin of error: 4.16
> Confidence Interval: (45.84, 54.16)

## T-procedure and t-distribution

### Confidence Interval t-Procedure
**When Population Standard Deviation (Sigma) is Not Known ($\sigma$)**

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*kVQIzG34u94g2sE1mG0vUQ.png" width="300px">
</p>

The t-procedure is a process where it is not necessary to have the population standard deviation. In real life, we mostly use the t-procedure because population parameters are typically unknown.

**Assumptions:**

1.  **Random Sampling:**
    * Like the Z-procedure, data must be collected using random sampling.
    * The sample should be representative of the entire population, not specific to any one area.

2.  **Sample Standard Deviation:**
    * Since we do not know the population standard deviation ($\sigma$), we use the sample standard deviation ($s$) as an estimate.

3.  **Approximately Normal Distribution:**
    * The t-procedure assumes the underlying population is approximately normally distributed, or the sample size is large enough ($n > 30$) to apply the Central Limit Theorem.
    * If the population distribution is heavily skewed or contains extreme outliers, the t-procedure may not be accurate, and non-parametric methods should be considered.

4.  **Independent Observations:**
    * The occurrence of one sample should not be related to another.
    * The value of one observation must not affect the value of another observation.
    * This is particularly important when working with time-series data or data with inherent dependencies.

---

### Transitioning from Z to t

According to the Z-procedure for calculating a Confidence Interval, we use the population standard deviation ($\sigma$):

> $$\text{CI} = \bar{X} \pm Z_{\alpha/2} \left(\frac{\sigma}{\sqrt{n}}\right)$$

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*2bCmNkH2q5cRaQSFSh1TgA.png" width="300px">
</p>

**But in the t-procedure, we do not know "Sigma" ($\sigma$); therefore, the next best estimate we have is the "Sample standard deviation" ($S$).**

Modifying the formula to replace sigma with the sample standard deviation introduces a new complexity:

> $$\text{Confidence Interval} = \bar{X} \pm t_{\alpha/2} \left(\frac{S}{\sqrt{n}}\right)$$

We know we can convert the sample mean ($\bar{X}$) distribution into a standard normal distribution using:
$$Z = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}}$$

However, when we replace "$\sigma$" with "$S$", the formula becomes:
$$\hat{Z} = \frac{\bar{X} - \mu}{S/\sqrt{n}}$$

*As we have made this change, since "$S$" is not a "Constant" like "Sigma $\sigma$", "$S$" varies from sample to sample.*

Consequently, this new statistic is no longer normally distributed. This distribution is called **"Student‚Äôs t-distribution,"** which is similar to a normal distribution but not exactly the same.

---

### Student‚Äôs t-distribution

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*QCTgLIl8QEFCIV7YDkQ4Mg.jpeg" width="300px">
</p>



* This is a **"Theoretical Distribution,"** meaning it does not exist in nature.
* While Normal, Log-Normal, Pareto, and Binomial distributions occur in natural phenomena, Student‚Äôs t-distribution was created to solve problems where population parameters are unknown.
* Student‚Äôs t-distribution was discovered by **William Sealy Gosset**. Because he could not publish under his own name, he used the pseudonym **"Student"**.
* While a normal distribution uses parameters $\mu$ and $\sigma$, the t-distribution uses one parameter: **"Degrees of Freedom"** ($n - 1$).

#### Degrees of Freedom ($n - 1$):
* If our sample size is "50," then the degrees of freedom will be "49".
* The t-distribution has "fatter tails" compared to the Normal distribution.
* This means more values exist in the outlier areas and fewer exist around the mean.

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*brLL43TvODha8dLILPw-zQ.png" width="300px">
  <br><em>Degree of Freedom = 1</em>
</p>

* As we **"increase Sample Size,"** the t-distribution begins to look like a normal distribution. This is because a larger sample provides more knowledge and confidence about the population.

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*YZcEp2RrxAuKIzkNajugpw.jpeg" width="300px">
  <br><em>t-distributions with different Degrees of Freedom</em>
</p>

[Image showing t-distribution curves approaching the Normal distribution as degrees of freedom increase]

* As the degrees of freedom increase, it starts to look like a normal distribution. It exactly matches the normal distribution at **"infinity"**.

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:720/format:webp/1*xuEnfra9bsTyJb6ItKde8g.jpeg" width="300px">
  <br><em>t-distribution at infinity matches the Normal Distribution exactly</em>
</p>

**t-distribution formula:**
<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:378/format:webp/1*bLXFI8wtjtR9JvrOw-WqEA.jpeg" width="300px">
</p>

Instead of the Z-table, we use a **"t-table"** to find the $t_{\alpha/2}$ value.

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*wa-uuOmo9xiBP2GDn_CGrA.png" width="300px">
</p>

* For a 95% confidence interval, the $Z_{\alpha/2}$ value is "1.96". Generally, for the same 95% confidence interval with a lower degree of freedom, the $t_{\alpha/2}$ value will be slightly greater than the $Z_{\alpha/2}$ value.
* Example: 95% CI with 30 degrees of freedom = 2.042.
* For lower degrees of freedom, **"t-statistics" > "Z"** because we want to increase the Confidence Interval to account for the uncertainty regarding the sample standard deviation ($S$).

---

### Summary of Usage

Confidence intervals estimate the range within which a population parameter is likely to fall based on sample data. The choice between the Z-procedure and Student‚Äôs t-statistics depends on sample size and knowledge of the population standard deviation.

**The Z-procedure is used when:**
* The sample size is large (typically $n \geq 30$).
* The population standard deviation ($\sigma$) is known.

**Student‚Äôs t-statistics are used when:**
* The sample size is small (typically $n < 30$) or unknown.
* The population standard deviation ($\sigma$) is unknown.

The Z-procedure is preferred for large sample sizes with a known population standard deviation, while Student‚Äôs t-statistics are used for small sample sizes or when the population standard deviation is unknown. Choosing the appropriate method ensures accurate estimation and valid statistical inference.

## Confidence Intervals in Code

### 1. **CI for Mean (Most Common)**
```python
from scipy import stats
import numpy as np

# Sample data
data = [23, 25, 27, 24, 26, 28, 22, 25, 24, 26]

# Calculate mean and CI
mean = np.mean(data)
sem = stats.sem(data)  # Standard error
ci = stats.t.interval(0.95, len(data)-1, loc=mean, scale=sem)

print(f"Mean: {mean:.2f}")
print(f"95% CI: [{ci[0]:.2f}, {ci[1]:.2f}]")
```


### 2. **Bootstrap CI (Works for Any Metric)**
```python
import numpy as np

def bootstrap_ci(data, confidence=0.95, n_bootstrap=1000):
    bootstrap_means = [np.mean(np.random.choice(data, len(data))) 
                       for _ in range(n_bootstrap)]
    
    alpha = 1 - confidence
    lower = np.percentile(bootstrap_means, alpha/2 * 100)
    upper = np.percentile(bootstrap_means, (1-alpha/2) * 100)
    
    return lower, upper

# Usage
data = [23, 25, 27, 24, 26, 28, 22, 25, 24, 26]
ci = bootstrap_ci(data)
print(f"Bootstrap 95% CI: [{ci[0]:.2f}, {ci[1]:.2f}]")
```


### 3. **CI for Model Accuracy**
```python
from sklearn.model_selection import cross_val_score
from scipy import stats

# Get CV scores
scores = cross_val_score(model, X, y, cv=5)

# Calculate CI
mean_acc = scores.mean()
sem = stats.sem(scores)
ci = stats.t.interval(0.95, len(scores)-1, loc=mean_acc, scale=sem)

print(f"Accuracy: {mean_acc:.3f} (95% CI: [{ci[0]:.3f}, {ci[1]:.3f}])")
```

note: Use method 1 for simple mean estimation, method 2 for any statistic (median, std, etc.), and method 3 for ML model evaluation.

---

## Credits

**Prepared by:**  
**Chetan Sharma**  
AIML / Data Science Notes  

üîó **GitHub:** [github.com/Chetan559](https://github.com/Chetan559)  
üåê **Portfolio:** [chetan559.github.io](https://chetan559.github.io)  
üíº **LinkedIn:** [linkedin.com/in/sharma-chetan-k](https://www.linkedin.com/in/sharma-chetan-k/)  

These notes were compiled for learning, revision, and academic understanding. 
