# **<span style="color:#2E86C1">Inferential Statistics</span>**

`What's inferential statistics?` In contrast to descriptive statistics, inferential statistics want to make a statement about the population. However, since it is almost impossible in most cases to survey the entire population, a sample is used, i.e. a small data set originating from the population. With this sample a statement about the population can be made.

`Definition :` "Inferential statistics is a branch of statistics that uses various analytical tools to draw conclusions about the population from sample data. For a given hypothesis about the population, inferential statistics uses a sample and gives an indication of the validity of the hypothesis based on the sample collected." 

---

## **<span style="color:#2E86C1">Hypothesis</span>**

A hypothesis is an assumption that is neither proven nor disproven. In the research process, a hypothesis is made at the very beginning and the goal is to either reject or not reject the hypothesis. In order to reject or not reject a hypothesis, data, e.g. from an experiment or a survey, are needed, which are then evaluated using a hypothesis test.

## <span style="color:#D35400"><b>Types of Hypotheses</b></span>

-   <span style="color:#28B463"><b>Differential Hypothesis</b></span>
    
    Difference hypotheses are used when different groups are to be distinguished. 
    
    Examples of difference hypotheses are:
    
    -   The "group" of men earn more than the "group" of women.
    -   Smokers have a higher risk of heart attack than non-smokers.
    -   There is a difference between Germany, Austria, and France in terms of hours worked per week.

-   <span style="color:#28B463"><b>Correlation Hypothesis</b></span>

    Correlation hypotheses are used when the relationship or correlation between variables is to be tested. 

    Examples of correlation hypotheses are:
    
    -   The relationship between age and height.
    -   The more horsepower a car has, the higher its fuel consumption.
    -   The better the math grade, the higher the future salary.

-   <span style="color:#28B463"><b>Directional and Non-directional Hypotheses : </b></span>

    Hypotheses can also be divided into **directional** and **non-directional** (also known as one-sided and two-sided hypotheses).

    -   <span style="color:purple"><b>Non-directional Hypotheses</b></span>:

        Non-directional hypotheses test whether there is a relationship or a difference, and it does not matter in which direction the relationship or difference goes.
        
        Examples:
        -   There is a difference between the salary of men and women (but it is not said who earns more!).
        -   There is a difference in the risk of heart attack between smokers and non-smokers (but it is not said who has the higher risk!).

        In regard to a correlation hypothesis, this means there is a relationship between two variables, but it is not said whether this relationship is positive or negative.
        
        -   There is a correlation between height and weight.
        -   There is a correlation between horsepower and fuel consumption in cars.

    -   <span style="color:purple"><b>Directional Hypotheses : </b></span>

        Directional hypotheses additionally indicate the direction of the relationship or difference. 
        
        Examples:
        -   Men earn more than women.
        -   Smokers have a higher risk of heart attack than non-smokers.

        In the case of a correlation hypothesis, it specifies whether the correlation is positive or negative.
        
        -   The taller a person is, the heavier they are.
        -   The more horsepower a car has, the higher its fuel economy.

`Note: Choosing the P-value for directional and non-directional hypotheses is different.`

 <center><img src="../../images/directional_and_nondiarectional_hypothesis.png" alt="error" width="1000"/></center>

## **<span style="color:#2E86C1">Hypothesis Testing</span>**


<span style="color:#D35400">What is Hypothesis Testing?</span>
Hypothesis testing is a statistical method used to make decisions or inferences about a population based on sample data. It helps determine whether the observed data deviates from what is expected under a specific assumption (the null hypothesis).

In simpler terms, hypothesis testing allows us to test assumptions or claims about a population by analyzing sample data. It helps to objectively decide whether to accept or reject a hypothesis.

<span style="color:#D35400">Why Do We Need Hypothesis Testing?</span>
Hypothesis testing is crucial in making data-driven decisions. It enables us to:

- <span style="color:#28B463">Test Assumptions</span>: Check if a particular assumption about a population holds true.
- <span style="color:#28B463">Validate Claims</span>: For example, whether a new drug is more effective than an existing one.
- <span style="color:#28B463">Reduce Uncertainty</span>: Helps in reducing subjectivity by using data to guide conclusions.
- <span style="color:#28B463">Support Evidence-based Decisions</span>: Hypothesis testing provides a structured method to accept or reject a claim based on data, which is vital in fields like healthcare, business, and social sciences.

<span style="color:#D35400">Key Terms in Hypothesis Testing</span>

- **<span style="color:#28B463">1. Null Hypothesis (H₀)</span>**  
  The null hypothesis represents a statement of no effect or no difference. It is the default assumption we aim to test against.
  - `Null Hypothesis Assumption` : There is no difference in test scores between two groups.
  - `Example`: gender has no effect on salary.
  - `When to choose the null hypothesis?` It is usually the hypothesis that suggests no change or status quo.

- **<span style="color:#28B463">2. Alternate Hypothesis (H₁ or Hₐ)</span>**  
  The alternate hypothesis is the statement that contradicts the null hypothesis. It represents the claim we are testing for.
  
  - `Alternate Hypothesis Assumption` : There is a significant difference in test scores between two groups.
  - `Example`: gender has an effect on salary. This hypothesis is called an alternative hypothesis.
  - `When to choose the alternate hypothesis?` The alternate hypothesis represents the effect or difference you're interested in proving.

## <span style="color:#D35400"><b>What is P-Value</b></span>

<span style="color:#2E86C1"><b>P-Value and Decision Making</b></span>

The p-value helps us decide whether to reject or keep the **null hypothesis** (the assumption that there is no effect or difference). If the p-value is smaller than a pre-set **significance level** (usually 5%), we reject the null hypothesis. Otherwise, we don't reject it.

`Example:` If we assume that men's and women's salaries in Germany are the same, but our sample shows a difference of 300 euros per month, the **p-value** tells us how likely it is that this 300 euro difference happened just by chance, assuming there’s really no difference in the overall population.

<span style="color:#D35400"><b>Understanding Small and Large P-values</b></span>

---

- **Small p-value (e.g., 3%)**:
    - If the p-value is very small, say **3%**, it means there’s only a **3% chance** that this **300 euro difference** (or more) happened by chance.
    - You then need to decide if you're okay with accepting a **3% chance of error**, or if that’s too risky to ignore.

<span style="color:#28B463"><b>Key Interpretations:</b></span>

---

- **When p-value is small (e.g., 3%)**:
    - There's only a **3% chance** that the observed result (like a **300 euro salary difference**) happened purely by chance.
    - So, it’s very **unlikely to be random**, and we believe the difference is **real**. That's why we **reject the null hypothesis** (which says there's no difference).

- **When p-value is large (e.g., 50%)**:
    - There's a **50% chance** that the observed difference could have occurred by chance.
    - In this case, it’s not surprising if it’s just random, so we **don't reject the null hypothesis** (since there's no strong evidence against it).

---

<span style="color:#D35400"><b>In Short:</b></span>

- **Small p-value** = Unlikely to be by chance → likely a **real effect**.
- **Large p-value** = Could easily be by chance → not enough evidence for a real effect.


<center><img src="../../images/p_value_example.png" alt="error" width="500"/></center>

## <span style="color:#D35400"><b>Importance of Significance Level</b></span>

<span style="color:#2E86C1"><b>Understanding Significance Level and P-value</b></span>

-   <span style="color:#D35400"><b>The Role of Significance Level (α)</b></span>
    - Before conducting a hypothesis test, the **significance level (α)** is set to decide how much uncertainty we're willing to accept.
    - A commonly used significance level is **5% (α = 0.05)**.
    - The significance level tells us how much evidence we need to reject the **null hypothesis**. In other words, how confident we need to be that the result is **not due to chance**.

    <center><img src="../../images/p_value.png" alt="error" width="600"/></center>

---

-   <span style="color:#28B463"><b>Interpretation of P-values</b></span>

    - **α < 0.01**: **Very significant result.**
        - There's less than **1% chance** the result is due to randomness, so we reject the null hypothesis with high confidence.

    - **α < 0.05**: **Significant result.**
        - There's less than **5% chance** the result is due to randomness, so we reject the null hypothesis. 

    - **α > 0.05**: **Not a significant result.**
        - The result could easily be due to chance, so we **don’t reject the null hypothesis**.

---

<span style="color:#2E86C1"><b>Example of Reluctance in Rejecting the Null Hypothesis</b></span>

If we calculate a p-value of **0.05**, it means there’s a **5% chance** that the result could have occurred by random chance. Even though it’s considered **significant**, it's **not significant enough** for us to confidently reject the null hypothesis.

We are generally **reluctant to reject the null hypothesis** because we assume that the **null hypothesis is true** before any testing is done. So, unless we are **very sure** (based on a small p-value), we **don't reject the null hypothesis**. 

In short, we only reject the null hypothesis if the chance of randomness is very small (typically less than **5%**).


---

#### 🎯 `NOTE` : Generally **P value** is calculated from **Z-score , T-statistics** which are obtained from their respective tests. But python has predefined function which directly gives use p-value result as final output so all we need to do it compare it with significance level of our liking generally we use 0.05 (i.e. 5% significance level)

# **<span style="color:#28B463">Z-Test</span>**


- **<span style="color:#D35400">When to Use a Z-Test:</span>**
    - <span style="color:#28B463">Used For :</span> To compare mean of two groups.
    - <span style="color:#28B463">Large sample size:</span> The Z-test is appropriate when the sample size is large (usually \(n > 30\)).
    - <span style="color:#28B463">Known population variance:</span> The population variance (or standard deviation) is known.
    - <span style="color:#28B463">Normal distribution:</span> The data follows a normal distribution.

- **<span style="color:#28B463">Z-score</span>**

    -   A Z-score measures how many standard deviations an element is from the mean. 
    -   In hypothesis testing, Z-scores are used in tests involving large samples to determine how far away a sample mean is from the population mean.

        $$ Z = \frac{X - \mu}{\sigma} $$  
        Where:
        - \(X\) is the sample mean
        - \($\mu$\) 'mu' is the population mean
        - \($\sigma$\) 'sigma' is the standard deviation of the population

- **<span style="color:#D35400">Real-life Example:</span>**
  Imagine a factory producing lightbulbs where the average lifespan of a bulb is claimed to be 1,000 hours, and the standard deviation is 100 hours. You take a sample of 50 bulbs and want to check if the average lifespan is significantly different from the claimed 1,000 hours.

  `General Idea of Statistical Test`: we want population mean and standard deviation AND sample mean and standard deviation to be almost same because that will define that sample properly represents population so we will compare it with z test.


In [14]:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt 

# Given data
population_mean = 1000
population_std = 100  # Known population standard deviation
sample_size = 50  # Sample size 
alpha = 0.05  # Significance level 

# Generate random sample data based on the population mean and std
# Adding noise (bias) to the sample data to introduce variability 
noise_factor = 0  # The magnitude of the noise to add 

### `NOTE`:
-   here, **noise factor** is added to give data more real world scenario effect.
-   making noise 0 will make so mean and std of sample and population are same 
-   increasing noise factor will make so that sample and population mean are not same 
-   try making noise very small like 0 and large like 50 and see the final result change 

In [15]:
sample = np.random.normal(loc=population_mean + noise_factor, scale=population_std, size=sample_size)

# Z-test
sample_mean = np.mean(sample)
z_score = (sample_mean - population_mean) / (population_std / np.sqrt(sample_size))

# Get p-value from z-score
p_value = 2 * (1 - stats.norm.cdf(abs(z_score)))

# Print results
print("Sample Mean:", sample_mean)
print("Z-score:", z_score)
print("P-value:", p_value)

# Decision making
if p_value <= alpha:
    print("Reject the null hypothesis: There is a significant difference.")
else:
    print("Fail to reject the null hypothesis: No significant difference.")

Sample Mean: 996.7047305791999
Z-score: -0.2330107353284405
P-value: 0.8157530696062265
Fail to reject the null hypothesis: No significant difference.


# **<span style="color:#28B463">T-Test</span>**


- **<span style="color:#D35400">When to Use a T-Test:</span>**
    - <span style="color:#28B463">Used For:</span> To compare the means of one or two groups.
    - <span style="color:#28B463">Small sample size:</span> The T-test is appropriate when the sample size is small (usually \(n $\leq$ 30\)).
    - <span style="color:#28B463">Unknown population variance:</span> The population variance (or standard deviation) is unknown.

- **<span style="color:#28B463">T-score</span>**
    - A T-score measures how many standard deviations the sample mean is from the population mean, accounting for sample size.
    - In hypothesis testing, T-scores are used when dealing with smaller sample sizes to evaluate how far away a sample mean is from the population mean.

        $$ T = \frac{X - \mu}{s / \sqrt{n}} $$  
        Where:
        - \(X\) is the sample mean
        - \($\mu$\) 'mu' is the population mean
        - \(s\) is the sample standard deviation
        - \(n\) is the sample size

### **<span style="color:#2E86C1">Types of T-tests</span>**

- **<span style="color:#D35400">One-sample T-test</span>**
    - <span style="color:#28B463">Description:</span> A one-sample T-test is used to compare the mean of a single sample to a known population mean.
    - **<span style="color:#28B463">Example:</span>**  
      A university claims that the average GPA of all its students is 3.0. You take a sample of 30 students and want to verify if the average GPA of these students is significantly different from 3.0.

In [33]:
import numpy as np
from scipy import stats

# Sample data: GPA of 30 students
sample = [3.1, 2.9, 3.0, 3.2, 3.1, 3.3, 2.8, 3.0, 2.9, 3.2, 3.1, 2.9, 3.0, 3.1, 3.2, 3.3, 2.8, 3.0, 3.1, 2.9, 3.0, 3.1, 3.2, 3.3, 2.8, 3.0, 3.1, 2.9, 3.0, 3.1]

# Population mean (claimed GPA)
population_mean = 3.0

# Perform one-sample t-test
t_statistic, p_value = stats.ttest_1samp(sample, population_mean)

print("T-statistic:", t_statistic)
print("P-value:", p_value)

# Significance level
alpha = 0.05

# Conclusion
if p_value <= alpha:
    print("Reject the null hypothesis: The sample mean is significantly different from the population mean.")
else:
    print("Fail to reject the null hypothesis: No significant difference.")


T-statistic: 1.7556849093970603
P-value: 0.08970111773731367
Fail to reject the null hypothesis: No significant difference.


- **<span style="color:#D35400">Paired T-test</span>**
    - <span style="color:#28B463">Description:</span> Compares the means of two related groups (e.g., before and after treatment).
    - **<span style="color:#28B463">Example:</span>**  
      You want to test if a new teaching method has improved the scores of students by comparing their scores before and after the method was introduced.

In [36]:
import numpy as np
from scipy import stats

# Test scores before and after applying a new teaching method
before_scores = [78, 82, 85, 80, 88, 84, 77, 83, 79, 81]
after_scores = [85, 86, 88, 85, 92, 89, 81, 87, 85, 86]

# Perform paired t-test
t_statistic, p_value = stats.ttest_rel(before_scores, after_scores)

print("T-statistic:", t_statistic)
print("P-value:", p_value)

# Significance level
alpha = 0.05

# Conclusion
if p_value <= alpha:
    print("Reject the null hypothesis: The new method significantly improved the scores.")
else:
    print("Fail to reject the null hypothesis: The new method did not significantly improve the scores.")


T-statistic: -12.818181818181818
P-value: 4.3817855382226614e-07
Reject the null hypothesis: The new method significantly improved the scores.


- **<span style="color:#D35400">Two-sample T-test</span>**
    - <span style="color:#28B463">Description:</span> Compares the means of two independent samples.
    - **<span style="color:#28B463">Example:</span>**  
      You want to compare the average test scores of students from two different schools to see if there's a significant difference in their performance.



In [34]:
import numpy as np
from scipy import stats

# Test scores from two independent schools
school1_scores = [85, 88, 92, 75, 89, 91, 86, 84, 90, 93]
school2_scores = [82, 79, 88, 77, 85, 80, 83, 81, 86, 84]

# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(school1_scores, school2_scores)

print("T-statistic:", t_statistic)
print("P-value:", p_value)

# Significance level
alpha = 0.05

# Conclusion
if p_value <= alpha:
    print("Reject the null hypothesis: There is a significant difference between the two groups.")
else:
    print("Fail to reject the null hypothesis: No significant difference between the two groups.")


T-statistic: 2.4319606085620005
P-value: 0.02568187849919393
Reject the null hypothesis: There is a significant difference between the two groups.


### **<span style="color:#2E86C1">Two-sample T-test Types :</span>**

- **<span style="color:#D35400">One-tailed Test:</span>**
    - <span style="color:#28B463">Description:</span> A one-tailed test is used when you want to test if one sample's mean is either greater than or less than the other sample's mean, but not both.
    - **<span style="color:#28B463">Example:</span>**  
      You want to test if students from School A have higher test scores than those from School B, assuming no possibility of School B having higher scores.

In [37]:
import numpy as np
from scipy import stats

# Test scores from two independent schools (School A and School B)
school_A_scores = [85, 88, 92, 75, 89, 91, 86, 84, 90, 93]
school_B_scores = [82, 79, 88, 77, 85, 80, 83, 81, 86, 84]

# Perform one-tailed t-test (School A > School B)
t_statistic, p_value = stats.ttest_ind(school_A_scores, school_B_scores)

# The one-tailed p-value is then obtained by dividing the two-tailed p-value by 2.
one_tailed_p_value = p_value / 2

print("T-statistic:", t_statistic)
print("One-tailed P-value:", one_tailed_p_value)

# Significance level
alpha = 0.05

# Conclusion
if one_tailed_p_value <= alpha and t_statistic > 0:
    print("Reject the null hypothesis: School A has significantly higher scores than School B.")
else:
    print("Fail to reject the null hypothesis: No significant difference in scores.")


T-statistic: 2.4319606085620005
One-tailed P-value: 0.012840939249596965
Reject the null hypothesis: School A has significantly higher scores than School B.


- **<span style="color:#D35400">Two-tailed Test:</span>**
    - <span style="color:#28B463">Description:</span> A two-tailed test is used when you want to test if the means of two samples are different from each other, regardless of which one is greater.
    - **<span style="color:#28B463">Example:</span>**  
      You want to compare test scores between two schools but don’t assume which school has higher or lower scores.

In [38]:
import numpy as np
from scipy import stats

# Test scores from two independent schools (School A and School B)
school_A_scores = [85, 88, 92, 75, 89, 91, 86, 84, 90, 93]
school_B_scores = [82, 79, 88, 77, 85, 80, 83, 81, 86, 84]

# Perform two-tailed t-test
t_statistic, p_value = stats.ttest_ind(school_A_scores, school_B_scores)

print("T-statistic:", t_statistic)
print("Two-tailed P-value:", p_value)

# Significance level
alpha = 0.05

# Conclusion
if p_value <= alpha:
    print("Reject the null hypothesis: There is a significant difference between the two groups.")
else:
    print("Fail to reject the null hypothesis: No significant difference between the two groups.")


T-statistic: 2.4319606085620005
Two-tailed P-value: 0.02568187849919393
Reject the null hypothesis: There is a significant difference between the two groups.


# **<span style="color:#28B463">Chi-Square Test</span>**

### **<span style="color:#D35400">What is the Chi-Square Test?</span>**
- The **Chi-Square test** is a statistical method used to determine if there is a significant association between two categorical variables. 
- It compares the observed frequency of events with the expected frequency under the assumption of no association.

### **<span style="color:#D35400">When to Use the Chi-Square Test:</span>**
- **<span style="color:#28B463">Use the Chi-Square Test when:</span>**
  - You have **categorical data**.
  - You want to test the **association** or **independence** between two variables.
  - You want to compare the **observed** distribution of data with an **expected** distribution.

### **<span style="color:#D35400">Steps to Perform a Chi-Square Test:</span>**

- **<span style="color:#28B463">Step 1: Formulate the Hypotheses</span>**
  - **Null Hypothesis (H₀):** Assumes no association between the variables or the sample fits the expected distribution.
  - **Alternate Hypothesis (H₁):** Assumes an association between the variables or that the sample does not fit the expected distribution.

- **<span style="color:#28B463">Step 2: Create a Contingency Table</span>**
  - Organize your observed data in a contingency table with the categories of one variable as rows and the categories of the second variable as columns.

- **<span style="color:#28B463">Step 3: Calculate the Expected Frequencies</span>**
  - The expected frequency for each cell in the contingency table is calculated based on the assumption of independence.
  $$ Expected\ Frequency = \frac{(Row\ Total \times Column\ Total)}{Grand\ Total} $$

- **<span style="color:#28B463">Step 4: Compute the Chi-Square Statistic</span>**
  $$ \chi^2 = \sum \frac{(Observed - Expected)^2}{Expected} $$

- **<span style="color:#28B463">Step 5: Compare the Chi-Square Statistic to the Critical Value</span>**
  - Compare the calculated Chi-Square value with the critical value from the Chi-Square distribution table using the degrees of freedom (df).
  $$ df = (number\ of\ rows - 1) \times (number\ of\ columns - 1) $$

- **<span style="color:#28B463">Step 6: Conclusion</span>**
  - If the calculated Chi-Square value is greater than the critical value (or the p-value is less than the significance level), reject the null hypothesis.

---

### **<span style="color:#D35400">Example Calculation:</span>**

Let's assume you have a contingency table showing the number of male and female customers who either smoke or don't smoke.

|                  | Smoker | Non-Smoker | Total |
|------------------|--------|------------|-------|
| **Male**         | 60     | 97         | 157   |
| **Female**       | 30     | 54         | 84    |
| **Total**        | 90     | 151        | 241   |

- **<span style="color:#28B463">Step 3: Expected Frequencies</span>**

To calculate the expected frequency for each cell, use the formula:

$$ E = \frac{(Row\ Total \times Column\ Total)}{Grand\ Total} $$

For example:

$$ E_{1A} = \frac{(60 + 97) \times (60 + 30)}{60 + 97 + 30 + 54} = \frac{157 \times 90}{241} = 58.6 $$  
$$ E_{1B} = \frac{(60 + 97) \times (30 + 54)}{60 + 97 + 30 + 54} = \frac{157 \times 84}{241} = 54.7 $$  
$$ E_{2A} = \frac{(60 + 54) \times (60 + 30)}{60 + 97 + 30 + 54} = \frac{114 \times 90}{241} = 42.6 $$  
$$ E_{2B} = \frac{(60 + 54) \times (30 + 54)}{60 + 97 + 30 + 54} = \frac{114 \times 84}{241} = 39.4 $$

So, the expected frequencies table will be:

|                  | Smoker | Non-Smoker | Total |
|------------------|--------|------------|-------|
| **Male**         | 58.6   | 54.7       | 157   |
| **Female**       | 42.6   | 39.4       | 84    |
| **Total**        | 90     | 151        | 241   |


### **<span style="color:#D35400">Continuing Hypothesis Testing:</span>**

We will now proceed with the hypothesis testing steps after calculating the expected frequencies.

---

### **<span style="color:#28B463">Step 4: Compute the Chi-Square Statistic</span>**

To calculate the Chi-Square statistic, use the formula:

$$ \chi^2 = \sum \frac{(Observed - Expected)^2}{Expected} $$

Let’s calculate the Chi-Square value for each cell.

1. For Male-Smoker:
   $$ \frac{(60 - 58.6)^2}{58.6} = \frac{(1.4)^2}{58.6} = 0.0335 $$

2. For Male-Non-Smoker:
   $$ \frac{(97 - 54.7)^2}{54.7} = \frac{(42.3)^2}{54.7} = 32.74 $$

3. For Female-Smoker:
   $$ \frac{(30 - 42.6)^2}{42.6} = \frac{(-12.6)^2}{42.6} = 3.72 $$

4. For Female-Non-Smoker:
   $$ \frac{(54 - 39.4)^2}{39.4} = \frac{(14.6)^2}{39.4} = 5.41 $$

Now, sum all the calculated values to get the total Chi-Square value:

$$ \chi^2 = 0.0335 + 32.74 + 3.72 + 5.41 = 41.9035 $$

---

### **<span style="color:#28B463">Step 5: Compare the Chi-Square Statistic to the Critical Value</span>**

We will now compare our calculated Chi-Square statistic to the critical value from the Chi-Square distribution table.

1. **Degrees of freedom (df)**:
   $$ df = (number\ of\ rows - 1) \times (number\ of\ columns - 1) $$

   For this example:
   $$ df = (2 - 1) \times (2 - 1) = 1 $$

2. **Critical value for Chi-Square at 0.05 significance level and 1 degree of freedom**:
    - use look-up table to calculate critical-value : https://datatab.net/tutorial/chi-square-distribution
    - From the Chi-Square table, the critical value for **df = 1** and **α = 0.05** is **3.841**.

---

### **<span style="color:#28B463">Step 6: Conclusion</span>**

- Our calculated Chi-Square value is **41.9035**, which is **much larger** than the critical value **3.841**.

- Since the calculated Chi-Square value is greater than the critical value, we **reject the null hypothesis**. This means there is a statistically significant association between gender (male/female) and smoking status (smoker/non-smoker).

---

### **<span style="color:#D35400">Summary</span>**

- **Observed Data**: We had observed data on whether males and females were smokers or non-smokers.
- **Expected Frequencies**: We calculated the expected frequencies assuming there is no association between gender and smoking status.
- **Chi-Square Statistic**: We computed the Chi-Square value and found it to be **41.9035**.
- **Conclusion**: Since the Chi-Square value is greater than the critical value, we reject the null hypothesis, meaning there is a significant association between gender and smoking status.



## **<span style="color:#2E86C1">Types of Chi-Square Tests:</span>**

### **<span style="color:#28B463">Chi-Square Test of Independence</span>**

- **Purpose:** This test is used to determine if there is a significant association between two categorical variables.
- **Example Variables:** Let's test if there is an association between `sex` (male/female) and `smoker` status (yes/no) in the tips dataset.

    - **Null Hypothesis (H₀):** There is no association between `sex` and `smoker` status (they are independent).
    - **Alternate Hypothesis (H₁):** There is an association between `sex` and `smoker` status (they are not independent).


In [50]:
import seaborn as sns
import pandas as pd
from scipy.stats import chi2_contingency

In [51]:
#step 1 : Import Libarary and Load Dataset

df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [52]:
# Create a contingency table for 'sex' and 'smoker'

# contengency table contains count of values which satisfy the constraints like in below table 
# count of Male who smoker = 60  (i.e. Male & Yes)
# count of Female who smoker = 33  (i.e. Female & Yes)
# count of Male who don't smoker = 97  (i.e. Male & No)
# count of Female who don't smoker = 54  (i.e. Male & No)

contingency_table = pd.crosstab(df['sex'], df['smoker'])
contingency_table

smoker,Yes,No
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
Male,60,97
Female,33,54


In [53]:
# Perform the Chi-Square Test of Independence
chi2_stat, p_value, dof, expected = chi2_contingency(contingency_table)

### **<span style="color:#D35400">Table: Expected Frequencies</span>**

| Row/Column        | Column A | Column B | Total  |
|-------------------|----------|----------|--------|
| **Row 1**         | $E_{1A} = \frac{(60 + 97) \times (60 + 30)}{60 + 97 + 30 + 54} = \frac{157 \times 90}{241} = 59.0$ | $E_{1B} = \frac{(60 + 97) \times (30 + 54)}{60 + 97 + 30 + 54} = \frac{157 \times 84}{241} = 54.0$ | $157$   |
| **Row 2**         | $E_{2A} = \frac{(60 + 54) \times (60 + 30)}{60 + 97 + 30 + 54} = \frac{114 \times 90}{241} = 42.6$ | $E_{2B} = \frac{(60 + 54) \times (30 + 54)}{60 + 97 + 30 + 54} = \frac{114 \times 84}{241} = 39.4$ | $114$   |
| **Total**         | $241$    | $241$    | $241$  |

In [55]:
print("Expected Frequencies:\n", expected)

Expected Frequencies:
 [[59.84016393 97.15983607]
 [33.15983607 53.84016393]]


In [56]:
# Print the results
print("Chi-Square Statistic:", chi2_stat)
print("P-value:", p_value)S
print("Degrees of Freedom:", dof)

# Conclusion
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant association between sex and smoker status.")
else:
    print("Fail to reject the null hypothesis: No significant association between sex and smoker status.")

Chi-Square Statistic: 0.0
P-value: 1.0
Degrees of Freedom: 1
Fail to reject the null hypothesis: No significant association between sex and smoker status.


# **<span style="color:#28B463">Anova Test</span>**