# Chi-Square Test

The chi-square test is a statistical test used to determine whether there is a significant association between two categorical variables. It's based on the difference between observed and expected frequencies in one or more categories.

Here's a step-by-step explanation of how the chi-square test works:

1. **Formulate Hypotheses**: 
   - Null Hypothesis (H0): There is no association between the two categorical variables.
   - Alternative Hypothesis (H1): There is an association between the two categorical variables.

2. **Construct a Contingency Table**: 
   - Organize the data into rows and columns based on the categories of the variables. This table shows the observed frequencies of each category combination.

   The contingency table formula is used to organize categorical data into a table format that allows for the analysis of relationships between two or more variables. It is commonly used in statistical analysis, especially in chi-squared tests of independence.

   The general form of a contingency table can be represented as follows:

   ```
               | Category 1    | Category 2    | ... | Category n    | Total
   --------------------------------------------------------------------------
   Group 1      | O11           | O12           | ... | O1n           | R1
   Group 2      | O21           | O22           | ... | O2n           | R2
   ...          | ...           | ...           | ... | ...           | ...
   Group m      | Om1           | Om2           | ... | Omn           | Rm
   --------------------------------------------------------------------------
   Total        | C1            | C2            | ... | Cn            | N
   ```

   Where:
   - Oij represents the observed frequency in the cell corresponding to the ith row (group) and jth column (category).
   - Ri represents the row total (sum of observed frequencies in the ith row).
   - Cj represents the column total (sum of observed frequencies in the jth column).
   - N represents the grand total (total number of observations).

3. **Calculate Expected Frequencies**: 
   - Assuming that the null hypothesis is true (i.e., no association between variables), calculate the expected frequencies for each cell of the contingency table. Expected frequency is calculated as (row total * column total) / grand total.
   ```
   Expected frequency (Eij) = (Row sum i * Column sum j) / N
   ```

4. **Compute the Chi-Square Statistic**: 
   - For each cell in the contingency table, calculate the contribution to the chi-square statistic using the formula: 
     $ \chi^2 = \sum \frac{{(O_{ij} - E_{ij})^2}}{{E_{ij}}} $
   - Where:
     - $O_{ij}$ is the observed frequency in cell (i, j).
     - $E_{ij}$ is the expected frequency in cell (i, j).
     - The sum is taken over all cells in the contingency table.

5. **Determine the Degrees of Freedom (df)**:
   - Degrees of freedom in a chi-square test for independence is calculated as $(r - 1)(c - 1)$, where $r$ is the number of rows and $c$ is the number of columns in the contingency table.

6. **Find Critical Value or P-value**: 
   - Using the chi-square distribution table or statistical software, find the critical value corresponding to the chosen significance level (typically 0.05).
   - Alternatively, calculate the p-value associated with the chi-square statistic.

7. **Make a Decision**:
   - If the chi-square statistic exceeds the critical value or if the p-value is less than the chosen significance level (usually 0.05), reject the null hypothesis.
   - If the chi-square statistic does not exceed the critical value and the p-value is greater than the significance level, fail to reject the null hypothesis.

8. **Interpretation**:
   - If the null hypothesis is rejected, it indicates that there is a significant association between the two categorical variables.
   - If the null hypothesis is not rejected, it suggests that there is insufficient evidence to conclude that there is an association between the variables.

The chi-square test is widely used in various fields, including biology, social sciences, market research, and quality control, to analyze categorical data and test hypotheses about relationships between variables.

## Problem statement

An RTO department has observed that on an highway more number of accidents have taken place in the early hours of the day than other timings and wish to associate the outcome of the accidents with timings. A survey findings shows that, of the 400 accident cases studied, 280 had met with accident in the early hours and 99 of them were fatal. Further, those who met with accident in the early hours and died was 80. Does this data indicate any association between the time of accident and fatality of the accident. Use α = 0.05.

## Solution 
To determine whether there is an association between the time of accident and the fatality of the accident, we can perform a chi-squared test of independence. The null hypothesis (H0) is that there is no association between the time of accident and fatality, while the alternative hypothesis (H1) is that there is an association.

First, let's organize the data into a contingency table:

```
                  | Fatality     | Non-Fatality | Total
--------------------------------------------------------
Early Hours       | 80           | 200           | 280
Other Timings     | 19           | 101           | 120
--------------------------------------------------------
Total             | 99           | 301           | 400
```


Now, we can calculate the expected frequencies under the assumption of independence. Expected frequency for a cell is calculated by: (row total * column total) / grand total.

<img src="images/chi-sq-2.png" width="500">

<img src="images/chi-sq-1.png" width="700">

Now, we compare this value to the critical value from the chi-squared distribution with (2-1)*(2-1) = 1 degree of freedom at α = 0.05. Using a chi-squared distribution table or a calculator, the critical value is approximately 3.841.

Since 7.318 > 3.841, we reject the null hypothesis. There may be an association between timings of accident and its outcome at the 0.05 significance level.