# Chi-Squared Test for Categorical Variables 

- The chi-square test is widely used to estimate how closely the distribution of a categorical variable matches an expected distribution (the goodness-of-fit test), or to estimate whether two categorical variables are independent of one another (the test of independence).

- In mathematical terms, the χ2 variable is the sum of the squares of a set of normally distributed variables.

- Suppose that a particular value Z1 is randomly selected from a standardized normal distribution. Then suppose another value Z2 is selected from the same standardized normal distribution. If there are d degrees of freedom, then let this process continue until d different Z values are selected from this distribution. The χ2 variable is defined as the sum of the squares of these Z values

![image.png](attachment:image.png)
                                
- This sum of squares of d normally distributed variables has a distribution which is called the χ2 distribution with d degrees of freedom.



## 1.Chi Squared test For Goodness Of fit

- Chi Square test for testing goodness of fit is used to decide whether there is any difference between the observed (experimental) value and the expected (theoretical) value.

- A goodness of fit test is a test that is concerned with the distribution of one categorical variable. 

The null and alternative hypotheses reflect this focus:

H0: The population distribution of the variable is the same as the proposed distribution

HA: The distributions are different

The chi-square statistic is calculated as:

![image-2.png](attachment:image-2.png)
                                                          
Where, Observed= actual count values in each category

Expected= the predicted (expected) counts in each category if the null hypothesis were true.

- Let’s see an example for better understanding:

- Q) A survey conducted by a Pet Food Company determined that 60% of dog owners have only one dog, 28% have two dogs, and 12% have three or more. You were not convinced by the survey and decided to conduct your own survey and have collected the data below,
Data: Out of 129 dog owners, 73 had one dog and 38 had two dogs

Determine whether your data supports the results of the survey by the pet. 

Use a significance level of 0.05

- Ans:  E(1 dog) =0.60 

E(2 dog) = 0.28

E(3 dogs) = .12

H0: proportions of dogs is equal to survey data

H1: proportions of dogs is not equal to survey data

![image-3.png](attachment:image-3.png)

Chi statistics = 19.36/77.4 + 3.53/36.12 + 2.52/15.48 = 0.7533

Let’s see the critical value using d.o.f 2 and significance 5%:
 

Critical chi = 5.99

Here, our chi statistic is less than the critical chi. Thus, we will not reject the null hypothesis.


- Let’s learn the use of chi-square with an intuitive example.

A research scholar is interested in the relationship between the placement of students in the statistics department of a reputed University and their C.G.P.A (their final assessment score).

He obtains the placement records of the past five years from the placement cell database (at random). He records how many students who got placed fell into each of the following C.G.P.A. categories – 9-10, 8-9, 7-8, 6-7, and below 6.

- If there is no relationship between the placement rate and the C.G.P.A., then the placed students should be equally spread across the different C.G.P.A. categories (i.e. there should be similar numbers of placed students in each category).

- However, if students having C.G.P.A more than 8 are more likely to get placed, then there would be a large number of placed students in the higher C.G.P.A. categories as compared to the lower C.G.P.A. categories. In this case, the data collected would make up the observed frequencies.

- So the question is, are these frequencies being observed by chance or do they follow some pattern?

- Here enters the chi-square test! The chi-square test helps us answer the above question by comparing the observed frequencies to the frequencies that we might expect to obtain purely by chance.

- Chi-square test in hypothesis testing is used to test the hypothesis about the distribution of observations/frequencies in different categories.

- 

## Assumptions of the Chi-Square Test

- Just like any other statistical test, the chi-square test comes with a few assumptions of its own:

- The χ2 assumes that the data for the study is obtained through random selection, i.e. they are randomly picked from the population

- The categories are mutually exclusive i.e. each subject fits in only one category. For e.g. – the number of people who lunched in your restaurant on Monday can’t be filled in the Tuesday category

- The data should be in the form of frequencies or counts of a particular category and not in percentages

- The data should not consist of paired samples or groups or we can say the observations should be independent of each other

- When more than 20% of the expected frequencies have a value of less than 5 then Chi-square cannot be used.  To tackle this problem: Either one should combine the categories only if it is relevant or obtain more data

- This is a non-parametric test. We typically use it to find how the observed value of a given event is significantly different from the expected value. In this case, we have categorical data for one independent variable, and we want to check whether the distribution of the data is similar or different from that of the expected distribution.

- Let’s consider the above example where the research scholar was interested in the relationship between the placement of students in the statistics department of a reputed University and their C.G.P.A.

- In this case, the independent variable is C.G.P.A with the categories 9-10, 8-9, 7-8, 6-7, and below 6.

- The statistical question here is: whether or not the observed frequencies of placed students are equally distributed for different C.G.P.A categories (so that our theoretical frequency distribution contains the same number of students in each of the C.G.P.A categories).

- We will arrange this data by using the contingency table which will consist of both the observed and expected values as below:

 ![image.png](attachment:image.png)
 
 ![image-2.png](attachment:image-2.png)
 
 After constructing the contingency table, the next task is to compute the value of the chi-square statistic. The formula for chi-square is given as:
 
 ![image-3.png](attachment:image-3.png)
 
 Let us look at the step-by-step approach to calculate the chi-square value:

Step 1: Subtract each expected frequency from the related observed frequency. For example, for the C.G.P.A category 10-9, it will be “30-20 = 10”. Apply similar operation for all the categories

Step 2: Square each value obtained in step 1, i.e. (O-E)2. For example: for the C.G.P.A category 10-9, the value obtained in step 1 is 10. It becomes 100 on squaring. Apply similar operation for all the categories

Step 3: Divide all the values obtained in step 2 by the related expected frequencies i.e. (O-E)2/E. For example: for the C.G.P.A category 10-9, the value obtained in step 2 is 100. On dividing it with the related expected frequency which is 20, it becomes 5. Apply similar operation for all the categories

Step 4: Add all the values obtained in step 3 to get the chi-square value. In this case, the chi-square value comes out to be 32.5

Step 5: Once we have calculated the chi-square value, the next task is to compare it with the critical chi-square value. We can find this in the below chi-square table against the degrees of freedom (number of categories – 1) and the level of significance:

![image-4.png](attachment:image-4.png)

- In this case, the degrees of freedom are 5-1 = 4. So, the critical value at 5% level of significance is 9.49.

- Our obtained value of 32.5 is much larger than the critical value of 9.49. Therefore, we can say that the observed frequencies are significantly different from the expected frequencies. In other words, C.G.P.A is related to the number of placements that occur in the department of statistics.

## Chi-Square Test for Association/Independence


- The second type of chi-square test is the Pearson’s chi-square test of association. This test is used when we have categorical data for two independent variables and we want to see if there is any relationship between the variables.

- Let’s take another example to understand this. A teacher wants to know the answer to whether the outcome of a mathematics test is related to the gender of the person taking the test. Or in other words, she wants to know if males show a different pattern of pass/fail rates than females.

- So, here are two categorical variables: Gender (Male and Female) and mathematics test outcome (Pass or Fail). Let us now look at the contingency table:

![image.png](attachment:image.png)

- By looking at the above contingency table, we can see that the girls have a comparatively higher pass rate than boys. However, to test whether this observed difference is significant or not, we will carry out the chi-square test.

The steps to calculate the chi-square value are as follows:

Step 1: Calculate the row and column total of the above contingency table:

![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)

![image-3.png](attachment:image-3.png)

![image-4.png](attachment:image-4.png)


- The next task is to compare it with the critical chi-square value from the table we saw above.

- The Chi-Square calculated value is 0.9354 which is less than the critical value of 3.84. So in this case, we fail to reject the null hypothesis. This means there is no significant association between the two variables, i.e, boys and girls have a statistically similar pattern of pass/fail rates on their mathematics tests.

## Example

A toy company builts football player toys. It claims that 30% of the cards are mid-fielders, 60% defenders, and 10% are forwards. Considering a random sample of 100 toys has 50 mid-fielders, 45 defenders, and 5 forwards. Given 0.05 level of significance, can you justify company's claim?

Solution:

## Determine Hypotheses

Null hypothesis H0 - The proportion of mid-fielders, defenders, and forwards is 30%, 60% and 10%, respectively.

Alternative hypothesis H1 - At least one of the proportions in the null hypothesis is false.


## Determine Degree of Freedom

The degrees of freedom, DF is equal to the number of levels (k) of the categorical variable minus 1: DF = k - 1. Here levels are 3. Thus


![image.png](attachment:image.png)

## Determine p-value

P-value is the probability that a chi-square statistic,X2 having 2 degrees of freedom is more extreme than 19.58. Use the Chi-Square Distribution Calculator to find P(X2>19.58)=0.0001.

## Interpret results
As the P-value (0.0001) is quite less than the significance level (0.05), the null hypothesis can not be accepted. Thus company claim is invalid.