# Chi-square Test
Used for Categorical vs Categorical Testing

In [2]:
from scipy.stats import chisquare, chi2_contingency

## Degrees of freedom

dof=n-1 <br>
If you have a set of n numbers and you know the average of those numbers, how many of these numbers do you need to know to determine the full set <br>
For two arrays with lengths n1 and n2, the degrees of freedom (dof) is calculated as the sum of the degrees of freedom for each array: (n1-1)+(n2-1)

For a two-sample T-test, the degrees of freedom become n1+n2-2, while for a one-sample T-test, it simplifies to n-1, where 'n' is the
sample size.


Degrees of freedom are then calculated by multiplying these two dimensions together: (#rows - 1) * (#columns - 1)

Now, why is this degree of freedom important? <br>
1. In the context of the Chi-square test, degrees of freedom represent the number of categories the value influences the critical values used to determine statistical significance.
    - As degrees of freedom increase, the chi-squared distribution changes shape.
    - Higher degrees of freedom lead to higher critical values, requiring a larger test statistic to reject the null hypothesis at a given significance level (α).
2. A higher degree of freedom allows for more variability and flexibility in the distribution of the test statistic.
3. Degrees of freedom help define the expected distribution of the test statistic under the null hypothesis.
    - The expected distribution is a key reference point for evaluating the observed test statistic and determining whether deviations are statistically significant.

## Chi-square Goodness of Fit Test

Used when you have one categorical variable, and you want to see if the observed frequencies match the expected frequencies.

degrees of freedom = 1

**Example** <br>
The expected outcome for a fair coin toss is 50% heads and 50% tails, which corresponds to 25 heads and 25 tails in 50 tosses. <br>
The observed outcomes from 50 coin tosses are 28 heads and 22 tails. <br>
To determine whether the coin is fair, we perform a chi-square test to check if the observed results significantly deviate from the expected results. If the deviation is statistically significant, it may indicate that the coin is not fair.

In [4]:
cstat, pval=chisquare([28, 22], #Observed 
          [25, 25]) #Expected
cstat, pval

(0.72, 0.3961439091520741)

In [7]:
alpha=0.01
if pval < alpha:
    print("Reject H0")
    print("Coin is baised")
else:
    print("Fail to reject H0")
    print("Coin is fair")

Fail to reject H0
Coin is fair


## Chi-square Test for Independence

Now, let us say we are conducting a survey on whether gender impacts offline and online purchases. <br>
In the survey, we got the following data:<br>
- H0: Gender and preference are independent
- Ha: Gender and preference are not independent <br>
<br>In this scenario, under the assumption of null hypothesis do we have the expected value? <br>
66% of the respondents prefer offline and 34% of the respondents prefer online.<br>
Now, if gender has no impact then among 733 men, how many are expected to prefer offline? => 66% of 734 = 484 <br>
Now, if gender has no impact then among 172 women, how many are expected to prefer offline? => 66% of 174 = 115 <br>
Now, if gender has no impact then among 733 men, how many are expected to prefer online? => 34% of 734 = 249 <br>
Now, if gender has no impact then among 172 women, how many are expected to prefer online? => 34% of 174 = 59 <br> <br>
But the observed value is: <br>
- Men : Offline = 527
- Women : Offline = 72
- Men : Online = 206
- Women : Online = 102

In [10]:
cstat, pval, dof, exp_freq=chi2_contingency([[527, 206], #row1
                                             [72, 102]]) #row2
cstat, pval, dof, exp_freq

(57.04098674049609,
 4.268230756875865e-14,
 1,
 array([[484.08710033, 248.91289967],
        [114.91289967,  59.08710033]]))

In [11]:
#Inverse order
cstat, pval, dof, exp_freq=chi2_contingency([[527, 72], #row1
                                             [206, 102]]) #row2
cstat, pval, dof, exp_freq

(57.04098674049609,
 4.268230756875865e-14,
 1,
 array([[484.08710033, 114.91289967],
        [248.91289967,  59.08710033]]))

pd.crosstab will work best here